realcode4you
- Feb 28, 2022
- 4 min read

Extracting Text, Tables From PDFs Using PyPDF2 Library in Python | NLP Assignment Help

In this blog, you will learn how you can extract tables in PDF using PyPDF2 library in Python.

#!pip install PyPDF2 camelot-py tabula-py
#conda install -c conda-forge camelot-py
import PyPDF2

#Read the PDF File
fileName = 'WhenisEarlyClassificationofTimeSeriesMeaningful.pdf'
reader = PyPDF2.PdfFileReader(fileName)
reader.

reader.documentInfo

output:

{'/Author': 'Renjie Wu;Audrey Der;Eamonn J. Keogh',
 '/Comments': '',
 '/Company': 'IEEE Computer Society',
 '/CreationDate': 'D:20210223042231Z',
 '/Creator': 'Acrobat PDFMaker 20 for Word',
 '/ModDate': 'D:20210223042234Z',
 '/Producer': 'Adobe PDF Library 20.13.106',
 '/SourceModified': 'D:20210223042227',
 '/Title': 'When is Early Classification of Time Series Meaningful?'}

reader.getPage
pages = ''
for i in range(0,8):
    pages += reader.getPage(i).extractText()

pages = pages.replace('\n','')
pages

Output:

" 1  When is Early Classification of Time Series Meaningful? Renjie Wu, Audrey Der, and Eamonn J. Keogh AbstractŠSince its introduction two decades ago, there has been increasing interest in the problem of early classification of time series. This problem generalizes classic time series classification to ask if we can classify a time series subsequence with sufficient accuracy and confidence after seeing only some prefix of a target pattern. The idea is that the earlier classification would allow us to take immediate action, in a domain in which some practical interventions are possible. For example, that intervention might be sounding an alarm or applying the brakes in an automobile. In this work, we make a surprising claim. In spite of the fact that there are dozens of papers on early classification of time series, it is not clear that any of them could ever work in a real-world setting. The problem is not with the algorithms per se but with the vague and underspecified problem description. Essentially all algorithms make implicit and unwarranted assumptions about the problem that will ensure that they will be plagued by false positives and false negatives even if their results suggested that they could obtain near-perfect results. We will explain our findings with novel insights and experiments and offer recommendations to the community. Index TermsŠEarly classification, time series analysis, data mining. ŠŠŠŠŠŠŠŠŠŠ      ŠŠŠŠŠŠŠŠŠŠ 1 INTRODUCTIONINCE its introduction two decades ago, there has been increasing interest in the problem of early classification of time series (ETSC). The problem is expressed differently by different researchers, but it generally reduced to ask-ing if we can classify a time series subsequence with suffi-cient accuracy and confidence after 

....
....

Breaking text into sections

sectionNames = ['Abstract', 'INTRODUCTION', 'BACKGROUND']
mapp = ['Title', 'Abstract', 'Introduction', 'BACKGROUND']
sections = {}
for i, sectionName in enumerate(sectionNames):
    print(i, sectionName)
    sections[mapp[i]] = pages.split(sectionName)[0]
    pages = ''.join(pages.split(sectionName)[1:])
sections[i+1] = pages

Output:

0 Abstract
1 INTRODUCTION
2 BACKGROUND

Extracting tables from PDFs

Camelot is a Python library and a command-line tool that makes it easy for anyone to extract data tables trapped inside PDF files

Whereas Tabula-py is a simple Python wrapper of tabula-java, which can read tables in a PDF. It enables you to convert a PDF file into a CSV, TSV, JSON, or even a pandas DataFrame.

In this blog, you will learn how you can extract tables in PDF using both camelot and tabula-py libraries in Python.

fileName

output:

'WhenisEarlyClassificationofTimeSeriesMeaningful.pdf'

import camelot
fileName = 'pdfs/foo.pdf'
# extract all the tables in the PDF file
tables = camelot.read_pdf(fileName)

# number of tables extracted
print("Total tables extracted:", tables.n)

output:

Total tables extracted: 1

# print the first table as Pandas DataFrame
print(tables[0].df)

output:

              0            1                2                     3  \
0  Cycle \nName  KI \n(1/km)  Distance \n(mi)  Percent Fuel Savings   
1                                                  Improved \nSpeed   
2        2012_2         3.30              1.3                  5.9%   
3        2145_1         0.68             11.2                  2.4%   
4        4234_1         0.59             58.7                  8.5%   
5        2032_2         0.17             57.8                 21.7%   
6        4171_1         0.07            173.9                 58.1%   

                   4                  5                 6  
0                                                          
1  Decreased \nAccel  Eliminate \nStops  Decreased \nIdle  
2               9.5%              29.2%             17.4%  
3               0.1%               9.5%              2.7%  
4               1.3%               8.5%              3.3%  
5               0.3%               2.7%              1.2%  
6               1.6%               2.1%              0.5%

# export individually as CSV
tables[0].to_csv("tables/foo.csv")

# or export all in a zip
tables.export("foo.zip", f="csv", compress=True)

# export individually as Excel (.xlsx extension)
tables[0].to_excel("foo.xlsx")

import tabula
from tabula.io import read_pdf, convert_into, convert_into_by_batch

# read PDF file
fileName = 'WhenisEarlyClassificationofTimeSeriesMeaningful.pdf'
tables = read_pdf(fileName, pages="all")
tables

output:

[  Section 4, most ETSC methods have a misPurednictedd erPsretdaictend ding  \
 0                              as Class 2 as Class 1                         
 1  about the normalization of the data that will ...                         
 
   sible, although to our knowledge there has never been an  
 0                                                NaN        
 1  ETSC algorithm deployed in the real world. As ...        ,
    We call the “cat” vs “catalog” problem the prefix prob-  \
 0   lem. We will later show two other issues, the ...        
 1   homophone problems that offer even greater stu...        
 2                          blocks to any ETSC models.        
 3   The absolute weakest interpretation of our fin...        
 4   that the ETSC community has failed to communic...        
 5   appreciate the many assumptions that must be t...        
 6   their models to be useful in the real world. H...        
 7   will argue a stronger interpretation. The ETSC...        
 8   underspecified to the point of being meaningle...

len(tables)
#tables = tables[2]
#tables = tables.iloc[:, 1:]
tables

Output:

import os
# save them in a folder
folder_name = "tables"
if not os.path.isdir(folder_name):
    os.mkdir(folder_name)
# iterate over extracted tables and export as excel individually
for i, table in enumerate(tables, start=1):
    table.to_excel(os.path.join(folder_name, f"table_{i}.xlsx"), index=False)

# convert all tables of a PDF file into a single CSV file
# supported output_formats are "csv", "json" or "tsv"
convert_into(fileName, "output.csv", output_format="csv", pages="all")

# convert all PDFs in a folder into CSV format
# `pdfs` folder should exist in the current directory
tabula.io.convert_into_by_batch("pdfs", output_format="csv", pages="all")

For any support you can contact us at:

realcode4you@gmail.com

We are providing all data science and machine learning related help. Hire expert and get instant help with an affordable price.

RealCode4You

Extracting Text, Tables From PDFs Using PyPDF2 Library in Python | NLP Assignment Help

Breaking text into sections

Extracting tables from PDFs

Recent Posts