python - 使用 Python 和 Pypdf2 从 pdf 中提取文本

标签 python pdf text pypdf

我想使用 Python 和 PYPDF 包从 pdf 文件中提取文本。 这是我的pdf fie,这是我的代码:

import PyPDF2
opened_pdf = PyPDF2.PdfFileReader('test.pdf', 'rb')

p=opened_pdf.getPage(0)

p_text= p.extractText()
# extract data line by line
P_lines=p_text.splitlines()
print P_lines

我的问题是 P_lines 无法逐行提取数据并导致一个巨大的字符串。我想逐行提取文本进行分析。关于如何改进它的任何建议? 谢谢! 这是代码返回的字符串:

[u'Ingredient information for chemicals subject to 29 CFR 1910.1200(i) and Appendix D are obtained from suppliers Material Safety Data Sheets (MSDS)** Information is based on the maximum potential for concentration and thus the total may be over 100%* Total Water Volume sources may include fresh water, produced water, and/or recycled water0.01271%72.00%7732-18-5Water0.00071%4.00%1310-73-2Sodium Hydroxide0.00424%24.00%533-74-4DazomatBiocidePumpcoPlexcide 24L0.00828%75.00%Organic phosphonic acid salts0.00276%25.00%67-56-1Methyl AlcoholScale InhibitorPumpcoPlexaid 6730.00807%30.00%7732-18-5Water0.00188%7.00%Polyethoxylated alcohol surfactants0.00753%28.00%9003-06-9Ammonium Salts0.00941%35.00%64742-47-8Petroleum DistillateFriction ReducerPumpcoPlexslick 9210.05029%60.00%7732-18-5Water0.03353%40.00%7647-01-0Hydrogen ChlorideHydrochloric AcidPumpcoHCL9.84261%100.00%14808-60-7Crystaline SilicaProppantPumpcoSand90.01799%100.00%7732-18-5WaterCommentsMaximumIngredientConcentrationin HF Fluid(% by mass)**MaximumIngredientConcentrationin Additive(% by mass)**Chemical AbstractService Number(CAS #)IngredientsPurposeSupplierTrade NameHydraulic Fracturing Fluid Composition:2,608,032Total Water Volume (gal)*:7,595True Vertical Depth (TVD):GasProduction Type:NAD27Long/Lat Projection:32.558525Latitude:-97.215242Longitude:Ole Gieser Unit D 6HWell Name and Number:XTO EnergyOperator Name:42-439-35084API Number:TarrantCounty:TexasState:12/10/2010Fracture DateHydraulic Fracturing Fluid Product Component Information Disclosure']

文件截图: enter image description here

最佳答案

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO

def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = file(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()

    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)

    text = retstr.getvalue()

    fp.close()
    device.close()
    retstr.close()
    return text
print(convert_pdf_to_txt('test.pdf').strip().split('\n\n'))

输出

Hydraulic Fracturing Fluid Product Component Information Disclosure

Fracture Date State: County: API Number: Operator Name: Well Name and Number: Longitude: Latitude: Long/Lat Projection: Production Type: True Vertical Depth (TVD): Total Water Volume (gal)*:

12/10/2010 Texas Tarrant 42-439-35084 XTO Energy Ole Gieser Unit D 6H -97.215242 32.558525 NAD27 Gas 7,595 2,608,032

Hydraulic Fracturing Fluid Composition:

Trade Name

Supplier

Purpose

Ingredients

Chemical Abstract Service Number

(CAS #)

Maximum Ingredient

Concentration

in Additive ( by mass)**

Comments

Maximum Ingredient

Concentration

in HF Fluid ( by mass)**

Water Sand HCL

Pumpco Pumpco

Proppant Hydrochloric Acid

Plexslick 921

Pumpco

Friction Reducer

Plexaid 673

Pumpco

Scale Inhibitor

Plexcide 24L

Pumpco

Biocide

Crystaline Silica

Hydrogen Chloride Water

Petroleum Distillate Ammonium Salts Polyethoxylated alcohol surfactants Water

Methyl Alcohol Organic phosphonic acid salts

Dazomat Sodium Hydroxide Water

7732-18-5 14808-60-7

7647-01-0 7732-18-5

64742-47-8 9003-06-9

7732-18-5

67-56-1

533-74-4 1310-73-2 7732-18-5

100.00 100.00

90.01799 9.84261

40.00 60.00

35.00 28.00 7.00 30.00

25.00 75.00

24.00 4.00 72.00

0.03353 0.05029

0.00941 0.00753 0.00188 0.00807

0.00276 0.00828

0.00424 0.00071 0.01271

  • Total Water Volume sources may include fresh water, produced water, and/or recycled water ** Information is based on the maximum potential for concentration and thus the total may be over 100

Ingredient information for chemicals subject to 29 CFR 1910.1200(i) and Appendix D are obtained from suppliers Material Safety Data Sheets (MSDS)

关于python - 使用 Python 和 Pypdf2 从 pdf 中提取文本,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/42743061/

相关文章:

objective-c - 在 Objective-C 中创建受密码保护的 PDF

text - 在 SSRS 图表上放置附加文本

python - 值错误: time data '10/11/2006 24:00' does not match format '%d/%m/%Y %H:%M'

python - 使用 f 字符串格式化字符串

javascript - iTextSharp : Javascript in PDF not firing when filled in in vb. 网络

vba - Excel VBA 十进制数字在文本输出中转换为整数

string - 在 Excel If 语句中返回数字和文本字符串

python - 如何制作可移植的 python 桌面应用程序?

python - 使用 cx_freeze 卡住后的绝对路径(Qt5/PySide2 应用程序)

ios - 数据不会保存到桌面