python - 使用python从格式化的PDF中提取文本

标签 python python-3.x parsing pdf pypdf

我必须解析格式化的 pdf 以获得一些字段。 PDF 为 here .我需要解析的内容显示在 this 中图像。我使用 PyPDF2 获取文本,但它返回没有任何格式的原始文本。

import PyPDF2
pdfFileObj = open('GPO-PLUMBOOK-2000-4-1.pdf','rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pageObj = pdfReader.getPage(0)
print(pageObj.extractText())

我得到的输出如下:

LEGISLATIVE BRANCHLocationPosition TitleName of IncumbentPayPlanType ofAppt.Level,Grade, orPayTenureExpiresARCHITECT OF THE CAPITOLAlan M. HantmanWashington, DCArchitect of the Capitol10 years02/02/07IIIEXPASLocationPosition TitleName of IncumbentPayPlanType ofAppt.Level,Grade, orPayTenureExpiresGENERAL ACCOUNTING OFFICEDavid M. WalkerWashington, DCComptroller General of the United States11/09/1315 years$141,300OTPASVacant  Do...........Deputy Comptroller General of the United States..................OTXSLocationPosition TitleName of IncumbentPayPlanType ofAppt.Level,Grade, orPayTenureExpiresGOVERNMENT PRINTING OFFICEMichael F. DiMarioWashington, DCPublic Printer............IIIEXPASRobert T. Mansker  Do...........Deputy Public Printer............IVEXXSFrancis J. Buckley, Jr.  Do...........Superintendent of Documents..................SLXSRobert G. Andary  Do...........Inspector General..................SLXSMary Beth Lawler  Do...........Staff Assistant............14OTSCLocationPosition TitleName of IncumbentPayPlanType ofAppt.Level,Grade, orPayTenureExpiresLIBRARY OF CONGRESSLIBRARIAN OF CONGRESSJames H. BillingtonWashington, DCLibrarian of Congress............IIIEXPASLIBRARY OF CONGRESS TRUST FUND BOARDJames H. Billington  Do...........Chairman (Ex-Officio)..................WCPASTed Stevens  Do...........Chairman of the Joint Committee of the Library (Ex-Officio)..................WCXSLawrence Summers  Do...........Member (Ex-Officio), Secretary of the Treasury..................WCPASDonald Hammond  Do...........Member (Designee for the Secretary of the Treasurer)..................WCXSCeil Pulitzer  Do...........Member5 years03/23/03......WCPASNajeeb Halaby  Do...........Member5 years08/31/05......WCPASJohn Kluge  Do...........Member5 years03/10/03......WCXSWayne Berman  Do...........Member5 years12/22/01......WCXSEdwin Cox  Do...........Member5 years03/31/04......WCXSJohn Henry  Do...........Member5 years12/22/03......WCXSDonald Jones  Do...........Member5 years10/08/02......WCXSJulie Finley  Do...........Member5 years06/29/01......WCXSBernard Rappaport  Do...........Member5 years12/22/01......WCXS(1)

我需要分隔数据,例如 Location 列下的数据等。

最佳答案

看看 tabula 库(这里是 github )。这会返回一个 pandas 数据框。

df = tabula.read_pdf("/home/michael/Downloads/GPO-PLUMBOOK-2000-4-1.pdf", pages=1)
df.dropna(inplace=True)
print(df[:2])

如果您需要阅读其他表格,或者想节省时间,您还可以调整应该使用 pdf 的哪一部分。这样你只需阅读所有 pdf 表格,并将我的数据切片到你想要的输出。

关于python - 使用python从格式化的PDF中提取文本,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56916972/

相关文章:

python - Sphinx 不处理 python 类

python - 使用 python 将多个 .doc 转换为 .docx 文件

python - 如何使用 for 循环遍历编号变量?

VBA 集合和赋值之间的区别

php - XML 解析 - 缺少节点

Python无法安装模块spaCy

Python C++ api - 在函数重载中返回不同类型

python - 从服务器获取响应

python - 这个字节串实际占用了多少内存?

php - 从每个定义中读取名称和值 ('NAME' ,'VALUE' );在 .php 文件中