我必须解析格式化的 pdf 以获得一些字段。 PDF 为 here .我需要解析的内容显示在 this 中图像。我使用 PyPDF2 获取文本,但它返回没有任何格式的原始文本。
import PyPDF2
pdfFileObj = open('GPO-PLUMBOOK-2000-4-1.pdf','rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pageObj = pdfReader.getPage(0)
print(pageObj.extractText())
我得到的输出如下:
LEGISLATIVE BRANCHLocationPosition TitleName of IncumbentPayPlanType ofAppt.Level,Grade, orPayTenureExpiresARCHITECT OF THE CAPITOLAlan M. HantmanWashington, DCArchitect of the Capitol10 years02/02/07IIIEXPASLocationPosition TitleName of IncumbentPayPlanType ofAppt.Level,Grade, orPayTenureExpiresGENERAL ACCOUNTING OFFICEDavid M. WalkerWashington, DCComptroller General of the United States11/09/1315 years$141,300OTPASVacant Do...........Deputy Comptroller General of the United States..................OTXSLocationPosition TitleName of IncumbentPayPlanType ofAppt.Level,Grade, orPayTenureExpiresGOVERNMENT PRINTING OFFICEMichael F. DiMarioWashington, DCPublic Printer............IIIEXPASRobert T. Mansker Do...........Deputy Public Printer............IVEXXSFrancis J. Buckley, Jr. Do...........Superintendent of Documents..................SLXSRobert G. Andary Do...........Inspector General..................SLXSMary Beth Lawler Do...........Staff Assistant............14OTSCLocationPosition TitleName of IncumbentPayPlanType ofAppt.Level,Grade, orPayTenureExpiresLIBRARY OF CONGRESSLIBRARIAN OF CONGRESSJames H. BillingtonWashington, DCLibrarian of Congress............IIIEXPASLIBRARY OF CONGRESS TRUST FUND BOARDJames H. Billington Do...........Chairman (Ex-Officio)..................WCPASTed Stevens Do...........Chairman of the Joint Committee of the Library (Ex-Officio)..................WCXSLawrence Summers Do...........Member (Ex-Officio), Secretary of the Treasury..................WCPASDonald Hammond Do...........Member (Designee for the Secretary of the Treasurer)..................WCXSCeil Pulitzer Do...........Member5 years03/23/03......WCPASNajeeb Halaby Do...........Member5 years08/31/05......WCPASJohn Kluge Do...........Member5 years03/10/03......WCXSWayne Berman Do...........Member5 years12/22/01......WCXSEdwin Cox Do...........Member5 years03/31/04......WCXSJohn Henry Do...........Member5 years12/22/03......WCXSDonald Jones Do...........Member5 years10/08/02......WCXSJulie Finley Do...........Member5 years06/29/01......WCXSBernard Rappaport Do...........Member5 years12/22/01......WCXS(1)
我需要分隔数据,例如 Location
列下的数据等。
最佳答案
看看 tabula
库(这里是 github )。这会返回一个 pandas 数据框。
df = tabula.read_pdf("/home/michael/Downloads/GPO-PLUMBOOK-2000-4-1.pdf", pages=1)
df.dropna(inplace=True)
print(df[:2])
如果您需要阅读其他表格,或者想节省时间,您还可以调整应该使用 pdf 的哪一部分。这样你只需阅读所有 pdf 表格,并将我的数据切片到你想要的输出。
关于python - 使用python从格式化的PDF中提取文本,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56916972/