python - extract 如何从 pdf 文件中提取特定文本 - python

标签 python web-crawler pypdf

我正在尝试提取这段文字:

 DLA LAND AND MARITIME
ACTIVE DEVICES DIVISION
PO BOX 3990
COLUMBUS OH 43218-3990
USA
 Name: Desmond Forshey Buyer Code:PMCMTA9 Tel: 614-692-6154 Fax:   614-692-6930
 Email: Desmond.Forshey@dla.mil

从这个pdf file .我能够使用以下代码在两个引用之间提取一些文本:

import PyPDF2


pdfFileObj = open('SPE7M518T446E.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

print(pdfReader.numPages)

pageObj1 = pdfReader.getPage(0)
pagecontent = pageObj1.extractText()


def between(value, a, b):
    # Find and validate before-part.
    pos_a = value.find(a)
    if pos_a == -1: return ""
    # Find and validate after part.
    pos_b = value.rfind(b)
    if pos_b == -1: return ""
    # Return middle part.
    adjusted_pos_a = pos_a + len(a)
    if adjusted_pos_a >= pos_b: return ""
    return value[adjusted_pos_a:pos_b]

desired = between(pagecontent,"5. ","8. ")
print(desired)

上面的代码输出如下:

20
REQUEST FOR QUOTATIONSTHIS RFQ           IS             IS NOT A SMALL BUSINESS SET-ASIDE 4. CERT.FOR NAT. DEF.      UNDER BDSA REG. 2      AND/OR DMS REG. 15. ISSUED BY7. DELIVERY   9. DESTINATION10. PLEASE FURNISH QUOTATIONS TO THE       ISSUING OFFICE IN BLOCK 5 ON OR  BEFORE CLOSE OF BUSINESS (Date)IMPORTANT: This is a request for information,  and quotations furnished are  not offers. If you are unable  to quote, please so indicate on this form and return it to the  address in Block 5.   This  request  does not commit the Government to pay any costs incurred  in  the preparation of the submission of this  quotation or to contract for supplies  or services. Supplies are of domestic origin unless otherwise indicated by quoter. Any representations and/or certifications attached to this Request for Quotations must be completed by  the quoter.11. SCHEDULE  (See Continuation Sheets)     12. DISCOUNT FOR PROMPT PAYMENTd. CALENDAR DAYSNUMBERPERCENTAGE  NOTE:   Additional   provisions   and   representations                    are             are not attached.13. NAME AND ADDRESS OF QUOTERa. NAME OF QUOTER16. SIGNERAUTHORIZED FOR LOCAL REPRODUCTION Previous edition not useableSTANDARD FORM 18       (REV. 6-95)     Prescribed by GSA-FAR (48 CFR) 53.215-1(a)    SPE7M5-18-T-446E1. REQUEST NO.2018 APR 302. DATE ISSUED00739229623. REQUISITION/PURCHASE REQUEST NO.DO-C9RATINGDLA LAND AND MARITIME 
ACTIVE DEVICES DIVISION 
PO BOX 3990 
COLUMBUS OH  43218-3990 
USA 
Name: Desmond Forshey Buyer Code:PMCMTA9 Tel: 614-692-6154 Fax: 614-692-6930 
Email: Desmond.Forshey@dla.mil175 DAYS ADO 6. DELIVER BY  (Date)8. TO: c. CITYd. STATE b. STREET ADDRESS a. NAME OF CONSIGNEEe. ZIP CODE a. 10 CALENDAR DAYS (%)b. 20 CALENDAR DAYS (%) c. 30 CALENDAR DAYS (%)15. Date of Quotationa. NAME (Type or Print)  
AREA CODEc. TITLE (Type or Print)d. CITY  c. COUNTY    b. STREET ADDRESSe. STATE f. ZIP CODESee  Schedule2018 MAY 10NUMBERFOB DESTINATIONOTHER  (See Schedule)CAGE          b. TELEPHONE PAGE     OF      PAGES1 
POC INFORMATION: 

WHEN TECHNICAL DATA IS PROVIDED IT MUST BE OBTAINED AT:https://pcf1x.bsm.dla.mil/cfolders. DISCREPANCIES FOUND IN TECHNICAL DATA SHOULD SUBMIT 
REQUEST TO THE DLA CUSTOMER SERVICE WEBSITE:https://www.pdmd.dla.mil/cs/ 

ALL OTHER QUESTIONS (SOLICITATION REQUIREMENTS, ITEM DESCRIPTION, AWARD CHOICE, ETC.), PLEASE CONTACT THE BUYER SHOWN ABOVE. 

QUESTIONS REGARDING OPERATION OF THE DLA-BSM INTERNET BID BOARD SYSTEM SHOULD BE E-MAILED TO: DibbsBSM@dla.mil 

FOR IMMEDIATE ASSISTANCE, PLEASE REFER TO THE FREQUENTLY ASKED QUESTIONS (FAQS) ON BSM DIBBS AT: 
https://www.dibbs.bsm.dla.mil/Refs/help/DIBBSHelp.htm  OR PHONE 1-855-DLA-0001 (1-855-352-0001). 


MASTER SOLICITATION 

THIS SOLICITATION INCORPORATES THE TERMS AND CONDITIONS SET FORTH IN THE DLA MASTER SOLICITATION FOR AUTOMATED SIMPLIFIED 
ACQUISITIONS REVISION 46 (FEBRURARY 7, 2018) WHICH CAN BE FOUND ON THE WEB AT: 
http://www.dla.mil/Portals/104/Documents/J7Acquisition/Master%20Solicitation%20Rev-46%20February-7-2018.pdf?ver=2018-02-08-063754-70 

This solicitation incorporates technical/quality requirements (‚R™ or ‚I™ number in section B). The full text is in the DLA Technical and Quality Master List of Requirements at: 
http://www.dla.mil/HQ/Acquisition/Offers/eprocurement.aspx The revisionof the TQ Master in effect on the award date controls.14. SIGNATURE OF PERSON AUTHORIZED TO SIGN QUOTATION 1                20
###################
ISSUED BY7. DELIVERY   9. DESTINATION10. PLEASE FURNISH QUOTATIONS TO THE       ISSUING OFFICE IN BLOCK 5 ON OR  BEFORE CLOSE OF BUSINESS (Date)IMPORTANT: This is a request for information,  and quotations furnished are  not offers. If you are unable  to quote, please so indicate on this form and return it to the  address in Block 5.   This  request  does not commit the Government to pay any costs incurred  in  the preparation of the submission of this  quotation or to contract for supplies  or services. Supplies are of domestic origin unless otherwise indicated by quoter. Any representations and/or certifications attached to this Request for Quotations must be completed by  the quoter.11. SCHEDULE  (See Continuation Sheets)     12. DISCOUNT FOR PROMPT PAYMENTd. CALENDAR DAYSNUMBERPERCENTAGE  NOTE:   Additional   provisions   and   representations                    are             are not attached.13. NAME AND ADDRESS OF QUOTERa. NAME OF QUOTER16. SIGNERAUTHORIZED FOR LOCAL REPRODUCTION Previous edition not useableSTANDARD FORM 18       (REV. 6-95)     Prescribed by GSA-FAR (48 CFR) 53.215-1(a)    SPE7M5-18-T-446E1. REQUEST NO.2018 APR 302. DATE ISSUED00739229623. REQUISITION/PURCHASE REQUEST NO.DO-C9RATINGDLA LAND AND MARITIME 
ACTIVE DEVICES DIVISION 
PO BOX 3990 
COLUMBUS OH  43218-3990 
USA 
Name: Desmond Forshey Buyer Code:PMCMTA9 Tel: 614-692-6154 Fax: 614-692-6930 
Email: Desmond.Forshey@dla.mil175 DAYS ADO 6. DELIVER BY  (Date)

如何从 PDF 文件中提取以下文本?

DLA LAND AND MARITIME
ACTIVE DEVICES DIVISION
PO BOX 3990
COLUMBUS OH 43218-3990
USA
Name: Desmond Forshey Buyer Code:PMCMTA9 Tel: 614-692-6154 Fax:   614-692-6930
Email: Desmond.Forshey@dla.mil

最佳答案

该 PDF 阅读器并没有为与返回数据的结构进行交互提供太多空间。虽然可以向它添加一个新函数,将每个元素作为列表中的另一个项目返回。然后您至少能够提取两个项目之间的数据。该方法仍然不是万无一失的,因为您仍然需要决定可能的终止情况:

import PyPDF2
import itertools


def extractTextList(self):
    text_list = []
    content = self["/Contents"].getObject()
    if not isinstance(content, ContentStream):
        content = ContentStream(content, self.pdf)

    for operands, operator in content.operations:
        if operator == b_("Tj"):
            _text = operands[0]
            if isinstance(_text, TextStringObject) and len(_text.strip()):
                text_list.append(_text.strip())
        elif operator == b_("T*"):
            pass
        elif operator == b_("'"):
            pass
            _text = operands[0]
            if isinstance(_text, TextStringObject) and len(operands[0]):
                text_list.append(operands[0])
        elif operator == b_('"'):
            _text = operands[2]
            if isinstance(_text, TextStringObject) and len(_text):
                text_list.append(_text)
        elif operator == b_("TJ"):
            for i in operands[0]:
                if isinstance(i, TextStringObject) and len(i):
                    text_list.append(i)
    return text_list


from PyPDF2.pdf import PageObject, u_, ContentStream, b_, TextStringObject
PageObject.extractTextList = extractTextList


def between(text_elements, drop_while, take_while):    
    return list(itertools.takewhile(take_while, itertools.dropwhile(drop_while, text_elements)))[1:]    


pdfFileObj = open('SPE7M518T446E.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

page0 = pdfReader.getPage(0)
text_elements = page0.extractTextList()

lines = between(text_elements, lambda x: x != 'RATING', lambda x: 'DAYS' not in x)
print('\n'.join(lines))

这会给你你想要的行,然后将它们组合成一个输出,如下所示:

DLA LAND AND MARITIME
ACTIVE DEVICES DIVISION
PO BOX 3990
COLUMBUS OH  43218-3990
USA
Name: Desmond Forshey Buyer Code:PMCMTA9 Tel: 614-692-6154 Fax: 614-692-6930
Email: Desmond.Forshey@dla.mil

由于新函数 extractTextList() 返回在页面中找到的文本元素列表,我使用 itertools.dropwhile()itertools.takewhile()处理返回的列表。

between() 函数分两个阶段工作,首先它一次读取一个字符串列表并丢弃它们直到匹配第一个测试(即查找 RATING)。然后它开始将元素返回给 takewhile() 函数。这会不断获取元素,直到它在其中一个元素中发现单词 DAYS 为止。 list() 用于创建过滤列表。然后我删除第一个元素(因为它是单词 RATING)。

实际上,这是在列表上进行切片的迭代方式。

注意:lambda 只是定义函数的另一种方式。在这种情况下,它接受一个名为 x 的文本元素,如果它是某个值,则返回 True,或者暂时,如果单词 DAYS在里面的某个地方。这两个 itertool 函数为列表中的每个元素调用这些 lambda 函数。

关于python - extract 如何从 pdf 文件中提取特定文本 - python,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/50116318/

相关文章:

python-3.x - PyPDF2 不打印文本的任何输出

python - 嵌套 for 循环时间更快的替代方案

c# - 从 Google 中的客户端电子邮件和私钥(服务帐户)获取 token

pdf - 动态生成的 PDF 文件适用于除 Adob​​e Reader 之外的大多数阅读器

Java-递归获取目录和子目录中的所有文件

java - Crawler4j vs. Jsoup Java 页面爬取解析

python - 通过 OCG(按层)从 PDF 中提取几何元素

python - RKeras "unknown url type: https"错误 six.urlretrieve(来自 R 的 Python 代码)

python - Python中的Unicode标识符?

javascript - Import.io(网络爬虫)不断查询但没有输出