python - 读取 PDF 文件并使用正则表达式过滤内容

已关闭。这个问题是 not reproducible or was caused by typos 。目前不接受答案。

这个问题是由拼写错误或无法再重现的问题引起的。虽然类似的问题可能是 on-topic在这里，这个问题的解决方式不太可能帮助 future 的读者。

已关闭 2 年前。

我正在尝试使用正则表达式过滤 PDF 文件，并且输出仅是正则表达式正在过滤的单词。

这是我的代码:

# FILTER PDF CONTENT FOR PHI USING REGEX

import PyPDF2
import re
# creating a pdf file object 
pdfFileObj = open('pdf.pdf', 'rb')

# creating a pdf reader object 
pdfReader = PyPDF2.PdfFileReader(pdfFileObj) 


# creating a page object 
pageObj = pdfReader.getPage(0) 

# extracting text from page 
read=pageObj.extractText()

regex2 = re.compile(r'(?:flexibility|Alaska|)')

e=regex2.findall(read)
print(e)

这是我的输出:

['', '','', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'flexibility', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', ''

如果向右滚动，您可以看到我找到了我的正则表达式单词(灵 active )，但为什么所有逗号都在那里？有任何想法吗？我可能遗漏了一个小细节，但似乎找不到哪里？

读取输出:

The pdf995 suite of products - Pdf995, PdfEdit995, and Signature995 - is a complete solution for your document publishing needs. It provides ease of use, flexibility in format, and industry-standard security- and all at no cost to you. Pdf995 makes it easy and affordable to create professional-quality documents in the popular PDF file format. Its easy-to-use interface helps you to create PDF files by simply selecting the "print" command from any application, creating documents which can be viewed on any computer with a PDF viewer. Pdf995 supports network file saving, fast user switching on XP, Citrix/Terminal Server, custom page sizes and large format printing. Pdf995 is a printer driver that works with any Postscript to PDF converter. The pdf995 printer driver and a free Converter are available for easy download. PdfEdit995 offers a wealth of additional functionality, such as: combining documents into a single PDF; automatic link insertion; hierarchical bookmark insertion; PDF conversion to HTML or DOC (text only); integration with Word toolbar with automatic table of contents and link generation; autoattach to email; stationery and stamping.  Signature995 offers state-of-the-art security and encryption to protect your documents and add digital signatures.  

 The Pdf995 Suite offers the following features, all at no cost: Automatic insertion of embedded links Hierarchical Bookmarks Support for Digital Signatures Support for Triple DES encryption Append and Delete PDF Pages Batch Print from Microsoft Office Asian and Cyrillic fonts Integration with Microsoft Word toolbar PDF Stationery Combining multiple PDF's into a single PDF Three auto-name options to bypass Save As dialog Imposition of Draft/Confidential stamps Support for large format architectural printing Convert PDF to JPEG, TIFF, BMP, PCX formats Convert PDF to HTML and Word DOC conversion Convert PDF to text Automatic Table of Contents generation Support for XP Fast User Switching and multiple user sessions Standard PDF Encryption (restricted printing, modifying, copying text and images) Support for Optimized PDF Support for custom page sizes Option to attach PDFs to email after creation  Automatic text summarization of PDF documents Easy integration with document management and Workflow systems n-Up printing Automatic page numbering Simple Programmers Interface Option to automatically display PDFs after creation Custom resizing of PDF output Configurable Font embedding Support for Citrix/Terminal Server Support for Windows 2003 Server Easy PS to PDF processing Specify PDF document properties Control PDF opening mode Can be configured to add functionality to Acrobat Distiller Free: Creates PDFs without annoying watermarks Free: Fully functional, not a trial and does not expire Over 5 million satisfied customers Over 1000 Enterprise Customers worldwide  Please visit us at www.pdf995.com to learn more.  This document illustrates several features of the Pdf995 Suite of Products.

最佳答案

模式的末尾有一个 | ，后面没有任何字符，它将匹配任何内容。删除它:

regex2 = re.compile(r'(?:flexibility|Alaska)')

e=regex2.findall(ReSearch)

此外，通过这样一个简单的模式，您可以删除非捕获组:

regex2 = re.compile(r'flexibility|Alaska')

关于python - 读取 PDF 文件并使用正则表达式过滤内容，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/53676407/

python - 读取 PDF 文件并使用正则表达式过滤内容

上一篇：python - 在 Pandas 中读取、选择和重新排列列

下一篇：python - 为什么此颜色条标记代码适用于 Matplotlib 2.2.3 而不适用于 Matplotlib 3.0.1？