python - 读取 PDF 文件并使用正则表达式过滤内容

标签 python regex pdf pypdf

我正在尝试使用正则表达式过滤 PDF 文件,并且输出仅是正则表达式正在过滤的单词。

这是我的代码:

# FILTER PDF CONTENT FOR PHI USING REGEX

import PyPDF2
import re
# creating a pdf file object 
pdfFileObj = open('pdf.pdf', 'rb')

# creating a pdf reader object 
pdfReader = PyPDF2.PdfFileReader(pdfFileObj) 


# creating a page object 
pageObj = pdfReader.getPage(0) 

# extracting text from page 
read=pageObj.extractText()

regex2 = re.compile(r'(?:flexibility|Alaska|)')

e=regex2.findall(read)
print(e)

这是我的输出:

['', '','', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'flexibility', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', ''

如果向右滚动,您可以看到我找到了我的正则表达式单词(灵 active ),但为什么所有逗号都在那里?有任何想法吗?我可能遗漏了一个小细节,但似乎找不到哪里?

读取输出:

The pdf995 suite of products - Pdf995, PdfEdit995, and Signature995 - is a complete solution for your document publishing needs. It provides ease of use, flexibility in format, and industry-standard security- and all at no cost to you. Pdf995 makes it easy and affordable to create professional-quality documents in the popular PDF file format. Its easy-to-use interface helps you to create PDF files by simply selecting the "print" command from any application, creating documents which can be viewed on any computer with a PDF viewer. Pdf995 supports network file saving, fast user switching on XP, Citrix/Terminal Server, custom page sizes and large format printing. Pdf995 is a printer driver that works with any Postscript to PDF converter. The pdf995 printer driver and a free Converter are available for easy download. PdfEdit995 offers a wealth of additional functionality, such as: combining documents into a single PDF; automatic link insertion; hierarchical bookmark insertion; PDF conversion to HTML or DOC (text only); integration with Word toolbar with automatic table of contents and link generation; autoattach to email; stationery and stamping.  Signature995 offers state-of-the-art security and encryption to protect your documents and add digital signatures.  

 The Pdf995 Suite offers the following features, all at no cost: Automatic insertion of embedded links Hierarchical Bookmarks Support for Digital Signatures Support for Triple DES encryption Append and Delete PDF Pages Batch Print from Microsoft Office Asian and Cyrillic fonts Integration with Microsoft Word toolbar PDF Stationery Combining multiple PDF's into a single PDF Three auto-name options to bypass Save As dialog Imposition of Draft/Confidential stamps Support for large format architectural printing Convert PDF to JPEG, TIFF, BMP, PCX formats Convert PDF to HTML and Word DOC conversion Convert PDF to text Automatic Table of Contents generation Support for XP Fast User Switching and multiple user sessions Standard PDF Encryption (restricted printing, modifying, copying text and images) Support for Optimized PDF Support for custom page sizes Option to attach PDFs to email after creation  Automatic text summarization of PDF documents Easy integration with document management and Workflow systems n-Up printing Automatic page numbering Simple Programmers Interface Option to automatically display PDFs after creation Custom resizing of PDF output Configurable Font embedding Support for Citrix/Terminal Server Support for Windows 2003 Server Easy PS to PDF processing Specify PDF document properties Control PDF opening mode Can be configured to add functionality to Acrobat Distiller Free: Creates PDFs without annoying watermarks Free: Fully functional, not a trial and does not expire Over 5 million satisfied customers Over 1000 Enterprise Customers worldwide  Please visit us at www.pdf995.com to learn more.  This document illustrates several features of the Pdf995 Suite of Products. 

最佳答案

模式的末尾有一个 | ,后面没有任何字符,它将匹配任何内容。删除它:

regex2 = re.compile(r'(?:flexibility|Alaska)')

e=regex2.findall(ReSearch)

此外,通过这样一个简单的模式,您可以删除非捕获组:

regex2 = re.compile(r'flexibility|Alaska')

关于python - 读取 PDF 文件并使用正则表达式过滤内容,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/53676407/

相关文章:

Python 正则表达式匹配带转义单引号的引号字符串

ios - 无法使用正则表达式获取全名

python - 在 osx 上将 .py(文本文件)批量转换为 .pdf

python - 使用 python 检查 pdf 限制

python - 在syncdb django python期间出错

python - 我应该传递对象还是在构造函数中构建它?

javascript - 使用 * 的正则表达式粗体字符

python - 数据未传递给 for 循环中的变量

python - 完整的原型(prototype)太大而无法保存,已清除变量

php - 最大表记录期间的 codeigniter dompdf 分页问题