python - 使用python查找搜索字符串位于pdf文档中的哪个页面

我可以使用哪些 python 包来找出特定“搜索字符串”位于哪个页面？

我查看了几个 python pdf 包，但无法弄清楚我应该使用哪一个。 PyPDF似乎没有这个功能和PDFMiner对于这样简单的任务来说似乎有点矫枉过正。有什么建议吗？

更精确: 我有几个 PDF 文档，我想提取介于字符串“Begin”和字符串“End”之间的页面。

最佳答案

我终于发现 pyPDF 可以提供帮助。我张贴它以防它可以帮助别人。

(1)一个定位字符串的函数

def fnPDF_FindText(xFile, xString):
    # xfile : the PDF file in which to look
    # xString : the string to look for
    import pyPdf, re
    PageFound = -1
    pdfDoc = pyPdf.PdfFileReader(file(xFile, "rb"))
    for i in range(0, pdfDoc.getNumPages()):
        content = ""
        content += pdfDoc.getPage(i).extractText() + "\n"
        content1 = content.encode('ascii', 'ignore').lower()
        ResSearch = re.search(xString, content1)
        if ResSearch is not None:
           PageFound = i
           break
     return PageFound

(2)提取感兴趣页面的函数

  def fnPDF_ExtractPages(xFileNameOriginal, xFileNameOutput, xPageStart, xPageEnd):
      from pyPdf import PdfFileReader, PdfFileWriter
      output = PdfFileWriter()
      pdfOne = PdfFileReader(file(xFileNameOriginal, "rb"))
      for i in range(xPageStart, xPageEnd):
          output.addPage(pdfOne.getPage(i))
          outputStream = file(xFileNameOutput, "wb")
          output.write(outputStream)
          outputStream.close()

我希望这对其他人有帮助

关于python - 使用python查找搜索字符串位于pdf文档中的哪个页面，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/12571905/

python - 使用python查找搜索字符串位于pdf文档中的哪个页面

上一篇：python - 使用 Python 快速音译阿拉伯文本

下一篇：Python - 测试对象是否为内置函数