python - 将 PDF 转换为文本 : "Text extraction is not allowed"

标签 python python-3.x pdfminer

我正在尝试将 PDF 转换为 Python 中的文本。但它给我一个错误:

PDFTextExtractionNotAllowed: Text extraction is not allowed: <_io.BufferedReader name='C:\Users\Downloads\Facts_for_2017.pdf'>

我使用的代码是:

import sys
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.converter import XMLConverter, HTMLConverter, TextConverter
from pdfminer.layout import LAParams
import io    


def pdfparser(data):
    fp = open(data, 'rb')      
    rsrcmgr = PDFResourceManager()
    retstr = io.StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    interpreter = PDFPageInterpreter(rsrcmgr, device)

    for page in PDFPage.get_pages(fp):
        interpreter.process_page(page)
        data = retstr.getvalue()

    return data


if __name__ == '__main__':
    text = pdfparser(Input_path)

谁能帮帮我?

文件路径为:

https://drive.google.com/file/d/1RyR-J-EwMywL6BqsYbl4Ocm96VzCYrM7/view?usp=sharing

最佳答案

问题是 PDFPage.get_pages() 检查文本是否可以按照惯例提取。您必须将标志设置为 check_extractable=False 才能使其正常工作。此外,如果您尝试转换为 txt 的 PDF 受密码保护,您也可以在那里进行更改。不幸的是,PDFPagedocumentation不是很清楚。

password = ""
for page in PDFPage.get_pages(fp, check_extractable=False, password=password):
    interpreter.process_page(page)
data = retstr.getvalue()

您的整个代码如下所示:

import io

from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfinterp import PDFPageInterpreter, PDFResourceManager
from pdfminer.pdfpage import PDFPage

def pdfparser(data):
    rsrcmgr = PDFResourceManager()
    retstr = io.StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)

    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos = set()

    with open(data, 'rb') as fp:
        for page in PDFPage.get_pages(fp,
                                      pagenos, 
                                      maxpages=maxpages,
                                      password=password,
                                      caching=caching,
                                      check_extractable=False):
            interpreter.process_page(page)

    # As pointed out in another answer, this goes outside the loop
    text = retstr.getvalue()

    device.close()
    retstr.close()
    return text

注意:Python 的 with open ...: 模式实现对于正确处理文件对象很有用。

关于python - 将 PDF 转换为文本 : "Text extraction is not allowed",我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/54009871/

相关文章:

python - 动态更改 scrapy 请求调度程序优先级

python-3.x - 内存泄漏 Keras TensorFlow1.8.0

python - 如何在 Tensorflow 中以不同的学习率训练两个密集层?

python - 字典是否在 Python 3.6+ 中排序?

Python 分节阅读 pdf

python - 使用 Python 抓取 PDF 文本 (pdfquery)

python - 将正则表达式参数转换为列表

python - 无法从版本 > 0.20 的 sklearn 导入 cross_validation

android - python 和 (android) adb shell

python - PDFMiner - 遍历页面并将它们转换为文本