python - 如何在Python中使用pdfminer从在线PDF中提取文本

标签 python web-scraping pdfminer

我想使用 pdfminer 使用下面的代码从在线 PDF 中提取文本,它没有显示错误,但输出什么也没有

from pdfminer.pdfpage import PDFPage
from urllib import request
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from io import StringIO
from io import open

def readPDF(pdfFile):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, laparams=laparams)
    PDFPage.get_pages(rsrcmgr, device, pdfFile)
    device.close()
    content = retstr.getvalue()
    retstr.close()
    return content

pdfFile = request.urlopen("https://www.jstage.jst.go.jp/article/cancer/9/0/9_KJ00003588219/_pdf/-char/en")
outputString = readPDF(pdfFile)
print(outputString)

最佳答案

以下代码适用于 Python 3.7.4

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.layout import LAParams
from pdfminer.converter import TextConverter
from pdfminer.pdfpage import PDFPage
import io
import urllib.request
import requests


def pdf_to_text(pdf_file):
    text_memory_file = io.StringIO()

    rsrcmgr = PDFResourceManager()
    device = TextConverter(rsrcmgr, text_memory_file, laparams=LAParams())
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    # get first 3 pages of the pdf file
    for page in PDFPage.get_pages(pdf_file, pagenos=(0, 1, 2)):
        interpreter.process_page(page)
    text = text_memory_file.getvalue()
    text_memory_file.close()
    return text

# # online pdf to text by urllib
# online_pdf_file=urllib.request.urlopen('http://www.dabeaz.com/python/UnderstandingGIL.pdf')
# pdf_memory_file=io.BytesIO()
# pdf_memory_file.write(online_pdf_file.read())
# print(pdf_to_text(pdf_memory_file))


# online pdf to text by requests
response = requests.get('http://www.dabeaz.com/python/UnderstandingGIL.pdf')
pdf_memory_file = io.BytesIO()
pdf_memory_file.write(response.content)
print(pdf_to_text(pdf_memory_file))

# extract metadata
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdftypes import resolve1

parser = PDFParser(pdf_memory_file)
doc = PDFDocument(parser)
metadata=doc.info[0]
for k in metadata:
    print(k, resolve1(metadata[k]))

关于python - 如何在Python中使用pdfminer从在线PDF中提取文本,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57591220/

相关文章:

python - 完全循环具有不同起始索引的列表

python - 函数名称在 python 类中未定义

Python BeautifulSoup 从父/兄弟关系中获取内容

python - 将边界框中的pdf文本直接提取到python中

python - 类继承自同一个类

python - Django中不同应用程序的不同数据库

python - 网页抓取 | BeautifulSoup |解析表

python - Scrapy - 使用正则表达式选择 xpath

Python PDFMIner - PDF 到 CSV

python - 判断 PDF 文本是否可见