python - 将远程 PDF 页面转换为 OCR 临时图像

我有一个远程 PDF 文件，我需要逐页阅读该文件，并将每个文件不断传递给 OCR，该 OCR 将为我提供其 OCR 文本。

import pytesseract
from pyPdf import PdfFileWriter, PdfFileReader
import cStringIO
from wand.image import Image
import urllib2
import tempfile
import pytesseract
from PIL import Image

remoteFile = urllib2.urlopen(urllib2.Request("file:///home/user/Documents/TestDocs/test.pdf")).read()
memoryFile = cStringIO.StringIO(remoteFile)

pdfFile = PdfFileReader(memoryFile)
for pageNum in xrange(pdfFile.getNumPages()):
    currentPage = pdfFile.getPage(pageNum)

    ## somehow convert currentPage to wand type
    ## image and then pass to tesseract-api
    ##
    ## TEMP_IMAGE = some conversion to temp file
    ## pytesseract.image_to_string(Image.open(TEMP_IMAGE))

memoryFile.close()

我想过使用 cStringIO 或 tempfile 但我不知道如何使用它们来实现此目的。

如何解决这个问题？

最佳答案

有几个选项可以执行此操作，考虑到您提供的代码，更兼容的方法是将图像临时存储在该目录中，然后在使用 pytesseract 读取文本后删除它们。我创建一个 wand 类型图像来单独从 PDF 中提取每个图像，然后将其转换为 pytesseract 的 PIL 类型图像。这是我用于此目的的代码，将检测到的文本写入数组“text”，其中每个元素都是原始 PDF 中的图像，我还更新了一些导入以使其与 Python3 兼容(cStringIO->io 和 urllib2 ->urllib.request)。

import PyPDF2
import os
import pytesseract
from wand.image import Image
from PIL import Image as PILImage
import urllib.request
import io

with urllib.request.urlopen('file:///home/user/Documents/TestDocs/test.pdf') as response:
    pdf_read = response.read()
    pdf_im = PyPDF2.PdfFileReader(io.BytesIO(pdf_read))
    text = []
    for p in range(pdf_im.getNumPages()):
        with Image(filename='file:///home/user/Documents/TestDocs/test.pdf' + '[' + str(p) + ']') as img:
            with Image(image = img) as converted: #Need second with to convert SingleImage object from wand to Image
                converted.save(filename=tempFile_Location)
                text.append(pytesseract.image_to_string(PILImage.open(tempFile_Location)))
                os.remove(tempFile_Location)

或者，如果您想避免为每个图像创建和删除临时文件，您可以使用 numpy 和 OpenCV 将图像提取为 blob，将其转换为 numpy 数组，然后将其转换为 PIL 图像以供 pytesseract 执行对 ( reference ) 执行 OCR

import PyPDF2
import os
import pytesseract
from wand.image import Image
from PIL import Image as PILImage
import urllib.request
import io
import numpy as np
import cv2

with urllib.request.urlopen('file:///home/user/Documents/TestDocs/test.pdf') as response:
    pdf_read = response.read()
    pdf_im = PyPDF2.PdfFileReader(io.BytesIO(pdf_read))
    text = []
    for p in range(pdf_im.getNumPages()):
        with Image(filename=('file:///home/user/Documents/TestDocs/test.pdf') + '[' + str(p) + ']') as img:
            img_buffer=np.asarray(bytearray(img.make_blob()), dtype=np.uint8)
            retval = cv2.imdecode(img_buffer, cv2.IMREAD_GRAYSCALE)
            text.append(pytesseract.image_to_string(PILImage.fromarray(retval)))

关于python - 将远程 PDF 页面转换为 OCR 临时图像，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/31095025/

python - 将远程 PDF 页面转换为 OCR 临时图像

上一篇：python - 值错误 : too many values to unpack

下一篇：python - 如何制作具有交叉关系的两个 Django 模型？