python - 无法在python中使用pytesseract从tif图像中提取文本

标签 python python-3.x python-imaging-library python-tesseract

我无法在 Python 中使用 pytesseract 和 PIL 从 .tif 图像文件中提取文本。 它适用于 .png、.jpg 图像文件,仅在 .tif 图像文件中给出错误。 我使用的是Python 3.7.1版本

运行 .tif 图像文件的 Python 代码时出现以下错误。请让我知道我做错了什么。

Fax3SetupState: Bits/sample must be 1 for Group 3/4 encoding/decoding.
Traceback (most recent call last):
  File "C:/Users/u88ltuc/PycharmProjects/untitled1/Image Processing/Prog1.py", line 13, in <module>
    image_to_text = pytesseract.image_to_string(image, lang='eng')
  File "C:\Users\u88ltuc\PycharmProjects\untitled1\venv\lib\site-packages\pytesseract\pytesseract.py", line 347, in image_to_string
    }[output_type]()
  File "C:\Users\u88ltuc\PycharmProjects\untitled1\venv\lib\site-packages\pytesseract\pytesseract.py", line 346, in <lambda>
    Output.STRING: lambda: run_and_get_output(*args),
  File "C:\Users\u88ltuc\PycharmProjects\untitled1\venv\lib\site-packages\pytesseract\pytesseract.py", line 246, in run_and_get_output
    with save(image) as (temp_name, input_filename):
  File "C:\Program Files\Python37\lib\contextlib.py", line 112, in __enter__
    return next(self.gen)
  File "C:\Users\u88ltuc\PycharmProjects\untitled1\venv\lib\site-packages\pytesseract\pytesseract.py", line 171, in save
    image.save(input_file_name, format=extension, **image.info)
  File "C:\Users\u88ltuc\PycharmProjects\untitled1\venv\lib\site-packages\PIL\Image.py", line 2102, in save
    save_handler(self, fp, filename)
  File "C:\Users\u88ltuc\PycharmProjects\untitled1\venv\lib\site-packages\PIL\TiffImagePlugin.py", line 1626, in _save
    raise OSError("encoder error %d when writing image file" % s)
OSError: encoder error -2 when writing image file

下面是它的 Python 代码。

#Import modules
from PIL import Image
import pytesseract

# Include tesseract executable in your path
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"

# Create an image object of PIL library
image = Image.open(r'C:\Users\u88ltuc\Desktop\12110845-e001.tif')

# pass image into pytesseract module
image_to_text = pytesseract.image_to_string(image, lang='eng')

# Print the text
print(image_to_text)

下面是 tif 图像及其链接:

enter image description here

https://ecat.aptiv.com/docs/default-source/ecatalog-documents/12110845-e001-tif.tif?sfvrsn=3ee3b8a1_0

最佳答案

首先,您应该更改您的图像扩展名。 这也许可以解决您的问题:

from PIL import Image
from io import BytesIO
import pytesseract

img = Image.open(r"C:\Users\u88ltuc\Desktop\12110845-e001.tif")
TempIO = BytesIO()
img.save(TempIO,format="JPEG")
img = Image.open(BytesIO(TempIO.getvalue()))

print(pytesseract.image_to_string(img))

或者,如果您不介意桌面上有两张相同的图片,则无需导入 BytesIO,这里是:

from PIL import Image
import pytesseract

img = Image.open(r"C:\Users\u88ltuc\Desktop\12110845-e001.tif")
img.save(r"C:\Users\u88ltuc\Desktop\12110845-e001.jpg")
img = Image.open(r"C:\Users\u88ltuc\Desktop\12110845-e001.jpg")

print(pytesseract.image_to_string(img))

关于python - 无法在python中使用pytesseract从tif图像中提取文本,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/60500698/

相关文章:

python - 如何为 python Telegram Bot 进行单元测试?

python-3.x - Keras:密集层的输入形状

python - (Py)Vips vs Pillow 提升图像质量

python - Python逻辑编程指南

python - 如何从 model+ModelForm 获取文本区域?

python - 如何使用键盘快捷键在jupyter笔记本中向上/向下移动行(不是单元格)?

建立连接后,Python TCP socket.recv() 什么都不返回

python - Python 中惯用的线性时间计数字典

google-app-engine - 为什么在运行 Google App Engine Python 2.7 Mac OSX 时 PIL 不在本地加载?

python - 安装 Raqm (Libraqm) Windows 10