python - 如何用汉字打印 tesseract 结果

我正在尝试让我的程序使用 Tesseract 识别中文，并且它有效。我遇到的唯一问题是将结果打印为汉字，结果是用拼音打印(你如何将中文单词输入为英文)。

# Import libraries
from PIL import Image
import pytesseract
from unidecode import unidecode

pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"

image_counter = 2

filelimit = image_counter - 1

outfile = "out_text.txt"

f = open(outfile, "a")

for i in range(1, filelimit + 1):
    print("ran")
    filename = "page_" + str(i) + ".png"

    # Recognize the text as string in image using pytesserct
    text = unidecode(((pytesseract.image_to_string(Image.open(filename), lang = "chi_sim"))))

    print(text)

这是我运行的图像

这是我得到的

跑了清明世解与分分,陆商行人与断缺新文旧家何出友，木易通之强化村。

结果应该是如图所示的汉字。

最佳答案

没关系，我意识到我的问题了。

text = unidecode(((pytesseract.image_to_string(Image.open(filename), lang = "chi_sim"))))

应该是

text = pytesseract.image_to_string(Image.open(filename), lang = "chi_tra")

关于python - 如何用汉字打印 tesseract 结果，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/57866592/

上一篇：json - Spring Boot Controller 建议 - 如何返回 XML 而不是 JSON？

下一篇：pyspark - 从 hdfs 目录迭代 pyspark 中的文件

相关文章：

python - 保存用户输入以便它们再次出现 - python 3.2

tesseract - 为什么eng中没有.traineddata文件

android - Android Studio Tesseract OCR 应用程序运行时出现 Zygote 错误

python - 如何创建交互式选择？

python - 分离 celery 消费者和生产者

python - 如何以干净的依赖关系启动 virtualenv

python - 在 python pandas 数据框中将十六进制转换为十进制

java - 使用 Tess4j 进行 OCRing 时在控制台上抑制警告

c++ - 在 Visual C++ 2010 中构建 Tesseract

python - 提高多段落扫描的 OCR 性能