Python 3.7.5
操作系统:Windows Server 2016
Ghostscript 版本:9.5
我正在尝试使用 Ghostscript 对目录中的多个 PDF 进行文本提取。该目录当前包含 2 个 PDF:1234.pdf 和 5678.pdf。
import os
import sys
def pdf2txt(directory,file):
import locale
import ghostscript
args=[file,"-dBATCH","-dNOPAUSE","-dNOPROMPT","-sDEVICE=txtwrite","-sOutputFile="+directory+"\\output\\"+file+"-%d.txt",directory+"\\"+file]
encoding=locale.getpreferredencoding()
args=[a.encode(encoding) for a in args]
print (args)
ghostscript.Ghostscript(*args)
directory=sys.argv[1]
files=os.listdir(directory)
for file in files:
print("Trying "+directory+"\\"+file)
pdf2txt(directory,file)
我遇到的问题是第一个 PDF 的处理没有问题,但尝试处理第二个 PDF 总是导致 Python 呕吐。我注意到即使从 Python 控制台进行文本提取时也会出现此错误。我提取第二个文件的唯一方法是退出 Python 并重新启动它。
我已重命名这些文件,以便首先处理第二个 PDF。该 PDF 的处理没有问题,而之前成功处理的第二个 PDF 现在抛出了 fatal error 。我尝试将参数列表和编码变量重置为空,调用 Ghostscript 中不存在的方法,例如 .quit() 或 .exit()。我确实看到一篇文章提到退出方法在 init 中被注释掉了,确实如此。我把评论去掉了,但没有成功。
C:\Users\bob\Documents>python exporter.py c:\users\bob\Desktop\PDFs
Trying c:\users\bob\Desktop\PDFs\1234.pdf
[b'1234.pdf', b'-dBATCH', b'-dNOPAUSE', b'-dNOPROMPT', b'-sDEVICE=txtwrite', b'-sOutputFile=c:\\users\\bob\\Desktop\\PDFs\\output\\1234.pdf-%d.txt', b'c:\\users\\bob\\Desktop\\PDFs\\1234.pdf']
GPL Ghostscript 9.50 (2019-10-15)
Copyright (C) 2019 Artifex Software, Inc. All rights reserved.
This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY:
see the file COPYING for details.
Processing pages 1 through 22.
Page 1
Page 2
Page 3
Page 4
Trying c:\users\bob\Desktop\PDFs\5678.pdf
[b'5678.pdf', b'-dBATCH', b'-dNOPAUSE', b'-dNOPROMPT', b'-sDEVICE=txtwrite', b'-sOutputFile=c:\\users\\bob\\Desktop\\PDFs\\output\\5678.pdf-%d.txt', b'c:\\users\\bob\\Desktop\\PDFs\\5678.pdf']
GPL Ghostscript 9.50 (2019-10-15)
Copyright (C) 2019 Artifex Software, Inc. All rights reserved.
This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY:
see the file COPYING for details.
Traceback (most recent call last):
File "exporter.py", line 18, in <module>
pdf2txt(directory,file)
File "exporter.py", line 11, in pdf2txt
ghostscript.Ghostscript(*args)
File "C:\Program Files\Python37\lib\site-packages\ghostscript\__init__.py", line 174, in Ghostscript
stderr=kw.get('stderr', None))
File "C:\Program Files\Python37\lib\site-packages\ghostscript\__init__.py", line 74, in __init__
rc = gs.init_with_args(instance, args)
File "C:\Program Files\Python37\lib\site-packages\ghostscript\_gsprint.py", line 273, in init_with_args
raise GhostscriptError(rc)
ghostscript._gsprint.GhostscriptError: Fatal
最佳答案
我今天遇到了同样的问题,发现 ghostscript.Ghostscript
应该在 with
block 中调用。另外,在创建 ghostscript.Ghostscript
的新实例之前,我必须调用 ghostscript.cleanup()
。
试试这个:
import os
import sys
def pdf2txt(directory,file):
import locale
import ghostscript
args=[file,"-dBATCH","-dNOPAUSE","-dNOPROMPT","-sDEVICE=txtwrite","-sOutputFile="+directory+"\\output\\"+file+"-%d.txt",directory+"\\"+file]
encoding=locale.getpreferredencoding()
args=[a.encode(encoding) for a in args]
print (args)
with ghostscript.Ghostscript(*args) as g:
ghostscript.cleanup()
directory=sys.argv[1]
files=os.listdir(directory)
for file in files:
print("Trying "+directory+"\\"+file)
pdf2txt(directory,file)
关于python - 处理多个文件时 Ghostscript 发生 fatal error ,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59295941/