python - 如何调用 pypdfocr 函数以在 python 脚本中使用它们?

标签 python python-2.7 python-3.x pdfbox

最近我下载了pypdfocr ,但是,在文档中没有关于如何将 pypdfocr 调用为库的示例,有人可以帮助我调用它只是为了转换单个文件吗?。我刚找到一个终端命令:

$ pypdfocr filename.pdf

最佳答案

如果您正在寻找源代码,它通常在您的 python 安装目录 site-package 下。更重要的是,如果您使用的是 IDE(即 Pycharm),它会帮助您找到目录和文件。这对于查找类以及向您展示如何实例化它也非常有用,例如: https://github.com/virantha/pypdfocr/blob/master/pypdfocr/pypdfocr.py 这个文件有一个 pypdfocr 类类型,你可以重复使用,并且可能做命令行会做的事情。

在那个类中,开发人员提出了很多要解析的参数:

def get_options(self, argv):
    """
        Parse the command-line options and set the following object properties:
        :param argv: usually just sys.argv[1:]
        :returns: Nothing
        :ivar debug: Enable logging debug statements
        :ivar verbose: Enable verbose logging
        :ivar enable_filing: Whether to enable post-OCR filing of PDFs
        :ivar pdf_filename: Filename for single conversion mode
        :ivar watch_dir: Directory to watch for files to convert
        :ivar config: Dict of the config file
        :ivar watch: Whether folder watching mode is turned on
        :ivar enable_evernote: Enable filing to evernote
    """
    p = argparse.ArgumentParser(description = "Convert scanned PDFs into their OCR equivalent.  Depends on GhostScript and Tesseract-OCR being installed.",
            epilog = "PyPDFOCR version %s (Copyright 2013 Virantha Ekanayake)" % __version__,
            )

    p.add_argument('-d', '--debug', action='store_true',
        default=False, dest='debug', help='Turn on debugging')

    p.add_argument('-v', '--verbose', action='store_true',
        default=False, dest='verbose', help='Turn on verbose mode')

    p.add_argument('-m', '--mail', action='store_true',
        default=False, dest='mail', help='Send email after conversion')

    p.add_argument('-l', '--lang',
        default='eng', dest='lang', help='Language(default eng)')


    p.add_argument('--preprocess', action='store_true',
            default=False, dest='preprocess', help='Enable preprocessing.  Not really useful now with improved Tesseract 3.04+')

    p.add_argument('--skip-preprocess', action='store_true',
            default=False, dest='skip_preprocess', help='DEPRECATED: always skips now.')

    #---------
    # Single or watch mode
    #--------
    single_or_watch_group = p.add_mutually_exclusive_group(required=True)
    # Positional argument for single file conversion
    single_or_watch_group.add_argument("pdf_filename", nargs="?", help="Scanned pdf file to OCR")
    # Watch directory for watch mode
    single_or_watch_group.add_argument('-w', '--watch', 
         dest='watch_dir', help='Watch given directory and run ocr automatically until terminated')

    #-----------
    # Filing options
    #----------
    filing_group = p.add_argument_group(title="Filing optinos")
    filing_group.add_argument('-f', '--file', action='store_true',
        default=False, dest='enable_filing', help='Enable filing of converted PDFs')
    #filing_group.add_argument('-c', '--config', type = argparse.FileType('r'),
    filing_group.add_argument('-c', '--config', type = lambda x: open_file_with_timeout(p,x),
         dest='configfile', help='Configuration file for defaults and PDF filing')
    filing_group.add_argument('-e', '--evernote', action='store_true',
        default=False, dest='enable_evernote', help='Enable filing to Evernote')
    filing_group.add_argument('-n', action='store_true',
        default=False, dest='match_using_filename', help='Use filename to match if contents did not match anything, before filing to default folder')


    # Add flow option to single mode extract_images,preprocess,ocr,write

    args = p.parse_args(argv)

您可以使用任何这些参数传递给它的解析器,像这样:

import pypdfocr

obj = pypdfocr.pypdfocr.pypdfocr()
obj.get_options([]) # this makes it takes default, but you could add CLI option to it.  Other option might be [-v] or [-d,-v]

我希望这能帮助您同时理解 :)

关于python - 如何调用 pypdfocr 函数以在 python 脚本中使用它们?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/39988381/

相关文章:

python - 创建 "Flashcard"词汇程序

python - 使用 Elasticsearch 后端使 “more like this”无法在Haystack中返回任何结果

python - 从列表和字典中抓取网页

python - 使用 Pandas 将列转换为行

python - py2exe打包py文件时出现"maximum recursion depth exceeded"

python-2.7 - 从命令提示符运行脚本时,PiCamera 无法初始化为类成员

python - 尝试导入 BeautifulSoup 时出现异常

python-2.7 - 命令 "python setup.py egg_info"失败,PATH/psycopg2 中的错误代码为 1

python - 解析器必须是字符串或字符流,而不是系列

python - 从 csv 文件中提取特定文本