python - 用于文本提取的文档布局分析

标签 python machine-learning nlp artificial-intelligence

我需要分析不同文档类型的布局结构,如: pdf , 文档 , docx , odt 等等。
我的任务是:
给出一个文档,将文本分组到块中,找到每个块的正确边界。
我使用 Apache Tika 做了一些测试,这是一个很好的提取器,它是一个非常好的工具,但它经常弄乱块的顺序,让我解释一下我对 ORDER 的意思。
Apache Tika 只是提取文本,所以如果我的文档有两列,Tika 提取第一列的整个文本,然后提取第二列的文本,这是可以的...但有时第一列上的文本与第二个文本,就像一个有行关系的表格。
所以我必须照顾每个块的位置,所以问题是:

  • 定义框边界,这很难......我应该理解一个句子是否开始一个新的块。
  • 定义方向,例如,给一个表格“句子”应该是行,而不是列。

  • 所以基本上在这里我必须处理 布局结构正确理解块边界。
    我给你一个直观的例子:
    enter image description here
    一个经典的提取器返回:
    2019
    2018
    2017
    2016
    2015
    2014
    Oregon Arts Commission Individual Artist Fellowship...
    
    这是 (就我而言)因为日期与右侧的文本有关。
    这个任务是为其他NLP分析做准备,所以很重要,因为,比如做,当我需要识别文本内部的实体(NER),然后识别它们之间的关系时,使用正确的上下文非常重要 .
    如何从文档和装配相关的文本块中提取文本(了解文档的布局结构)在同一块下?

    最佳答案

    这只是您问题的部分解决方案,但它可能会简化手头的任务。
    This tool接收 PDF 文件并将其转换为文本文件。它工作得非常快,可以在大量文件上运行。
    它为每个 PDF 创建一个输出文本文件。该工具相对于其他工具的优势在于输出文本根据其原始布局对齐。
    例如,这是一份布局复杂的简历:
    enter image description here
    它的输出是以下文本文件:

    Christopher                         Summary
                                        Senior Web Developer specializing in front end development.
    Morgan                              Experienced with all stages of the development cycle for
                                        dynamic web projects. Well-versed in numerous programming
                                        languages including HTML5, PHP OOP, JavaScript, CSS, MySQL.
                                        Strong background in project management and customer
                                        relations.
    
    
                                        Skill Highlights
                                            •   Project management          •   Creative design
                                            •   Strong decision maker       •   Innovative
                                            •   Complex problem             •   Service-focused
                                                solver
    
    
                                        Experience
    Contact
                                        Web Developer - 09/2015 to 05/2019
    Address:                            Luna Web Design, New York
    177 Great Portland Street, London      • Cooperate with designers to create clean interfaces and
    W5W 6PQ                                   simple, intuitive interactions and experiences.
                                           • Develop project concepts and maintain optimal
    Phone:                                    workflow.
    +44 (0)20 7666 8555
                                           • Work with senior developer to manage large, complex
                                              design projects for corporate clients.
    Email:
                                           • Complete detailed programming and development tasks
    christoper.m@gmail.com
                                              for front end public and internal websites as well as
                                              challenging back-end server code.
    LinkedIn:
                                           • Carry out quality assurance tests to discover errors and
    linkedin.com/christopher.morgan
                                              optimize usability.
    
    Languages                           Education
    Spanish – C2
                                        Bachelor of Science: Computer Information Systems - 2014
    Chinese – A1
                                        Columbia University, NY
    German – A2
    
    
    Hobbies                             Certifications
                                        PHP Framework (certificate): Zend, Codeigniter, Symfony.
       •   Writing
                                        Programming Languages: JavaScript, HTML5, PHP OOP, CSS,
       •   Sketching
                                        SQL, MySQL.
       •   Photography
       •   Design
    -----------------------Page 1 End-----------------------
    
    现在您的任务减少到在文本文件中查找批量,并使用单词之间的空格作为对齐提示。
    首先,我包含了一个脚本,它可以找到文本列之间的边距并产生 rhslhs - 分别是右列和左列的文本流。
    import numpy as np
    import matplotlib.pyplot as plt
    import re
    
    txt_lines = txt.split('\n')
    max_line_index = max([len(line) for line in txt_lines])
    padded_txt_lines = [line + " " * (max_line_index - len(line)) for line in txt_lines] # pad short lines with spaces
    space_idx_counters = np.zeros(max_line_index)
    
    for idx, line in enumerate(padded_txt_lines):
        if line.find("-----------------------Page") >= 0: # reached end of page
            break
        space_idxs = [pos for pos, char in enumerate(line) if char == " "]
        space_idx_counters[space_idxs] += 1
    
    padded_txt_lines = padded_txt_lines[:idx] #remove end page line
    
    # plot histogram of spaces in each character column
    plt.bar(list(range(len(space_idx_counters))), space_idx_counters)
    plt.title("Number of spaces in each column over all lines")
    plt.show()
    
    # find the separator column idx
    separator_idx = np.argmax(space_idx_counters)
    print(f"separator index: {separator_idx}")
    left_lines = []
    right_lines = []
    
    # separate two columns of text
    for line in padded_txt_lines:
        left_lines.append(line[:separator_idx])
        right_lines.append(line[separator_idx:])
    
    # join each bulk into one stream of text, remove redundant spaces
    lhs = ' '.join(left_lines)
    lhs = re.sub("\s{4,}", " ", lhs)
    rhs = ' '.join(right_lines)
    rhs = re.sub("\s{4,}", " ", rhs)
    
    print("************ Left Hand Side ************")
    print(lhs)
    print("************ Right Hand Side ************")
    print(rhs)
    
    绘图输出:
    enter image description here
    文本输出:
    separator index: 33
    ************ Left Hand Side ************
    Christopher Morgan Contact Address: 177 Great Portland Street, London W5W 6PQ Phone: +44 (0)20 7666 8555 Email: christoper.m@gmail.com LinkedIn: linkedin.com/christopher.morgan Languages Spanish – C2 Chinese – A1 German – A2 Hobbies •   Writing •   Sketching •   Photography •   Design 
    ************ Right Hand Side ************
       Summary Senior Web Developer specializing in front end development. Experienced with all stages of the development cycle for dynamic web projects. Well-versed in numerous programming languages including HTML5, PHP OOP, JavaScript, CSS, MySQL. Strong background in project management and customer relations. Skill Highlights •   Project management •   Creative design •   Strong decision maker •   Innovative •   Complex problem •   Service-focused solver Experience Web Developer - 09/2015 to 05/2019 Luna Web Design, New York • Cooperate with designers to create clean interfaces and simple, intuitive interactions and experiences. • Develop project concepts and maintain optimal workflow. • Work with senior developer to manage large, complex design projects for corporate clients. • Complete detailed programming and development tasks for front end public and internal websites as well as challenging back-end server code. • Carry out quality assurance tests to discover errors and optimize usability. Education Bachelor of Science: Computer Information Systems - 2014 Columbia University, NY Certifications PHP Framework (certificate): Zend, Codeigniter, Symfony. Programming Languages: JavaScript, HTML5, PHP OOP, CSS, SQL, MySQL. 
    
    下一步是将此脚本推广到多页文档,删除多余的符号等。
    祝你好运!

    关于python - 用于文本提取的文档布局分析,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/66473977/

    相关文章:

    python - 预测概率

    python - 为多标签 SVM 分配权重以平衡类

    python - 使用 Python 在许多文档中搜索许多表达式

    python - NLTK RegEx Chunker 不使用通配符捕获定义的语法模式

    python - 将字符串打印为十六进制字节

    python - 如何在Python中的列表中存储类对象的多个实例?

    python - 转换以前缀表示法给出的表达式,识别公共(public)子表达式和依赖项

    python - 通过python获取Ubuntu操作系统中特定文件类型的默认图标

    r - 结合训练+测试数据并在 R 中运行交叉验证

    python - 用 spacy 对文档进行词形还原?