我需要有关当前正在尝试创建的机器学习项目的帮助。

我收到了来自许多不同供应商的大量发票 - 全部都有自己独特的布局。我需要从发票中提取 3 个关键元素。这 3 元素均位于所有发票的表格/行项目中。

3 元素是:

1:关税号(数字)
2:数量(始终为数字)
3:行总金额(货币值(value))

请参阅下面的屏幕截图，我在示例发票上标记了这些字段。

我使用基于正则表达式的模板方法开始了这个项目。然而，这根本无法扩展，我最终得到了大量不同的规则。

我希望机器学习可以在这里帮助我 - 或者也许是混合解决方案？

共同点

在我的所有发票中，尽管布局不同，每个行项目都始终包含一个关税号。此关税号码始终为 8 位数字，并且始终采用如下格式之一:

xxxxxxxx
xxxx.xxxx
xx.xx.xx.xx

(其中“x”是 0 - 9 之间的数字)。

此外，正如您在发票上看到的那样，每行都有单价和总金额。我需要的金额始终是每行的最高金额。

输出

对于像上面这样的每张发票，我需要每一行的输出。例如，这可能是这样的:

{
    "line":"0",
    "tariff":"85444290",
    "quantity":"3",
    "amount":"258.93"
},
{
    "line":"1",
    "tariff":"85444290",
    "quantity":"4",
    "amount":"548.32"
},
{
    "line":"2",
    "tariff":"76109090",
    "quantity":"5",
    "amount":"412.30"
}

从这里去哪里？

我不确定我想要做的事情属于机器学习，如果是，属于哪个类别。是计算机视觉吗？自然语言处理？命名实体识别？

我最初的想法是:

将发票转换为文本。 (发票都是可文本 PDF 格式，因此我可以使用 pdftotext 之类的工具来获取准确的文本值)
为数量、关税和金额创建自定义命名实体
导出找到的实体。

但是，我觉得我可能错过了一些东西。

有人可以帮助我朝正确的方向前进吗？

编辑:

请参阅下面的更多示例，了解发票表部分的外观:

发票样本 #2

发票样本 #3

编辑2:

请参阅下面的三个示例图像，没有边框/边界框:

图片 1:

图 2:

图 3:

最佳答案

这是使用 OpenCV 的尝试，想法是:

获取二值图像。我们加载图像，使用放大 imutils.resize为了帮助获得更好的 OCR 结果(请参阅 Tesseract improve quality )，请转换为灰度，然后 Otsu's threshold获得二值图像(1 channel )。
删除表格网格线。我们创建 horizontal and vertical kernels然后执行morphological operations将相邻文本轮廓合并为单个轮廓。这个想法是将 ROI 行作为一个整体提取到 OCR。
提取行投资返回率。我们find contours然后使用 imutils.contours.sort_contours 从上到下排序。这确保我们以正确的顺序迭代每一行。从这里开始，我们迭代轮廓，使用 Numpy 切片提取行 ROI，使用 Pytesseract 进行 OCR。，然后解析数据。

<小时/>

这是每个步骤的可视化:

输入图片

二值图像

变形关闭

迭代每一行的可视化

提取的行 ROI

输出发票数据结果:

{'line': '0', 'tariff': '85444290', 'quantity': '3', 'amount': '258.93'}
{'line': '1', 'tariff': '85444290', 'quantity': '4', 'amount': '548.32'}
{'line': '2', 'tariff': '76109090', 'quantity': '5', 'amount': '412.30'}

不幸的是，在尝试第二张和第三张图像时，我得到的结果好坏参半。由于发票的布局都不同，因此此方法不会在其他图像上产生很好的结果。然而，这种方法表明，假设您有固定的发票布局，可以使用传统的图像处理技术来提取发票信息。

代码

import cv2
import numpy as np
import pytesseract
from imutils import contours
import imutils

pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"

# Load image, enlarge, convert to grayscale, Otsu's threshold
image = cv2.imread('1.png')
image = imutils.resize(image, width=1000)
height, width = image.shape[:2]
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]

# Remove horizontal lines
horizontal_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (50,1))
detect_horizontal = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, horizontal_kernel, iterations=2)
cnts = cv2.findContours(detect_horizontal, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]
for c in cnts:
    cv2.drawContours(thresh, [c], -1, 0, -1)

# Remove vertical lines
vertical_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1,50))
detect_vertical = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, vertical_kernel, iterations=2)
cnts = cv2.findContours(detect_vertical, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]
for c in cnts:
    cv2.drawContours(thresh, [c], -1, 0, -1)

# Morph close to combine adjacent contours into a single contour
invoice_data = []
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (85,5))
close = cv2.morphologyEx(thresh, cv2.MORPH_CLOSE, kernel, iterations=3)

# Find contours, sort from top-to-bottom
# Iterate through contours, extract row ROI, OCR, and parse data
cnts = cv2.findContours(close, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]
(cnts, _) = contours.sort_contours(cnts, method="top-to-bottom")

row = 0
for c in cnts:
    x,y,w,h = cv2.boundingRect(c)
    ROI = image[y:y+h, 0:width]
    ROI = cv2.GaussianBlur(ROI, (3,3), 0)
    data = pytesseract.image_to_string(ROI, lang='eng', config='--psm 6')
    parsed = [word.lower() for word in data.split()] 
    if 'tariff' in parsed or 'number' in parsed:
        row_data = {}
        row_data['line'] = str(row)
        row_data['tariff'] = parsed[-1]
        row_data['quantity'] = parsed[2]
        row_data['amount'] = str(max(parsed[10], parsed[11]))
        row += 1

        print(row_data)
        invoice_data.append(row_data)
        
        # Visualize row extraction
        '''
        mask = np.zeros(image.shape, dtype=np.uint8)
        cv2.rectangle(mask, (0, y), (width, y + h), (255,255,255), -1)
        display_row = cv2.bitwise_and(image, mask)

        cv2.imshow('ROI', ROI)
        cv2.imshow('display_row', display_row)
        cv2.waitKey(1000)
        '''
print(invoice_data)
cv2.imshow('thresh', thresh)
cv2.imshow('close', close)
cv2.waitKey()

关于machine-learning - 从不同布局的 PDF 文件中提取文本信息 - 机器学习，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/60074110/

machine-learning - 从不同布局的 PDF 文件中提取文本信息 - 机器学习

共同点

输出

从这里去哪里？

编辑:

编辑2:

上一篇：machine-learning - 从自然语言生成 RDF

下一篇：machine-learning - 贝尔曼最优方程与Q-learning的关系