python - 将 hOCR 转换为 HTML 表格

我正在寻找一个工具或一个想法，用 python 实现，将 hOCR 文件(由应用程序中的 tesseract 生成)转换为 html 表。这个想法是利用 hOCR 文件中的文本位置信息(在 bbox 属性中提供)来创建一个基于所提供位置的表格。我提供一个例子来解释上面的想法:

我用了这个image从 SlideShare.net 作为我使用 tesseract 的应用程序的输入，我得到以下 hOCR/xml 文件作为输出。

hOCR 文件:

  <div class='ocr_page' id='page_2' title='image "sample_slide.jpg"; bbox 0 0 638 479; ppageno 1'>
   <div class='ocr_carea' id='block_1_1' title="bbox 0 0 638 479">
    <p class='ocr_par' dir='ltr' id='par_1' title="bbox 31 104 620 439">
     <span class='ocr_line' id='line_1' title="bbox 32 104 613 138"><span class='ocrx_word' id='word_1' title="bbox 32 105 119 131">done:</span> <span class='ocrx_word' id='word_2' title="bbox 132 104 262 138">working</span> <span class='ocrx_word' id='word_3' title="bbox 273 105 405 138">product,</span> <span class='ocrx_word' id='word_4' title="bbox 419 104 517 132">hotels</span> <span class='ocrx_word' id='word_5' title="bbox 528 104 613 132">listed</span> 
     </span>
     <span class='ocr_line' id='line_2' title="bbox 31 160 471 194"><span class='ocrx_word' id='word_6' title="bbox 31 164 62 187">to</span> <span class='ocrx_word' id='word_7' title="bbox 75 161 122 187">do:</span> <span class='ocrx_word' id='word_8' title="bbox 134 164 227 187">smart</span> <span class='ocrx_word' id='word_9' title="bbox 236 160 330 187">trafﬁc</span> <span class='ocrx_word' id='word_10' title="bbox 342 160 471 194">building</span> 
     </span>
     <span class='ocr_line' id='line_3' title="bbox 32 243 284 280"><span class='ocrx_word' id='word_11' title="bbox 32 243 128 280">seed</span> <span class='ocrx_word' id='word_12' title="bbox 148 243 284 280">round:</span> 
     </span>
     <span class='ocr_line' id='line_4' title="bbox 71 316 619 361"><span class='ocrx_word' id='word_13' title="bbox 71 321 156 356">CEO</span> <span class='ocrx_word' id='word_14' title="bbox 171 319 240 355">will</span> <span class='ocrx_word' id='word_15' title="bbox 260 321 384 356">invest</span> <span class='ocrx_word' id='word_16' title="bbox 517 316 619 361">$30k</span> 
     </span>
     <span class='ocr_line' id='line_5' title="bbox 75 392 620 439"><span class='ocrx_word' id='word_17' title="bbox 75 397 252 433">investor</span> <span class='ocrx_word' id='word_18' title="bbox 489 392 620 439">$120k</span> 
     </span>
    </p>
   </div>
  </div>

我需要的是根据next的位置将hOCR文件转换成html表格。预期的表格应该类似于 this table .

表格单元格的大小和位置反射(reflect)了 hOCR 文件中提供的信息。

图片来源:slideshare.net

最佳答案

检查 this document .我相信它描述了您需要的大部分(或全部)内容。来自介绍:

This document describes a representation of various aspects of OCR output in an XML-like format. That is, we define as set of tags containing text and other tags, together with attributes of those tags. However, since the content we are representing is formatted text, However, we are not actually using a new XML for the representation; instead embed the representation in XHTML (or HTML) because XHTML and XHTML processing already define many aspects of OCR output representation that would otherwise need additional, separate and ad-hoc definitions.

XML 也可以是 converted to HTML using XSLT .其实还有a project which plans to do just that .

此外，this project (hocr-tools)可能会有帮助。

最后注意 FAQ of Tesseract提到这个:

With the configfile 'hocr' tesseract will produce xhtml output compliant with hocr specification

关于python - 将 hOCR 转换为 HTML 表格，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/31029734/

python - 将 hOCR 转换为 HTML 表格

上一篇：javascript - jQuery-ui 的调整大小功能无法与 chrome 中的 flexbox 模型正确交互，但在 FF 和 IE 中成功

下一篇：javascript - IE scrollWidth 问题的解决方法