java - 从标签不明确的结构化文档中解析数据

标签 java regex algorithm parsing plaintext

我正在尝试将法律文档从古老的 SGML 文件移动到数据库中。在 Java 中使用正则表达式,我运气不错。但是,我遇到了一个小问题。文档的每个部分的标签似乎在文档之间不是标准的。例如,最常见的标签是:

(<numeric>)
    (<alpah>)
        (<ROMAN>)
            (<ALPHA>)

例。 (1)(一)(一)(一)

但是,还有其他文档有变体,其中可能会出现 ()。我当前的算法具有与每个级别的每个元素匹配的硬编码 RegEx。但是我需要一种方法来在我浏览文档时为每个级别动态设置标签类型。

有人遇到过这样的问题吗?有人有什么建议吗?

提前致谢。

编辑:

这是我用来解析不同项目的正则表达式:

Section: ^<tab>(<b>)?\d{1,4}(\.\d+)?-((\d{1,4}(\.\d+)?)(-|\.)?){3}
SubSection: \.?\s*(<\/b>|<tab>|^)\s*\(\d+(\.\d+)?\)\s+($|<b>|[A-Z"]|\([a-z](.\d+)?\)\s*(\((XC|XL|L?X{0,3})(IX|IV|V?I{0,3})(\.\d+)?\)\s*(\([A-Z](.\d+)?\))?)?\s*.)
Paragraph: (^|<tab>|\s+|\(\d+(\.\d+)?\)\s+)\([a-z](.\d+)?\)(\s+$|\s+<b>|\s+[A-Z"]|\s*\((XC|XL|L?X{0,3})(IX|IV|V?I{0,3})(\.\d+)?\)(\([A-Z](.\d+)?\))?\s*[A-Z"]?)
SubParagraph: (\)|<tab>|<\/b>)\s*\((XC|XL|L?X{0,3})(IX|IV|V?I{0,3})(\.\d+)?\)\s+($|[A-Z"<]|\([A-Z](.\d+)?\)\s*[A-Z"])
SubSubParagraph: (<tab>|\)\s*)\([A-Z](.\d+)?\)\s+([A-Z"]|$)

这是一些示例文本。我之前说错了。虽然数据的最终来源是 SGML,但我解析的内容略有不同。除了有样式标签外,它或多或少是纯文本。

<tab><b>SECTION 5.</b>  In Colorado Revised Statutes, 13-5-142, <b>amend</b> (1)
introductory portion, (1)(b), and (3)(b)(II) as follows:

<tab><b>13-5-142.  National instant criminal background check system - reporting.</b>
(1)  On and after March 20, 2013, the state court administrator shall send electronically
the following information to the Colorado bureau of investigation created pursuant to
section 24-33.5-401, referred to in this section as the "bureau":

<tab>(b)  The name of each person who has been committed by order of the court to the
custody of the office of behavioral health in the department of human services pursuant
to section 27-81-112 or 27-82-108; and

<tab>(3)  The state court administrator shall take all necessary steps to cancel a record
made by the state court administrator in the national instant criminal background check
system if:

<tab>(b)  No less than three years before the date of the written request:

<tab>(II)  The period of commitment of the most recent order of commitment or
recommitment expired, or a court entered an order terminating the person's incapacity or
discharging the person from commitment in the nature of habeas corpus, if the record in
the national instant criminal background check system is based on an order of
commitment to the custody of the office of behavioral health in the department of human
services; except that the state court administrator shall not cancel any record pertaining to
a person with respect to whom two recommitment orders have been entered pursuant to
section 27-81-112 (7) and (8), or who was discharged from treatment pursuant to section
27-81-112 (11) on the grounds that further treatment is not likely to bring about
significant improvement in the person's condition; or

最佳答案

您对问题的陈述含糊不清,因此唯一可能的答案是通用方法。我曾处理过像这样格式不精确的文档转换。

CS 中可以提供帮助的工具是状态机。如果您可以检测到(例如使用正则表达式)格式正在更改为新约定,那么这是合适的。这改变了状态,在这种情况下相当于翻译器用于当前和后续文本 block 。它一直有效,直到下一次状态更改。总体而言,算法如下所示:

translator = DEFAULT 
while (chunks of input remain) {
  chunk = GetNextChunkOfInput // a line, paragraph, etc.
  new_translator = ScanChunkForStateChange(chunk, translator)
  if (new_translator != null) translator = new_translator // found a state change!
  print(translator.Translate(chunk))  // use the translator on the chunk
}

在此框架内,设计翻译器和状态更改谓词是一个繁琐的过程。您所能做的就是尝试、检查输出并解决问题,不断重复,直到您再也无法改进为止。那时您可能已经在输入中发现了最大结构,因此仅使用模式匹配的算法(无需尝试对语义进行建模,例如使用 AI)不会让您走得更远。

关于java - 从标签不明确的结构化文档中解析数据,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/46493146/

相关文章:

python - 优化计算两幅图像之间的欧式距离的性能

java - 没有实例的最终常量类如何比常量接口(interface)更好?

regex - 提取短语后的所有文本,然后在每行的开头添加文件名?

algorithm - 不能满足所有需求的最小成本的最大流量

Mysql - Like 和 Regexp 一起使用

javascript - 仅使用正则表达式 javascript 中的一个字符

java - 二进制搜索以在旋转的排序列表中找到旋转点

java - 在 Java (Android) 中检查 null 和 "emptyness"--> 崩溃

Java 获取嵌套的 JSON 对象/数组

java - 系统属性始终具有值