java - 使用外部 DTD 中的实体将大型 XML 从 ISO-8859-1 转换为 UTF-8

我有:

ISO-8859-1 中 2.2GiB 的未压缩 XML，从

<?xml version="1.0" encoding="ISO-8859-1"?> <!DOCTYPE dblp SYSTEM "dblp-2017-08-29.dtd">

定义实体的相应DTD如下:

<!ENTITY Aacute "Á" >

无法将解析后的 XML 装入 RAM 的计算机

我要

将 XML 导入 Apache Solr，它已经设置好并可以正常工作。 Solr/Java 会(理所当然地)提示扩展实体太多，我可以通过设置 -DentityExpansionLimit=2000000 来提出这个问题对于 JVM，但我必须编辑 Importer 以提高 System::setProperty 的限制.

我试过了

xmllint

xmllint --stream --loaddtd --encode utf8 --output dblp.utf8.xml dblp-2018-07-01.xml

没有--stream该进程在尝试将结构解析到内存中时被内核终止。

与 --stream它不会写入输出文件，我怀疑它只是根据 DTD 验证 XML。

编辑XML，python

我不知道如何使用 python 导入 DTD 并将其与解析器一起使用，所以我将实体放入 <!DOCTYPE dblp [ … ]> 中然后

import xml.etree.ElementTree
f = open('dblp-2018-07-01.xml')
out = open('dblp.utf8.xml','wb')
xml.etree.ElementTree.parse(f).write(out, encoding='UTF-8')

这将消耗大约 11 GiB 的内存并且适合我，但是:

详情

我希望其他人复制我正在做的事情，所以这是我想要的:

无需手动编辑 XML 来插入实体
可以转换编码的脚本或编译程序
尽可能少地使用内存，尽量保持在 6 GiB 以下
一个额外的好处是读取和写入 gzip 文件以节省空间，但这不是必需的。

我更喜欢 Java 作为编程解决方案(这样我可以将导入过程合并到 Solr 中)，但我很乐意采用任何其他解决方案(我想避免使用 JavaScript)。

如果您想自己搞定 XML，文件位于此处:

http://dblp.org/xml/release/ (使用最新的 dtd)。
http://dblp.org/faq/How+to+parse+dblp+xml.html (了解更多信息)。
http://dblp.org/faq/Under+what+license+is+the+data+from+dblp+released.html (用于许可证)。

gzip 文件的大小约为 430MiB，扩展为 2.2GiB 的 XML。

谢谢!

最佳答案

我自己找到了一个解决方案，它有点慢(~11-12 分钟)，但我觉得很好:

import javax.xml.stream.*;
import java.io.*;
import java.util.zip.*;

public class ConvertToUtf8 {

  public static void main(String[] args) {
    System.setProperty("entityExpansionLimit", "10000000");
    XMLInputFactory inputFactory = XMLInputFactory.newFactory();
    XMLOutputFactory outputFactory = XMLOutputFactory.newFactory();

    try (
        FileInputStream ifs = new FileInputStream("dblp-2018-08-01.xml.gz"); 
        GZIPInputStream gzIn = new GZIPInputStream(ifs);
        FileOutputStream ofs = new FileOutputStream("dblp_utf8.xml.gz");
        GZIPOutputStream gzOut = new GZIPOutputStream(ofs, true);
        ) 
    {
      XMLEventReader inEvt = inputFactory.createXMLEventReader(gzIn);
      XMLEventWriter outEvt = outputFactory.createXMLEventWriter(gzOut, "UTF-8");
      outEvt.add(inEvent);
    } catch (IOException | XMLStreamException e) {
      e.printStackTrace();
    }
  }
}

使用 GZIP in/out 将显着加快进程(在我的机器上快 6 倍)，因为从磁盘读取会阻碍系统的其余部分。如果要复制，请确保 DTD 在您的工作目录中，否则不会替换实体。 Java 将在 XML 中插入一条注释，声明否则无法找到 DTD。

基于@janbrohl 的回答:

#! python3
import re
import gzip
from lxml import etree

# read the DTD with the lxml parser
dtd = etree.DTD('dblp-2017-08-29.dtd')
# build a dict with it for lookup
replacements = {x.name: x.content for x in dtd.entities()}

entity_re=re.compile('&(\w+);')

def resolve_entity(m):
    """
    This will replace the defined entities with their expansions from the DTD:
    '&Ouml;' will be replaced with '&#214;'.
    The entities that are already escaped with '&#[0-9]+;' should not be expanded,
    Ex: if some of the escapes produced the character '<', the XML would no longer be well formed.

    If the matched entity is not in the replacements, use the match as default
    """
    return replacements.get(m.group(1),f'&{m.group(1)};')

def expand_line(line):
    return entity_re.sub(resolve_entity,line)

def recode_file(src,dst):
    with gzip.open(src,mode='rt', encoding='ISO-8859-1', newline='\n') as src_file:
        # discard first line with wrong encoding
        print('discard: ' + src_file.readline())  
        with gzip.open(dst, mode='wt', encoding='UTF-8', newline='\n') as dst_file:
            # replace with correct encoding statement
            dst_file.write('<?xml version="1.0"?>\n')  
            for line in src_file:
                dst_file.write(expand_line(line))

recode_file('dblp-2018-08-01.xml.gz','dblp_recode.xml.gz')

我已经导入了 regex-replace 生成的输出，似乎可以正常工作 :D 诚然，它比 Java 版本更快，但我仍然不确定生成的 XML 是否与通过实际解析器的版本相同。我会尝试一下。

编辑:经过一些实验，我发现了一些可能会修改数据的边缘情况。我将把 python 脚本留在这里，因为它很快。但是，我更喜欢使用实际使用解析器的版本:它易于理解，仅使用标准库并且易于维护。极端情况是我的错，我像在 C++ 映射中一样使用 python 的字典:访问 replacements['val'] 会在 C++ 中创建条目，replacements.at(' val') 将抛出。在 python 中，情况正好相反:replacements['val'] 会抛出，replacements.get('val') 不会，如果没有则返回一个空字符串提供了默认值。

我会把它打开一段时间，以防有人能找到更快的解决方案。编辑:如果有人能找到一个使用 XML 解析器的更快的解决方案:D

关于java - 使用外部 DTD 中的实体将大型 XML 从 ISO-8859-1 转换为 UTF-8，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/51781500/

java - 使用外部 DTD 中的实体将大型 XML 从 ISO-8859-1 转换为 UTF-8

我有:

我要

我试过了

xmllint

编辑XML，python

详情

上一篇：android - 无法在当前主题错误中找到样式 'bottomNavigationStyle'

下一篇：xml - xsd.exe 没有为 xs :list tag 创建集合(数组或列表)