python - 使用 beautifulsoup 将大型 xml 文件拆分为多个文件

我正在尝试将大型 xml 文件拆分为较小的文件，首先我从 beautifulsoup 开始:

from bs4 import BeautifulSoup
import os
# Core settings
rootdir = r'C:\Users\XX\Documents\Grant Data\2010_xml'
extension = ".xml"
to_save = r'C:\Users\XX\Documents\all_patents_as_xml'

index = 0
for root, dirs, files in os.walk(rootdir):
    for file in files:
        if file.endswith(extension):
            print(file)
            file_name = os.path.join(root,file)
            with open(file_name) as f:
                data = f.read()
            texts = data.split('?xml version="1.0" encoding="UTF-8"?')
            for text in texts:
                index += 1
                filename = to_save + "\\"+ str(index) + ".txt"
                with open(filename, 'w') as f:
                    f.write(text)

但是，我遇到了内存错误。然后我切换到xml etree:

from xml.etree import ElementTree as ET
import re


file_name = r'C:\Users\XX\Documents\Grant Data\2010_xml\2010cat_xml.xml'


with open(file_name) as f:
    xml = f.read()
tree = ET.fromstring(re.sub(r"(<\?xml[^>]+\?>)", r"\1<root>", xml) + "</root>")
parser = ET.iterparse(tree)
to_save = r'C:\Users\Yilmaz\Documents\all_patents_as_xml'
index = 0
for event, element in parser:
    # element is a whole element
    if element.tag == '?xml version="1.0" encoding="UTF-8"?':
        index += 1
        filename = to_save + "\\"+ str(index) + ".txt"
        with open(filename, 'w') as f:
            f.write(ET.tostring(element))
        # do something with this element
        # then clean up
        element.clear()

我收到以下错误:

OverflowError: size does not fit in an int

我使用的是 Windows 操作系统，我知道在 Linux 中你可以从 consule 中拆分 xml，但就我而言，我不知道该怎么做。

最佳答案

如果您的 XML 由于内存限制而无法加载，您应该考虑使用 SAX 。

使用 SAX，您将阅读文档的“一小部分”，然后对它们执行您想要执行的操作(示例:将每 N 个元素保存到一个新文件中)。

Python SAX example 1 。

Python SAX example 2 。

关于python - 使用 beautifulsoup 将大型 xml 文件拆分为多个文件，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/56461707/

python - 使用 beautifulsoup 将大型 xml 文件拆分为多个文件

上一篇：python - 无法将文件上传到 Google 云端硬盘上的特定文件夹

下一篇：python - 如何在Python中正确从文本中提取各种日期格式