我正在尝试将大型 xml 文件拆分为较小的文件,首先我从 beautifulsoup 开始:
from bs4 import BeautifulSoup
import os
# Core settings
rootdir = r'C:\Users\XX\Documents\Grant Data\2010_xml'
extension = ".xml"
to_save = r'C:\Users\XX\Documents\all_patents_as_xml'
index = 0
for root, dirs, files in os.walk(rootdir):
for file in files:
if file.endswith(extension):
print(file)
file_name = os.path.join(root,file)
with open(file_name) as f:
data = f.read()
texts = data.split('?xml version="1.0" encoding="UTF-8"?')
for text in texts:
index += 1
filename = to_save + "\\"+ str(index) + ".txt"
with open(filename, 'w') as f:
f.write(text)
但是,我遇到了内存错误。然后我切换到xml etree:
from xml.etree import ElementTree as ET
import re
file_name = r'C:\Users\XX\Documents\Grant Data\2010_xml\2010cat_xml.xml'
with open(file_name) as f:
xml = f.read()
tree = ET.fromstring(re.sub(r"(<\?xml[^>]+\?>)", r"\1<root>", xml) + "</root>")
parser = ET.iterparse(tree)
to_save = r'C:\Users\Yilmaz\Documents\all_patents_as_xml'
index = 0
for event, element in parser:
# element is a whole element
if element.tag == '?xml version="1.0" encoding="UTF-8"?':
index += 1
filename = to_save + "\\"+ str(index) + ".txt"
with open(filename, 'w') as f:
f.write(ET.tostring(element))
# do something with this element
# then clean up
element.clear()
我收到以下错误:
OverflowError: size does not fit in an int
我使用的是 Windows 操作系统,我知道在 Linux 中你可以从 consule 中拆分 xml,但就我而言,我不知道该怎么做。
最佳答案
如果您的 XML 由于内存限制而无法加载,您应该考虑使用 SAX 。
使用 SAX,您将阅读文档的“一小部分”,然后对它们执行您想要执行的操作(示例:将每 N 个元素保存到一个新文件中)。
Python SAX example 1 。
Python SAX example 2 。
关于python - 使用 beautifulsoup 将大型 xml 文件拆分为多个文件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56461707/