我可以从网站上抓取数据,但我需要将其导出为 XML。
为此,我定义了一个序列化器,如下所示:
class Person(scrapy.Item):
Name = scrapy.Field(serializer=serialize_name)
Location = scrapy.Field()
还有一个像这样的 XMLExportPipeline:
class XmlExportPipeline(object):
def __init__(self):
self.files = {}
@classmethod
def from_crawler(cls, crawler):
pipeline = cls()
crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
return pipeline
def spider_opened(self, spider):
file = open('%s_people.xml' % spider.name, 'w+b')
self.files[spider] = file
self.exporter = XmlItemExporter(file, item_element='Person', root_element='People')
self.exporter.start_exporting()
def spider_closed(self, spider):
...
def process_item(self, person, spider):
self.exporter.export_item(person)
return person
这有效并给我一个像这样的 XML 文件:
<?xml version="1.0" encoding="utf-8"?>
<People><Person><Name>Bob</Name><Location>NYC</Location></Person></People>
如何向标签添加属性?例如,如果我想要
<Person Age="25" Likes="Programming">
我该怎么做呢?
同样快速跟进,为什么输出 XML 没有格式化 like it is supposed to be ?我可以将标签中的值转换为 CDATA(目前使用自定义序列化程序来执行此操作)吗?
最佳答案
XmlItemExporter
的默认实现不允许这样做,因为这一行 (scrapy/exporters.py:173
):
self.xg.startElement(name, {})
第二个参数应该包含每个新元素的属性。因此,解决方法是实现您自己的 XmlItemExporter
子类,添加此参数。
from scrapy.exporters import six, is_listlike, XmlItemExporter
class AttrXmlItemExporter(XmlItemExporter):
def _export_xml_field(self, name, serialized_value, depth):
# Custom code:
attrs = {}
if isinstance(serialized_value, dict):
serialized_value = serialized_value.copy()
attr_keys = [k for k in serialized_value.keys() if k.startswith('_')]
attrs = {k[1:]: serialized_value.pop(k) for k in attr_keys}
# Default implementation (except for startElement call)
self._beautify_indent(depth=depth)
self.xg.startElement(name, attrs)
if hasattr(serialized_value, 'items'):
self._beautify_newline()
for subname, value in serialized_value.items():
self._export_xml_field(subname, value, depth=depth + 1)
self._beautify_indent(depth=depth)
elif is_listlike(serialized_value):
self._beautify_newline()
for value in serialized_value:
self._export_xml_field('value', value, depth=depth + 1)
self._beautify_indent(depth=depth)
elif isinstance(serialized_value, six.text_type):
self._xg_characters(serialized_value)
else:
self._xg_characters(str(serialized_value))
self.xg.endElement(name)
self._beautify_newline()
在此示例中,任何项值是字典(即子字典),其键以下划线 (_
) 开头,将呈现为属性。
例如,项目:
yield {
'name': 'Sample',
'rating': {
'_rating': '4.5',
'_max': '5',
},
}
将呈现为 XML 为:
<item>
<nam>Sample</name>
<rating rating="4.5" max="5">
</rating>
</item>
不过,还没有找到使它成为自闭合元素的方法。请注意,所有标记为属性的值必须是字符串。
关于python - 在 Scrapy 中向导出的 XML 添加属性,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/46425930/