C# .NET - 是否有一种简单的方法可以通过单个 ZIP 文件中的 XML 文件集合查询相同的 XML 节点？

我正在尝试将一段 Python 代码转换为 C#，该代码采用充满 XML 文件的 ZIP 文件，然后对每个 XML 文件执行特定的 XPath 查询并返回结果。在 Python 中，它非常轻量级，看起来像这样(我意识到下面的示例并不是严格意义上的 XPath，但我不久前编写了它!):

with zipfile.ZipFile(fullFileName) as zf:
zfxml = [f for f in zf.namelist() if f.endswith('.xml')]
for zfxmli in zfxml:
    with zf.open(zfxmli) as zff:
        zfft = et.parse(zff).getroot()
        zffts = zfft.findall('Widget')
        print ([wgt.find('Description').text for wgt in zffts])

我在 C# 中最接近的是:

foreach (ZipArchiveEntry entry in archive.Entries)
{
    FileInfo fi = new FileInfo(entry.FullName);

    if (fi.Extension.Equals(".xml", StringComparison.OrdinalIgnoreCase))
    {
        using (Stream zipEntryStream = entry.Open())
        {
            XmlDocument xmlDoc = new XmlDocument();

            xmlDoc.Load(zipEntryStream);
            XmlNodeList wgtNodes = xmlDoc.SelectNodes("//Root/Widget");

            foreach (XmlNode tmp in wgtNodes)
            {
                zipListBox.Items.Add(tmp.SelectSingleNode("//Description"));
            }
        }
    }
}

虽然这确实适用于较小的 ZIP 文件，但它比 Python 实现占用更多的内存，并且如果 ZIP 文件中包含太多 XML 文件，则会出现内存不足的情况。是否有另一种更有效的方法来实现这一目标？

最佳答案

如 What is the best way to parse (big) XML in C# Code? 中所述，您可以使用 XmlReader 以有限的内存消耗流式传输巨大的 XML 文件。然而，XmlReader使用起来有些棘手，因为如果 XML 不完全符合预期，则很容易读取太少或太多。 (即使是微不足道的空格也可能会导致 XmlReader 算法失效。)

为了减少发生此类错误的机会，首先引入以下扩展方法，该方法会迭代当前元素的所有直接子元素:

public static partial class XmlReaderExtensions
{
    /// <summary>
    /// Read all immediate child elements of the current element, and yield return a reader for those matching the incoming name & namespace.
    /// Leave the reader positioned after the end of the current element
    /// </summary>
    public static IEnumerable<XmlReader> ReadElements(this XmlReader inReader, string localName, string namespaceURI)
    {
        inReader.MoveToContent();
        if (inReader.NodeType != XmlNodeType.Element)
            throw new InvalidOperationException("The reader is not positioned on an element.");
        var isEmpty = inReader.IsEmptyElement;
        inReader.Read();
        if (isEmpty)
            yield break;
        while (!inReader.EOF)
        {
            switch (inReader.NodeType)
            {
                case XmlNodeType.EndElement:
                    // Move the reader AFTER the end of the element
                    inReader.Read();
                    yield break;
                case XmlNodeType.Element:
                    {
                        if (inReader.LocalName == localName && inReader.NamespaceURI == namespaceURI)
                        {
                            using (var subReader = inReader.ReadSubtree())
                            {
                                subReader.MoveToContent();
                                yield return subReader;
                            }
                            // ReadSubtree() leaves the reader positioned ON the end of the element, so read that also.
                            inReader.Read();
                        }
                        else
                        {
                            // Skip() leaves the reader positioned AFTER the end of the element.
                            inReader.Skip();
                        }
                    }
                    break;
                default:
                    // Not an element: Text value, whitespace, comment.  Read it and move on.
                    inReader.Read();
                    break;
            }
        }
    }

    /// <summary>
    /// Read all immediate descendant elements of the current element, and yield return a reader for those matching the incoming name & namespace.
    /// Leave the reader positioned after the end of the current element
    /// </summary>
    public static IEnumerable<XmlReader> ReadDescendants(this XmlReader inReader, string localName, string namespaceURI)
    {
        inReader.MoveToContent();
        if (inReader.NodeType != XmlNodeType.Element)
            throw new InvalidOperationException("The reader is not positioned on an element.");
        using (var reader = inReader.ReadSubtree())
        {
            while (reader.ReadToFollowing(localName, namespaceURI))
            {
                using (var subReader = inReader.ReadSubtree())
                {
                    subReader.MoveToContent();
                    yield return subReader;
                }
            }
        }
        // Move the reader AFTER the end of the element
        inReader.Read();
    }
}

这样，你的 python 算法就可以重现如下:

var zipListBox = new List<string>();

using (var archive = ZipFile.Open(fullFileName, ZipArchiveMode.Read))
{
    foreach (var entry in archive.Entries)
    {
        if (Path.GetExtension(entry.Name).Equals(".xml", StringComparison.OrdinalIgnoreCase))
        {
            using (var zipEntryStream = entry.Open())
            using (var reader = XmlReader.Create(zipEntryStream))
            {
                // Move to the root element
                reader.MoveToContent();

                var query = reader
                    // Read all child elements <Widget>
                    .ReadElements("Widget", "")
                    // And extract the text content of their first child element <Description>
                    .SelectMany(r => r.ReadElements("Description", "").Select(i => i.ReadElementContentAsString()).Take(1));

                zipListBox.AddRange(query);
            }
        }
    }
}

注释:

您的 C# XPath 查询与原始 Python 查询不匹配。您的原始 python 代码执行以下操作:
```
zfft = et.parse(zff).getroot()
```
这将无条件获取根元素( docs )。
```
zffts = zfft.findall('Widget')
```
这会找到所有名为“Widget”的直接子元素(未使用递归下降运算符 //)( docs )。
```
wgt.find('Description').text for wgt in zffts
```
这会循环遍历小部件，并为每个小部件查找第一个名为“Description”的子元素并获取其文本 ( docs )。

比较xmlDoc.SelectNodes("//Root/Widget")递归地沿整个 XML 元素层次结构向下查找名为 <Widget> 的节点嵌套在名为 <Root> 的节点内——这可能不是你想要的。同样tmp.SelectSingleNode("//Description")在 <Widget> 下递归地降低 XML 层次结构查找描述节点。递归下降在这里可能有效，但如果有多个嵌套，则可能返回不同的结果 <Description>节点。
使用 XmlReader.ReadSubtree() 确保整个元素都被消耗——不多也不少。

ReadElements()与 LINQ to XML 配合良好。例如。如果您想流式传输 XML 并获取每个小部件的 id、描述和名称，而不将它们全部加载到内存中，您可以这样做:

var query = reader
    .ReadElements("Widget", "")
    .Select(r => XElement.Load(r))
    .Select(e => new { Description = e.Element("Description")?.Value, Id = e.Attribute("id")?.Value, Name = e.Element("Name")?.Value });

foreach (var widget in query)
{
    Console.WriteLine("Id = {0}, Name = {1}, Description = {2}", widget.Id, widget.Name, widget.Description);
}

这里内存使用将再次受到限制，因为只有一个 XElement对应单个<Widget>随时会被引用。

演示 fiddle here .

更新

如果 <Widget> 的集合，您的代码将如何更改标签并非直接来自 XML 根，实际上它们本身包含在单个 <Widgets> 中。根的子树？

这里有几个选择。首先，您可以嵌套调用 ReadElements通过将 LINQ 语句链接在一起，使用 SelectMany 展平元素层次结构:

var query = reader
    // Read all child elements <Widgets>
    .ReadElements("Widgets", "")
    // Read all child elements <Widget>
    .SelectMany(r => r.ReadElements("Widget", ""))
    // And extract the text content of their first child element <Description>
    .SelectMany(r => r.ReadElements("Description", "").Select(i => i.ReadElementContentAsString()).Take(1));

如果您只想阅读<Widget>，请使用此选项仅在某些特定 XPath 上的节点。

或者，您可以简单地读取名为 <Widget> 的所有后代。如下所示:

var query = reader
    // Read all descendant elements <Widget>
    .ReadDescendants("Widget", "")
    // And extract the text content of their first child element <Description>
    .SelectMany(r => r.ReadElements("Description", "").Select(i => i.ReadElementContentAsString()).Take(1));

如果有兴趣阅读 <Widget>，请使用此选项节点，无论它们出现在 XML 中的何处。

演示 fiddle #2 here .

关于C# .NET - 是否有一种简单的方法可以通过单个 ZIP 文件中的 XML 文件集合查询相同的 XML 节点？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/58692275/

C# .NET - 是否有一种简单的方法可以通过单个 ZIP 文件中的 XML 文件集合查询相同的 XML 节点？

上一篇：python - Selenium Python 编码选择下拉菜单 : getting error SeleAttributeError: 'list' object has no attribute 'tag_name'

下一篇：python - 如何组合通过交叉验证找到的深度学习模型？