c# - 使用 LINQ GroupBy 链生成具有聚合的多级层次结构

标签 c# linq

我碰巧这样做了,并想到询问有关 O(n)、顺序稳定性和所涉及的枚举器实例的底层保证。

我知道每个聚合的枚举器拐点的数量,例如计数和持续时间,根据子级到父级的实际分布而变化,但至少每个记录在每个聚合中只枚举一次不是至少是正确的?

在这个例子中,我们有 3 个聚合和 3 个层级要聚合,因此 O(9n) ~ O(n)。

LINQ GroupBy 问题:

  1. GroupBy 是否被记录为线性复杂性和稳定排序,或者它只是 Microsoft .NET 的实现方式?我没有特别怀疑它可能是非线性的,只是出于好奇而问。
  2. 似乎实例化的枚举数的数量应该总是与层次结构中父节点的数量成线性关系,对吗?
  3. 在同一聚合统计数据和级别期间,没有叶子或非叶子被枚举多次,对吗?因为每个节点只是某个单一“结果”中一个父节点的子节点。
    using System;
    using System.Collections.Generic;
    using System.Linq;
    using Microsoft.VisualStudio.TestTools.UnitTesting;

    namespace NestedGroupBy
    {
        [TestClass]
        public class UnitTest1
        {
            struct Record
            {
                // cat->subcat->event is the composite 3-tier key
                private readonly string category;
                private readonly string subcategory;
                private readonly string ev;
                private readonly double duration;

                public Record(string category, string subcategory, string ev, double duration)
                {
                    this.category = category;
                    this.subcategory = subcategory;
                    this.ev = ev;
                    this.duration = duration;
                }

                public string Category { get { return category; } }
                public string Subcategory { get { return subcategory; } }
                public string Event { get { return ev; } }
                public double Duration { get { return duration; } }
            }

            static readonly IList<Record> Rec1 = new System.Collections.ObjectModel.ReadOnlyCollection<Record>
            (new[] {
                 new Record ("Network", "SecurityCheck", "ReadAcl", 0.0145),
                 new Record ("Network", "Security", "ReadSocket", 0.0145),
                 new Record ("Network", "SecurityCheck", "ReadAcl", 0.0145),
                 new Record ("File", "ReadMetadata", "ReadDirectory", 0.0145),
                 new Record ("File", "ReadMetadata", "ReadDirectory", 0.0145),
                 new Record ("File", "ReadMetadata", "ReadDirectory", 0.0145),
                 new Record ("File", "ReadMetadata", "ReadSize", 0.0145),
                 new Record ("File", "ReadMetadata", "ReadSize", 0.0145),
                 new Record ("File", "ReadMetadata", "ReadSize", 0.0145),
                 new Record ("Registry", "ReadKey", "ReadKeyAcl", 0.0145),
                 new Record ("Registry", "ReadKey", "ReadKeyAcl", 0.0145),
                 new Record ("Registry", "ReadKey", "ReadKeyAcl", 0.0145),
                 new Record ("Registry", "ReadKey", "ReadKeyAcl", 0.0145),
                 new Record ("Registry", "ReadKey", "CacheKeyAcl", 0.0145),
                 new Record ("Registry", "ReadKey", "CacheKeyAcl", 0.0145),
                 new Record ("Registry", "ReadKey", "CheckKeyAcl", 0.0145),
                 new Record ("Registry", "ReadKey", "CheckKeyAcl", 0.0145),
                 new Record ("Registry", "ReadKey", "CheckKeyAcl", 0.0145),
                 new Record ("Registry", "WriteKey", "CheckPermissions", 0.0145),
                 new Record ("Registry", "WriteKey", "CheckOwner", 0.0145),
                 new Record ("Registry", "WriteKey", "InheritPermissions", 0.0145),
                 new Record ("Registry", "WriteKey", "ValidateKey", 0.0145),
                 new Record ("Registry", "WriteKey", "RecacheKey", 0.0145),
                 new Record ("File", "WriteData", "FlushData", 0.0145),
                 new Record ("File", "WriteData", "WriteBuffer", 0.0145),
                 new Record ("File", "WriteData", "WritePermissions", 0.0145),
                 new Record ("File", "ReadData", "CheckDataBuffer", 0.0145),
                 new Record ("File", "ReadData", "ReadBuffer", 0.0145),
                 new Record ("File", "ReadData", "ReadPermissions", 0.0145),
                 new Record ("Network", "SecurityCheck", "ReadAcl", 0.0145),
                 new Record ("Network", "SecurityCheck", "ReadAcl", 0.0145),
                 new Record ("Network", "SecurityCheck", "ReadAcl", 0.0145),
                 new Record ("Network", "SecurityCheck", "ReadAcl", 0.0145),
                 new Record ("Network", "SecurityCheck", "ReadAcl", 0.0145),
                 new Record ("Network", "Security", "ReadSocket", 0.0145),
                 new Record ("Network", "SecurityCheck", "ReadAcl", 0.0145),
                 new Record ("Network", "SecurityCheck", "ReadAcl", 0.0145),
                 new Record ("Network", "Security", "ReadSocket", 0.0145),
                 new Record ("Network", "SecurityCheck", "ReadAcl", 0.0145),
            });

            [TestMethod]
            public void TestMethod1()
            {
                // Perform one big sort to order all child rungs properly
                var s = Rec1.OrderBy(
                    r => r,
                    Comparer<Record>.Create(
                        (x, y) =>
                        {
                            int c = x.Category.CompareTo(y.Category);
                            if (c != 0) return c;
                            c = x.Subcategory.CompareTo(y.Subcategory);
                            if (c != 0) return c;
                            return x.Event.CompareTo(y.Event);
                        }));

                // This query enumerates bottom-up (in the key hierarchy-sense), 
                // then proceedes to each higher summary (parent) level and retains
                // the "result" collection of its children determined by the preceding GroupBy.
                //
                // This is so each level can later step down into its own children for looping. 
                // And the leaf durations, immediate child counts and leaf event counts are already calculated as well.
                //
                // I think this is O(n), since each record does not get repeatedly scanned for different levels of the same accumulation stat.
                // But under-the-hood there may be much grainy processing like enumerator instantiation, depending on child count density.
                var q = s
                    .GroupBy(
                        r => new { Category = r.Category, Subcategory = r.Subcategory, Event = r.Event },
                        (key, result) => {
                            int c = result.Count();
                            return new
                            {
                                LowKey = key,
                                // at this lowest summary level only, 
                                // the hierarchical (immediate child) count is the same as the event (leaf) count
                                LowChildCount = c,
                                LowEventCount = c,
                                LowDuration = result.Sum(x => x.Duration),
                                LowChildren = result
                            };
                        })
                        .GroupBy(
                            r => new { Category = r.LowKey.Category, Subcategory = r.LowKey.Subcategory },
                                (key, result) => new {
                                    MidKey = key,
                                    MidChildCount = result.Count(),
                                    MidEventCount = result.Sum(x => x.LowEventCount),
                                    MidDuration = result.Sum(x => x.LowDuration),
                                    MidChildren = result
                                })
                                .GroupBy(
                                    r => new { Category = r.MidKey.Category },
                                    (key, result) => new {
                                        HighKey = key,
                                        HighChildCount = result.Count(),
                                        HighEventCount = result.Sum(x => x.MidEventCount),
                                        HighDuration = result.Sum(x => x.MidDuration),
                                        HighChildren = result
                                    });


                foreach (var high in q)
                {
                    Console.WriteLine($"{high.HighKey.Category} child#={high.HighChildCount} event#={high.HighEventCount} duration={high.HighDuration}");
                    foreach (var mid in high.HighChildren)
                    {
                        Console.WriteLine($"  {mid.MidKey.Subcategory} child#={mid.MidChildCount} event#={high.HighEventCount} duration={mid.MidDuration}");
                        foreach (var low in mid.MidChildren)
                        {
                            Console.WriteLine($"    {low.LowKey.Event} child#={low.LowChildCount} event#={high.HighEventCount} duration={low.LowDuration}");
                            foreach (var leaf in low.LowChildren)
                            {
                                Console.WriteLine($"      >> {leaf.Category}/{leaf.Subcategory}/{leaf.Event} duration={leaf.Duration}");
                            }
                        }
                    }
                }
            }
        }
    }

最佳答案

让我们从回顾 Enumerable.GroupBy 的实现开始来自 MS 源代码:

分组

public static IEnumerable<TResult> GroupBy<TSource, TKey, TElement, TResult>
(this IEnumerable<TSource> source, 
 Func<TSource, TKey> keySelector, 
 Func<TSource, TElement> elementSelector, 
 Func<TKey, IEnumerable<TElement>, TResult> resultSelector)
{
  return new GroupedEnumerable<TSource, TKey, TElement, TResult>(source, 
                                                                 keySelector, 
                                                                 elementSelector, 
                                                                 resultSelector, null);
}

GroupedEnumerable

internal class GroupedEnumerable<TSource, TKey, TElement, TResult> : IEnumerable<TResult>{
    IEnumerable<TSource> source;
    Func<TSource, TKey> keySelector;
    Func<TSource, TElement> elementSelector;
    IEqualityComparer<TKey> comparer;
    Func<TKey, IEnumerable<TElement>, TResult> resultSelector;

    public GroupedEnumerable(IEnumerable<TSource> source, Func<TSource, TKey> keySelector, Func<TSource, TElement> elementSelector, Func<TKey, IEnumerable<TElement>, TResult> resultSelector, IEqualityComparer<TKey> comparer){
        if (source == null) throw Error.ArgumentNull("source");
        if (keySelector == null) throw Error.ArgumentNull("keySelector");
        if (elementSelector == null) throw Error.ArgumentNull("elementSelector");
        if (resultSelector == null) throw Error.ArgumentNull("resultSelector");
        this.source = source;
        this.keySelector = keySelector;
        this.elementSelector = elementSelector;
        this.comparer = comparer;
        this.resultSelector = resultSelector;
    }

    public IEnumerator<TResult> GetEnumerator(){
        Lookup<TKey, TElement> lookup = Lookup<TKey, TElement>.Create<TSource>(source, keySelector, elementSelector, comparer);
        return lookup.ApplyResultSelector(resultSelector).GetEnumerator();
    }

    IEnumerator IEnumerable.GetEnumerator(){
        return GetEnumerator();
    }
}

重要的一点是它在内部使用 LookUp数据结构,它提供了 O(1)查找 key ,因此它在内部枚举所有记录并为每条记录添加数据到 LookUp , 类型为 IEnumerable<IGrouping<TKey,TValue>>

现在您的具体问题

Is GroupBy documented to be both linear complexity and stable ordering, or is it just how Microsoft .NET's impl does it? I don't have any particular suspicion it may be non-linear, just asking out of curiosity.

设计和代码建议在顶层使用 O(N),但它肯定取决于元素选择,如果进一步执行 O(N)操作,这将自动使复杂性 O(N^2)或更多

It seems like the number of enumerators instantiated should always be linear with the number of parent nodes seen in the hierarchy, true?

是的,它只是顶层的一个枚举器

No leaf or non-leaf is enumerated more than once during the same aggregation stat and level, true? Because each node is only the child of one parent in some single "result".

对于非叶节点,我假设您指的是嵌套数据,因此这取决于您的模型设计,但是如果每个元素都按预期具有其单独的嵌套数据副本,而不是共享/引用,那么我不会看为什么一个元素会被多次触及,是针对具体子/子数据的枚举

更多细节

  • 在您的例子中,您正在编写 GroupBy查询,因此一个查询的结果充当另一个查询的输入,就复杂性而言,这更像是 O(N) + O(N) + O(N) ~ O(N) ,但是在它们中的每一个中,您至少要执行 1 O(N) 操作,因此嵌套设计的整体复杂性应为 O(N^2)不是 O(N)
  • 如果您有深层嵌套数据,则改为连接 GroupBy , SelectMany ,将数据展平是更好的选择,因为在展平数据上运行最终聚合不太复杂

关于c# - 使用 LINQ GroupBy 链生成具有聚合的多级层次结构,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56205152/

相关文章:

javascript - 创建一个浏览器以在 ASP.NET 网页中加载网站

c# - Linq - 按ID从数据库中选择记录

c# - 在 SQL 中实现 All 功能

c# - 从 Control 继承的类不显示在窗体上

c# - 在 C# 中合并两个数据表

c# - 如何声明新类型

linq - 如何使用Linq将元素插入XML?

c# - 从其orderby子句中清除IQueryable

c# - GroupBy 根据条件返回 Count 和 Value

c# - 查找 VSTO 日志