c# - 如何最大程度地减少大型数据集的运行时间(从 93,773 个对象列表中列出唯一对象)

我们正在从 EVE Online API 中提取大量 JSON 对象，并使用 Newtonsoft.Json.JsonConvert 将它们反序列化为 EveObjModel 对象。从那里我们想创建一个唯一对象列表，即每个 type_id 中最昂贵的。我也粘贴了下面的 dataContract。

问题:下面的这段代码可以处理较小的数据集，但不适用于较大的数据集。目前，我们正在运行它，它需要 50 多分钟(并且还在计算)。我们可以做些什么来将运行较大数据集所需的时间减少到可以承受的水平？

感谢您的宝贵时间。手指交叉。

    // The buyList contains about 93,000 objects. 
    public void CreateUniqueBuyList(List<EveObjModel> buyList)
    {

        List<EveObjModel> uniqueBuyList = new List<EveObjModel>();

        foreach (EveObjModel obj in buyList)
        {
            int duplicateCount = 0;

            for (int i = 0; i < uniqueBuyList.Count; i++)
            {
                if (uniqueBuyList[i].type_id == obj.type_id)
                       duplicateCount++;
            }

            if (duplicateCount == 1)
            {
                foreach (EveObjModel objinUnique in uniqueBuyList)
                {
                    if (obj.type_id == objinUnique.type_id && obj.price > objinUnique.price)
                    {
                        // instead of adding obj, the price is just changed to the price in the obj. 
                        objinUnique.price = obj.price;

                    }
                    else if (obj.type_id == objinUnique.type_id && obj.price == objinUnique.price)
                    {
                        //uniqueBuyList.RemoveAll(item => item.type_id == obj.type_id);

                    }
                    else 
                    {
                        // Hitting this  mean that there are other objects with same type and higher price OR its not the same type_id
                    }

                }
            }
            else if (duplicateCount > 1)
            {
                // shud not happn...
            }
            else
            {

                uniqueBuyList.Add(obj);
            }


            continue;
        }
        foreach (EveObjModel item in uniqueBuyList.OrderBy(item => item.type_id))
        {
            buyListtextField.Text += $"Eve Online Item! Type-ID is: {item.type_id}, Price is {item.price}\n";
        }
    }

这是我们的 EveObjModel 类

    using System;
    using System.Collections.Generic;
    using System.Linq;
    using System.Runtime.Serialization;
    using System.Text;
    using System.Threading.Tasks;

    namespace EveOnlineApp
    {
    [DataContract]
         public class EveObjModel
    {
    [DataMember]
    public bool is_buy_order { get; set; }

    [DataMember]
    public double price { get; set; }

    [DataMember]
    public int type_id { get; set; }

    }
}

最佳答案

这个过程很慢并不奇怪，因为您使用的算法(带有嵌套循环)至少具有二次 O(N*N) 时间复杂度，对于如此大的数据集来说确实很慢。

一种方法是使用 LINQ GroupBy 运算符，它在内部使用基于散列的查找，因此理论上具有 O(N) 时间复杂度。因此，您按 type_id 分组，并为每个组(具有相同键的元素列表)选择具有最大 price 的组:

var uniqueBuyList = buyList
    .GroupBy(e => e.type_id)
    .Select(g => g.OrderByDescending(e => e.price).First())
    .ToList();

当然，您不需要对列表进行排序以获取具有最高 price 的元素。更好的版本是为此使用 Aggregate 方法(基本上是 foreach 循环):

var uniqueBuyList = buyList
    .GroupBy(e => e.type_id)
    .Select(g => g.Aggregate((e1, e2) => e1.price > e2.price ? e1 : e2))
    .ToList();

另一种非基于 LINQ 的方法是按 type_id 升序、price 降序对输入列表进行排序。然后对排序后的列表进行一次循环，并获取每个 type_id 组的第一个元素(它将具有最大的 price):

var comparer = Comparer<EveObjModel>.Create((e1, e2) =>
{
    int result = e1.type_id.CompareTo(e2.type_id);
    if (result == 0) // e1.type_id == e2.type_id
        result = e2.price.CompareTo(e1.price); // e1, e2 exchanged to get descending order
    return result;
});
buyList.Sort(comparer);
var uniqueBuyList = new List<EveObjModel>();
EveObjModel last = null;
foreach (var item in buyList)
{
    if (last == null || last.type_id != item.type_id)
        uniqueBuyList.Add(item);
    last = item;
}

该算法的复杂度为 O(N*log(N))，因此它比基于散列的算法差(但比原始算法好得多)。好处是它使用更少的内存，并且生成的列表已经按 type_id 排序，因此您不需要使用 OrderBy。

关于c# - 如何最大程度地减少大型数据集的运行时间(从 93,773 个对象列表中列出唯一对象)，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/53495531/

c# - 如何最大程度地减少大型数据集的运行时间(从 93,773 个对象列表中列出唯一对象)

上一篇：python - 字符串替换的无错误版本

下一篇：algorithm - 有什么方法可以使用单个循环比较数组中的每个元素吗？