c# - 在 C# 中过滤巨大列表的最高效方法？

我们有一个 ASP.NET MVC Web 应用程序，它通过 Entity Framework 连接到 SQL Server 数据库。该应用程序的主要任务之一是允许用户快速搜索和过滤包含存档值的庞大数据库表。

表结构相当简单:Timestamp(DateTime)、StationId(int)、DatapointId(int)、Value(double)。该表包含大约 10 到 1 亿行。我用覆盖索引等对数据库表进行了优化，但在按 DatapointId、StationId、Time 和 Skipping and Taking only 我想在页面上显示的部分进行过滤时，用户体验仍然很滞后。

所以我尝试了一种不同的方法:由于我们的服务器有很多 RAM，我认为我们可以简单地将整个存档表加载到 List<ArchiveRow> 中。当 Web 应用程序启动时，只需直接从该列表中获取数据，而不是往返于数据库。这工作得很好，将整个存档表(目前大约有 1000 万个条目)加载到列表中大约需要 9 秒。 ArchiveRow是一个看起来像这样的简单对象:

public class ArchiveResponse {
    public  int Length { get; set; }
    public int numShown { get; set; }
    public  int numFound { get; set; }
    public  int numTotal { get; set; }
    public  List<ArchiveRow> Rows { get; set; }
}

相应地:

public class ArchiveRow {
    public  int s { get; set; }
    public  int d { get; set; }
    public  DateTime t { get; set; }
    public  double v { get; set; }
}

当我现在尝试使用 Linq 查询从列表中获取所需数据时，查询数据库已经更快了，但是在按多个条件过滤时仍然很慢。例如，当我按 1 个 StationId 和 12 个 DatapointId 进行筛选时，检索 25 行的窗口大约需要 5 秒。我已经从 Where 过滤切换了使用连接，但我认为仍有改进的余地。有没有更好的方法来实现这种缓存机制，同时保持尽可能低的内存消耗？是否有其他更适合此目的的集合类型？

下面是从 ArchiveCache 列表中过滤和获取相关数据的代码:

// Total number of entries in archive cache
var numTotal = ArchiveCache.Count();

// Initial Linq query
ParallelQuery<ArchiveCacheValue> query = ArchiveCache.AsParallel();

// The request may contain StationIds that the user is interested in,
// so here's the filtering by StationIds with a join:
if (request.StationIds.Count > 0)
{
    query = from a in ArchiveCache.AsParallel()
             join b in request.StationIds.AsParallel()
             on a.StationId equals b
             select a;
}

// The request may contain DatapointIds that the user is interested in,
// so here's the filtering by DatapointIds with a join:
if (request.DatapointIds.Count > 0)
{
    query = from a in query.AsParallel()
             join b in request.DatapointIds.AsParallel()
             on a.DataPointId equals b
             select a;
}

// Number of matching entries after filtering and before windowing
int numFound = query.Count();

// Pagination: Select only the current window that needs to be shown on the page
var result = query.Skip(request.Start == 0 ? 0 : request.Start - 1).Take(request.Length);

// Number of entries on the current page that will be shown
int numShown = result.Count();

// Build a response object, serialize it to Json and return to client
// Note: The projection with the Rows is not a bottleneck, it is only done to
// shorten 'StationId' to 's' etc. At this point there are only 25 to 50 rows,
// so that is no problem and happens in way less than 1 ms
ArchiveResponse myResponse = new ArchiveResponse();
myResponse.Length = request.Length;
myResponse.numShown = numShown;
myResponse.numFound = numFound;
myResponse.numTotal = numTotal;
myResponse.Rows = result.Select(x => new archRow() { s = x.StationId, d = x.DataPointId, t = x.DateValue, v = x.Value }).ToList();

return JsonSerializer.ToJsonString(myResponse);

更多细节:站的数量通常在 5 到 50 之间，很少超过 50。数据点的数量 <7000。 Web 应用程序设置为 64 位 <gcAllowVeryLargeObjects enabled="true" />在 web.config 中设置。

我真的很期待进一步的改进和建议。也许有一种完全不同的基于数组或类似方法的方法，在没有 linq 的情况下表现得更好？

最佳答案

您可以针对此特定查询类型调整存储。首先，从内存中的存档创建字典:

ArchiveCacheByDatapoint = ArchiveCache.GroupBy(c => c.DataPointId)
            .ToDictionary(c => c.Key, c => c.ToList());
ArchiveCacheByStation = ArchiveCache.GroupBy(c => c.StationId)
            .ToDictionary(c => c.Key, c => c.ToList());

然后在您的查询中使用这些词典:

bool hasStations = request.StationIds.Length > 0;
bool hasDatapoints = request.DatapointIds.Length > 0;            
int numFound = 0;
List<ArchiveCacheValue> result;
if (hasDatapoints && hasStations) {
    // special case - filter by both
    result = new List<ArchiveCacheValue>();
    // store station filter in hash set
    var stationsFilter = new HashSet<int>(request.StationIds);
    // first filter by datapoints, because you have more different datapoints than stations
    foreach (var datapointId in request.DatapointIds.OrderBy(c => c)) {                    
        foreach (var item in ArchiveCacheByDatapoint[datapointId]) {                        
            if (stationsFilter.Contains(item.StationId)) {
                // both datapoint and station matches filter - found item
                numFound++;
                if (numFound >= request.Start && result.Count < request.Length) {
                    // add to result list if matches paging criteria
                    result.Add(item);
                }
            }
        }
    }
}
else if (hasDatapoints) {                
    var query = Enumerable.Empty<ArchiveCacheValue>();                
    foreach (var datapoint in request.DatapointIds.OrderBy(c => c))
    {
        var list = ArchiveCacheByDatapoint[datapoint];
        numFound += list.Count;
        query = query.Concat(list);
    }
    // execute query just once
    result = query.Skip(request.Start).Take(request.Length).ToList();
}
else if (hasStations) {                
    var query = Enumerable.Empty<ArchiveCacheValue>();
    foreach (var station in request.StationIds.OrderBy(c => c))
    {
        var list = ArchiveCacheByStation[station];
        numFound += list.Count;
        query = query.Concat(list);
    }
    // execute query just once
    result = query.Skip(request.Start).Take(request.Length).ToList();
}
else {
    // no need to do Count()
    numFound = ArchiveCache.Count;
    // no need to Skip\Take here really, ArchiveCache is list\array
    // so you can use indexes which will be faster
    result = ArchiveCache.Skip(request.Start).Take(request.Length).ToList();
}

// Number of entries on the current page that will be shown
int numShown = result.Count;

我已经对其进行了测量，在我的机器上，对于我尝试过的所有类型的查询(仅按部分、仅按数据点、按部分和数据点)，1 亿个项目的运行时间为 1 毫秒(有时高达 10 毫秒)。

关于c# - 在 C# 中过滤巨大列表的最高效方法？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/47348923/

c# - 在 C# 中过滤巨大列表的最高效方法？

上一篇：c# - 算术运算在 VB6 和 C# 中不返回相同值

下一篇：C# 从工厂返回通用接口(interface)