c# - 优化 LINQ 例程

标签 c# sql-server multithreading linq entity-framework

我运行一个构建系统。 Datawise 的简化描述是我有配置,每个配置有 0..n 个构建。 现在构建产生工件,其中一些存储在服务器上。我正在做的是编写一种规则,将每个配置构建生成的所有字节相加,并检查这些字节是否太多。

目前例程的代码如下:

private void CalculateExtendedDiskUsage(IEnumerable<Configuration> allConfigurations)
{
    var sw = new Stopwatch();
    sw.Start();
    // Lets take only confs that have been updated within last 7 days
    var items = allConfigurations.AsParallel().Where(x =>
        x.artifact_cleanup_type != null && x.build_cleanup_type != null &&
        x.updated_date > DateTime.UtcNow.AddDays(-7)
        ).ToList();

    using (var ctx = new LocalEntities())
    {
        Debug.WriteLine("Context: " + sw.Elapsed);
        var allBuilds = ctx.Builds;
        var ruleResult = new List<Notification>();
        foreach (var configuration in items)
        {
            // all builds for current configuration
            var configurationBuilds = allBuilds.Where(x => x.configuration_id == configuration.configuration_id)
                .OrderByDescending(z => z.build_date);
            Debug.WriteLine("Filter conf builds: " + sw.Elapsed);

            // Since I don't know which builds/artifacts have been cleaned up, calculate it manually
            if (configuration.build_cleanup_count != null)
            {
                var buildCleanupCount = "30"; // default
                if (configuration.build_cleanup_type.Equals("ReserveBuildsByDays"))
                {
                    var buildLastCleanupDate = DateTime.UtcNow.AddDays(-int.Parse(buildCleanupCount));
                    configurationBuilds = configurationBuilds.Where(x => x.build_date > buildLastCleanupDate)
                            .OrderByDescending(z => z.build_date);
                }
                if (configuration.build_cleanup_type.Equals("ReserveBuildsByCount"))
                {
                    var buildLastCleanupCount = int.Parse(buildCleanupCount);
                    configurationBuilds =
                        configurationBuilds.Take(buildLastCleanupCount).OrderByDescending(z => z.build_date);
                }
            }

            if (configuration.artifact_cleanup_count != null)
            {
                // skipped, similar to previous block
            }

            Debug.WriteLine("Done cleanup: " + sw.Elapsed);
            const int maxDiscAllocationPerConfiguration = 1000000000; // 1GB
            // Sum all disc usage per configuration
            var confDiscSizePerConfiguration = configurationBuilds
                .GroupBy(c => new {c.configuration_id})
                .Where(c => (c.Sum(z => z.artifact_dir_size) > maxDiscAllocationPerConfiguration))
                .Select(groupedBuilds =>
                    new
                    {
                        configurationId = groupedBuilds.FirstOrDefault().configuration_id,
                        configurationPath = groupedBuilds.FirstOrDefault().configuration_path,
                        Total = groupedBuilds.Sum(c => c.artifact_dir_size),
                        Average = groupedBuilds.Average(c => c.artifact_dir_size)
                    }).ToList();
            Debug.WriteLine("Done db query: " + sw.Elapsed);

            ruleResult.AddRange(confDiscSizePerConfiguration.Select(iter => new Notification
            {
                ConfigurationId = iter.configurationId,
                CreatedDate = DateTime.UtcNow,
                RuleType = (int) RulesEnum.TooMuchDisc,
                ConfigrationPath = iter.configurationPath
            }));
            Debug.WriteLine("Finished loop: " + sw.Elapsed);
        }
        // find owners and insert...
    }
}

这正是我想要的,但我在想是否可以让它更快。目前我看到:

Context: 00:00:00.0609067
// first round
Filter conf builds: 00:00:00.0636291
Done cleanup: 00:00:00.0644505
Done db query: 00:00:00.3050122
Finished loop: 00:00:00.3062711
// avg round
Filter conf builds: 00:00:00.0001707
Done cleanup: 00:00:00.0006343
Done db query: 00:00:00.0760567
Finished loop: 00:00:00.0773370

SQL.ToList() 生成looks very messy. (WHERE 中使用的所有内容都包含在数据库中的索引中)

我正在测试 200 个配置,所以这加起来是 00:00:18.6326722。我每天总共有大约 8k 个项目需要处理(因此整个例程需要 10 多分钟才能完成)。

我一直在这个互联网上随机谷歌搜索,在我看来 Entitiy Framework并行处理不是很好。知道了还是决定给这个async/await接近尝试(第一次尝试,很抱歉有任何废话)。

基本上,如果我将所有处理移出范围,例如:

foreach (var configuration in items)
    {

        var confDiscSizePerConfiguration = await GetData(configuration, allBuilds);

        ruleResult.AddRange(confDiscSizePerConfiguration.Select(iter => new Notification
        {
           ... skiped
    } 

和:

private async Task<List<Tmp>> GetData(Configuration configuration, IQueryable<Build> allBuilds)  
{
        var configurationBuilds = allBuilds.Where(x => x.configuration_id == configuration.configuration_id)
            .OrderByDescending(z => z.build_date);
        //..skipped
        var confDiscSizePerConfiguration = configurationBuilds
            .GroupBy(c => new {c.configuration_id})
            .Where(c => (c.Sum(z => z.artifact_dir_size) > maxDiscAllocationPerConfiguration))
            .Select(groupedBuilds =>
                new Tmp
                {
                    ConfigurationId = groupedBuilds.FirstOrDefault().configuration_id,
                    ConfigurationPath = groupedBuilds.FirstOrDefault().configuration_path,
                    Total = groupedBuilds.Sum(c => c.artifact_dir_size),
                    Average = groupedBuilds.Average(c => c.artifact_dir_size)
                }).ToListAsync();
        return await confDiscSizePerConfiguration;
    }

出于某种原因,这将 200 个项目的执行时间从 18 -> 13 秒降低。无论如何,据我了解,因为我是await每个 .ToListAsync() ,还是按顺序处理,对吗?

因此,当我替换 foreach (var configuration in items) 时,“无法并行处理”声明开始出现与 Parallel.ForEach(items, async configuration => .进行此更改会导致:

A second operation started on this context before a previous asynchronous operation completed. Use 'await' to ensure that any asynchronous operations have completed before calling another method on this context. Any instance members are not guaranteed to be thread safe.

起初我有点困惑,因为我 await实际上在编译器允许的每个地方,但数据可能会被快速播种。

我试图通过不那么贪婪来克服这个问题,并添加了 new ParallelOptions {MaxDegreeOfParallelism = 4}对于那个并行循环,农民假设默认连接池大小是 100,我只想使用 4,应该足够了。但它仍然失败。

我还尝试在 GetData 中创建新的 DbContexts方法,但仍然失败。如果我没记错的话(现在无法测试),我得到了

Underlying connection failed to open

有什么可能性可以使这个例程进行得更快?

最佳答案

在并行之前,优化查询本身是值得的。以下是一些可能会改善您的时间的建议:

1) 在使用 GroupBy 时使用 Key。这可能会解决复杂和嵌套 SQL 查询的问题,因为您指示 Linq 使用 GROUP BY 中定义的相同键,而不是创建子选择。

        var confDiscSizePerConfiguration = configurationBuilds
            .GroupBy(c => new { ConfigurationId = c.configuration_id, ConfigurationPath = c.configuration_path})
            .Where(c => (c.Sum(z => z.artifact_dir_size) > maxDiscAllocationPerConfiguration))
            .Select(groupedBuilds =>
                new
                {
                    configurationId = groupedBuilds.Key.ConfigurationId,
                    configurationPath = groupedBuilds.Key.ConfigurationPath,
                    Total = groupedBuilds.Sum(c => c.artifact_dir_size),
                    Average = groupedBuilds.Average(c => c.artifact_dir_size)
                })
            .ToList();

2) 看来你被N+1问题给坑了。简而言之 - 您执行一个 SQL 查询以获取所有配置,然后执行 N 个其他查询以获取构建信息。总计大约 8k 个小查询,其中 2 个更大的查询就足够了。如果已用内存不是约束,则获取内存中的所有构建数据并使用 ToLookup 优化快速查找。

var allBuilds = ctx.Builds.ToLookup(x=>x.configuration_id);

稍后您可以通过以下方式查找构建:

var configurationBuilds = allBuilds[configuration.configuration_id].OrderByDescending(z => z.build_date);

3) 您在 configurationBuilds 上多次执行 OrderBy。过滤不会影响记录顺序,因此您可以安全地删除对 OrderBy 的额外调用:

...
configurationBuilds = configurationBuilds.Where(x => x.build_date > buildLastCleanupDate);
...
configurationBuilds = configurationBuilds.Take(buildLastCleanupCount);
...

4) GroupBy 没有意义,因为构建已经针对单个配置进行了过滤。

更新:

我更进一步并创建了代码,可以通过单个请求检索与您提供的代码相同的结果。它应该具有更高的性能并使用更少的内存。

private void CalculateExtendedDiskUsage()
{
    using (var ctx = new LocalEntities())
    {
        var ruleResult = ctx.Configurations
            .Where(x => x.build_cleanup_count != null && 
                (
                    (x.build_cleanup_type == "ReserveBuildsByDays" && ctx.Builds.Where(y => y.configuration_id == x.configuration_id).Where(y => y.build_date > buildLastCleanupDate).Sum(y => y.artifact_dir_size) > maxDiscAllocationPerConfiguration) ||
                    (x.build_cleanup_type == "ReserveBuildsByCount" && ctx.Builds.Where(y => y.configuration_id == x.configuration_id).OrderByDescending(y => y.build_date).Take(buildCleanupCount).Sum(y => y.artifact_dir_size) > maxDiscAllocationPerConfiguration)
                )
            )
            .Select(x => new Notification
            {
                ConfigurationId = x.configuration_id,
                ConfigrationPath = x.configuration_path
                CreatedDate = DateTime.UtcNow,
                RuleType = (int)RulesEnum.TooMuchDisc,
            })
            .ToList();
    }
}

关于c# - 优化 LINQ 例程,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/31759896/

相关文章:

C# SOAP : The specified type was not recognized: name ='arrayList'

c# - 通过 C# 优化此大型 SQL 插入的策略?

SQL 查询问题 - 如何使用共性选择行组

c++ - 在线程内使用全局数组

Linux:在这种情况下我应该使用进程还是线程?

c# - 为 Visual Studio 中的所有构建定义条件常量

c# - 使用 NoSQL 数据库进行 ASP.NET Core 身份验证

sql - Microsoft SQL 浏览器客户端

java - 如何将值从线程返回到另一个类

c# - 通过验证创建有效的日期时间对象