我运行一个构建系统。 Datawise 的简化描述是我有配置,每个配置有 0..n 个构建。 现在构建产生工件,其中一些存储在服务器上。我正在做的是编写一种规则,将每个配置构建生成的所有字节相加,并检查这些字节是否太多。
目前例程的代码如下:
private void CalculateExtendedDiskUsage(IEnumerable<Configuration> allConfigurations)
{
var sw = new Stopwatch();
sw.Start();
// Lets take only confs that have been updated within last 7 days
var items = allConfigurations.AsParallel().Where(x =>
x.artifact_cleanup_type != null && x.build_cleanup_type != null &&
x.updated_date > DateTime.UtcNow.AddDays(-7)
).ToList();
using (var ctx = new LocalEntities())
{
Debug.WriteLine("Context: " + sw.Elapsed);
var allBuilds = ctx.Builds;
var ruleResult = new List<Notification>();
foreach (var configuration in items)
{
// all builds for current configuration
var configurationBuilds = allBuilds.Where(x => x.configuration_id == configuration.configuration_id)
.OrderByDescending(z => z.build_date);
Debug.WriteLine("Filter conf builds: " + sw.Elapsed);
// Since I don't know which builds/artifacts have been cleaned up, calculate it manually
if (configuration.build_cleanup_count != null)
{
var buildCleanupCount = "30"; // default
if (configuration.build_cleanup_type.Equals("ReserveBuildsByDays"))
{
var buildLastCleanupDate = DateTime.UtcNow.AddDays(-int.Parse(buildCleanupCount));
configurationBuilds = configurationBuilds.Where(x => x.build_date > buildLastCleanupDate)
.OrderByDescending(z => z.build_date);
}
if (configuration.build_cleanup_type.Equals("ReserveBuildsByCount"))
{
var buildLastCleanupCount = int.Parse(buildCleanupCount);
configurationBuilds =
configurationBuilds.Take(buildLastCleanupCount).OrderByDescending(z => z.build_date);
}
}
if (configuration.artifact_cleanup_count != null)
{
// skipped, similar to previous block
}
Debug.WriteLine("Done cleanup: " + sw.Elapsed);
const int maxDiscAllocationPerConfiguration = 1000000000; // 1GB
// Sum all disc usage per configuration
var confDiscSizePerConfiguration = configurationBuilds
.GroupBy(c => new {c.configuration_id})
.Where(c => (c.Sum(z => z.artifact_dir_size) > maxDiscAllocationPerConfiguration))
.Select(groupedBuilds =>
new
{
configurationId = groupedBuilds.FirstOrDefault().configuration_id,
configurationPath = groupedBuilds.FirstOrDefault().configuration_path,
Total = groupedBuilds.Sum(c => c.artifact_dir_size),
Average = groupedBuilds.Average(c => c.artifact_dir_size)
}).ToList();
Debug.WriteLine("Done db query: " + sw.Elapsed);
ruleResult.AddRange(confDiscSizePerConfiguration.Select(iter => new Notification
{
ConfigurationId = iter.configurationId,
CreatedDate = DateTime.UtcNow,
RuleType = (int) RulesEnum.TooMuchDisc,
ConfigrationPath = iter.configurationPath
}));
Debug.WriteLine("Finished loop: " + sw.Elapsed);
}
// find owners and insert...
}
}
这正是我想要的,但我在想是否可以让它更快。目前我看到:
Context: 00:00:00.0609067
// first round
Filter conf builds: 00:00:00.0636291
Done cleanup: 00:00:00.0644505
Done db query: 00:00:00.3050122
Finished loop: 00:00:00.3062711
// avg round
Filter conf builds: 00:00:00.0001707
Done cleanup: 00:00:00.0006343
Done db query: 00:00:00.0760567
Finished loop: 00:00:00.0773370
SQL
由 .ToList()
生成looks very messy. (WHERE
中使用的所有内容都包含在数据库中的索引中)
我正在测试 200 个配置,所以这加起来是 00:00:18.6326722。我每天总共有大约 8k 个项目需要处理(因此整个例程需要 10 多分钟才能完成)。
我一直在这个互联网上随机谷歌搜索,在我看来 Entitiy Framework
并行处理不是很好。知道了还是决定给这个async/await
接近尝试(第一次尝试,很抱歉有任何废话)。
基本上,如果我将所有处理移出范围,例如:
foreach (var configuration in items)
{
var confDiscSizePerConfiguration = await GetData(configuration, allBuilds);
ruleResult.AddRange(confDiscSizePerConfiguration.Select(iter => new Notification
{
... skiped
}
和:
private async Task<List<Tmp>> GetData(Configuration configuration, IQueryable<Build> allBuilds)
{
var configurationBuilds = allBuilds.Where(x => x.configuration_id == configuration.configuration_id)
.OrderByDescending(z => z.build_date);
//..skipped
var confDiscSizePerConfiguration = configurationBuilds
.GroupBy(c => new {c.configuration_id})
.Where(c => (c.Sum(z => z.artifact_dir_size) > maxDiscAllocationPerConfiguration))
.Select(groupedBuilds =>
new Tmp
{
ConfigurationId = groupedBuilds.FirstOrDefault().configuration_id,
ConfigurationPath = groupedBuilds.FirstOrDefault().configuration_path,
Total = groupedBuilds.Sum(c => c.artifact_dir_size),
Average = groupedBuilds.Average(c => c.artifact_dir_size)
}).ToListAsync();
return await confDiscSizePerConfiguration;
}
出于某种原因,这将 200 个项目的执行时间从 18 -> 13 秒降低。无论如何,据我了解,因为我是await
每个 .ToListAsync()
,还是按顺序处理,对吗?
因此,当我替换 foreach (var configuration in items)
时,“无法并行处理”声明开始出现与 Parallel.ForEach(items, async configuration =>
.进行此更改会导致:
A second operation started on this context before a previous asynchronous operation completed. Use 'await' to ensure that any asynchronous operations have completed before calling another method on this context. Any instance members are not guaranteed to be thread safe.
起初我有点困惑,因为我 await
实际上在编译器允许的每个地方,但数据可能会被快速播种。
我试图通过不那么贪婪来克服这个问题,并添加了 new ParallelOptions {MaxDegreeOfParallelism = 4}
对于那个并行循环,农民假设默认连接池大小是 100,我只想使用 4,应该足够了。但它仍然失败。
我还尝试在 GetData
中创建新的 DbContexts方法,但仍然失败。如果我没记错的话(现在无法测试),我得到了
Underlying connection failed to open
有什么可能性可以使这个例程进行得更快?
最佳答案
在并行之前,优化查询本身是值得的。以下是一些可能会改善您的时间的建议:
1) 在使用 GroupBy
时使用 Key
。这可能会解决复杂和嵌套 SQL 查询的问题,因为您指示 Linq 使用 GROUP BY
中定义的相同键,而不是创建子选择。
var confDiscSizePerConfiguration = configurationBuilds
.GroupBy(c => new { ConfigurationId = c.configuration_id, ConfigurationPath = c.configuration_path})
.Where(c => (c.Sum(z => z.artifact_dir_size) > maxDiscAllocationPerConfiguration))
.Select(groupedBuilds =>
new
{
configurationId = groupedBuilds.Key.ConfigurationId,
configurationPath = groupedBuilds.Key.ConfigurationPath,
Total = groupedBuilds.Sum(c => c.artifact_dir_size),
Average = groupedBuilds.Average(c => c.artifact_dir_size)
})
.ToList();
2) 看来你被N+1问题给坑了。简而言之 - 您执行一个 SQL 查询以获取所有配置,然后执行 N 个其他查询以获取构建信息。总计大约 8k 个小查询,其中 2 个更大的查询就足够了。如果已用内存不是约束,则获取内存中的所有构建数据并使用 ToLookup
优化快速查找。
var allBuilds = ctx.Builds.ToLookup(x=>x.configuration_id);
稍后您可以通过以下方式查找构建:
var configurationBuilds = allBuilds[configuration.configuration_id].OrderByDescending(z => z.build_date);
3) 您在 configurationBuilds
上多次执行 OrderBy
。过滤不会影响记录顺序,因此您可以安全地删除对 OrderBy
的额外调用:
...
configurationBuilds = configurationBuilds.Where(x => x.build_date > buildLastCleanupDate);
...
configurationBuilds = configurationBuilds.Take(buildLastCleanupCount);
...
4) GroupBy
没有意义,因为构建已经针对单个配置进行了过滤。
更新:
我更进一步并创建了代码,可以通过单个请求检索与您提供的代码相同的结果。它应该具有更高的性能并使用更少的内存。
private void CalculateExtendedDiskUsage()
{
using (var ctx = new LocalEntities())
{
var ruleResult = ctx.Configurations
.Where(x => x.build_cleanup_count != null &&
(
(x.build_cleanup_type == "ReserveBuildsByDays" && ctx.Builds.Where(y => y.configuration_id == x.configuration_id).Where(y => y.build_date > buildLastCleanupDate).Sum(y => y.artifact_dir_size) > maxDiscAllocationPerConfiguration) ||
(x.build_cleanup_type == "ReserveBuildsByCount" && ctx.Builds.Where(y => y.configuration_id == x.configuration_id).OrderByDescending(y => y.build_date).Take(buildCleanupCount).Sum(y => y.artifact_dir_size) > maxDiscAllocationPerConfiguration)
)
)
.Select(x => new Notification
{
ConfigurationId = x.configuration_id,
ConfigrationPath = x.configuration_path
CreatedDate = DateTime.UtcNow,
RuleType = (int)RulesEnum.TooMuchDisc,
})
.ToList();
}
}
关于c# - 优化 LINQ 例程,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/31759896/