我正在使用字典来保存相当大量 (> 10^7) 的项目。如果它被分成多个单独的字典,每个字典保存数据的一部分/分区,它可以提高查找和/或插入性能吗?
举个例子,假设我们有一个 Dictionary<int,int>
。我们可以将其替换为:
var ds = new Dictionary<int,int> [256];
// ...
void Add (int key, int value) {
// We can assume key is an evenly distributed hash
ds[key & 0xFF].Add (key, value);
}
// Lookup similar
当然,这是需要进行基准测试的事情,但我也对针对这种情况的一般建议感兴趣。令人惊讶的是,我在这里找不到真正类似的问题。
我知道单个词典可以容纳的项目数量是有限的。这个问题假设这个限制不是问题 - 否则,无论如何都只有一个解决方案。
最佳答案
这个问题我又想了一下。虽然许多数据结构的插入或查找操作的成本为对数,但在字典中,这些成本(摊销)假设为 O(1)。
在前一种情况下,通过手动索引(O(1) 操作)分割一部分工作可以通过减少对数参数来减少剩余工作。实际上,我们将在另一个结构之上实现一个字典。
当然,这也意味着当基础结构本身已经是一个字典时,这不应该产生任何显着的影响。有很多方法可以实现这些,但据我所知,没有一种方法可以通过减小它们的大小来渐进地受益:它们的平均案例行为(即处理重复项)在时间上是恒定的并且不会增长。
另一方面,手动工作会带来开销。因此我们预计这样的切片字典性能会更差。
为了对其进行合理性检查,我编写了一个小测试。
Console.WriteLine ("Times in seconds per 10m merged/sliced operations");
foreach (var init in new[] { "empty", "size", "spare" }) {
for (int n = 10 * 1000 * 1000; n <= 40 * 1000 * 1000; n += 10 * 1000 * 1000) {
for (int repeat = 0; repeat < 3; repeat++) {
Stopwatch wmi, wml, wsi, wsl;
{
GC.Collect ();
var r = new Random (0);
Dictionary<int, object> d;
if (init == "empty") {
d = new Dictionary<int, object> ();
}
else if (init == "size") {
d = new Dictionary<int, object> (n);
}
else {
d = new Dictionary<int, object> (2 * n);
}
wmi = Stopwatch.StartNew ();
for (int i = 0; i < n; i++) {
var key = r.Next ();
d[key] = null;
}
wmi.Stop ();
wml = Stopwatch.StartNew ();
var dummy = false;
for (int i = 0; i < n; i++) {
dummy ^= d.ContainsKey (i);
}
wml.Stop ();
}
{
GC.Collect ();
var r = new Random (0);
var ds = new Dictionary<int, object>[256];
for (int i = 0; i < 256; i++) {
if (init == "empty") {
ds[i] = new Dictionary<int, object> ();
}
else if (init == "size") {
ds[i] = new Dictionary<int, object> (n / 256);
}
else {
ds[i] = new Dictionary<int, object> (2 * n / 256);
}
}
wsi = Stopwatch.StartNew ();
for (int i = 0; i < n; i++) {
var key = r.Next ();
var d = unchecked(ds[key & 0xFF]);
d[key] = null;
}
wsi.Stop ();
wsl = Stopwatch.StartNew ();
var dummy = false;
for (int i = 0; i < n; i++) {
var d = unchecked(ds[i & 0xFF]);
dummy ^= d.ContainsKey (i);
}
wsl.Stop ();
}
string perM (Stopwatch w) => $"{w.Elapsed.TotalSeconds / n * 10 * 1000 * 1000,5:0.00}";
Console.WriteLine ($"init = {init,-5}, n = {n,8};"
+ $" insert = {perM (wmi)}/{perM (wsi)},"
+ $" lookup = {perM (wml)}/{perM (wsl)}");
}
}
Console.WriteLine ();
}
每个测试重复三次。字典的初始化策略是以下之一: a) 空(让它自己处理增长) b) 初始化大小(容量等于项目数) c) 初始化为两倍大小
请注意,切片字典可能会出现一些不均匀的分布。
输出(数字越小越好):
init = empty, n = 10000000; insert = 1.17/ 1.22, lookup = 0.42/ 0.53
init = empty, n = 10000000; insert = 1.13/ 1.24, lookup = 0.41/ 0.53
init = empty, n = 10000000; insert = 1.10/ 1.21, lookup = 0.41/ 0.53
init = empty, n = 20000000; insert = 1.19/ 1.29, lookup = 0.42/ 0.53
init = empty, n = 20000000; insert = 1.18/ 1.28, lookup = 0.42/ 0.54
init = empty, n = 20000000; insert = 1.18/ 1.28, lookup = 0.42/ 0.53
init = empty, n = 30000000; insert = 1.31/ 1.22, lookup = 0.34/ 0.66
init = empty, n = 30000000; insert = 1.35/ 1.23, lookup = 0.35/ 0.66
init = empty, n = 30000000; insert = 1.34/ 1.21, lookup = 0.35/ 0.66
init = empty, n = 40000000; insert = 1.26/ 1.20, lookup = 0.43/ 0.76
init = empty, n = 40000000; insert = 1.26/ 1.19, lookup = 0.43/ 0.76
init = empty, n = 40000000; insert = 1.25/ 1.21, lookup = 0.43/ 0.76
init = size , n = 10000000; insert = 0.82/ 0.89, lookup = 0.48/ 0.79
init = size , n = 10000000; insert = 0.80/ 0.90, lookup = 0.48/ 0.70
init = size , n = 10000000; insert = 0.80/ 0.88, lookup = 0.47/ 0.69
init = size , n = 20000000; insert = 0.84/ 0.91, lookup = 0.48/ 0.69
init = size , n = 20000000; insert = 0.84/ 0.88, lookup = 0.48/ 0.69
init = size , n = 20000000; insert = 0.84/ 0.85, lookup = 0.48/ 0.69
init = size , n = 30000000; insert = 0.88/ 0.90, lookup = 0.49/ 0.75
init = size , n = 30000000; insert = 0.93/ 0.96, lookup = 0.50/ 0.72
init = size , n = 30000000; insert = 0.88/ 0.90, lookup = 0.49/ 0.73
init = size , n = 40000000; insert = 0.85/ 0.90, lookup = 0.48/ 0.76
init = size , n = 40000000; insert = 0.86/ 0.98, lookup = 0.49/ 0.76
init = size , n = 40000000; insert = 0.86/ 0.94, lookup = 0.49/ 0.76
init = spare, n = 10000000; insert = 0.69/ 0.73, lookup = 0.29/ 0.49
init = spare, n = 10000000; insert = 0.70/ 0.71, lookup = 0.29/ 0.49
init = spare, n = 10000000; insert = 0.68/ 0.76, lookup = 0.28/ 0.49
init = spare, n = 20000000; insert = 0.70/ 0.78, lookup = 0.29/ 0.54
init = spare, n = 20000000; insert = 0.70/ 0.78, lookup = 0.29/ 0.53
init = spare, n = 20000000; insert = 0.70/ 0.76, lookup = 0.29/ 0.53
init = spare, n = 30000000; insert = 0.71/ 0.77, lookup = 0.29/ 0.50
init = spare, n = 30000000; insert = 0.73/ 0.78, lookup = 0.30/ 0.51
init = spare, n = 30000000; insert = 0.71/ 0.77, lookup = 0.29/ 0.51
init = spare, n = 40000000; insert = 0.72/ 0.80, lookup = 0.29/ 0.53
init = spare, n = 40000000; insert = 0.72/ 0.81, lookup = 0.29/ 0.53
init = spare, n = 40000000; insert = 0.72/ 0.81, lookup = 0.29/ 0.53
始终如一,切片字典中的插入和查找速度都不快。我相信大多数情况下都是这样。
但是,这种切片字典在并行操作中仍然有可能的用例。字典的不同部分可以并行地对批量操作的数据进行插入、查找等操作。
无论字典是否作为一个整体同时使用,都是如此。但是,如果是,那么切片将允许仅锁定所需的部分而不是所有内容(在简单的实现中)。然而,其他专为并发操作而设计的字典(例如 .NET 的 ConcurrentDictionary
)则不受此缺点的影响。
关于c# - 将一个大字典分成多个部分是否明智?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57939220/