我有一个关于并行 for 循环的问题。我有以下代码:
public static void MultiplicateArray(double[] array, double factor)
{
for (int i = 0; i < array.Length; i++)
{
array[i] = array[i] * factor;
}
}
public static void MultiplicateArray(double[] arrayToChange, double[] multiplication)
{
for (int i = 0; i < arrayToChange.Length; i++)
{
arrayToChange[i] = arrayToChange[i] * multiplication[i];
}
}
public static void MultiplicateArray(double[] arrayToChange, double[,] multiArray, int dimension)
{
for (int i = 0; i < arrayToChange.Length; i++)
{
arrayToChange[i] = arrayToChange[i] * multiArray[i, dimension];
}
}
现在我尝试添加并行函数:
public static void MultiplicateArray(double[] array, double factor)
{
Parallel.For(0, array.Length, i =>
{
array[i] = array[i] * factor;
});
}
public static void MultiplicateArray(double[] arrayToChange, double[] multiplication)
{
Parallel.For(0, arrayToChange.Length, i =>
{
arrayToChange[i] = arrayToChange[i] * multiplication[i];
});
}
public static void MultiplicateArray(double[] arrayToChange, double[,] multiArray, int dimension)
{
Parallel.For(0, arrayToChange.Length, i =>
{
arrayToChange[i] = arrayToChange[i] * multiArray[i, dimension];
});
}
问题是,我想节省时间,而不是浪费时间。使用标准 for 循环,它计算大约 2 分钟,但使用并行 for 循环需要 3 分钟。为什么?
最佳答案
Parallel.For()
可以通过并行化您的代码大大提高性能,但它也有开销(线程之间的同步,在每次迭代时调用委托(delegate))。由于在您的代码中,每次迭代都非常短(基本上,只有几条 CPU 指令),这种开销会变得很突出。
因此,我认为使用 Parallel.For()
不是适合您的解决方案。相反,如果您手动并行化您的代码(在本例中这非常简单),您可能会看到性能提高。
为了验证这一点,我进行了一些测量:我在一个包含 200 000 000 个项目的数组上运行了 MultiplicateArray()
的不同实现(我使用的代码如下)。在我的机器上,串行版本始终需要 0.21 秒,而 Parallel.For()
通常需要大约 0.45 秒,但有时会飙升至 8-9 秒!
首先,我将尝试改进常见情况,稍后再讨论这些峰值。我们想用 N 个 CPU 来处理这个数组,所以我们把它分成 N 个大小相等的部分,然后分别处理每个部分。结果? 0.35 秒。这仍然比串行版本差。但是 for
遍历数组中的每个项目是最优化的结构之一。我们不能做点什么来帮助编译器吗?提取计算循环的界限可能会有所帮助。事实证明确实如此:0.18 秒。这比串行版本好,但也好不了多少。而且,有趣的是,在我的 4 核机器(没有超线程)上将并行度从 4 更改为 2 不会改变结果:仍然是 0.18 秒。这让我得出结论,CPU 不是这里的瓶颈,内存带宽才是。
现在,回到峰值:我的自定义并行化没有它们,但是 Parallel.For()
有,为什么? Parallel.For()
确实使用范围分区,这意味着每个线程处理自己的数组部分。但是,如果一个线程提前结束,它将尝试帮助处理另一个尚未结束的线程的范围。如果发生这种情况,您将获得大量错误共享,这可能会大大降低代码速度。我自己的强制虚假共享测试似乎表明这确实可能是问题所在。强制 Parallel.For()
的并行度似乎对峰值有所帮助。
当然,所有这些测量都是针对我的计算机上的硬件的,对您来说会有所不同,因此您应该进行自己的测量。
我使用的代码:
static void Main()
{
double[] array = new double[200 * 1000 * 1000];
for (int i = 0; i < array.Length; i++)
array[i] = 1;
for (int i = 0; i < 5; i++)
{
Stopwatch sw = Stopwatch.StartNew();
Serial(array, 2);
Console.WriteLine("Serial: {0:f2} s", sw.Elapsed.TotalSeconds);
sw = Stopwatch.StartNew();
ParallelFor(array, 2);
Console.WriteLine("Parallel.For: {0:f2} s", sw.Elapsed.TotalSeconds);
sw = Stopwatch.StartNew();
ParallelForDegreeOfParallelism(array, 2);
Console.WriteLine("Parallel.For (degree of parallelism): {0:f2} s", sw.Elapsed.TotalSeconds);
sw = Stopwatch.StartNew();
CustomParallel(array, 2);
Console.WriteLine("Custom parallel: {0:f2} s", sw.Elapsed.TotalSeconds);
sw = Stopwatch.StartNew();
CustomParallelExtractedMax(array, 2);
Console.WriteLine("Custom parallel (extracted max): {0:f2} s", sw.Elapsed.TotalSeconds);
sw = Stopwatch.StartNew();
CustomParallelExtractedMaxHalfParallelism(array, 2);
Console.WriteLine("Custom parallel (extracted max, half parallelism): {0:f2} s", sw.Elapsed.TotalSeconds);
sw = Stopwatch.StartNew();
CustomParallelFalseSharing(array, 2);
Console.WriteLine("Custom parallel (false sharing): {0:f2} s", sw.Elapsed.TotalSeconds);
}
}
static void Serial(double[] array, double factor)
{
for (int i = 0; i < array.Length; i++)
{
array[i] = array[i] * factor;
}
}
static void ParallelFor(double[] array, double factor)
{
Parallel.For(
0, array.Length, i => { array[i] = array[i] * factor; });
}
static void ParallelForDegreeOfParallelism(double[] array, double factor)
{
Parallel.For(
0, array.Length, new ParallelOptions { MaxDegreeOfParallelism = Environment.ProcessorCount },
i => { array[i] = array[i] * factor; });
}
static void CustomParallel(double[] array, double factor)
{
var degreeOfParallelism = Environment.ProcessorCount;
var tasks = new Task[degreeOfParallelism];
for (int taskNumber = 0; taskNumber < degreeOfParallelism; taskNumber++)
{
// capturing taskNumber in lambda wouldn't work correctly
int taskNumberCopy = taskNumber;
tasks[taskNumber] = Task.Factory.StartNew(
() =>
{
for (int i = array.Length * taskNumberCopy / degreeOfParallelism;
i < array.Length * (taskNumberCopy + 1) / degreeOfParallelism;
i++)
{
array[i] = array[i] * factor;
}
});
}
Task.WaitAll(tasks);
}
static void CustomParallelExtractedMax(double[] array, double factor)
{
var degreeOfParallelism = Environment.ProcessorCount;
var tasks = new Task[degreeOfParallelism];
for (int taskNumber = 0; taskNumber < degreeOfParallelism; taskNumber++)
{
// capturing taskNumber in lambda wouldn't work correctly
int taskNumberCopy = taskNumber;
tasks[taskNumber] = Task.Factory.StartNew(
() =>
{
var max = array.Length * (taskNumberCopy + 1) / degreeOfParallelism;
for (int i = array.Length * taskNumberCopy / degreeOfParallelism;
i < max;
i++)
{
array[i] = array[i] * factor;
}
});
}
Task.WaitAll(tasks);
}
static void CustomParallelExtractedMaxHalfParallelism(double[] array, double factor)
{
var degreeOfParallelism = Environment.ProcessorCount / 2;
var tasks = new Task[degreeOfParallelism];
for (int taskNumber = 0; taskNumber < degreeOfParallelism; taskNumber++)
{
// capturing taskNumber in lambda wouldn't work correctly
int taskNumberCopy = taskNumber;
tasks[taskNumber] = Task.Factory.StartNew(
() =>
{
var max = array.Length * (taskNumberCopy + 1) / degreeOfParallelism;
for (int i = array.Length * taskNumberCopy / degreeOfParallelism;
i < max;
i++)
{
array[i] = array[i] * factor;
}
});
}
Task.WaitAll(tasks);
}
static void CustomParallelFalseSharing(double[] array, double factor)
{
var degreeOfParallelism = Environment.ProcessorCount;
var tasks = new Task[degreeOfParallelism];
int i = -1;
for (int taskNumber = 0; taskNumber < degreeOfParallelism; taskNumber++)
{
tasks[taskNumber] = Task.Factory.StartNew(
() =>
{
int j = Interlocked.Increment(ref i);
while (j < array.Length)
{
array[j] = array[j] * factor;
j = Interlocked.Increment(ref i);
}
});
}
Task.WaitAll(tasks);
}
示例输出:
Serial: 0,20 s
Parallel.For: 0,50 s
Parallel.For (degree of parallelism): 8,90 s
Custom parallel: 0,33 s
Custom parallel (extracted max): 0,18 s
Custom parallel (extracted max, half parallelism): 0,18 s
Custom parallel (false sharing): 7,53 s
Serial: 0,21 s
Parallel.For: 0,52 s
Parallel.For (degree of parallelism): 0,36 s
Custom parallel: 0,31 s
Custom parallel (extracted max): 0,18 s
Custom parallel (extracted max, half parallelism): 0,19 s
Custom parallel (false sharing): 7,59 s
Serial: 0,21 s
Parallel.For: 11,21 s
Parallel.For (degree of parallelism): 0,36 s
Custom parallel: 0,32 s
Custom parallel (extracted max): 0,18 s
Custom parallel (extracted max, half parallelism): 0,18 s
Custom parallel (false sharing): 7,76 s
Serial: 0,21 s
Parallel.For: 0,46 s
Parallel.For (degree of parallelism): 0,35 s
Custom parallel: 0,31 s
Custom parallel (extracted max): 0,18 s
Custom parallel (extracted max, half parallelism): 0,18 s
Custom parallel (false sharing): 7,58 s
Serial: 0,21 s
Parallel.For: 0,45 s
Parallel.For (degree of parallelism): 0,40 s
Custom parallel: 0,38 s
Custom parallel (extracted max): 0,18 s
Custom parallel (extracted max, half parallelism): 0,18 s
Custom parallel (false sharing): 7,58 s
关于c# - 使用并行 FOR 循环节省时间,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/12405938/