c# - 使用并行 FOR 循环节省时间

标签 c# .net parallel-extensions

我有一个关于并行 for 循环的问题。我有以下代码:

    public static void MultiplicateArray(double[] array, double factor)
    {
        for (int i = 0; i < array.Length; i++)
        {
            array[i] = array[i] * factor;
        }
    }

    public static void MultiplicateArray(double[] arrayToChange, double[] multiplication)
    {
        for (int i = 0; i < arrayToChange.Length; i++)
        {
            arrayToChange[i] = arrayToChange[i] * multiplication[i];
        }
    }

    public static void MultiplicateArray(double[] arrayToChange, double[,] multiArray, int dimension)
    {
        for (int i = 0; i < arrayToChange.Length; i++)
        {
            arrayToChange[i] = arrayToChange[i] * multiArray[i, dimension];
        }
    }

现在我尝试添加并行函数:

    public static void MultiplicateArray(double[] array, double factor)
    {
        Parallel.For(0, array.Length, i =>
            {
                array[i] = array[i] * factor;
            });
    }

    public static void MultiplicateArray(double[] arrayToChange, double[] multiplication)
    {
        Parallel.For(0, arrayToChange.Length, i =>
        {
            arrayToChange[i] = arrayToChange[i] * multiplication[i];
        });
    }

    public static void MultiplicateArray(double[] arrayToChange, double[,] multiArray, int dimension)
    {
        Parallel.For(0, arrayToChange.Length, i =>
        {
            arrayToChange[i] = arrayToChange[i] * multiArray[i, dimension];
        });
    }

问题是,我想节省时间,而不是浪费时间。使用标准 for 循环,它计算大约 2 分钟,但使用并行 for 循环需要 3 分钟。为什么?

最佳答案

Parallel.For() 可以通过并行化您的代码大大提高性能,但它也有开销(线程之间的同步,在每次迭代时调用委托(delegate))。由于在您的代码中,每次迭代都非常短(基本上,只有几条 CPU 指令),这种开销会变得很突出。

因此,我认为使用 Parallel.For() 不是适合您的解决方案。相反,如果您手动并行化您的代码(在本例中这非常简单),您可能会看到性能提高。

为了验证这一点,我进行了一些测量:我在一个包含 200 000 000 个项目的数组上运行了 MultiplicateArray() 的不同实现(我使用的代码如下)。在我的机器上,串行版本始终需要 0.21 秒,而 Parallel.For() 通常需要大约 0.45 秒,但有时会飙升至 8-9 秒!

首先,我将尝试改进常见情况,稍后再讨论这些峰值。我们想用 N 个 CPU 来处理这个数组,所以我们把它分成 N 个大小相等的部分,然后分别处理每个部分。结果? 0.35 秒。这仍然比串行版本差。但是 for 遍历数组中的每个项目是最优化的结构之一。我们不能做点什么来帮助编译器吗?提取计算循环的界限可能会有所帮助。事实证明确实如此:0.18 秒。这比串行版本好,但也好不了多少。而且,有趣的是,在我的 4 核机器(没有超线程)上将并行度从 4 更改为 2 不会改变结果:仍然是 0.18 秒。这让我得出结论,CPU 不是这里的瓶颈,内存带宽才是。

现在,回到峰值:我的自定义并行化没有它们,但是 Parallel.For() 有,为什么? Parallel.For() 确实使用范围分区,这意味着每个线程处理自己的数组部分。但是,如果一个线程提前结束,它将尝试帮助处理另一个尚未结束的线程的范围。如果发生这种情况,您将获得大量错误共享,这可能会大大降低代码速度。我自己的强制虚假共享测试似乎表明这确实可能是问题所在。强制 Parallel.For() 的并行度似乎对峰值有所帮助。

当然,所有这些测量都是针对我的计算机上的硬件的,对您来说会有所不同,因此您应该进行自己的测量。

我使用的代码:

static void Main()
{
    double[] array = new double[200 * 1000 * 1000];

    for (int i = 0; i < array.Length; i++)
        array[i] = 1;

    for (int i = 0; i < 5; i++)
    {
        Stopwatch sw = Stopwatch.StartNew();
        Serial(array, 2);
        Console.WriteLine("Serial: {0:f2} s", sw.Elapsed.TotalSeconds);

        sw = Stopwatch.StartNew();
        ParallelFor(array, 2);
        Console.WriteLine("Parallel.For: {0:f2} s", sw.Elapsed.TotalSeconds);

        sw = Stopwatch.StartNew();
        ParallelForDegreeOfParallelism(array, 2);
        Console.WriteLine("Parallel.For (degree of parallelism): {0:f2} s", sw.Elapsed.TotalSeconds);

        sw = Stopwatch.StartNew();
        CustomParallel(array, 2);
        Console.WriteLine("Custom parallel: {0:f2} s", sw.Elapsed.TotalSeconds);

        sw = Stopwatch.StartNew();
        CustomParallelExtractedMax(array, 2);
        Console.WriteLine("Custom parallel (extracted max): {0:f2} s", sw.Elapsed.TotalSeconds);

        sw = Stopwatch.StartNew();
        CustomParallelExtractedMaxHalfParallelism(array, 2);
        Console.WriteLine("Custom parallel (extracted max, half parallelism): {0:f2} s", sw.Elapsed.TotalSeconds);

        sw = Stopwatch.StartNew();
        CustomParallelFalseSharing(array, 2);
        Console.WriteLine("Custom parallel (false sharing): {0:f2} s", sw.Elapsed.TotalSeconds);
    }
}

static void Serial(double[] array, double factor)
{
    for (int i = 0; i < array.Length; i++)
    {
        array[i] = array[i] * factor;
    }
}

static void ParallelFor(double[] array, double factor)
{
    Parallel.For(
        0, array.Length, i => { array[i] = array[i] * factor; });
}

static void ParallelForDegreeOfParallelism(double[] array, double factor)
{
    Parallel.For(
        0, array.Length, new ParallelOptions { MaxDegreeOfParallelism = Environment.ProcessorCount },
        i => { array[i] = array[i] * factor; });
}

static void CustomParallel(double[] array, double factor)
{
    var degreeOfParallelism = Environment.ProcessorCount;

    var tasks = new Task[degreeOfParallelism];

    for (int taskNumber = 0; taskNumber < degreeOfParallelism; taskNumber++)
    {
        // capturing taskNumber in lambda wouldn't work correctly
        int taskNumberCopy = taskNumber;

        tasks[taskNumber] = Task.Factory.StartNew(
            () =>
            {
                for (int i = array.Length * taskNumberCopy / degreeOfParallelism;
                    i < array.Length * (taskNumberCopy + 1) / degreeOfParallelism;
                    i++)
                {
                    array[i] = array[i] * factor;
                }
            });
    }

    Task.WaitAll(tasks);
}

static void CustomParallelExtractedMax(double[] array, double factor)
{
    var degreeOfParallelism = Environment.ProcessorCount;

    var tasks = new Task[degreeOfParallelism];

    for (int taskNumber = 0; taskNumber < degreeOfParallelism; taskNumber++)
    {
        // capturing taskNumber in lambda wouldn't work correctly
        int taskNumberCopy = taskNumber;

        tasks[taskNumber] = Task.Factory.StartNew(
            () =>
            {
                var max = array.Length * (taskNumberCopy + 1) / degreeOfParallelism;
                for (int i = array.Length * taskNumberCopy / degreeOfParallelism;
                    i < max;
                    i++)
                {
                    array[i] = array[i] * factor;
                }
            });
    }

    Task.WaitAll(tasks);
}

static void CustomParallelExtractedMaxHalfParallelism(double[] array, double factor)
{
    var degreeOfParallelism = Environment.ProcessorCount / 2;

    var tasks = new Task[degreeOfParallelism];

    for (int taskNumber = 0; taskNumber < degreeOfParallelism; taskNumber++)
    {
        // capturing taskNumber in lambda wouldn't work correctly
        int taskNumberCopy = taskNumber;

        tasks[taskNumber] = Task.Factory.StartNew(
            () =>
            {
                var max = array.Length * (taskNumberCopy + 1) / degreeOfParallelism;
                for (int i = array.Length * taskNumberCopy / degreeOfParallelism;
                    i < max;
                    i++)
                {
                    array[i] = array[i] * factor;
                }
            });
    }

    Task.WaitAll(tasks);
}

static void CustomParallelFalseSharing(double[] array, double factor)
{
    var degreeOfParallelism = Environment.ProcessorCount;

    var tasks = new Task[degreeOfParallelism];

    int i = -1;

    for (int taskNumber = 0; taskNumber < degreeOfParallelism; taskNumber++)
    {
        tasks[taskNumber] = Task.Factory.StartNew(
            () =>
            {
                int j = Interlocked.Increment(ref i);
                while (j < array.Length)
                {
                    array[j] = array[j] * factor;
                    j = Interlocked.Increment(ref i);
                }
            });
    }

    Task.WaitAll(tasks);
}

示例输出:

Serial: 0,20 s
Parallel.For: 0,50 s
Parallel.For (degree of parallelism): 8,90 s
Custom parallel: 0,33 s
Custom parallel (extracted max): 0,18 s
Custom parallel (extracted max, half parallelism): 0,18 s
Custom parallel (false sharing): 7,53 s
Serial: 0,21 s
Parallel.For: 0,52 s
Parallel.For (degree of parallelism): 0,36 s
Custom parallel: 0,31 s
Custom parallel (extracted max): 0,18 s
Custom parallel (extracted max, half parallelism): 0,19 s
Custom parallel (false sharing): 7,59 s
Serial: 0,21 s
Parallel.For: 11,21 s
Parallel.For (degree of parallelism): 0,36 s
Custom parallel: 0,32 s
Custom parallel (extracted max): 0,18 s
Custom parallel (extracted max, half parallelism): 0,18 s
Custom parallel (false sharing): 7,76 s
Serial: 0,21 s
Parallel.For: 0,46 s
Parallel.For (degree of parallelism): 0,35 s
Custom parallel: 0,31 s
Custom parallel (extracted max): 0,18 s
Custom parallel (extracted max, half parallelism): 0,18 s
Custom parallel (false sharing): 7,58 s
Serial: 0,21 s
Parallel.For: 0,45 s
Parallel.For (degree of parallelism): 0,40 s
Custom parallel: 0,38 s
Custom parallel (extracted max): 0,18 s
Custom parallel (extracted max, half parallelism): 0,18 s
Custom parallel (false sharing): 7,58 s

关于c# - 使用并行 FOR 循环节省时间,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/12405938/

相关文章:

.net - NUnit测试PLINQ代码后,如何防止AppDomainUnloadedException?

c# - 泛型还是只使用重载?

c# - 任何 .NET '#region directive' 约定想法?

c# - 从另一个刚刚编译的程序集编译程序集

c# - 我如何在 C# 中获取当前日期?

.net-4.0 - 如何强制在继续任务中观察到任务的异常?

c# - 将并行扩展或并行 LINQ 与 LINQ Take 结合使用

c# - 如何从服务总线主题死信队列中读取C#?

c# - 在c#中发送邮件到gmail ssl问题

.net - 具有多个 DataLoadOptions 的 Linq2SQl 预加载