delphi - TParallel 的奇怪行为。对于默认线程池

标签 delphi parallel-processing rtl-ppl

我正在尝试 Delphi XE7 Update 1 的并行编程功能。

我创建了一个简单的 TParallel.For 循环,它基本上执行一些虚假操作来打发时间。

我在 AWS 实例 (c4.8xlarge) 的 36 vCPU 上启动了该程序,尝试看看并行编程可以带来什么好处。

当我第一次启动程序并执行 TParallel.For 循环时,我看到了显着的增益(尽管无可否认,比我对 36 个 vCPU 的预期要少得多):

Parallel matches: 23077072 in 242ms
Single Threaded matches: 23077072 in 2314ms

如果我不关闭程序并不久后(例如,立即或大约 10-20 秒后)在 36 vCPU 计算机上再次运行该过程,则并行过程会恶化很多:

Parallel matches: 23077169 in 2322ms
Single Threaded matches: 23077169 in 2316ms

如果我不关闭程序,并等待几分钟(不是几秒钟,而是几分钟),然后再次运行该过程,我会再次得到第一次启动该程序时得到的结果(提高了 10 倍)响应时间)。

启动程序后的第一遍在 36 vCPU 计算机上总是更快,因此这种效果似乎仅在程序中第二次调用 TParallel.For 时发生。

这是我正在运行的示例代码:

unit ParallelTests;

interface

uses
  Winapi.Windows, Winapi.Messages, System.SysUtils, System.Variants, System.Classes, Vcl.Graphics,
  System.Threading, System.SyncObjs, System.Diagnostics,
  Vcl.Controls, Vcl.Forms, Vcl.Dialogs, Vcl.StdCtrls;

type
  TForm1 = class(TForm)
    Button1: TButton;
    Memo1: TMemo;
    SingleThreadCheckBox: TCheckBox;
    ParallelCheckBox: TCheckBox;
    UnitsEdit: TEdit;
    Label1: TLabel;
    procedure Button1Click(Sender: TObject);
  private
    { Private declarations }
  public
    { Public declarations }
  end;

var
  Form1: TForm1;

implementation

{$R *.dfm}

procedure TForm1.Button1Click(Sender: TObject);
var
  matches: integer;
  i,j: integer;
  sw: TStopWatch;
  maxItems: integer;
  referenceStr: string;

 begin
  sw := TStopWatch.Create;

  maxItems := 5000;

  Randomize;
  SetLength(referenceStr,120000); for i := 1 to 120000 do referenceStr[i] := Chr(Ord('a') + Random(26)); 

  if ParallelCheckBox.Checked then begin
    matches := 0;
    sw.Reset;
    sw.Start;
    TParallel.For(1, MaxItems,
      procedure (Value: Integer)
        var
          index: integer;
          found: integer;
        begin
          found := 0;
          for index := 1 to length(referenceStr) do begin
            if (((Value mod 26) + ord('a')) = ord(referenceStr[index])) then begin
              inc(found);
            end;
          end;
          TInterlocked.Add(matches, found);
        end);
    sw.Stop;
    Memo1.Lines.Add('Parallel matches: ' + IntToStr(matches) + ' in ' + IntToStr(sw.ElapsedMilliseconds) + 'ms');
  end;

  if SingleThreadCheckBox.Checked then begin
    matches := 0;
    sw.Reset;
    sw.Start;
    for i := 1 to MaxItems do begin
      for j := 1 to length(referenceStr) do begin
        if (((i mod 26) + ord('a')) = ord(referenceStr[j])) then begin
          inc(matches);
        end;
      end;
    end;
    sw.Stop;
    Memo1.Lines.Add('Single Threaded matches: ' + IntToStr(Matches) + ' in ' + IntToStr(sw.ElapsedMilliseconds) + 'ms');
  end;
end;

end.

这是否按设计工作?我发现这篇文章( http://delphiaball.co.uk/tag/parallel-programming/ )建议我让库决定线程池,但如果我必须从一个请求到另一个请求等待几分钟以便更快地处理请求,那么我不认为使用并行编程有什么意义。

我是否遗漏了关于如何使用 TParallel.For 循环的任何内容?

请注意,我无法在 AWS m3.large 实例(根据 AWS 的说法是 2 个 vCPU)上重现此情况。在这种情况下,我总是会得到轻微的改进,并且在不久之后的 TParallel.For 后续调用中也不会得到更糟糕的结果。

Parallel matches: 23077054 in 2057ms
Single Threaded matches: 23077054 in 2900ms

因此,似乎当有许多可用核心(36 个)时,就会出现这种效果,这很遗憾,因为并行编程的全部要点就是从许多核心中受益。我想知道这是否是一个库错误,因为内核数量较多,或者在这种情况下内核数量不是 2 的幂。

UPDATE: After testing it with various instances of different vCPU counts in AWS, this seems to be the behaviour:

  • 36 vCPUs (c4.8xlarge). You have to wait minutes between subsequent calls to a vanilla TParallel call (it makes it unusable for production)
  • 32 vCPUs (c3.8xlarge). You have to wait minutes between subsequent calls to a vanilla TParallel call (it makes it unusable for production)
  • 16 vCPUs (c3.4xlarge). You have to wait sub second times. It could be usable if load is low but response time still important
  • 8 vCPUs (c3.2xlarge). It seems to work normally
  • 4 vCPUs (c3.xlarge). It seems to work normally
  • 2 vCPUs (m3.large). It seems to work normally

最佳答案

我根据您的程序创建了两个测试程序来比较 System.Threading 和 OTL 。我使用 XE7 update 1 和 OTL r1397 构建。我使用的 OTL 源对应版本 3.04。我使用 32 位 Windows 编译器并使用发布构建选项进行构建。

我的测试机器是运行 Windows 7 x64 的双 Intel Xeon E5530。该系统有两个四核处理器。总共有 8 个处理器,但由于超线程,系统显示有 16 个。经验告诉我,超线程只是营销废话,我从未见过在这台机器上扩展超过 8 倍。

现在来说两个程序,它们几乎相同。

系统线程

program SystemThreadingTest;

{$APPTYPE CONSOLE}

uses
  System.Diagnostics,
  System.Threading;

const
  maxItems = 5000;
  DataSize = 100000;

procedure DoTest;
var
  matches: integer;
  i, j: integer;
  sw: TStopWatch;
  referenceStr: string;
begin
  Randomize;
  SetLength(referenceStr, DataSize);
  for i := low(referenceStr) to high(referenceStr) do
    referenceStr[i] := Chr(Ord('a') + Random(26));

  // parallel
  matches := 0;
  sw := TStopWatch.StartNew;
  TParallel.For(1, maxItems,
    procedure(Value: integer)
    var
      index: integer;
      found: integer;
    begin
      found := 0;
      for index := low(referenceStr) to high(referenceStr) do
        if (((Value mod 26) + Ord('a')) = Ord(referenceStr[index])) then
          inc(found);
      AtomicIncrement(matches, found);
    end);
  Writeln('Parallel matches: ', matches, ' in ', sw.ElapsedMilliseconds, 'ms');

  // serial
  matches := 0;
  sw := TStopWatch.StartNew;
  for i := 1 to maxItems do
    for j := low(referenceStr) to high(referenceStr) do
      if (((i mod 26) + Ord('a')) = Ord(referenceStr[j])) then
        inc(matches);
  Writeln('Serial matches: ', matches, ' in ', sw.ElapsedMilliseconds, 'ms');
end;

begin
  while True do
    DoTest;
end.

OTL

program OTLTest;

{$APPTYPE CONSOLE}

uses
  Winapi.Windows,
  Winapi.Messages,
  System.Diagnostics,
  OtlParallel;

const
  maxItems = 5000;
  DataSize = 100000;

procedure ProcessThreadMessages;
var
  msg: TMsg;
begin
  while PeekMessage(Msg, 0, 0, 0, PM_REMOVE) and (Msg.Message <> WM_QUIT) do begin
    TranslateMessage(Msg);
    DispatchMessage(Msg);
  end;
end;

procedure DoTest;
var
  matches: integer;
  i, j: integer;
  sw: TStopWatch;
  referenceStr: string;
begin
  Randomize;
  SetLength(referenceStr, DataSize);
  for i := low(referenceStr) to high(referenceStr) do
    referenceStr[i] := Chr(Ord('a') + Random(26));

  // parallel
  matches := 0;
  sw := TStopWatch.StartNew;
  Parallel.For(1, maxItems).Execute(
    procedure(Value: integer)
    var
      index: integer;
      found: integer;
    begin
      found := 0;
      for index := low(referenceStr) to high(referenceStr) do
        if (((Value mod 26) + Ord('a')) = Ord(referenceStr[index])) then
          inc(found);
      AtomicIncrement(matches, found);
    end);
  Writeln('Parallel matches: ', matches, ' in ', sw.ElapsedMilliseconds, 'ms');

  ProcessThreadMessages;

  // serial
  matches := 0;
  sw := TStopWatch.StartNew;
  for i := 1 to maxItems do
    for j := low(referenceStr) to high(referenceStr) do
      if (((i mod 26) + Ord('a')) = Ord(referenceStr[j])) then
        inc(matches);
  Writeln('Serial matches: ', matches, ' in ', sw.ElapsedMilliseconds, 'ms');
end;

begin
  while True do
    DoTest;
end.

现在是输出。

System.Threading 输出

Parallel matches: 19230817 in 374ms
Serial matches: 19230817 in 2423ms
Parallel matches: 19230698 in 374ms
Serial matches: 19230698 in 2409ms
Parallel matches: 19230556 in 368ms
Serial matches: 19230556 in 2433ms
Parallel matches: 19230635 in 2412ms
Serial matches: 19230635 in 2430ms
Parallel matches: 19230843 in 2441ms
Serial matches: 19230843 in 2413ms
Parallel matches: 19230905 in 2493ms
Serial matches: 19230905 in 2423ms
Parallel matches: 19231032 in 2430ms
Serial matches: 19231032 in 2443ms
Parallel matches: 19230669 in 2440ms
Serial matches: 19230669 in 2473ms
Parallel matches: 19230811 in 2404ms
Serial matches: 19230811 in 2432ms
....

OTL输出

Parallel matches: 19230667 in 422ms
Serial matches: 19230667 in 2475ms
Parallel matches: 19230663 in 335ms
Serial matches: 19230663 in 2438ms
Parallel matches: 19230889 in 395ms
Serial matches: 19230889 in 2461ms
Parallel matches: 19230874 in 391ms
Serial matches: 19230874 in 2441ms
Parallel matches: 19230617 in 385ms
Serial matches: 19230617 in 2524ms
Parallel matches: 19231021 in 368ms
Serial matches: 19231021 in 2455ms
Parallel matches: 19230904 in 357ms
Serial matches: 19230904 in 2537ms
Parallel matches: 19230568 in 373ms
Serial matches: 19230568 in 2456ms
Parallel matches: 19230758 in 333ms
Serial matches: 19230758 in 2710ms
Parallel matches: 19230580 in 371ms
Serial matches: 19230580 in 2532ms
Parallel matches: 19230534 in 336ms
Serial matches: 19230534 in 2436ms
Parallel matches: 19230879 in 368ms
Serial matches: 19230879 in 2419ms
Parallel matches: 19230651 in 409ms
Serial matches: 19230651 in 2598ms
Parallel matches: 19230461 in 357ms
....

我让 OTL 版本运行了很长一段时间,并且模式从未改变。并行版本始终比串行版本快 7 倍左右。

结论

代码非常简单。可以得出的唯一合理的结论是 System.Threading 的实现存在缺陷。

有大量与新的 System.Threading 库相关的错误报告。种种迹象表明,它的质量很差。 Embarcadero 在发布不符合标准的库代码方面有着悠久的记录。我正在考虑 TMonitor、XE3 字符串助手、早期版本的 System.IOUtils、FireMonkey。这样的例子还在继续。

很明显,质量是 Embarcadero 的一个大问题。发布的代码显然没有经过充分的测试(如果有的话)。这对于线程库来说尤其麻烦,因为错误可能处于休眠状态,并且仅在特定的硬件/软件配置中暴露。 TMonitor 的经验让我相信 Embarcadero 没有足够的专业知识来生成高质量、正确的线程代码。

我的建议是您不应该使用当前形式的System.Threading。在可以看出它具有足够的质量和正确性之前,应该避免它。我建议您使用OTL。

<小时/>

编辑:该程序的原始 OTL 版本存在实时内存泄漏,这是由于丑陋的实现细节而发生的。 Parallel.For 使用 .Unobserved 修饰符创建任务。这导致所述任务仅在某些内部消息窗口收到“任务已终止”消息时才被销毁。该窗口是在与 Parallel.For 调用者相同的线程中创建的 - 即在本例中是在主线程中。由于主线程不处理消息,任务永远不会被销毁,内存消耗(加上其他资源)只会堆积起来。可能是因为该程序在一段时间后挂起。

关于delphi - TParallel 的奇怪行为。对于默认线程池,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/29062697/

相关文章:

delphi - 为什么包含 FastMM 会导致 EInvalidTypecast 错误?

delphi - 每次我尝试使用存储过程与数据库进行通信时,我都会收到此消息 "Cannot perform this operation on a closed dataset"

delphi - 透明背景TStringGrid

delphi翻译工具

string - 并行命令中的 bash 字符串替换

loops - OpenCL for 循环执行模型

parallel-processing - OpenMP - Fortran 中的任务依赖性