Intel® 64 and IA-32 Architectures Software Developer’s Manual说:
8.2.3.4 Loads May Be Reordered with Earlier Stores to Different Locations
The Intel-64 memory-ordering model allows a load to be reordered with an earlier store to a different location. However, loads are not reordered with stores to the same location.
如果加载部分或完全与以前的存储重叠,但没有相同的起始地址呢? (具体案例见文末)
假设以下类似 C 的代码:
// lock - pointer to an aligned int64 variable
// threadNum - integer in the range 0..7
// volatiles here just to show direct r/w of the memory as it was suggested in the comments
int TryLock(volatile INT64* lock, INT64 threadNum)
{
if (0 != *lock)
return 0; // another thread already had the lock
((volatile INT8*)lock)[threadNum] = 1; // take the lock by setting our byte
if (1LL << 8*threadNum != *lock)
{ // another thread set its byte between our 1st and 2nd check. unset ours
((volatile INT8*)lock)[threadNum] = 0;
return 0;
}
return 1;
}
或者它的 x64 asm 等效项:
; rcx - address of an aligned int64 variable
; rdx - integer in the range 0..7
TryLock PROC
cmp qword ptr [rcx], 0
jne @fail
mov r8, rdx
mov rax, 8
mul rdx
mov byte ptr [rcx+r8], 1
bts rdx, rax
cmp qword ptr [rcx], rdx
jz @success
mov byte ptr [rcx+r8], 0
@fail:
mov rax, 0
ret
@success:
mov rax, 1
ret
然后假设 TryLock 在两个线程中并发执行:
INT64 lock = 0;
void Thread_1() { TryLock(&lock, 1); }
void Thread_5() { TryLock(&lock, 5); }
问题:
((INT8*)lock)[1] = 1;
和 ((INT8*)lock)[5] = 1;
存储与 lock
的 64 位负载不在同一位置.但是,它们每个都完全包含在该负载中,那么“算”为相同的位置吗? CPU 似乎不可能做到这一点。怎么样
((INT8*)lock)[0] = 1
?存储的地址与后续加载的地址相同。即使较早的情况不是,这些操作是否“到同一位置”?附言请注意,问题不是关于 C/Asm 代码,而是关于 x86 CPU 的行为。
最佳答案
Can x86 reorder a narrow store with a wider load that fully contains it?
是的,x86 可以用更宽的负载重新排序一个狭窄的存储,完全包含它。
这就是为什么你的锁算法被破坏了,
shared_value
不等于 800000:shared_value =
662198
:http://coliru.stacked-crooked.com/a/157380085ccad40f shared_value =
538246
:http://coliru.stacked-crooked.com/a/ecec7f021a2a9782 请参阅下面的正确示例。
The question:
The ((INT8*)lock)[ 1 ] = 1; and ((INT8*)lock)[ 5 ] = 1; stores aren't to the same location as the 64bit load of lock. However, they are each fully contained by that load, so does that "count" as the same location?
不,那不是。
Intel® 64 and IA-32 Architectures Software Developer’s Manual says:
8.2.3.4 Loads May Be Reordered with Earlier Stores to Different Locations The Intel-64 memory-ordering model allows a load to be reordered with an earlier store to a different location. However, loads are not reordered with stores to the same location.
这是 STORE 和 LOAD 大小相同的情况下的简化规则。
但一般规则是写入内存会延迟一段时间,然后 STORE (address+value) 入队到 Store Buffer 以等待处于独占状态 (E) 的 cache-line - 当该 cache line 将失效时( I) 在其他 CPU 核心的缓存中。但是可以使用asm操作
MFENCE
(或任何带有 [LOCK]
前缀的操作)强制等待直到写入完成,并且只有在存储缓冲区被清除后才能执行任何后续指令,并且 STORE 将对所有 CPU 核心可见。关于重新排序两行:
((volatile INT8*)lock)[threadNum] = 1; // STORE
if (1LL << 8*threadNum != *lock) // LOAD
MFENCE
时相同. 结论,一定要用
MFENCE
无论如何,在 STORE 之后:MFENCE
为空的 Store-Buffer 将立即结束。 C 和 x86_64 asm 上的正确示例:
我们强制 CPU-Core 像 那样运行案例2 通过使用
MFENCE
,因此有 不是 StoreLoad 重新排序 mfence
来刷新存储缓冲区):https://godbolt.org/g/dtNMZ7 [LOCK] xchgb reg, [addr]
来刷新存储缓冲区):https://godbolt.org/g/BQY6Ju 注:
xchgb
总是有前缀 LOCK
, 所以通常不写成 asm 或用括号表示。所有其他编译器都可以在上面的链接上手动选择:PowerPC、ARM、ARM64、MIPS、MIPS64、AVR。
C 代码 - 应该对第一个 STORE 和下一个 LOAD 使用顺序一致性:
#ifdef __cplusplus
#include <atomic>
using namespace std;
#else
#include <stdatomic.h>
#endif
// lock - pointer to an aligned int64 variable
// threadNum - integer in the range 0..7
// volatiles here just to show direct r/w of the memory as it was suggested in the comments
int TryLock(volatile uint64_t* lock, uint64_t threadNum)
{
//if (0 != *lock)
if (0 != atomic_load_explicit((atomic_uint_least64_t*)lock, memory_order_acquire))
return 0; // another thread already had the lock
//((volatile uint8_t*)lock)[threadNum] = 1; // take the lock by setting our byte
uint8_t* current_lock = ((uint8_t*)lock) + threadNum;
atomic_store_explicit((atomic_uint_least8_t*)current_lock, (uint8_t)1, memory_order_seq_cst);
//if (1LL << 8*threadNum != *lock)
// You already know that this flag is set and should not have to check it.
if ( 0 != ( (~(1LL << 8*threadNum)) &
atomic_load_explicit((atomic_uint_least64_t*)lock, memory_order_seq_cst) ))
{ // another thread set its byte between our 1st and 2nd check. unset ours
//((volatile uint8_t*)lock)[threadNum] = 0;
atomic_store_explicit((atomic_uint_least8_t*)current_lock, (uint8_t)0, memory_order_release);
return 0;
}
return 1;
}
GCC 6.1.0 - x86_64 asm-code - 应该使用
MFENCE
对于第一家商店:TryLock(unsigned long volatile*, unsigned long):
movq (%rdi), %rdx
xorl %eax, %eax
testq %rdx, %rdx
je .L7
.L1:
rep ret
.L7:
leaq (%rdi,%rsi), %r8
leaq 0(,%rsi,8), %rcx
movq $-2, %rax
movb $1, (%r8)
rolq %cl, %rax
mfence
movq (%rdi), %rdi
movq %rax, %rdx
movl $1, %eax
testq %rdi, %rdx
je .L1
movb $0, (%r8)
xorl %eax, %eax
ret
它是如何工作的完整示例:http://coliru.stacked-crooked.com/a/65e3002909d8beae
shared_value = 800000
如果不使用
MFENCE
会发生什么- 数据竞赛有一个 StoreLoad 重新排序 如上述 案例1 (即,如果不为 STORE 使用顺序一致性)- asm:https://godbolt.org/g/p3j9fR
shared_value = 610307
:http://coliru.stacked-crooked.com/a/469f087b1ce32977 shared_value = 678949
:http://coliru.stacked-crooked.com/a/25070868d3cfbbdd 我将 STORE 的内存屏障从
memory_order_seq_cst
更改为至 memory_order_release
,它删除 MFENCE
- 现在有数据竞争 - shared_value 不等于 800000。关于multithreading - x86 是否可以重新排序一个狭窄的存储,并具有完全包含它的更宽的负载?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/35830641/