assembly - SBCL 优化 : Can we compile an efficient population count for bit-vectors?

SBCL，我使用的 Lisp 实现，知道编译 (logcount x)到 x86-64 POPCNT说明如果 x可输入为足够短的 unsigned-byte .假设一个 simple-bit-vector SBCL 存储在字对齐的连续内存段中，如何让编译器在其上执行类似的优化人口计数？该标准不提供 (bit-vector-logcount) (在我看来，这是一个奇怪的遗漏)；它也不允许我(coerce)到 fixnum .

一个故意幼稚的实现如下；请注意，还有一个 Ω(1)-time one ，而 Harley-Seal 技术 [1] 可能是大向量的更好选择。但这对于优化编译器来说很简单，可以发现意图:

(defun bit-vector-unsigned-logcount (x)
  "Not worrying about negative interpretations of X."
  (declare (type (simple-bit-vector 32) x)
           (optimize (speed 3) (safety 0) (debug 0)))
  (loop for b across x
        count (not (zerop b))))

在 SBCL 2.0.1 我得到这个:

; disassembly for BIT-VECTOR-UNSIGNED-LOGCOUNT
; Size: 67 bytes. Origin: #x52B88079                          ; BIT-VECTOR-UNSIGNED-LOGCOUNT
; 79:       31D2             XOR EDX, EDX
; 7B:       31C0             XOR EAX, EAX
; 7D:       31C9             XOR ECX, ECX
; 7F:       EB2C             JMP L1
; 81:       660F1F840000000000 NOP
; 8A:       660F1F440000     NOP
; 90: L0:   488BD0           MOV RDX, RAX
; 93:       48D1FA           SAR RDX, 1
; 96:       480FA35301       BT QWORD PTR [RBX+1], RDX
; 9B:       19D2             SBB EDX, EDX
; 9D:       83E202           AND EDX, 2
; A0:       4883C002         ADD RAX, 2
; A4:       4885D2           TEST RDX, RDX
; A7:       7404             JEQ L1
; A9:       4883C102         ADD RCX, 2
; AD: L1:   4883F840         CMP RAX, 64
; B1:       7CDD             JL L0
; B3:       488BD1           MOV RDX, RCX
; B6:       488BE5           MOV RSP, RBP
; B9:       F8               CLC
; BA:       5D               POP RBP
; BB:       C3               RET

我会给SBCL manual一个说话的机会。

If your system's performance is suffering because of some construct which could in principle be compiled efficiently, but which the SBCL compiler can't in practice compile efficiently, consider writing a patch to the compiler and submitting it for inclusion in the main sources.

我怀疑我正面临这样的情况，我很乐意提供帮助，但除了查看 Paul Khuong 的文章之外，我对 VOP 黑客几乎一无所知 here和 here .
x86-64/arith.lisp定义了几个 VOP，unsigned-byte-64-count和 positive-fixnum-count ，如果我们可以取笑 bit-vector，它们看起来好像可以重新用于工作。分开。

[1] Muła, W., Kurz, N., & Lemire, D. (2017)。 Faster Population Counts Using AVX2 Instructions .计算机杂志，61(1)，111-120。 doi:10.1093/comjnl/bxx046

最佳答案

sb-kernel:%vector-raw-bits 可以成为缺失的部分吗？

(logcount (sb-kernel:%vector-raw-bits #*100101010 0))

-> 4

(disassemble (compile () (lambda (a) (logcount (sb-kernel:%vector-raw-bits a    0)))))

-> (...) POPCNT (...)
如何查找:检查，例如，bit-and ，并跳转到定义(M-. in sly/slime repl)，选择 deftransform为 simple-bit-vectors .
毋庸置疑，这是特定于实现的，需要注意安全，但 vops 也是如此。

关于assembly - SBCL 优化 : Can we compile an efficient population count for bit-vectors?，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/62478318/

assembly - SBCL 优化 : Can we compile an efficient population count for bit-vectors?

上一篇：unity3d - 如何为 xbox 编写程序并使用其 GPU？

下一篇：python - GEKKO:将数组中的出现次数作为条件进行计数