我有一个用 Go 编写的应用程序正在处理消息,需要以 20K/秒(可能更多)的速率从网络 (UDP) 中获取消息,并且每条消息最多可以达到 UDP 数据包的最大长度(64KB -headersize),程序需要解码这个传入的数据包并编码成另一种格式并发送到另一个网络;
目前在24core+64GB RAM的机器上运行正常,但偶尔会丢包,编程模式已经遵循pipelines使用多个 go-routines/channels 占用整机 cpu 负载的 10%;因此它有可能使用更多的 CPU% 或 RAM 来处理所有 20K/s 的消息而不丢失任何消息;然后我开始分析,遵循这个profiling我在 cpu 配置文件中发现 runtime.mallocgc
出现在最上面,那是垃圾收集器运行时,我怀疑这个 GC 可能是它挂起几毫秒(或几微秒)并丢失一些数据包的罪魁祸首,一些最佳实践说切换到 sync.Pool 可能会有所帮助,但我切换到池似乎会导致更多的 CPU 争用并丢失更多的数据包并且更频繁
(pprof) top20 -cum (sync|runtime)
245.99s of 458.81s total (53.61%)
Dropped 487 nodes (cum <= 22.94s)
Showing top 20 nodes out of 22 (cum >= 30.46s)
flat flat% sum% cum cum%
0 0% 0% 440.88s 96.09% runtime.goexit
1.91s 0.42% 1.75% 244.87s 53.37% sync.(*Pool).Get
64.42s 14.04% 15.79% 221.57s 48.29% sync.(*Pool).getSlow
94.29s 20.55% 36.56% 125.53s 27.36% sync.(*Mutex).Lock
1.62s 0.35% 36.91% 72.85s 15.88% runtime.systemstack
22.43s 4.89% 41.80% 60.81s 13.25% runtime.mallocgc
22.88s 4.99% 46.79% 51.75s 11.28% runtime.scanobject
1.78s 0.39% 47.17% 49.15s 10.71% runtime.newobject
26.72s 5.82% 53.00% 39.09s 8.52% sync.(*Mutex).Unlock
0.76s 0.17% 53.16% 33.74s 7.35% runtime.gcDrain
0 0% 53.16% 33.70s 7.35% runtime.gcBgMarkWorker
0 0% 53.16% 33.69s 7.34% runtime.gcBgMarkWorker.func2
pool的使用是标准
// create this one globally at program init
var rfpool = &sync.Pool{New: func() interface{} { return new(aPrivateStruct); }}
// get
rf := rfpool.Get().(*aPrivateStruct)
// put after done processing this message
rfpool.Put(rf)
不确定我做错了吗? 或者想知道还有哪些其他方法可以调整 GC 以使用更少的 CPU%? go版本是1.8
列表显示池中发生了很多锁争用。getSlow src to pool.go at golang.org
(pprof) list sync.*.getSlow
Total: 7.65mins
ROUTINE ======================== sync.(*Pool).getSlow in /opt/go1.8/src/sync/pool.go
1.07mins 3.69mins (flat, cum) 48.29% of Total
. . 144: x = p.New()
. . 145: }
. . 146: return x
. . 147:}
. . 148:
80ms 80ms 149:func (p *Pool) getSlow() (x interface{}) {
. . 150: // See the comment in pin regarding ordering of the loads.
30ms 30ms 151: size := atomic.LoadUintptr(&p.localSize) // load-acquire
180ms 180ms 152: local := p.local // load-consume
. . 153: // Try to steal one element from other procs.
30ms 130ms 154: pid := runtime_procPin()
20ms 20ms 155: runtime_procUnpin()
730ms 730ms 156: for i := 0; i < int(size); i++ {
51.55s 51.55s 157: l := indexLocal(local, (pid+i+1)%int(size))
580ms 2.01mins 158: l.Lock()
10.65s 10.65s 159: last := len(l.shared) - 1
40ms 40ms 160: if last >= 0 {
. . 161: x = l.shared[last]
. . 162: l.shared = l.shared[:last]
. 10ms 163: l.Unlock()
. . 164: break
. . 165: }
490ms 37.59s 166: l.Unlock()
. . 167: }
40ms 40ms 168: return x
. . 169:}
. . 170:
. . 171:// pin pins the current goroutine to P, disables preemption and returns poolLocal pool for the P.
. . 172:// Caller must call runtime_procUnpin() when done with the pool.
. . 173:func (p *Pool) pin() *poolLocal {
最佳答案
sync.Pool 在高并发负载下运行缓慢。尝试在启动时将所有结构分配一次并多次使用。例如,您可以在启动时创建多个 goroutine(worker),而不是在每个请求上运行新的 goroutine。我推荐阅读这篇文章:https://software.intel.com/en-us/blogs/2014/05/10/debugging-performance-issues-in-go-programs .
关于golang GC分析? runtime.mallocgc 似乎名列前茅;然后是转向 sync.Pool 解决方案?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/43109483/