我正在对运行过程进行连续分析,所以我在服务器上设置了一个 crontab。它定期运行一个 python 脚本,该脚本执行 perf 子进程,从由 supervise 启动的守护进程收集性能数据
我使用的 perf 命令是这样的:
perf record -p {target process} -e cycles:u -a -q -g -- sleep {some time}
一切顺利,除了正在运行的进程终止。我们有时需要更新目标进程的可执行文件并使用 svc -t
重新启动进程.该操作可能会导致内核崩溃,我们必须重新启动机器我的服务器的发行版是
CentOS release 6.5 (Final)
linux发行版是2.6.32-431.23.3.el6.x86_64
核心转储日志和回溯如下所示general protection fault: 0000 [#1] SMP
last sysfs file: /sys/devices/system/cpu/online
CPU 1
Modules linked in: AliSecGuard(U) AliSecProcFilter64(U) tcp_diag inet_diag joydev microcode virtio_net virtio_balloon shpchp i2c_piix4 i2c_core ext4 jbd2 mbcache virtio_blk virtio_console virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
Pid: 22748, comm: server Not tainted 2.6.32-573.22.1.el6.x86_64 #1 Alibaba Cloud Alibaba Cloud ECS
RIP: 0010:[<ffffffff8111db57>] [<ffffffff8111db57>] ring_buffer_put+0x77/0xf0
RSP: 0018:ffff8801afadbda8 EFLAGS: 00010006
RAX: ffff880416d81e60 RBX: ffff8803d335f000 RCX: 63496d6165727473
RDX: 676e697274736f5f RSI: 0000000000000003 RDI: ffff880416d81c00
RBP: ffff8801afadbdd8 R08: 0000000000000001 R09: 00000000ffffffff
R10: 00000000ffffffff R11: dead000000200200 R12: ffff8803d335f058
R13: 676e697274736cff R14: ffff8803d335f060 R15: 0000000000000202
FS: 0000000000000000(0000) GS:ffff880028240000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000004264d70 CR3: 0000000001a8d000 CR4: 00000000001406e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Process gameserver (pid: 22748, threadinfo ffff8801afad8000, task ffff8803a19cd520)
Stack:
ffff8801afadbdf8 ffff8803a1b03800 ffff8804138bf78c ffff8804138bf790
<d> ffff88001a9fb800 ffff8804182a1c80 ffff8801afadbdf8 ffffffff8111e377
<d> ffff8804138bf790 ffff8803a1b03800 ffff8801afadbe28 ffffffff8111fe72
Call Trace:
[<ffffffff8111e377>] free_event+0x37/0x170
[<ffffffff8111fe72>] perf_event_release_kernel+0x72/0xb0
[<ffffffff8111ff49>] put_event+0x99/0xd0
[<ffffffff81123a65>] __perf_event_exit_task+0xf5/0x150
[<ffffffff81123c91>] perf_event_exit_task+0x1d1/0x210
[<ffffffff8107ca24>] do_exit+0x1e4/0x870
[<ffffffff8107d1b7>] sys_exit+0x17/0x20
[<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b
Code: ff ff 4c 39 f0 48 8b 97 60 02 00 00 74 5f 4c 8d aa a0 fd ff ff eb 08 0f 1f 44 00 00 49 89 cd 48 8b 8f 68 02 00 00 be 03 00 00 00 <48> 89 4a 08 48 89 11 31 c9 48 89 87 60 02 00 00 48 89 87 68 02
RIP [<ffffffff8111db57>] ring_buffer_put+0x77/0xf0
RSP <ffff8801afadbda8>
PID: 22748 TASK: ffff8803a19cd520 CPU: 1 COMMAND: "server"
#0 [ffff8801afadbb30] machine_kexec at ffffffff8103d1fb
#1 [ffff8801afadbb90] crash_kexec at ffffffff810cc882
#2 [ffff8801afadbc60] oops_end at ffffffff8153da20
#3 [ffff8801afadbc90] die at ffffffff81010fab
#4 [ffff8801afadbcc0] do_general_protection at ffffffff8153d512
#5 [ffff8801afadbcf0] general_protection at ffffffff8153cce5
[exception RIP: ring_buffer_put+119]
RIP: ffffffff8111db57 RSP: ffff8801afadbda8 RFLAGS: 00010006
RAX: ffff880416d81e60 RBX: ffff8803d335f000 RCX: 63496d6165727473
RDX: 676e697274736f5f RSI: 0000000000000003 RDI: ffff880416d81c00
RBP: ffff8801afadbdd8 R8: 0000000000000001 R9: 00000000ffffffff
R10: 00000000ffffffff R11: dead000000200200 R12: ffff8803d335f058
R13: 676e697274736cff R14: ffff8803d335f060 R15: 0000000000000202
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#6 [ffff8801afadbda0] ring_buffer_put at ffffffff8111db20
#7 [ffff8801afadbde0] free_event at ffffffff8111e377
#8 [ffff8801afadbe00] perf_event_release_kernel at ffffffff8111fe72
#9 [ffff8801afadbe30] put_event at ffffffff8111ff49
#10 [ffff8801afadbe60] __perf_event_exit_task at ffffffff81123a65
#11 [ffff8801afadbe90] perf_event_exit_task at ffffffff81123c91
#12 [ffff8801afadbef0] do_exit at ffffffff8107ca24
#13 [ffff8801afadbf70] sys_exit at ffffffff8107d1b7
#14 [ffff8801afadbf80] system_call_fastpath at ffffffff8100b0d2
RIP: 0000003026207c41 RSP: 00007fe4d1e56e50 RFLAGS: 00000246
RAX: 000000000000003c RBX: ffffffff8100b0d2 RCX: 0000000000000001
RDX: 0000000000000004 RSI: 00000000009fb000 RDI: 0000000000000000
RBP: 0000000000000000 R8: 000000000598f280 R9: 00000000000058dc
R10: 00007fe4d259f3ac R11: 0000000000000246 R12: ffffffff8107d1b7
R13: ffff8801afadbf78 R14: 0000000000000003 R15: 0000000000000000
ORIG_RAX: 000000000000003c CS: 0033 SS: 002b
附加进程的线程退出导致内核 panic 并且每次都无法重现 panic ,所以我想这可能是内核中的一种竞争条件错误?顺便说一句,附加进程在我的服务器上终止后,性能进程不会退出(因为我猜是旧版本),所以性能将继续工作,直到我中断它。我不确定这是否会影响目标进程退出
最佳答案
听起来像是旧内核版本中的错误;用户空间 perf
不应该使内核 panic 。要么是 Linux 2.6.32 有缺陷,要么是 CentOS 的补丁向后移植(或补丁本身)引入了一个仅在 perf 处于事件状态时才会发生的错误。
我不认为自己解决这个问题是合理的,除非你想真正深入研究内核调试,所以你的选择是
perf
. 关于linux - 运行 perf 进程时 CentOS 6.5 内核崩溃,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/65084914/