java - _XReply() 使用 _XIOError() 终止应用程序

标签 java linux x11 xorg xcb

我们正在开发一些复杂的应用程序,它由 linux 二进制文件与来 self 们定制的 .jar 文件的 java jni 调用(来自在 linux 二进制文件中创建的 JVM)集成。所有的 gui 工作都是由 java 部分实现和完成的。每次必须更改某些 gui 属性或必须重新绘制 gui 时,都是通过调用 JVM 的 jni 来完成的。

完整的显示/gui 以 JVM/java 可以处理的速度重新绘制(或刷新)。它以迭代和频繁的方式完成,每秒迭代数百次或数千次。

在某个确切的时间后,应用程序以 exit(1) 终止,我用 gdb 捕捉到从 _XIOError() 调用。这种终止可以在或多或少准确的时间段后重复,例如在 x86 双核 2.5GHz 上运行 15 小时后。如果我使用一些速度较慢的计算机,它会持续更长时间,就像它与 cpu/gpu 速度成正比一样。一些结论是 xorg 的某些部分耗尽了某些资源或类似的东西。

这是我的回溯:

#0  0xb7fe1424 in __kernel_vsyscall ()
#1  0xb7c50941 in raise () from /lib/i386-linux-gnu/i686/cmov/libc.so.6
#2  0xb7c53d72 in abort () from /lib/i386-linux-gnu/i686/cmov/libc.so.6
#3  0xb7fdc69d in exit () from /temp/bin/liboverrides.so
#4  0xa0005c80 in _XIOError () from /usr/lib/i386-linux-gnu/libX11.so.6
#5  0xa0003afe in _XReply () from /usr/lib/i386-linux-gnu/libX11.so.6
#6  0x9fffee7b in XSync () from /usr/lib/i386-linux-gnu/libX11.so.6
#7  0xa01232b8 in X11SD_GetSharedImage () from /usr/lib/jvm/jre1.8.0_20/lib/i386/libawt_xawt.so
#8  0xa012529e in X11SD_GetRasInfo () from /usr/lib/jvm/jre1.8.0_20/lib/i386/libawt_xawt.so
#9  0xa01aac3d in Java_sun_java2d_loops_ScaledBlit_Scale () from /usr/lib/jvm/jre1.8.0_20/lib/i386/libawt.so

我在 liboverrides.so 中进行了自己的 exit() 调用,并在 abort( )/SIGABRT。 在对 libX11 和 libxcb 进行一些调试之后,我注意到 _XReply() 得到了 NULL 回复(来自 xcb_wait_for_reply() 的回复)导致调用 _XIOError()exit(1)。在 xcb_wait_for_reply() 函数中更深入地了解 libxcb,我注意到它可以返回 NULL 回复的原因之一是当它检测到断开或关闭的套接字连接时,这可能是我的情况。

出于测试目的,如果我更改 xcb_io.c 并忽略 _XIOError(),应用程序将不再工作。如果我在 _XReply() 中重复请求,它每次都会失败,即在每个 xcb_wait_for_reply() 上得到 NULL 响应。

所以,我的问题是,为什么这样不受控制的应用程序终止,并从 _XReply() -> XIOError() -> exit(1) 退出(1) ) 发生了,或者我如何找出发生的原因和发生了什么,以便我可以修复它或做一些解决方法。

正如我上面所写,为了重复这个问题,我必须等待大约 15 小时,但目前我的调试时间非常短,无法找到问题/终止的原因。 我们也尝试重组处理GUI/显示刷新的java部分,但问题没有解决。

一些软件事实:
- java jre 1.8.0_20,即使是java 7也会重复这个问题
- libX11.so 1.5.0
- libxcb.so 1.8.1
- debian 喘息
- 内核 3.2.0

最佳答案

这可能是 libX11 中关于处理用于 xcb_wait_for_reply 的请求编号的已知问题。

在引入 libxcb v1.5 代码后的某个时刻,在内部到处都使用 64 位序列号,并添加了逻辑以在进入那些仍然采用 32 位序列号的公共(public) API 时扩大序列号。

这是来自 submitted libxcb bug report 的引述(删除了实际的电子邮件):

We have an application that does a lot of XDrawString and XDrawLine. After several hours the application is exited by an XIOError.

The XIOError is called in libX11 in the file xcb_io.c, function _XReply. It didn't get a response from xcb_wait_for_reply.

libxcb 1.5 is fine, libxcb 1.8.1 is not. Bisecting libxcb points to this commit:

commit ed37b087519ecb9e74412e4df8f8a217ab6d12a9 Author: Jamey Sharp Date: Sat Oct 9 17:13:45 2010 -0700

xcb_in: Use 64-bit sequence numbers internally everywhere.

Widen sequence numbers on entry to those public APIs that still take
32-bit sequence numbers.

Signed-off-by: Jamey Sharp <jamey@xxxxxx.xxx>

Reverting it on top of 1.8.1 helps.

Adding traces to libxcb I found that the last request numbers used for xcb_wait_for_reply are these: 4294900463 and 4294965487 (two calls in the while loop of the _XReply function), half a second later: 63215 (then XIOError is called). The widen_request is also 63215, I would have expected 63215+2^32. Therefore it seems that the request is not correctly widened.

The commit above also changed the compares in poll_for_reply from XCB_SEQUENCE_COMPARE_32 to XCB_SEQUENCE_COMPARE. Maybe the widening never worked correctly, but it was never observed, because only the lower 32bits were compared.

重现问题

这是用于重现问题的已提交错误报告中的原始代码片段:

  for(;;) {
    XDrawLine(dpy, w, gc, 10, 60, 180, 20);
    XFlush(dpy);
  }

显然这个问题可以用更简单的代码重现:

 for(;;) {
    XNoOp(dpy);
  }

根据提交的 libxcb 错误报告,这些条件需要重现(假设重现代码在 xdraw.c 中):

  • libxcb >= 1.8 (i.e. includes the commit ed37b08)
  • compiled with 32bit: gcc -m32 -lX11 -o xdraw xdraw.c
  • the sequence counter wraps.

建议的补丁

可以在 libxcb 1.8.1 之上应用的建议补丁是这样的:

diff --git a/src/xcb_io.c b/src/xcb_io.c
index 300ef57..8616dce 100644
--- a/src/xcb_io.c
+++ b/src/xcb_io.c
@@ -454,7 +454,7 @@ void _XSend(Display *dpy, const char *data, long size)
        static const xReq dummy_request;
        static char const pad[3];
        struct iovec vec[3];
-       uint64_t requests;
+       unsigned long requests;
        _XExtension *ext;
        xcb_connection_t *c = dpy->xcb->connection;
        if(dpy->flags & XlibDisplayIOError)
@@ -470,7 +470,7 @@ void _XSend(Display *dpy, const char *data, long size)
        if(dpy->xcb->event_owner != XlibOwnsEventQueue || dpy->async_handlers)
        {
                uint64_t sequence;
-               for(sequence = dpy->xcb->last_flushed + 1; sequence <= dpy->request; ++sequence)
+               for(sequence = dpy->xcb->last_flushed + 1; (unsigned long) sequence <= dpy->request; ++sequence)
                        append_pending_request(dpy, sequence);
        }
        requests = dpy->request - dpy->xcb->last_flushed;

技术详解

请在下方查找 detailed technical explanation by Jonas Petersen (也包含在上述错误报告中):

Hi,

Here's two patches. The first one fixes a 32-bit sequence wrap bug. The second patch only adds a comment to another relevant statement.

The patches contain some details. Here is the whole story for who might be interested:

Xlib (libx11) will crash an application with a "Fatal IO error 11 (Resource temporarily unavailable)" after 4 294 967 296 requests to the server. That is when the Xlib internal 32-bit sequence wraps.

Most applications probably will hardly reach this number, but if they do, they have a chance to die a mysterious death. For example the application I'm working on did always crash after about 20 hours when I started to do some stress testing. It does some intensive drawing through Xlib using gktmm2, pixmaps and gc drawing at 40 frames per second in full hd resolution (on Ubuntu). Some optimizations did extend the grace to about 35 hours but it would still crash.

What then followed was some frustrating weeks of digging and debugging to realize that it's not in my application, nor in gtkmm, gtk or glib but that it's this little bug in Xlib which exists since 2006-10-06 apparently.

It took a while to turn out that the number 0x100000000 (2^32) has some relevance. (Much) later it turned out it can be reproduced with Xlib only, using this code for example:

while(1) { XDrawPoint(display, drawable, gc, x, y); XFlush(display); }

It might take one or two hours, but when it reaches the 4294 million it will explode into a "Fatal IO error 11".

What I then learned is that even though Xlib uses internal 32bit sequence numbers they get (smartly) widened to 64bit in the process so that the 32bit sequence may wrap without any disruption in the widened 64bit sequence. Obviously there must be something wrong with that.

The Fatal IO error is issued in _XReply() when it's not getting a reply where there should be one, but the cause is earlier in _XSend() in the moment when the Xlib 32-bit sequence number wraps.

The problem is that when it wraps to 0, the value of 'last_flushed' will still be at the upper boundary (e.g. 0xffffffff). There is two locations in _XSend() (xcb_io.c) that fail in this state because they rely on those values being sequential all the time, the first location is:

requests = dpy->request - dpy->xcb->last_flushed;

I case of request = 0x0 and last_flushed = 0xffffffff it will assign 0xffffffff00000001 to 'requests' and then to XCB as a number (amount) of requests. This is the main killer.

The second location is this:

for(sequence = dpy->xcb->last_flushed + 1; sequence <= dpy->request; \ ++sequence)

I case of request = 0x0 (less than last_flushed) there is no chance to enter the loop ever and as a result some requests are ignored.

The solution is to "unwrap" dpy->request at these two locations and thus retain the sequence related to last_flushed.

uint64_t unwrapped_request = ((uint64_t)(dpy->request < \ dpy->xcb->last_flushed) << 32) + dpy->request;

It creates a temporary 64-bit request number which has bit 8 set if 'request' is less than 'last_flushed'. It is then used in the two locations instead of dpy->request.

I'm not sure if it might be more efficient to use that statement inplace, instead of using a variable.

There is another line in require_socket() that worried me at first:

dpy->xcb->last_flushed = dpy->request = sent;

That's a 64-bit, 32-bit, 64-bit assignment. It will truncate 'sent' to 32-bit when assinging it to 'request' and then also assign the truncated value to the (64-bit) 'last_flushed'. But it seems inteded. I have added a note explaining that for the next poor soul debugging sequence issues... :-)

  • Jonas

Jonas Petersen (2): xcb_io: Fix Xlib 32-bit request number wrapping xcb_io: Add comment explaining a mixed type double assignment

src/xcb_io.c | 14 +++++++++++--- 1 file changed, 11 insertions(+), 3 deletions(-)

-- 1.7.10.4

祝你好运!

关于java - _XReply() 使用 _XIOError() 终止应用程序,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/23871516/

相关文章:

linux - 错误 CoreDNS 所需的最低内核版本为 4.14.35 (UEKR5)

linux - bash 脚本中的 source/etc/profile

macos - Linux/X11(和 Mac OS X)中相当于 Win API SetCapture() 函数的是什么?

Java XLS通过列名获取单元格

java - 使用uibinder的Gwt图像按钮

java - 有没有办法访问另一个匿名类中的匿名类?

java - 1970 年 1 月 1 日之前的日期

c - IRC 程序不打印最后一条消息

linux - 在 Linux 中模拟按下静音/提高音量/降低音量键

linux - 将应用程序嵌入到窗口中