google-compute-engine - Google VM(europe-west1-b)上的网络和磁盘IO中断

标签 google-compute-engine

问题:

我们该如何解决这个问题是我们在应用程序中遇到的问题还是Google平台的问题?

已采取的步骤:

  • 检查平台日志,没有VM迁移的迹象。
  • 检查了Google Cloud Status Dashboard,并且没有任何中断的迹象。

  • 问题的详细说明:

    10月16日(星期日)大约8:37:40 UTC,我们经历了网络和磁盘IO中断。这是一个摘要(有关详细信息,请参见下面的日志):
  • [08:37:40]:-我们的应用程序遇到了DNS问题
  • [08:37:43]:-sd 0:0:1:0:[sda]中止-在syslog
  • 中报告
  • [08:37:43]:-超过5分钟的内核“任务被阻塞超过120秒”消息
  • [08:43:10]:-sd 0:0:1:0:设备重置-在syslog
  • 中报告
  • [08:43:10]:-Google脚本中的三个错误(我认为与网络相关)
  • [08:43:11]:我们的应用程序恢复。

  • [我们的应用程序日志]
    I|2016-10-16|08:37:09.271|ALM          Finished processing alarms
    W|2016-10-16|08:37:40.165|RC           Exception: DNS error: Temporary DNS error while resolving: www.googleapis.com
    W|2016-10-16|08:37:47.218|BP           Exception: DNS error: Temporary DNS error while resolving: www.some-domain.com
    I|2016-10-16|08:43:11.138|DB           line 1127: HWMDatabase::virtual void HWMDatabase::run() - Elapsed: 357999
    I|2016-10-16|08:43:11.149|PE.CON       onTimeoutNotification 185.3.54.28:9161
    

    [Linux syslog]
    Oct 16 08:37:43 hwm-node-1 kernel: [151118.601288] sd 0:0:1:0: [sda] abort
    Oct 16 08:41:07 hwm-node-1 kernel: [151321.937381] INFO: task kworker/u4:1:29 blocked for more than 120 seconds.
    ...
    Oct 16 08:41:07 hwm-node-1 kernel: [151322.089136] INFO: task jbd2/sda1-8:104 blocked for more than 120 seconds.
    ...
    Oct 16 08:41:07 hwm-node-1 kernel: [151322.245617] INFO: task rs:main Q:Reg:414 blocked for more than 120 seconds.
    ...
    Oct 16 08:41:07 hwm-node-1 kernel: [151322.481381] INFO: task hwm_master:7791 blocked for more than 120 seconds.
    ...
    Oct 16 08:41:07 hwm-node-1 kernel: [151322.616600] INFO: task hwm_master:7802 blocked for more than 120 seconds.
    ...
    Oct 16 08:41:07 hwm-node-1 kernel: [151322.861420] INFO: task cron:18904 blocked for more than 120 seconds.
    ...
    Oct 16 08:41:08 hwm-node-1 kernel: [151323.051763] INFO: task cron:18905 blocked for more than 120 seconds.
    ...
    Oct 16 08:42:53 hwm-node-1 kernel: [151428.634159] sd 0:0:1:0: [sda] abort
    Oct 16 08:42:53 hwm-node-1 kernel: [151428.638435] sd 0:0:1:0: [sda] abort
    Oct 16 08:42:53 hwm-node-1 kernel: [151428.642497] sd 0:0:1:0: [sda] abort
    Oct 16 08:42:53 hwm-node-1 kernel: [151428.646611] sd 0:0:1:0: [sda] abort
    Oct 16 08:42:53 hwm-node-1 kernel: [151428.650844] sd 0:0:1:0: [sda] abort
    Oct 16 08:42:53 hwm-node-1 kernel: [151428.655165] sd 0:0:1:0: [sda] abort
    Oct 16 08:42:53 hwm-node-1 kernel: [151428.659332] sd 0:0:1:0: [sda] abort
    Oct 16 08:42:53 hwm-node-1 kernel: [151428.663459] sd 0:0:1:0: [sda] abort
    Oct 16 08:42:53 hwm-node-1 kernel: [151428.667794] sd 0:0:1:0: [sda] abort
    Oct 16 08:42:53 hwm-node-1 kernel: [151428.671939] sd 0:0:1:0: [sda] abort
    Oct 16 08:43:08 hwm-node-1 kernel: [151443.169478] INFO: task jbd2/sda1-8:104 blocked for more than 120 seconds.
    ...
    Oct 16 08:43:08 hwm-node-1 kernel: [151443.328262] INFO: task ntpd:393 blocked for more than 120 seconds.
    ...
    Oct 16 08:43:08 hwm-node-1 kernel: [151443.527233] INFO: task rs:main Q:Reg:414 blocked for more than 120 seconds.
    ...
    Oct 16 08:43:10 hwm-node-1 kernel: [151445.559469] sd 0:0:1:0: device reset
    Oct 16 08:43:10 hwm-node-1 rsyslogd-2007: action 'action 18' suspended, next retry is Sun Oct 16 08:43:40 2016 [try http://www.rsyslog.com/e/2007 ]
    ...
    Oct 16 08:43:10 hwm-node-1 google-ip-forwarding: ERROR GET request error retrieving metadata.#012Traceback (most recent call last):#012  File "/usr/lib/python2.7/dist-packages/google_compute_engine/metadata_watcher.py", line 159, in _HandleMetadataUpdate#012    metadata_key=metadata_key, recursive=recursive, wait=wait)#012  File "/usr/lib/python2.7/dist-packages/google_compute_engine/metadata_watcher.py", line 134, in _GetMetadataUpdate#012    response = self._GetMetadataRequest(metadata_url, params=params)#012  File "/usr/lib/python2.7/dist-packages/google_compute_engine/metadata_watcher.py", line 50, in Wrapper#012    response = func(*args, **kwargs)#012  File "/usr/lib/python2.7/dist-packages/google_compute_engine/metadata_watcher.py", line 97, in _GetMetadataRequest#012    return request_opener.open(request, timeout=self.timeout*1.1)#012  File "/usr/lib/python2.7/urllib2.py", line 431, in open#012    response = self._open(req, data)#012  File "/usr/lib/python2.7/urllib2.py", line 449, in _open#012    '_open', req)#012  File "/usr/lib/python2.7/urllib2.py", line 409, in _call_chain#012    result = func(*args)#012  File "/usr/lib/python2.7/urllib2.py", line 1227, in http_open#012    return self.do_open(httplib.HTTPConnection, req)#012  File "/usr/lib/python2.7/urllib2.py", line 1200, in do_open#012    r = h.getresponse(buffering=True)#012  File "/usr/lib/python2.7/httplib.py", line 1111, in getresponse#012    response.begin()#012  File "/usr/lib/python2.7/httplib.py", line 444, in begin#012    version, status, reason = self._read_status()#012  File "/usr/lib/python2.7/httplib.py", line 400, in _read_status#012    line = self.fp.readline(_MAXLINE + 1)#012  File "/usr/lib/python2.7/socket.py", line 476, in readline#012    data = self._sock.recv(self._rbufsize)#012timeout: timed out
    Oct 16 08:43:10 hwm-node-1 google-accounts: ERROR GET request error retrieving metadata.#012Traceback (most recent call last):#012  File "/usr/lib/python2.7/dist-packages/google_compute_engine/metadata_watcher.py", line 159, in _HandleMetadataUpdate#012    metadata_key=metadata_key, recursive=recursive, wait=wait)#012  File "/usr/lib/python2.7/dist-packages/google_compute_engine/metadata_watcher.py", line 134, in _GetMetadataUpdate#012    response = self._GetMetadataRequest(metadata_url, params=params)#012  File "/usr/lib/python2.7/dist-packages/google_compute_engine/metadata_watcher.py", line 50, in Wrapper#012    response = func(*args, **kwargs)#012  File "/usr/lib/python2.7/dist-packages/google_compute_engine/metadata_watcher.py", line 97, in _GetMetadataRequest#012    return request_opener.open(request, timeout=self.timeout*1.1)#012  File "/usr/lib/python2.7/urllib2.py", line 431, in open#012    response = self._open(req, data)#012  File "/usr/lib/python2.7/urllib2.py", line 449, in _open#012    '_open', req)#012  File "/usr/lib/python2.7/urllib2.py", line 409, in _call_chain#012    result = func(*args)#012  File "/usr/lib/python2.7/urllib2.py", line 1227, in http_open#012    return self.do_open(httplib.HTTPConnection, req)#012  File "/usr/lib/python2.7/urllib2.py", line 1200, in do_open#012    r = h.getresponse(buffering=True)#012  File "/usr/lib/python2.7/httplib.py", line 1111, in getresponse#012    response.begin()#012  File "/usr/lib/python2.7/httplib.py", line 444, in begin#012    version, status, reason = self._read_status()#012  File "/usr/lib/python2.7/httplib.py", line 400, in _read_status#012    line = self.fp.readline(_MAXLINE + 1)#012  File "/usr/lib/python2.7/socket.py", line 476, in readline#012    data = self._sock.recv(self._rbufsize)#012timeout: timed out
    Oct 16 08:43:10 hwm-node-1 google-clock-skew: ERROR GET request error retrieving metadata.#012Traceback (most recent call last):#012  File "/usr/lib/python2.7/dist-packages/google_compute_engine/metadata_watcher.py", line 159, in _HandleMetadataUpdate#012    metadata_key=metadata_key, recursive=recursive, wait=wait)#012  File "/usr/lib/python2.7/dist-packages/google_compute_engine/metadata_watcher.py", line 134, in _GetMetadataUpdate#012    response = self._GetMetadataRequest(metadata_url, params=params)#012  File "/usr/lib/python2.7/dist-packages/google_compute_engine/metadata_watcher.py", line 50, in Wrapper#012    response = func(*args, **kwargs)#012  File "/usr/lib/python2.7/dist-packages/google_compute_engine/metadata_watcher.py", line 97, in _GetMetadataRequest#012    return request_opener.open(request, timeout=self.timeout*1.1)#012  File "/usr/lib/python2.7/urllib2.py", line 431, in open#012    response = self._open(req, data)#012  File "/usr/lib/python2.7/urllib2.py", line 449, in _open#012    '_open', req)#012  File "/usr/lib/python2.7/urllib2.py", line 409, in _call_chain#012    result = func(*args)#012  File "/usr/lib/python2.7/urllib2.py", line 1227, in http_open#012    return self.do_open(httplib.HTTPConnection, req)#012  File "/usr/lib/python2.7/urllib2.py", line 1200, in do_open#012    r = h.getresponse(buffering=True)#012  File "/usr/lib/python2.7/httplib.py", line 1111, in getresponse#012    response.begin()#012  File "/usr/lib/python2.7/httplib.py", line 444, in begin#012    version, status, reason = self._read_status()#012  File "/usr/lib/python2.7/httplib.py", line 400, in _read_status#012    line = self.fp.readline(_MAXLINE + 1)#012  File "/usr/lib/python2.7/socket.py", line 476, in readline#012    data = self._sock.recv(self._rbufsize)#012timeout: timed out
    

    最佳答案

    此错误通常是由 throttle 引起的。这可能是由于磁盘大小或磁盘I/O大量增加引起的。您应该尝试增加PD或using an SSD的大小以提高性能。

    关于google-compute-engine - Google VM(europe-west1-b)上的网络和磁盘IO中断,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/40085202/

    相关文章:

    google-cloud-platform - GCP "Managing SSH keys in metadata"是如何在幕后工作的

    docker - 为什么在容器外部无法访问在Compute Engine中运行的Docker实例?

    kubernetes - 更改Kubernetes实例模板以打开HTTPS端口

    google-cloud-platform - 如何防止 Google Compute Engine 外部 IP 用于 GCE 的公共(public)和 cloudflare 白名单?

    google-compute-engine - 无法 ssh 进入 GCE docker 容器

    security - 谷歌CP : Allowing Public Ingress Web Traffic from the Load Balancer ONLY

    google-compute-engine - Google Container Engine Kubernetes服务LoadBalancer是否将流量发送到无响应的主机?

    postgresql - 获取连接到 postgresql Cloud SQL 实例的奇怪的 googleapi 错误 400 消息

    python - 无法访问在GCE上运行在0.0.0.0上的Flask应用程序

    google-compute-engine - 在 Google Compute 引擎上为 Confluence 配置端口