python - 连接到 AWS Aurora 集群时偶尔出现 'temporary failure in name resolution'

我正在运行 Amazon Web Services RDS Aurora 5.6 数据库集群。有几个 lambda 与这些数据库实例对话，它们都是用 python 编写的。现在一切运行良好，但突然，从几天前开始，python 代码有时开始抛出以下错误:

[ERROR] InterfaceError: 2003: Can't connect to MySQL server on 'CLUSTER-DOMAIN:3306' (-3 Temporary failure in name resolution)

这种情况每 1000 个左右的新连接就会发生 1 个。有趣的是，最近几天我没有接触过整个服务(自从它开始发生以来)。所有 lambda 都使用官方 MySQL 连接器客户端，并使用以下代码段在每次初始化时进行连接:

import mysql.connector as mysql
import os

connection = mysql.connect(user=os.environ['DATABASE_USER'],
                         password=os.environ['DATABASE_PASSWORD'],
                         database=os.environ['DATABASE_NAME'],
                         host=os.environ['DATABASE_HOST'],
                         autocommit=True)

为了排除这是 Python MySQL 客户端中的问题，我添加了以下内容来解析主机:

import os
import socket

host = socket.gethostbyname(os.environ['DATABASE_HOST'])

同样在这里，我有时会收到以下错误:

[ERROR] gaierror: [Errno -2] Name or service not known

现在我怀疑这与 DNS 有关，但由于我只是使用集群端点，因此我无能为力。有趣的是，我最近在不同地区也遇到了完全相同的问题，使用相同的设置(Aurora 5.6 集群，python 中的 lambda 连接到它)并且在那里发生了同样的情况。

我已经尝试重新启动集群中的所有机器，但问题似乎仍然存在。这真的是DNS问题吗？我能做些什么来阻止这种情况发生？

最佳答案

AWS Support 告诉我这个错误很可能是由 AWS 的 VPC 中的流量配额引起的。

根据他们在 DNS Quotas 上的文档:

Each Amazon EC2 instance limits the number of packets that can be sent to the Amazon-provided DNS server to a maximum of 1024 packets per second per network interface. This quota cannot be increased. The number of DNS queries per second supported by the Amazon-provided DNS server varies by the type of query, the size of response, and the protocol in use. For more information and recommendations for a scalable DNS architecture, see the Hybrid Cloud DNS Solutions for Amazon VPC whitepaper.

需要注意的是，我们在这里查看的指标是 数据包每秒，每个 ENI。这有什么重要的？好吧，虽然每个查询的实际数据包数量各不相同，但每个 DNS 查询通常有多个数据包，这可能不是很明显。

虽然在 VPC 流日志中看不到这些数据包，但在查看我自己的数据包捕获时，我可以看到一些包含大约 4 个数据包的解析。

不幸的是，我不能对白皮书说太多。在这个阶段，我并没有真正考虑将混合 DNS 服务的实现视为“好的”解决方案。

解决方案

我正在寻找方法来减轻发生此错误的风险，并在它确实发生时限制其影响。在我看来，有多种选择可以实现这一目标:

强制 Lambda 函数在执行任何其他操作之前解析 Aurora 集群的 DNS，并使用私有(private) IP 地址进行连接并使用指数退避处理故障。为了最大限度地减少等待 reties 的成本，我为 DNS 解析设置了 5 秒的总超时时间。该数字包括所有回退等待时间。

即使您正在关闭连接，建立许多短期连接也会带来潜在的成本高昂的开销。考虑在客户端使用连接池，因为 Aurora 的连接池足以处理许多短期连接的开销是一种常见的误解。

尽量不要依赖 DNS。 Aurora 会自动处理实例的故障转移和升级/降级，因此重要的是要知道您始终连接到“正确”(或写入，在某些情况下:P)实例。由于对 Aurora 集群的 DNS 名称的更新可能需要时间来传播，即使是 5 秒的 TTL，最好使用 INFORMATION_SCHEMA.REPLICA_HOST_STATUS表，其中 MySQL 公开了有关数据库实例的“近实时”元数据。请注意，该表“包含集群范围的元数据”。如果您是 cbf，请查看选项 4。

使用智能驱动程序，它:

is a database driver or connector with the ability to read DB cluster topology from the metadata table. It can route new connections to individual instance endpoints without relying on high-level cluster endpoints. A smart driver is also typically capable of load balancing read-only connections across the available Aurora Replicas in a round-robin fashion.

不是解决方案

最初，我认为创建一个指向集群的 CNAME 可能是个好主意，但现在我不太确定缓存 Aurora DNS 查询结果是否明智。造成这种情况的原因有几个，在 The Aurora Connection Management Handbook 中详细讨论了这些原因。 :

Unless you use a smart database driver, you depend on DNS record updates and DNS propagation for failovers, instance scaling, and load balancing across Aurora Replicas. Currently, Aurora DNS zones use a short Time-To-Live (TTL) of 5 seconds. Ensure that your network and client configurations don’t further increase the DNS cache TTL

Aurora's cluster and reader endpoints abstract the role changes (primary instance promotion/demotion) and topology changes (addition and removal of instances) occurring in the DB cluster

我希望这有帮助!

关于python - 连接到 AWS Aurora 集群时偶尔出现 'temporary failure in name resolution'，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/58179080/

python - 连接到 AWS Aurora 集群时偶尔出现 'temporary failure in name resolution'

上一篇：firebase - 我可以使用 GCP 基础设施作为代码来设置 Firebase Auth、Firestore、RDB、Cloud Functions 吗？

下一篇：amazon-web-services - AWS Fargate 中的文件