我的程序执行以下操作:
- 获取txt文件的文件夹
- 对于每个文件:
- 阅读文件
- 使用文件内容向本地主机中的 API 发出 POST 请求
- 解析 XML 响应(不在下面的示例中)
我担心程序的同步版本的性能,所以尝试使用 aiohttp
使其异步(这是我除了 Scrapy 之外第一次尝试在 Python 中进行异步编程)。事实证明,异步代码花费了 2 倍的时间,我不明白为什么。
同步代码(152 秒)
url = "http://localhost:6090/api/analyzexml"
package = #name of the package I send in each requests
with open("template.txt", "r", encoding="utf-8") as f:
template = f.read()
articles_path = #location of my text files
def fetch(session, url, article_text):
data = {"package": package, "data": template.format(article_text)}
response = session.post(url, data=json.dumps(data))
print(response.text)
files = glob(os.path.join(articles_path, "*.txt"))
with requests.Session() as s:
for file in files:
with open(file, "r", encoding="utf-8") as f:
article_text = f.read()
fetch(s, url, article_text)
分析结果:
+--------+---------+----------+---------+----------+-------------------------------------------------------+
| ncalls | tottime | percall | cumtime | percall | filename:lineno(function) |
+--------+---------+----------+---------+----------+-------------------------------------------------------+
| 849 | 145.6 | 0.1715 | 145.6 | 0.1715 | ~:0(<method 'recv_into' of '_socket.socket' objects>) |
| 2 | 1.001 | 0.5007 | 1.001 | 0.5007 | ~:0(<method 'connect' of '_socket.socket' objects>) |
| 365 | 0.772 | 0.002115 | 1.001 | 0.002742 | ~:0(<built-in method builtins.print>) |
+--------+---------+----------+---------+----------+-------------------------------------------------------+
(WANNABE)异步代码(327 秒)
async def fetch(session, url, article_text):
data = {"package": package, "data": template.format(article_text)}
async with session.post(url, data=json.dumps(data)) as response:
return await response.text()
async def process_files(articles_path):
tasks = []
async with ClientSession() as session:
files = glob(os.path.join(articles_path, "*.txt"))
for file in files:
with open(file, "r", encoding="utf-8") as f:
article_text = f.read()
task = asyncio.ensure_future(fetch(session=session,
url=url,
article_text=article_text
))
tasks.append(task)
responses = await asyncio.gather(*tasks)
print(responses)
loop = asyncio.get_event_loop()
future = asyncio.ensure_future(process_files(articles_path))
loop.run_until_complete(future)
分析结果:
+--------+---------+---------+---------+---------+-----------------------------------------------+
| ncalls | tottime | percall | cumtime | percall | filename:lineno(function) |
+--------+---------+---------+---------+---------+-----------------------------------------------+
| 2278 | 156 | 0.06849 | 156 | 0.06849 | ~:0(<built-in method select.select>) |
| 365 | 128.3 | 0.3516 | 168.9 | 0.4626 | ~:0(<built-in method builtins.print>) |
| 730 | 40.54 | 0.05553 | 40.54 | 0.05553 | ~:0(<built-in method _codecs.charmap_encode>) |
+--------+---------+---------+---------+---------+-----------------------------------------------+
我显然在这个概念中遗漏了一些东西。有人也可以帮助我理解为什么异步版本的打印需要这么多时间(参见分析)。
最佳答案
因为它不是异步的:)
看看你的代码:你做 responses = await asyncio.gather(*tasks)
每个文件,所以你基本上同步运行抓取,每次支付所有协程处理的价格。
我想这只是一个缩进错误;如果您取消缩进 responses = await asyncio.gather(*tasks)
以便它通过 for file in files
循环,您将真正开始 tasks
并行。
关于python - 异步比同步慢,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/50003803/