我有一个 15MB 左右的 htm 文件,我想在其中从值表中获取名称。由于我对 python 的了解相当有限,这是我找到的解决问题的最佳解决方案,但问题是它相当慢。任何更快解析数据或使代码整体运行更快的方法 ?
我已将解析器更改为 lxml,但没有发现太大的改进。 我想要实现的是,随着条目的添加,BeautifulSoup 只会从该点开始搜索,但我不知道该怎么做
def stew(fp):
with open(fp, encoding="utf8") as f:
soup = BeautifulSoup(f, features="lxml")
return soup
def name_crawler(soup):
i = 2
while i < 6853:
tasty = soup.select("tr.cItem:nth-child(" + str(i) + ") > td:nth-child(1) > a:nth-child(1)")
tastier = search('target="_blank">(.*)</a>', str(tasty))
with open("database.json", "a+") as f:
f.write(tastier.group(1) + "\n")
i = i + 1
print(" [+] Entry added for " + tastier.group(1))
最佳答案
稍微改进代码使其运行速度稍快一些,但在更改为 PyPy 后代码运行速度提高了 10 倍
def stew(fp):
with open(fp, encoding="utf8") as f:
soup = BeautifulSoup(f, features="lxml")
return soup
def sauce(soup):
i = 2
tastiest = []
while i < 6853:
tasty = soup.select("tr.cItem:nth-child(" + str(i) + ") > td:nth-child(1) > a:nth-child(1)")
tastier = search('target="_blank">(.*)</a>', str(tasty))
i = i + 1
tastiest.append(tastier.group(1))
print(" [+] Entry added for " + tastier.group(1))
with open("database_weapons.obj", "ab+") as f:
pickle.dump(tastiest, f)
编辑:阅读 BS 文档后,我使用了 SoupSieve
以便只查看 a 标签,现在它以更快的速度运行
def stew(fp):
tags = SoupStrainer("a")
with open(fp, encoding="utf8") as f:
soup = BeautifulSoup(f, features="lxml", parse_only=tags).prettify()
return soup
def sauce(soup):
i = 1
tastiest = []
while True:
tastier = findall('730/(.*)" target="_blank', str(soup))
tastiest.append(tastier[i])
print(" [+] Entry added for " + tastier[i])
i = i + 1
with open("database_weapons.obj", "ab+") as f:
pickle.dump(tastiest, f)
关于html - 提高 BeautifulSoup 解析速度,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57862616/