html - 提高 BeautifulSoup 解析速度

我有一个 15MB 左右的 htm 文件，我想在其中从值表中获取名称。由于我对 python 的了解相当有限，这是我找到的解决问题的最佳解决方案，但问题是它相当慢。任何更快解析数据或使代码整体运行更快的方法？

我已将解析器更改为 lxml，但没有发现太大的改进。我想要实现的是，随着条目的添加，BeautifulSoup 只会从该点开始搜索，但我不知道该怎么做

def stew(fp):
    with open(fp, encoding="utf8") as f:
        soup = BeautifulSoup(f, features="lxml")
    return soup


def name_crawler(soup):
    i = 2
    while i < 6853:
        tasty = soup.select("tr.cItem:nth-child(" + str(i) + ") > td:nth-child(1) > a:nth-child(1)")
        tastier = search('target="_blank">(.*)</a>', str(tasty))
        with open("database.json", "a+") as f:
            f.write(tastier.group(1) + "\n")
        i = i + 1
        print(" [+] Entry added for " + tastier.group(1))

最佳答案

稍微改进代码使其运行速度稍快一些，但在更改为 PyPy 后代码运行速度提高了 10 倍

def stew(fp):
    with open(fp, encoding="utf8") as f:
        soup = BeautifulSoup(f, features="lxml")
    return soup


def sauce(soup):
    i = 2
    tastiest = []
    while i < 6853:
        tasty = soup.select("tr.cItem:nth-child(" + str(i) + ") > td:nth-child(1) > a:nth-child(1)")
        tastier = search('target="_blank">(.*)</a>', str(tasty))
        i = i + 1
        tastiest.append(tastier.group(1))
        print(" [+] Entry added for " + tastier.group(1))
    with open("database_weapons.obj", "ab+") as f:
        pickle.dump(tastiest, f)

编辑:阅读 BS 文档后，我使用了 SoupSieve 以便只查看 a 标签，现在它以更快的速度运行

def stew(fp):
    tags = SoupStrainer("a")
    with open(fp, encoding="utf8") as f:
        soup = BeautifulSoup(f, features="lxml", parse_only=tags).prettify()
    return soup


def sauce(soup):
    i = 1
    tastiest = []
    while True:
        tastier = findall('730/(.*)" target="_blank', str(soup))
        tastiest.append(tastier[i])
        print(" [+] Entry added for " + tastier[i])
        i = i + 1
    with open("database_weapons.obj", "ab+") as f:
        pickle.dump(tastiest, f)

关于html - 提高 BeautifulSoup 解析速度，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/57862616/

html - 提高 BeautifulSoup 解析速度

上一篇：html - 尽管有溢出自动禁用滚动条

下一篇：javascript - CSS:将div元素垂直和水平放置在中心，将页脚放在底部