python - 尝试使用 BeautifulSoup 从本地文件收集数据

我想运行一个 python 脚本来解析 html 文件并收集具有 target="_blank" 属性的所有链接的列表。

我尝试了以下方法，但没有从 bs4 获得任何信息。 SoupStrainer 在文档中说它将以与 findAll 等相同的方式获取参数，这应该有效吗？我是否遗漏了一些愚蠢的错误？

import os
import sys

from bs4 import BeautifulSoup, SoupStrainer
from unipath import Path

def main():

    ROOT = Path(os.path.realpath(__file__)).ancestor(3)
    src = ROOT.child("src")
    templatedir = src.child("templates")

    for (dirpath, dirs, files) in os.walk(templatedir):
        for path in (Path(dirpath, f) for f in files):
            if path.endswith(".html"):
                for link in BeautifulSoup(path, parse_only=SoupStrainer(target="_blank")):
                    print link

if __name__ == "__main__":
    sys.exit(main())

最佳答案

我想你需要这样的东西

if path.endswith(".html"):
    htmlfile = open(dirpath)
    for link in BeautifulSoup(htmlfile,parse_only=SoupStrainer(target="_blank")):
        print link

关于python - 尝试使用 BeautifulSoup 从本地文件收集数据，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/17574119/

上一篇：python - 以最少的迭代进行列表操作

下一篇：python - Django:表单对象没有属性 cleaned_data - save() 方法

相关文章：

python - 如何使用 Beautiful Soup 从网站检索信息？

python - 尝试使用 python 和 bs4 从特定 'a' s 中抓取所有 'td' 文本

python - 如何告诉 BeautifulSoup 将特定标签的内容提取为文本？ (不碰它)

python - HTML 和 BeautifulSoup : how to iteratively parse when the structure is not always known beforehand?

python - 最大和连续子序列为零？

python /珀尔 : timed loop implementation (also with microseconds)?

python - 字母之间具有恒定字母距离的正则表达式

python - 如何用Python去除图像中的小物体

python - 无法将输出写入 csv bs4 python

python - BeautifulSoup - 从 JS 中提取特定的 JSON 键值