Python - 使用 BeautifulSoup 创建 URL 列表时出现问题

我正在尝试使用 BeautifulSoup 制作 Python 爬网程序，但收到错误消息，提示我正在尝试将非字符串或其他字符缓冲区类型写入文件。通过检查程序输出，我发现我的列表包含许多“无”项目。除了“无”之外，我还有很多图像和不是链接但在我的列表中的图像链接的东西。如何只将 URL 添加到我的列表中？

    import urllib
    from BeautifulSoup import *

    try:
        with open('url_file', 'r') as f:
            url_list = [line.rstrip('\n') for line in f]
            f.close()
        with open('old_file', 'r') as x:
            old_list = [line.rstrip('\n') for line in f]
            f.close()
    except:
        url_list = list()
        old_list = list()
        #for Testing
        url_list.append("http://www.dinamalar.com/")


    count = 0


    for item in url_list:
        try:
            count = count + 1
            if count > 5:
                break

            html = urllib.urlopen(item).read()
            soup = BeautifulSoup(html)
            tags = soup('a')

            for tag in tags:

                if tag in old_list:
                    continue
                else:
                    url_list.append(tag.get('href', None))


            old_list.append(item)
            #for testing
            print url_list
        except:
            continue

    with open('url_file', 'w') as f:
        for s in url_list:
            f.write(s)
            f.write('\n')


    with open('old_file', 'w') as f:
        for s in old_list:
            f.write(s)

最佳答案

首先，使用bs4不是不再维护的BeautifulSoup3，您的错误是因为并非所有 anchor 都有href，因此您尝试编写None，这会导致您的错误，请使用find_all> 并设置href=True，以便您只找到具有 href 属性的 anchor 标记:

soup = BeautifulSoup(html)
tags = soup.find_all("a", href=True)

也不要使用一揽子 except 语句，始终捕获您期望的错误，并至少在发生错误时打印它们。就我还有很多图像和非链接的东西来说，如果您想过滤某些链接，那么您必须更具体，要么查找包含您想要的内容的标签。如果可能的话，请使用正则表达式 href=re.compile("some_pattern") 或使用 css 选择器:

# hrefs starting with something
"a[href^=something]"

# hrefs that contain something
"a[href*=something]"

# hrefs ending with  something
"a[href$=something]"

只有您知道 html 的结构以及您想要什么，因此您使用什么完全由您决定。

关于Python - 使用 BeautifulSoup 创建 URL 列表时出现问题，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/39152827/

Python - 使用 BeautifulSoup 创建 URL 列表时出现问题

上一篇：python - django:实例化 AdminSite 更改未反射(reflect)出来

下一篇：python - Selenium 与Python : First instance of the element is identified but the next occurance is ElementNotVisibleException