python - 如何在Python中使用Selenium连续抓取网页中的文章

标签 python python-3.x selenium web web-crawler

我正在尝试抓取 Bloomberg.com 并查找所有英文新闻文章的链接。下面代码的问题在于,它确实从第一页找到了很多文章,但它只是进入了一个循环,它不返回任何内容,并且偶尔会循环一次。

from collections import deque
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.firefox.options import Options

visited = set()
to_crawl = deque()
to_crawl.append("https://www.bloomberg.com")

def crawl_link(input_url):
    options = Options()
    options.add_argument('--headless')
    browser = webdriver.Firefox(options=options)
    browser.get(input_url)
    elems = browser.find_elements(by=By.XPATH, value="//a[@href]")
    for elem in elems:
        #retrieve all href links and save it to url_element variable
        url_element = elem.get_attribute("href")
        if url_element not in visited:
            to_crawl.append(url_element)
            visited.add(url_element)
            #save news articles
            if 'www.bloomberg.com/news/articles' in url_element:
                print(str(url_element))
                with open("result.txt", "a") as outf:
                    outf.write(str(url_element) + "\n")
    browser.close()

while len(to_crawl):
    url_to_crawl = to_crawl.pop()
    crawl_link(url_to_crawl)

我尝试过使用队列,然后使用堆栈,但行为是相同的。我似乎无法完成我正在寻找的事情。

如何抓取这样的网站来抓取新闻网址?

最佳答案

您使用的方法应该可以正常工作,但是在我自己运行之后,我注意到有一些事情导致它挂起或抛出错误。

我做了一些调整,并添加了一些内嵌评论来解释我的原因。

from collections import deque
from selenium.common.exceptions import StaleElementReferenceException
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.firefox.options import Options

base = "https://www.bloomberg.com"
article = base + "/news/articles"
visited = set()


# A set discards duplicates automatically and is more efficient for lookups
articles = set()

to_crawl = deque()
to_crawl.append(base)

def crawl_link(input_url):
    options = Options()
    options.add_argument('--headless')
    browser = webdriver.Firefox(options=options)
    print(input_url)
    browser.get(input_url)
    elems = browser.find_elements(by=By.XPATH, value="//a[@href]")

    # this part was the issue, before this line there was 
    # `to_crawl.append()` which was prematurely adding links 
    # to the visited list so those links were skipped over without
    # being crawled
    visited.add(input_url)

    for elem in elems:

        # checks for errors
        try:
            url_element = elem.get_attribute("href")
        except StaleElementReferenceException as err:
            print(err)
            continue

        # checks to make sure links aren't being crawled more than once
        # and that all the links are in the propper domain
        if base in url_element and all(url_element not in i for i in [visited, to_crawl]):

            to_crawl.append(url_element)

            # this checks if the link matches the correct url pattern
            if article in url_element and url_element not in articles:

                articles.add(url_element)
                print(str(url_element))
                with open("result.txt", "a") as outf:
                    outf.write(str(url_element) + "\n")
    
    browser.quit() # guarantees the browser closes completely


while len(to_crawl):
    # popleft makes the deque a FIFO instead of LIFO.
    # A queue would achieve the same thing.
    url_to_crawl = to_crawl.popleft()

    crawl_link(url_to_crawl)

运行 60 多秒后,这是 result.txt https://gist.github.com/alexpdev/b7545970c4e3002b1372e26651301a23 的输出

关于python - 如何在Python中使用Selenium连续抓取网页中的文章,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/72065563/

相关文章:

php - Python:无法通过脚本启动 Selenium Webdriver (Firefox),但可以通过命令行运行

c# - Selenium:将 chromeDriver 嵌入到一个 exe 中

javascript - 在 Protractor 中,我们如何迭代并为其中包含 $index 的 Angular 模型赋予不同的值?

python - 如何在极坐标图周围添加环绕轴?

python - 从列表中删除最小的数字

python - 如何使用电机干净地关闭更改流?

python - 用 Python 解方程。计算第n位数字

python - Django 在多对多中按计数排序(@property)?

python - 在 Python 中的列表之间移动项目

Python - Pygame 突然变慢