python - 如何让网页以编程方式加载内容，就像我手动向下滚动时那样？

我想从 this 中抓取一些新闻链接网站。为此，我的代码是这样的:

from bs4 import BeautifulSoup
import requests

base = "https://www.philstar.com/business/"
page = requests.get(base)
soup = BeautifulSoup(page.text, "html.parser")

li_box = soup.find_all("href")

links = open("News article links.txt", "w+")

for a in li_box:
    links.write(base+a['href']+"\n")

问题是，它只能找到大约 15-16 个显示在着陆页上的链接。如果您手动向下滚动到页面底部，您会看到它加载了更多新闻内容。滚动更多，它会加载更多，依此类推。该代码无法执行此“向下滚动以查看更多”部分。我如何抓取所有这些新闻(或者说，前 1000 个)？

最佳答案

你必须使用 Selenium为了这。我已经稍微修改了你的代码，它会让你知道如何去做。

试试这个:

from bs4 import BeautifulSoup
import requests
from selenium import webdriver
import time

browser = webdriver.Chrome('--path--')      # here path of driver if it didn't find it.

base = "https://www.philstar.com/business/"

browser.get(base)

''' to auto scroll page '''
SCROLL_PAUSE_TIME = 0.5

# Get scroll height
last_height = browser.execute_script("return document.body.scrollHeight")

while True:
    # Scroll down to bottom
    browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")

    # Wait to load page
    time.sleep(SCROLL_PAUSE_TIME)

    # Calculate new scroll height and compare with last scroll height
    new_height = browser.execute_script("return document.body.scrollHeight")
    if new_height == last_height:
        break
    last_height = new_height

html_source = browser.page_source
soup = BeautifulSoup(html_source, "html.parser")


li_box = soup.find_all('a')     # here whatever you want to find
print(li_box)

希望对您有所帮助! :) 谢谢你!

关于python - 如何让网页以编程方式加载内容，就像我手动向下滚动时那样？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/51076355/

python - 如何让网页以编程方式加载内容，就像我手动向下滚动时那样？

上一篇：html - 使用 css 和 html 绘制带有五边形的框图

下一篇：html - CSS |页面底部的下拉菜单被切断