python - Selenium/BeautifulSoup - Python - 循环遍历多个页面

标签 python selenium selenium-webdriver web-scraping beautifulsoup

我一天中大部分时间都在研究和测试在零售商网站上循环浏览一组产品的最佳方式。

虽然我能够成功地在第一页上收集到一组产品(和属性),但我一直难以找到循环遍历网站页面以继续我的抓取的最佳方法。

根据下面的代码,我尝试使用“while”循环和 Selenium 单击网站的“下一页”按钮,然后继续收集产品。

问题是我的代码仍然没有通过第 1 页。

我在这里犯了一个愚蠢的错误吗?阅读本网站上的 4 或 5 个类似示例,但没有一个足够具体以在此处提供解决方案。

from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome()
driver.get('https://www.kohls.com/catalog/mens-button-down-shirts-tops-clothing.jsp?CN=Gender:Mens+Silhouette:Button-Down%20Shirts+Category:Tops+Department:Clothing&cc=mens-TN3.0-S-buttondownshirts&kls_sbp=43160314801019132980443403449632772558&PPP=120&WS=0')

products.clear()
hyperlinks.clear()
reviewCounts.clear()
starRatings.clear()

products = []
hyperlinks = []
reviewCounts = []
starRatings = []

pageCounter = 0
maxPageCount = int(html_soup.find('a', class_ = 'totalPageNum').text)+1


html_soup = BeautifulSoup(driver.page_source, 'html.parser')
prod_containers = html_soup.find_all('li', class_ = 'products_grid')


while (pageCounter < maxPageCount):
    for product in prod_containers:
        # If the product has review count, then extract:
        if product.find('span', class_ = 'prod_ratingCount') is not None:
            # The product name
            name = product.find('div', class_ = 'prod_nameBlock')
            name = re.sub(r"\s+", " ", name.text)
            products.append(name)

            # The product hyperlink
            hyperlink = product.find('span', class_ = 'prod_ratingCount')
            hyperlink = hyperlink.a
            hyperlink = hyperlink.get('href')
            hyperlinks.append(hyperlink)

            # The product review count
            reviewCount = product.find('span', class_ = 'prod_ratingCount').a.text
            reviewCounts.append(reviewCount)

            # The product overall star ratings
            starRating = product.find('span', class_ = 'prod_ratingCount')
            starRating = starRating.a
            starRating = starRating.get('alt')
            starRatings.append(starRating) 

    driver.find_element_by_xpath('//*[@id="page-navigation-top"]/a[2]').click()
    counterProduct +=1
    print(counterProduct)

最佳答案

每次“点击”下一页时都需要解析。因此,您需要将其包含在 while 循环中,否则您将继续遍历第一页,即使它点击到下一页也是如此,因为 prod_containers 对象永远不会改变。

其次,按照您的方式,您的 while 循环将永远不会停止,因为您设置了 pageCounter = 0,但永远不会递增它...它将永远是 <您的 maxPageCount。

我修复了代码中的这两件事并运行了它,它似乎已经工作并解析了第 1 到 5 页。

from selenium import webdriver
from bs4 import BeautifulSoup
import re

driver = webdriver.Chrome()
driver.get('https://www.kohls.com/catalog/mens-button-down-shirts-tops-clothing.jsp?CN=Gender:Mens+Silhouette:Button-Down%20Shirts+Category:Tops+Department:Clothing&cc=mens-TN3.0-S-buttondownshirts&kls_sbp=43160314801019132980443403449632772558&PPP=120&WS=0')

products = []
hyperlinks = []
reviewCounts = []
starRatings = []

pageCounter = 0

html_soup = BeautifulSoup(driver.page_source, 'html.parser')
maxPageCount = int(html_soup.find('a', class_ = 'totalPageNum').text)+1

prod_containers = html_soup.find_all('li', class_ = 'products_grid')


while (pageCounter < maxPageCount):
    html_soup = BeautifulSoup(driver.page_source, 'html.parser')
    prod_containers = html_soup.find_all('li', class_ = 'products_grid')
    for product in prod_containers:
        # If the product has review count, then extract:
        if product.find('span', class_ = 'prod_ratingCount') is not None:
            # The product name
            name = product.find('div', class_ = 'prod_nameBlock')
            name = re.sub(r"\s+", " ", name.text)
            name = name.strip()
            products.append(name)

            # The product hyperlink
            hyperlink = product.find('span', class_ = 'prod_ratingCount')
            hyperlink = hyperlink.a
            hyperlink = hyperlink.get('href')
            hyperlinks.append(hyperlink)

            # The product review count
            reviewCount = product.find('span', class_ = 'prod_ratingCount').a.text
            reviewCounts.append(reviewCount)

            # The product overall star ratings
            starRating = product.find('span', class_ = 'prod_ratingCount')
            starRating = starRating.a
            starRating = starRating.get('alt')
            starRatings.append(starRating) 

    driver.find_element_by_xpath('//*[@id="page-navigation-top"]/a[2]').click()
    pageCounter +=1
    print(pageCounter)

关于python - Selenium/BeautifulSoup - Python - 循环遍历多个页面,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/53965295/

相关文章:

python - 通过添加两列的值来创建新的 df

testing - 使用 Junit/Test NG 在 Selenium 中以编程方式(不使用 ANT)创建测试报告

node.js - (Chai) assert.isBoolean 不是一个函数——我做错了什么?

python - 使用 Python Selenium 遍历表行并打印列文本

angularjs - node_modules 中断 Visual Studio 中的构建

python - 如何更改 QTreeView 标题(又名 QHeaderView)的背景颜色?

python - 在numpy中将几个矩阵相乘

python - 子类化对象导致 NoneType

Java Selenium - 使用 xpath 从具有相同类名的多个 div 中查找字符串文本

python - 为 Youtube 自动生成的字幕自动打开脚本