python - 如何迭代 Ebay 中的页面

我正在为 Ebay 构建一个抓取工具。我正在尝试找出一种方法来操纵 Ebay url 的页码部分以转到下一页，直到没有更多页面为止(如果您在第 2 页，页码部分将类似于“_pgn=2”) 。我注意到，如果您输入的数字大于列表的最大页数，该页面将重新加载到最后一页，而不是给出类似页面不存在的错误。 (如果列表有 5 个页面，则最后一个列表的页码 url 部分 _pgn=5 将路由到同一页面(如果页码 url 部分为 _pgn=100)。我怎样才能实现一种方法从第一页开始，获取页面的html soup，从汤中获取我想要的所有相关数据，然后使用新页码加载下一页并再次启动该过程，直到有没有任何新页面可供抓取？我尝试使用 selenium xpath 和 math.ceil 来获取列表的结果数，结果数与 50(每页最大列表的默认数量)的商，并使用该商作为我的 max_page，但我收到错误消息元素不存在，即使它存在。 self.driver.findxpath('xpath').text. 243 是我试图通过 xpath 获得的。

class EbayScraper(object):

def __init__(self, item, buying_type):
    self.base_url = "https://www.ebay.com/sch/i.html?_nkw="
    self.driver = webdriver.Chrome(r"chromedriver.exe")
    self.item = item
    self.buying_type = buying_type + "=1"
    self.url_seperator = "&_sop=12&rt=nc&LH_"
    self.url_seperator2 = "&_pgn="
    self.page_num = "1"

def getPageUrl(self):
    if self.buying_type == "Buy It Now=1":
        self.buying_type = "BIN=1"

    self.item = self.item.replace(" ", "+")

    url = self.base_url + self.item + self.url_seperator + self.buying_type + self.url_seperator2 + self.page_num
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    return soup

def getInfo(self, soup):
    for listing in soup.find_all("li", {"class": "s-item"}):
        raw = listing.find_all("a", {"class": "s-item__link"})
        if raw:
            raw_price = listing.find_all("span", {"class": "s-item__price"})[0]
            raw_title = listing.find_all("h3", {"class": "s-item__title"})[0]
            raw_link = listing.find_all("a", {"class": "s-item__link"})[0]
            raw_condition = listing.find_all("span", {"class": "SECONDARY_INFO"})[0]
            condition = raw_condition.text
            price = float(raw_price.text[1:])
            title = raw_title.text
            link = raw_link['href']
            print(title)
            print(condition)
            print(price)
            if self.buying_type != "BIN=1":
                raw_time_left = listing.find_all("span", {"class": "s-item__time-left"})[0]
                time_left = raw_time_left.text[:-4]
                print(time_left)
            print(link)
            print('\n')



if __name__ == '__main__':
item = input("Item: ")
buying_type = input("Buying Type (e.g, 'Buy It Now' or 'Auction'): ")

instance = EbayScraper(item, buying_type)
page = instance.getPageUrl()
instance.getInfo(page)

最佳答案

如果您想迭代所有页面并收集所有结果，那么您的脚本需要检查是否有 next访问该页面后的页面

import requests
from bs4 import BeautifulSoup


class EbayScraper(object):

    def __init__(self, item, buying_type):
        ...
        self.currentPage = 1

    def get_url(self, page=1):
        if self.buying_type == "Buy It Now=1":
            self.buying_type = "BIN=1"

        self.item = self.item.replace(" ", "+")
        # _ipg=200 means that expect a 200 items per page
        return '{}{}{}{}{}{}&_ipg=200'.format(
            self.base_url, self.item, self.url_seperator, self.buying_type,
            self.url_seperator2, page
        )

    def page_has_next(self, soup):
        container = soup.find('ol', 'x-pagination__ol')
        currentPage = container.find('li', 'x-pagination__li--selected')
        next_sibling = currentPage.next_sibling
        if next_sibling is None:
            print(container)
        return next_sibling is not None

    def iterate_page(self):
        # this will loop if there are more pages otherwise end
        while True:
            page = instance.getPageUrl(self.currentPage)
            instance.getInfo(page)
            if self.page_has_next(page) is False:
                break
            else:
                self.currentPage += 1

    def getPageUrl(self, pageNum):
        url = self.get_url(pageNum)
        print('page: ', url)
        response = requests.get(url)
        soup = BeautifulSoup(response.text, 'html.parser')
        return soup

    def getInfo(self, soup):
        ...


if __name__ == '__main__':
    item = input("Item: ")
    buying_type = input("Buying Type (e.g, 'Buy It Now' or 'Auction'): ")

    instance = EbayScraper(item, buying_type)
    instance.iterate_page()

这里的重要函数是page_has_next和iterate_page

page_has_next - 检查页面分页是否有另一个的函数 li selected 旁边的元素页。例如< 1 2 3 >如果我们在第 1 页，那么它会检查下一页是否有 2 -> 类似这样的内容
iterate_page - 一个循环直到没有 page_next 的函数

另请注意，除非您需要模仿用户点击或需要浏览器进行导航，否则您不需要 selenium。

关于python - 如何迭代 Ebay 中的页面，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/59744646/

python - 如何迭代 Ebay 中的页面

上一篇：python - 使用子进程通过 shell 卸载 OpenJDK

下一篇：python - 将文件从一个文件夹复制到另一个文件夹(并且不覆盖)