我正在为 Ebay 构建一个抓取工具。我正在尝试找出一种方法来操纵 Ebay url 的页码部分以转到下一页,直到没有更多页面为止(如果您在第 2 页,页码部分将类似于“_pgn=2”) 。我注意到,如果您输入的数字大于列表的最大页数,该页面将重新加载到最后一页,而不是给出类似页面不存在的错误。 (如果列表有 5 个页面,则最后一个列表的页码 url 部分 _pgn=5 将路由到同一页面(如果页码 url 部分为 _pgn=100)。我怎样才能实现一种方法从第一页开始,获取页面的html soup,从汤中获取我想要的所有相关数据,然后使用新页码加载下一页并再次启动该过程,直到有没有任何新页面可供抓取?我尝试使用 selenium xpath 和 math.ceil 来获取列表的结果数,结果数与 50(每页最大列表的默认数量)的商,并使用该商作为我的 max_page,但我收到错误消息元素不存在,即使它存在。 self.driver.findxpath('xpath').text. 243 是我试图通过 xpath 获得的。
class EbayScraper(object):
def __init__(self, item, buying_type):
self.base_url = "https://www.ebay.com/sch/i.html?_nkw="
self.driver = webdriver.Chrome(r"chromedriver.exe")
self.item = item
self.buying_type = buying_type + "=1"
self.url_seperator = "&_sop=12&rt=nc&LH_"
self.url_seperator2 = "&_pgn="
self.page_num = "1"
def getPageUrl(self):
if self.buying_type == "Buy It Now=1":
self.buying_type = "BIN=1"
self.item = self.item.replace(" ", "+")
url = self.base_url + self.item + self.url_seperator + self.buying_type + self.url_seperator2 + self.page_num
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
return soup
def getInfo(self, soup):
for listing in soup.find_all("li", {"class": "s-item"}):
raw = listing.find_all("a", {"class": "s-item__link"})
if raw:
raw_price = listing.find_all("span", {"class": "s-item__price"})[0]
raw_title = listing.find_all("h3", {"class": "s-item__title"})[0]
raw_link = listing.find_all("a", {"class": "s-item__link"})[0]
raw_condition = listing.find_all("span", {"class": "SECONDARY_INFO"})[0]
condition = raw_condition.text
price = float(raw_price.text[1:])
title = raw_title.text
link = raw_link['href']
print(title)
print(condition)
print(price)
if self.buying_type != "BIN=1":
raw_time_left = listing.find_all("span", {"class": "s-item__time-left"})[0]
time_left = raw_time_left.text[:-4]
print(time_left)
print(link)
print('\n')
if __name__ == '__main__':
item = input("Item: ")
buying_type = input("Buying Type (e.g, 'Buy It Now' or 'Auction'): ")
instance = EbayScraper(item, buying_type)
page = instance.getPageUrl()
instance.getInfo(page)
最佳答案
如果您想迭代所有页面并收集所有结果,那么您的脚本需要检查是否有 next
访问该页面后的页面
import requests
from bs4 import BeautifulSoup
class EbayScraper(object):
def __init__(self, item, buying_type):
...
self.currentPage = 1
def get_url(self, page=1):
if self.buying_type == "Buy It Now=1":
self.buying_type = "BIN=1"
self.item = self.item.replace(" ", "+")
# _ipg=200 means that expect a 200 items per page
return '{}{}{}{}{}{}&_ipg=200'.format(
self.base_url, self.item, self.url_seperator, self.buying_type,
self.url_seperator2, page
)
def page_has_next(self, soup):
container = soup.find('ol', 'x-pagination__ol')
currentPage = container.find('li', 'x-pagination__li--selected')
next_sibling = currentPage.next_sibling
if next_sibling is None:
print(container)
return next_sibling is not None
def iterate_page(self):
# this will loop if there are more pages otherwise end
while True:
page = instance.getPageUrl(self.currentPage)
instance.getInfo(page)
if self.page_has_next(page) is False:
break
else:
self.currentPage += 1
def getPageUrl(self, pageNum):
url = self.get_url(pageNum)
print('page: ', url)
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
return soup
def getInfo(self, soup):
...
if __name__ == '__main__':
item = input("Item: ")
buying_type = input("Buying Type (e.g, 'Buy It Now' or 'Auction'): ")
instance = EbayScraper(item, buying_type)
instance.iterate_page()
这里的重要函数是page_has_next
和iterate_page
page_has_next
- 检查页面分页是否有另一个的函数li
selected
旁边的元素页。例如< 1 2 3 >
如果我们在第 1 页,那么它会检查下一页是否有 2 -> 类似这样的内容iterate_page
- 一个循环直到没有page_next
的函数
另请注意,除非您需要模仿用户点击或需要浏览器进行导航,否则您不需要 selenium。
关于python - 如何迭代 Ebay 中的页面,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59744646/