我制作了一个可以工作的抓取工具,只是它不会抓取最后一页。 url 没有改变,所以我将它设置为无限循环运行。
我将循环设置为在无法再单击下一个按钮(在最后一页上)时中断,并且脚本似乎在 append 最后过去的结果之前就结束了。
如何将最后一页 append 到列表中?
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import pandas as pd
from time import sleep
import itertools
url = "https://example.com"
driver = webdriver.Chrome(executable_path="/usr/bin/chromedriver")
driver.get(url)
inputElement = driver.find_element_by_id("txtBusinessName")
inputElement.send_keys("ship")
inputElement.send_keys(Keys.ENTER)
df2 = pd.DataFrame()
for i in itertools.count():
element = WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.ID, "grid_businessList")))
html = driver.page_source
soup = BeautifulSoup(html, "html.parser")
table = soup.find('table', id="grid_businessList")
rows = table.findAll("tr")
columns = [v.text.replace('\xa0', ' ') for v in rows[0].find_all('th')]
df = pd.DataFrame(columns=columns)
for i in range(1, len(rows)):
tds = rows[i].find_all('td')
if len(tds) == 5:
values = [tds[0].text, tds[1].text, tds[2].text, tds[3].text, tds[4].text, tds[5].text]
else:
values = [td.text for td in tds]
df = df.append(pd.Series(values, index=columns), ignore_index=True)
try:
next_button = driver.find_element_by_css_selector("li.next:nth-child(9) > a:nth-child(1)")
driver.execute_script("arguments[0].click();", next_button)
sleep(5)
except NoSuchElementException:
break
df2 = df2.append(df)
df2.to_csv(r'/home/user/Documents/test/' + 'gasostest.csv', index=False)
最佳答案
问题是 except 会在你追加最后一页之前打破循环。
您可以做的是在 try - except 语句中使用 finally 语句。 finally block 中的代码将始终运行,请参阅 https://docs.python.org/3/tutorial/errors.html#defining-clean-up-actions
您的代码可以重写为:
try:
next_button = driver.find_element_by_css_selector("li.next:nth-child(9) > a:nth-child(1)")
driver.execute_script("arguments[0].click();", next_button)
sleep(5)
except NoSuchElementException:
break
finally:
df2 = df2.append(df)
df2.to_csv(r'/home/user/Documents/test/' + 'gasostest.csv', index=False)
关于Python:如何打破循环并 append 结果的最后一页?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/54928329/