python - 在 python 中使用 selenium 进行分页导航

标签 python selenium selenium-webdriver web-scraping

我正在使用 Python 和 Selenium 抓取这个网站。我有代码工作,但它目前只刮第一页,我想遍历所有页面并将它们全部刮掉,但它们以一种奇怪的方式处理分页我将如何浏览页面并逐个刮掉它们?

分页 HTML:

<div class="pagination">
    <a href="/PlanningGIS/LLPG/WeeklyList/41826123,1" title="Go to first page">First</a>
    <a href="/PlanningGIS/LLPG/WeeklyList/41826123,1" title="Go to previous page">Prev</a>
    <a href="/PlanningGIS/LLPG/WeeklyList/41826123,1" title="Go to page 1">1</a>
    <span class="current">2</span>
    <a href="/PlanningGIS/LLPG/WeeklyList/41826123,3" title="Go to page 3">3</a>
    <a href="/PlanningGIS/LLPG/WeeklyList/41826123,4" title="Go to page 4">4</a>
    <a href="/PlanningGIS/LLPG/WeeklyList/41826123,3" title="Go to next page">Next</a>
    <a href="/PlanningGIS/LLPG/WeeklyList/41826123,4" title="Go to last page">Last</a>
</div>

刮刀:
import re
import json
import requests
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.webdriver.chrome.options import Options

options = Options()
# options.add_argument('--headless')
options.add_argument("start-maximized")
options.add_argument('disable-infobars')
driver=webdriver.Chrome(chrome_options=options, 
executable_path=r'/Users/weaabduljamac/Downloads/chromedriver')

url = 'https://services.wiltshire.gov.uk/PlanningGIS/LLPG/WeeklyList'
driver.get(url)

def getData():
  data = []
  rows = driver.find_element_by_xpath('//*[@id="form1"]/table/tbody').find_elements_by_tag_name('tr')
 for row in rows:
    app_number = row.find_elements_by_tag_name('td')[1].text
    address =  row.find_elements_by_tag_name('td')[2].text
    proposals =  row.find_elements_by_tag_name('td')[3].text
    status =  row.find_elements_by_tag_name('td')[4].text
    data.append({"CaseRef": app_number, "address": address, "proposals": proposals, "status": status})
print(data)
return data


def main():
 all_data = []
 select = Select(driver.find_element_by_xpath("//select[@class='formitem' and @id='selWeek']"))
 list_options = select.options

 for item in range(len(list_options)):
    select = Select(driver.find_element_by_xpath("//select[@class='formitem' and @id='selWeek']"))
    select.select_by_index(str(item))
    driver.find_element_by_css_selector("input.formbutton#csbtnSearch").click()
    all_data.extend( getData() )
    driver.find_element_by_xpath('//*[@id="form1"]/div[3]/a[4]').click()
    driver.get(url)

 with open( 'wiltshire.json', 'w+' ) as f:
    json.dump( all_data, f )
 driver.quit()


if __name__ == "__main__":
    main()

最佳答案

首先获取分页中的总页数,使用

ins.get('https://services.wiltshire.gov.uk/PlanningGIS/LLPG/WeeklyList/10702380,1')
ins.find_element_by_class_name("pagination")
source = BeautifulSoup(ins.page_source)
div = source.find_all('div', {'class':'pagination'})
all_as = div[0].find_all('a')
total = 0

for i in range(len(all_as)):
    if 'Next' in all_as[i].text:
        total = all_as[i-1].text
        break

现在只需遍历范围
for i in range(total):
 ins.get('https://services.wiltshire.gov.uk/PlanningGIS/LLPG/WeeklyList/10702380,{}'.format(count))

不断增加计数并获取页面的源代码,然后获取它的数据。
注意:单击从一页转到另一页时不要忘记 sleep

关于python - 在 python 中使用 selenium 进行分页导航,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/51743859/

相关文章:

python - 如何解析时间

java - 如何并行运行java for循环?

javascript - 如何通过 selenium-webdriver javascript API 设置 "debuggerAddress"chromeOption?

javascript - Selenium 网络驱动程序 : Scroll to the top using Javascript

python - odoo公​​司可以共享同一个合作伙伴吗?在哪些情况下?

python - Pandas :groupby 并根据某些列值获取尾部

python - 类似 Pandas 的方式来处理 iloc 越界错误?

java - "Fatal error"在 Sikuli 中,libs 目录不在系统路径上

python - 无法从 beautifulsoup 正确打印组合表

c# - 使用 Selenium WebDriver 验证文本字段不为空