python-3.x - 用 Python 和 Beautiful Soup 抓取分页

我是 Python 的新手，也是网络抓取的新手...

我正在尝试从该网页中选择所有页面

http://www.pour-les-personnes-agees.gouv.fr/annuaire-ehpad-en-hebergement-permanent/64/0

这很复杂:我看到html代码中有“active first”和“next last”。

我做了一个 python 代码，它工作了 4 页(第 2、3、4 和 11 页)

import requests
from bs4 import BeautifulSoup
url_pagination= "http://www.pour-les-personnes-agees.gouv.fr/annuaire-ehpad-en-hebergement-permanent/64/0"
dept_page_Url=[]
r=requests.get(url_pagination)
soup = BeautifulSoup(r.content, "html.parser")
pagination= soup.find_all("ul",{"class":"pagination"})
if len(pagination) == 0 :
    dept_page_Url.append(url_pagination)
else:
    for page_url_list in pagination:
        for page_url in page_url_list.find_all("a"):
            dept_page_Url.append(root_url + page_url.get('href'))
print(dept_page_Url)

其实我知道为什么我只有4页，因为我只选择了“Href”html代码。但我不知道如何改进我的代码。

任何线索，例如包含一些信息的网页可以帮助我或知道如何做的人？？？

非常感谢

最佳答案

分页只给出4个链接(第2-4页和最后一页)，所以不能直接从html文档中获取所有页面链接。
但是，您可以从最后一页开始获取页数，并使用 range 创建所有页面。

import requests
from bs4 import BeautifulSoup

url_pagination= "http://www.pour-les-personnes-agees.gouv.fr/annuaire-ehpad-en-hebergement-permanent/64/0"
r = requests.get(url_pagination)
soup = BeautifulSoup(r.content, "html.parser")

page_url = "http://www.pour-les-personnes-agees.gouv.fr/annuaire-ehpad-en-hebergement-permanent/64/0?page={}"
last_page = soup.find('ul', class_='pagination').find('li', class_='next').a['href'].split('=')[1]
#last_page = soup.select_one('ul.pagination li.next a')['href'].split('=')[1] # with css selectors
dept_page_url = [page_url.format(i) for i in range(1, int(last_page)+1)]

print(dept_page_url)

soup.find('ul', class_='pagination').find('li', class_='next').a['href'] 找到第一个 'ul。分页”，然后是“li.next”，然后是“a”，然后选择“href”。
结果是:“/annuaire-ehpad-en-hebergement-permanent/64/0?page=11”。

.split('=')在包含 2 个项目的列表中按“=”拆分字符串，.split('=')[1] 选择第二个项目“11”，因此 last_page = '11' 。

range(1, int(last_page)+1) 创建从 1 到 11 的数字范围。

page_url.format(i)在 page_url 中格式化这些数字，因此 dept_page_url 包含 11 个 url。

关于python-3.x - 用 Python 和 Beautiful Soup 抓取分页，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/47636294/

python-3.x - 用 Python 和 Beautiful Soup 抓取分页

上一篇：f# - 如何在寓言中扩展一个JS类

下一篇：python-3.x - TensorFlow 训练模型预测始终为零