Pythonic beautifulSoup4 : How to get remaining titles from the next page link of a wikipedia category

标签 python python-3.x web-scraping beautifulsoup wikipedia

我成功编写了以下代码来获取 the titles of a Wikipedia category.该类别包含超过 404 个标题。但我的输出文件只提供 200 个标题/页。如何扩展我的代码以获取该类别链接的所有标题 (next page)等等。

命令:python3 getCATpages.py

getCATpages.py 代码;-

from bs4 import BeautifulSoup
import requests
import csv

#getting all the contents of a url
url = 'https://en.wikipedia.org/wiki/Category:Free software'
content = requests.get(url).content
soup = BeautifulSoup(content,'lxml')

#showing the category-pages Summary
catPageSummaryTag = soup.find(id='mw-pages')
catPageSummary = catPageSummaryTag.find('p')
print(catPageSummary.text)

#showing the category-pages only
catPageSummaryTag = soup.find(id='mw-pages')
tag = soup.find(id='mw-pages')
links = tag.findAll('a')

# giving serial numbers to the output print and limiting the print into three
counter = 1
for link in links[:3]:
    print ('''        '''+str(counter) + "  " + link.text)
    counter = counter + 1

#getting the category pages
catpages = soup.find(id='mw-pages')
whatlinksherelist = catpages.find_all('li')
things_to_write = []
for titles in whatlinksherelist:
  things_to_write.append(titles.find('a').get('title'))

#writing the category pages as a output file
with open('001-catPages.csv', 'a') as csvfile:
  writer = csv.writer(csvfile,delimiter="\n")
  writer.writerow(things_to_write)

最佳答案

这个想法是跟随下一页,直到页面上没有“下一页”链接。我们将维护一个网络抓取 session ,同时发出多个请求以在列表中收集所需的链接标题:

from pprint import pprint
from urllib.parse import urljoin

from bs4 import BeautifulSoup
import requests


base_url = 'https://en.wikipedia.org/wiki/Category:Free software'


def get_next_link(soup):
    return soup.find("a", text="next page")

def extract_links(soup):
    return [a['title'] for a in soup.select("#mw-pages li a")]


with requests.Session() as session:
    content = session.get(base_url).content
    soup = BeautifulSoup(content, 'lxml')

    links = extract_links(soup)
    next_link = get_next_link(soup)
    while next_link is not None:  # while there is a Next Page link
        url = urljoin(base_url, next_link['href'])
        content = session.get(url).content
        soup = BeautifulSoup(content, 'lxml')

        links += extract_links(soup)

        next_link = get_next_link(soup)

pprint(links)

打印:

['Free software',
 'Open-source model',
 'Outline of free software',
 'Adoption of free and open-source software by public institutions',
 ...
 'ZK Spreadsheet',
 'Zulip',
 'Portal:Free and open-source software']

省略了无关的CSV写入部分。

关于Pythonic beautifulSoup4 : How to get remaining titles from the next page link of a wikipedia category,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/41391168/

相关文章:

python-3.x - 如何根据替代属性选择所有元素? [美汤]

python - 无法区分用于执行一项特定操作的两个选择器

python - 无法加载tensorflow

python - 如何使用 python 正则表达式将字符串数据附加到某些位置?

python子类检查和子类 Hook

python-3.x - 将SHAPE_RESTORE_SHX配置选项设置为YES即可恢复或创建它

python - 无法从网页的某些脚本标记中获取电子邮件链接

python - Pycharm 设置 Mysql 数据库驱动

python - 呈现嵌入式表单集 management_form 时出现 django mongodbforms 异常

python-3.x - 从用户定义的模块导入方法会产生属性错误