python - 使用 python requests 和 BeatifulSoup 在维基百科页面上抓取多个表及其标题?

标签 python pandas web-scraping beautifulsoup python-requests

使用 python 库、request 和 BeautifulSoup,我正在尝试抓取此维基百科页面上的表格:https://en.wikipedia.org/wiki/Mobile_country_code 。我可以获取表中的所有数据;但是,我想从表名称中添加另一个名为 Country 的列,并用表名称填充它。 Here是一个例子, 维基百科表(上)和所需的表(下)。

下面的代码允许我获取没有“国家/地区”列的所有数据:

import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

wiki = requests.get('https://en.wikipedia.org/wiki/Mobile_country_code')
soup = BeautifulSoup(wiki.content, 'html.parser')

# Get all the tables
tables = soup.find_all('table',class_="wikitable")

# extract the column names
column_names = [item.get_text() for item in tables[0].find_all('th')]

# extract the content
contents = [item.get_text() for item in tables[0].find_all('td')]

# put all the content into a list
values=[]
for table in tables:
    for item in table.select('td'):
        temp = item.get_text()
        values.append(temp)

# Since there are 7 columns, obtain the number of rows and reshape the table
len(values)/7   # 2452 rows

# change the shape of the table
data = np.reshape(values,(2452,7))

# put all the data into a dataframe
df = pd.DataFrame(data = data, columns=header_list)

最佳答案

尝试:

#This is the table which I want to extract
# Get all the tables
tables = soup.find_all('table',class_="wikitable")

# extract the column names
column_names = [item.get_text() for item in tables[0].find_all('th')]

# extract the content
contents = [item.get_text() for item in tables[0].find_all('td')]

# put all the content into a list

values_list = []
#find all countries
countries = soup.find_all('h3')
international = [soup.find('span',{"id":"International_operators"}).parent]
countries = countries+international
for c in countries:
    table = c.find_next_sibling("table")
    if table is not None: #check the coutries has table
        for item in table.select('tr')[1:]:
            values = [e.get_text() for e in item.select('td')]
            values = [c.text]+values
            values_list.append(values)

header_list = ["COUNTRY"]+ column_names

# put all the data into a dataframe
df = pd.DataFrame(values_list, columns=header_list)

df 将是:

    COUNTRY             MCC MNC Brand    Operator       Status       Bands (MHz)                                        References and notes
0   Abkhazia - GE-AB    289 67  Aquafon  Aquafon JSC    Operational GSM 900 / GSM 1800 / UMTS 2100 / LTE 800            MCC is not listed by ITU;[85] LTE band 20[95]
1   Abkhazia - GE-AB    289 88  A-Mobile A-Mobile LLSC  Operational GSM 900 / GSM 1800 / UMTS 2100 / LTE 800 / LTE...   MCC is not listed by ITU[85]
...

关于python - 使用 python requests 和 BeatifulSoup 在维基百科页面上抓取多个表及其标题?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/44960942/

相关文章:

python - 无法测量列表中每个项目的频率

python - 在 numpy.correlate 中指定滞后

python - Python 中的大型 try except block - 如何理解异常在哪里?

python - 将原始日期时间列转换为新时区 Pandas Dataframe

python pandas groupby 关于分类变量

python - 当使用 python 进行网页抓取且值不存在时,如何防止错误?

python - 创建 QDialog 的 PyQt 单元测试

python - 在 django 模板上使用keys()

python - 如何检查数据框列中的值是否为字符串?

python-3.x - 抓取所有 youtube 搜索结果