python - 使用 python requests 和 BeatifulSoup 在维基百科页面上抓取多个表及其标题？

使用 python 库、request 和 BeautifulSoup，我正在尝试抓取此维基百科页面上的表格:https://en.wikipedia.org/wiki/Mobile_country_code 。我可以获取表中的所有数据；但是，我想从表名称中添加另一个名为 Country 的列，并用表名称填充它。 Here是一个例子，维基百科表(上)和所需的表(下)。

下面的代码允许我获取没有“国家/地区”列的所有数据:

import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

wiki = requests.get('https://en.wikipedia.org/wiki/Mobile_country_code')
soup = BeautifulSoup(wiki.content, 'html.parser')

# Get all the tables
tables = soup.find_all('table',class_="wikitable")

# extract the column names
column_names = [item.get_text() for item in tables[0].find_all('th')]

# extract the content
contents = [item.get_text() for item in tables[0].find_all('td')]

# put all the content into a list
values=[]
for table in tables:
    for item in table.select('td'):
        temp = item.get_text()
        values.append(temp)

# Since there are 7 columns, obtain the number of rows and reshape the table
len(values)/7   # 2452 rows

# change the shape of the table
data = np.reshape(values,(2452,7))

# put all the data into a dataframe
df = pd.DataFrame(data = data, columns=header_list)

最佳答案

尝试:

#This is the table which I want to extract
# Get all the tables
tables = soup.find_all('table',class_="wikitable")

# extract the column names
column_names = [item.get_text() for item in tables[0].find_all('th')]

# extract the content
contents = [item.get_text() for item in tables[0].find_all('td')]

# put all the content into a list

values_list = []
#find all countries
countries = soup.find_all('h3')
international = [soup.find('span',{"id":"International_operators"}).parent]
countries = countries+international
for c in countries:
    table = c.find_next_sibling("table")
    if table is not None: #check the coutries has table
        for item in table.select('tr')[1:]:
            values = [e.get_text() for e in item.select('td')]
            values = [c.text]+values
            values_list.append(values)

header_list = ["COUNTRY"]+ column_names

# put all the data into a dataframe
df = pd.DataFrame(values_list, columns=header_list)

df 将是:

    COUNTRY             MCC MNC Brand    Operator       Status       Bands (MHz)                                        References and notes
0   Abkhazia - GE-AB    289 67  Aquafon  Aquafon JSC    Operational GSM 900 / GSM 1800 / UMTS 2100 / LTE 800            MCC is not listed by ITU;[85] LTE band 20[95]
1   Abkhazia - GE-AB    289 88  A-Mobile A-Mobile LLSC  Operational GSM 900 / GSM 1800 / UMTS 2100 / LTE 800 / LTE...   MCC is not listed by ITU[85]
...

关于python - 使用 python requests 和 BeatifulSoup 在维基百科页面上抓取多个表及其标题？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/44960942/

python - 使用 python requests 和 BeatifulSoup 在维基百科页面上抓取多个表及其标题？

上一篇：python - urllib.parse vs urlparse : How do I get python 2. 7 识别urllib？

下一篇：python - 如何让 Sprite 旋转面向鼠标？