使用 python 库、request 和 BeautifulSoup,我正在尝试抓取此维基百科页面上的表格:https://en.wikipedia.org/wiki/Mobile_country_code 。我可以获取表中的所有数据;但是,我想从表名称中添加另一个名为 Country 的列,并用表名称填充它。 Here是一个例子, 维基百科表(上)和所需的表(下)。
下面的代码允许我获取没有“国家/地区”列的所有数据:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
wiki = requests.get('https://en.wikipedia.org/wiki/Mobile_country_code')
soup = BeautifulSoup(wiki.content, 'html.parser')
# Get all the tables
tables = soup.find_all('table',class_="wikitable")
# extract the column names
column_names = [item.get_text() for item in tables[0].find_all('th')]
# extract the content
contents = [item.get_text() for item in tables[0].find_all('td')]
# put all the content into a list
values=[]
for table in tables:
for item in table.select('td'):
temp = item.get_text()
values.append(temp)
# Since there are 7 columns, obtain the number of rows and reshape the table
len(values)/7 # 2452 rows
# change the shape of the table
data = np.reshape(values,(2452,7))
# put all the data into a dataframe
df = pd.DataFrame(data = data, columns=header_list)
最佳答案
尝试:
#This is the table which I want to extract
# Get all the tables
tables = soup.find_all('table',class_="wikitable")
# extract the column names
column_names = [item.get_text() for item in tables[0].find_all('th')]
# extract the content
contents = [item.get_text() for item in tables[0].find_all('td')]
# put all the content into a list
values_list = []
#find all countries
countries = soup.find_all('h3')
international = [soup.find('span',{"id":"International_operators"}).parent]
countries = countries+international
for c in countries:
table = c.find_next_sibling("table")
if table is not None: #check the coutries has table
for item in table.select('tr')[1:]:
values = [e.get_text() for e in item.select('td')]
values = [c.text]+values
values_list.append(values)
header_list = ["COUNTRY"]+ column_names
# put all the data into a dataframe
df = pd.DataFrame(values_list, columns=header_list)
df
将是:
COUNTRY MCC MNC Brand Operator Status Bands (MHz) References and notes
0 Abkhazia - GE-AB 289 67 Aquafon Aquafon JSC Operational GSM 900 / GSM 1800 / UMTS 2100 / LTE 800 MCC is not listed by ITU;[85] LTE band 20[95]
1 Abkhazia - GE-AB 289 88 A-Mobile A-Mobile LLSC Operational GSM 900 / GSM 1800 / UMTS 2100 / LTE 800 / LTE... MCC is not listed by ITU[85]
...
关于python - 使用 python requests 和 BeatifulSoup 在维基百科页面上抓取多个表及其标题?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/44960942/