python - 维基百科抓取 - 需要帮助来构建它

标签 python pandas python-2.7 beautifulsoup wikipedia

我正在尝试抓取 this Wikipedia page .

我遇到了一些问题,非常感谢您的帮助:

  1. Some rows have more than one name or link and I want them all to be assigned to the correct country. Is there anyway I can do that?

  2. I want to skip the 'Name(native)' column. How can I do that?

  3. If I'm scraping the 'Name(native)' column. I get some gibberish, is there anyway to encode that?

import requests
from bs4 import BeautifulSoup
import csv
import pandas as pd

url = 'https://en.wikipedia.org/wiki/List_of_government_gazettes'
source = requests.get(url).text

soup = BeautifulSoup(source, 'lxml')
table = soup.find('table', class_='wikitable').tbody

rows = table.findAll('tr')

columns = [col.text.encode('utf').replace('\xc2\xa0','').replace('\n', '') for col in rows[1].find_all('td')]
print(columns)

最佳答案

您可以使用 pandas 函数 read_html并从 DataFrame 列表中获取第二个 DataFrame:

url = 'https://en.wikipedia.org/wiki/List_of_government_gazettes'
df = pd.read_html(url)[1].head()
print (df)
       Country/region                                              Name  \
0              Albania       Official Gazette of the Republic of Albania   
1              Algeria                                  Official Gazette   
2              Andorra  Official Bulletin of the Principality of Andorra   
3  Antigua and Barbuda              Antigua and Barbuda Official Gazette   
4            Argentina     Official Gazette of the Republic of Argentina   

                                 Name (native)                    Website  
0  Fletorja Zyrtare E Republikës Së Shqipërisë                 qbz.gov.al  
1                   Journal Officiel d'Algérie              joradp.dz/HAR  
2     Butlletí Oficial del Principat d'Andorra                www.bopa.ad  
3         Antigua and Barbuda Official Gazette    www.legalaffairs.gov.ag  
4    Boletín Oficial de la República Argentina  www.boletinoficial.gob.ar 

如果检查输出有问题行 26,因为 wiki 页面中也有错误数据。

解决方案应按列名和行设置值:

df.loc[26, 'Name (native)'] = np.nan 

关于python - 维基百科抓取 - 需要帮助来构建它,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55197498/

相关文章:

python - 将 pandas 数据框中的日期时间列转换为秒

python - Tensorflow-gpu获取卷积算法失败

python - 用pyinstaller打包: PyQt4 module not found

python - 如果包含单个 NaN 并组合列,则将整个组设置为 NaN

python - 通过 for 循环创建数据帧字典

python - 在 python 中加载 MIT-BIH 正常窦性心律数据库

python - 使用网站的Python Youtube音频下载器

python - 使用带有 beautifulsoup 的 python 2.7 在 html 页面中标记的位置

python - 如何在 Django 1.4 网站中显示基本 shell 命令的结果?

python - 在Python中对列表列表进行反向索引