python - 如何迭代表中的 HTML 链接以从表中提取数据?

标签 python html json web-scraping beautifulsoup

我正在尝试浏览 https://bgp.he.net/report/world 处的表格。我想浏览每个前往国家/地区页面的 HTML 链接,然后获取数据,然后迭代到下一个列表。我正在使用 beautiful soup,并且已经可以获取我想要的数据,但不太清楚如何迭代 HTML 列。

from bs4 import BeautifulSoup
import requests
import json


headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:56.0) Gecko/20100101 Firefox/56.0'}

url = "https://bgp.he.net/country/LC"
html = requests.get(url, headers=headers)

country_ID = (url[-2:])
print("\n")

soup = BeautifulSoup(html.text, 'html.parser')
#print(soup)
data = []
for row in soup.find_all("tr")[1:]: # start from second row
    cells = row.find_all('td')
    data.append({
        'ASN': cells[0].text,
        'Country': country_ID,
        "Name": cells[1].text,
        "Routes V4": cells[3].text,
        "Routes V6": cells[5].text
    })



i = 0

with open ('table_attempt.txt', 'w') as r:
    for item in data:
        r.write(str(data[i]))
        i += 1
        r.write("\n")


print(data)

我希望能够将每个国家/地区的数据收集到一个书面文本文件中。

最佳答案

我只使用前 3 个链接进行了测试(遇到了一个 UnicodeEncodeError 错误,但修复了该错误并评论了代码中的位置)。

from bs4 import BeautifulSoup
import requests
import json

#First get the list of countries urls

headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:56.0) Gecko/20100101 Firefox/56.0'}

url = "https://bgp.he.net/report/world"
html = requests.get(url, headers=headers)

soup = BeautifulSoup(html.text, 'html.parser')

table = soup.find('table', {'id':'table_countries'})
rows = table.find_all('tr')

country_urls = []

# Go through each row and grab the link. If there's no link, continue to next row
for row in rows:
    try:
        link = row.select('a')[0]['href']
        country_urls.append(link)
    except:
        continue


# Now iterate through that list
for link in country_urls:

    url = "https://bgp.he.net" + link
    html = requests.get(url, headers=headers)

    country_ID = (url[-2:])
    print("\n")

    soup = BeautifulSoup(html.text, 'html.parser')
    #print(soup)
    data = []
    for row in soup.find_all("tr")[1:]: # start from second row
        cells = row.find_all('td')
        data.append({
            'ASN': cells[0].text,
            'Country': country_ID,
            "Name": cells[1].text,
            "Routes V4": cells[3].text,
            "Routes V6": cells[5].text
        })



    i = 0
    print ('Writing from %s' %(url))

    # I added encoding="utf-8" because of an UnicodeEncodeError:
    with open ('table_attempt.txt', 'w', encoding="utf-8") as r:
        for item in data:
            r.write(str(data[i]))
            i += 1
            r.write("\n")

关于python - 如何迭代表中的 HTML 链接以从表中提取数据?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/54166823/

相关文章:

python - 禁用 Flask-SocketIO 日志记录到终端

python - 如何在numpy中反转子数组?

javascript - 暂停 getUserMedia 返回的流

html - 数字/格重叠

c# - 从文档中获取子元素

java - 颠簸条件规范

python - 无法通过 LAN 发送以太网包

python - pyplot 双对数等高线图不起作用

html - 为什么我的新 div 会影响它之前的 div?

javascript - 在 NodeJS 中提取和解析巨大的不完整 JSON