python - 在Python中抓取<table>TABLE I NEED</table>之间的所有文本

标签 python python-3.x web-scraping beautifulsoup python-requests

我正在尝试从以下内容中抓取 URL从 WorldOMeter 获取 CoVid 数据,此页面上有一个 id 为 main_table_countries_today 的表,其中包含我希望收集的 15x225 (3,375) 个数据单元。

我尝试了几种方法,但让我分享我认为我所做的最接近的尝试:

import requests
from os import system

url = 'https://www.worldometers.info/coronavirus/'
table_id = 'main_table_countries_today'
table_end = '</table>'


# Refreshes the Terminal Emulator window
def clear_screen():

    def bash_input(user_in):
        _ = system(user_in)
    
    bash_input('clear')


# This bot searches for <table> and </table> to start/stop recording data
class Bot:

    def __init__(self,
                 line_added=False,
                 looking_for_start=True,
                 looking_for_end=False):

        self.line_adding = line_added
        self.looking_for_start = looking_for_start
        self.looking_for_end = looking_for_end
    
    def set_line_adding(self, bool):

        self.line_adding = bool

    def set_start_look(self, bool):

        self. looking_for_start = bool

    def set_end_look(self, bool):

        self.looking_for_end = bool


if __name__ == '__main__':

    # Start with a fresh Terminal emulator
    clear_screen()
    
    my_bot = Bot()

    r = requests.get(url).text
    all_r = r.split('\n')

    for rs in all_r:

        if my_bot.looking_for_start and table_id in rs:
                
            my_bot.set_line_adding(True)
            my_bot.set_end_look(True)
            my_bot.set_start_look(False)
        
        if my_bot.looking_for_end and table_end in rs:    
                
            my_bot.set_line_adding(False)
            my_bot.looking_for_end(False)
        
        if my_bot.line_adding:

            all_lines.append(rs)
        

        for lines in all_lines:
            print(lines)
        
        print('\n\n\n\n')
        print(len(all_lines))

这会打印 6,551 行代码,这是我需要的两倍多。这通常没问题,因为下一步是清理与我的数据无关的行,但是,这不会产生整个表。我之前用 BeautifulSoup 进行的另一次尝试(非常相似的过程)也没有以上述表格开始和停止。它看起来像这样:

from bs4 import BeautifulSoup
import requests
from os import system

url = 'https://www.worldometers.info/coronavirus/'
table_id = 'main_table_countries_today'
table_end = '</table>'

# Declare an empty list to fill with lines of text
all_lines = list()


if __name__ == '__main__':

    # Here we go, again...
    _ = system('clear')

    r = requests.get(url).text
    soup = BeautifulSoup(r)
    my_table = soup.find_all('table', {'id': table_id})

    for current_line in my_table:

        page_lines = str(current_line).split('\n')

        for line in page_lines:
            all_lines.append(line)

    for line in all_lines:
        print(line)

    print('\n\n')
    print(len(all_lines))

结果是 5,547 行。

我也尝试过 Pandas 和 Selenium,但后来我删除了该代码。我希望通过展示我的两次“最佳”尝试,有人可能会发现我遗漏的一些明显问题。

如果我能在屏幕上获取数据,我会很高兴。我最终尝试将此数据转换为字典(将导出为 .json 文件),如下所示:

data = {
    "Country": [country for country in countries],
    "Total Cases": [case for case in total_cases],
    "New Cases": [case for case in new_cases],
    "Total Deaths": [death for death in total_deaths],
    "New Deaths": [death for death in new_deaths],
    "Total Recovered": [death for death in total_recovered],
    "New Recovered": [death for death in new_recovered],
    "Active Cases": [case for case in active_cases],
    "Serious/Critical": [case for case in serious_critical],
    "Total Cases/1M pop": [case for case in total_case_per_million],
    "Deaths/1M pop": [death for death in deaths_per_million],
    "Total Tests": [test for test in total_tests],
    "Tests/1M pop": [test for test in tests_per_million],
    "Population": [population for population in populations]
}

有什么建议吗?

最佳答案

该表包含许多其他信息。您可以获得前15个<td>一行中的单元格并删除前/后 8 行:

import requests
import pandas as pd
from bs4 import BeautifulSoup


url = "https://www.worldometers.info/coronavirus/"
soup = BeautifulSoup(requests.get(url).content, "html.parser")

all_data = []
for tr in soup.select("#main_table_countries_today tr:has(td)")[8:-8]:
    tds = [td.get_text(strip=True) for td in tr.select("td")][:15]
    all_data.append(tds)

df = pd.DataFrame(
    all_data,
    columns=[
        "#",
        "Country",
        "Total Cases",
        "New Cases",
        "Total Deaths",
        "New Deaths",
        "Total Recovered",
        "New Recovered",
        "Active Cases",
        "Serious, Critical",
        "Tot Cases/1M pop",
        "Deaths/1M pop",
        "Total Tests",
        "Tests/1M pop",
        "Population",
    ],
)
print(df)

打印:

       #                 Country Total Cases New Cases Total Deaths New Deaths Total Recovered New Recovered Active Cases Serious, Critical Tot Cases/1M pop Deaths/1M pop  Total Tests Tests/1M pop     Population
0      1                     USA  35,745,024                629,315                 29,666,117                  5,449,592            11,516          107,311         1,889  529,679,820    1,590,160    333,098,437
1      2                   India  31,693,625   +39,041      424,777       +393      30,846,509       +33,636      422,339             8,944           22,725           305  468,216,510      335,725  1,394,642,466
2      3                  Brazil  19,917,855                556,437                 18,619,542                    741,876             8,318           92,991         2,598   55,034,721      256,943    214,190,490
3      4                  Russia   6,288,677   +22,804      159,352       +789       5,625,890       +17,271      503,435             2,300           43,073         1,091  165,800,000    1,135,600    146,002,094

...

218  219                   Samoa           3                                                 3                          0                                 15                                                199,837
219  220            Saint Helena           2                                                 2                          0                                328                                                  6,097
220  221              Micronesia           1                                                 1                          0                                  9                                                116,324
221  222                   China      93,005       +75        4,636                     87,347           +24        1,022                25               65             3  160,000,000      111,163  1,439,323,776

关于python - 在Python中抓取<table>TABLE I NEED</table>之间的所有文本,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/68612714/

相关文章:

python - 平面算法之间的角度太慢

python - 使用 Python 程序输出构建 wordle

Python - 如何查找多维数组中是否存在某个项目?

python - cx_Freeze 和 networkx 的问题

python - 带有分隔符的Pandas groupby加入

python - Python 3 字节的奇怪符号

python - Selenium 中的 Xpath 通配符捕获多个结果实例

python - 如何网络抓取 NBA 的首发阵容?

php - 在提取的网站数据中添加千位分隔符

python - 使用 Python 的 azure httptrigger blob 存储