我正在尝试从以下内容中抓取 URL从 WorldOMeter 获取 CoVid 数据,此页面上有一个 id 为 main_table_countries_today
的表,其中包含我希望收集的 15x225 (3,375) 个数据单元。
我尝试了几种方法,但让我分享我认为我所做的最接近的尝试:
import requests
from os import system
url = 'https://www.worldometers.info/coronavirus/'
table_id = 'main_table_countries_today'
table_end = '</table>'
# Refreshes the Terminal Emulator window
def clear_screen():
def bash_input(user_in):
_ = system(user_in)
bash_input('clear')
# This bot searches for <table> and </table> to start/stop recording data
class Bot:
def __init__(self,
line_added=False,
looking_for_start=True,
looking_for_end=False):
self.line_adding = line_added
self.looking_for_start = looking_for_start
self.looking_for_end = looking_for_end
def set_line_adding(self, bool):
self.line_adding = bool
def set_start_look(self, bool):
self. looking_for_start = bool
def set_end_look(self, bool):
self.looking_for_end = bool
if __name__ == '__main__':
# Start with a fresh Terminal emulator
clear_screen()
my_bot = Bot()
r = requests.get(url).text
all_r = r.split('\n')
for rs in all_r:
if my_bot.looking_for_start and table_id in rs:
my_bot.set_line_adding(True)
my_bot.set_end_look(True)
my_bot.set_start_look(False)
if my_bot.looking_for_end and table_end in rs:
my_bot.set_line_adding(False)
my_bot.looking_for_end(False)
if my_bot.line_adding:
all_lines.append(rs)
for lines in all_lines:
print(lines)
print('\n\n\n\n')
print(len(all_lines))
这会打印 6,551 行代码,这是我需要的两倍多。这通常没问题,因为下一步是清理与我的数据无关的行,但是,这不会产生整个表。我之前用 BeautifulSoup 进行的另一次尝试(非常相似的过程)也没有以上述表格开始和停止。它看起来像这样:
from bs4 import BeautifulSoup
import requests
from os import system
url = 'https://www.worldometers.info/coronavirus/'
table_id = 'main_table_countries_today'
table_end = '</table>'
# Declare an empty list to fill with lines of text
all_lines = list()
if __name__ == '__main__':
# Here we go, again...
_ = system('clear')
r = requests.get(url).text
soup = BeautifulSoup(r)
my_table = soup.find_all('table', {'id': table_id})
for current_line in my_table:
page_lines = str(current_line).split('\n')
for line in page_lines:
all_lines.append(line)
for line in all_lines:
print(line)
print('\n\n')
print(len(all_lines))
结果是 5,547 行。
我也尝试过 Pandas 和 Selenium,但后来我删除了该代码。我希望通过展示我的两次“最佳”尝试,有人可能会发现我遗漏的一些明显问题。
如果我能在屏幕上获取数据,我会很高兴。我最终尝试将此数据转换为字典(将导出为 .json
文件),如下所示:
data = {
"Country": [country for country in countries],
"Total Cases": [case for case in total_cases],
"New Cases": [case for case in new_cases],
"Total Deaths": [death for death in total_deaths],
"New Deaths": [death for death in new_deaths],
"Total Recovered": [death for death in total_recovered],
"New Recovered": [death for death in new_recovered],
"Active Cases": [case for case in active_cases],
"Serious/Critical": [case for case in serious_critical],
"Total Cases/1M pop": [case for case in total_case_per_million],
"Deaths/1M pop": [death for death in deaths_per_million],
"Total Tests": [test for test in total_tests],
"Tests/1M pop": [test for test in tests_per_million],
"Population": [population for population in populations]
}
有什么建议吗?
最佳答案
该表包含许多其他信息。您可以获得前15个<td>
一行中的单元格并删除前/后 8 行:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = "https://www.worldometers.info/coronavirus/"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
all_data = []
for tr in soup.select("#main_table_countries_today tr:has(td)")[8:-8]:
tds = [td.get_text(strip=True) for td in tr.select("td")][:15]
all_data.append(tds)
df = pd.DataFrame(
all_data,
columns=[
"#",
"Country",
"Total Cases",
"New Cases",
"Total Deaths",
"New Deaths",
"Total Recovered",
"New Recovered",
"Active Cases",
"Serious, Critical",
"Tot Cases/1M pop",
"Deaths/1M pop",
"Total Tests",
"Tests/1M pop",
"Population",
],
)
print(df)
打印:
# Country Total Cases New Cases Total Deaths New Deaths Total Recovered New Recovered Active Cases Serious, Critical Tot Cases/1M pop Deaths/1M pop Total Tests Tests/1M pop Population
0 1 USA 35,745,024 629,315 29,666,117 5,449,592 11,516 107,311 1,889 529,679,820 1,590,160 333,098,437
1 2 India 31,693,625 +39,041 424,777 +393 30,846,509 +33,636 422,339 8,944 22,725 305 468,216,510 335,725 1,394,642,466
2 3 Brazil 19,917,855 556,437 18,619,542 741,876 8,318 92,991 2,598 55,034,721 256,943 214,190,490
3 4 Russia 6,288,677 +22,804 159,352 +789 5,625,890 +17,271 503,435 2,300 43,073 1,091 165,800,000 1,135,600 146,002,094
...
218 219 Samoa 3 3 0 15 199,837
219 220 Saint Helena 2 2 0 328 6,097
220 221 Micronesia 1 1 0 9 116,324
221 222 China 93,005 +75 4,636 87,347 +24 1,022 25 65 3 160,000,000 111,163 1,439,323,776
关于python - 在Python中抓取<table>TABLE I NEED</table>之间的所有文本,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/68612714/