python - 使用 BeautifulSoup 抓取 Web 数据

标签 python html web-scraping beautifulsoup

我正在尝试从 rotowire.com 获取每场棒球比赛的降雨机会和温度/风速。一旦我抓取了数据,我就会将其转换为三列——降雨、温度和风。感谢另一位用户,我能够接近获取数据,但无法完全获取数据。我尝试了两种方法。

第一种方法:

from bs4 import BeautifulSoup
import requests
import pandas as pd

url = 'https://www.rotowire.com/baseball/daily-lineups.php'
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")

weather = []

for i in soup.select(".lineup__bottom"):
    
    forecast = i.select_one('.lineup__weather-text').text
    weather.append(forecast)

这将返回:

['\n100% Rain\r\n                66°\xa0\xa0Wind 8 mph In                        ', '\n0% Rain\r\n                64°\xa0\xa0Wind 4 mph L-R                        ', '\n0% Rain\r\n                69°\xa0\xa0Wind 7 mph In                        ', '\nDome\r\n                In Domed Stadium\r\n                        ', '\n0% Rain\r\n                75°\xa0\xa0Wind 10 mph Out                        ', '\n0% Rain\r\n                68°\xa0\xa0Wind 9 mph R-L                        ', '\n0% Rain\r\n                82°\xa0\xa0Wind 9 mph                         ', '\n0% Rain\r\n                81°\xa0\xa0Wind 5 mph R-L                        ', '\nDome\r\n                In Domed Stadium\r\n                        ', '\n1% Rain\r\n                75°\xa0\xa0Wind 4 mph R-L                        ', '\n1% Rain\r\n                71°\xa0\xa0Wind 6 mph Out                        ', '\nDome\r\n                In Domed Stadium\r\n                        ']

我尝试过的第二种方法是:

from bs4 import BeautifulSoup
import requests
import pandas as pd


url = 'https://www.rotowire.com/baseball/daily-lineups.php'

r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")

#weather = []

for i in soup.select(".lineup__bottom"):
    
    forecast = i.select_one('.lineup__weather-text').text
    weather.append(forecast)
    #print(forecast)
    rain = i.select_one('.lineup__weather-text b:contains("Rain") ~ span').text

这会返回一个 AttributeError,即“NoneType”对象没有属性“text”

最佳答案

要查找所有数据,请参阅此示例:

import pandas as pd
import requests
from bs4 import BeautifulSoup


url = "https://www.rotowire.com/baseball/daily-lineups.php"
soup = BeautifulSoup(requests.get(url).content, "html.parser")

weather = []

for tag in soup.select(".lineup__bottom"):
    header = tag.find_previous(class_="lineup__teams").get_text(
        strip=True, separator=" vs "
    )
    rain = tag.select_one(".lineup__weather-text > b")
    forecast_info = rain.next_sibling.split()
    temp = forecast_info[0]
    wind = forecast_info[2]

    weather.append(
        {"Header": header, "Rain": rain.text.split()[0], "Temp": temp, "Wind": wind}
    )


df = pd.DataFrame(weather)
print(df)

输出:

        Header  Rain Temp     Wind
0   PHI vs CIN  100%  66°        8
1   CWS vs CLE    0%  64°        4
2    SD vs CHC    0%  69°        7
3   NYM vs ARI  Dome   In  Stadium
4   MIN vs BAL    0%  75°        9
5    TB vs NYY    0%  68°        9
6   MIA vs TOR    0%  81°        6
7   WAS vs ATL    0%  81°        4
8   BOS vs HOU  Dome   In  Stadium
9   TEX vs COL    0%  76°        6
10  STL vs LAD    0%  73°        4
11  OAK vs SEA  Dome   In  Stadium

关于python - 使用 BeautifulSoup 抓取 Web 数据,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/67814115/

相关文章:

python - 2D Numpy 数组到 Numpy 数组列表

javascript - Firefox、Selenium、toDataURL、Uint8ClampedArray 和 Python

javascript - 如何有一个不在滚动对象旁边的滚动条?

python - 异步网络抓取 101 : fetching multiple urls with aiohttp

java - 在 HTMLUnit 中单击提交按钮()后无法到达新页面

python - 使用 Plotly add_trace go.scatter for-loop 仅使用唯一的跟踪名称填充绘图图例

python - 根据另一列中的字符串内容在 pandas 中创建类别列

javascript - 使用 jQuery 将宽度从 100% 更改为自动时的 css 转换不起作用

css - 使用 CSS 定位的 DIV 与 NAV 标签

python - 无法使用请求从脚本标签中抓取不同专辑的链接?