我正在尝试抓取以下 td 标签之间 x 的 8 个实例
<th class="first"> Temperature </th>
<td> x </td> # repeated for 8 lines
但是页面上有许多类 <th class="first">
唯一的唯一标识符是首先跟随的字符串,在本例中为Temperature。
不知道要在以下代码中添加什么,我正在使用它来创建某种标准来抓取 <th class="first">
其中温度(后面是其他字符串)
for tag in soup.find_all("th", {"class":"first"}):
temps.append(tag.text)
这是额外代码的问题(重新编译吗?)还是我应该完全使用其他代码?
编辑:下面感兴趣的 Html
<tbody>
<tr> <th class="first">Temperature</th> <td>x</td> <td>x</td> <td>x</td> <td>x</td> <td>x</td> <td>x</td> <td>x</td> <td>x</td> </tr>
编辑:当前代码
from bs4 import BeautifulSoup as bs
from selenium import webdriver
driver = webdriver.Firefox(executable_path=r'c:\program files\firefox\geckodriver.exe')
driver.get("http://www.bom.gov.au/places/nsw/sydney/forecast/detailed/")
html = driver.page_source
soup = bs(html, "lxml")
dates = []
for tag in soup.find_all("a", {"class":"toggle"}):
dates.append(tag.text)
temps = [item.text for item in soup.select('th.first:contains(Temperature) ~ td')]
print(dates)
print(temps)
最佳答案
这对于 bs4 4.7.1 来说很容易。因为您可以将 :contains 伪类与〜通用同级组合器一起使用
import requests
from bs4 import BeautifulSoup as bs
url = 'http://www.bom.gov.au/places/nsw/sydney/forecast/detailed'
r = requests.get(url)
soup = bs(r.content, 'lxml')
for table in soup.select('[summary*=Temperatures]'):
print(table['summary']) #day of reading
tds = [item.text for item in table.select('.first:contains("Air temperature (°C)") ~ td')] #readings
print(tds)
<小时/>
您可以通过以下方式获取阅读时间:
print([item.text.strip() for item in table.select('tr:nth-of-type(1) th')][1:-1])
<小时/>
在 pandas 中添加格式良好的表格:
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
url = 'http://www.bom.gov.au/places/nsw/sydney/forecast/detailed'
r = requests.get(url)
soup = bs(r.content, 'lxml')
for table in soup.select('[summary*=Temperatures]'):
print(table['summary'])
output = pd.read_html(str(table))[0]
print(output)
关于python - 当 html 源中使用类似的类时,使用特定标准进行抓取,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55773625/