我正在尝试解析一个网页,我想在其中抓取也具有 bgcolor 属性的“tr”元素。以下是该网页的 html:
<table cellspacing="0" cellpadding="15" id="MainContent_GridView1" style="color:#333333;border-collapse:collapse;">
<tr style="color:White;background-color:#045D99;font-weight:bold;">
<th scope="col"><a href="javascript:__doPostBack('ctl00$MainContent$GridView1','Sort$name')" style="color:White;">ORGANIZATION NAME</a></th><th scope="col"><a href="javascript:__doPostBack('ctl00$MainContent$GridView1','Sort$state')" style="color:White;">STATE</a></th><th scope="col"><a href="javascript:__doPostBack('ctl00$MainContent$GridView1','Sort$year')" style="color:White;">YEAR</a></th><th scope="col"><a href="javascript:__doPostBack('ctl00$MainContent$GridView1','Sort$rt')" style="color:White;">FORM</a></th><th scope="col"><a href="javascript:__doPostBack('ctl00$MainContent$GridView1','Sort$pc')" style="color:White;">PAGES</a></th><th scope="col"><a href="javascript:__doPostBack('ctl00$MainContent$GridView1','Sort$ta')" style="color:White;">TOTAL ASSETS</a></th><th scope="col"><a href="javascript:__doPostBack('ctl00$MainContent$GridView1','Sort$ein')" style="color:White;">EIN</a></th>
</tr><tr style="color:#333333;background-color:#ECEEF2;">
<td><a href="//990s.foundationcenter.org/990_pdf_archive/231/231352298/231352298_201702_990.pdf">Zoological Society of Philadelphia Philadelphia Zoo</a></td><td>PA</td><td>2017</td><td>990 </td><td align="right">68</td><td align="right">$124,163,973.00</td><td style="white-space:nowrap;">23-1352298</td>
</tr><tr style="color:#333333;background-color:White;">
<td><a href="//990s.foundationcenter.org/990_pdf_archive/231/231352298/231352298_201602_990.pdf">Zoological Society of Philadelphia</a></td><td>PA</td><td>2016</td><td>990 </td><td align="right">61</td><td align="right">$125,008,026.00</td><td style="white-space:nowrap;">23-1352298</td>
</tr><tr style="color:#333333;background-color:#ECEEF2;">
<td><a href="//990s.foundationcenter.org/990_pdf_archive/231/231352298/231352298_201502_990.pdf">Zoological Society of Philadelphia</a></td><td>PA</td><td>2015</td><td>990 </td><td align="right">63</td><td align="right">$131,880,929.00</td><td style="white-space:nowrap;">23-1352298</td>
</tr>
</table>
我正在尝试使用样式元素抓取 tr 元素
style="color:White;background-color:#045D99;font-weight:bold;"
下面是我的代码:
import requests
from bs4 import BeautifulSoup
data = requests.get(url).text
soup = BeautifulSoup(data,"lxml")
elems = soup.find_all('tr',style"color:White;background-color:#045D99;font-weight:bold;")
但是我的elems返回空。同样在我的汤元素中,我看到:
style="color:White;background-color:#045D99;font-weight:bold;"
已更改为
<tr bgcolor="#ECEEF2">
我不确定这是否导致问题,还有没有办法将整个表作为 pandas 数据框进行抓取?
编辑:
我的代码中有一个拼写错误,下面是正确的代码:
soup.find_all('tr',{"style":"color:White;background-color:#045D99;font-weight:bold;"})
与答案中提到的相同,但我仍然得到空结果
再编辑一次:
即使在提出建议之后,我仍然得到空结果。 html来自以下网页:
http://990finder.foundationcenter.org/990results.aspx?990_type=&fn=AMERICAN+HEART+ASSOCIATION&st=&zp=&ei=&fy=&action=Search
我正在尝试解析网页中存在的表格
最佳答案
我会从字面上理解你。 background-color
不是一个属性,而是 style
属性值的一部分。假设您想要一个包含该子字符串的字符串(并且可能是为了满足不同的颜色),我们可以使用 contains、*、运算符来匹配 style
属性值
html = '''<table cellspacing="0" cellpadding="15" id="MainContent_GridView1" style="color:#333333;border-collapse:collapse;">
<tr style="color:White;background-color:#045D99;font-weight:bold;">
<th scope="col"><a href="javascript:__doPostBack('ctl00$MainContent$GridView1','Sort$name')" style="color:White;">ORGANIZATION NAME</a></th><th scope="col"><a href="javascript:__doPostBack('ctl00$MainContent$GridView1','Sort$state')" style="color:White;">STATE</a></th><th scope="col"><a href="javascript:__doPostBack('ctl00$MainContent$GridView1','Sort$year')" style="color:White;">YEAR</a></th><th scope="col"><a href="javascript:__doPostBack('ctl00$MainContent$GridView1','Sort$rt')" style="color:White;">FORM</a></th><th scope="col"><a href="javascript:__doPostBack('ctl00$MainContent$GridView1','Sort$pc')" style="color:White;">PAGES</a></th><th scope="col"><a href="javascript:__doPostBack('ctl00$MainContent$GridView1','Sort$ta')" style="color:White;">TOTAL ASSETS</a></th><th scope="col"><a href="javascript:__doPostBack('ctl00$MainContent$GridView1','Sort$ein')" style="color:White;">EIN</a></th>
</tr><tr style="color:#333333;background-color:#ECEEF2;">
<td><a href="//990s.foundationcenter.org/990_pdf_archive/231/231352298/231352298_201702_990.pdf">Zoological Society of Philadelphia Philadelphia Zoo</a></td><td>PA</td><td>2017</td><td>990 </td><td align="right">68</td><td align="right">$124,163,973.00</td><td style="white-space:nowrap;">23-1352298</td>
</tr><tr style="color:#333333;background-color:White;">
<td><a href="//990s.foundationcenter.org/990_pdf_archive/231/231352298/231352298_201602_990.pdf">Zoological Society of Philadelphia</a></td><td>PA</td><td>2016</td><td>990 </td><td align="right">61</td><td align="right">$125,008,026.00</td><td style="white-space:nowrap;">23-1352298</td>
</tr><tr style="color:#333333;background-color:#ECEEF2;">
<td><a href="//990s.foundationcenter.org/990_pdf_archive/231/231352298/231352298_201502_990.pdf">Zoological Society of Philadelphia</a></td><td>PA</td><td>2015</td><td>990 </td><td align="right">63</td><td align="right">$131,880,929.00</td><td style="white-space:nowrap;">23-1352298</td>
</tr>
</table>'''
import requests
from bs4 import BeautifulSoup as bs
soup = bs(html,"lxml")
trs = soup.select('tr[style*=";background-color:"]')
关于python - 使用背景颜色样式来抓取 td 元素 BeautifulSoup,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56458619/