我有这个结构:
<tr id="table3620_0_5" class="l1">
<td class="r"> North America</td>
<td x:num="02/12/20">02/12/20</td>
<td x:num="" class="r">5553226</td>
<td x:num="" class="r">TEST TWI</td>
<td x:num="0.03365930063626542">3.37 %</td>
<td/>
<td x:num="0.03365930063626542">3.37 %</td>
</tr>
使用 Pandas 读取 html 我可以提取表格,但我对 x:num 值而不是标签的值感兴趣。我也在尝试使用 Beautiful Soup 探索解决方案,但到目前为止我一无所获
最佳答案
你可以试试beautifulsoup
from bs4 import BeautifulSoup
import re
s = """<tr id="table3620_0_5" class="l1">
<td class="r"> North America</td>
<td x:num="02/12/20">02/12/20</td>
<td x:num="" class="r">5553226</td>
<td x:num="" class="r">TEST TWI</td>
<td x:num="0.03365930063626542">3.37 %</td>
<td/>
<td x:num="0.03365930063626542">3.37 %</td>
</tr>"""
soup = BeautifulSoup(s, 'lxml')
output = [i.get('x:num') for i in soup.findAll("td", {"x:num" : True})]
print(output)
['02/12/20', '', '', '0.03365930063626542', '0.03365930063626542']
output[-2:]
['0.03365930063626542', '0.03365930063626542']
关于python - Beautiful Soup 和 Pandas 提取物编号,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/66516856/