我正在尝试使用 beautiful soup 从网站中提取一些数据,特别是一个表,其中表和行存储在 div 标签中,而不是通常的表标签中。这意味着我无法使用 pandas read_html
函数来简单地提取所有表格。
这是我提取的html
<div class="block">
<div class="expand">
<div class="expand-button collapsed" data-toggle="collapse">Forex</div>
<div class="panel-collapse collapse">
<div class="table">
<div class="search">
<div class="date"</div>
<div class="group search">
<span>Search </span>
<input class="search-box" type="search"/>
</div>
<div class="group ">
<span class="label"></span>
<span class="toggle-a"> </span>
<span class="toggle-b"> </span>
</div>
</div>
<div class="skin">
<div class="table visible">
<div class="header">
<div>Product</div>
<div>Account A</div>
</div>
<div class="column-header">
<div class="column-name">NAME</div>
<div class="column-name">DESCRIPTION</div>
<div class="column-name">Value1</div>
<div class="column-name">Value2</div>
<div class="column-name">Value3</div>
<div class="column-name">Value3</div>
</div>
<div class="table-row">
<div class="table-cell c1">bronze</div>
<div class="table-cell c2">3rd tier</div>
<div class="table-cell c3">0</div>
<div class="table-cell c4">1</div>
<div class="table-cell c5">1</div>
<div class="table-cell c6">1</div>
<div class="table-cell c-true">Account A</div>
<div class="table-cell c-standard">Account B</div>
</div>
<div class="table-row">
<div class="table-cell c1">silver</div>
<div class="table-cell c2">2nd tier</div>
<div class="table-cell c3">1</div>
<div class="table-cell c4">0</div>
<div class="table-cell c5">3</div>
<div class="table-cell c6">0</div>
<div class="table-cell c-true">Account A</div>
<div class="table-cell c-standard">Account B</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
最后我想要什么:
| Product | | Account A | | Account B | |
|---------|-------------|-----------|---------|-----------|---------|
| NAME | DESCRIPTION | Value 1 | Value 2 | Value 3 | Value 4 |
| bronze | 3rd tier | 0 | 1 | 1 | 1 |
| silver | 2nd tier | 1 | 0 | 3 | 0 |
有没有一种简单的方法使用 python 或 beautiful soup 来做到这一点?
最佳答案
从给定的 html 标签生成数据的代码,我已将您的数据解析为 html
from bs4 import BeautifulSoup
rows=[]
soup=BeautifulSoup(html,"html.parser")
first_row=soup.find("div",attrs={"class":"column-header"}).text.strip("\n").split("\n")
for i in range(len((soup.select("div[class=table-row]")))):
rows.append(soup.select("div[class=table-row]")[i].text.strip("\n").split("\n")[:6])
对于表格生成,您可以安装BeautifulTable
from beautifultable import BeautifulTable
table = BeautifulTable()
table.column_headers = ["Product", "","Account A","","Account B",""]
table.append_row(first_row)
for i in rows:
table.append_row(i)
print(table)
输出:
+---------+-------------+-----------+--------+-----------+--------+
| Product | | Account A | | Account B | |
+---------+-------------+-----------+--------+-----------+--------+
| NAME | DESCRIPTION | Value1 | Value2 | Value3 | Value4 |
+---------+-------------+-----------+--------+-----------+--------+
| bronze | 3rd tier | 0 | 1 | 1 | 1 |
+---------+-------------+-----------+--------+-----------+--------+
| silver | 2nd tier | 1 | 0 | 3 | 0 |
+---------+-------------+-----------+--------+-----------+--------+
您仍然可以使用tabulate
库修改表格数据
关于pandas - 使用python从html中提取表数据,其中行存储在div中,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/67536810/