pandas - 使用python从html中提取表数据,其中行存储在div中

标签 pandas beautifulsoup

我正在尝试使用 beautiful soup 从网站中提取一些数据,特别是一个表,其中表和行存储在 div 标签中,而不是通常的表标签中。这意味着我无法使用 pandas read_html 函数来简单地提取所有表格。

这是我提取的html

<div class="block">
<div class="expand">
<div class="expand-button collapsed" data-toggle="collapse">Forex</div>
<div class="panel-collapse collapse">
<div class="table">
<div class="search">
<div class="date"</div>
<div class="group search">
<span>Search </span>
<input class="search-box" type="search"/>
</div>
<div class="group ">
<span class="label"></span>
<span class="toggle-a"> </span>
<span class="toggle-b"> </span>
</div>
</div>
<div class="skin">
<div class="table visible">
<div class="header">
<div>Product</div>
<div>Account A</div>
</div>
<div class="column-header">
<div class="column-name">NAME</div>
<div class="column-name">DESCRIPTION</div>
<div class="column-name">Value1</div>
<div class="column-name">Value2</div>
<div class="column-name">Value3</div>
<div class="column-name">Value3</div>
</div>
<div class="table-row">
<div class="table-cell c1">bronze</div>
<div class="table-cell c2">3rd tier</div>
<div class="table-cell c3">0</div>
<div class="table-cell c4">1</div>
<div class="table-cell c5">1</div>
<div class="table-cell c6">1</div>
<div class="table-cell c-true">Account A</div>
<div class="table-cell c-standard">Account B</div>
</div>
<div class="table-row">
<div class="table-cell c1">silver</div>
<div class="table-cell c2">2nd tier</div>
<div class="table-cell c3">1</div>
<div class="table-cell c4">0</div>
<div class="table-cell c5">3</div>
<div class="table-cell c6">0</div>
<div class="table-cell c-true">Account A</div>
<div class="table-cell c-standard">Account B</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>

最后我想要什么:

| Product |             | Account A |         | Account B |         |
|---------|-------------|-----------|---------|-----------|---------|
| NAME    | DESCRIPTION | Value 1   | Value 2 | Value 3   | Value 4 |
| bronze  | 3rd tier    | 0         | 1       | 1         | 1       |
| silver  | 2nd tier    | 1         | 0       | 3         | 0       |

有没有一种简单的方法使用 python 或 beautiful soup 来做到这一点?

最佳答案

从给定的 html 标签生成数据的代码,我已将您的数据解析为 html

from bs4 import BeautifulSoup
rows=[]
soup=BeautifulSoup(html,"html.parser")
first_row=soup.find("div",attrs={"class":"column-header"}).text.strip("\n").split("\n")
for i in range(len((soup.select("div[class=table-row]")))):
    rows.append(soup.select("div[class=table-row]")[i].text.strip("\n").split("\n")[:6])

对于表格生成,您可以安装BeautifulTable

from beautifultable import BeautifulTable
table = BeautifulTable()
table.column_headers = ["Product", "","Account A","","Account B",""]
table.append_row(first_row)
for i in rows:
    table.append_row(i)
print(table)

输出:

+---------+-------------+-----------+--------+-----------+--------+
| Product |             | Account A |        | Account B |        |
+---------+-------------+-----------+--------+-----------+--------+
|  NAME   | DESCRIPTION |  Value1   | Value2 |  Value3   | Value4 |
+---------+-------------+-----------+--------+-----------+--------+
| bronze  |  3rd tier   |     0     |   1    |     1     |   1    |
+---------+-------------+-----------+--------+-----------+--------+
| silver  |  2nd tier   |     1     |   0    |     3     |   0    |
+---------+-------------+-----------+--------+-----------+--------+

您仍然可以使用tabulate库修改表格数据

关于pandas - 使用python从html中提取表数据,其中行存储在div中,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/67536810/

相关文章:

python - Pandas 根据条件从数据框中删除行

python - 如何使用BS4从标签外部提取文本

python - Web 在 url 保持不变时抓取多个页面(但给出了 ajax 响应)

python - 执行错误 : The variable display is not defined. (-2753)

python - 在 Python 中分隔 Pandas DataFrame 的元素

python - 如果一行中的一个单元格为空,则更改数据帧中的特定值

python - Concat 未按预期工作

python - 从数据框列列表创建术语频率字典

python - Selenium Python 按文本/样式单击页面上的链接

python - 使用 beautiful soup 从 <td> 标签中提取正确格式的文本(中间有空格)