将pd.read_html
格式更改为后,无法从CCCCCCC
列中获取1,2,3,4,5,6
123456
,我的预期结果应该保持1,2,3,4,5,6
HTML 代码
html = """<html>
<body>
<div id="MMMMMMMM" class="MMMMMMMMMMM" style="">
<table class="OOOOOOOO" style="">
<thead>
<tr class="PPPPPPPPPP">
<td colspan="3" style="font-size:14px;font-weight:bold;" class="QQQQQQQQQQ">AAAAAAA</td>
</tr>
<tr class="RRRRRRRRRR">
<td>BBBBBB</td>
<td>CCCCCCC</td>
<td>AAAAAAA</td>
</tr>
</thead>
<tbody>
<tr class="SSSSSSSS">
<td rowspan="1">DDDDDD</td>
<td class="L_LLLL67">1,2,3,4,5,6</td>
<td class="L_LLLL67 f_tar">1234.56</td>
</tr>
<tr class="">
<td rowspan="3">EEEEEEEEE</td>
<td class="L_LLLL67">1,2,3,4,5,6</td>
<td class="L_LLLL67 f_tar">1234.56</td>
</tr>
<tr class="">
<td class="L_LLLL67">1,2,3,4,5,6</td>
<td class="L_LLLL67 f_tar">1234.56</td>
</tr>
<tr class="">
<td class="L_LLLL67">1,2,3,4,5,6</td>
<td class="L_LLLL67 f_tar">1234.56</td>
</tr>
<tr class="">
<td rowspan="1">FFFFFFFFF</td>
<td class="L_LLLL67">1,2,3,4,5,6</td>
<td class="L_LLLL67 f_tar">1234.56</td>
</tr>
<tr class="TTTTTT">
<td rowspan="1">GGGGGGGGG</td>
<td class="L_LLLL67">1,2,3,4,5,6</td>
<td class="L_LLLL67 f_tar">1234.56</td>
</tr>
<tr class="">
<td rowspan="1">HHHHHHHHH</td>
<td class="L_LLLL67">1,2,3,4,5,6</td>
<td class="L_LLLL67 f_tar">1234.56</td>
</tr>
<tr class="TTTTTTT">
<td rowspan="1">IIIIIIIIII</td>
<td class="L_LLLL67">1,2,3,4,5,6</td>
<td class="L_LLLL67 f_tar">1234.56</td>
</tr>
<tr class="">
<td rowspan="1">JJJJJJJJ</td>
<td class="L_LLLL67">1,2,3,4,5,6</td>
<td class="L_LLLL67 f_tar">1234.56</td>
</tr>
<tr class="TTTTT">
<td rowspan="2">KKKKKKKK</td>
<td class="L_LLLL67">1/2/3/4/5/6</td>
<td class="L_LLLL67 f_tar">1234.56</td>
</tr>
<tr class="TTTTTT">
<td class="L_LLLL67">1/2/3/4/5/6</td>
<td class="L_LLLL67 f_tar">1234.56</td>
</tr>
</tbody>
</table>
</body>
</html>"""
Python代码
from bs4 import BeautifulSoup
import pandas as pd
soup = BeautifulSoup(html,'html.parser')
table = soup.find('div', attrs={'id':'MMMMMMMM'})
df_list = pd.read_html(str(table), header=1)
df_list
执行结果
[ BBBBBB CCCCCCC AAAAAAA
0 DDDDDD 123456 1234.56
1 EEEEEEEEE 123456 1234.56
2 EEEEEEEEE 123456 1234.56
3 EEEEEEEEE 123456 1234.56
4 FFFFFFFFF 123456 1234.56
5 GGGGGGGGG 123456 1234.56
6 HHHHHHHHH 123456 1234.56
7 IIIIIIIIII 123456 1234.56
8 JJJJJJJJ 123456 1234.56
9 KKKKKKKK 1/2/3/4/5/6 1234.56
10 KKKKKKKK 1/2/3/4/5/6 1234.56]
预期结果
[ BBBBBB CCCCCCC AAAAAAA
0 DDDDDD 1,2,3,4,5,6 1234.56
1 EEEEEEEEE 1,2,3,4,5,6 1234.56
2 EEEEEEEEE 1,2,3,4,5,6 1234.56
3 EEEEEEEEE 1,2,3,4,5,6 1234.56
4 FFFFFFFFF 1,2,3,4,5,6 1234.56
5 GGGGGGGGG 1,2,3,4,5,6 1234.56
6 HHHHHHHHH 1,2,3,4,5,6 1234.56
7 IIIIIIIIII 1,2,3,4,5,6 1234.56
8 JJJJJJJJ 1,2,3,4,5,6 1234.56
9 KKKKKKKK 1/2/3/4/5/6 1234.56
10 KKKKKKKK 1/2/3/4/5/6 1234.56]
最佳答案
您需要添加thousands
参数并将其设置为None
,默认为','
。
from bs4 import BeautifulSoup
import pandas as pd
soup = BeautifulSoup(html,'html.parser')
table = soup.find('div', attrs={'id':'MMMMMMMM'})
df_list = pd.read_html(str(table), header=1, thousands=None)
df_list
输出:
[ BBBBBB CCCCCCC AAAAAAA
0 DDDDDD 1,2,3,4,5,6 1234.56
1 EEEEEEEEE 1,2,3,4,5,6 1234.56
2 EEEEEEEEE 1,2,3,4,5,6 1234.56
3 EEEEEEEEE 1,2,3,4,5,6 1234.56
4 FFFFFFFFF 1,2,3,4,5,6 1234.56
5 GGGGGGGGG 1,2,3,4,5,6 1234.56
6 HHHHHHHHH 1,2,3,4,5,6 1234.56
7 IIIIIIIIII 1,2,3,4,5,6 1234.56
8 JJJJJJJJ 1,2,3,4,5,6 1234.56
9 KKKKKKKK 1/2/3/4/5/6 1234.56
10 KKKKKKKK 1/2/3/4/5/6 1234.56]
关于python - pd.read_html 更改了数字格式,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/68264711/