python - pd.read_html 更改了数字格式

标签 python pandas list dataframe beautifulsoup

pd.read_html格式更改为后,无法从CCCCCCC列中获取1,2,3,4,5,6 123456,我的预期结果应该保持1,2,3,4,5,6

HTML 代码

html = """<html>
<body>
<div id="MMMMMMMM" class="MMMMMMMMMMM" style="">
        <table class="OOOOOOOO" style="">
            <thead>
                <tr class="PPPPPPPPPP">
                    <td colspan="3" style="font-size:14px;font-weight:bold;" class="QQQQQQQQQQ">AAAAAAA</td>
                </tr>
                <tr class="RRRRRRRRRR">
                    <td>BBBBBB</td>
                    <td>CCCCCCC</td>
                    <td>AAAAAAA</td>
                </tr>
            </thead>
            <tbody>
                    <tr class="SSSSSSSS">
                        <td rowspan="1">DDDDDD</td>
                        <td class="L_LLLL67">1,2,3,4,5,6</td>
                        <td class="L_LLLL67 f_tar">1234.56</td>
                    </tr>
                    <tr class="">
                        <td rowspan="3">EEEEEEEEE</td>
                        <td class="L_LLLL67">1,2,3,4,5,6</td>
                        <td class="L_LLLL67 f_tar">1234.56</td>
                    </tr>
                        <tr class="">
                            <td class="L_LLLL67">1,2,3,4,5,6</td>
                            <td class="L_LLLL67 f_tar">1234.56</td>
                        </tr>
                        <tr class="">
                            <td class="L_LLLL67">1,2,3,4,5,6</td>
                            <td class="L_LLLL67 f_tar">1234.56</td>
                        </tr>
                    <tr class="">
                        <td rowspan="1">FFFFFFFFF</td>
                        <td class="L_LLLL67">1,2,3,4,5,6</td>
                        <td class="L_LLLL67 f_tar">1234.56</td>
                    </tr>
                    <tr class="TTTTTT">
                        <td rowspan="1">GGGGGGGGG</td>
                        <td class="L_LLLL67">1,2,3,4,5,6</td>
                        <td class="L_LLLL67 f_tar">1234.56</td>
                    </tr>
                    <tr class="">
                        <td rowspan="1">HHHHHHHHH</td>
                        <td class="L_LLLL67">1,2,3,4,5,6</td>
                        <td class="L_LLLL67 f_tar">1234.56</td>
                    </tr>
                    <tr class="TTTTTTT">
                        <td rowspan="1">IIIIIIIIII</td>
                        <td class="L_LLLL67">1,2,3,4,5,6</td>
                        <td class="L_LLLL67 f_tar">1234.56</td>
                    </tr>
                    <tr class="">
                        <td rowspan="1">JJJJJJJJ</td>
                        <td class="L_LLLL67">1,2,3,4,5,6</td>
                        <td class="L_LLLL67 f_tar">1234.56</td>
                    </tr>
                    <tr class="TTTTT">
                        <td rowspan="2">KKKKKKKK</td>
                        <td class="L_LLLL67">1/2/3/4/5/6</td>
                        <td class="L_LLLL67 f_tar">1234.56</td>
                    </tr>
                        <tr class="TTTTTT">
                            <td class="L_LLLL67">1/2/3/4/5/6</td>
                            <td class="L_LLLL67 f_tar">1234.56</td>
                        </tr>
            </tbody>
        </table>
</body>
</html>"""

Python代码

from bs4 import BeautifulSoup
import pandas as pd

soup = BeautifulSoup(html,'html.parser')
table = soup.find('div', attrs={'id':'MMMMMMMM'})
df_list = pd.read_html(str(table), header=1)
df_list

执行结果

 [        BBBBBB      CCCCCCC  AAAAAAA
 0       DDDDDD       123456  1234.56
 1    EEEEEEEEE       123456  1234.56
 2    EEEEEEEEE       123456  1234.56
 3    EEEEEEEEE       123456  1234.56
 4    FFFFFFFFF       123456  1234.56
 5    GGGGGGGGG       123456  1234.56
 6    HHHHHHHHH       123456  1234.56
 7   IIIIIIIIII       123456  1234.56
 8     JJJJJJJJ       123456  1234.56
 9     KKKKKKKK  1/2/3/4/5/6  1234.56
 10    KKKKKKKK  1/2/3/4/5/6  1234.56]

预期结果

 [        BBBBBB      CCCCCCC  AAAAAAA
 0       DDDDDD       1,2,3,4,5,6  1234.56
 1    EEEEEEEEE       1,2,3,4,5,6  1234.56
 2    EEEEEEEEE       1,2,3,4,5,6  1234.56
 3    EEEEEEEEE       1,2,3,4,5,6  1234.56
 4    FFFFFFFFF       1,2,3,4,5,6  1234.56
 5    GGGGGGGGG       1,2,3,4,5,6  1234.56
 6    HHHHHHHHH       1,2,3,4,5,6  1234.56
 7   IIIIIIIIII       1,2,3,4,5,6  1234.56
 8     JJJJJJJJ       1,2,3,4,5,6  1234.56
 9     KKKKKKKK       1/2/3/4/5/6  1234.56
 10    KKKKKKKK       1/2/3/4/5/6  1234.56]
 

最佳答案

您需要添加thousands参数并将其设置为None,默认为','

from bs4 import BeautifulSoup
import pandas as pd

soup = BeautifulSoup(html,'html.parser')
table = soup.find('div', attrs={'id':'MMMMMMMM'})
df_list = pd.read_html(str(table), header=1, thousands=None)
df_list
输出:
[        BBBBBB      CCCCCCC  AAAAAAA
 0       DDDDDD  1,2,3,4,5,6  1234.56
 1    EEEEEEEEE  1,2,3,4,5,6  1234.56
 2    EEEEEEEEE  1,2,3,4,5,6  1234.56
 3    EEEEEEEEE  1,2,3,4,5,6  1234.56
 4    FFFFFFFFF  1,2,3,4,5,6  1234.56
 5    GGGGGGGGG  1,2,3,4,5,6  1234.56
 6    HHHHHHHHH  1,2,3,4,5,6  1234.56
 7   IIIIIIIIII  1,2,3,4,5,6  1234.56
 8     JJJJJJJJ  1,2,3,4,5,6  1234.56
 9     KKKKKKKK  1/2/3/4/5/6  1234.56
 10    KKKKKKKK  1/2/3/4/5/6  1234.56]

关于python - pd.read_html 更改了数字格式,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/68264711/

相关文章:

list - 对象以 "s"结尾时的命名约定列表

python - 创建一个包含 >255 个元素的列表

python - 无法解释小部件打包和未打包时的行为差异

Python:从 XML 文件中查找并删除子项并将其输出到新文件

pandas - 如何从 pandas 数据框中选择几年内的季节/月份?

python - 根据另一个数据集中的元素位置过滤 pandas 数据帧的快速方法

python - 将字符串列表转换为 int

python - 如何在 Python 中对 str 进行子类化

python - 遍历列表列表并比较当前元素和最后一个元素

python - 根据列中的值组合 Pandas 中的两行并创建一个新类别