python - 正确读取HTML表格

原文 标签 python html pandas

这是一个HTML表:



<table width="100%" cellpadding="4" cellspacing="0" style="page-break-before: always">
        <col width="32*"/>

        <col width="32*"/>

        <col width="32*"/>

        <col width="32*"/>

        <col width="32*"/>

        <col width="32*"/>

        <col width="32*"/>

        <col width="32*"/>

        <tr valign="top">
                <td colspan="2" width="25%" style="background: transparent" style="border: none; padding: 0cm"><p align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
                        <font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">A</font></font></font></p>
                </td>
                <td colspan="2" width="25%" style="background: transparent" style="border: none; padding: 0cm"><p align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
                        <font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">B</font></font></font></p>
                </td>
                <td colspan="2" width="25%" style="background: transparent" style="border: none; padding: 0cm"><p align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
                        <font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">C</font></font></font></p>
                </td>
                <td colspan="2" width="25%" style="background: transparent" style="border: none; padding: 0cm"><p align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
                        <font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">D</font></font></font></p>
                </td>
        </tr>
        <tr valign="top">
                <td colspan="2" width="25%" style="background: transparent" style="border: none; padding: 0cm"><p align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
                        <font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">E</font></font></font></p>
                </td>
                <td colspan="2" width="25%" style="background: transparent" style="border: none; padding: 0cm"><p align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
                        <font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">F</font></font></font></p>
                </td>
                <td colspan="2" width="25%" style="background: transparent" style="border: none; padding: 0cm"><p align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
                        <font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">G</font></font></font></p>
                </td>
                <td colspan="2" width="25%" style="background: transparent" style="border: none; padding: 0cm"><p align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
                        <font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">H</font></font></font></p>
                </td>
        </tr>
        <tr valign="top">
                <td colspan="2" width="25%" style="background: transparent" style="border: none; padding: 0cm"><p align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
                        <font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">I</font></font></font></p>
                </td>
                <td colspan="2" width="25%" style="background: transparent" style="border: none; padding: 0cm"><p align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
                        <font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">J</font></font></font></p>
                </td>
                <td colspan="2" width="25%" style="background: transparent" style="border: none; padding: 0cm"><p align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
                        <font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">K</font></font></font></p>
                </td>
                <td colspan="2" width="25%" style="background: transparent" style="border: none; padding: 0cm"><p align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
                        <font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">L</font></font></font></p>
                </td>
        </tr>
        <tr valign="top">
                <td width="12%" style="background: transparent" style="border: none; padding: 0cm"><p align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
                        <font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">M</font></font></font></p>
                </td>
                <td width="13%" style="background: transparent" style="border: none; padding: 0cm"><p lang="ru-RU" align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
                        <font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">M2</font></font></font></p>
                </td>
                <td width="12%" style="background: transparent" style="border: none; padding: 0cm"><p align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
                        <font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">N</font></font></font></p>
                </td>
                <td width="13%" style="background: transparent" style="border: none; padding: 0cm"><p lang="ru-RU" align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
                        <font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">N2</font></font></font></p>
                </td>
                <td width="12%" style="background: transparent" style="border: none; padding: 0cm"><p align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
                        <font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">O</font></font></font></p>
                </td>
                <td width="13%" style="background: transparent" style="border: none; padding: 0cm"><p lang="ru-RU" align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
                        <font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">O2</font></font></font></p>
                </td>
                <td width="12%" style="background: transparent" style="border: none; padding: 0cm"><p align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
                        <font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">P</font></font></font></p>
                </td>
                <td width="13%" style="background: transparent" style="border: none; padding: 0cm"><p lang="ru-RU" align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
                        <font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">P2</font></font></font></p>
                </td>
        </tr>
</table>





这里的最后一行比其他列多2倍。当我尝试将其读入Pandas数据框时,得到以下结果:

table = pd.read_html('1111.html')
table[0]

   0   1  2   3  4   5  6   7
0  A   A  B   B  C   C  D   D
1  E   E  F   F  G   G  H   H
2  I   I  J   J  K   K  L   L
3  M  M2  N  N2  O  O2  P  P2


如何正确阅读而不配音?我不需要最后一行。

最佳答案

您可以使用BeautifulSoup解析表,然后将结果转换为数据框:

import pandas as pd
from bs4 import BeautifulSoup as soup
df = pd.DataFrame([[k[1:-1] for i in b.find_all('td') if (k:=i.text) is not None] for b in soup(html, 'html.parser').table.find_all('tr')])


输出:

   0   1  2   3     4     5     6     7
0  A   B  C   D  None  None  None  None
1  E   F  G   H  None  None  None  None
2  I   J  K   L  None  None  None  None
3  M  M2  N  N2     O    O2     P    P2


编辑:没有赋值表达式的解决方案:

df = pd.DataFrame([[i.text[1:-1] if i else i for i in b.find_all('td')] for b in soup(html, 'html.parser').table.find_all('tr')])


输出:

   0   1  2   3     4     5     6     7
0  A   B  C   D  None  None  None  None
1  E   F  G   H  None  None  None  None
2  I   J  K   L  None  None  None  None
3  M  M2  N  N2     O    O2     P    P2

相关文章:

python - 在熊猫数据框中相互获取最近点

python - 如何结合熊猫系列

javascript - 单击按钮应用CSS

javascript - 删除HTML标签和格式化文本

python - 如何(重)命名熊猫数据框中的空列标题而不导出到CSV

python - pygame.mixer.Sound.play:无法打开文件

python - 在熊猫中拆分和复制CSV字段

html - 将HTML5:无效的伪类延迟到第一个事件之前

python - MariaDB,Python read_sql_table,utf8-bin

python - 转换Pandas中列中的行