python - 正确读取html表格

标签 python html pandas

这是一个 HTML 表格:

<table width="100%" cellpadding="4" cellspacing="0" style="page-break-before: always">
        <col width="32*"/>

        <col width="32*"/>

        <col width="32*"/>

        <col width="32*"/>

        <col width="32*"/>

        <col width="32*"/>

        <col width="32*"/>

        <col width="32*"/>

        <tr valign="top">
                <td colspan="2" width="25%" style="background: transparent" style="border: none; padding: 0cm"><p align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
                        <font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">A</font></font></font></p>
                </td>
                <td colspan="2" width="25%" style="background: transparent" style="border: none; padding: 0cm"><p align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
                        <font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">B</font></font></font></p>
                </td>
                <td colspan="2" width="25%" style="background: transparent" style="border: none; padding: 0cm"><p align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
                        <font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">C</font></font></font></p>
                </td>
                <td colspan="2" width="25%" style="background: transparent" style="border: none; padding: 0cm"><p align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
                        <font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">D</font></font></font></p>
                </td>
        </tr>
        <tr valign="top">
                <td colspan="2" width="25%" style="background: transparent" style="border: none; padding: 0cm"><p align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
                        <font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">E</font></font></font></p>
                </td>
                <td colspan="2" width="25%" style="background: transparent" style="border: none; padding: 0cm"><p align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
                        <font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">F</font></font></font></p>
                </td>
                <td colspan="2" width="25%" style="background: transparent" style="border: none; padding: 0cm"><p align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
                        <font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">G</font></font></font></p>
                </td>
                <td colspan="2" width="25%" style="background: transparent" style="border: none; padding: 0cm"><p align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
                        <font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">H</font></font></font></p>
                </td>
        </tr>
        <tr valign="top">
                <td colspan="2" width="25%" style="background: transparent" style="border: none; padding: 0cm"><p align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
                        <font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">I</font></font></font></p>
                </td>
                <td colspan="2" width="25%" style="background: transparent" style="border: none; padding: 0cm"><p align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
                        <font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">J</font></font></font></p>
                </td>
                <td colspan="2" width="25%" style="background: transparent" style="border: none; padding: 0cm"><p align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
                        <font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">K</font></font></font></p>
                </td>
                <td colspan="2" width="25%" style="background: transparent" style="border: none; padding: 0cm"><p align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
                        <font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">L</font></font></font></p>
                </td>
        </tr>
        <tr valign="top">
                <td width="12%" style="background: transparent" style="border: none; padding: 0cm"><p align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
                        <font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">M</font></font></font></p>
                </td>
                <td width="13%" style="background: transparent" style="border: none; padding: 0cm"><p lang="ru-RU" align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
                        <font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">M2</font></font></font></p>
                </td>
                <td width="12%" style="background: transparent" style="border: none; padding: 0cm"><p align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
                        <font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">N</font></font></font></p>
                </td>
                <td width="13%" style="background: transparent" style="border: none; padding: 0cm"><p lang="ru-RU" align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
                        <font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">N2</font></font></font></p>
                </td>
                <td width="12%" style="background: transparent" style="border: none; padding: 0cm"><p align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
                        <font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">O</font></font></font></p>
                </td>
                <td width="13%" style="background: transparent" style="border: none; padding: 0cm"><p lang="ru-RU" align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
                        <font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">O2</font></font></font></p>
                </td>
                <td width="12%" style="background: transparent" style="border: none; padding: 0cm"><p align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
                        <font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">P</font></font></font></p>
                </td>
                <td width="13%" style="background: transparent" style="border: none; padding: 0cm"><p lang="ru-RU" align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
                        <font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">P2</font></font></font></p>
                </td>
        </tr>
</table>

此处最后一行的列数比其他行多 2 倍。当我尝试将其读入 Pandas 数据帧时,我得到以下结果:

table = pd.read_html('1111.html')
table[0]

   0   1  2   3  4   5  6   7
0  A   A  B   B  C   C  D   D
1  E   E  F   F  G   G  H   H
2  I   I  J   J  K   K  L   L
3  M  M2  N  N2  O  O2  P  P2

如何在不配音的情况下正确朗读?我不需要最后一行。

最佳答案

您可以使用BeautifulSoup解析表格,然后将结果转换为数据帧:

import pandas as pd
from bs4 import BeautifulSoup as soup
df = pd.DataFrame([[k[1:-1] for i in b.find_all('td') if (k:=i.text) is not None] for b in soup(html, 'html.parser').table.find_all('tr')])

输出:

   0   1  2   3     4     5     6     7
0  A   B  C   D  None  None  None  None
1  E   F  G   H  None  None  None  None
2  I   J  K   L  None  None  None  None
3  M  M2  N  N2     O    O2     P    P2

编辑:没有赋值表达式的解决方案:

df = pd.DataFrame([[i.text[1:-1] if i else i for i in b.find_all('td')] for b in soup(html, 'html.parser').table.find_all('tr')])

输出:

   0   1  2   3     4     5     6     7
0  A   B  C   D  None  None  None  None
1  E   F  G   H  None  None  None  None
2  I   J  K   L  None  None  None  None
3  M  M2  N  N2     O    O2     P    P2

关于python - 正确读取html表格,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58486063/

相关文章:

python - 封装矢量化函数 - 用于 Panda DataFrames

python - 如何在python中的数据框中查找具有相同值的列列表

python - 将 numpy 数组作为行添加到 pandas 中,并以字符串作为索引

python - Pandas :删除非整数数据

python - 如果 lat-lon 键不在其他 json 边界内,则删除 json 的功能

python - Python 中的步骤信息

javascript - 如何在 HTML 或 javascript 中绘制渐变背景?

python - 根据部分名称从数据框中提取 3 列组

html - 此代码在 html 代码验证器中仍然有一个警告。我还需要插入什么?

html - 如何使用 BeautifulSoup 在 Python 中隔离只有一到两位数的解析结果