python - 如何使用 BeautifulSoup 从 html 中提取元素

我开始学习Python，想尝试使用BeautifulSoup来提取下面html中的元素。

此 html 取自语音记录系统，记录本地时间、UTC 时间和日期、通话时长、被叫号码、姓名、主叫号码、姓名等
通常有数百个这样的条目。

我试图做的是提取元素并将它们打印在一行中以逗号分隔的格式，以便与调用管理器中的调用详细记录进行比较。这将有助于验证所有通话均已录音并且没有遗漏。

我相信 BeautifulSoup 是执行此操作的正确工具。
有人能指出我正确的方向吗？

<tbody>
   <tr class="formRowLight">

<td class="formRowLight" >24/10/16<br>16:24:47</td>
<td class="formRowLight" >24/10/16 07:24:47</td>
<td class="formRowLight" >00:45</td>
<td class="formRowLight" >31301</td>
<td class="formRowLight" >Joe Smith</td>
<td class="formRowLight" >31111</td>
<td class="formRowLight" >Jane Doe</td>
<td class="formRowLight" >N/A</td>
<td class="formRowLight" >1432875648934</td>
<td align="center" class"formRowLight">&nbsp;</td>

   <tr class="formRowLight">

<td class="formRowLight" >24/10/16<br>17:33:02</td>
<td class="formRowLight" >24/10/16 08:33:02</td>
<td class="formRowLight" >00:58</td>
<td class="formRowLight" >35664</td>
<td class="formRowLight" >Billy Bob</td>
<td class="formRowLight" >227045665</td>
<td class="formRowLight" >James Dean</td>
<td class="formRowLight" >N/A</td>
<td class="formRowLight" >9934959586849</td>
<td align="center" class"formRowLight">&nbsp;</td>
</tr>
</tbody>

最佳答案

pandas.read_html()会让事情变得更容易 - 它将您的表格数据从 HTML 表格转换为 dataframe如果需要，您可以稍后 dump into CSV .

以下是帮助您入门的示例代码:

import pandas as pd

data = """
<table>
    <thead>
        <tr>
            <th>Date</th>
            <th>Name</th>
            <th>ID</th>
        </tr>
    </thead>
    <tbody>
        <tr class="formRowLight">
            <td class="formRowLight">24/10/16<br>16:24:47</td>
            <td class="formRowLight">Joe Smith</td>
            <td class="formRowLight">1432875648934</td>
        </tr>

        <tr class="formRowLight">
            <td class="formRowLight">24/10/16<br>17:33:02</td>
            <td class="formRowLight">Billy Bob</td>
            <td class="formRowLight">9934959586849</td>
        </tr>
    </tbody>
</table>"""

df = pd.read_html(data)[0]
print(df.to_csv(index=False))

打印:

Date,Name,ID
24/10/1616:24:47,Joe Smith,1432875648934
24/10/1617:33:02,Billy Bob,9934959586849

仅供引用，read_html() 实际上使用 BeautifulSoup 来解析 HTML。

关于python - 如何使用 BeautifulSoup 从 html 中提取元素，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/40262728/

python - 如何使用 BeautifulSoup 从 html 中提取元素

上一篇：python - Pandas groupby() 与另一个 DataFrame 的条件

下一篇：python - numpy.unique 抛出错误