python - 如何使用 python 修复字符串中不正确的 html 标签?

标签 python html chatgpt-api

所以我使用 python 从 openai 的 API 生成包含 HTML 标签的文章。文章很长,大多数情况下我都能得到正确的结果,但有时 HTML 标签不正确,下面是一个示例:

<h3><strong>1. Gaze:</strong></h 3 >
<p><strong>Gaze</ strong>. is a free and easy-to-use video streaming app , supporting both live and pre-recorded content. It supports up to 10 people joining a single session at once, with synchronized video playback for all users. Additionally, Gaze offers its own messaging service so you can chat during the viewing experience.</p>
 
<h3><strong>2. Chrono:</strong></h 3 >

如何修复这些 HTML 标记?我已经使用过 bs4,但它在不同的行上分隔标签,这不是我想要的。

使用 python 还有其他解决方案吗?

我尝试过 bs4 但没有得到好的结果...

最佳答案

由于结束标记中存在多余空格,您问题中的 HTML 示例需要更正。您只需删除所有空格即可修复这些格式错误的结束标签。这是一个例子:

import re

def remove_spaces_from_closing_tags(html):
    fixed_html = ""
    # Regular Expression sets apart tags and other content
    for tag, other_content in re.findall(r'(<[^>]*>)|([^<]*)', html):
        if tag:
            # If it is a closing tag then remove spaces, otherwise leave it as is
            fixed_html += re.sub(r'\s+', r'', tag) if '/' in tag else tag
        if other_content:
            # Leave other content as is
            fixed_html += other_content
    return fixed_html



# Input malformed HTML
html = """
<h3><strong>1. Gaze:</strong></h 3 >
<p><strong>Gaze</ strong>. is a free and easy-to-use video streaming app , supporting both live and pre-recorded content. It supports up to 10 people joining a single session at once, with synchronized video playback for all users. Additionally, Gaze offers its own messaging service so you can chat during the viewing experience.</p>
 
<h3><strong>2. Chrono:< / st rong ></h 3 >
"""

print(remove_spaces_from_closing_tags(html))

此示例代码将输出:

<h3><strong>1. Gaze:</strong></h3>
<p><strong>Gaze</strong>. is a free and easy-to-use video streaming app , supporting both live and pre-recorded content. It supports up to 10 people joining a single session at once, with synchronized video playback for all users. Additionally, Gaze offers its own messaging service so you can chat during the viewing experience.</p>
 
<h3><strong>2. Chrono:</strong></h3>

您可以使用 remove_spaces_from_closing_tags 修复格式错误的结束标记上面定义的函数。您的示例没有显示任何格式错误的开始标记,但请记住,如果您也有格式错误的开始标记,则不能对格式错误的开始标记使用相同的方法。例如,删除格式错误的开始标记中的所有空格,例如 <h 3 class="some-class"> ,不会修复它。因此,remove_spaces_from_closing_tags函数仅修复带有额外空格的结束标记,类似于示例 HTML 中的标记。

关于python - 如何使用 python 修复字符串中不正确的 html 标签?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/75712694/

相关文章:

python - 切片 ndarray 的最快方法

python - 在 Python 中创建自定义等待条件

python - 你能用 GPU 加速 python 中的简单数学方程吗,例如 : y = 1/x and how do you do it?

openai-api - OpenAI API错误: Why do I still get the "module ' openai' has no attribute 'ChatCompletion' "error after I upgraded the OpenAI package and Python?

php - OpenAI 将 API 代码从 GPT-3 转换为 chatGPT-3.5

python - 在 Python 中更改运算符优先级

html - 单选按钮在 chrome 中是圆形的,但在 firefox 中是方形的

PHP 中的 Javascript 条件格式

JavaScript 焦点不工作

javascript - 聊天 GPT API key 故障排除