python - 在Python中解析损坏的html页面

标签 python html-parsing beautifulsoup lxml

我正在尝试解析一个损坏的 html 页面，该页面在花药注释中包含一条注释，并且所有著名的 htmlparser(如 beautifulsoup、lxml 和 HTMLParser)都给出了语法错误。以下是代码。如何忽略损坏代码的部分并解析页面的其余部分？

<html xmlns="http://www.w3.org/1999/xhtml"><head>

<script language="JavaScript">
<!--
     function setTimeOffsetVars (Link) { 
   // code removed
 } 

<!-- Image Preloader - takes an array of images to preload --> 
    function warningCheck(e, warnMsg) {
   // code removed
}
-->
</script>

</head>

<body topmargin="0" leftmargin="0" rightmargin="0" bottommargin="0" marginwidth="0" marginheight="0">
<!-- lot of useful code -->
</body></html>

最佳答案

如果您知道问题是什么，您可以进行预处理:首先使用正则表达式之类的原始方法来删除有问题的内部注释，然后使用真正的解析器来处理它。

关于python - 在Python中解析损坏的html页面，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/14037866/

上一篇：python - Numpy:循环替代、优化

下一篇：python - 按值对 numpy 矩阵进行有效着色？

相关文章：

python - 列表索引超出范围 - Python 索引错误

在 vscode 中找不到 python 调试适配器 - WSL :Ubuntu

html-parsing - 使用网络 worker 解析 HTML

python - 没有这样的文件或目录错误(python)

html - 使用 BeautifulSoup 从 HTML 获取文本

php - 密信 : Download all emails when POP & IMAP is disabled (USER/PASS included)

python - 为什么在 Python 中 `50 << 6` 为真，而 `50 >> 6` 为假？

Javascript Excel 工作表

python - 确实使用 BeautifulSoup python 抓取前 100 个工作结果

beautifulsoup - 使用 Mechanicalsoup 设置没有名称的表单元素的值