python - 去除标签内容之外的文本

使用 pyparsing 可以实现相反的效果，如下所示:

from pyparsing import Suppress, replaceWith, makeHTMLTags, SkipTo
#...
removeText = replaceWith("")
scriptOpen, scriptClose = makeHTMLTags("script")
scriptBody = scriptOpen + SkipTo(scriptClose) + scriptClose
scriptBody.setParseAction(removeText)
data = (scriptBody).transformString(data)

如何保留标签“table”的内容？

更新0:

我尝试过: # 只保留表 tableOpen, tableClose = makeHTMLTags("表") tableBody = tableOpen + SkipTo(tableClose) + tableClose f = 替换(tableBody) tableBody.setParseAction(f) 数据 = (tableBody).transformString(数据) 打印数据

我得到了这样的东西......

garbages
<input type="hidden" name="cassstx"   value="en_US:frontend"></form></td></tr></table></span></td></tr></table> 

{<"table"> SkipTo:(</"table">) </"table">} 
<div id="asbnav" style="padding-bottom: 10px;">{<"table"> SkipTo:(</"table">) </"table">} 
</div> 
even more garbages

更新2:

谢谢马尔泰利。我需要的是:

from pyparsing import Suppress, replaceWith, makeHTMLTags, SkipTo
#...
data = 'before<script>ciao<table>buh</table>bye</script>after'

tableOpen, tableClose = makeHTMLTags("table")
tableBody = tableOpen + SkipTo(tableClose) + tableClose
thetable = (tableBody).searchString(data)[0][2]

print thetable

最佳答案

您可以首先提取表(与现在提取脚本的方式类似，但当然不删除；-)，获取 thetable 字符串；然后，您提取脚本 replaceWith(thetable) 而不是 replaceWith('')。或者，您可以准备更复杂的解析操作，但简单的两阶段方法对我来说看起来更直接。例如。 (专门保留表的内容，而不是表标签):

from pyparsing import Suppress, replaceWith, makeHTMLTags, SkipTo
#...
data = 'before<script>ciao<table>buh</table>bye</script>after'

tableOpen, tableClose = makeHTMLTags("table")
tableBody = tableOpen + SkipTo(tableClose) + tableClose
thetable = (tableBody).searchString(data)[0][2]

removeText = replaceWith(thetable)
scriptOpen, scriptClose = makeHTMLTags("script")
scriptBody = scriptOpen + SkipTo(scriptClose) + scriptClose
scriptBody.setParseAction(removeText)
data = (scriptBody).transformString(data)

print data

这将打印 beforebuhafter (脚本标记之外的内容，其中夹着表格标记的内容)，希望“如所期望的”。

关于python - 去除标签内容之外的文本，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/3032532/

python - 去除标签内容之外的文本

上一篇：python - 无法导入 matplotlib

下一篇：python - 如何在Python中计算新时区？