使用 pyparsing 可以实现相反的效果,如下所示:
from pyparsing import Suppress, replaceWith, makeHTMLTags, SkipTo
#...
removeText = replaceWith("")
scriptOpen, scriptClose = makeHTMLTags("script")
scriptBody = scriptOpen + SkipTo(scriptClose) + scriptClose
scriptBody.setParseAction(removeText)
data = (scriptBody).transformString(data)
如何保留标签“table”
的内容?
更新0:
我尝试过: # 只保留表 tableOpen, tableClose = makeHTMLTags("表") tableBody = tableOpen + SkipTo(tableClose) + tableClose f = 替换(tableBody) tableBody.setParseAction(f) 数据 = (tableBody).transformString(数据) 打印数据
我得到了这样的东西......
garbages
<input type="hidden" name="cassstx" value="en_US:frontend"></form></td></tr></table></span></td></tr></table>
{<"table"> SkipTo:(</"table">) </"table">}
<div id="asbnav" style="padding-bottom: 10px;">{<"table"> SkipTo:(</"table">) </"table">}
</div>
even more garbages
更新2:
谢谢马尔泰利。我需要的是:
from pyparsing import Suppress, replaceWith, makeHTMLTags, SkipTo
#...
data = 'before<script>ciao<table>buh</table>bye</script>after'
tableOpen, tableClose = makeHTMLTags("table")
tableBody = tableOpen + SkipTo(tableClose) + tableClose
thetable = (tableBody).searchString(data)[0][2]
print thetable
最佳答案
您可以首先提取表(与现在提取脚本的方式类似,但当然不删除;-),获取 thetable
字符串;然后,您提取脚本 replaceWith(thetable)
而不是 replaceWith('')
。或者,您可以准备更复杂的解析操作,但简单的两阶段方法对我来说看起来更直接。例如。 (专门保留表
的内容,而不是表
标签):
from pyparsing import Suppress, replaceWith, makeHTMLTags, SkipTo
#...
data = 'before<script>ciao<table>buh</table>bye</script>after'
tableOpen, tableClose = makeHTMLTags("table")
tableBody = tableOpen + SkipTo(tableClose) + tableClose
thetable = (tableBody).searchString(data)[0][2]
removeText = replaceWith(thetable)
scriptOpen, scriptClose = makeHTMLTags("script")
scriptBody = scriptOpen + SkipTo(scriptClose) + scriptClose
scriptBody.setParseAction(removeText)
data = (scriptBody).transformString(data)
print data
这将打印 beforebuhafter
(脚本标记之外的内容,其中夹着表格标记的内容),希望“如所期望的”。
关于python - 去除标签内容之外的文本,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/3032532/