我有句子。
text="The president of America is <PERSON>Barack Obama</PERSON>. He was born on August 4, 1961. Obama was reelected president in November 2012".
我想输入<PERSON></PERSON>
标签为“Obama”,所以结果将是这样的:
The president of America is <PERSON>Barack Obama</PERSON>. He was born on August 4, 1961. <PERSON>Obama</PERSON> was reelected president in November 2012".
我想找到没有标签 <PERSON>
的子字符串(例如:Obama)子字符串之前并且没有标签 </PERSON>
在子字符串之后,但我不知道 python 中正则表达式的正确语法。
**我是Python新手:''
使用简单的正则表达式re.sub(namedEntity, "<PERSON>"+namedEntity+"</PERSON>", text)
将给出输出
The president of America is <PERSON>Barack <PERSON>Obama</PERSON></PERSON>. He was born on August 4, 1961. <PERSON>Obama</PERSON> was reelected president in November 2012".
这是我的代码(使用python2.7)
import re
result=re.sub(r"((?!<PERSON>).*"+namedEntity+".*(?!</PERSON>))","<PERSON>"+namedEntity+"</PERSON>",text)
print "result: "+result
输出
result: <PERSON>Obama</PERSON>
我不知道这是第一个“奥巴马”还是第二个。
感谢您之前的帮助
最佳答案
你们很接近。在你的新正则表达式 r"((?!<PERSON>).*"+namedEntity+".*(?!</PERSON>))"
中,你有.*
before 和 after 将“Obama”与其之前和之后的任何字符相匹配,并且环视将被忽略,因为标签位于匹配组中。如果删除它们,您就会得到您想要的结果。
>>> import re
>>> text = "The president of America is <PERSON>Barack Obama</PERSON>. He was born on August 4, 1961. Obama was reelected president in November 2012"
>>> namedEntity = 'Obama'
>>> result = re.sub(r"((?!<PERSON>)"+namedEntity+"(?!</PERSON>))","<PERSON>"+namedEntity+"</PERSON>",text)
>>> print result
'The president of America is <PERSON>Barack Obama</PERSON>. He was born on August 4, 1961. <PERSON>Obama</PERSON> was reelected president in November 2012'
对于 future 的正则表达式测试,regex101 可以很好地检查实时更改时的工作情况。对于您的情况this显示正在发生的事情。
关于python - 如何在python中用正则表达式替换未包含在标签中的子字符串,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/35830544/