我有一个单词字典,其频率如下。
mydictionary = {'yummy tim tam':3, 'milk':2, 'chocolates':5, 'biscuit pudding':3, 'sugar':2}
我有一组字符串(删除标点符号)如下。
recipes_book = "For todays lesson we will show you how to make biscuit pudding using
yummy tim tam milk and rawsugar"
在上面的字符串中,我需要通过引用字典只输出“biscuit pudding”、“yummy tim tam”和“milk”。不是糖,因为它是字符串中的原糖。
但是,我目前使用的代码也会输出糖分。
mydictionary = {'yummy tim tam':3, 'milk':2, 'chocolates':5, 'biscuit pudding':3, 'sugar':2}
recipes_book = "For today's lesson we will show you how to make biscuit pudding using yummy tim tam milk and rawsugar"
searcher = re.compile(r'{}'.format("|".join(mydictionary.keys())), flags=re.I | re.S)
for match in searcher.findall(recipes_book):
print(match)
如何避免使用这样的子字符串并只考虑一个完整的标记,例如“牛奶”。请帮助我。
最佳答案
使用单词边界'\b'。简单来说
recipes_book = "For todays lesson we will show you how to make biscuit pudding using
yummy tim tam milk and rawsugar"
>>> re.findall(r'(?is)(\bchocolates\b|\bbiscuit pudding\b|\bsugar\b|\byummy tim tam\b|\bmilk\b)',recipes_book)
['biscuit pudding', 'yummy tim tam', 'milk']
关于python - 在 python 中删除子字符串时识别字符串,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/46542271/