python - 模糊搜索Python

标签 python regex nltk fuzzy-search fuzzywuzzy

我有一个很大的示例文本,例如:

"The arterial high blood pressure may engage the prognosis for survival of the patient as a result of complications. TENSTATEN enters within the framework of a preventive treatment(processing). His(Her,Its) report(relationship) efficiency / effects unwanted is important. diuretics, medicine of first intention of which TENSTATEN, is. The therapeutic alternatives are very numerous."

我正在尝试检测文本中是否有“参与生存预后”,但是以模糊的方式。例如“具有生存预后”也必须返回肯定的答案。

我研究了 fuzzywuzzy、nltk 和新的正则表达式模糊函数,但我没有找到方法:

if [anything similar (>90%) to "that sentence"] in mybigtext:
    print True

最佳答案

以下内容并不理想,但应该可以帮助您入门。它使用 nltk 首先将文本拆分为单词,然后生成一个包含所有单词词干的集合,过滤掉任何停用词。它对示例文本和示例查询都执行此操作。

如果两个集合的交集包含查询中的所有单词,则视为匹配。

import nltk

from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

stop_words = stopwords.words('english')
ps = PorterStemmer()

def get_word_set(text):
    return set(ps.stem(word) for word in word_tokenize(text) if word not in stop_words)

text1 = "The arterial high blood pressure may engage the prognosis for survival of the patient as a result of complications. TENSTATEN enters within the framework of a preventive treatment(processing). His(Her,Its) report(relationship) efficiency / effects unwanted is important. diuretics, medicine of first intention of which TENSTATEN, is. The therapeutic alternatives are very numerous."
text2 = "The arterial high blood pressure may engage the for survival of the patient as a result of complications. TENSTATEN enters within the framework of a preventive treatment(processing). His(Her,Its) report(relationship) efficiency / effects unwanted is important. diuretics, medicine of first intention of which TENSTATEN, is. The therapeutic alternatives are very numerous."

query = "engage the prognosis for survival"

set_query = get_word_set(query)
for text in [text1, text2]:
    set_text = get_word_set(text)
    intersection = set_query & set_text

    print "Query:", set_query
    print "Test:", set_text
    print "Intersection:", intersection
    print "Match:", len(intersection) == len(set_query)
    print

该脚本提供了两个文本,一个通过,另一个未通过,它会生成以下输出来向您展示它正在做什么:

Query: set([u'prognosi', u'engag', u'surviv'])
Test: set([u'medicin', u'prevent', u'effici', u'engag', u'Her', u'process', u'within', u'surviv', u'high', u'pressur', u'result', u'framework', u'diuret', u')', u'(', u',', u'/', u'.', u'numer', u'Hi', u'treatment', u'import', u'complic', u'altern', u'patient', u'relationship', u'may', u'arteri', u'effect', u'prognosi', u'intent', u'blood', u'report', u'The', u'TENSTATEN', u'unwant', u'It', u'therapeut', u'enter', u'first'])
Intersection: set([u'prognosi', u'engag', u'surviv'])
Match: True

Query: set([u'prognosi', u'engag', u'surviv'])
Test: set([u'medicin', u'prevent', u'effici', u'engag', u'Her', u'process', u'within', u'surviv', u'high', u'pressur', u'result', u'diuret', u')', u'(', u',', u'/', u'.', u'numer', u'Hi', u'treatment', u'import', u'complic', u'altern', u'patient', u'relationship', u'may', u'arteri', u'effect', u'framework', u'intent', u'blood', u'report', u'The', u'TENSTATEN', u'unwant', u'It', u'therapeut', u'enter', u'first'])
Intersection: set([u'engag', u'surviv'])
Match: False

关于python - 模糊搜索Python,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/35705582/

相关文章:

python - 向 celery 中的特定消费者发送消息(通过路由键)

python - 是否可以在 python 中返回 AD 中的所有计算机

javascript - 仅匹配某些字符的正则表达式模式

java - 验证 UUID Restful 服务

list - 如何将列表中的位置大写 Python

python - 是否有可能*实时*修改 Python 代码(如 Lisp 或 Erlang)

python - 让 Flask 监听我的 Firestore 树中的实时更新

javascript - 包含 1 个引用的正则表达式与所需字符串不匹配

python - 如何使用 Python、NLTK 和 WordNet 获取反义词引理列表?

python - 如何向 NLTK 中的停用词添加更多语言?