python - 标记停用词生成的标记 ['ha' , 'le' , 'u' , 'wa' ] 不在 stop_words

标签 python python-3.x nlp nltk chatbot

我正在使用 Python 制作聊天机器人。 代码:

import nltk
import numpy as np
import random
import string 
f=open('/home/hostbooks/ML/stewy/speech/chatbot.txt','r',errors = 'ignore')
raw=f.read()
raw=raw.lower()# converts to lowercase

sent_tokens = nltk.sent_tokenize(raw)# converts to list of sentences 
word_tokens = nltk.word_tokenize(raw)# converts to list of words

lemmer = nltk.stem.WordNetLemmatizer()    

def LemTokens(tokens):
    return [lemmer.lemmatize(token) for token in tokens]

remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)

def LemNormalize(text):
    return LemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))

GREETING_INPUTS = ("hello", "hi", "greetings", "sup", "what's up","hey","hii")
GREETING_RESPONSES = ["hi", "hey", "*nods*", "hi there", "hello", "I am glad! You are talking to me"]


def greeting(sentence):
    for word in sentence.split():
        if word.lower() in GREETING_INPUTS:
            return random.choice(GREETING_RESPONSES)

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.metrics.pairwise import cosine_similarity

def response(user_response):
    robo_response=''
    sent_tokens.append(user_response)    

    TfidfVec = TfidfVectorizer(tokenizer=LemNormalize, stop_words='english')
    tfidf = TfidfVec.fit_transform(sent_tokens)
    vals = cosine_similarity(tfidf[-1], tfidf)
    idx=vals.argsort()[0][-2]
    flat = vals.flatten()
    flat.sort()
    req_tfidf = flat[-2]    

    if(req_tfidf==0):
        robo_response=robo_response+"I am sorry! I don't understand you"
        return robo_response
    else:
        robo_response = robo_response+sent_tokens[idx]
        return robo_response

flag=True
print("ROBO: My name is Robo. I will answer your queries about Chatbots. If you want to exit, type Bye!")

while(flag==True):
    user_response = input()
    user_response=user_response.lower()
    if(user_response!='bye'):
        if(user_response=='thanks' or user_response=='thank you' ):
            flag=False
            print("ROBO: You are welcome..")
        else:
            if(greeting(user_response)!=None):
                print("ROBO: "+greeting(user_response))
            else:
                print("ROBO: ",end="")
                print(response(user_response))
                sent_tokens.remove(user_response)
    else:
        flag=False
        print("ROBO: Bye! take care..")

它运行良好,但每次对话都会出现此错误:

/home/hostbooks/django1/myproject/lib/python3.6/site-packages/sklearn/feature_extraction/text.py:300: UserWarning: Your stop_words may be inconsistent with your preprocessing. 

Tokenizing the stop words generated tokens ['ha', 'le', 'u', 'wa'] not in stop_words.

这些是来自 CMD 的一些对话:

ROBO:聊天机器人是一种通过听觉或文本方法进行对话的软件。

什么是印度

    /home/hostbooks/django1/myproject/lib/python3.6/site-packages/sklearn/feature_extraction/text.py:300: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['ha', 'le', 'u', 'wa'] not in stop_words. 'stop_words.' % sorted(inconsistent))

ROBO:印度的野生动物,在印度的文化中历来被视为宽容,在这些森林和其他地方的 protected 栖息地中得到支持。

什么是聊天机器人

    /home/hostbooks/django1/myproject/lib/python3.6/site-packages/sklearn/feature_extraction/text.py:300: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['ha', 'le', 'u', 'wa'] not in stop_words. 'stop_words.' % sorted(inconsistent))

ROBO:聊天机器人是一种通过听觉或文本方法进行对话的软件。

最佳答案

原因是您使用了自定义 tokenizer 并使用了默认 stop_words='english' 因此在提取特征时会进行检查以查看两者之间是否存在任何不一致stop_wordstokenizer

如果您深入研究 sklearn/feature_extraction/text.py 的代码,您会发现执行一致性检查的代码片段:

def _check_stop_words_consistency(self, stop_words, preprocess, tokenize):
    """Check if stop words are consistent

    Returns
    -------
    is_consistent : True if stop words are consistent with the preprocessor
                    and tokenizer, False if they are not, None if the check
                    was previously performed, "error" if it could not be
                    performed (e.g. because of the use of a custom
                    preprocessor / tokenizer)
    """
    if id(self.stop_words) == getattr(self, '_stop_words_id', None):
        # Stop words are were previously validated
        return None

    # NB: stop_words is validated, unlike self.stop_words
    try:
        inconsistent = set()
        for w in stop_words or ():
            tokens = list(tokenize(preprocess(w)))
            for token in tokens:
                if token not in stop_words:
                    inconsistent.add(token)
        self._stop_words_id = id(self.stop_words)

        if inconsistent:
            warnings.warn('Your stop_words may be inconsistent with '
                          'your preprocessing. Tokenizing the stop '
                          'words generated tokens %r not in '
                          'stop_words.' % sorted(inconsistent))

如您所见,如果发现不一致,它会发出警告。

希望对您有所帮助。

关于python - 标记停用词生成的标记 ['ha' , 'le' , 'u' , 'wa' ] 不在 stop_words,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/60280307/

相关文章:

python - 如何使用本地服务器的 nginx 将 http 转换为 https(自签名证书)

python - apache spark 加载内部文件夹

python - 需要减少基于公式和数字 n 创建数字列表的运行时间

python - t-SNE 高维数据可视化

nlp - Word2Vec 中的维度从何而来?

python - 如何合并列和重复行值以在 pandas 中匹配

python - 将不均匀的值更改为特定的偶数

python - 使用元组作为函数参数

python-3.x - 当比较相同的数据帧时,python all 返回 false

Python 地理无法找到美国的城市