python - 使用 spacy 删除停用词

标签 python nlp spacy python-3.7 data-cleaning

我正在清理我的 data frame 中的一个列,Summription,并且我正在尝试做 3 件事:

  • 代币化
  • Lemmantize
  • 删除停用词
    import spacy        
    nlp = spacy.load('en_core_web_sm', parser=False, entity=False)        
    df['Tokens'] = df.Sumcription.apply(lambda x: nlp(x))    
    spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS        
    spacy_stopwords.add('attach')
    df['Lema_Token']  = df.Tokens.apply(lambda x: " ".join([token.lemma_ for token in x if token not in spacy_stopwords]))
    

  • 但是,当我打印例如:
    df.Lema_Token.iloc[8]
    

    输出中仍然包含单词 attach :attach poster on the wall because it is cool
    为什么它不删除停用词?

    我也试过这个:
    df['Lema_Token_Test']  = df.Tokens.apply(lambda x: [token.lemma_ for token in x if token not in spacy_stopwords])
    

    但是 str attach仍然出现。

    最佳答案

    import spacy
    import pandas as pd
    
    # Load spacy model
    nlp = spacy.load('en', parser=False, entity=False)        
    
    # New stop words list 
    customize_stop_words = [
        'attach'
    ]
    
    # Mark them as stop words
    for w in customize_stop_words:
        nlp.vocab[w].is_stop = True
    
    
    # Test data
    df = pd.DataFrame( {'Sumcription': ["attach poster on the wall because it is cool",
                                       "eating and sleeping"]})
    
    # Convert each row into spacy document and return the lemma of the tokens in 
    # the document if it is not a sotp word. Finally join the lemmas into as a string
    df['Sumcription_lema'] = df.Sumcription.apply(lambda text: 
                                              " ".join(token.lemma_ for token in nlp(text) 
                                                       if not token.is_stop))
    
    print (df)
    

    输出:
       Sumcription                                   Sumcription_lema
    0  attach poster on the wall because it is cool  poster wall cool
    1                           eating and sleeping         eat sleep
    

    关于python - 使用 spacy 删除停用词,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55817040/

    相关文章:

    python - 将 Celery 用于管道

    python - 除非 POS 是显式的,否则 WordNetLemmatizer 不会返回正确的引理 - Python NLTK

    python - 使用 Wordnet Lemmatizer 获取词根

    python - Spacy - 具有不同属性的短语匹配器

    python - 用于文本分类任务的 NLP 数据准备和排序

    python - 使用二元词、专有名词和复数改进稀疏词形还原?

    python - 来自 "Generic related"模型的 Django 表单

    Python Twisted 协议(protocol)取消注册?

    python - 如何在 Python 中使用 Pandas 绘制条形图以比较具有多个变量的多个系统

    java - 如何使用斯坦福解析器将文本拆分为句子?