python - 使用 spacy 删除停用词

我正在清理我的 data frame 中的一个列,Summription,并且我正在尝试做 3 件事:

  • 代币化
  • Lemmantize
  • 删除停用词
    import spacy        
    nlp = spacy.load('en_core_web_sm', parser=False, entity=False)        
    df['Tokens'] = df.Sumcription.apply(lambda x: nlp(x))    
    spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS        
    df['Lema_Token']  = df.Tokens.apply(lambda x: " ".join([token.lemma_ for token in x if token not in spacy_stopwords]))

  • 但是,当我打印例如:

    输出中仍然包含单词 attach :attach poster on the wall because it is cool

    df['Lema_Token_Test']  = df.Tokens.apply(lambda x: [token.lemma_ for token in x if token not in spacy_stopwords])

    但是 str attach仍然出现。


    import spacy
    import pandas as pd
    # Load spacy model
    nlp = spacy.load('en', parser=False, entity=False)        
    # New stop words list 
    customize_stop_words = [
    # Mark them as stop words
    for w in customize_stop_words:
        nlp.vocab[w].is_stop = True
    # Test data
    df = pd.DataFrame( {'Sumcription': ["attach poster on the wall because it is cool",
                                       "eating and sleeping"]})
    # Convert each row into spacy document and return the lemma of the tokens in 
    # the document if it is not a sotp word. Finally join the lemmas into as a string
    df['Sumcription_lema'] = df.Sumcription.apply(lambda text: 
                                              " ".join(token.lemma_ for token in nlp(text) 
                                                       if not token.is_stop))
    print (df)

       Sumcription                                   Sumcription_lema
    0  attach poster on the wall because it is cool  poster wall cool
    1                           eating and sleeping         eat sleep

