python - 为什么 pos_tag() 如此缓慢且可以避免？

我希望能够以这种方式一个接一个地获取句子的 POS-Tags:

def __remove_stop_words(self, tokenized_text, stop_words):

    sentences_pos = nltk.pos_tag(tokenized_text)  
    filtered_words = [word for (word, pos) in sentences_pos 
                      if pos not in stop_words and word not in stop_words]

    return filtered_words

但问题是 pos_tag() 每个句子大约需要一秒钟的时间。还有另一种选择是使用 pos_tag_sents() 来分批执行此操作并加快速度。但如果我能逐句做这件事，我的生活会更轻松。

有没有办法更快地做到这一点？

最佳答案

对于 nltk 版本 3.1，内部 nltk/tag/__init__.py , pos_tag 定义如下:

from nltk.tag.perceptron import PerceptronTagger
def pos_tag(tokens, tagset=None):
    tagger = PerceptronTagger()
    return _pos_tag(tokens, tagset, tagger)

因此每次调用 pos_tag 都会首先实例化 PerceptronTagger，这需要一些时间，因为它涉及 loading a pickle file . _pos_tag simply calls tagger.tag当 tagset 为 None 时。因此，您可以一次加载文件，然后自己调用tagger.tag 而不是调用pos_tag，从而节省一些时间:

from nltk.tag.perceptron import PerceptronTagger
tagger = PerceptronTagger() 
def __remove_stop_words(self, tokenized_text, stop_words, tagger=tagger):
    sentences_pos = tagger.tag(tokenized_text)  
    filtered_words = [word for (word, pos) in sentences_pos 
                      if pos not in stop_words and word not in stop_words]

    return filtered_words

pos_tag_sents 使用与上面相同的技巧 -- it instantiates PerceptronTagger once在多次调用 _pos_tag 之前。因此，使用上述代码，您将获得与重构和调用 pos_tag_sents 相当的性能提升。

此外，如果 stop_words 是一个长列表，您可以通过将 stop_words 设置为一个集合来节省一些时间:

stop_words = set(stop_words)

因为检查集合中的成员资格(例如 pos not in stop_words)是一个 O(1)(常数时间)操作，而检查列表中的成员资格是一个 O(n) 操作(即它需要的时间与列表的长度成比例增长。)

关于python - 为什么 pos_tag() 如此缓慢且可以避免？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/33829160/

python - 为什么 pos_tag() 如此缓慢且可以避免？

上一篇：python - Pip3在哪里安装模块？

下一篇：python - 如何将随机整数设置为 Django CharField 的默认值？