python - 标记化单词列表

标签 python python-3.x pandas nltk

我在 pandas df 中有一个列,它已使用以下方法进行标记化:

df['token_col'] = df.col.apply(word_tokenize)


df['pos_col'] = nltk.tag.pos_tag(df['token_col'])
df['wordnet_tagged_pos_col'] = [(w,get_wordnet_pos(t)) for (w, t) in (df['pos_col'])]


AttributeError                            Traceback (most recent call last)
<ipython-input-28-99d28433d090> in <module>()
      1 #tag tokenized lists
----> 2 df['pos_col'] = nltk.tag.pos_tag(df['token_col'])
      3 df['wordnet_tagged_pos_col'] = [(w,get_wordnet_pos(t)) for (w, t) in (df['pos_col'])]

C:\Users\egagne\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\tag\ in pos_tag(tokens, tagset, lang)
    125     """
    126     tagger = _get_tagger(lang)
--> 127     return _pos_tag(tokens, tagset, tagger)

C:\Users\egagne\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\tag\ in _pos_tag(tokens, tagset, tagger)
     94 def _pos_tag(tokens, tagset, tagger):
---> 95     tagged_tokens = tagger.tag(tokens)
     96     if tagset:
     97         tagged_tokens = [(token, map_tag('en-ptb', tagset, tag)) for (token, tag) in tagged_tokens]

C:\Users\egagne\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\tag\ in tag(self, tokens)
    150         output = []
--> 152         context = self.START + [self.normalize(w) for w in tokens] + self.END
    153         for i, word in enumerate(tokens):
    154             tag = self.tagdict.get(word)

C:\Users\egagne\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\tag\ in <listcomp>(.0)
    150         output = []
--> 152         context = self.START + [self.normalize(w) for w in tokens] + self.END
    153         for i, word in enumerate(tokens):
    154             tag = self.tagdict.get(word)

C:\Users\egagne\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\tag\ in normalize(self, word)
    236         if '-' in word and word[0] != '-':
    237             return '!HYPHEN'
--> 238         elif word.isdigit() and len(word) == 4:
    239             return '!YEAR'
    240         elif word[0].isdigit():

AttributeError: 'list' object has no attribute 'isdigit'


df['lmtzd_col'] = [(lmtzr.lemmatize(w, pos=t if t else 'n').lower(),t) for (w,t) in wordnet_tagged_pos_col]

我的 df 超过 70 列宽,所以这里是一个小快照:

ID_number   Meeting1    Meeting2    Meeting3    Meeting4    Meeting5    col    
123456789   9/15/2015   1/8/2016    4/27/2016   NaN         NaN         [Assessment, of, Improvement, will, be, on-goi...   
987654321   9/22/2016   NaN         2/25/2017   NaN         NaN         [A, member, of, the, administrative, team, wil..   
456789123   10/1/2015   11/30/2015  NaN         NaN         NaN         [During, our, second, and, third, meetings, we...


您可以使用 apply 来获取词性标签,即

df['pos_col'] = df['token_col'].apply(nltk.tag.pos_tag)

0    [(Assessment, NNP), ( of, NNP), ( Improvement,...
1    [(A, DT), ( member, NNP), ( of, NNP), ( the, N...
2    [(During, IN), ( our, JJ), ( second, NN), ( an...
Name: pos_col, dtype: object

similarly its better you use apply function with lambda to apply the function on every row than passing the series to the function like

df['wordnet_tagged_pos_col'] = df['pos_col'].apply(lambda x : [(w,get_wordnet_pos(t)) for (w, t) in x],1)

因为您需要对列的每个单元格应用 get_wordnet_pos 。

0    [(Assessment, (N, n)), ( of, (N, n)), ( Improv...
1    [(A, (D, n)), ( member, (N, n)), ( of, (N, n))...
2    [(During, (I, n)), ( our, (J, a)), ( second, (...
Name: wordnet_tagged_pos_col, dtype: object


关于python - 标记化单词列表,我们在Stack Overflow上找到一个类似的问题:


*casting* 的 Python 赋值简写

python - 我怎样才能 "condense"这个代码?

python - 在python 3.8.3上的IDE中导入pygame失败

python - 我正在尝试将 Pandas 中的全名拆分为第一个中间名和姓氏,但我陷入了替换

python - 在使用 kmeans 创建集群时,有没有办法输出每行的失真?

python - 将列表分成子列表?

Python 3.1 和 Sublime Text 2 错误

python - 在 Python/Pandas 中使用正则表达式运算符有条件地计算数据条目数

python - 重置列的 MultiIndex 级别
