我有一个包含大量文档的大型 pandas 数据框:
id text
1 doc2 Google i...
2 doc3 Amazon...
3 doc4 This was...
...
n docN nice camara...
如何将所有文档堆叠成执行各自id的句子?:
id text
1 doc1 Google is a great company.
2 doc1 It is in silicon valley.
3 doc1 Their search engine is the best
4 doc2 Amazon is a great store.
5 doc2 it is located in Seattle.
6 doc2 its new product is alexa.
5 doc2 its expensive.
5 doc3 This was a great product.
...
n docN nice camara I really liked it.
我尝试过:
import nltk
def sentence(document):
sentences = nltk.sent_tokenize(document.strip(' '))
return sentences
df['sentece'] = df['text'].apply(sentence)
df.stack(level=0)
然而,这并没有奏效。知道如何堆叠句子来执行它们的出处吗?
最佳答案
这里有一个与您类似的问题的解决方案:pandas: When cell contents are lists, create a row for each element in the list 。这是我对您的特定任务的解释:
df['sents'] = df['text'].apply(lambda x: nltk.sent_tokenize(x))
s = df.apply(lambda x: pd.Series(x['sents']), axis=1).stack().\
reset_index(level=1, drop=True)
s.name = 'sents'
df = df.drop(['sents','text'], axis=1).join(s)
关于python - 如何在 pandas 数据框中堆叠 wthin 来执行其引用?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/41472234/