我有一个非常大的数据框,里面装满了歌词。我已经标记了歌词列,因此每一行都是歌词列表,即 ["You", "say", "goodbye", "and", "I", "say", "hello"]
等等。我编写了一个函数来使用正面单词和负面单词列表来计算情绪分数。然后,我需要将此函数应用于歌词列,以计算积极情绪、消极情绪和净情绪,并将它们设为新列。
我尝试将数据帧拆分为 1000 个 block 的列表,然后循环应用,但仍然需要相当长的时间。我想知道是否有一种更有效的方法可以做到这一点,或者这是否是最好的,我只需要等待。
def sentiment_scorer(row):
pos=neg=0
for item in row['lyrics']:
# count positive words
if item in positiv:
pos += 1
# count negative words
elif item in negativ:
neg += 1
# ignore words that are neither negative nor positive
else:
pass
# set sentiment to 0 if pos is 0
if pos < 1:
pos_sent = 0
else:
pos_sent = pos / len(row['lyrics'])
# set sentiment to 0 if neg is 0
if neg < 1:
neg_sent = 0
else:
neg_sent = neg / len(row['lyrics'])
# return positive and negative sentiment to make new columns
return pos_sent, neg_sent
# chunk data frames
n = 1000
list_df = [lyrics_cleaned_df[i:i+n] for i in range(0,lyrics_cleaned_df.shape[0],n)]
for lr in range(len(list_df)):
# credit for method: toto_tico on Stack Overflow https://stackoverflow.com/a/46197147
list_df[lr]['positive_sentiment'], list_df[lr]['negative_sentiment'] = zip(*list_df[lr].apply(sentiment_scorer, axis=1))
list_df[lr]['net_sentiment'] = list_df[lr]['positive_sentiment'] - list_df[lr]['negative_sentiment']
预计到达时间:示例数据框
data = [['ego-remix', 2009, 'beyonce-knowles', 'Pop', ['oh', 'baby', 'how']],
['then-tell-me', 2009, 'beyonce-knowles', 'Pop', ['playin', 'everything', 'so']],
['honesty', 2009, 'beyonce-knowles', 'Pop', ['if', 'you', 'search']]]
df = pd.DataFrame(data, columns = ['song', 'year', 'artist', 'genre', 'lyrics'])
最佳答案
如果我正确理解问题并使用您的示例(我添加了更多单词来创建不均匀长度的列表)。您可以创建一个单独的数据框lyrics
,将歌词中的单词转换为单独的列。
data = [['ego-remix', 2009, 'beyonce-knowles', 'Pop', ['oh', 'baby', 'how', "d"]],
['then-tell-me', 2009, 'beyonce-knowles', 'Pop', ['playin', 'everything', 'so']],
['honesty', 2009, 'beyonce-knowles', 'Pop', ['if', 'you', 'search']]]
df = pd.DataFrame(data, columns = ['song', 'year', 'artist', 'genre', 'lyrics'])
然后定义歌词
。
lyrics = pd.DataFrame(df.lyrics.values.tolist())
# 0 1 2 3
# 0 oh baby how d
# 1 playin everything so None # Null rows need to be accounted for
# 2 if you search None # Null rows need to be accounted for
然后,如果您有两个包含积极情绪词和消极情绪词的列表(如下所示),您可以使用 mean()
方法计算每行的情绪(歌词)。
# positive and negative sentiment words
pos = ['baby', 'you']
neg = ['if', 'so']
# When converting the lyrics list to a new dataframe, it will contain Null values
# when the length of the lists are not the same. Therefore these need to be scaled
# according to the proportion of null values
null_rows = lyrics.notnull().mean(1)
# Calculate the proportion of positive and negative words, accounting for null values
pos_sent = lyrics.isin(pos).mean(1) / null_rows
neg_sent = lyrics.isin(neg).mean(1) / null_rows
# pos_sent
# 0 0.250000
# 1 0.000000
# 2 0.333333
# neg_sent
# 0 0.000000
# 1 0.333333
# 2 0.333333
如果我完全理解您的问题,那么您应该能够使用df['pos'] = pos_sent
和df['neg'] = neg_sent
。我想可能存在一些问题,所以请告诉我这是否在正确的范围内。
关于python - 将函数应用于数据框列的最有效方法,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/60345273/