我正在创建一个包含数百万个单词三元组及其计数的巨大张量。例如,单词三元组是(word0, link, word1)
。这些单词三元组被收集在一个字典中,其中值是它们各自的计数,例如(word0,链接,word1):15
。想象一下我有数百万个这样的三元组。在计算发生次数后,我尝试进行其他计算,这就是我的 python 脚本陷入困境的地方。这是需要永恒的代码的一部分:
big_tuple = covert_to_tuple(big_dict)
pdf = pd.DataFrame.from_records(big_tuple)
pdf.columns = ['word0', 'link', 'word1', 'counts']
total_cnts = pdf.counts.sum()
for _, row in pdf.iterrows():
w0, link, w1 = row['word0'], row['link'], row['word1']
w0w1_link = row.counts
# very slow
w0_link = pdf[(pdf.word0 == w0) & (pdf.link == link)]['counts'].sum()
w1_link = pdf[(pdf.word1 == w1) & (pdf.link == link)]['counts'].sum()
p_w0w1_link = w0w1_link / total_cnts
p_w0_link = w0_link / total_cnts
p_w1_link = w1_link / total_cnts
new_score = log(p_w0w1_link / (p_w0_link * p_w1_link))
big_dict[(w0, link, w1)] = new_score
我分析了我的脚本,看起来下面两行
w0_link = pdf[(pdf.word0 == w0) & (pdf.link == link)]['counts'].sum()
w1_link = pdf[(pdf.word1 == w1) & (pdf.link == link)]['counts'].sum()
分别占用 49% 和 49% 的计算时间。这些行尝试查找 (word0, link)
和 (word1, link)
的计数。那么,看起来像这样访问 pdf 需要花费很多时间?我可以做一些优化吗?
最佳答案
请检查我的解决方案 - 我优化了计算中的某些内容(希望没有错误:))
# sample of data
df = pd.DataFrame({'word0': list('aabb'), 'link': list('llll'), 'word1': list('cdcd'),'counts': [10, 20, 30, 40]})
# caching total count
total_cnt = df['counts'].sum()
# two series with sums for all combinations of ('word0', 'link') and ('word1', 'link')
grouped_w0_l = df.groupby(['word0', 'link'])['counts'].sum()/total_cnt
grouped_w1_l = df.groupby(['word1', 'link'])['counts'].sum()/total_cnt
# join sums for grouped ('word0', 'link') to original df
merged_w0 = df.set_index(['word0', 'link']).join(grouped_w0_l, how='left', rsuffix='_w0').reset_index()
# join sums for grouped ('word1', 'link') to merged df
merged_w0_w1 = merged_w0.set_index(['word1', 'link']).join(grouped_w1_l, how='left', rsuffix='_w1').reset_index()
# merged_w0_w1 has enough data for calculation new_score
# check here - I transform the expression
merged_w0_w1['new_score'] = np.log(merged_w0_w1['counts'] * total_cnt / (merged_w0_w1['counts_w0'] * merged_w0_w1['counts_w1']))
# export results to dict (don't know is it really needed or not - you can continue manipulate data with dataframes)
big_dict = merged_w0_w1.set_index(['word0', 'link', 'word1'])['new_score'].to_dict()
new_score 的表达式为
new_score = log(p_w0w1_link / (p_w0_link * p_w1_link))
= log(w0w1_link / total_cnts / (w0_link / total_cnts * w0_link / total_cnts))
= log(w0w1_link / total_cnts * (total_cnts * total_cnts / w0_link * w0_link))
= log(w0w1_link * total_cnts / (w0_link * w0_link))
关于Python pandas张量访问极慢,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/37318554/