python - 仅使用 numpy 和 pandas 计算转换矩阵中每个单词的频率

我正在尝试仅使用 numpy 和 pandas 来计算转换矩阵中每个单词的频率。

我有一个字符串

star_wars = [('darth', 'leia'), ('luke', 'han'), ('chewbacca', 'luke'), 
         ('chewbacca', 'obi'), ('chewbacca', 'luke'), ('leia', 'luke')]

我使用 this question 为该字符串构建一个矩阵.

             chewbacca  darth  han  leia  luke  obi
chewbacca          0      0    0     0     2    1
darth              0      0    0     1     0    0
han                0      0    0     0     1    0
leia               0      0    0     0     1    0
luke               0      0    0     0     0    0
obi                0      0    0     0     0    0

现在我尝试使用 this question 将这些单词值转换为概率。 :

使用交叉表适用于初始数据框，但只给我对

pd.crosstab(pd.Series(star_wars[1:]),
        pd.Series(star_wars[:-1]), normalize = 1)

输出是错误的，这也不适用于我创建的矩阵，只是一个例子:

col_0   (chewbacca, luke)   (chewbacca, obi)    (darth, leia)   (luke, han)
row_0               
(chewbacca, luke)   0.0 1.0 0.0 1.0
(chewbacca, obi)    0.5 0.0 0.0 0.0
(leia, luke)        0.5 0.0 0.0 0.0
(luke, han)         0.0 0.0 1.0 0.0

我还创建了一个函数

from itertools import islice

def my_function(seq, n = 2):
it = iter(seq)
result = tuple(islice(it, n))
if len(result) == n:
    yield result
for elem in it:
    result = result[1:] + (elem,)
    yield result

应用函数并计算概率

pairs = pd.DataFrame(my_function(star_wars), columns=['Columns', 'Rows'])
counts = pairs.groupby('Columns')['Rows'].value_counts()
probs = (counts/counts.sum()).unstack()

print(probs)

但它给了我对的计算(甚至不确定它是否正确)

Rows               (chewbacca, luke)  (chewbacca, obi)  (leia, luke)  \
Columns                                                                
(chewbacca, luke)                NaN               0.2           0.2   
(chewbacca, obi)                 0.2               NaN           NaN   
(darth, leia)                    NaN               NaN           NaN   
(luke, han)                      0.2               NaN           NaN   

Rows               (luke, han)  
Columns                         
(chewbacca, luke)          NaN  
(chewbacca, obi)           NaN  
(darth, leia)              0.2  
(luke, han)                NaN

再次尝试，仅使用交叉表

需要 - 一个具有概率的矩阵，而不是数字。

例如

            chewbacca  darth  han  leia  luke  obi
chewbacca          0      0    0     0   0.66 0.33
darth              0      0    0     1     0    0
han                0      0    0     0     1    0
leia               0      0    0     0     1    0
luke               0      0    0     0     0    0
obi                0      0    0     0     0    0

感谢您的时间和帮助!

最佳答案

我们仍然可以通过crosstab来完成

df=pd.DataFrame(star_wars)
s=pd.crosstab(df[0],df[1],normalize='index')
s=s.reindex(index=df.stack().unique(),fill_value=0).reindex(columns=df.stack().unique(),fill_value=0)
s
1          darth  leia      luke  han  chewbacca       obi
0                                                         
darth          0   1.0  0.000000  0.0          0  0.000000
leia           0   0.0  1.000000  0.0          0  0.000000
luke           0   0.0  0.000000  1.0          0  0.000000
han            0   0.0  0.000000  0.0          0  0.000000
chewbacca      0   0.0  0.666667  0.0          0  0.333333
obi            0   0.0  0.000000  0.0          0  0.000000

关于python - 仅使用 numpy 和 pandas 计算转换矩阵中每个单词的频率，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/62962843/

python - 仅使用 numpy 和 pandas 计算转换矩阵中每个单词的频率

上一篇：debugging - 英特尔lij : how to open a library java source and set a breakpoint for debugging?

下一篇：Android 导航组件未显示正确的操作栏标题