这是我之前发表的一篇文章的后续问题 here - 我认为我取得了重大进展,现在问题已经改变。
我有一个“匹配”矩阵,如下所示:
[,1] [,2]
[1,] 1 2
[2,] 5 6
[3,] 7 8
[4,] 9 10
[5,] 11 13
[6,] 14 15
[7,] 16 17
[8,] 18 19
我也有
dtm
- 文档术语矩阵:1108058_10-K_2005 . . . . . . . 1 . . . . 1 . . . . 1 . .
1108058_10-K_2006 . . . . . . . . . . . . . . . . . . . .
72243_10-K_2005 . . . . . . . . . . . . . . . . . . . .
1352341_10-K_2006 1 . 1 . . 1 . . . . . . . . 1 . . . . .
64040_10-K_2005 . . . . . . . . . . . . . . . . . . . .
64040_10-K_2006 . . . . . . . . . . . . . . . . . . . .
1111247_10-K_2005 . . . . . . . . . . . . . . . . . . . .
1111247_10-K_2006 . . . . 1 . . . . . . . . . . . . . . .
1129425_10-K_2005 . . . . . . . . . . 1 1 . . . . . . . .
1129425_10-K_2006 . . . . . . . . . . . . . . . 1 1 . . .
943894_10-K_2005 . . . . . . . . . . . . . . . . . . . .
943894_10-K/A_2005 . . . . . . . . . . . . . . . . . . . .
943894_10-K_2006 . . . 1 . . . . . 1 . . . . . . . . . .
1176316_10-K_2005 . . . . . . . . . . . . . . . . . . . .
1176316_10-K_2006 . . . . . . 1 . . . . . . . . . . . . .
805305_10-K_2005 . . . . . . . . . . . . . . . . . . . .
805305_10-K_2006 . 1 . . . . . . . . . . . 1 . . . . 1 1
63276_10-K_2005 . . . . . . . . 1 . . . . . . . . . . .
63276_10-K_2006 . . . . . . . . . . . . . . . . . . . .
我可以运行以下
dist
功能:dist2(dtm[matching[, 1], ], dtm[matching[, 2], ], method = "cosine", norm = "none")
哪些输出:
WARN [2019-09-11 20:51:40] Sparsity will be lost - worth to calculate similarity instead of distance.
8 x 8 Matrix of class "dgeMatrix"
1108058_10-K_2006 64040_10-K_2006 1111247_10-K_2006 1129425_10-K_2006
1108058_10-K_2005 1 1 1 1
64040_10-K_2005 1 1 1 1
1111247_10-K_2005 1 1 1 1
1129425_10-K_2005 1 1 1 1
943894_10-K_2005 1 1 1 1
1176316_10-K_2005 1 1 1 1
805305_10-K_2005 1 1 1 1
63276_10-K_2005 1 1 1 1
943894_10-K_2006 1176316_10-K_2006 805305_10-K_2006 63276_10-K_2006
1108058_10-K_2005 1 1 1 1
64040_10-K_2005 1 1 1 1
1111247_10-K_2005 1 1 1 1
1129425_10-K_2005 1 1 1 1
943894_10-K_2005 1 1 1 1
1176316_10-K_2005 1 1 1 1
805305_10-K_2005 1 1 1 1
63276_10-K_2005 1 1 1 1
这几乎可以满足我的要求,但不完全是。它仍在计算“太多”的计算。我要计算
dist2
函数根据 matching
中的“rowise”观察.即计算dist2
观察用1
和 2
.然后计算下一个dist2
观察用5
和 6
然后 7
和 8
等等。数据:
library(text2vec)
matching <- structure(c(1, 5, 7, 9, 11, 14, 16, 18, 2, 6, 8, 10, 13, 15,
17, 19), .Dim = c(8L, 2L))
dtm <- new("dgCMatrix", i = c(3L, 16L, 3L, 12L, 7L, 3L, 14L, 0L, 17L,
12L, 8L, 8L, 0L, 16L, 3L, 9L, 9L, 0L, 16L, 16L), p = 0:20, Dim = 19:20,
Dimnames = list(c("1108058_10-K_2005", "1108058_10-K_2006",
"72243_10-K_2005", "1352341_10-K_2006", "64040_10-K_2005",
"64040_10-K_2006", "1111247_10-K_2005", "1111247_10-K_2006",
"1129425_10-K_2005", "1129425_10-K_2006", "943894_10-K_2005",
"943894_10-K/A_2005", "943894_10-K_2006", "1176316_10-K_2005",
"1176316_10-K_2006", "805305_10-K_2005", "805305_10-K_2006",
"63276_10-K_2005", "63276_10-K_2006"), c("counterclaim",
"reacting", "dissipating", "delisted", "trades", "relocated",
"buyers", "allege", "wind", "antiquated", "initiating", "detract",
"instat", "putters", "confronted", "enrolling", "futility",
"repatriating", "oppose", "communicates")), x = c(1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), factors = list())
编辑(我的尝试不正确):这允许我应用
dist
第一行的功能: m1 <- as.matrix(dtm[matching[1, ], ])
dist2(m1, method = "cosine", norm = "none")[1, 2]
将其应用于第二行:
m1 <- as.matrix(dtm[matching[2, ], ])
dist2(m1, method = "cosine", norm = "none")
只需要迭代并创建一个函数以将其应用于所有行。
将某种解决方案组合在一起(不完整):
for(i in 1:nrow(matching)){
m <- as.matrix(dtm[matching[i, ], ])
dist <- dist2(m, method = "cosine", norm = "none")[1, 2]
print(dist)
}
如果有人能帮助把它变成一个很棒的功能!
这没有给我正确的结果
foo <- function(data){
col1 = data[, 1]
col2 = data[, 2]
dist = dist2(dtm[col1, ], dtm[col2, ], method = "cosine", norm = "none")
return(dist)
}
foo(matching)
或者这个(不起作用):
apply(matching, 1, function(x, y) dist2(dtm[x, ], dtm[y, ], method = "cosine", norm = "norm"))
编辑:
当我在
matching
上应用“完整”功能时数据我得到一个这样的矩阵:dist2(dtm[matching[, 1], ], dtm[matching[, 2], ], method = rwmd, norm = "none")
(注意:我使用自定义方法
rwmd
而不是 cosine
并且我使用文档术语矩阵中的所有数据 - 我还对数据进行了新的随机抽样,因此该数据与之前的数据不匹配)。 1019695_10-K_2006 718937_10-K_2006 708955_10-K_2006 923120_10-K_2006 1020569_10-K_2006 862022_10-K_2006
1019695_10-K_2005 0.06690147 0.26848699 0.52009095 0.29421497 0.27183372 0.4673677
718937_10-K_2005 0.21579128 0.03183972 0.44026262 0.26678393 0.24644321 0.4339234
708955_10-K_2005 0.51919906 0.44900795 0.02992449 0.40760294 0.39043990 0.4338723
923120_10-K_2005 0.35596766 0.32048006 0.43839797 0.07794912 0.25703208 0.4123749
1020569_10-K_2005 0.27958200 0.24791561 0.39780292 0.19322863 0.01679282 0.3915167
862022_10-K_2005 0.51707930 0.49270230 0.44924855 0.45008895 0.45454247 0.0887527
917857_10-K_2005 0.30562057 0.27731399 0.41435485 0.22840343 0.22982293 0.4053557
917857_10-K_2006
1019695_10-K_2005 0.30368532
718937_10-K_2005 0.25491939
708955_10-K_2005 0.42074617
923120_10-K_2005 0.30625747
1020569_10-K_2005 0.22772452
862022_10-K_2005 0.48192247
917857_10-K_2005 0.03438092
这让我得到了我想要的 - 但给出了太多的计算。那就是我只对
diagonal
感兴趣这个矩阵的值是 0.06690147
, 0.06690147
, 0.02992449
等等。其中对应于 matching
中的点数据在这里: [,1] [,2]
[1,] 1 2
[2,] 3 5
[3,] 7 8
[4,] 9 10
[5,] 12 13
[6,] 15 16
[7,] 18 19
这些点对应于
dtm
中的行位置马蒂克斯。> dtm[,1:10]
19 x 10 sparse Matrix of class "dgCMatrix"
[[ suppressing 10 column names ‘reacting’, ‘ments’, ‘proper’ ... ]]
1019695_10-K_2005 . . . . . . . . . .
1019695_10-K_2006 . . . . . . . . 1 1
718937_10-K_2005 . . . . . . . . . .
718937_10-K/A_2005 . . . . . . . . . .
718937_10-K_2006 . . . . . . . . . .
1034258_10-K_2006 . . . 1 . . . . . .
708955_10-K_2005 . . . . . . . . . .
708955_10-K_2006 . . . . . . . . . .
923120_10-K_2005 . . . . . . . . . .
923120_10-K_2006 . . . . . . . . . .
923120_10-K/A_2006 . . . . . . . . . .
1020569_10-K_2005 . . . . . . . . . .
1020569_10-K_2006 1 . . . . . 1 . . .
1009463_10-K_2005 . . . . . 1 . . . .
862022_10-K_2005 . . . . . . . . . .
862022_10-K_2006 . . 1 . . . . . . .
868271_10-K_2005 . 1 . . . . . 1 . .
917857_10-K_2005 . . . . . . . . . .
917857_10-K_2006 . . . . 1 . . . . .
那就是我应该得到
7
的结果- 是 dist2
的对角线矩阵。编辑2:
应用您的所有功能可得到以下结果:
方法一:
> apply(matching, 1, function(x) dist2(as.matrix(dtm[x,]), method = rwmd, norm = 'none'))
Error in method$dist2(x, y) :
inherits(x, "sparseMatrix") && inherits(y, "sparseMatrix") is not TRUE
Called from: method$dist2(x, y)
方法二:
> apply(matching, 1, function(x) dist2((dtm[x,]), method = rwmd, norm = 'none'))
|====================================================================================================| 100%
|====================================================================================================| 100%
|====================================================================================================| 100%
|====================================================================================================| 100%
|====================================================================================================| 100%
|====================================================================================================| 100%
|====================================================================================================| 100%
[,1] [,2] [,3]
[1,] -0.00000000000000001804112 -0.00000000000000001518568 -0.00000000000000003168025
[2,] 0.06690147056044426499000 0.03183972474513259431905 0.02992448660488894462972
[3,] 0.06690147056044426499000 0.03183972474513259431905 0.02992448660488894462972
[4,] -0.00000000000000002283564 -0.00000000000000001232901 -0.00000000000000003952019
[,4] [,5] [,6]
[1,] -0.00000000000000001162810 -0.000000000000000009077403 -0.00000000000000003039822
[2,] 0.07794911930538156452641 0.016792819916915013161995 0.08875270114006890420644
[3,] 0.07794911930538156452641 0.016792819916915013161995 0.08875270114006890420644
[4,] -0.00000000000000001939834 -0.000000000000000009394918 -0.00000000000000004965902
[,7]
[1,] -0.00000000000000001829033
[2,] 0.03438092421044294105803
[3,] 0.03438092421044294105803
[4,] -0.00000000000000001748001
(从对角线给出了一些正确的结果,但也给出了一些额外的结果)
最佳答案
这将遍历您 matching
的每一行矩阵并执行你说的行:
apply(matching, 1, function(x) dist2(as.matrix(dtm[x,]), method = 'cosine', norm = 'none'))
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,] -2 1 1 -1 1 1 1 0
[2,] 1 1 1 1 1 1 1 1
[3,] 1 1 1 1 1 1 1 1
[4,] 1 1 0 -1 -1 0 -3 1
或者,如果你想保持命名约定,你可以跳过
as.matrix
的转换。 :res<-apply(matching, 1, function(x) dist2((dtm[x,]), method = 'cosine', norm = 'none'))
res
[[1]]
2 x 2 Matrix of class "dgeMatrix"
1108058_10-K_2005 1108058_10-K_2006
1108058_10-K_2005 -2 1
1108058_10-K_2006 1 1
[[2]]
2 x 2 Matrix of class "dgeMatrix"
64040_10-K_2005 64040_10-K_2006
64040_10-K_2005 1 1
64040_10-K_2006 1 1
#6 more list items...
如果您不喜欢使用列表,可以将列表转换为数组:
library(abind)
abind::abind(lapply(res, as.matrix), along = 3)
, , 1
63276_10-K_2005 63276_10-K_2006
63276_10-K_2005 -2 1
63276_10-K_2006 1 1
, , 2
63276_10-K_2005 63276_10-K_2006
63276_10-K_2005 1 1
63276_10-K_2006 1 1
#6 more matrix slices...
另外,您对 apply 语句的尝试试图传递两个变量
x
和 y
. apply()
只传递 1 个变量 - 行向量。相反,您必须进行子集化:apply(matching, 1, function(x) sum(x[1],x[2]))
[1] 3 11 15 19 24 29 33 37
关于r - 在自定义函数中应用 dist 函数 rowise,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57895375/