R 用户定义的函数单独工作,但在与 apply 一起使用时返回不正确的值

标签 r function apply nearest-neighbor euclidean-distance

用户定义的函数 (dist.func) 运行并提供正确的输出,当我在单行数据上使用它时,但当我将它嵌入 apply() 命令时不提供正确的输出(仍然执行)。在这种情况下,我想按行计算。



该函数本质上是在 XY 坐标(使用 rdist() 命令的欧几里得距离)之间进行测量,但它首先获取数据的一个子集,只保留属于给定相似性(之间的欧几里得距离)的“TO”数据的那些行第一和第二主成分,PC1 和 PC2)。


# This data is the reference points to measure FROM
FROM <- data.frame(X=c(-4187500,-4183500,-4155500,-4179500,-2883500),

# This data is the destination points to measure TO
TO <- data.frame(X=c(-4207500,-4183500,-4203500,-4187500,-2827500,-4203500,-4199500,-4183500,-4195500,-4191500),
             PC1=c(-0.371,0.447,-0.344,-0.026,-0.652,-0.460,-0.313,0.010,-0.293,-0.319 ),

# This is the threshold of the data similarity match (distance between PC1 and PC2 in both data sets)
threshold <- 0.5


dist.func <- function(REF){
  # Calculate the similarity (PC1 and PC2 distance) to all points in the destination
  # Select only those under the threshold
  bt <- as.matrix(TO[(rdist(REF[3:4],TO[3:4])[1,]<threshold)==T,c("X","Y")])
  # Calculate the number of points under the threshold (the "sample size")
  # If there are no points uder the threshold, the SS is set to zero (otherwise 'NA' kills the loop)
  ss <- ifelse(nrow(bt)>=50, 50 ,nrow(bt))
  # If/else to deal with SS=0
  if (nrow(bt)>0) {
    # Calculate the euclidian distance between the reference point and all points under the threshold
    # This calculates the distances, sorts them in ascending order, and trims to the sample size
    dst <- rdist(REF[1:2],bt)[1,][order(rdist(REF[1:2],bt)[1,])][1:ss]
  } else {
  dst <- c(NA)
# Report (in a list or table or whatever) the summary stats for the distances 
  p05=ifelse(nrow(bt)==0, NA, quantile(dst,0.05)),
  MIN=ifelse(nrow(bt)==0, NA, min(dst)),
  AVG=ifelse(nrow(bt)==0, NA, mean(dst)),
  N=ifelse(nrow(bt)==0, 0, nrow(bt)))

下面是使用单行 FROM 数据(有效)并嵌入到 apply() 命令(未返回正确值)的测试:

# Using the function on a single line of data returns correct values for the given line

# Embedding the function into apply() returns incorrect outputs
# I'm committed to using apply() here (or some variant) to avoid a for() loop by rows
apply(FROM, 1, dist.func)



问题是 applyFROM 转换为矩阵。比较:

> dist.func(FROM[1,])
[1] 14939.76
[1] 14422.21
[1] 19795.44
[1] 6

> dist.func(as.matrix(FROM)[1,])
[1] 1400
[1] 1e-10
[1] 179500
[1] 8

> apply(FROM, 1, dist.func)[[1]]
[1] 1400
[1] 1e-10
[1] 179500
[1] 8

关于R 用户定义的函数单独工作,但在与 apply 一起使用时返回不正确的值,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/19259897/


python - 如何在 Pandas 数据帧切片中使用 apply 来设置多列的值

python - 处理属于 pandas 数据框的列表中的少数元素

python - Pandas 专栏: applying a function

r - 在R函数中,将数据框对象指定为名称?循环遍历函数

php - 将 MySQL 查询包装在 PHP 函数中

r - 使用 dplyr 添加基于最大行值的新列?


c - 在不知道大小的情况下打印数组

r - 传递数据帧以在函数内进行变异

r - ggplot2 中的 annotation_logticks 和 facet 图