r - 使用mRMRe进行特征选择: my categorical target variable is sometimes selected

标签 r machine-learning bioinformatics categorical-data feature-selection

我有一个包含 60 行(=样本)和 20228 列的数据框“数据”,其中第一列是我的目标变量(有序因子:0 或 1),其他列是我的特征(=数字)。我想在循环中使用 mRMRe 进行特征选择,对应于我执行 3 次的 5 交叉验证。我每次都会选择 25 个特征。这是我的代码有问题的部分:

library(caret)
library(mRMRe)

data <- read.csv("home/RNA_seq.csv", row.names=1, sep=";", stringsAsFactors=FALSE)
data <- data.frame(t(data))
data[,1] <- factor(data[,1])
data[,1] <- ordered(data[,1], levels = c("0", "1"))

features_select <- list()

r <- 5 # 5-cross-validation
t <- 3 # 5-cross-validation done 3 times
  for (j in 1:t){
    for (i in 1:r){
      #5-cross-validation
      train.index <- createFolds(factor(data$Response), k = 5, list = TRUE, returnTrain = TRUE) 
      datatrain <- data[train.index[[i]],]
      datatest  <- data[-train.index[[i]],]

      #Feature selection
      data.mrmre.train <- mRMR.data(data=datatrain)
      res.fs.mrmr <- mRMR.classic(data=data.mrmre.train, target_indices=1, feature_count=25)
      selected.features.mrmre <- mRMRe::solutions(res.fs.mrmr)
      features_select[[((j-1)*r+i)]] <- res.fs.mrmr@feature_names[unlist(selected.features.mrmre)]
      print(features_select[[((j-1)*r+i)]])
      print(res.fs.mrmr)
    }
  }

我的问题是,有时 mRMRe 选择名为“Response”(=“数据”的第 1 列)的目标变量。举例来说:

features_select :

[[1]]
[1] "AC137800.2" "AC007387.1" "AC079354.1" "AC145138.1" "RNA5SP370" 
[6] "RNA5SP219"  "AL022324.1" "AC023449.1" "AP000873.1" "AC020612.2"
[11] "RNA5SP473"  "AC092810.1" "IGKV1D.37"  "SST"        "AC093331.1"
[16] "TRAJ34"     "AC107983.1" "RPL39P"     "HSBP1P1"    "TRBJ1.6"   
[21] "PHGR1"      "RNA5SP435"  "RNA5SP301"  "AC005255.1" "KRT127P"

[[2]]
 [1] "AC073869.8"   "Response" "Response" "Response" "Response" "Response"
 [7] "Response" "Response" "Response" "Response" "Response" "Response"
[13] "Response" "Response" "Response" "Response" "Response" "Response"
[19] "Response" "Response" "Response" "Response" "Response" "Response"
[25] "Response"

这是函数 mRMR.classic() 在第一种情况和第二种情况(=坏情况)下的输出:

[[1]]
Formal class 'mRMRe.Filter' [package "mRMRe"] with 8 slots
  ..@ filters       :List of 1
  .. ..$ 1: int [1:25, 1] 18837 18781 15503 15526 17437 20028 18924 17133 17024 16104 ...
  ..@ scores        :List of 1
  .. ..$ 1: num [1:25, 1] 0.817 0.819 0.817 0.817 0.817 ...
  ..@ mi_matrix     : num [1:20228, 1:20228] NA -0.3786 -0.1536 -0.0929 -0.0964 ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:20228] "Response" "TMSB15B" "MATR3" "HSPA14" ...
  .. .. ..$ : chr [1:20228] "Response" "TMSB15B" "MATR3" "HSPA14" ...
  ..@ causality_list:List of 1
  .. ..$ 1: num [1:20228] NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
  ..@ sample_names  : chr [1:48] "Pt1_28" "Pt2_28" "Pt4_28" "Pt5_28" ...
  ..@ feature_names : chr [1:20228] "Response" "TMSB15B" "MATR3" "HSPA14" ...
  ..@ target_indices: int 1
  ..@ levels        : int [1:25] 1 1 1 1 1 1 1 1 1 1 ...

[[2]]
Formal class 'mRMRe.Filter' [package "mRMRe"] with 8 slots
  ..@ filters       :List of 1
  .. ..$ 1: int [1:25, 1] 1 1 1 1 1 1 1 1 1 1 ...
  ..@ scores        :List of 1
  .. ..$ 1: num [1:25, 1] 0 0 0 0 0 0 0 0 0 0 ...
  ..@ mi_matrix     : num [1:20228, 1:20228] NA -0.518 -0.246 -0.211 -0.204 ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:20228] "Response" "TMSB15B" "MATR3" "HSPA14" ...
  .. .. ..$ : chr [1:20228] "Response" "TMSB15B" "MATR3" "HSPA14" ...
  ..@ causality_list:List of 1
  .. ..$ 1: num [1:20228] NA NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
  ..@ sample_names  : chr [1:48] "Pt1_28" "Pt2_28" "Pt4_28" "Pt5_28" ...
  ..@ feature_names : chr [1:20228] "Response" "TMSB15B" "MATR3" "HSPA14" ...
  ..@ target_indices: int 1
  ..@ levels        : int [1:25] 1 1 1 1 1 1 1 1 1 1 ...

对于相同的 i 和 j 值进入循环时,不会每次都会出现这种情况。您知道问题出在哪里吗?

提前谢谢您!

最佳答案

我收到了 mRMRe 包作者的回复。解决方案是使用“strata”参数来指示mRMR.data()中的我的目标变量(=有序因子)。功能。所以,我必须改变:

data.mrmre.train <- mRMR.data(data=datatrain)

至:

data.mrmre.train <- mRMR.data(data=datatrain[,-1], strata=datatrain[,1]) .

更多详情请参见:https://github.com/bhklab/mRMRe/issues/27

关于r - 使用mRMRe进行特征选择: my categorical target variable is sometimes selected,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59953732/

相关文章:

r - 在 R 中跳过 sprintf 格式字符串中的参数

r - 在 R 中创建函数时循环将不起作用

python - val_loss 减半,但 val_acc 保持不变

python - 如何获取决策树中的特征重要性?

python - 将 AdditiveGaussianNoise 添加到单个图像 - AssertionError : Expected boolean as argument for 'return_batch'

r - 将编码应用于整个数据表

r - 如何在 R 中创建多个 .csv 文件?

python - 如何使用 python 或 R 将三个字母的氨基酸代码转换为一个字母的代码?

r - 计算 R 中的单例数

workflow - 组织生物信息学项目的最佳方式?