r - 将字符串数据集转换为矩阵

标签 r string split bioinformatics

我有一个由制表符分隔的数据集,所以我想将以下数据集转换为矩阵

CATGGGGAAAACTGA
CCTCTCGATCACCGA
CCTATAGATCACCGA
CCGATTGATCACCGA
CCTTGTGCAGACCGA

我曾经用过

rbind(strsplit("CATGGGGAAAACTGA","")[[1]],
        strsplit("CCTCTCGATCACCGA","")[[1]],
        strsplit("CCTCTCGATCACCGA","")[[1]],
        strsplit("CCTATAGATCACCGA","")[[1]],
        strsplit("CCGATTGATCACCGA","")[[1]],
        strsplit("CCTTGTGCAGACCGA","")[[1]])

这会产生:

[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15]
[1,] "C"  "A"  "T"  "G"  "G"  "G"  "G"  "A"  "A"  "A"   "A"   "C"   "T"   "G"   "A"  
[2,] "C"  "C"  "T"  "C"  "T"  "C"  "G"  "A"  "T"  "C"   "A"   "C"   "C"   "G"   "A"  
[3,] "C"  "C"  "T"  "C"  "T"  "C"  "G"  "A"  "T"  "C"   "A"   "C"   "C"   "G"   "A"  
[4,] "C"  "C"  "T"  "A"  "T"  "A"  "G"  "A"  "T"  "C"   "A"   "C"   "C"   "G"   "A"  
[5,] "C"  "C"  "G"  "A"  "T"  "T"  "G"  "A"  "T"  "C"   "A"   "C"   "C"   "G"   "A"  
[6,] "C"  "C"  "T"  "T"  "G"  "T"  "G"  "C"  "A"  "G"   "A"   "C"   "C"   "G"   "A"

但是当数据集非常大的时候,这个过程就很累了。我怎样才能自动完成它?

最佳答案

您可以使用 read.fwf 拆分为单个字符:

read.fwf(textConnection("CATGGGGAAAACTGA
CCTCTCGATCACCGA
CCTATAGATCACCGA
CCGATTGATCACCGA
CCTTGTGCAGACCGA"), rep(1, nchar("CATGGGGAAAACTGA")))
#  V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15
#1  C  A  T  G  G  G  G  A  A   A   A   C   T   G   A
#2  C  C  T  C  T  C  G  A  T   C   A   C   C   G   A
#3  C  C  T  A  T  A  G  A  T   C   A   C   C   G   A
#4  C  C  G  A  T  T  G  A  T   C   A   C   C   G   A
#5  C  C  T  T  G  T  G  C  A   G   A   C   C   G   A

您可能希望传递文件名而不是文本连接。

关于r - 将字符串数据集转换为矩阵,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/40678032/

相关文章:

r - R中的慢dplyr查询

python - 具有可变长度列表的 Format()

android - utf-8 到字符串获取额外添加的字符

mfc - 使用 CSplitterWndEx 将两个对话框(CDialog)添加到可停靠 Pane (CDockablePane)

Java .split() 方法极端情况 : what are they?

python - 拆分 Pandas 列并对值求和

R Shiny 的 for 循环变量依赖不良。有没有办法强制评估?

r - 在 R 中模拟一个奇怪的分布

R:对数据框的多列进行行式计算的最快方法

string - 测试字符串中的值