r - 从杂乱的字符列表到 R 中的矩阵

标签 r split data-conversion

我非常感谢您的帮助。我有一个大向量,其中包含 2000 个不同长度的字符串,这是我从 Web of Science 检索到的。我的数据集可以下载here .

数据结构和结果。

该向量的每一行都有不同的“长度”,但模式相同。 “[]”内的字符确定行数,“[]”外的字符确定列数。我将用这三行来做一个例子:

[Sorce, A.; Greco, A.; Magistri, L.] Univ Genoa, Polytech Sch, Thermochem Power Grp TPG DIME, I-16145 Genoa, Italy; [Costamagna, P.] Univ Genoa, Polytech Sch, Thermochem Power Grp TPG DICCA, I-16145 Genoa, Italy
[Allema, Bas; Hemerik, Lia; Rossing, Walter A. H.] Wageningen Univ, NL-6700 AP Wageningen, Netherlands; [Allema, Bas; van Lenteren, Joop C.] Wageningen Univ, Entomol Lab, NL-6700 AP Wageningen, Netherlands; [van der Werf, Wopke] Wageningen Univ, Ctr Crop Syst Anal, Crop & Weed Ecol Grp, NL-6700 AP Wageningen, Netherlands
[Abdissa, Ketema; Tadesse, Mulualem; Bezabih, Mesele; Bekele, Alemayehu; Abebe, Gemeda] Jimma Univ, Dept Med Lab Sci & Pathol, Jimma, Ethiopia; [Apers, Ludwig] Inst Trop Med, Dept Clin Sci, B-2000 Antwerp, Belgium; [Rigouts, Leen] Inst Trop Med, Dept Microbiol, Mycobacteriol Unit, B-2000 Antwerp, Belgium

第一行“[]”中有2组,每组5列;第二行有 2 组,一组有 3 列,第二组有 4 列;第三行有 3 组,每组 4、4 和 5 列。

结果将是一个像这样的矩阵:

ID  Author  Info01  Info02  Info03  Info04  Info05
1   Sorce, A    Univ Genoa   Polytech Sch    Thermochem Power Grp TPG DIME   I-16145 Genoa   Italy
1   Greco, A.   Univ Genoa   Polytech Sch    Thermochem Power Grp TPG DIME   I-16145 Genoa   Italy
1   Magistri, L.    Univ Genoa   Polytech Sch    Thermochem Power Grp TPG DIME   I-16145 Genoa   Italy
1   Costamagna, P.  Univ Genoa   Polytech Sch   Thermochem Power Grp TPG DICCA   I-16145 Genoa   Italy
2   Allema  Wageningen Univ  NL-6700 AP Wageningen   Netherlands    N/A N/A
2   Bas; Hemerik    Wageningen Univ  NL-6700 AP Wageningen   Netherlands    N/A N/A
2   Lia; Rossing    Wageningen Univ  NL-6700 AP Wageningen   Netherlands    N/A N/A
2   Walter A. H.    Wageningen Univ  NL-6700 AP Wageningen   Netherlands    N/A N/A
2   Allema, Bas Wageningen Univ  Entomol Lab     NL-6700 AP Wageningen   Netherlands    N/A
2   van Lenteren, Joop C.   Wageningen Univ  Entomol Lab     NL-6700 AP Wageningen   Netherlands    N/A
2   van der Werf, Wopke Wageningen Univ  Ctr Crop Syst Anal  Crop & Weed Ecol Grp    NL-6700 AP Wageningen   Netherlands
3   Abdissa, Ketema  Jimma Univ  Dept Med Lab Sci & Pathol   Jimma   Ethiopia   N/A
3   Tadesse, Mulualem    Jimma Univ  Dept Med Lab Sci & Pathol   Jimma   Ethiopia   N/A
3   Bezabih, Mesele  Jimma Univ  Dept Med Lab Sci & Pathol   Jimma   Ethiopia   N/A
3   Bekele, Alemayehu    Jimma Univ  Dept Med Lab Sci & Pathol   Jimma   Ethiopia   N/A
3   Abebe, Gemeda    Jimma Univ  Dept Med Lab Sci & Pathol   Jimma   Ethiopia   N/A
3   Apers, Ludwig    Inst Trop Med   Dept Clin Sci   B-2000 Antwerp  Belgium    N/A
3   Rigouts, Leen    Inst Trop Med   Dept Microbiol  Mycobacteriol Unit  B-2000 Antwerp  Belgium

我的方法

使用以下命令分隔字符串并将向量转换为列表:

CL1 <- str_split(CL, "\\[|\\]", n= Inf)

这会生成一个包含如下字符的向量列表:

[[1999]]
[1] ""                                                                                               
[2] "Zhuo, Hongying; Li, Qingzhong; Li, Wenzuo; Cheng, Jianbo"                                       
[3] " Yantai Univ, Sch Chem & Chem Engn, Lab Theoret & Computat Chem, Yantai 264005, Peoples R China"

[[2000]]
[1] ""                                                                                                        
[2] "Zuo, Li; Meng, Qing-Hong; Chung, Peter Chee-Keung"                                                       
[3] " Guiyang Med Coll, Dept Immunol, Guiyang 550004, Guizhou Provinc, Peoples R China; "                     
[4] "Yuan, Kai-Tao"                                                                                           
[5] " Sun Yat Sen Univ, Affiliated Hosp 1, Dept Surg, Guangzhou 510080, Guangdong, Peoples R China; "         
[6] "Yu, Li"                                                                                                  
[7] " Guangzhou First Municipal Peoples Hosp, Dept Paediat, Guangzhou 510180, Guangdong, Peoples R China; "   
[8] "Yang, Ding-Hua"                                                                                          
[9] " Southern Med Univ, Nan Fang Hosp, Dept Hepatobiliary Surg, Guangzhou 510515, Guangdong, Peoples R China"

正如您所看到的,列表中每个向量的第一个元素是空白的。向量的每个“偶数”元素包含“组”,每个“奇数”元素包含该组的列。

下一步是将组分开以组装一个矩阵,为此我使用这两个命令。

CL2 <- lapply(CL1,function(x)x[2])

AF1 <- lapply(CL1,function(x)x[3])

由于在某些情况下我在同一行中有超过 50 个组,基本上我必须循环重复此过程,但我不知道如何,现在我正在手动执行此操作。另一个问题是我不知道如何创建 ID 以及如何将列表合并到矩阵中。

欢迎任何想法或建议。

最佳答案

以下应该可以实现您想要实现的目标:

A <- read.csv("AU.csv", stringsAsFactors = FALSE)

## One vector with all of the data in square brackets
A1 <- regmatches(A[[2]], gregexpr("\\[.*?\\]", A[[2]]))
LA1 <- lengths(A1)

A1 <- gsub("\\[|\\]", "", unlist(A1))

## One vector with all of the other data
A2 <- regmatches(A[[2]], gregexpr("\\[.*?\\]", A[[2]]), invert = TRUE)
LA2 <- lengths(A2) - 1

A2 <- unlist(lapply(A2, function(x) gsub("^\\s+|\\s+$|;\\s+$", "", x[-1])))

## Checking for mistakes....
all.equal(LA1, LA2)
# [1] TRUE
all.equal(sum(LA1), length(A1))
# [1] TRUE

现在我们有了向量,我们可以使用“splitstackshape”包中的cSplit来获得您想要的输出:

library(splitstackshape)
library(magrittr)

## Make a data.table of the two vectors and the ID column
DT <- data.table(ID = rep(A[[1]], LA1), A1, A2)

## Here's the splitting....
final <- DT %>% 
  cSplit("A1", ";", "long") %>%  ## The first column is split and made long
  cSplit("A2", ",")              ## The second column is split and made wide

结果如下:

final
#          ID                      A1                                  A2_01                            A2_02
#     1:    1         Aalten, Pauline                Maastricht Univ Med Ctr Sch Mental Hlth & Neurosci MHeNS
#     2:    1 Ramakers, Inez H. G. B.                Maastricht Univ Med Ctr Sch Mental Hlth & Neurosci MHeNS
#     3:    1         Rozendaal, Nico                Maastricht Univ Med Ctr Sch Mental Hlth & Neurosci MHeNS
#     4:    1     Verhey, Frans R. J.                Maastricht Univ Med Ctr Sch Mental Hlth & Neurosci MHeNS
#     5:    1     Biessels, Geert Jan                   Univ Med Ctr Utrecht                      Dept Neurol
#    ---                                                                                                     
# 13949: 2000         Meng, Qing-Hong                       Guiyang Med Coll                     Dept Immunol
# 13950: 2000 Chung, Peter Chee-Keung                       Guiyang Med Coll                     Dept Immunol
# 13951: 2000           Yuan, Kai-Tao                       Sun Yat Sen Univ                Affiliated Hosp 1
# 13952: 2000                  Yu, Li Guangzhou First Municipal Peoples Hosp                     Dept Paediat
# 13953: 2000          Yang, Ding-Hua                      Southern Med Univ                    Nan Fang Hosp
#                          A2_03                 A2_04           A2_05           A2_06 A2_07 A2_08 A2_09 A2_10
#     1:   Alzheimer Ctr Limburg NL-6200 MD Maastricht     Netherlands              NA    NA    NA    NA    NA
#     2:   Alzheimer Ctr Limburg NL-6200 MD Maastricht     Netherlands              NA    NA    NA    NA    NA
#     3:   Alzheimer Ctr Limburg NL-6200 MD Maastricht     Netherlands              NA    NA    NA    NA    NA
#     4:   Alzheimer Ctr Limburg NL-6200 MD Maastricht     Netherlands              NA    NA    NA    NA    NA
#     5:                 Utrecht           Netherlands              NA              NA    NA    NA    NA    NA
#    ---                                                                                                      
# 13949:          Guiyang 550004       Guizhou Provinc Peoples R China              NA    NA    NA    NA    NA
# 13950:          Guiyang 550004       Guizhou Provinc Peoples R China              NA    NA    NA    NA    NA
# 13951:               Dept Surg      Guangzhou 510080       Guangdong Peoples R China    NA    NA    NA    NA
# 13952:        Guangzhou 510180             Guangdong Peoples R China              NA    NA    NA    NA    NA
# 13953: Dept Hepatobiliary Surg      Guangzhou 510515       Guangdong Peoples R China    NA    NA    NA    NA

关于r - 从杂乱的字符列表到 R 中的矩阵,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/30090144/

相关文章:

python - 将包含 bytearray 的字符串转换为 uint16

r - 如何从数据框中删除少于 5 个观察值的个体

roxygen2 如何不运行示例文件

javascript - JS/ES6 - 将字符串拆分为数组,将元素保留在数组中(大小写)

sql - 将数据库从 DBF 转换为 SQL,无需重新编码

python - 使用 type() 信息来转换存储为字符串的值

r - 在 F# 中,R 中有没有像 'melt' 或 'cast' 这样的操作?

c++ - 如何在 R 对象中存储 CPP 类?

php - preg_split 在 PHP 中有两个定界符

c# - 在 “,” 上拆分,但不在 “/,” 上拆分