r - 我应该如何合并(完全连接)多个(> 100)具有公共(public)键但行数不一致的 CSV 文件？

在我深入讨论这个问题之前，here有类似的问题问过，但还没有解决方案。

所以，我正在 R 中工作，并且我的工作目录中有一个名为 columns 的文件夹，其中包含 198 个类似的 .csv 文件，其名称格式为 6 - 数字整数(例如 100000)，不一致地增加(因为这些文件的名称实际上是每个变量的名称)。

现在，我想完全加入它们，但不知何故我必须将所有这些文件导入到 R 中，然后加入它们。自然地，我考虑使用列表来包含这些文件，然后使用循环来连接它们。这是我尝试使用的代码:

#These are the first 3 columns containing identifiers
matrix_starter <- read_csv("files/matrix_starter.csv")

## import_multiple_csv_files_to_R
# Purpose: Import multiple csv files to the Global Environment in R

# set working directory
setwd("columns")

# list all csv files from the current directory
list.files(pattern=".csv$") # use the pattern argument to define a common pattern for import files with regex. Here: .csv

# create a list from these files
list.filenames <- list.files(pattern=".csv$")
#list.filenames

# create an empty list that will serve as a container to receive the incoming files
list.data <- list()

# create a loop to read in your data
for (i in 1:length(list.filenames))
{
list.data[[i]] <- read.csv(list.filenames[i])
list.data[[i]] <- list.data[[i]] %>% 
  select(`Occupation.Title`,`X2018.Employment`) %>% 
  rename(`Occupation title` = `Occupation.Title`) #%>% 
  #rename(list.filenames[i] = `X2018.Employment`)
}

# add the names of your data to the list
names(list.data) <- list.filenames

# now you can index one of your tables like this
list.data$`113300.csv`

# or this
list.data[1]

# source: https://www.edureka.co/community/1902/how-can-i-import-multiple-csv-files-into-r

上面的 block 解决了导入部分。现在我有一个 .csv 文件列表。接下来，我想加入他们:

for (i in 1:length(list.filenames)){
matrix_starter <- matrix_starter %>% full_join(list.data[[i]], by = `Occupation title`)
}

但是，这效果并不好。我最终得到了大约 47,000 行，而我预计只有大约 1700 行。请告诉我您的意见。

最佳答案

将文件作为列表读入 R 并将文件名作为列包含在内，可以如下完成:

files <- list.files(path = path,
                    full.names = TRUE,
                    all.files = FALSE)
files <- files[!file.info(files)$isdir]

data <- lapply(files,
               function(x) {
                 data <- read_xls(
                   x,
                   sheet = 1
                 )
                 data$File_name <- x
                 data
                 })

我现在假设您的所有 Excel 文件都具有相同的结构:相同的列和列类型。如果是这种情况，您可以使用 dplyr::bind_rows 来创建一个组合数据框。当然，您可以循环遍历列表并 left_join 列表元素。例如。通过使用Reduce 和merge。

根据 mihndang 的评论进行更新。当您说:有没有一种方法可以使用文件名来命名列而不包含文件名的列时，这就是您所追求的吗？

library(dplyr)
library(stringr)

path <- "./files"
files <- list.files(path = path,
                    full.names = TRUE,
                    all.files = FALSE)
files <- files[!file.info(files)$isdir]

data <- lapply(files,
               function(x) {
                 read.csv(x, stringsAsFactors = FALSE)
               })

col1 <- paste0(str_sub(basename(files[1]), start = 1, end = -5), ": Values")
col2 <- paste0(str_sub(basename(files[1]), start = 1, end = -5), ": Character")
df1 <- data[[1]] %>%
  rename(!!col1 := Value,
         !!col2 := Character)

我在 ./files 中创建了两个简单的 .csv 文件:file1.csv 和 file2.csv。我把它们读入一个列表。我提取第一个列表元素(DF)并计算出变量中的列名称。然后，我通过将两个变量传递给 DF 中的列来重命名这些列。列名包括文件名。结果:

> View(df1)
> df1
   file1: Values file1: Character
1              1                a
2              2                b
3              3                c
4              4                d
5              5                e
6              6                f
7              7                g
8              8                h
9              9                i
10            10                j

关于r - 我应该如何合并(完全连接)多个(> 100)具有公共(public)键但行数不一致的 CSV 文件？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/61787806/

r - 我应该如何合并(完全连接)多个(> 100)具有公共(public)键但行数不一致的 CSV 文件？

上一篇：python - Selenium chromedriver 从 cron 作业失败？

下一篇：android - 如何使用Retrofit进行同步调用