r - 解析文本文件以创建整洁的数据框

标签 r dplyr

我有一个包含数百行的文本文件,这些行重复如下:

Mito_1.fastq.gz
Uniquely mapped reads: 3314106 (74%) 
Multi-mapping reads: 956 (0%) 
Unmapped reads: 1165802 (26%) 
Total: 4480864

Mito_1_old.fastq.gz
Uniquely mapped reads: 1188564 (88%) 
Multi-mapping reads: 406 (0%) 
Unmapped reads: 162676 (12%) 
Total: 1351646

我想解析此文件并将其转换为如下所示的数据框:

> head(desired_outcome)
               Sample Uniquely.mapped.reads Multi.mapping.reads Unmapped.reads.Total   Total
1     Mito_1.fastq.gz               3314106                 956              1165802 4480864
2 Mito_1_old.fastq.gz               1188564                 406               162676 1351646

我尝试过以下代码:

library(tidyverse)

text_file <- read_tsv("./Results/mapping_stats.txt",
                      col_names = "text")

head(text_file)

# A tibble: 6 × 1
  text                                
  <chr>                               
1 Mito_1.fastq.gz                     
2 Uniquely mapped reads: 3314106 (74%)
3 Multi-mapping reads: 956 (0%)       
4 Unmapped reads: 1165802 (26%)       
5 Total: 4480864                      
6 Mito_1_old.fastq.gz

text_file <- text_file |>
  # Separate numeric values after the colon
  separate(
    col = text,
    into = c("id", "value"),
    sep = "\\:",
    remove = TRUE,
    extra = "warn"
  ) |> 
  # Extract reads and percentage
  mutate(
    reads_number = parse_number(value),
    reads_percentage = str_extract(value, "\\d+(?=%)")
  ) |> 
  # Keep only the relevant columns
  select(id, reads_number)


head(text_file)

# A tibble: 6 × 2
  id                    reads_number
  <chr>                        <dbl>
1 Mito_1.fastq.gz                 NA
2 Uniquely mapped reads      3314106
3 Multi-mapping reads            956
4 Unmapped reads             1165802
5 Total                      4480864
6 Mito_1_old.fastq.gz             NA

但是,在此之后我完全陷入困境。任何帮助或建议将不胜感激。

最佳答案

此数据几乎遵循 DCF 格式,只是每个记录的第一行缺少标签。如果我们以文本形式读取数据并插入缺少的标签,则可以使用 read.dcf() 轻松地将其转换为数据帧。

library(dplyr)
library(readr)

txt <- readLines("./Results/mapping_stats.txt")
txt <- ifelse(!grepl(":", txt) & nzchar(txt), paste("Sample:", txt), txt)

dat <- read.dcf(textConnection(txt), all = TRUE) 

这给出:

               Sample Uniquely mapped reads Multi-mapping reads Unmapped reads   Total
1     Mito_1.fastq.gz         3314106 (74%)            956 (0%)  1165802 (26%) 4480864
2 Mito_1_old.fastq.gz         1188564 (88%)            406 (0%)   162676 (12%) 1351646

从那里,readr::parse_number()可用于解析这些值,虽然也可以提取百分比,但重新计算它们同样容易并且可能更有效:

dat |>
  mutate(across(-Sample, parse_number),
         across(-c(Sample, Total), ~ .x / Total * 100, .names = "{.col}_percent")) |>
  as_tibble() |>
  rename_with(make.names)

# A tibble: 2 × 8
  Sample           Uniquely.mapped.reads Multi.mapping.reads Unmapped.reads  Total Uniquely.mapped.read…¹
  <chr>                            <dbl>               <dbl>          <dbl>  <dbl>                  <dbl>
1 Mito_1.fastq.gz                3314106                 956        1165802 4.48e6                   74.0
2 Mito_1_old.fast…               1188564                 406         162676 1.35e6                   87.9
# ℹ abbreviated name: ¹​Uniquely.mapped.reads_percent
# ℹ 2 more variables: Multi.mapping.reads_percent <dbl>, Unmapped.reads_percent <dbl>

关于r - 解析文本文件以创建整洁的数据框,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/77030265/

相关文章:

r - 在 dplyr 中汇总并为没有值的类别插入 0

r - dplyr::select()与某些可能在数据框中不存在的变量?

r - 使用 dplyr : object ‘data_frame’ is not exported by 'namespace:vctrs' 时出错

r - Disqus 插件 + 社交 URL 不会在使用 R blogdown 生成的站点中加载

r - 使用 show.legend = FALSE 删除图例不适用于连续美学

r - 使用其他自变量的所有可能组合获取许多模型中特定变量的 p 值

r - 使用dplyr,如何通过管道或链式链接到plot()?

r - 排列数据框以实现两两相关

r - tm 包中不再支持 Dictionary()。如何修改代码?

r - 如何使用 dplyr 重命名 SQLite 表?