我有一个包含数百行的文本文件,这些行重复如下:
Mito_1.fastq.gz
Uniquely mapped reads: 3314106 (74%)
Multi-mapping reads: 956 (0%)
Unmapped reads: 1165802 (26%)
Total: 4480864
Mito_1_old.fastq.gz
Uniquely mapped reads: 1188564 (88%)
Multi-mapping reads: 406 (0%)
Unmapped reads: 162676 (12%)
Total: 1351646
我想解析此文件并将其转换为如下所示的数据框:
> head(desired_outcome)
Sample Uniquely.mapped.reads Multi.mapping.reads Unmapped.reads.Total Total
1 Mito_1.fastq.gz 3314106 956 1165802 4480864
2 Mito_1_old.fastq.gz 1188564 406 162676 1351646
我尝试过以下代码:
library(tidyverse)
text_file <- read_tsv("./Results/mapping_stats.txt",
col_names = "text")
head(text_file)
# A tibble: 6 × 1
text
<chr>
1 Mito_1.fastq.gz
2 Uniquely mapped reads: 3314106 (74%)
3 Multi-mapping reads: 956 (0%)
4 Unmapped reads: 1165802 (26%)
5 Total: 4480864
6 Mito_1_old.fastq.gz
text_file <- text_file |>
# Separate numeric values after the colon
separate(
col = text,
into = c("id", "value"),
sep = "\\:",
remove = TRUE,
extra = "warn"
) |>
# Extract reads and percentage
mutate(
reads_number = parse_number(value),
reads_percentage = str_extract(value, "\\d+(?=%)")
) |>
# Keep only the relevant columns
select(id, reads_number)
head(text_file)
# A tibble: 6 × 2
id reads_number
<chr> <dbl>
1 Mito_1.fastq.gz NA
2 Uniquely mapped reads 3314106
3 Multi-mapping reads 956
4 Unmapped reads 1165802
5 Total 4480864
6 Mito_1_old.fastq.gz NA
但是,在此之后我完全陷入困境。任何帮助或建议将不胜感激。
最佳答案
此数据几乎遵循 DCF 格式,只是每个记录的第一行缺少标签。如果我们以文本形式读取数据并插入缺少的标签,则可以使用 read.dcf() 轻松地将其转换为数据帧。
library(dplyr)
library(readr)
txt <- readLines("./Results/mapping_stats.txt")
txt <- ifelse(!grepl(":", txt) & nzchar(txt), paste("Sample:", txt), txt)
dat <- read.dcf(textConnection(txt), all = TRUE)
这给出:
Sample Uniquely mapped reads Multi-mapping reads Unmapped reads Total
1 Mito_1.fastq.gz 3314106 (74%) 956 (0%) 1165802 (26%) 4480864
2 Mito_1_old.fastq.gz 1188564 (88%) 406 (0%) 162676 (12%) 1351646
从那里,readr::parse_number()
可用于解析这些值,虽然也可以提取百分比,但重新计算它们同样容易并且可能更有效:
dat |>
mutate(across(-Sample, parse_number),
across(-c(Sample, Total), ~ .x / Total * 100, .names = "{.col}_percent")) |>
as_tibble() |>
rename_with(make.names)
# A tibble: 2 × 8
Sample Uniquely.mapped.reads Multi.mapping.reads Unmapped.reads Total Uniquely.mapped.read…¹
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Mito_1.fastq.gz 3314106 956 1165802 4.48e6 74.0
2 Mito_1_old.fast… 1188564 406 162676 1.35e6 87.9
# ℹ abbreviated name: ¹Uniquely.mapped.reads_percent
# ℹ 2 more variables: Multi.mapping.reads_percent <dbl>, Unmapped.reads_percent <dbl>
关于r - 解析文本文件以创建整洁的数据框,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/77030265/