我有两个大文件,如下所示:
f1:
chr1,3073253,3074322,gene_id,"ENSMUSG00000102693.1",gene_type,"TEC"
chr1,3074253,3075322,gene_id,"ENSMUSG00000102693.1",transcript_id,"ENSMUST00000193812.1"
chr1,3077253,3078322,gene_id,"ENSMUSG00000102693.1",transcript_id,"ENSMUST00000193812.1"
chr1,3102916,3103025,gene_id,"ENSMUSG00000064842.1",gene_type,"snRNA"
chr1,3105016,3106025,gene_id,"ENSMUSG00000064842.1",transcript_id,"ENSMUST00000082908.1"
f2:chr,name,start,end
chr1,linc1320,3073300,3074300
chr3,linc2245,3077270,3078250
chr1,linc8956,4410501,4406025
我想要做的是在文件 1 的单独列中打印文件 2 的行,如果范围为 start
和 end
file2 的列在 file1(第 2 列和第 3 列)和 chr
的范围内是一样的。因此,基于我提供的虚拟示例文件 - 所需的输出应该是(只有 linc1320
的范围在文件 1 的第一行中):chr1,3073253,3074322,gene_id,"ENSMUSG00000102693.1",gene_type,"TEC",linc1320,3073300,3074300
chr1,3074253,3075322,gene_id,"ENSMUSG00000102693.1",transcript_id,"ENSMUST00000193812.1"
chr1,3077253,3078322,gene_id,"ENSMUSG00000102693.1",transcript_id,"ENSMUST00000193812.1"
chr1,3102916,3103025,gene_id,"ENSMUSG00000064842.1",gene_type,"snRNA"
chr1,3105016,3106025,gene_id,"ENSMUSG00000064842.1",transcript_id,"ENSMUST00000082908.1"
我不是专业的编码员,但我一直在使用此代码根据 file2 手动更改范围:awk -F ',' '$2<=3073300,$3>=3074300, {print $1,$2,$3,$4,$5,$6,$7}' f1.csv
我没有特别偏好使用特定的编程语言 - 两者都是 Python
和 awk
会很有帮助。感谢您提供任何帮助。
最佳答案
你可以用这个awk
:
awk 'BEGIN{FS=OFS=","} FNR==NR {if (FNR>1) {chr[++n] = $1; id[n]=$2; r1[n]=$3; r2[n]=$4}; next} {for (i=1; i<=n; ++i) if ($1 == chr[i] && r1[i] > $2 && r2[i] < $3) {$0 = $0 OFS id[i] OFS r1[i] OFS r2[i]; break}} 1' file2 file1
chr1,3073253,3074322,gene_id,"ENSMUSG00000102693.1",gene_type,"TEC",linc1320,3073300,3074300
chr1,3074253,3075322,gene_id,"ENSMUSG00000102693.1",transcript_id,"ENSMUST00000193812.1"
chr1,3077253,3078322,gene_id,"ENSMUSG00000102693.1",transcript_id,"ENSMUST00000193812.1"
chr1,3102916,3103025,gene_id,"ENSMUSG00000064842.1",gene_type,"snRNA"
chr1,3105016,3106025,gene_id,"ENSMUSG00000064842.1",transcript_id,"ENSMUST00000082908.1"
更易读的形式:awk '
BEGIN { FS = OFS = "," }
FNR == NR {
if (FNR > 1) {
chr[++n] = $1
id[n] = $2
r1[n] = $3
r2[n] = $4
}
next
}
{
for (i=1; i<=n; ++i)
if ($1 == chr[i] && r1[i] > $2 && r2[i] < $3) {
$0 = $0 OFS id[i] OFS r1[i] OFS r2[i]
break
}
} 1' file2 file1
关于python - 如果行满足一定范围,如何打印行,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/67280459/