在下面的 awk 中,我试图匹配 $4
中的值的 file1
来自 $4
的值在 file2
前第一_
.我存储了 $4
的值在 file1
在 A
.然后我存储 $2
中的值如 min
, $3
中的值如 max
,以及 $1
中的值如 chr
.
如 $1 in A
等于 array[1]
,然后我使用存储在 min
中的值, max
, 和 chr
检查 $2
之间是否有重叠, $3
, 和 $
file2
中的 1 个值.如果有那么 overlap
已打印但如果没有 missing
被打印。我试图确保线条匹配并且坐标被覆盖 file1
至 file2
.我的实际数据是以下格式的数千行,file2
中的每一行都应该匹配。 .我评论了 awk
也希望它有所帮助,因为我遇到了语法错误,也许有更好的方法,但我想尝试看看。
如果我删除 {split($4,array,"_")}
并删除 array[1]
,我得到了当前的输出,但并不是所有的行都只打印 overlap
行是,我不确定只会打印完全匹配的内容。
文件 1 tab-delimited
chr19 42373737 42373856 RPS19
chr6 32790021 32790140 TAP2
文件 2
tab-delimited
chr19 42364844 42364915 RPS19_cds_1_0_chr19_42364845_f 0 +
chr19 42365180 42365281 RPS19_cds_2_0_chr19_42365181_f 0 +
chr19 42373100 42373284 RPS19_cds_3_0_chr19_42373101_f 0 +
chr19 42373768 42373823 RPS19_cds_4_0_chr19_42373769_f 0 +
chr19 42375418 42375445 RPS19_cds_5_0_chr19_42375419_f 0 +
所需的输出
tab-delimited
chr19 42364844 42364915 RPS19_cds_1_0_chr19_42364845_f 0 + missing
chr19 42365180 42365281 RPS19_cds_2_0_chr19_42365181_f 0 + missing
chr19 42373100 42373284 RPS19_cds_3_0_chr19_42373101_f 0 + missing
chr19 42373768 42373823 RPS19_cds_4_0_chr19_42373769_f 0 + overlap
chr19 42375418 42375445 RPS19_cds_5_0_chr19_42375419_f 0 + missing
awk
awk ' # call awk script
BEGIN { FS=OFS="\t" } # define FS and OFS as tab
FNR==NR{ # start processing same line in files
a[$4]; # store gene in
min[$4]=$2; # store staring coordinate
max[$4]=$3; # store ending coordinate
next # process next line
} # close block
{ # start block
split($4,array,"_"); # split $4 on _ and store in array[1]
print $0,(array[1] in a) && ($2>=min[array[1]] &&
$2<=max[array[1]])?"overlap":"missing" # print all lines followed by
overlap or missing depending on condition (if array[1] = a and $2 in
file2 is greater than or equal to min and $3 in file2 greater than or
equal to max print overlap, else missing)
} # close block
' file1 file2 # define input
当前输出
1 42373768 42373823 RPS19_cds_4_0_chr19_42373769_f 0 + overlap
最佳答案
super 巨星来了awk
在这里营救:
也看不到您的 Input_file(s) 是实际的 TAB 分隔,所以使用 FS="\t"
之前 Input_file1
在这段代码中也是如此。
awk 'FNR==NR{a[$4];min[$4]=$2;max[$4]=$3;next} {split($4,array,"_");print $0,(array[1] in a) && ($2>=min[array[1]] && $2<=max[array[1]])?"overlap":"missing"}' Input_file1 OFS="\t" Input_file2
现在也添加一种非单线形式的解决方案:
awk '
FNR==NR{
a[$4];
min[$4]=$2;
max[$4]=$3;
next
}
{
split($4,array,"_");
print $0,(array[1] in a) && ($2>=min[array[1]] && $2<=max[array[1]])?"overlap":"missing"
}
' Input_file1 OFS="\t" Input_file2
输出如下:
chr19 42364844 42364915 RPS19_cds_1_0_chr19_42364845_f 0 + missing
chr19 42365180 42365281 RPS19_cds_2_0_chr19_42365181_f 0 + missing
chr19 42373100 42373284 RPS19_cds_3_0_chr19_42373101_f 0 + missing
chr19 42373768 42373823 RPS19_cds_4_0_chr19_42373769_f 0 + overlap
chr19 42375418 42375445 RPS19_cds_5_0_chr19_42375419_f 0 + missing
关于awk 根据坐标范围和精确匹配在字段中打印文本,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/48818891/