python - 如果行满足一定范围,如何打印行

标签 python pandas awk

我有两个大文件,如下所示:
f1:

chr1,3073253,3074322,gene_id,"ENSMUSG00000102693.1",gene_type,"TEC"
chr1,3074253,3075322,gene_id,"ENSMUSG00000102693.1",transcript_id,"ENSMUST00000193812.1"
chr1,3077253,3078322,gene_id,"ENSMUSG00000102693.1",transcript_id,"ENSMUST00000193812.1"
chr1,3102916,3103025,gene_id,"ENSMUSG00000064842.1",gene_type,"snRNA"
chr1,3105016,3106025,gene_id,"ENSMUSG00000064842.1",transcript_id,"ENSMUST00000082908.1"
f2:
chr,name,start,end
chr1,linc1320,3073300,3074300
chr3,linc2245,3077270,3078250
chr1,linc8956,4410501,4406025
我想要做的是在文件 1 的单独列中打印文件 2 的行,如果范围为 startend file2 的列在 file1(第 2 列和第 3 列)和 chr 的范围内是一样的。因此,基于我提供的虚拟示例文件 - 所需的输出应该是(只有 linc1320 的范围在文件 1 的第一行中):
chr1,3073253,3074322,gene_id,"ENSMUSG00000102693.1",gene_type,"TEC",linc1320,3073300,3074300
chr1,3074253,3075322,gene_id,"ENSMUSG00000102693.1",transcript_id,"ENSMUST00000193812.1"
chr1,3077253,3078322,gene_id,"ENSMUSG00000102693.1",transcript_id,"ENSMUST00000193812.1"
chr1,3102916,3103025,gene_id,"ENSMUSG00000064842.1",gene_type,"snRNA"
chr1,3105016,3106025,gene_id,"ENSMUSG00000064842.1",transcript_id,"ENSMUST00000082908.1"
我不是专业的编码员,但我一直在使用此代码根据 file2 手动更改范围:
awk -F ',' '$2<=3073300,$3>=3074300, {print $1,$2,$3,$4,$5,$6,$7}' f1.csv
我没有特别偏好使用特定的编程语言 - 两者都是 Pythonawk会很有帮助。感谢您提供任何帮助。

最佳答案

你可以用这个awk :

awk 'BEGIN{FS=OFS=","} FNR==NR {if (FNR>1) {chr[++n] = $1; id[n]=$2; r1[n]=$3; r2[n]=$4}; next} {for (i=1; i<=n; ++i) if ($1 == chr[i] && r1[i] > $2 && r2[i] < $3) {$0 = $0 OFS id[i] OFS r1[i] OFS r2[i]; break}} 1' file2 file1

chr1,3073253,3074322,gene_id,"ENSMUSG00000102693.1",gene_type,"TEC",linc1320,3073300,3074300
chr1,3074253,3075322,gene_id,"ENSMUSG00000102693.1",transcript_id,"ENSMUST00000193812.1"
chr1,3077253,3078322,gene_id,"ENSMUSG00000102693.1",transcript_id,"ENSMUST00000193812.1"
chr1,3102916,3103025,gene_id,"ENSMUSG00000064842.1",gene_type,"snRNA"
chr1,3105016,3106025,gene_id,"ENSMUSG00000064842.1",transcript_id,"ENSMUST00000082908.1"
更易读的形式:
awk '
BEGIN { FS = OFS = "," }
FNR == NR {
   if (FNR > 1) {
      chr[++n] = $1
      id[n] = $2
      r1[n] = $3
      r2[n] = $4
   }
   next
}
{
   for (i=1; i<=n; ++i)
      if ($1 == chr[i] && r1[i] > $2 && r2[i] < $3) {
         $0 = $0 OFS id[i] OFS r1[i] OFS r2[i]
         break
      }
} 1' file2 file1

关于python - 如果行满足一定范围,如何打印行,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/67280459/

相关文章:

python - 如何模拟一个csv文件

python - 将python稀疏矩阵dict转换为scipy稀疏矩阵

python - Pandas : "distribute"列值到多行

Python使用pandas每三行转列

linux - 将 awk 操作存储在变量中

python - 聚类十亿个项目(或哪些聚类方法在线性时间内运行?)

python - Elasticsearch:使用python从索引中检索所有文档

python - pd.to_datetime 是我一半的日期与翻转的日期/月份

bash - 在 bash 中保留第一行的前 52000 个字符

linux - Bash - 编辑代码行