python - 将数据分箱

标签 python perl statistics bioinformatics

我有一个如下所示的数据集

-9030   KIR3DX1
-75     SLC12A6
8005    C14orf79
-251    ARAP1
65994   EFNB1
-12111  SLC7A5
-11643  CAMK2G
-19749  PRPS2
-23324  MIR198
10012   LOC100506172
-77     CCDC88A
12171   MMP14

其中第 1 列表示第 2 列中的元素(基因)在任一方向上距 0 的距离(以碱基对为单位)。我想将这些数据存储在 50 个碱基对的窗口中。

有什么建议吗? 谢谢

最佳答案

程序(希望 Perl 没问题):

#!/usr/bin/perl

# Create a histogram of some data

# Denis Howe 2012-07-03 - 2012-07-03 18:40

use strict;
use warnings;

# my $n_bins = 10;
my $width = 50;

# Read lines into @d
my @d = <DATA>; chomp @d;

# Split each line containing a digit into a pair
@d = map [split(/\s+/, $_)], grep /\d/, @d;

# Find range
my $min = 9E9; my $max = -9E9;
foreach (@d)
{
    $min = $_->[0] if ($_->[0] < $min);
    $max = $_->[0] if ($_->[0] > $max);
}

# Round down to multiple of $width
$min = int($min/$width) * $width;

# Ensure there's a bin for max value
# my $width = ($max*1.01 - $min) / $n_bins;
my $n_bins = int(($max - $min) / $width) + 1;

# Allocate data to bins
my @bin;
foreach (@d)
{
    push @{$bin[($_->[0]-$min)/$width]}, $_;
}

# Show content of each bin
foreach (0 .. $n_bins-1)
{
    next unless ($bin[$_]);             # Ignore empty bins
    printf "%6d - %6d", $min + $_*$width, $min + ($_+1)*$width;
    print map("  " . $_->[0] . ":" . $_->[1], @{$bin[$_]}), "\n";
}

__DATA__
-9030   KIR3DX1
-75     SLC12A6
8005    C14orf79
-251    ARAP1
65994   EFNB1
-12111  SLC7A5
-11643  CAMK2G
-19749  PRPS2
-23324  MIR198
10012   LOC100506172
-77     CCDC88A
12171   MMP14
EOF

输出:

-23300 - -23250  -23324:MIR198
-19750 - -19700  -19749:PRPS2
-12150 - -12100  -12111:SLC7A5
-11650 - -11600  -11643:CAMK2G
 -9050 -  -9000  -9030:KIR3DX1
  -300 -   -250  -251:ARAP1
  -100 -    -50  -75:SLC12A6  -77:CCDC88A
  8000 -   8050  8005:C14orf79
 10000 -  10050  10012:LOC100506172
 12150 -  12200  12171:MMP14
 65950 -  66000  65994:EFNB1

HtH

关于python - 将数据分箱,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/11314802/

相关文章:

json - "Fatal error: ' EXTERN.h ' file not found"安装 Perl 模块时

r - 从带有噪声的高维球体中采样

sql - 用MySQL计算中位数的简单方法

python - 如何注释条形之间的差异?

python - 如何在不重新安装模块的情况下更新 mac python

Python 看门狗 : what is the 'empty' directory snapshot?

windows - 无法终止 Perl 中的进程

c - 为什么即使没有控制终端,getlogin() 也会成功

python - 使用 python chisquare 和使用卡方值表的不同结果

python - OpenCV 打开文件错误(Assertion failed)