我有一个如下所示的数据集
-9030 KIR3DX1
-75 SLC12A6
8005 C14orf79
-251 ARAP1
65994 EFNB1
-12111 SLC7A5
-11643 CAMK2G
-19749 PRPS2
-23324 MIR198
10012 LOC100506172
-77 CCDC88A
12171 MMP14
其中第 1 列表示第 2 列中的元素(基因)在任一方向上距 0 的距离(以碱基对为单位)。我想将这些数据存储在 50 个碱基对的窗口中。
有什么建议吗? 谢谢
最佳答案
程序(希望 Perl 没问题):
#!/usr/bin/perl
# Create a histogram of some data
# Denis Howe 2012-07-03 - 2012-07-03 18:40
use strict;
use warnings;
# my $n_bins = 10;
my $width = 50;
# Read lines into @d
my @d = <DATA>; chomp @d;
# Split each line containing a digit into a pair
@d = map [split(/\s+/, $_)], grep /\d/, @d;
# Find range
my $min = 9E9; my $max = -9E9;
foreach (@d)
{
$min = $_->[0] if ($_->[0] < $min);
$max = $_->[0] if ($_->[0] > $max);
}
# Round down to multiple of $width
$min = int($min/$width) * $width;
# Ensure there's a bin for max value
# my $width = ($max*1.01 - $min) / $n_bins;
my $n_bins = int(($max - $min) / $width) + 1;
# Allocate data to bins
my @bin;
foreach (@d)
{
push @{$bin[($_->[0]-$min)/$width]}, $_;
}
# Show content of each bin
foreach (0 .. $n_bins-1)
{
next unless ($bin[$_]); # Ignore empty bins
printf "%6d - %6d", $min + $_*$width, $min + ($_+1)*$width;
print map(" " . $_->[0] . ":" . $_->[1], @{$bin[$_]}), "\n";
}
__DATA__
-9030 KIR3DX1
-75 SLC12A6
8005 C14orf79
-251 ARAP1
65994 EFNB1
-12111 SLC7A5
-11643 CAMK2G
-19749 PRPS2
-23324 MIR198
10012 LOC100506172
-77 CCDC88A
12171 MMP14
EOF
输出:
-23300 - -23250 -23324:MIR198
-19750 - -19700 -19749:PRPS2
-12150 - -12100 -12111:SLC7A5
-11650 - -11600 -11643:CAMK2G
-9050 - -9000 -9030:KIR3DX1
-300 - -250 -251:ARAP1
-100 - -50 -75:SLC12A6 -77:CCDC88A
8000 - 8050 8005:C14orf79
10000 - 10050 10012:LOC100506172
12150 - 12200 12171:MMP14
65950 - 66000 65994:EFNB1
HtH
关于python - 将数据分箱,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/11314802/