python - 对大型分隔文件进行子集化的有效方法

我之前发布过一个需要使用R解决的问题: subset recursively a data.frame ，但是该文件太大了，我需要大量的时间和 RAM 内存来读取它。我想知道我是否可以在 python 中使用 pandas 来做同样的事情，因为我是 python 的新手，并且 pandas 看起来更类似于 R，至少在它的语法上。这是我上一篇文章的摘要:

上一篇文章: 我有一个制表符分隔的文件，有近 1500 万行，其大小为 27GB。我需要一种有效的方法来根据两个标准对数据进行子集化。我可以用 for 循环来做到这一点，但想知道是否有更优雅的方法来做到这一点，并且显然更有效。 data.frame 如下所示:

SNP         CHR     BP          P
rs1000000   chr1    126890980   0.000007
rs10000010  chr4    21618674    0.262098    
rs10000012  chr4    1357325     0.344192
rs10000013  chr4    37225069    0.726325    
rs10000017  chr4    84778125    0.204275    
rs10000023  chr4    95733906    0.701778
rs10000029  chr4    138685624   0.260899
rs1000002   chr3    183635768   0.779574
rs10000030  chr4    103374154   0.964166    
rs10000033  chr2    139599898   0.111846    
rs10000036  chr4    139219262   0.564791
rs10000037  chr4    38924330    0.392908    
rs10000038  chr4    189176035   0.971481    
rs1000003   chr3    98342907    0.000004
rs10000041  chr3    165621955   0.573376
rs10000042  chr3    5237152     0.834206    
rs10000056  chr4    189321617   0.268479
rs1000005   chr1    34433051    0.764046
rs10000062  chr4    5254744     0.238011    
rs10000064  chr4    127809621   0.000044
rs10000068  chr2    36924287    0.000003
rs10000075  chr4    179488911   0.100225    
rs10000076  chr4    183288360   0.962476
rs1000007   chr2    237752054   0.594928
rs10000081  chr1    17348363    0.517486    
rs10000082  chr1    167310192   0.261577    
rs10000088  chr1    182605350   0.649975
rs10000092  chr4    21895517    0.000005
rs10000100  chr4    19510493    0.296693

我需要做的第一件事是选择那些 P 值低于阈值的 SNP，然后按 CHR 和 BP 对这个子集进行排序。一旦我有了这个子集，我需要从重要的 SNP 中获取上下 500,000 个窗口中的所有 SNP，这一步将定义一个区域。我需要对所有重要的 SNP 执行此操作，并将每个区域存储到列表或类似的内容中以进行进一步分析。例如，在显示的数据框中，CHR==chr1 的最显着 SNP(即低于阈值 0.001)是 rs1000000，CHR==chr4 的最显着 SNP 是 rs10000092。因此，这两个 SNP 将定义两个区域，我需要在每个区域中获取从每个最重要的 SNP 的 POS 上下 500,000 个区域中的 SNP。

@eddi和@rafaelpereira提供的R代码解决方案如下:

library(data.table) # v1.9.7 (devel version)
df <- fread("C:/folderpath/data.csv") # load your data
setDT(df) # convert your dataset into data.table
#1st step
# Filter data under threshold 0.05 and Sort by CHR, POS
df <- df[ P < 0.05, ][order(CHR, POS)]
#2nd step
df[, {idx = (1:.N)[which.min(P)]
  SNP[seq(max(1, idx - 5e5), min(.N, idx + 5e5))]}, by = CHR]

最佳答案

首先，我强烈建议从 CSV 文件切换到 PyTables(HDF 存储)并存储按 ['SNP','BP'] 排序的 DF如果可能的话，因为它的速度要快几个数量级，允许条件选择(请参阅 where 参数)并且通常占用更少的空间 - 请参阅 this comparison .

HDF store recipes

这是一个工作示例脚本，它执行以下操作:

生成样本 DF(20M 行，8 列: 'SNP', 'CHR', 'BP', 'P', 'SNP2', 'CHR2', 'BP2', 'P2' )。我故意将列数加倍，因为我认为您的 CSV 有更多的列，因为我生成的 20M 行和 8 列的 CSV 文件的大小仅 1.7GB。
将生成的 DF 保存到 CSV 文件(文件大小:1.7 GB)
以 1M 行 block 读取 CSV 文件(仅读取前 4 列)
按 ['CHR','BP'] 对 DF 进行排序并将结果保存为 PyTable (.h5)
从 HDF 读取仅存储 P < threshold 的行
从 HDF 存储中读取 min(SNP) - 500K 之间的所有行和max(SNP) + 500K - 你可能想改进这部分

代码:

import numpy as np
import pandas as pd

##############################
# generate sample DF
##############################
rows = 2*10**7
chr_lst = ['chr{}'.format(i) for i in range(1,10)]

df = pd.DataFrame({'SNP': np.random.randint(10**6, 10**7, rows).astype(str)})
df['CHR'] = np.random.choice(chr_lst, rows)
df['BP'] = np.random.randint(10**6, 10**9, rows)
df['P'] = np.random.rand(rows)
df.SNP = 'rs' + df.SNP

"""
# NOTE: sometimes it gives me MemoryError !!!
# because of that i did it "column-by-column" before
df = pd.DataFrame({
    'SNP': np.random.randint(10**6, 10**7, rows).astype(str),
    'CHR': np.random.choice(chr_lst, rows),
    'BP':  np.random.randint(10**6, 10**9, rows),
    'P':   np.random.rand(rows)
}, columns=['SNP','CHR','BP','P'])

df.SNP = 'rs' + df.SNP
"""

# make 8 columns out of 4 ...
df = pd.concat([df]*2, axis=1)
df.columns = ['SNP', 'CHR', 'BP', 'P', 'SNP2', 'CHR2', 'BP2', 'P2']

##############################
# store DF as CSV file
##############################
csv_path = r'c:/tmp/file_8_cols.csv'
df.to_csv(csv_path, index=False)

##############################
# read CSV file (only needed cols) in chunks
##############################
csv_path = r'c:/tmp/file_8_cols.csv'
cols = ['SNP', 'CHR', 'BP', 'P']
chunksize = 10**6

df = pd.concat([x for x in pd.read_csv(csv_path, usecols=cols,
                                       chunksize=chunksize)],
               ignore_index=True )

##############################
# sort DF and save it as .h5 file
##############################
store_path = r'c:/tmp/file_sorted.h5'
store_key = 'test'
(df.sort_values(['CHR','BP'])
   .to_hdf(store_path, store_key, format='t', mode='w', data_columns=True)
)


##############################
# read HDF5 file in chunks
##############################
store_path = r'c:/tmp/file_sorted.h5'
store_key = 'test'
chunksize = 10**6

store = pd.HDFStore(store_path)

threshold = 0.001
store_condition = 'P < %s' % threshold

i = store.select(key=store_key, where=store_condition)

# select all rows between `min(SNP) - 500K` and `max(SNP) + 500K`
window_size = 5*10**5
start = max(0, i.index.min() - window_size)
stop = min(store.get_storer(store_key).nrows, i.index.max() + window_size)
df = pd.concat([
        x for x in store.select(store_key, chunksize=chunksize,
                                start=start, stop=stop, )
               ])

# close the store before exiting...
store.close()

示例数据:

In [39]: df.head(10)
Out[39]:
                SNP   CHR       BP         P
18552732  rs8899557  chr1  1000690  0.764227
3837818   rs1883864  chr1  1000916  0.145544
13055060  rs2403233  chr1  1001591  0.116835
9303493   rs5567473  chr1  1002297  0.409937
14370003  rs1661796  chr1  1002523  0.322398
9453465   rs8222028  chr1  1004318  0.269862
2611036   rs9514787  chr1  1004666  0.936439
10378043  rs3345160  chr1  1004930  0.271848
16149860  rs4245017  chr1  1005219  0.157732
667361    rs3270325  chr1  1005252  0.395261

关于python - 对大型分隔文件进行子集化的有效方法，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/37660278/

python - 对大型分隔文件进行子集化的有效方法

上一篇：python - Pandas 按所有列中的特定行比较 2 个数据帧

下一篇：python - 使用 re.sub 的更简洁方法