python - 如何将单个 CSV 文件分割成多个按字段分组的较小文件并删除最终 CSV 中的列?

标签 python csv

尽管我认为这听起来像是一个重复的问题,但我还没有找到解决方案。嗯,我有一个很大的 .csv 文件,如下所示:

prot_hit_num,prot_acc,prot_desc,pep_res_before,pep_seq,pep_res_after,ident,country
1,gi|21909,21 kDa seed protein [Theobroma cacao],A,ANSPV,L,F40,EB
1,gi|21909,21 kDa seed protein [Theobroma cacao],A,ANSPVL,D,F40,EB
1,gi|21909,21 kDa seed protein [Theobroma cacao],L,SSISGAGGGGLA,L,F40,EB
1,gi|21909,21 kDa seed protein [Theobroma cacao],D,NYDNSAGKW,W,F40,EB
....

目的是根据最后两列(“ident”和“country”)将此 .csv 文件分割成多个较小的 .csv 文件。

我使用了之前的答案中的代码post如下:

csv_contents = []
with open(outfile_path4, 'rb') as fin:
  dict_reader = csv.DictReader(fin)   # default delimiter is comma
  fieldnames = dict_reader.fieldnames # save for writing
  for line in dict_reader:            # read in all of your data
    csv_contents.append(line)         # gather data into a list (of dicts)

# input to itertools.groupby must be sorted by the grouping value 
sorted_csv_contents = sorted(csv_contents, key=op.itemgetter('prot_desc','ident','country'))


for groupkey, groupdata in it.groupby(sorted_csv_contents, 
                                  key=op.itemgetter('prot_desc','ident','country')):

  with open(outfile_path5+'slice_{:s}.csv'.format(groupkey), 'wb') as fou:
    dict_writer = csv.DictWriter(fou, fieldnames=fieldnames)    
    dict_writer.writerows(groupdata)

但是,我需要我的输出 .csv 仅包含“pep_seq”列,所需的输出如下:

pep_seq    
ANSPV
ANSPVL
SSISGAGGGGLA
NYDNSAGKW

我能做什么?

最佳答案

您的代码几乎是正确的,只需要正确设置fieldsnames并设置extraaction='ignore'即可。这告诉 DictWriter 仅写入您指定的字段:

import itertools   
import operator    
import csv

outfile_path4 = 'input.csv'    
outfile_path5 = r'my_output_folder\output.csv'
csv_contents = []

with open(outfile_path4, 'rb') as fin:
    dict_reader = csv.DictReader(fin)   # default delimiter is comma
    fieldnames = dict_reader.fieldnames # save for writing

    for line in dict_reader:            # read in all of your data
        csv_contents.append(line)         # gather data into a list (of dicts)

group = ['prot_desc','ident','country']
# input to itertools.groupby must be sorted by the grouping value 
sorted_csv_contents = sorted(csv_contents, key=operator.itemgetter(*group))

for groupkey, groupdata in itertools.groupby(sorted_csv_contents, key=operator.itemgetter(*group)):
    with open(outfile_path5+'slice_{:s}.csv'.format(groupkey), 'wb') as fou:
        dict_writer = csv.DictWriter(fou, fieldnames=['pep_seq'], extrasaction='ignore')    
        dict_writer.writeheader()
        dict_writer.writerows(groupdata) 

这将为您提供一个输出 csv 文件,其中包含:

pep_seq
ANSPV
ANSPVL
SSISGAGGGGLA
NYDNSAGKW

关于python - 如何将单个 CSV 文件分割成多个按字段分组的较小文件并删除最终 CSV 中的列?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/36014075/

相关文章:

Python 模块变量不起作用

Python smtp从gmail发送电子邮件,

python - 创建对象的继承问题

java - 读取字段中包含逗号的 csv 文件时出现问题

javascript - 将 NodeJS 流通过管道传输到数组

python - 使用列名将多个数组保存到 csv 文件

python - 如何将自己的索引添加到在 virtualenv 中运行的 pip?

python - 根据另一个 csv 中的另一个值在 csv 列中写入一个值

csv - import-csv,get-aduser,然后export-csv筛选出不存在的AD用户

python - 使用opencv python检测到地平线的霍夫线后如何裁剪图像?