我有一个制表符分隔的文件,我需要从中提取第 12 列的所有内容(哪些文档类别)。然而,第 12 列内容高度重复,因此首先我需要获取一个仅返回类别数量的列表(通过删除重复)。然后我需要找到一种方法来获取每个类别的行数。我的尝试如下:
def remove_duplicates(l): # define function to remove duplicates
return list(set(l))
input = sys.argv[1] # command line arguments to open tab file
infile = open(input)
for lines in infile: # split content into lines
words = lines.split("\t") # split lines into words i.e. columns
dataB2.append(words[11]) # column 12 contains the desired repetitive categories
dataB2 = dataA.sort() # sort the categories
dataB2 = remove_duplicates(dataA) # attempting to remove duplicates but this just returns an infinite list of 0's in the print command
print(len(dataB2))
infile.close()
我不知道如何获得每个类别的行数? 所以我的问题是:如何有效消除重复?如何获取每个类别的行数?
最佳答案
我建议使用 python Counter来实现这一点。计数器几乎完全符合您的要求,因此您的代码如下所示:
from collections import Counter
import sys
count = Counter()
# Note that the with open()... syntax is generally preferred.
with open(sys.argv[1]) as infile:
for lines in infile: # split content into lines
words = lines.split("\t") # split lines into words i.e. columns
count.update([words[11]])
print count
关于python - 从制表符分隔文件的列表产品中删除重复项并进一步分类,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/34279904/