python - 从制表符分隔文件的列表产品中删除重复项并进一步分类

我有一个制表符分隔的文件，我需要从中提取第 12 列的所有内容(哪些文档类别)。然而，第 12 列内容高度重复，因此首先我需要获取一个仅返回类别数量的列表(通过删除重复)。然后我需要找到一种方法来获取每个类别的行数。我的尝试如下:

def remove_duplicates(l): # define function to remove duplicates
    return list(set(l))

input = sys.argv[1] # command line arguments to open tab file
infile = open(input)
for lines in infile: # split content into lines
    words = lines.split("\t") # split lines into words i.e. columns
    dataB2.append(words[11]) # column 12 contains the desired repetitive categories
    dataB2 = dataA.sort() # sort the categories
    dataB2 = remove_duplicates(dataA) # attempting to remove duplicates but this just returns an infinite list of 0's in the print command
    print(len(dataB2))
infile.close()

我不知道如何获得每个类别的行数？所以我的问题是:如何有效消除重复？如何获取每个类别的行数？

最佳答案

我建议使用 python Counter来实现这一点。计数器几乎完全符合您的要求，因此您的代码如下所示:

from collections import Counter
import sys

count = Counter()

# Note that the with open()... syntax is generally preferred.
with open(sys.argv[1]) as infile:
  for lines in infile: # split content into lines
      words = lines.split("\t") # split lines into words i.e. columns
      count.update([words[11]])

print count

关于python - 从制表符分隔文件的列表产品中删除重复项并进一步分类，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/34279904/

上一篇：python - Django 分页基于 View 中的 bool 值

下一篇：python - 检查 numpy 数组中的所有元素是否与数字匹配

相关文章：

mongodb - 在 mongodb 中删除重复文档的最快方法

classification - WEKA - 分类 - 训练和测试集

python - 使用 GmailAPI 在帐户设置中添加自动转发时出现问题

python - 我可以在没有驱动程序的情况下创建cloudshell shell吗？

python - 如何显示列表中字符串的重复项

linux - 如何删除文本行中的重复项？

python - 如何构建隐含概率。 python中泊松分布的矩阵

python - 限制每个唯一 pyspark 数据帧列值返回的行，无需循环

algorithm - 加权K最近邻的正确实现

validation - 在分类中，如果数据集不平衡，如何验证模型？