python - 从多个文件中搜索和排序数据

我有一组 1000 个文本文件，名称为 in_s1.txt 、 in_s2.txt 等。每个文件包含数百万行，每行有 7 列，例如:

ccc245 1 4 5 5 3 -12.3

对我来说，最重要的是第一列和第七列的值； ccc245 , -12.3 对

我需要做的是在所有in_sXXXX.txt文件中找到第七列值最低的10个案例，并且我还需要获取每个值位于哪个文件中。我需要类似的东西:

FILE  1st_col  7th_col

in_s540.txt ccc3456 -9000.5
in_s520.txt ccc488 -723.4
in_s12.txt ccc34 -123.5
in_s344.txt ccc56 -45.6

我正在考虑使用 python 和 bash 来实现此目的，但目前我没有找到实用的方法。我所知道要做的就是:

将 in_ 中的所有 IN.TXT 文件连接起来
使用以下方式搜索最低值:for i in IN.TXT ; do sort -k6n $i | head -n 10; done
给定前十个列表的 1st_col 和 7th_col 值，使用它们来过滤 in_s 文件，使用 grep -n VALUE in_s* ，这样我就可以为每个值获取文件名

它可以工作，但有点乏味。我想知道一种仅使用 bash 或 python 或两者的更快方法。或者另一种更好的语言。

谢谢

最佳答案

在Python中，使用nsmallest function in the heapq module -- 它正是为此类任务而设计的。

Python 2.5 和 2.6 的示例(已测试):

import heapq, glob

def my_iterable():
    for fname in glob.glob("in_s*.txt"):
        f = open(fname, "r")
        for line in f:
            items = line.split()
            yield fname, items[0], float(items[6])
        f.close()

result = heapq.nsmallest(10, my_iterable(), lambda x: x[2])
print result

接受上述答案后更新

查看 Python 2.6 的源代码，它似乎有可能 list(iterable) 并对其起作用......如果是这样，那对一千个就不起作用了每个文件都有数百万行。如果第一个答案给你 MemoryError 等，这里有一个替代方案，它将列表的大小限制为 n (在你的情况下 n == 10 )。

注意:仅限 2.6；如果您需要 2.5，请使用条件 heapreplace()，如文档中所述。使用 heappush() 和 heappushpop() ，它们没有 key arg :-( 所以我们必须伪造它。

import glob
from heapq import heappush, heappushpop
from pprint import pprint as pp

def my_iterable():
    for fname in glob.glob("in_s*.txt"):
        f = open(fname, "r")
        for line in f:
            items = line.split()
            yield -float(items[6]), fname, items[0]
        f.close()

def homegrown_nlargest(n, iterable):
    """Ensures heap never has more than n entries"""
    heap = []
    for item in iterable:
        if len(heap) < n:
            heappush(heap, item)
        else:
            heappushpop(heap, item)
    return heap

result =  homegrown_nlargest(10, my_iterable())
result = sorted(result, reverse=True)
result = [(fname, fld0, -negfld6) for negfld6, fname, fld0 in result]
pp(result)

关于python - 从多个文件中搜索和排序数据，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/2048779/

python - 从多个文件中搜索和排序数据

上一篇：python - 使用 3 个组件创建 numpy 向量

下一篇：python - 当我尝试将模块安装到另一个 virtualenv 中时，PIP 会提示