python - 读取文本列的大型数据文件的最快方法是什么？

我有一个将近 900 万行(很快就会超过 5 亿行)的数据文件，我正在寻找读取它的最快方法。五个对齐的列被填充并用空格分隔，所以我知道在每一行的什么地方寻找我想要的两个字段。我的 Python 例程需要 45 秒:

import sys,time

start = time.time()
filename = 'test.txt'    # space-delimited, aligned columns
trans=[]
numax=0
for line in open(linefile,'r'):
    nu=float(line[-23:-11]); S=float(line[-10:-1])
    if nu>numax: numax=nu
    trans.append((nu,S))
end=time.time()
print len(trans),'transitions read in %.1f secs' % (end-start)
print 'numax =',numax

而我在 C 中提出的例程是更令人愉快的 4 秒:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#define BPL 47
#define FILENAME "test.txt"
#define NTRANS 8858226

int main(void) {
  size_t num;
  unsigned long i;
  char buf[BPL];
  char* sp;
  double *nu, *S;
  double numax;
  FILE *fp;
  time_t start,end;

  nu = (double *)malloc(NTRANS * sizeof(double));
  S = (double *)malloc(NTRANS * sizeof(double));

  start = time(NULL);
  if ((fp=fopen(FILENAME,"rb"))!=NULL) {
    i=0;
    numax=0.;
    do {
      if (i==NTRANS) {break;}
      num = fread(buf, 1, BPL, fp);
      buf[BPL-1]='\0';
      sp = &buf[BPL-10]; S[i] = atof(sp);
      buf[BPL-11]='\0';
      sp = &buf[BPL-23]; nu[i] = atof(sp);
      if (nu[i]>numax) {numax=nu[i];}
      ++i;
    } while (num == BPL);
    fclose(fp);
    end = time(NULL);
    fprintf(stdout, "%d lines read; numax = %12.6f\n", (int)i, numax);
    fprintf(stdout, "that took %.1f secs\n", difftime(end,start));
  } else {
    fprintf(stderr, "Error opening file %s\n", FILENAME);
    free(nu); free(S);
    return EXIT_FAILURE;
  }

  free(nu); free(S);
  return EXIT_SUCCESS;
  }

Fortran、C++ 和 Java 中的解决方案花费的时间适中(27 秒、20 秒、8 秒)。我的问题是:我在上面(特别是 C 代码)是否犯了任何离谱的错误？有什么办法可以加快 Python 例程的速度吗？我很快意识到将我的数据存储在元组数组中比为每个条目实例化一个类要好。

最佳答案

几点:

您的 C 程序作弊；它被告知文件大小，并且正在预分配 ...
Python:考虑使用 array.array('d') ... S 和 nu 各一个。然后尝试预分配。
Python:将您的例程编写为函数并调用它——访问函数局部变量比访问模块全局变量要快得多。

关于python - 读取文本列的大型数据文件的最快方法是什么？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/3779073/

python - 读取文本列的大型数据文件的最快方法是什么？

上一篇：python - 如何获取恒定大小的 python 列表的随机切片。 (最小代码)

下一篇：python - 如何使用 imaplib 创建电子邮件并将其发送到特定邮箱