python - Pandas read_table() 缺少行

Pandas read_table 函数在我尝试读取的文件中缺少一些行，但我无法找出原因。

import pandas as pd
import numpy as np
filename = "whatever.txt"

df_pd = pd.read_table(filename, use_cols=['FirstColumn'], skip_blank_lines=False)
df_np = np.genfromtxt(filename, usecols=0)

#function to count file line by line
def file_len(fname):
    with open(fname) as f:
        for i, l in enumerate(f):
            pass
    return i + 1

len_pd = len(df_pd)
len_np = len(df_np)
len_linebyline = file_len(filename)

不幸的是，我无法共享我的实际数据，因为它是一个巨大的文件，除了受许可保护外，还有 30 列 x 5800 万行。由于某种原因，numpy 和 file_len 方法给出了大约 5800 万行的正确长度，但 pandas 方法只有大约 5500 万行。

有人对可能导致此问题的原因或我如何调查它有任何想法吗？

最佳答案

使用以下方法，您可以尝试查找丢失的数据:

In [31]: df = pd.DataFrame({'col':[0,1,2,3,4,6,7,8]})

In [32]: a = np.arange(10)

In [33]: df
Out[33]:
   col
0    0
1    1
2    2
3    3
4    4
5    6
6    7
7    8

In [34]: a
Out[34]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [35]: np.setdiff1d(a, df.col)
Out[35]: array([5, 9])

关于python - Pandas read_table() 缺少行，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/42097561/

上一篇：python - Pandas read_csv - 千位分隔符工作不一致

下一篇：python - 如何对字符串中的 unicode 进行转义

相关文章：

python - 如何在合理的时间内(小于1天)将5亿个条目写入neo4j？

python - 如何允许非管理员用户通过 OAuth2.0 为不允许用户代表他们同意应用程序的租户进行身份验证？

python - cassandra-driver 执行查询时，cassandra-driver 返回错误 OperationTimedOut

python - sqlalchemy.exc.InvalidRequestError : Could not reflect: requested table(s) not available in Engine

python - 无法返回整个 CSV 数据框

python - 将 Dataframe 行与 numpy 数组相乘

python - mpmath 中的 mpf 是什么意思？

python - 使用相同的装饰器路由到 view_func "flask"

python - 合并两个具有相同模式的 DataFrame

python - 当我尝试 reshape numpy 数组时，为什么 memmap 需要文件名？