python - 在打开的文件上使用 Pandas read_csv() 两次

当我试验 pandas 时，我注意到 pandas.read_csv 有一些奇怪的行为，想知道是否有更多经验的人可以解释是什么导致了它。

首先，这是我从 .csv 文件创建新 pandas.dataframe 的基本类定义:

import pandas as pd

class dataMatrix:
    def __init__(self, filepath):
        self.path = filepath  # File path to the target .csv file.
        self.csvfile = open(filepath)  # Open file.
        self.csvdataframe = pd.read_csv(self.csvfile)

现在，这工作得很好，调用我的 __ main __.py 中的类成功创建了一个 pandas 数据框:

From dataMatrix.py import dataMatrix

testObject = dataMatrix('/path/to/csv/file')

但我注意到此过程自动将 .csv 的第一行设置为 pandas.dataframe.columns 索引。相反，我决定对列进行编号。因为我不想假设我事先知道列数，所以我采用了打开文件的方法，将其加载到数据框中，计算列数，然后使用 range( ).

import pandas as pd

class dataMatrix:
    def __init__(self, filepath):
        self.path = filepath
        self.csvfile = open(filepath)

        # Load the .csv file to count the columns.
        self.csvdataframe = pd.read_csv(self.csvfile)
        # Count the columns.
        self.numcolumns = len(self.csvdataframe.columns)
        # Re-load the .csv file, manually setting the column names to their 
        # number.
        self.csvdataframe = pd.read_csv(self.csvfile, 
                                        names=range(self.numcolumns))

保持我在 __ main __.py 中的处理相同，我得到了一个数据框，其中包含正确的列数(在本例中为 500)和正确的名称(0...499)，但它是空的(没有行数据)。

我摸不着头脑，决定关闭 self.csvfile 并像这样重新加载它:

import pandas as pd

class dataMatrix:
    def __init__(self, filepath):
        self.path = filepath
        self.csvfile = open(filepath)

        # Load the .csv file to count the columns.
        self.csvdataframe = pd.read_csv(self.csvfile)
        # Count the columns.
        self.numcolumns = len(self.csvdataframe.columns)

        # Close the .csv file.         #<---- +++++++
        self.csvfile.close()           #<----  Added
        # Re-open file.                #<----  Block
        self.csvfile = open(filepath)  #<---- +++++++

        # Re-load the .csv file, manually setting the column names to their
        # number.
        self.csvdataframe = pd.read_csv(self.csvfile, 
                                        names=range(self.numcolumns))

关闭文件并重新打开它会正确返回 pandas.dataframe，列号为 0...499 和所有 255 个后续数据行。

我的问题是为什么关闭文件并重新打开它会有所不同？

最佳答案

当你打开一个文件时

open(filepath)

返回一个文件句柄迭代器。迭代器适合一次遍历其内容。所以

self.csvdataframe = pd.read_csv(self.csvfile)

读取内容并耗尽迭代器。对 pd.read_csv 的后续调用认为迭代器为空。

请注意，您可以通过将文件路径传递给 pd.read_csv 来避免此问题:

class dataMatrix:
    def __init__(self, filepath):
        self.path = filepath

        # Load the .csv file to count the columns.
        self.csvdataframe = pd.read_csv(filepath)
        # Count the columns.
        self.numcolumns = len(self.csvdataframe.columns)


        # Re-load the .csv file, manually setting the column names to their
        # number.
        self.csvdataframe = pd.read_csv(filepath, 
                                        names=range(self.numcolumns))

pd.read_csv 将为您打开(和关闭)该文件。

附言。另一种选择是通过调用 self.csvfile.seek(0) 将文件句柄重置为文件开头，但使用 pd.read_csv(filepath, ...) 还是比较容易。

更好的是，您可以像这样重命名列，而不是调用 pd.read_csv 两次(这是低效的):

class dataMatrix:
    def __init__(self, filepath):
        self.path = filepath

        # Load the .csv file to count the columns.
        self.csvdataframe = pd.read_csv(filepath)
        self.numcolumns = len(self.csvdataframe.columns)
        self.csvdataframe.columns = range(self.numcolumns)

关于python - 在打开的文件上使用 Pandas read_csv() 两次，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/25943208/

python - 在打开的文件上使用 Pandas read_csv() 两次

上一篇：python - SQLAlchemy:Backref 上的 Order_by

下一篇：python - 我怎样才能确保我的测试触及代码库的每一行？