python - 使用 Python 循环处理多个 csv 文件并从特定列的非空单元格中提取行

我编写了一个代码来处理许多 csv 文件。对于其中的每一个，我想提取与名为“20201-2.0”的列的非空单元格相对应的所有行。看一下附加的示例(这是 LCE 列):

https://uoe-my.sharepoint.com/personal/gpapanas_ed_ac_uk/_layouts/15/onedrive.aspx?id=%2Fpersonal%2Fgpapanas%5Fed%5Fac%5Fuk%2FDocuments%2FCSV%20File%20screenshot%2EPNG&parent=%2Fpersonal%2Fgpapanas%5Fed%5Fac%5Fuk%2FDocuments&originalPath=aHR0cHM6Ly91b2UtbXkuc2hhcmVwb2ludC5jb20vOmk6L2cvcGVyc29uYWwvZ3BhcGFuYXNfZWRfYWNfdWsvRWF5QmJsRlRIbVZKdlJmc0I2aDhWcjRCMDlJZmpRMkwxSTVPUUtVTjJwNXd6dz9ydGltZT10V2Y0c2Q1UzEwZw

我编写了以下代码来执行此操作:

import pandas as pd
import glob
import os

path = './'
#column = ['20201-2.0']

all_files = glob.glob(path + "/*.csv")

for filename in all_files:

    # Option 1 below worked, although without isolating the non-nulled values
    # 1. df = pd.read_csv(filename, encoding="ISO-8859-1")
    df = pd.read_csv(filename, header = 0)
    df = df[df['20201-2.0'].notnull()]

    print('extracting info from cvs...')
    print(df)

    # You can now export all outcomes in new csv files
    file_name = filename + 'new' + '.csv'
    save_path = os.path.abspath(
        os.path.join(
            path, file_name
        )
    )
    print('saving ...')
    export_csv = df.to_csv(save_path, index=None)

    del df
    del export_csv

但是，虽然我设法生成第一个文件，但出现以下错误:

Traceback (most recent call last):
  File "/home/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2657, in get_loc
    return self._engine.get_loc(key)
  File "pandas/_libs/index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 132, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1601, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1608, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: '20201-2.0'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/PycharmProjects/OPTIMAT/Read_MR_from_all_csv.py", line 21, in <module>
    df = df[df['20201-2.0'].notnull()]
  File "/home/giorgos/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py", line 2927, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/home/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2659, in get_loc
    return self._engine.get_loc(self._maybe_cast_indexer(key))
  File "pandas/_libs/index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 132, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1601, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1608, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: '20201-2.0'

我不明白为什么会发生这种情况。任何想法将不胜感激。

最佳答案

很高兴地说我找到了一种方法来做到这一点:

import pandas as pd
import glob
import os
import numpy as np

path = './'
#column = ['20201-2.0']

# all_files = glob.glob(path + "/*.csv")

#li = []
all_files = os.listdir(path)
all_df = pd.DataFrame()
for filename in all_files:
    if not filename.endswith('csv'):
        continue

    print('extracting info from ' + filename)
    # Option 1 below worked, although without isolating the non-nulled values
    # 1. df = pd.read_csv(filename, encoding="ISO-8859-1")
    df = pd.read_csv(filename, header=0)
    #df = df[df['20201-2.0'].notnull()]

    df_subset = df.dropna(subset=['20201-2.0'])
    print('processed ' + filename)

    # You can now export all outcomes in new csv files
    file_name = filename.split('.')[0] + '_new' + '.csv'

    print('saving to' + file_name)
    export_csv = df_subset.to_csv('./' + file_name, index=None)

    del df
    del export_csv

关于python - 使用 Python 循环处理多个 csv 文件并从特定列的非空单元格中提取行，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/58428610/

python - 使用 Python 循环处理多个 csv 文件并从特定列的非空单元格中提取行

上一篇：python - Django 编码 - 为什么需要返回两个相同的参数？

下一篇：python - 在 Python 中打包相邻整数