python - 解析 Pandas 数据框列以检查相同的值

我正在处理一个巨大的 csv 文件 (873,323 x 271)，其看起来类似于以下内容:

| Part_Number |   Type_Code   |  Building_Code | Handling_Code | Price to Buy | Price to Sell |      Name     |
|:-----------:|:-------------:|:--------------:|:-------------:|:------------:|:-------------:|:-------------:|
|      A      |      1, 2     |   XX, XX, XX   | Y, Y, Y, Y, Y |    304.32    |      510      |     Mower     |
|      B      |    1, 1, 1    |   XX, XX, XX   |   Y, Y, Y, Y  |    1282.04   |      5000     |      Saw      |
|      C      |    1, 2, 3    |     XX, XX     |      Y, Y     |     68.91    |       65      | Barrel (Hard) |
|      D      | 1, 1, 1, 1, 1 | XX, XX, XX, XX |    Y, Y, Y    |       0      |      300      | Barrel (Make) |
|      E      |       1       |       XX       |   Y, Y, Y, Y  |    321.11    |      415      |  Cement Mixer |
|      F      |       2       |   XX, XX, XX   |       Y       |    194.44    |      1095     |   Cement Mix  |

有多种列类型:有些是数字，有些是字符串，有些是看起来像列表的字符串(即 Type_Code、Building_Code、Handling_Code 等)

我想要完成的是:

If each value in the column is the same value, then remove the list-like structure and replace it with just that value. i.e., 1, 1, 1 should become just 1. Numerical and non list-like strings should not be changed

修改上表:

| Part_Number | Type_Code | Building_Code | Handling_Code | Price to Buy | Price to Sell |      Name     |
|:-----------:|:---------:|:-------------:|:-------------:|:------------:|:-------------:|:-------------:|
|      A      |    1, 2   |       XX      |       Y       |    304.32    |      510      |     Mower     |
|      B      |     1     |       XX      |       Y       |    1282.04   |      5000     |      Saw      |
|      C      |  1, 2, 3  |       XX      |       Y       |     68.91    |       65      | Barrel (Hard) |
|      D      |     1     |       XX      |       Y       |       0      |      300      | Barrel (Make) |
|      E      |     1     |       XX      |       Y       |    321.11    |      415      |  Cement Mixer |
|      F      |     2     |       XX      |       Y       |    194.44    |      1095     |   Cement Mix  |

(即，由于 Building_Code 只是 XX 的聚合，因此应该只显示 XX)

以下是我目前的尝试:

import pandas as pd

# Read in CSV
df = pd.read_csv('C:\\Users\\wundermahn\\Desktop\\test_stack_csv.csv')

# Turn all columns into a list
for col in df.columns:
    col_name = str(col)
    temp = pd.DataFrame(df[col_name].tolist())
    df.drop(col, axis=1, inplace=True)
    df = pd.concat([df, temp], axis=1, join='inner')

# Now loop through the columns and remove items from the list
for col in df.columns:
    # If all items are the same
    if (len(set(col)) <= 1):
        # Set it to be that item
        col = col[0]
    else:
        # If they aren't the same, then just take the items out of the list
        col = str(col)

print(df)

但是我收到一个错误:

Traceback (most recent call last):
  File "c:\Users\wundermahn\Desktop\stack_0318.py", line 15, in <module>
    if (len(set(col)) <= 1):
TypeError: 'int' object is not iterable

怎样才能达到我想要的结果？

最佳答案

这看起来像一个自定义函数，它会拆分 , 并在删除我使用 dict.fromkeys 的重复项后将其连接回来

f = lambda x:','.join(dict.fromkeys([i.strip() for i in x.split(',')]).keys())

df.loc[:,df.dtypes.eq('object')]=df.select_dtypes('O').applymap(f)

print(df)

   Part_Number Type_Code Building_Code Handling_Code  Price to Buy  \
0           A       1,2            XX             Y        304.32   
1           B         1            XX             Y       1282.04   
2           C     1,2,3            XX             Y         68.91   
3           D         1            XX             Y          0.00   
4           E         1            XX             Y        321.11   
5           F         2            XX             Y        194.44   

   Price to Sell           Name  
0            510          Mower  
1           5000            Saw  
2             65  Barrel (Hard)  
3            300  Barrel (Make)  
4            415   Cement Mixer  
5           1095     Cement Mix

关于python - 解析 Pandas 数据框列以检查相同的值，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/60740834/

python - 解析 Pandas 数据框列以检查相同的值

上一篇：typescript - 为什么 TypeScript 对于未定义的类型不会抛出编译错误

下一篇：r - 找不到 KDE 安装 - RKWard