python - pandas 清理数据框

标签 python pandas dataframe data-cleaning

我目前正在学习 pandas,并且在清理我的 Dataframe 时遇到问题:

"TIMESTAMP","RECORD","WM1_u_ms","WM1_v_ms","WM1_w_ms","WM2_u_ms","WM2_v_ms","WM2_w_ms","WS1_u_ms","WS1_v_ms"
"2018-04-06 14:31:11.5",29699805,2.628,4.629,0.599,3.908,7.971,0.47,2.51,7.18
"2018-04-06 14:31:11.75",29699806,3.264,4.755,-0.095,2.961,6.094,-0.504,2.47,7.18
"2018-04-06 14:31:12",29699807,1.542,5.793,0.698,4.95,4.91,0.845,2.18,7.5
"2018-04-06 14:31:12.25",29699808,2.527,5.207,0.012,4.843,6.285,0.924,2.15,7.4
"2018-04-06 14:31:12.5",29699809,3.511,4.528,1.059,2.986,5.636,0.949,3.29,5.54
"2018-04-06 14:31:12.75",29699810,3.445,3.957,-0.075,3.127,6.561,0.259,3.85,5.45
"2018-04-06 14:31:13",29699811,2.624,5.238,-0.166,3.451,7.199,0.242,3.94,6.24

df = pd.read_csv(FilePath,parse_dates=True)  #read the csv file and save it into a variable
df = df.drop(['RECORD'],axis=1)

dtypes

我不明白为什么pandas将部分识别为float64而将其他部分识别为对象。你们有什么线索吗? 因此,我开始尝试自己转换列:

df['TIMESTAMP'] = pd.to_datetime(df['TIMESTAMP'])
df['WM1_u_ms':] = df.iloc[:, df.columns != 'TIMESTAMP'].values.astype(float)

但是我收到错误:

cannot do slice indexing on <class 'pandas.core.indexes.range.RangeIndex'> with these indexers [WM1_u_ms] of <class 'str'>

为什么 pandas 无法从一开始就正确读取 .dat 文件以及我转换它时的错误是什么。在下一个步骤中,我想通过 df.interpolate() 进行插值以清除 nan 的

感谢您的帮助!

最佳答案

我认为您可以在read_csv中创建DatetimeIndex通过参数 parse_datesindex_col:

df = pd.read_csv(FilePath, parse_dates=['TIMESTAMP'], index_col=['TIMESTAMP'])

df = df.drop(['RECORD'],axis=1)

但我认为有一些非数值,所以有必要 to_numeric使用 errors='coerce' 将它们解析为 NaNs:

df = df.apply(lambda x: pd.to_numeric(x, errors='coerce'))

使用示例数据进行示例 - 但为 object 列添加了字符串:

import pandas as pd

pd.options.display.max_columns = 20

temp=u""""TIMESTAMP","RECORD","WM1_u_ms","WM1_v_ms","WM1_w_ms","WM2_u_ms","WM2_v_ms","WM2_w_ms","WS1_u_ms","WS1_v_ms"
"2018-04-06;14:31:11.5",29699805,2.628a,4.629a,0.599s,3.908,7.971,0.47,2;;51,7.18
"2018-04-06;14:31:11.75",29699806,3.264,4.755,-0.095,2.961,6.094,-0.504,2.47,7.18
"2018-04-06;14:31:12",29699807,1.542,5.793,0.698,4.95,4.91,0.845,2.18,7.5
"2018-04-06;14:31:12.25",29699808,2.527,5.207,0.012,4.843,6.285,0.924,2.15,7.4
"2018-04-06;14:31:12.5",29699809,3.511,4.528,1.059,2.986,5.636,0.949,3.29,5.54
"2018-04-06;14:31:12.75",29699810,3.445,3.957,-0.075,3.127,6.561,0.259,3.85,5.45
"2018-04-06;14:31:13",29699811,2.624,5.238,-0.166,3.451,7.199,0.242,3.94,a"""
#after testing replace 'pd.compat.StringIO(temp)' to 'filename.csv'
df = pd.read_csv(pd.compat.StringIO(temp), parse_dates=['TIMESTAMP'], index_col=['TIMESTAMP'])

print (df)
                           RECORD WM1_u_ms WM1_v_ms WM1_w_ms  WM2_u_ms  \
TIMESTAMP                                                                
2018-04-06 14:31:11.500  29699805   2.628a   4.629a   0.599s     3.908   
2018-04-06 14:31:11.750  29699806    3.264    4.755   -0.095     2.961   
2018-04-06 14:31:12.000  29699807    1.542    5.793    0.698     4.950   
2018-04-06 14:31:12.250  29699808    2.527    5.207    0.012     4.843   
2018-04-06 14:31:12.500  29699809    3.511    4.528    1.059     2.986   
2018-04-06 14:31:12.750  29699810    3.445    3.957   -0.075     3.127   
2018-04-06 14:31:13.000  29699811    2.624    5.238   -0.166     3.451   

                         WM2_v_ms  WM2_w_ms WS1_u_ms WS1_v_ms  
TIMESTAMP                                                      
2018-04-06 14:31:11.500     7.971     0.470    2;;51     7.18  
2018-04-06 14:31:11.750     6.094    -0.504     2.47     7.18  
2018-04-06 14:31:12.000     4.910     0.845     2.18      7.5  
2018-04-06 14:31:12.250     6.285     0.924     2.15      7.4  
2018-04-06 14:31:12.500     5.636     0.949     3.29     5.54  
2018-04-06 14:31:12.750     6.561     0.259     3.85     5.45  
2018-04-06 14:31:13.000     7.199     0.242     3.94        a  

print (df.dtypes)
RECORD        int64
WM1_u_ms     object
WM1_v_ms     object
WM1_w_ms     object
WM2_u_ms    float64
WM2_v_ms    float64
WM2_w_ms    float64
WS1_u_ms     object
WS1_v_ms     object
dtype: object

print (df.index)
DatetimeIndex(['2018-04-06 14:31:11.500000', '2018-04-06 14:31:11.750000',
                      '2018-04-06 14:31:12', '2018-04-06 14:31:12.250000',
               '2018-04-06 14:31:12.500000', '2018-04-06 14:31:12.750000',
                      '2018-04-06 14:31:13'],
              dtype='datetime64[ns]', name='TIMESTAMP', freq=None)


df = df.drop(['RECORD'],axis=1)
df = df.apply(lambda x: pd.to_numeric(x, errors='coerce'))

print (df)
                         WM1_u_ms  WM1_v_ms  WM1_w_ms  WM2_u_ms  WM2_v_ms  \
TIMESTAMP                                                                   
2018-04-06 14:31:11.500       NaN       NaN       NaN     3.908     7.971   
2018-04-06 14:31:11.750     3.264     4.755    -0.095     2.961     6.094   
2018-04-06 14:31:12.000     1.542     5.793     0.698     4.950     4.910   
2018-04-06 14:31:12.250     2.527     5.207     0.012     4.843     6.285   
2018-04-06 14:31:12.500     3.511     4.528     1.059     2.986     5.636   
2018-04-06 14:31:12.750     3.445     3.957    -0.075     3.127     6.561   
2018-04-06 14:31:13.000     2.624     5.238    -0.166     3.451     7.199   

                         WM2_w_ms  WS1_u_ms  WS1_v_ms  
TIMESTAMP                                              
2018-04-06 14:31:11.500     0.470       NaN      7.18  
2018-04-06 14:31:11.750    -0.504      2.47      7.18  
2018-04-06 14:31:12.000     0.845      2.18      7.50  
2018-04-06 14:31:12.250     0.924      2.15      7.40  
2018-04-06 14:31:12.500     0.949      3.29      5.54  
2018-04-06 14:31:12.750     0.259      3.85      5.45  
2018-04-06 14:31:13.000     0.242      3.94       NaN  

print (df.dtypes)
WM1_u_ms    float64
WM1_v_ms    float64
WM1_w_ms    float64
WM2_u_ms    float64
WM2_v_ms    float64
WM2_w_ms    float64
WS1_u_ms    float64
WS1_v_ms    float64
dtype: object

关于python - pandas 清理数据框,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/53744456/

相关文章:

python - 如何对pandas多索引中的列进行操作

python - 根据另一列的值更改列中值的顺序

python - 一个 Pandas 数据帧在另一个数据帧中的查找值

在 R 中重新排序字母数字年龄组

string - 删除 Pandas 数据框中带有数字和字符串的行

Python 3 找不到 setuptools 模块 - Ubuntu

python - mod_wsgi + apache不是多线程的,为什么?

python pandas new 列根据其他列中的条件进行分类

python - 如何将函数应用于两列 Pandas 数据框

python - 在 python 中解析日期字符串(将字符串转换为日期)