python - 将一列拆分为多列/清洗数据集

标签 python pandas dataframe

所以我已经将一个表从 pdf 初始化为 pandas Dataframe,如下所示:

df_current= pd.DataFrame({'Country': ['NaN','NaN','Nan','NaN','Denmark', 'Sweden',
                            'Germany'],
                 'Explained Part':['Personal and job characteristics',
                'Education Occupation Job Employment', 'experience contract',
'Employment contract','20 -7 2 0','4 6 2 0', '-9 -6 -1 :']})

预期(或我最终目标的输出):

df_expected = pd.DataFrame({'Country': ['Denmark', 'Sweden',
'Germany'],'Personal and job characteristics':[20 ,4,-9],
'Education Occupation Job Employment':[-7,6,-6],
'experience contract':[2,2,-1],'Employment contract':[0,0,':']})

问题是:“解释部分”列包含 4 列数据,并且某些数据显示为符号,例如“:”。

我正在考虑使用

     df[['Personal and job characteristics',
'Education Occupation Job Employment',
'experience contract',
'experience contract']] = df['Explained part'].str.split(" ",expand=True,)

但我无法让它工作。

我想将列拆分为 3,但由于某些单元格已拆分数字。 有任何想法吗 ?

先谢谢了~ 附言。我已经更新了问题,因为我认为我的第一篇文章太难理解了,我现在添加了实际问题中的一些数据,并添加了预期的输出,感谢迄今为止的反馈!。

最佳答案

如果 NaN 缺少值,首先按 DataFrame.dropna 删除包含它们的行然后使用 DataFrame.pop 应用您的解决方案对于提取列:

df_current= pd.DataFrame({'Country': [np.nan,np.nan,np.nan,np.nan,'Denmark', 'Sweden',
                            'Germany'],
                 'Explained Part':['Personal and job characteristics',
                'Education Occupation Job Employment', 'experience contract',
'Employment contract','20 -7 2 0','4 6 2 0', '-9 -6 -1 :']})
print (df_current)
   Country                       Explained Part
0      NaN     Personal and job characteristics
1      NaN  Education Occupation Job Employment
2      NaN                  experience contract
3      NaN                  Employment contract
4  Denmark                            20 -7 2 0
5   Sweden                              4 6 2 0
6  Germany                           -9 -6 -1 :
<小时/>
df = df_current.dropna(subset=['Country']).copy()
cols = ['Personal and job characteristics','Education Occupation Job Employment',
        'experience contract','Employment contract']
df[cols] = df.pop('Explained Part').str.split(expand=True)
print (df)
   Country Personal and job characteristics  \
4  Denmark                               20   
5   Sweden                                4   
6  Germany                               -9   

  Education Occupation Job Employment experience contract Employment contract  
4                                  -7                   2                   0  
5                                   6                   2                   0  
6                                  -6                  -1                   :  

或者没有pop:

df = df_current.dropna(subset=['Country']).copy()
cols = ['Personal and job characteristics','Education Occupation Job Employment',
        'experience contract','Employment contract']
df[cols] = df['Explained Part'].str.split(expand=True)
df = df.drop('Explained Part', axis=1)

关于python - 将一列拆分为多列/清洗数据集,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/60074155/

相关文章:

python - 使用 python 搜索/替换 html 文件中的文本

python - Windows 中没有名为 'tkinter' (Python3.8) 的模块

python - dask 计算存储结果吗?

r - 将带有日期的列表转换为R中的数据框

python - Pytest - ModuleNotFoundError : No module named 'x'

python - Python 中使用 list(dict.items()) 和 dict.items() 迭代字典的区别在哪里

python - 带有值的多索引数据框的字典列表列表

python - 拆分数据框中的多列与特定列配对

python - 设置 pandas Dataframe Boxplot() 的 y 轴刻度,3 个偏差?

python - 使用 pandas 从基于时间的列中选择最新值