我正在尝试创建自定义 DataFrame 来表示我的数据中所有缺失的 (NaN) 值。
我提出的解决方案可行,但它在 300 行和 50 列的集合上速度缓慢且无效。
Pandas 版 = "0.24.2"
import pandas as pd
data = {
'city_code' : ['Sydney2017', 'London2017', 'Sydney2018', 'London2018'],
'population_mil': [5.441, 7.375, pd.np.nan, pd.np.nan]
}
class NaNData:
def __init__(self, data: dict):
self.data: dict = data
@property
def data_df(self) -> pd.DataFrame:
""" Returns input data as a DataFrame. """
return pd.DataFrame(self.data)
def select_city(self, city_code: str) -> pd.DataFrame:
""" Creates DataFrame where city_code column value matches
requested city_code string. """
df = self.data_df
return df.loc[df['city_code'] == city_code]
@property
def df(self) -> pd.DataFrame:
""" Creates custom summary DataFrame to represent missing data. """
data_df = self.data_df
# There are duplicates in 'city_code' column. Make sure your cities
# are unique values only.
all_cities = list(set(data_df['city_code']))
# Check whether given city has any NaN values in any column.
has_nan = [
self.select_city(i).isnull().values.any() for i in all_cities
]
data = {
'cities' : all_cities,
'has_NaN': has_nan,
}
df = pd.DataFrame(data)
return df
nan_data = NaNData(data)
print(nan_data.df)
# Output:
# cities has_NaN
# 0 London2018 True
# 1 London2017 False
# 2 Sydney2018 True
# 3 Sydney2017 False
我觉得我在 pandas 中处理迭代的方式不对。是否有针对此类问题的适当(或通用)解决方案?我应该以某种方式使用 groupby 进行此类操作吗?
非常感谢任何输入, 感谢您的宝贵时间。
最佳答案
您不需要遍历多个数据帧来获得结果,您确实可以使用 groupby
使用应用
:
import pandas as pd
data = {
'city_code' : ['Sydney2017', 'London2017', 'Sydney2018', 'London2018'],
'population_mil': [5.441, 7.375, pd.np.nan, pd.np.nan],
'temp': [28, pd.np.nan, 24, 25]
}
df = pd.DataFrame(data)
df.groupby('city_code').apply(lambda x: x.isna().any()).any(axis=1)
关于python - 迭代多个 DataFrame 的更有效方法,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57883198/