python - 如何用非数字列的模式替换空值?

标签 python dataframe apache-spark-sql pyspark

我的 DataFrame 的 Continent_Name 列中有空值,我希望将其替换为同一列的模式。

+-----------------+-----------------------+-------------------------+-----------------------+-------------------------------+--------------+
|     Country_Name|Number_of_Beer_Servings|Number_of_Spirit_Servings|Number_of_Wine_servings|Pure_alcohol_Consumption_litres|Continent_Name|
+-----------------+-----------------------+-------------------------+-----------------------+-------------------------------+--------------+
|      Afghanistan|                      0|                        0|                      0|                            0.0|            AS|
|          Albania|                     89|                      132|                     54|                            4.9|            EU|
|          Algeria|                     25|                        0|                     14|                            0.7|            AF|
|          Andorra|                    245|                      138|                    312|                           12.4|            EU|
|           Angola|                    217|                       57|                     45|                            5.9|            AF|
|Antigua & Barbuda|                    102|                      128|                     45|                            4.9|          null|
|        Argentina|                    193|                       25|                    221|                            8.3|            SA|
|          Armenia|                     21|                      179|                     11|                            3.8|            EU|
|        Australia|                    261|                       72|                    212|                           10.4|            OC|
|          Austria|                    279|                       75|                    191|                            9.7|            EU|
|       Azerbaijan|                     21|                       46|                      5|                            1.3|            EU|
|          Bahamas|                    122|                      176|                     51|                            6.3|          null|
|          Bahrain|                     42|                       63|                      7|                            2.0|            AS|
|       Bangladesh|                      0|                        0|                      0|                            0.0|            AS|
|         Barbados|                    143|                      173|                     36|                            6.3|          null|
|          Belarus|                    142|                      373|                     42|                           14.4|            EU|
|          Belgium|                    295|                       84|                    212|                           10.5|            EU|
|           Belize|                    263|                      114|                      8|                            6.8|          null|
|            Benin|                     34|                        4|                     13|                            1.1|            AF|
|           Bhutan|                     23|                        0|                      0|                            0.4|            AS|
+-----------------+-----------------------+-------------------------+-----------------------+-------------------------------+--------------+

仅显示前 20 行

我尝试了以下方法:

for column in df_copy['Continent_Name']:
    df_copy['Continent_Name'].fillna(df_copy['Continent_Name'].mode()[0], inplace=True)

出现的错误:

TypeError: Column is not iterable

最佳答案

创建下面的DataFrame

df = spark.createDataFrame([('Afghanistan',0,0,0,0.0,'AS'),('Albania',89,132,54,4.9,'EU'),
                            ('Algeria',25,0,14,0.7,'AF'),('Andorra',245,138,312,12.4,'EU'),
                            ('Angola',217,57,45,5.9,'AF'),('Antigua&Barbuda',102,128,45,4.9,None),
                            ('Argentina',193,25,221,8.3,'SA'),('Armenia',21,179,11,3.8,'EU'),
                            ('Australia',261,72,212,10.4,'OC'),('Austria',279,75,191,9.7,'EU'),
                            ('Azerbaijan',21,46,5,1.3,'EU'),('Bahamas',122,176,51,6.3,None),
                            ('Bahrain',42,63,7,2.0,'AS'),('Bangladesh',0,0,0,0.0,'AS'),
                            ('Barbados',143,173,36,6.3,None),('Belarus',142,373,42,14.4,'EU'),
                            ('Belgium',295,84,212,10.5,'EU'),('Belize',263,114,8,6.8,None),
                            ('Benin',34,4,13,1.1,'AF'),('Bhutan',23,0,0,0.4,'AS')],
                            ['Country_Name','Number_of_Beer_Servings','Number_of_Spirit_Servings',
                             'Number_of_Wine_servings','Pure_alcohol_Consumption_litres',
                             'Continent_Name'])

由于我们打算找到Mode,因此我们需要查找最常出现的Continent_Name值。

df1 = df.where(col('Continent_Name').isNotNull())

Resistering我们的 DataFrame 作为 View ,并在其上应用 SQL 命令来group by,然后对 Continent_Name 进行计数。

df1.registerTempTable('table')
df2=spark.sql(
    'SELECT Continent_Name, COUNT(Continent_Name) AS count FROM table GROUP BY Continent_Name ORDER BY count desc'
)
df2.show()
+--------------+-----+
|Continent_Name|count|
+--------------+-----+
|            EU|    7|
|            AS|    4|
|            AF|    3|
|            SA|    1|
|            OC|    1|
+--------------+-----+

最后,返回first组的元素。

mode_value = df2.first()['Continent_Name']
print(mode_value)
     EU

获取mode_value后,只需填写.fillna()即可功能。

df = df.fillna({'Continent_Name':mode_value})
df.show()
+---------------+-----------------------+-------------------------+-----------------------+-------------------------------+--------------+
|   Country_Name|Number_of_Beer_Servings|Number_of_Spirit_Servings|Number_of_Wine_servings|Pure_alcohol_Consumption_litres|Continent_Name|
+---------------+-----------------------+-------------------------+-----------------------+-------------------------------+--------------+
|    Afghanistan|                      0|                        0|                      0|                            0.0|            AS|
|        Albania|                     89|                      132|                     54|                            4.9|            EU|
|        Algeria|                     25|                        0|                     14|                            0.7|            AF|
|        Andorra|                    245|                      138|                    312|                           12.4|            EU|
|         Angola|                    217|                       57|                     45|                            5.9|            AF|
|Antigua&Barbuda|                    102|                      128|                     45|                            4.9|            EU|
|      Argentina|                    193|                       25|                    221|                            8.3|            SA|
|        Armenia|                     21|                      179|                     11|                            3.8|            EU|
|      Australia|                    261|                       72|                    212|                           10.4|            OC|
|        Austria|                    279|                       75|                    191|                            9.7|            EU|
|     Azerbaijan|                     21|                       46|                      5|                            1.3|            EU|
|        Bahamas|                    122|                      176|                     51|                            6.3|            EU|
|        Bahrain|                     42|                       63|                      7|                            2.0|            AS|
|     Bangladesh|                      0|                        0|                      0|                            0.0|            AS|
|       Barbados|                    143|                      173|                     36|                            6.3|            EU|
|        Belarus|                    142|                      373|                     42|                           14.4|            EU|
|        Belgium|                    295|                       84|                    212|                           10.5|            EU|
|         Belize|                    263|                      114|                      8|                            6.8|            EU|
|          Benin|                     34|                        4|                     13|                            1.1|            AF|
|         Bhutan|                     23|                        0|                      0|                            0.4|            AS|
+---------------+-----------------------+-------------------------+-----------------------+-------------------------------+--------------+

关于python - 如何用非数字列的模式替换空值?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57969059/

相关文章:

r - 数据框中单行中的重复值

apache-spark - Spark DataFrame 基于条件的列总和

python - PySpark 数据框 : comma to dot

python - 无法在 pymongo 中使用格式显示结果

python - 根据多个谓词对字符串列表进行排序

python - 从最后可用数据创建 DataFrame 的最快方法

java - Spark Hbase : How to convert a dataframe to Hbase org. apache.hadoop.hbase.client.Result

python - 使用 Pandas 和 Numpy 按 ID 索引查找比率的计算时间很长

python - pandas 中的连接和移位列适用

Python Pandas - 根据值删除行