在 Pandas 中,可以执行如下操作:
mapping = {
'a': 'The letter A',
'b': 'The letter B',
'c': 'The third letter'
}
x = pd.Series(['a', 'b', 'a', c']).map(mapping)
并获得类似的东西
pd.Series([
'The letter A',
'The letter B',
'The letter A',
'The third letter'
])
天真地,我可以在 PySpark DataFrame 中使用类似的东西来实现这一点
import pyspark.sql.functions as F
import pyspark.sql.functions as T
def _map_values_str(value, mapping, default=None):
""" Apply a mapping, assuming the result is a string """
return mapping.get(value, default)
map_values_str = F.udf(_map_values_str, T.StringType())
mapping = {
'a': 'The letter A',
'b': 'The letter B',
'c': 'The third letter'
}
data = spark.createDataFrame([('a',), ('b',), ('a',), ('c',)], schema=['letters'])
data = data.withColumn('letters_mapped', map_values_str(F.col('letters'), mapping))
但根据我的经验,这样的 UDF 在大型数据集上往往会有些慢。有没有更有效的方法?
最佳答案
我认为在这种情况下,您可以将 dict
转换为 DataFrame
并简单地使用 join
:
import pyspark.sql.functions as F
mapping = {
'a': 'The letter A',
'b': 'The letter B',
'c': 'The third letter'
}
# Convert so Spark DataFrame
mapping_df = spark.sparkContext.parallelize([(k,)+(v,) for k,v in mapping.items()]).toDF(['letters','val'])
data = spark.createDataFrame([('a',), ('b',), ('a',), ('c',)], schema=['letters'])
data = data.join(mapping_df.withColumnRenamed('val','letters_mapped'),'letters','left')
data.show()
输出:
+-------+----------------+
|letters| letters_mapped|
+-------+----------------+
| c|The third letter|
| b| The letter B|
| a| The letter A|
| a| The letter A|
+-------+----------------+
希望这有帮助!
关于python - 在 (Py)Spark DataFrame 中映射值,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/51641658/