python - 当连接键以列表形式给出时，如何修改 Spark 数据框中连接的列？

我一直在尝试使用以下作为列表传递的连接键列表来连接两个数据帧，并且我想添加在其中一个键值为空时连接键子集的功能

我一直在尝试连接两个数据帧 df_1 和 df_2。

data1 = [[1,'2018-07-31',215,'a'],
        [2,'2018-07-30',None,'b'],
        [3,'2017-10-28',201,'c']
     ]
df_1 = sqlCtx.createDataFrame(data1, 
['application_number','application_dt','account_id','var1'])

和

data2 = [[1,'2018-07-31',215,'aaa'],
        [2,'2018-07-30',None,'bbb'],
        [3,'2017-10-28',201,'ccc']
        ]
df_2 = sqlCtx.createDataFrame(data2, 
['application_number','application_dt','account_id','var2'])

我用来加入的代码是这样的:

key_a = ['application_number','application_dt','account_id']
new = df_1.join(df_2,key_a,'left')

相同的输出是:

+------------------+--------------+----------+----+----+
|application_number|application_dt|account_id|var1|var2|
+------------------+--------------+----------+----+----+
|                 1|    2018-07-31|       215|   a| aaa|
|                 3|    2017-10-28|       201|   c| ccc|
|                 2|    2018-07-30|      null|   b|null|
+------------------+--------------+----------+----+----+

我关心的是，在 account_id 为空的情况下，连接应该仍然通过比较其他 2 个键来工作。

所需的输出应该是这样的:

+------------------+--------------+----------+----+----+
|application_number|application_dt|account_id|var1|var2|
+------------------+--------------+----------+----+----+
|                 1|    2018-07-31|       215|   a| aaa|
|                 3|    2017-10-28|       201|   c| ccc|
|                 2|    2018-07-30|      null|   b| bbb|
+------------------+--------------+----------+----+----+

我通过使用以下语句找到了类似的方法:

  join_elem = "df_1.application_number == 
  df_2.application_number|df_1.application_dt == 
  df_2.application_dt|F.coalesce(df_1.account_id,F.lit(0)) ==  
  F.coalesce(df_2.account_id,F.lit(0))".split("|")
  join_elem_column = [eval(x) for x in join_elem]

但是设计考虑不允许我使用完整的连接表达式，并且我坚持使用列名称列表作为连接键。

我一直在试图找到一种方法来将这个合并的东西容纳到这个列表本身中，但到目前为止还没有取得任何成功。

最佳答案

我将此解决方案称为解决方法。

这里的问题是，DataFrame 中的键之一具有 Null 值，而 OP 希望使用其余的键列。为什么不为此 Null 分配任意值，然后应用连接。实际上，这与在其余两个键上进行连接是一样的。

# Let's replace Null with an arbitrary value, which has
# little chance of occurring in the Dataset. For eg; -100000
df1 = df1.withColumn('account_id', when(col('account_id').isNull(),-100000).otherwise(col('account_id')))    
df2 = df2.withColumn('account_id', when(col('account_id').isNull(),-100000).otherwise(col('account_id')))

# Do a FULL Join
df = df1.join(df2,['application_number','application_dt','account_id'],'full')

# Replace the arbitrary value back with Null.    
df = df.withColumn('account_id', when(col('account_id')== -100000, None).otherwise(col('account_id')))
df.show()
+------------------+--------------+----------+----+----+
|application_number|application_dt|account_id|var1|var2|
+------------------+--------------+----------+----+----+
|                 1|    2018-07-31|       215|   a| aaa|
|                 2|    2018-07-30|      null|   b| bbb|
|                 3|    2017-10-28|       201|   c| ccc|
+------------------+--------------+----------+----+----+

关于python - 当连接键以列表形式给出时，如何修改 Spark 数据框中连接的列？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/54655518/

python - 当连接键以列表形式给出时，如何修改 Spark 数据框中连接的列？

上一篇：python - Scapy TCP 握手 - Windows

下一篇：python - 更改 QTreeWidget 中复选框的样式而不影响 Maya 中的复选标记？