我有以下数据框:
df = sc.parallelize([(1, 2, 3, '2','1','1'), (4, 5, 6, '3','2','1')]).toDF(['ID1', 'ID2', 'ID3','Impressions','Clicks','ImpressionsMinusClicks'])
df.show()
我想将其转换为这个(但是不知道如何以及是否应用 split()
和 explode()
来实现这一点):
这里的关键是基本上复制每个实例以匹配印象数(例如,10 个印象实例变成 10 行),然后在这些行中将它们标记为 # 点击次数作为正例,其余行标记为 # IMpressions -点击次数作为反面例子。总结一下:一个实例有 10 次展示和 3 次点击。我想将其转换为 10 行,3 个正样本(“1”表示点击)和 7 个负样本(“0”表示印象深刻/未点击)。目的是使用它作为分类模型的输入,例如朴素贝叶斯或逻辑回归。其起源是 Kaggle KDD Cup 2012 数据集。
最佳答案
您确实可以在 UDF 的结果上使用 explode
来生成一系列“事件” - 1 表示点击事件,0 表示未点击的印象事件:
// We create a UDF which expects two columns (imps and clicks) as input,
// and returns an array of "is clicked" (0 or 1) integers
val toClickedEvents = udf[Array[Int], Int, Int] {
case (imps, clicks) => {
// First, we map the number of imps (e.g. 3) into a sequence
// of "imps" indices starting from zero; Each one would later
// represent a single impression "event"
val impsIndices = (0 until imps)
// we map each impression "event", represented by its index,
// into a 1 or a 0: depending if that event had a matching click;
// we do that by assigning "1" to indices lower than the number of clicks
// and "0" for the rest
val clickIndicatorPerImp = impsIndices.map(index => if (clicks > index) 1 else 0)
// finally we just convert into an array, to comply with the UDF signature
clickIndicatorPerImp.toArray
}
}
// explode the result of the UDF and calculate ImpressedNotClicked
df.withColumn("Clicked", explode(toClickedEvents($"Impressions", $"Clicks")))
.select($"ID1", $"ID2", $"ID3", $"Clicked", abs($"Clicked" - lit(1)) as "ImpressedNotClicked")
注意:原始帖子已标记为 scala
;如果您可以将其转换为 python
,请随意编辑
关于python - Spark : How to split an instance into Postivie/Negative Samples according to two columns,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/43685580/