python - Spark : How to split an instance into Postivie/Negative Samples according to two columns

我有以下数据框:

df = sc.parallelize([(1, 2, 3, '2','1','1'), (4, 5, 6, '3','2','1')]).toDF(['ID1', 'ID2', 'ID3','Impressions','Clicks','ImpressionsMinusClicks'])
df.show()

我想将其转换为这个(但是不知道如何以及是否应用 split() 和 explode() 来实现这一点):

这里的关键是基本上复制每个实例以匹配印象数(例如，10 个印象实例变成 10 行)，然后在这些行中将它们标记为 # 点击次数作为正例，其余行标记为 # IMpressions -点击次数作为反面例子。总结一下:一个实例有 10 次展示和 3 次点击。我想将其转换为 10 行，3 个正样本(“1”表示点击)和 7 个负样本(“0”表示印象深刻/未点击)。目的是使用它作为分类模型的输入，例如朴素贝叶斯或逻辑回归。其起源是 Kaggle KDD Cup 2012 数据集。

最佳答案

您确实可以在 UDF 的结果上使用 explode 来生成一系列“事件” - 1 表示点击事件，0 表示未点击的印象事件:

// We create a UDF which expects two columns (imps and clicks) as input, 
// and returns an array of "is clicked" (0 or 1) integers
val toClickedEvents = udf[Array[Int], Int, Int] {
  case (imps, clicks) => {
    // First, we map the number of imps (e.g. 3) into a sequence
    // of "imps" indices starting from zero; Each one would later
    // represent a single impression "event"
    val impsIndices = (0 until imps)

    // we map each impression "event", represented by its index, 
    // into a 1 or a 0: depending if that event had a matching click;
    // we do that by assigning "1" to indices lower than the number of clicks
    // and "0" for the rest
    val clickIndicatorPerImp = impsIndices.map(index => if (clicks > index) 1 else 0)

    // finally we just convert into an array, to comply with the UDF signature
    clickIndicatorPerImp.toArray
  }
}

// explode the result of the UDF and calculate ImpressedNotClicked
df.withColumn("Clicked", explode(toClickedEvents($"Impressions", $"Clicks")))
  .select($"ID1", $"ID2", $"ID3", $"Clicked", abs($"Clicked" - lit(1)) as "ImpressedNotClicked")

注意:原始帖子已标记为 scala；如果您可以将其转换为 python，请随意编辑

关于python - Spark : How to split an instance into Postivie/Negative Samples according to two columns，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/43685580/

python - Spark : How to split an instance into Postivie/Negative Samples according to two columns

上一篇：python - 资源已耗尽 Google Cloud Speech

下一篇：python - 无法重命名列系列