json - Spark.RDD take(n) 返回元素为 n 的数组，n 次

我正在使用 https://github.com/alexholmes/json-mapreduce 中的代码将多行 json 文件读入 RDD。

var data = sc.newAPIHadoopFile( 
    filepath, 
    classOf[MultiLineJsonInputFormat], 
    classOf[LongWritable], 
    classOf[Text], 
    conf)

我打印了前 n 个元素来检查它是否正常工作。

data.take(n).foreach { p => 
  val (line, json) = p
  println
  println(new JSONObject(json.toString).toString(4))
}

但是，当我尝试查看数据时，从 take 返回的数组似乎不正确。

而不是返回以下形式的数组

[ data[0], data[1], ... data[n] ]

其形式为

[ data[n], data[n], ... data[n] ]

这是我创建的 RDD 的问题，还是我尝试打印它的方式的问题？

最佳答案

我明白了为什么take它返回一个具有重复值的数组。

正如 API 中提到的:

Note: Because Hadoop's RecordReader class re-uses the same Writable object 
for each record, directly caching the returned RDD will create many 
references to the same object. If you plan to directly cache Hadoop 
writable objects, you should first copy them using a map function.

因此，在我的例子中，它重复使用相同的 LongWritable 和 Text 对象。例如，如果我这样做:

val foo = data.take(5)
foo.map( r => System.identityHashCode(r._1) )

输出是:

Array[Int] = Array(1805824193, 1805824193, 1805824193, 1805824193, 1805824193)

因此，为了防止它这样做，我简单地将重用对象映射到它们各自的值:

val data = sc.newAPIHadoopFile(
    filepath,
    classOf[MultiLineJsonInputFormat],
    classOf[LongWritable],
    classOf[Text],
    conf ).map(p => (p._1.get, p._2.toString))

关于json - Spark.RDD take(n) 返回元素为 n 的数组，n 次，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/25310985/

json - Spark.RDD take(n) 返回元素为 n 的数组，n 次

上一篇：hadoop - 如何将 -text HDFS 命令的输出复制到另一个文件中？

下一篇：使用 Hadoop Streaming 进行 avro 转换的 python 脚本