apache-spark - 如何将二进制文件传输到spark中的rdd中？

我正在尝试将seg-Y类型文件加载到spark中，并将它们传输到rdd中进行mapreduce操作。但我未能将它们转入rdd。有谁可以提供帮助吗？

最佳答案

您可以使用 binaryRecords() pySpark 调用将二进制文件的内容转换为 RDD

http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.SparkContext.binaryRecords

binaryRecords(path, recordLength)

Load data from a flat binary file, assuming each record is a set of numbers with the specified numerical format (see ByteBuffer), and the number of bytes per record is constant.

Parameters: path – Directory to the input data files recordLength – The length at which to split the records

然后您可以使用 struct.unpack() 等方法将该 RDD 映射() 到结构中

https://docs.python.org/2/library/struct.html

我们使用这种方法来获取固定宽度记录二进制文件。有一些 Python 代码可以生成格式字符串(struct.unpack 的第一个参数)，但如果您的文件布局是静态的，则手动执行一次相当简单。

使用纯 Scala 也可以实现类似的效果:

http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext@binaryRecords(path:String,recordLength:Int,conf:org.apache.hadoop.conf.Configuration):org.apache.spark.rdd.RDD[Array[Byte]]

关于apache-spark - 如何将二进制文件传输到spark中的rdd中？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/32602489/