mongodb - 在 Spark 中连接到 mongodb 时出现异常

标签 mongodb exception hadoop apache-spark hadoop-streaming

在尝试使用 MongoDB 作为输入 RDD 时,我在 org.bson.BasicBSONDecoder._decode 中得到“java.lang.IllegalStateException: not ready”:

Configuration conf = new Configuration();
conf.set("mongo.input.uri", "mongodb://127.0.0.1:27017/test.input");

JavaPairRDD<Object, BSONObject> rdd = sc.newAPIHadoopRDD(conf, MongoInputFormat.class, Object.class, BSONObject.class);

System.out.println(rdd.count());

我得到的异常(exception)是: 14/08/06 09:49:57 INFO rdd.NewHadoopRDD:输入拆分:

MongoInputSplit{URI=mongodb://127.0.0.1:27017/test.input, authURI=null, min={ "_id" : { "$oid" : "53df98d7e4b0a67992b31f8d"}}, max={ "_id" : { "$oid" : "53df98d7e4b0a67992b331b8"}}, query={ }, sort={ }, fields={ }, notimeout=false} 14/08/06 09:49:57 
WARN scheduler.TaskSetManager: Loss was due to java.lang.IllegalStateException 
java.lang.IllegalStateException: not ready
            at org.bson.BasicBSONDecoder._decode(BasicBSONDecoder.java:139)
            at org.bson.BasicBSONDecoder.decode(BasicBSONDecoder.java:123)
            at com.mongodb.hadoop.input.MongoInputSplit.readFields(MongoInputSplit.java:185)
            at org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:285)
            at org.apache.hadoop.io.ObjectWritable.readFields(ObjectWritable.java:77)
            at org.apache.spark.SerializableWritable.readObject(SerializableWritable.scala:42)
            at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
            at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:88)
            at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:55)
            at java.lang.reflect.Method.invoke(Method.java:618)
            at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1089)
            at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1962)
            at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1867)
            at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1419)
            at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2059)
            at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1984)
            at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1867)
            at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1419)
            at java.io.ObjectInputStream.readObject(ObjectInputStream.java:420)
            at org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:147)
            at java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1906)
            at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1865)
            at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1419)
            at java.io.ObjectInputStream.readObject(ObjectInputStream.java:420)
            at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
            at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:85)
            at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:165)
            at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1156)
            at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:626)
            at java.lang.Thread.run(Thread.java:804)

所有程序输出为here

环境:

  • 红帽
  • Spark 1.0.1
  • Hadoop 2.4.1
  • MongoDB 2.4.10
  • mongo-hadoop-1.3

最佳答案

我想我发现了问题:mongodb-hadoop 在 core/src/main/java/com/mongodb/hadoop/input/MongoInputSplit.java 中的 BSON 编码器/解码器实例上有一个“静态”修饰符。当 Spark 在多线程模式下运行时,所有线程都尝试使用 same 编码器/解码器实例进行反序列化,这可能会产生不好的结果。

我的 github 上的补丁 here (已向上游提交了拉取请求)

我现在能够从 Python 运行 8 核多线程 Spark->mongo 集合计数()!

关于mongodb - 在 Spark 中连接到 mongodb 时出现异常,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/25226515/

相关文章:

java - 如何获得有关我的应用程序中发生的异常的更多详细信息?

eclipse - 映射减少分布式缓存

Amazon EC2 上的 MongoDB

ruby - 如何从中心圆的半径找到所有重叠的圆?

java - @ExceptionHandler 没有捕获 HttpMessageNotReadableException

hadoop - 获取hadoop Reducer中的Total输入路径

java - Cloudera设置Sqoop导入给出Java堆空间错误并且超出GC开销限制

mongodb - 从 MongoDB 中的数组中选择不同的值

mongodb - mongo 聚合框架,具有不同的键计数和多个分组依据

python - 如何在Django和Django Rest框架中覆盖/自定义所有服务器错误