mongodb - Pig MongoLoader 异常加载带有 UUID 的数据

标签 mongodb hadoop mapreduce apache-pig

我正在尝试从包含二进制形式 UUID 数据类型字段的 Mongo 集合加载数据(例如 BinData(3, "/qHWF5hGQU+w6unYcTQxWw==") )。作业失败并显示

org.apache.pig.backend.executionengine.ExecException: ERROR 2108: \
  Could not determine data type of field: 1423ed53-5064-0000-784b-7bf2e2dd837b". 

我构建了 mongo-hadoop 1.1 版(来自 Master 分支)。 https://github.com/mongodb/mongo-hadoop .它工作正常,除非有 UUID。下面是我的脚本和错误。有什么想法吗?

register '/pig/lib/mongo-java-driver-2.9.3.jar';
register '/pig/lib/mongo-hadoop-core_cdh4.3.0-1.1.0.jar';
register '/pig/lib/mongo-hadoop-pig_cdh4.3.0-1.1.0.jar';
a = LOAD 'mongodb://localhost/TestDb.SocialUser'
      USING com.mongodb.hadoop.pig.MongoLoader();
store a INTO 'a';

2013-07-10 15:03:35,630 [Thread-6] INFO  org.apache.hadoop.mapred.LocalJobRunner - Map task executor complete.
2013-07-10 15:03:35,632 [Thread-6] WARN  org.apache.hadoop.mapred.LocalJobRunner - job_local402930066_0001
java.lang.Exception: org.apache.pig.backend.executionengine.ExecException: ERROR 2108: Could not determine data type of field: 1423ed53-5064-0000-784b-7bf2e2dd837b
  at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:404)
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2108: \
    Could not determine data type of field: 1423ed53-5064-0000-784b-7bf2e2dd837b
  at org.apache.pig.impl.util.StorageUtil.putField(StorageUtil.java:208)
  at org.apache.pig.impl.util.StorageUtil.putField(StorageUtil.java:166)
  at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigTextOutputFormat$PigLineRecordWriter.write(PigTextOutputFormat.java:68)
  at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigTextOutputFormat$PigLineRecordWriter.write(PigTextOutputFormat.java:44)
  at org.apache.pig.builtin.PigStorage.putNext(PigStorage.java:296)
  at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.write(PigOutputFormat.java:139)
  at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.write(PigOutputFormat.java:98)
  at org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:558)
  at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:85)
  at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:106)
  at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapOnly$Map.collect(PigMapOnly.java:48)
  at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:264)
  at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
  at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:140)
  at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:672)
  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
  at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:266)
  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
  at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
  at java.util.concurrent.FutureTask.run(FutureTask.java:166)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
  at java.lang.Thread.run(Thread.java:724)
2013-07-10 15:03:39,235 [main] WARN  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Ooops! Some job has failed! Specify -stop_on_failure if you want Pig to stop immediately on failure.

最佳答案

MongoLoader 有一个convertBSONtoPigType 方法,用于将记录读取器返回的类型转换为与pig 兼容的类型。如果类型不是可识别的类型 - 即包括 java.util.Date,则该方法默认为输出对象并打破 pig 。

如果您向 mongo 加载器添加一个模式,为 UUID 提供一个 pig 数据类型的 char 数组,例如

使用 MongoLoader(myguid:chararray) 加载 '/mongodb://mongoserver/db.collection' 底层 java 代码调用对象上的 .toString() (在本例中为 java.util.UUID)并将输出一个普通 UUID。

您也可以可行地更改 convertBSONtoPigType 方法来执行相同的操作,例如

public static Object convertBSONtoPigType(final Object o) throws ExecException {
    if (o == null) {
        return null;
    } else if (o instanceof Number || o instanceof String) {
        return o;
    } else if (o instanceof Date) {
        return ((Date) o).getTime();
    } else if (o instanceof ObjectId) {
        return o.toString();
    } else if (o instanceof UUID) {
        return o.toString();
    }
    else if (o instanceof BasicBSONList) {
        BasicBSONList bl = (BasicBSONList) o;
        Tuple t = tupleFactory.newTuple(bl.size());
        for (int i = 0; i < bl.size(); i++) {
            t.set(i, convertBSONtoPigType(bl.get(i)));
        }
        return t;
    } else if (o instanceof Map) {
        //TODO make this more efficient for lazy objects?
        Map<String, Object> fieldsMap = (Map<String, Object>) o;
        HashMap<String, Object> pigMap = new HashMap<String, Object>(fieldsMap.size());
        for (Map.Entry<String, Object> field : fieldsMap.entrySet()) {
            pigMap.put(field.getKey(), convertBSONtoPigType(field.getValue()));
        }
        return pigMap;
    } else {
        return o;
    }

}

令我困惑的是为什么 MongoLoader 不支持具有未知架构的 UUID。原因是,UUID/BinData 是 Mongo 的一部分并被广泛使用。

也许这是他们可以解决的问题。

无论如何 - 希望这对您有所帮助。

问候

关于mongodb - Pig MongoLoader 异常加载带有 UUID 的数据,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/17579027/

相关文章:

mongodb - 在 MongoDB shell 查询中获取 "data from collection b not in collection a"

shell - BigQuery命令无法从Oozie工作流程运行

java - Hadoop:在 MapReduce [Java] 中实现嵌套 for 循环

linux - 计算大文件中的行数

spring - MongoDB比较运算符与null

node.js - Mongoose 查询 where

mongodb - '--link' 似乎无法连接两个 Docker 容器

java - zookeeper 客户端不向 CLI 提供 "jline support is disabled"消息

hadoop - 失败:SemanticException org.apache.hadoop.hive.ql.metadata.HiveException

java - 使用 mapreduce : Java, Pig 解析 twitter json