json - 将 JSON 数组加载到 Pig 中

标签 json hadoop apache-pig hdfs bigdata

我有一个格式如下的json文件

[
  {
    "id": 2,
    "createdBy": 0,
    "status": 0,
    "utcTime": "Oct 14, 2014 4:49:47 PM",
    "placeName": "21/F, Cunningham Main Rd, Sampangi Rama NagarBengaluruKarnatakaIndia",
    "longitude": 77.5983817,
    "latitude": 12.9832418,
    "createdDate": "Sep 16, 2014 2:59:03 PM",
    "accuracy": 5,
    "loginType": 1,
    "mobileNo": "0000005567"
  },
  {
    "id": 4,
    "createdBy": 0,
    "status": 0,
    "utcTime": "Oct 14, 2014 4:52:48 PM",
    "placeName": "21/F, Cunningham Main Rd, Sampangi Rama NagarBengaluruKarnatakaIndia",
    "longitude": 77.5983817,
    "latitude": 12.9832418,
    "createdDate": "Oct 8, 2014 5:24:42 PM",
    "accuracy": 5,
    "loginType": 1,
    "mobileNo": "0000005566"
  }
]

当我尝试使用 JsonLoader 类将数据加载到 pig 时,我收到了一个错误,例如意外的输入结束:OBJECT 的预期关闭标记

a = LOAD '/user/root/jsoneg/exp.json' USING JsonLoader('id:int,createdBy:int,status:int,utcTime:chararray,placeName:chararray,longitude:double,latitude:double,createdDate:chararray,accuracy:double,loginType:double,mobileNo:chararray');
b = foreach a generate $0,$1,$2;
dump b;

最佳答案

我以前也遇到过类似的问题,后来我才知道 Pig JSON 不支持多行 json 格式。它总是期望 json 输入必须在单行中。

我建议您使用 elephantbird json loader,而不是原生的 Jsonloader。它非常适合 Jsons 格式。

您可以从下面的链接下载 jar

http://www.java2s.com/Code/Jar/e/elephant.htm

我将您的输入格式更改为单行并通过 elephantbird 加载,如下所示

input.json
{"test":[{"id": 2,"createdBy": 0,"status": 0,"utcTime": "Oct 14, 2014 4:49:47 PM","placeName": "21/F, Cunningham Main Rd, Sampangi Rama NagarBengaluruKarnatakaIndia","longitude": 77.5983817,"latitude": 12.9832418,"createdDate": "Sep 16, 2014 2:59:03 PM","accuracy": 5,"loginType": 1,"mobileNo": "0000005567"},{"id": 4,"createdBy": 0,"status": 0,"utcTime": "Oct 14, 2014 4:52:48 PM","placeName": "21/F, Cunningham Main Rd, Sampangi Rama NagarBengaluruKarnatakaIndia","longitude": 77.5983817,"latitude": 12.9832418,"createdDate": "Oct 8, 2014 5:24:42 PM","accuracy": 5,"loginType": 1,"mobileNo": "0000005566"}]}

PigScript:
REGISTER '/tmp/elephant-bird-hadoop-compat-4.1.jar';
REGISTER '/tmp/elephant-bird-pig-4.1.jar';

A = LOAD 'input.json ' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad');
B = FOREACH A GENERATE FLATTEN($0#'test');
C = FOREACH B GENERATE FLATTEN($0) AS mymap;
D = FOREACH C GENERATE mymap#'id',mymap#'placeName',mymap#'status';
DUMP D;

Output:
(2,21/F, Cunningham Main Rd, Sampangi Rama NagarBengaluruKarnatakaIndia,0)
(4,21/F, Cunningham Main Rd, Sampangi Rama NagarBengaluruKarnatakaIndia,0)

关于json - 将 JSON 数组加载到 Pig 中,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/26710438/

相关文章:

android - 使用 Retrofit2 解析 API 时获取 null

hadoop - org.apache.hadoop.hbase.TableNotFoundException : SYSTEM. 目录异常与凤凰 4.5.2

c# - 显示hadoop内容C#

hadoop - 从 pig 中的分组数据生成二元组合

java - 避免在 Hadoop pig 中使用指数表示法

javascript - 如何在控制台中调试我的数据?

ios - swift - 将 JSON 数据保存到本地文件

java - 如何在jsp表中显示Json字符串

java - 对于条目太多的目录,ABFS hadoop-azure AzureBlobFileSystem.listStatus(path) 花费太多时间(不返回)

hadoop - 在 EvalFunc pig UDF 中抛出异常是跳过那一行,还是完全停止?