json - 无法使用Elephant Bird读取JSON文件

标签 json hadoop apache-pig

通过使用 Elephant-bird JsonLoader 尝试加载其中具有空值的json文件。

sample.json

{"created_at": "Mon Aug 22 10:48:23 +0000 2016","id": 767674772662607873,"id_str": "767674772662607873","text": "KPIT Image Result for https:\/\/t.co\/Nas2ZnF1zZ... https:\/\/t.co\/9TnelwtIvm","source": "\u003ca href=\"http:\/\/twitter.com\" rel=\"nofollow\"\u003eTwitter Web Client\u003c\/a\u003e","truncated": false,"in_reply_to_status_id": 123,"in_reply_to_status_id_str": null,"in_reply_to_user_id": null,"in_reply_to_user_id_str": null,"in_reply_to_screen_name": null,"geo": null,"coordinates": null,"place": null,"contributors": null,"is_quote_status": false,"retweet_count": 0,"favorite_count": 0,"entities": {"hashtags": [],"urls": [{"url": "https:\/\/t.co\/Nas2ZnF1zZ","expanded_url": "http:\/\/miltonious.com\/","display_url": "miltonious.com","indices": [24, 47]}],"user_mentions": [],"symbols": []},"favorited": false,"retweeted": false,"possibly_sensitive": false,"filter_level": "low","lang": "en","timestamp_ms": "1471862903167"}

脚本:
REGISTER piggybank.jar
REGISTER json-simple-1.1.1.jar
REGISTER elephant-bird-pig-4.3.jar
REGISTER elephant-bird-core-4.1.jar
REGISTER elephant-bird-hadoop-compat-4.3.jar

json = LOAD 'sample.json' USING JsonLoader('created_at:chararray, id:chararray, id_str:chararray, text:chararray, source:chararray, in_reply_to_status_id:chararray, in_reply_to_status_id_str:chararray, in_reply_to_user_id:chararray, in_reply_to_user_id_str:chararray, in_reply_to_screen_name:chararray, geo:chararray, coordinates:chararray, place:chararray, contributors:chararray, is_quote_status:bytearray, retweet_count:long, favorite_count:chararray, entities:map[], favorited:bytearray, retweeted:bytearray, possibly_sensitive:bytearray, lang:chararray');

describe json;
dump json;

当我转储json时,我得到以下输出和穿着的

(星期一八月22 10:48:23 +0000 2016,767674772662607873,767674772662607873,Twitter Web Client的google图像结果,false,1234,12345,3214,43215 ,,,,,,,,,,,,)

WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger-org.apache.pig.builtin.JsonLoader(UDF_WARNING_1):错误的记录,对于{complete json} 返回空

通过警告,我猜它正在获取NULL值。
因此,我们如何加载其中包含空值的Json。

我以另一种方式尝试过
json = LOAD 'sample.json' USING com.twitter.elephantbird.pig.load.JsonLoader('created_at:chararray, id:chararray, id_str:chararray, text:chararray, source:chararray, in_reply_to_status_id:chararray, in_reply_to_status_id_str:chararray, in_reply_to_user_id:chararray, in_reply_to_user_id_str:chararray, in_reply_to_screen_name:chararray, geo:chararray, coordinates:chararray, place:chararray, contributors:chararray, is_quote_status:bytearray, retweet_count:long, favorite_count:chararray, entities:map[], favorited:bytearray, retweeted:bytearray, possibly_sensitive:bytearray, lang:chararray');

describe json;

输出
Schema for json unknown.

请给我建议。

谢谢。

最佳答案

您可以尝试这样的事情,

 MY_JSON = LOAD 'sample.json' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad');
 dump MY_JSON;

关于json - 无法使用Elephant Bird读取JSON文件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/39369535/

相关文章:

hadoop - 要将数据存储在hcatalog表中,该表必须为空。那么增量加载呢?

python - 如何使用python处理hdfs中的文件

javascript - 本地存储的 json 数组正在打印,但 json 的值在 cordova 应用程序中给出未定义

javascript - AngularJS 访问 JSON array[n].property 数据

node.js - 使用 Axios 在 Sequelize 上更新请求 (PUT) 超时

hadoop - 在Pig Latin中使用TOBAG和STRSPLIT

hadoop - Hadoop jar命令和job命令的区别

Hadoop - 多输入

hadoop - PIG 将文本行转换为稀疏向量

javascript - 在 jQuery 中创建 JSON 对象数组