json - Hadoop 中的 Twitter json 数据

标签 json hadoop twitter hive

我已经将 Twitter 数据流式传输到 HDFS。这是我的 Twitter 代理配置:

#setting properties of agent
Twitter-agent.sources=source1
Twitter-agent.channels=channel1
Twitter-agent.sinks=sink1

#configuring sources
Twitter-agent.sources.source1.type=com.cloudera.flume.source.TwitterSource
Twitter-agent.sources.source1.channels=channel1
Twitter-agent.sources.source1.consumerKey=<consumer-key>
Twitter-agent.sources.source1.consumerSecret=<consumer-secret>
Twitter-agent.sources.source1.accessToken=<access-token>
Twitter-agent.sources.source1.accessTokenSecret=<Access-Token-secret>
Twitter-agent.sources.source1.keywords= morning, night, hadoop, bigdata

#configuring channels
Twitter-agent.channels.channel1.type=memory
Twitter-agent.channels.channel1.capacity=10000
Twitter-agent.channels.channel1.transactionCapacity=100

#configuring sinks
Twitter-agent.sinks.sink1.channel=channel1
Twitter-agent.sinks.sink1.type=hdfs
Twitter-agent.sinks.sink1.hdfs.path=flume/tweets
Twitter-agent.sinks.sink1.rollSize=0
Twitter-agent.sinks.sink1.rollCount=10000
Twitter-agent.sinks.sink1.batchSize=1000
Twitter-agent.sinks.sink1.fileType=DataStream
Twitter-agent.sinks.sink1.writeFormat=Text

Twitter 数据流式传输成功。但是HDFS中的每一个FlumeData文件都是这样的:

SEQ!org.apache.hadoop.io.LongWritable"org.apache.hadoop.io.BytesWritable�	���^�kd��h?�tN ���h{"in_reply_to_status_id_str":null,"in_reply_to_status_id":null,"created_at":"Tue Jun 23 15:09:32 +0000 2015","in_reply_to_user_id_str":null,"source":"<a href=\"http://tweetlogix.com\" rel=\"nofollow\">Tweetlogix<\/a>","retweet_count":0,"retweeted":false,"geo":null,"filter_level":"low","in_reply_to_screen_name":null,"id_str":"613363262709723139","in_reply_to_user_id":null,"favorite_count":0,"id":613363262709723139,"text":"Morning.","place":null,"lang":"en","favorited":false,"possibly_sensitive":false,"coordinates":null,"truncated":false,"timestamp_ms":"1435072172225","entities":{"urls":[],"hashtags":[],"user_mentions":[],"trends":[],"symbols":[]},"contributors":null,"user":{"utc_offset":-14400,"friends_count":195,"profile_image_url_https":"https://pbs.twimg.com/profile_images/613121771093532673/mA5NPv6X_normal.jpg","listed_count":16,"profile_background_image_url":"http://pbs.twimg.com/profile_background_images/378800000045222063/847094549362b20f2b1e3c1ff137a80f.png","default_profile_image":false,"favourites_count":891,"description":"See, I was actually on my way to get a piece of burger from Burger King.....","created_at":"Sat Apr 30 00:51:06 +0000 2011","is_translator":false,"profile_background_image_url_https":"https://pbs.twimg.com/profile_background_images/378800000045222063/847094549362b20f2b1e3c1ff137a80f.png","protected":false,"screen_name":"NilesDontCurrr","id_str":"290266873","profile_link_color":"FF0000","id":290266873,"geo_enabled":false,"profile_background_color":"FFFFFF","lang":"en","profile_sidebar_border_color":"FFFFFF","profile_text_color":"34AA7A","verified":false,"profile_image_url":"http://pbs.twimg.com/profile_images/613121771093532673/mA5NPv6X_normal.jpg","time_zone":"Eastern Time (US & Canada)","url":null,"contributors_enabled":false,"profile_background_tile":true,"profile_banner_url":"https://pbs.twimg.com/profile_banners/290266873/1432844093","statuses_count":68154,"follow_request_sent":null,"followers_count":4611,"profile_use_background_image":true,"default_profile":false,"following":null,"name":"niles.","location":"New York City.","profile_sidebar_fill_color":"AFDFB7","notifications":null}}

当我在 Hive 中解析这个 json 数据时,我收到类似

的错误

Caused by: org.apache.hadoop.hive.serde2.SerDeException: org.codehaus.jackson.JsonParseException: Unexpected character ('S' (code 83)): expected a valid value (number, String, array, object, 'true', 'false' or 'null')
 at [Source: java.io.StringReader@5fdcaa40; line: 1, column: 2]

我认为错误是因为这一行是每个 FlumeData 文件中的第一行。 SEQ!org.apache.hadoop.io.LongWritable"org.apache.hadoop.io.BytesWritable�����^��kd��h?��tN ����h 我说得对吗?

twitter 的 json 数据不应该这样开头 {"in_reply_to_status_id_str":......} 吗?

最佳答案

Flume 以二进制格式而不是文本格式生成文件。这是因为您的配置文件中的一些属性设置不正确,包括以下两个属性。

Twitter-agent.sinks.sink1.fileType=DataStream
Twitter-agent.sinks.sink1.writeFormat=Text

正确的属性设置方法如下。

Twitter-agent.sinks.sink1.hdfs.fileType=DataStream
Twitter-agent.sinks.sink1.hdfs.writeFormat=Text

关于json - Hadoop 中的 Twitter json 数据,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/31073098/

相关文章:

ios - Swift Json 统计并创建 TableView

javascript - 将图像添加到三个 javascript 结果中的前两个

android - 如何替换 URL 中的一个单词

.Net MVC4 - 如何以json格式返回异常?

apache-spark - 在HDIinsight群集上运行Spark作业时如何解决此 fatal error ? session 681意外达到最终状态 'dead'。查看日志:

sql - 不支持的子查询表达式 : Correlating expression cannot contain unqualified column references

python - 在流式 hadoop 程序中获取输入文件名

ios - Swift 如何从 UITableViewRowAction 获取访问功能

php - 推文中的西里尔符号

javascript - 如何减少json数据