json - Hadoop 中的 Twitter json 数据

我已经将 Twitter 数据流式传输到 HDFS。这是我的 Twitter 代理配置:

#setting properties of agent
Twitter-agent.sources=source1
Twitter-agent.channels=channel1
Twitter-agent.sinks=sink1

#configuring sources
Twitter-agent.sources.source1.type=com.cloudera.flume.source.TwitterSource
Twitter-agent.sources.source1.channels=channel1
Twitter-agent.sources.source1.consumerKey=<consumer-key>
Twitter-agent.sources.source1.consumerSecret=<consumer-secret>
Twitter-agent.sources.source1.accessToken=<access-token>
Twitter-agent.sources.source1.accessTokenSecret=<Access-Token-secret>
Twitter-agent.sources.source1.keywords= morning, night, hadoop, bigdata

#configuring channels
Twitter-agent.channels.channel1.type=memory
Twitter-agent.channels.channel1.capacity=10000
Twitter-agent.channels.channel1.transactionCapacity=100

#configuring sinks
Twitter-agent.sinks.sink1.channel=channel1
Twitter-agent.sinks.sink1.type=hdfs
Twitter-agent.sinks.sink1.hdfs.path=flume/tweets
Twitter-agent.sinks.sink1.rollSize=0
Twitter-agent.sinks.sink1.rollCount=10000
Twitter-agent.sinks.sink1.batchSize=1000
Twitter-agent.sinks.sink1.fileType=DataStream
Twitter-agent.sinks.sink1.writeFormat=Text

Twitter 数据流式传输成功。但是HDFS中的每一个FlumeData文件都是这样的:

SEQ!org.apache.hadoop.io.LongWritable"org.apache.hadoop.io.BytesWritable�	���^�kd��h?�tN ���h{"in_reply_to_status_id_str":null,"in_reply_to_status_id":null,"created_at":"Tue Jun 23 15:09:32 +0000 2015","in_reply_to_user_id_str":null,"source":"<a href=\"http://tweetlogix.com\" rel=\"nofollow\">Tweetlogix<\/a>","retweet_count":0,"retweeted":false,"geo":null,"filter_level":"low","in_reply_to_screen_name":null,"id_str":"613363262709723139","in_reply_to_user_id":null,"favorite_count":0,"id":613363262709723139,"text":"Morning.","place":null,"lang":"en","favorited":false,"possibly_sensitive":false,"coordinates":null,"truncated":false,"timestamp_ms":"1435072172225","entities":{"urls":[],"hashtags":[],"user_mentions":[],"trends":[],"symbols":[]},"contributors":null,"user":{"utc_offset":-14400,"friends_count":195,"profile_image_url_https":"https://pbs.twimg.com/profile_images/613121771093532673/mA5NPv6X_normal.jpg","listed_count":16,"profile_background_image_url":"http://pbs.twimg.com/profile_background_images/378800000045222063/847094549362b20f2b1e3c1ff137a80f.png","default_profile_image":false,"favourites_count":891,"description":"See, I was actually on my way to get a piece of burger from Burger King.....","created_at":"Sat Apr 30 00:51:06 +0000 2011","is_translator":false,"profile_background_image_url_https":"https://pbs.twimg.com/profile_background_images/378800000045222063/847094549362b20f2b1e3c1ff137a80f.png","protected":false,"screen_name":"NilesDontCurrr","id_str":"290266873","profile_link_color":"FF0000","id":290266873,"geo_enabled":false,"profile_background_color":"FFFFFF","lang":"en","profile_sidebar_border_color":"FFFFFF","profile_text_color":"34AA7A","verified":false,"profile_image_url":"http://pbs.twimg.com/profile_images/613121771093532673/mA5NPv6X_normal.jpg","time_zone":"Eastern Time (US & Canada)","url":null,"contributors_enabled":false,"profile_background_tile":true,"profile_banner_url":"https://pbs.twimg.com/profile_banners/290266873/1432844093","statuses_count":68154,"follow_request_sent":null,"followers_count":4611,"profile_use_background_image":true,"default_profile":false,"following":null,"name":"niles.","location":"New York City.","profile_sidebar_fill_color":"AFDFB7","notifications":null}}

当我在 Hive 中解析这个 json 数据时，我收到类似

的错误

Caused by: org.apache.hadoop.hive.serde2.SerDeException: org.codehaus.jackson.JsonParseException: Unexpected character ('S' (code 83)): expected a valid value (number, String, array, object, 'true', 'false' or 'null')
 at [Source: java.io.StringReader@5fdcaa40; line: 1, column: 2]

我认为错误是因为这一行是每个 FlumeData 文件中的第一行。 SEQ!org.apache.hadoop.io.LongWritable"org.apache.hadoop.io.BytesWritable��^��kd��h?��tN ��h 我说得对吗？

twitter 的 json 数据不应该这样开头 {"in_reply_to_status_id_str":......} 吗？

最佳答案

Flume 以二进制格式而不是文本格式生成文件。这是因为您的配置文件中的一些属性设置不正确，包括以下两个属性。

Twitter-agent.sinks.sink1.fileType=DataStream
Twitter-agent.sinks.sink1.writeFormat=Text

正确的属性设置方法如下。

Twitter-agent.sinks.sink1.hdfs.fileType=DataStream
Twitter-agent.sinks.sink1.hdfs.writeFormat=Text

关于json - Hadoop 中的 Twitter json 数据，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/31073098/

json - Hadoop 中的 Twitter json 数据

上一篇：java - Spark Java scala 错误

下一篇：hadoop - 安全模式下的 Oozie 无效用户