python - 来自python程序的配置单元查询返回的输出类似于 “x00e\x00”\x00“

标签 python hadoop character-encoding hive

我在Hive中创建了一个表,并从外部csv文件加载了数据。当我尝试从python打印数据时,得到类似“['\ x00” \ x00m \ x00e \ x00s \ x00s \ x00a \ x00g \ x00e \ x00“\ x00']”的输出。当我查询Hive GUI时,结果是正确的。请告诉我如何通过python程序获得相同的结果。

我的python代码:

import pyhs2

with pyhs2.connect(host='192.168.56.101',
               port=10000,
               authMechanism='PLAIN',
               user='hiveuser',
               password='password',
               database='anuvrat') as conn:
with conn.cursor() as cur:
    cur.execute('SELECT message FROM ABC_NEWS LIMIT 5')

    print cur.fetchone()

输出为:
/usr/bin/python2.7 /home/anuvrattiku/SPRING_2017/CMPE239/Facebook_Fake_news_detection/code_fake_news/code.py
['\x00"\x00m\x00e\x00s\x00s\x00a\x00g\x00e\x00"\x00']

Process finished with exit code 0

当我在Hive中查询同一张表时,得到以下输出:

enter image description here

这就是我创建表的方式:
CREATE TABLE ABC_NEWS(
ID STRING, 
PAGE_ID INT, 
NAME STRING, 
MESSAGE STRING, 
DESCRIPTION STRING, 
CAPTION STRING, 
POST_TYPE STRING, 
STATUS_TYPE STRING, 
LIKES_COUNT SMALLINT, 
COMMENTS SMALLINT, 
SHARES_COUNT SMALLINT, 
LOVE_COUNT SMALLINT, 
WOW_COUNT SMALLINT, 
HAHA_COUNT SMALLINT, 
SAD_COUNT SMALLINT, 
THANKFUL_COUNT SMALLINT, 
ANGRY_COUNT SMALLINT, 
LINK STRING, 
IMAGE_LINK STRING, 
POSTED_AT STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY "," ESCAPED BY '\\';

用于加载表格的csv文件位于以下路径中:
https://www.dropbox.com/s/fiwygyqt8u9eo5s/abc-news-86680728811.csv?dl=0

最佳答案

  • 因为文本是合格的("),并且在合格的文本内出现定界符(,),所以您应该使用CSV Serde
  • 您试图打印cur.fetchone(),它是一个列表而不是字符串,因此得到了一个字节数组,而您应该已经打印了列表的第一个元素-cur.fetchone()[0]

  • create external table abc_news
    (
        id              string 
       ,page_id         int 
       ,name            string 
       ,message         string 
       ,description     string 
       ,caption         string 
       ,post_type       string 
       ,status_type     string 
       ,likes_count     smallint 
       ,comments        smallint 
       ,shares_count    smallint 
       ,love_count      smallint 
       ,wow_count       smallint 
       ,haha_count      smallint 
       ,sad_count       smallint 
       ,thankful_count  smallint 
       ,angry_count     smallint 
       ,link            string 
       ,image_link      string 
       ,posted_at       string
    )
    row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
    with serdeproperties 
    (
       'separatorChar' = ','
      ,'quoteChar'     = '"'
    )  
    stored as textfile
    ;
    
    >>> import pyhs2
    >>> 
    >>> with pyhs2.connect(host='localhost',port=10000,authMechanism='PLAIN',user='cloudera',password='cloudera',database='local_db') as conn:
    ...     with conn.cursor() as cur:
    ...         cur.execute('SELECT message FROM ABC_NEWS LIMIT 10')
    ...         for i in cur.fetch():
    ...             print i[0]
    ...             
    ...             
    ... 
    "message"
    "Roberts took the unusual step of devoting the majority of  his annual  report to the issue of judicial ethics."
    "Do you agree with the new law?"
    "Some pretty cool confetti will rain down on New York City celebrators."
    NULL
    "The pharmacy was held up by a man seeking prescription medication. "
    NULL
    "There were no immediate reports of damage or injuries."
    "Were you an LCD screen early adopter? A settlement may be headed your way."
    "As Americans get bigger, passenger limits are becoming more restrictive."
    >>> 
    

    关于python - 来自python程序的配置单元查询返回的输出类似于 “x00e\x00”\x00“,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/43712292/

    相关文章:

    python - 使用 Python pandas 检查列是否包含相同的值或 NaN

    python - 声明的变量出现名称错误

    hadoop - sqoop 在导入时如何处理更新的行?

    datetime - 如何使用 mapreduce 和 pyspark 查找某年某一天的频率

    c++ - 编译c++ dll时选择哪个字符集

    python - re.search 返回空元组

    python - 是否可以只使用 argparse 解析一个参数组的参数?

    hadoop - Hadoop 2.0调度程序是否仅适用于多用户方案?

    java - 编码/解码 REST 路径参数

    javascript - JSON.parse()意外标记的数据编码