hadoop - Apache PIG-错误org.apache.pig.impl.PigContext-在第1行第1列遇到 “<OTHER> ”,= “”

标签 hadoop hive apache-pig issue-tracking hcatalog

我正在尝试使用来自Hive表中的数据的Apache PIG在我的数据下进行一些数据清理。

我的Apache PIG中有以下语句:

   INPUT_FILE = LOAD 'staging_area' USING org.apache.hive.hcatalog.pig.HCatLoader()
AS
          (ID:Long, 
          CHAIN:Int,
          DEPT:Int,
          CATEGORY:Int,
          COMPANY:Long,
          BRAND:Long,
          DATE:Chararray,
          QUARTER:Int,
          MONTH:Int,
          DAY:Int,
          WEEKDAY:Int,
          PRODUCT_SIZE:Int,
          PRODUCT_MEASURE:Chararray,
          PRODUCT_QUANTITY:Int,
          PURCHASE_AMOUNT:Double);

SPLIT INPUT_FILE INTO DATA IF (PRODUCT_SIZE > 0 AND PURCHASE_AMOUNT > 0 AND PRODUCT_QUANTITY > 0), MISSING_VALUES if (PRODUCT_QUANTITY <= 0 OR PURCHASE_AMOUNT <= 0);

DATA_TRANSFORMATION = FOREACH DATA GENERATE 
                                            ID,
                                            CHAIN,
                                            DEPT,
                                            CATEGORY,
                                            ToDate(DATE,'yyyy-MM-dd') as DATE_ID,
                                            QUARTER,
                                            MONTH,
                                            DAY,
                                            WEEKDAY,
                                            PRODUCT_SIZE,
                                            PURCHASE_AMOUNT;

GRP = GROUP DATA_TRANSFORMATION BY ID;

SUMMED = foreach GRP {
     amount = SUM(DATA_TRANSFORMATION.PURCHASE_AMOUNT);
     cnt = COUNT(DATA_TRANSFORMATION.ID);
     generate group, Purchase_Average,Freq_Visits;
}

JOINED = join DATA_TRANSFORMATION by $0, SUMMED by $0;

DATASET = FOREACH JOINED GENERATE $0,$1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12;

RANKING = rank DATASET by $6,$1,$0;

DW = FOREACH RANKING GENERATE $1 as ID,$2 as Purchase_Average, $3 as Freq_Visits, $0 as Transaction_ID, $4,$5,$6,$7,$8,$9,$10,$11,$12,$13;

STORE DW INTO '/user/cloudera/data' USING PigStorage(',');

Hive中的表具有以下数据(前10名):
id  chain   dept    category    company brand   date_id quarter month_id    day_id  weekday productsize productmeasure  purchasequantity    purchaseamount
1940424003  46  99  9909    1081843181  25935   29-01-2013 00:00    1   1   29  2   6   OZ  2   5
1940424003  46  35  3504    103500030   13470   04-02-2013 00:00    1   2   4   1   25  OZ  2   5
1940424003  46  91  9115    108048080   1230    08-02-2013 00:00    1   2   8   5   0   LT  1   13.99
1940452798  46  7   706 101200010   17286   09-02-2013 00:00    1   2   9   6   38  OZ  1   5.75
1940452798  46  45  4517    107220575   17340   10-02-2013 00:00    1   2   10  7   16  OZ  1   45
1940452798  46  99  9909    107143070   5072    10-02-2013 00:00    1   2   10  7   12  OZ  1   1.99
1940452798  46  21  2119    1061300868  867 10-02-2013 00:00    1   2   10  7   138 OZ  1   43.8
1940452798  46  56  5616    1071373373  11473   10-02-2013 00:00    1   2   10  7   8   OZ  1   2.5
1940452798  46  7   706 107146474   2142    10-02-2013 00:00    1   2   10  7   15  OZ  1   2
1940452798  46  72  7205    103700030   4294    22-02-2013 00:00    1   2   22  5   6   OZ  1   3

每当我运行脚本时,都会出现此错误:
ERROR org.apache.pig.impl.PigContext - Encountered " <OTHER> ",= "" at line 1, column 1

有人知道如何解决吗?我的数据大小为300万条记录,我正在使用Cloudera Quickstart VM 5.8。

最佳答案

SUMMED = foreach GRP {
     amount = SUM(DATA_TRANSFORMATION.PURCHASE_AMOUNT);
     cnt = COUNT(DATA_TRANSFORMATION.ID);
     generate group, Purchase_Average,Freq_Visits;
}

您不能在此处投影Purchase_Average和Freq_Visits。

关于hadoop - Apache PIG-错误org.apache.pig.impl.PigContext-在第1行第1列遇到 “<OTHER> ”,= “”,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/41231847/

相关文章:

json - 更改 Hive 中的 Derby Metastore 以允许具有 >4000B 定义的结构

mysql - SQL/Hive 表别名

java - 带有 datafu : Cannot resolve UDF's 的 apache PIG

java - 未处理的内部错误。 org.apache.hadoop.mapred.jobcontrol.JobControl.addJob

json - 如何使用Scala读取子目录下的多个Json文件

hadoop - 在 Hadoop 框架上构建用于报告和分析的应用程序

hadoop - 不带 reducer 的HBase读/写,异常

hadoop - Hive 表列日期格式

hadoop - 将值附加到 PIG 变量

hadoop - 如何获取 hive 中每个组的前n个计数