hive - Pig 脚本失败,出现 java.io.EOFException : Unexpected end of input stream

标签 hive apache-pig

我有一个 Pig 脚本来使用正则表达式获取一组字段并将数据存储到 Hive 表。

--Load data

cisoFortiGateDataAll = LOAD '/user/root/cisodata/Logs/Fortigate/ec-pix-log.20140625.gz' USING TextLoader AS (line:chararray);

--There are two types of data, filter type1 - The field dst_country seems unique there

cisoFortiGateDataType1 = FILTER cisoFortiGateDataAll BY (line matches '.*dst_country.*');

--Parse each line and pick up the required fields

cisoFortiGateDataType1Required = FOREACH cisoFortiGateDataType1 GENERATE
 FLATTEN(
 REGEX_EXTRACT_ALL(line, '(.*?)\\s(.*?)\\s(.*?)\\s(.*?)\\sdate=(.*?)\\s+time=(.*?)\\sdevname=(.*?)\\sdevice_id=(.*?)\\slog_id=(.*?)\\stype=(.*?)\\ssubtype=(.*?)\\spri=(.*?)\\svd=(.*?)\\ssrc=(.*?)\\ssrc_port=(.*?)\\ssrc_int=(.*?)\\sdst=(.*?)\\sdst_port=(.*?)\\sdst_int=(.*?)\\sSN=(.*?)\\sstatus=(.*?)\\spolicyid=(.*?)\\sdst_country=(.*?)\\ssrc_country=(.*?)\\s(.*?\\s.*)+')
 ) AS (
 rmonth:charArray, rdate:charArray, rtime:charArray, ip:charArray, date:charArray, time:charArray,
 devname:charArray, deviceid:charArray, logid:charArray, type:charArray, subtype:charArray,
 pri:charArray, vd:charArray, src:charArray, srcport:charArray, srcint:charArray, dst:charArray,
 dstport:charArray, dstint:charArray, sn:charArray, status:charArray, policyid:charArray,
 dstcountry:charArray, srccountry:charArray, rest:charArray );

--Store to hive table 

STORE cisoFortiGateDataType1Required INTO 'ciso_db.fortigate_type1_1_table' USING org.apache.hcatalog.pig.HCatStorer();

该脚本在小文件上运行良好,但在较大文件 (750 MB) 上出现以下异常时中断。知道如何调试并找到根本原因吗?

2014-09-03 15:31:33,562 [main] ERROR org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher - java.io.EOFException: Unexpected end of input stream
        at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:145)
        at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85)
        at java.io.InputStream.read(InputStream.java:101)
        at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:180)
        at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
        at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
        at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:149)
        at org.apache.pig.builtin.TextLoader.getNext(TextLoader.java:58)
        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:211)
        at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:533)
        at org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:80)
        at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:91)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1594)

最佳答案

检查您加载到 line:chararray 中的文本的大小。如果大小大于 hdfs block 大小 (64 MB),那么您将收到错误消息。

关于hive - Pig 脚本失败,出现 java.io.EOFException : Unexpected end of input stream,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/25641844/

相关文章:

hadoop - 如何为 PIG 或 HIVE 中的行添加行号?

hadoop - 如何在 RDD [(String, Int)] 上保存 AsTextFile 时删除记录周围的括号?

hadoop - 使用 S3 作为默认文件系统

hadoop - 在配置单元中使用外部表支持数组列类型的最佳方法是什么?

mysql - Presto 如何与关系数据库一起工作

hadoop - 在Pig脚本中加载.gz文件时出错

hadoop - 如何将单元格拆分成单独的行并找到最小汇总值

regex - 如何在Hive中使用正则表达式提取第二个整数?

hadoop - HIVE 中 ALTER TABLE 命令中的 CONCATENATE 如何工作

hadoop - 将 json 文件加载到 PIG