hadoop - 在 PIG 中读取带有模式的文件

标签 hadoop apache-pig hadoop2

我有一个场景,我使用 HCatStorer 从一个目录加载 40 个具有不同模式的文件到 Hive 表。

Directory : opt/inputfolder/ 
Input Files Pattern :

inp1*.log,
inp2*.log,
    .....
inp39*.log,
inp40*.log.

我写了一个 pig 脚本,它读取所有具有 40 种模式的文件。

但我的问题是,这 40 个文件是强制性的,我可能无法收到某些文件。在这种情况下,我会收到一个异常说明:

Caused by: org.apache.hadoop.mapreduce.lib.input.InvalidInputException:
           Input Pattern opt/ip_files/inp16*.log matches 0 files

有什么办法可以处理这个异常吗?

即使这个文件不存在,我也想用模式读取剩余的 39 个文件。

如果我的源文件是字符串(即 banana_2014012.log、orange_2014012.log、apple_2014012.log)怎么办

以下是我使用 HCatStorer 将这些文件中的数据加载到 HIVE 表的方法。

*** Pseudo code ****
banana_src = LOAD banana_*.log' using PigStorage;
......
Store banana_src into BANANA using HCatStorer;

apple_src = LOAD banana_*.log' using PigStorage;
......
Store apple_src into APPLE using HCatStorer;

orange_src = LOAD banana_*.log' using PigStorage;
......
Store orange_src into ORANGE using HCatStorer;

如果任何 src 没有文件,那么这个 Pig 脚本将抛出错误,指出 Match Pattern is 0 并且 PIG Scrip 将失败。即使一个源文件不可用,我希望我的脚本加载其他表没有让我的工作失败。

谢谢。

最佳答案

 If you load inp1*.log, it matches inp16*.log also(if file present) but why are you again
 loading inp16*.log separately?

 Based on the above input i feel the below condition is sufficient for you.
        LOAD 'opt/ip_files/inp[1-9]*.log'

Please let me know if you are trying something different?

UPDATE:
I have one more option but not sure if this works for you.
1. Split your pig script into three parts say banana.pig, apple.pig and orange.pig each script will have their own logic.
2. Write a shell script to check existence of the each file pattern
3. If the files are present, call the corresponding pig script using pig params option else dont call. 
   In this option, if the files are not present that particular pig script will not be get triggred

Shellscript: test.sh
#!/bin/bash

BANANA_FILES="opt/ip_files/banana_*.log"
APPLE_FILES="opt/ip_files/apple_*.log"
ORANGE_FILES="opt/ip_files/orange_*.log"

if ls $BANANA_FILES > /dev/null 2>&1
then
    echo "Banana File Found"
    pig -x local -param PIG_BANANA_INPUT_FILES="$BANANA_FILES" -f banana.pig
else
    echo "No Banana files found"
fi

if ls $APPLE_FILES > /dev/null 2>&1
then
    echo "Apple File Found"
    pig -x local -param PIG_APPLE_INPUT_FILES="$APPLE_FILES" -f apple.pig
else
    echo "No APPLE files found"
fi

if ls $ORANGE_FILES > /dev/null 2>&1
then
    echo "Orange File Found"
    pig -x local -param PIG_ORANGE_INPUT_FILES="$ORANGE_FILES" -f orange.pig
else
    echo "No Orange files found"
fi


PigScript:banana.pig
banana_src = LOAD '$PIG_BANANA_INPUT_FILES' using PigStorage;
DUMP banana_src;

PigScript: apple.pig
apple_src = LOAD '$PIG_APPLE_INPUT_FILES' using PigStorage;
DUMP apple_src;

PigScript:orange.pig
orange_src = LOAD '$PIG_ORANGE_INPUT_FILES' using PigStorage;
DUMP orange_src;

Output1: All the three files are present
$ ./test.sh 
Banana File Found
(1,2,3,4,5)
(a,b,c,d,e)
Apple File Found
(test1,test2)
Orange File Found
(13,4,5)

Output2: Only banana files are present
$ ./test.sh 
Banana File Found
(1,2,3,4,5)
(a,b,c,d,e)
No APPLE files found
No Orange files found

关于hadoop - 在 PIG 中读取带有模式的文件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/26347281/

相关文章:

regex - 带有正则表达式的Hadoop fs -rm

hadoop - Oozie示例在运行 pig 作业时卡住

java - pig 栏连字符(-)

hadoop - 如何保证MapReduce任务之间相互独立?

hadoop - beeline 执行命令的用户 id

hadoop - 用于插入数据的 Hive 循环

hadoop - FATAL ha.BootstrapStandby:无法从事件的NN获取 namespace 信息

hadoop - 错误 org.apache.pig.tools.grunt.Grunt - 错误 2998 : Unhandled internal error. org/apache/hadoop/hbase/filter/WritableByteArrayComparable

java - 使用Maven for Java程序在hdfs上写入时的Hadoop错误

hadoop - Sqoop无效的连接URL SQL Server