这就是我想做的
INPUT
1,code=1a_asdfasdf_code=1b,asdf
2,code=2a_asdfasdf_code=2b_code=2c_laksjdf;lksjdf,asdf
3,code=3a_,sdoliclwmd
Intermediate
1,{1a,1b}
2,{2a,2b,2c}
3,{3a}
Finally
1,1a
1,1b
2,2a
2,2b
我知道 REGEX_EXTRACT 和 REGEX_EXTRACT_ALL,但它们都没有为同一正则表达式提供多个匹配项。
2,2c
3,3a
这只给了我第一场比赛
A = LOAD '/data/regsearch1.csv' using PigStorage(',') as (c1:chararray,c2:chararray,c3:chararray);
B = foreach A generate c1,REGEX_EXTRACT_ALL(c2,'.*code=([^_]+)_.*') as m1;
最佳答案
仅供引用,这个问题是关于 PIG-latin 的。
我最终编写了 python UDF
#!/usr/bin/python
import re;
@outputSchema("bag1:bag{tuple1:tuple(match:chararray)}")
def findallregex(pattern,str):
outbag = []
matches = re.findall(pattern,str);
for m in matches:
tuple1 = (m,)
outbag.append(tuple1);
return outbag;
然后是这个 PIG 拉丁代码
REGISTER '/findall.py' using org.apache.pig.scripting.jython.JythonScriptEngine as myfuncs;
A = LOAD '/regsearch1.csv' using PigStorage(',') as (c1:chararray,c2:chararray,c3:chararray);
B = foreach A generate c1, myfuncs.findallregex('code=([^_]+)',c2) as bag1;
C = foreach B generate c1, flatten(bag1);
关于regex - 从 PIG 中的同一行提取多个正则表达式匹配,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/21469541/