我正在使用 Cloudera - quickstat 5.4。我有一个文件,每一行都有数据,例如:
323.81.303.680 - - [25/Oct/2011:01:41:00 -0500] "GET /download/download6.zip HTTP/1.1" 200 0 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.19) Gecko/2010031422 Firefox/3.0.19"
在 apache pig 中,我使用的脚本如下:
A= LOAD 'weblog.txt' using TextLoader() as (line:chararray);
B= FOREACH A GENERATE
FLATTEN(REGEX_EXTRACT_ALL(line,'^(\\S+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] “(.+?)” (\\S+) (\\S+) “([^”]*)” “([^”]*)”')) AS (remoteAddr: chararray, remoteLogname: chararray, user: chararray, time:chararray, request: chararray, status:int,bytes_string:chararray,referrer: chararray, browser: chararray);
DUMP B;
上面查询的输出给出类似
的输出()
()
谁能告诉我我做错了什么?正则表达式可以吗?
最佳答案
在末尾添加, line
,在chararray之后) 之前;
:
A= LOAD 'weblog.txt' using TextLoader() as (line:chararray);
B= FOREACH A GENERATE FLATTEN(
REGEX_EXTRACT_ALL(line,'^(\\S+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] "(.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)"'))
AS (remoteAddr: chararray, remoteLogname: chararray, user: chararray, time:chararray, request: chararray, status:int,bytes_string:chararray,referrer: chararray, browser: chararray)
, line;
DUMP B;
至于正则表达式,它与示例字符串匹配得很好,请参阅 regex demo .
关于regex - 使用 REGEX_EXTRACT_ALL 但投影我得到 "()",我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/33499134/