如何使用Pig / Hive从Weblog文件中的URL中提取字符串
输入文件
122.161.182.202 - jane [21/Jul/2012:13:14:17-0700] "GET /rss.pl HTTP/1.1" 200 35942 "http://www.e.com/bam_applicatin/VD55173061" "IE/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Trident/4.0; SLCC1; .NET CLR 2.0.50727; .NET CLR 3.5.21022; InfoPath.2; .NET CLR 3.5.30729; .NET CLR 3.0.30618; OfficeLiveConnector.1.3; OfficeLivePatch.1.3; MSOffice 12)"
所需的输出:
122.161.182.202 - jane [21/Jul/2012:13:14:17-0700] "GET /rss.pl HTTP/1.1" 200 35942 "VD55173061" "IE/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Trident/4.0; SLCC1; .NET CLR 2.0.50727; .NET CLR 3.5.21022; InfoPath.2; .NET CLR 3.5.30729; .NET CLR 3.0.30618; OfficeLiveConnector.1.3; OfficeLivePatch.1.3; MSOffice 12)"
输入网址
http://www.e.com/bam_applicatin/VD55173061
网址中的所需字符串
VD55173061
我想使用Pig或Hive处理Weblog。请帮忙 ..
最佳答案
使用 Apache Pig
请参阅http://pig.apache.org/docs/r0.14.0/func.html#substring以获取API文档和用法
输入:
http://www.e.com/bam_applicatin/VD55173061
pig 脚本:
url_data = LOAD 'input.csv' USING PigStorage(',') AS (url:chararray);
req_url = FOREACH url_data GENERATE SUBSTRING(url,LAST_INDEX_OF(url, '/') + 1, (int)SIZE(url));
DUMP req_url;
输出:
VD55173061
关于shell - 如何使用Pig/Hive从Weblog文件中的URL中提取字符串,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/32569784/