我正在尝试在 Hadoop map-reduce
中编写以下代码。我有一个日志文件,其中包含 IP 地址和相应 IP 打开的 url。具体如下:
192.168.72.224 www.m4maths.com
192.168.72.177 www.yahoo.com
192.168.72.177 www.yahoo.com
192.168.72.224 www.facebook.com
192.168.72.224 www.gmail.com
192.168.72.177 www.facebook.com
192.168.198.92 www.google.com
192.168.198.92 www.yahoo.com
192.168.72.224 www.google.com
192.168.72.177 www.yahoo.com
192.168.198.92 www.google.com
192.168.72.224 www.indiabix.com
192.168.72.177 www.yahoo.com
192.168.72.224 www.google.com
192.168.72.177 www.yahoo.com
192.168.72.224 www.yahoo.com
192.168.198.92 www.m4maths.com
192.168.198.92 www.facebook.com
192.168.72.224 www.gmail.com
192.168.72.177 www.google.com
192.168.72.224 www.indiabix.com
192.168.72.224 www.indiabix.com
192.168.72.177 www.m4maths.com
192.168.72.224 www.indiabix.com
192.168.198.92 www.google.com
192.168.72.177 www.yahoo.com
192.168.198.92 www.yahoo.com
192.168.72.177 www.yahoo.com
192.168.198.92 www.facebook.com
192.168.198.92 www.indiabix.com
192.168.72.177 www.indiabix.com
192.168.72.224 www.google.com
192.168.198.92 www.askubuntu.com
192.168.198.92 www.askubuntu.com
192.168.198.92 www.facebook.com
192.168.198.92 www.gmail.com
192.168.72.177 www.facebook.com
192.168.72.177 www.yahoo.com
192.168.198.92 www.m4maths.com
192.168.72.224 www.yahoo.com
192.168.72.177 www.google.com
192.168.72.177 www.m4maths.com
192.168.72.177 www.yahoo.com
192.168.72.224 www.m4maths.com
192.168.72.177 www.yahoo.com
192.168.72.177 www.yahoo.com
192.168.72.224 www.facebook.com
192.168.72.224 www.gmail.com
192.168.72.177 www.facebook.com
192.168.198.92 www.google.com
192.168.198.92 www.yahoo.com
192.168.72.224 www.google.com
192.168.72.177 www.yahoo.com
192.168.198.92 www.google.com
192.168.72.224 www.indiabix.com
192.168.72.177 www.yahoo.com
192.168.72.224 www.google.com
192.168.72.177 www.yahoo.com
192.168.72.224 www.yahoo.com
192.168.198.92 www.m4maths.com
192.168.198.92 www.facebook.com
192.168.72.224 www.gmail.com
192.168.72.177 www.google.com
192.168.72.224 www.indiabix.com
192.168.72.224 www.indiabix.com
192.168.72.177 www.m4maths.com
192.168.72.224 www.indiabix.com
现在我需要以这样一种方式组织此文件的结果,即列出不同的 IP 地址和 Urls,后跟特定 IP 地址打开的次数。
例如,如果 192.168.72.224
根据整个日志文件打开 www.yahoo.com
15 次,则输出必须包含:
192.168.72.224 www.yahoo.com 15
应该对文件中的所有 IP 执行此操作,最终输出应如下所示:
192.168.72.224 www.yahoo.com 15
www.m4maths.com 11
192.168.72.177 www.yahoo.com 6
www.gmail.com 19
....
...
..
.
我试过的代码是:
public class WordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable>
{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException
{
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens())
{
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}
我知道这段代码存在严重缺陷,请给我一个继续前进的想法。
谢谢。
最佳答案
我会建议这个设计:
- Mapper 从文件中获取一行并输出 IP 作为键和一对网站和 1 作为值
- 组合器和 reducer 。获取 IP 作为键和一系列 (website, count) 对,按网站聚合它们(使用 HashMap)并输出 IP、网站和计数作为输出。
实现这个需要你实现自定义可写来处理一对。
我个人会使用 Spark 来执行此操作,除非您太在意性能。使用 PySpark,它会像这样简单:
rdd = sc.textFile('/sparkdemo/log.txt')
counts = rdd.map(lambda line: line.split()).map(lambda line: ((line[0], line[1]), 1)).reduceByKey(lambda x, y: x+y)
result = counts.map(lambda ((ip, url), cnt): (ip, (url, cnt))).groupByKey().collect()
for x in result:
print 'IP: %s' % x[0]
for w in x[1]:
print ' website: %s count: %d' % (w[0], w[1])
您的示例的输出为:
IP: 192.168.72.224
website: www.facebook.com count: 2
website: www.m4maths.com count: 2
website: www.google.com count: 5
website: www.gmail.com count: 4
website: www.indiabix.com count: 8
website: www.yahoo.com count: 3
IP: 192.168.72.177
website: www.yahoo.com count: 14
website: www.google.com count: 3
website: www.facebook.com count: 3
website: www.m4maths.com count: 3
website: www.indiabix.com count: 1
IP: 192.168.198.92
website: www.facebook.com count: 4
website: www.m4maths.com count: 3
website: www.yahoo.com count: 3
website: www.askubuntu.com count: 2
website: www.indiabix.com count: 1
website: www.google.com count: 5
website: www.gmail.com count: 1
关于url - 使用 mapreduce 从日志文件中提取命中计数,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/29005755/