java - 使用Distributed Cache分发小查找文件的最佳方法

哪种是获取分布式缓存数据的最佳方法？

public class TrailMapper extends Mapper<LongWritable, Text, Text, IntWritable> {

    ArrayList<String> globalFreq = new ArrayList<String>();
    public void setup(Context context) throws IOException{
        Configuration conf = context.getConfiguration();
        FileSystem fs = FileSystem.get(conf);
        URI[] cacheFiles = DistributedCache.getCacheFiles(conf);
        Path getPath = new Path(cacheFiles[0].getPath());
        BufferedReader bf = new BufferedReader(new InputStreamReader(fs.open(getPath)));
        String setupData = null;
        while ((setupData = bf.readLine()) != null) {
            String [] parts = setupData.split(" ");
            globalFreq.add(parts[0]);
        }
    }
    public void map(LongWritable key, Text value, Context context)
            throws IOException, InterruptedException {
        //Accessing "globalFreq" data .and do further processing
        }

或

public class TrailMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
    URI[] cacheFiles
    public void setup(Context context) throws IOException{
        Configuration conf = context.getConfiguration();
        FileSystem fs = FileSystem.get(conf);
        cacheFiles = DistributedCache.getCacheFiles(conf);

    }
    public void map(LongWritable key, Text value, Context context)
            throws IOException, InterruptedException {
        ArrayList<String> globalFreq = new ArrayList<String>();
        Path getPath = new Path(cacheFiles[0].getPath());
        BufferedReader bf = new BufferedReader(new InputStreamReader(fs.open(getPath)));
        String setupData = null;
        while ((setupData = bf.readLine()) != null) {
            String [] parts = setupData.split(" ");
            globalFreq.add(parts[0]);
        }

        }

因此，如果我们这样做(代码2 )意味着Say we have 5 map task every map task reads the same copy of the data。在为每个 map 这样编写时，任务会多次读取数据，对吗(5次)？

代码1:由于它是在设置中写入的，因此将被读取一次并在map中访问全局数据。

这是编写分布式缓存的正确方法。

最佳答案

在setup方法中执行尽可能多的操作:每个映射器都会调用一次，但是会为传递给映射器的每个记录进行缓存。可以避免为每条记录解析数据的开销，因为没有什么依赖于您在key方法中接收到的value，context和map变量。

每个 map task 都将调用setup方法，但是将为该任务处理的每个记录调用map(显然，该数字非常高)。

关于java - 使用Distributed Cache分发小查找文件的最佳方法，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/25760810/

java - 使用Distributed Cache分发小查找文件的最佳方法

上一篇：hadoop - 在mahout中输出项目 Material 相似度矩阵

下一篇：hadoop - Ambari安装脚本位置