java - hadoop在输入文件夹中选择输入文件

在training_set文件夹中，文件像这样存在

mv_000000
mv_000001
mv_000002
...

索引是可以在movie_title.txt上找到的电影ID
movie_title.tx文件类似于:

1,2003,Dinosaur Planet
2,2004,Isle of Man TT 2004 Review
3,1997,Character   
4,1994,Paula Abdul's Get Up & Dance
5,2004,The Rise and Fall of ECW 
...

第一栏是特定电影名称的索引。

我根据netplix竞赛奖金数据集练习hadoop。
我假设我插入了特定的电影标题，例如“Sick”。
然后转到movie_titles.txt文件并搜索“sick”的电影标题ID。
最后设置输入路径电影标题ID。

例如，如果我以以下方式启动hadoop程序:

hadoop jar ~ [input path] [output path] [moiveA name]

比必须设置输入路径training_set/mv_movieAIndex。

如我所说，电影id的信息存在于movie_title.txt上。

请给我一些提示，以解决此问题。

最佳答案

您的要求似乎与Hadoop根本无关。您只需要根据id命令的第3个参数指定的movieName查找hadoop jar。以下代码段将完成工作:

private static Map<String, Integer> getMovieMappings(String filePath)
        throws IOException {
    Map<String, Integer> movieMap = new HashMap<String, Integer>();
    BufferedReader br = null;
    try {
        br = new BufferedReader(new FileReader(filePath));
        String line;
        while ((line = br.readLine()) != null) {
            String[] temp = line.split(",");
            movieMap.put(temp[2].trim(), Integer.parseInt(temp[0].trim()));
        }
    } finally {
        if (br != null)   br.close(); 
    }
    return movieMap;
}

现在在驱动程序中，只需获取 map 并相应地设置inputPath即可:

Map<String, Integer> movieMap = getMovieMappings("/pathTo/movie_title.txt");
int movieId = movieMap.get(args[2]);
System.out.println(String.format("mv_%06d", movieId));
FileInputFormat.addInputPath( job, 
                              new Path( "training_set",
                                        String.format("mv_%06d", movieId)));

可能会有所帮助。

关于java - hadoop在输入文件夹中选择输入文件，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/26825570/

java - hadoop在输入文件夹中选择输入文件

上一篇：string - 如何在映射器中将字符串作为值传递？

下一篇：ubuntu - Oracle vm 中 ubuntu 中的 SSH 连接抛出错误