shell - 如何将一个目录中的多个CSV表递归加载到Hive中

我已经创建了一个具有指定模式但没有数据的外部 Hive 表，比如表 A。现在假设我在 HDFS 目录中有 CSV 文件，按以下方式组织:

20150718/dir1/dir2/file1.csv
20150718/dir1/dir2/file2.csv
...................
20150718/dir1/dir2/..../dirN/file10000.csv

换句话说，这些文件可能在目录 20150718 中的多个不同级别的目录中。如何在一个 Hive/shell 命令中加载这些 CSV 文件？

另一个注意事项是我计划随着时间的推移根据日期创建分区，那么我应该如何进行？仍然是新的 Hive 用户，非常感谢您的建议。

最佳答案

//Get the configuration

Configuration conf = getConf();
FileSystem fs = inputPath.getFileSystem(conf);

//Specify the filter, Dates in your case.

PathFilter pf = new FileFilter(conf, fs, new String[] { "txt" });

//Move or copy recursively

moveRecursivelytoTarget(target, fs, inputPath, pf);

protected void moveRecursivelytoTarget(String target, FileSystem fs, Path path, PathFilter inputFilter)
    throws IOException
  {
    for (FileStatus stat : fs.listStatus(path, inputFilter))
      if (stat.isDir())
        moveRecursivelytoTarget(target, fs, stat.getPath(), inputFilter);
      else
      {
        fs.copyFromLocalFile(stat.getPath(), target);
        //Or rename
        //rename(stat.getPath(), target) 
      }
 }

you can follow the same procedure in shell too.

为了创建动态分区，将上面收集的信息放入暂存表中，将其称为 tableA，然后从 tableA 读取并使用分区写入 tableMain，您可以清理 tableA 一天。

set hive.exec.dynamic.partition=true; 
INSERT OVERWRITE TABLE tableMain PARTITION (date) SELECT x,y,z 
FROM tableA t;

关于shell - 如何将一个目录中的多个CSV表递归加载到Hive中，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/31581401/

shell - 如何将一个目录中的多个CSV表递归加载到Hive中

上一篇：java - 如何从单独的 java 程序中在集群上运行 spark 程序？

下一篇：hadoop - 如何关闭你的namenode？