java - RDD.saveAsTextFile后的空文件是什么?

标签 java apache-spark data-analysis apache-spark-1.3

我通过学习 Spark:闪电般的快速数据分析中的一些示例来学习 Spark,然后添加我自己的开发成果。

我创建这个类是为了了解基本的转换和操作。

/**
 * Find errors in a log file
 */

package com.oreilly.learningsparkexamples.mini.java;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;

public class FindErrors {
    public static void main(String args[]){
        String inputFile = args[0];
        String outputFile = args[1];
        //Create a Spark context
        SparkConf conf = new SparkConf().setAppName("findErrors");
        JavaSparkContext sc = new JavaSparkContext(conf);
        //Load input data
        JavaRDD<String> input = sc.textFile(inputFile);
        //Split up into words
        JavaRDD<String> errorsRDD = input.filter(
            new Function<String, Boolean>() {
                public Boolean call(String x) {
                    return x.contains("error");
                }
            });
        //Transform into word and count
        //errorsRDD.saveAsTextFile(outputFile);

        JavaRDD<String> warningsRDD = input.filter(
            new Function<String, Boolean>() {
                public Boolean call(String x) {
                    return x.contains("warning");
                }
            });

        JavaRDD<String> badLinesRDD = errorsRDD.union(warningsRDD);

        badLinesRDD.saveAsTextFile(outputFile);

        System.out.println("I had " + badLinesRDD.count() + " concerning lines.");
        System.out.println("Here are 10 examples:");
        for(String line: badLinesRDD.take(10)){
            System.out.println(line);
        }

    }   
}

这是我用来运行它的命令:

$SPARK_HOME/bin/spark-submit --class com.oreilly.learningsparkexamples.mini.java.FindErrors ./target/learning-spark-mini-example-0.0.1.jar ../files/fake_logs/log1.log ./errorLog

以下是日志文件的内容:

66.249.69.97 - - [24/Sep/2014:22:25:44 +0000] "GET /071300/242153 HTTP/1.1" 404 514 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
71.19.157.174 - - [24/Sep/2014:22:26:12 +0000] "GET /error HTTP/1.1" 404 505 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.94 Safari/537.36"
71.19.157.174 - - [24/Sep/2014:22:26:12 +0000] "GET /favicon.ico HTTP/1.1" 200 1713 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.94 Safari/537.36"
71.19.157.174 - - [24/Sep/2014:22:26:37 +0000] "GET / HTTP/1.1" 200 18785 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.94 Safari/537.36"
71.19.157.174 - - [24/Sep/2014:22:26:37 +0000] "GET /jobmineimg.php?q=m HTTP/1.1" 200 222 "http://www.holdenkarau.com/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.94 Safari/537.36"
71.19.157.175 - - [24/Sep/2014:22:26:12 +0000] "GET /error HTTP/1.1" 404 505 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.94 Safari/537.36"
71.19.157.175 - - [24/Sep/2014:22:26:12 +0000] "GET /error HTTP/1.1" 404 505 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.94 Safari/537.36"
71.19.157.174 - - [24/Sep/2014:22:26:37 +0000] "GET /jobmineimg.php?q=m HTTP/1.1" 200 222 "http://www.holdenkarau.com/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.94 Safari/537.36"
71.19.157.175 - - [24/Sep/2014:22:26:12 +0000] "GET /warning HTTP/1.1" 404 505 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.94 Safari/537.36"
71.19.157.175 - - [24/Sep/2014:22:26:12 +0000] "GET /warning HTTP/1.1" 404 505 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.94 Safari/537.36"

我注意到的一件事是输出创建了多个文件,而不是我期望的一个文件。

这些文件是:

_SUCCESS


part-00000
71.19.157.174 - - [24/Sep/2014:22:26:12 +0000] "GET /error HTTP/1.1" 404 505 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.94 Safari/537.36"
71.19.157.175 - - [24/Sep/2014:22:26:12 +0000] "GET /error HTTP/1.1" 404 505 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.94 Safari/537.36"

part-00001
71.19.157.175 - - [24/Sep/2014:22:26:12 +0000] "GET /error HTTP/1.1" 404 505 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.94 Safari/537.36"

part-00002


part-00003
71.19.157.175 - - [24/Sep/2014:22:26:12 +0000] "GET /warning HTTP/1.1" 404 505 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.94 Safari/537.36"
71.19.157.175 - - [24/Sep/2014:22:26:12 +0000] "GET /warning HTTP/1.1" 404 505 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.94 Safari/537.36"

看起来好像为每个警告/错误“分组”创建了一个文件。空白文件是做什么用的?

另外,这可能是我的代码中我尚未找到的东西,还是 Spark 的特征?

最佳答案

这是一个功能。使用 saveAsTextFile,Spark 会为每个分区写入一个输出文件,无论它是否包含数据。由于您应用filter,一些最初包含数据的输入分区最终可能会变成空。因此是空文件。

关于java - RDD.saveAsTextFile后的空文件是什么?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/44869912/

相关文章:

部分数组的java for-each循环

java - 如何降低 WebSphere 服务器的堆大小

java - Spark spark-submit --jars arguments 想要逗号列表,如何声明一个 jars 目录?

python - 如何在数据框中创建一个新列,它是另一列和条件的函数,比 for 循环更快?

java - 在没有 setEnabled(false) 的情况下禁用按钮单击动画

java - 在没有清晰背景的情况下旋转图像

mongodb - 使用 ssl 从 spark 连接到 mongo docker

scala - map 和udf之间的区别

r - 使用纵向数据集计算随时间变化的百分比

r - 测量 R 中的保留