java - Apache Spark 找不到 CSVReader 类

标签 java maven intellij-idea apache-spark

我尝试解析简单 csv 文件的代码如下所示:

SparkConf conf = new SparkConf().setMaster("local").setAppName("word_count");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> csv = sc.textFile("/home/user/data.csv");

JavaRDD<String[]> parsed = csv.map(x-> new CSVReader(new StringReader(x)).readNext());
parsed.foreach(x->System.out.println(x));  

但是,Spark 作业以未找到类异常结束,表示找不到 CSVReader。我的 pom.xml 如下所示:

<dependencies>
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-core_2.10</artifactId>
        <version>1.1.0</version>
        <scope>provided</scope>
    </dependency>

    <dependency>
        <groupId>com.opencsv</groupId>
        <artifactId>opencsv</artifactId>
        <version>3.8</version>
        <scope>provided</scope>
    </dependency>
</dependencies>

如何解决这个问题?

最佳答案

If your code depends on other projects, you will need to package them alongside your application in order to distribute the code to a Spark cluster. To do this, create an assembly jar (or “uber” jar) containing your code and its dependencies. Both sbt and Maven have assembly plugins. When creating assembly jars, list Spark and Hadoop as provided dependencies; these need not be bundled since they are provided by the cluster manager at runtime.
Source: http://spark.apache.org/docs/latest/submitting-applications.html

当 Maven 将项目打包成 JAR 时,它不会传送依赖 JAR。为了传送依赖 JAR,我添加了 Maven Shade 插件。

<plugin>
    <groupId>org.apache.maven.plugins</groupId>
    <artifactId>maven-shade-plugin</artifactId>
    <version>2.3</version>
    <executions>
        <execution>
            <phase>package</phase>
            <goals>
                <goal>shade</goal>
            </goals>
        </execution>
    </executions>
    <configuration>
        <filters>
            <filter>
                <artifact>*:*</artifact>
                <excludes>
                    <exclude>META-INF/*.SF</exclude>
                    <exclude>META-INF/*.DSA</exclude>
                    <exclude>META-INF/*.RSA</exclude>
                </excludes>
            </filter>
        </filters>
        <finalName>${project.artifactId}-${project.version}</finalName>
    </configuration>
</plugin>  

另请参阅:How to make it easier to deploy my Jar to Spark Cluster in standalone mode?

关于java - Apache Spark 找不到 CSVReader 类,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/39684327/

相关文章:

c# - Java 包与 C# 程序集

mysql - zeppelin sql 解释器错误,无法获取 mysql 的依赖项 :mysql-connector-java:5. 1.38

eclipse - Hibernate映射找不到资源

maven - 如何在build.gradle中指定多个mavenRepo?

intellij-idea - IntelliJ IDEA 在本地驱动器上存储货架更改的位置?

intellij-idea - Intellij Markdown 插件预览底部?

scala - sbt 无法编译 Scala 项目,因为 java.lang.NoSuchMethodError

java - 使用递归以相反顺序打印字符数组的方法?

java - 是否可以将外部枚举(外部消息定义)与 Protocol Buffer 一起使用?

java - 在Android上读取和显示表单数据