我尝试解析简单 csv 文件的代码如下所示:
SparkConf conf = new SparkConf().setMaster("local").setAppName("word_count");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> csv = sc.textFile("/home/user/data.csv");
JavaRDD<String[]> parsed = csv.map(x-> new CSVReader(new StringReader(x)).readNext());
parsed.foreach(x->System.out.println(x));
但是,Spark 作业以未找到类异常结束,表示找不到 CSVReader
。我的 pom.xml
如下所示:
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.1.0</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>com.opencsv</groupId>
<artifactId>opencsv</artifactId>
<version>3.8</version>
<scope>provided</scope>
</dependency>
</dependencies>
如何解决这个问题?
最佳答案
If your code depends on other projects, you will need to package them alongside your application in order to distribute the code to a Spark cluster. To do this, create an assembly jar (or “uber” jar) containing your code and its dependencies. Both sbt and Maven have assembly plugins. When creating assembly jars, list Spark and Hadoop as provided dependencies; these need not be bundled since they are provided by the cluster manager at runtime.
Source: http://spark.apache.org/docs/latest/submitting-applications.html
当 Maven 将项目打包成 JAR 时,它不会传送依赖 JAR。为了传送依赖 JAR,我添加了 Maven Shade 插件。
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>2.3</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
</execution>
</executions>
<configuration>
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
<finalName>${project.artifactId}-${project.version}</finalName>
</configuration>
</plugin>
另请参阅:How to make it easier to deploy my Jar to Spark Cluster in standalone mode?
关于java - Apache Spark 找不到 CSVReader 类,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/39684327/