apache-spark - 从 CSV 文件创建 Spark 数据集

标签 apache-spark apache-spark-dataset

我想从一个简单的 CSV 文件创建一个 Spark 数据集。以下是 CSV 文件的内容:

name,state,number_of_people,coolness_index
trenton,nj,"10","4.5"
bedford,ny,"20","3.3"
patterson,nj,"30","2.2"
camden,nj,"40","8.8"

这是制作数据集的代码:
var location = "s3a://path_to_csv"

case class City(name: String, state: String, number_of_people: Long)

val cities = spark.read
  .option("header", "true")
  .option("charset", "UTF8")
  .option("delimiter",",")
  .csv(location)
  .as[City]

这是错误消息:“无法将 number_of_people 从字符串转换为 bigint,因为它可能会被截断”

Databricks 在 this blog post 中讨论了创建数据集和这个特定的错误消息.

Encoders eagerly check that your data matches the expected schema, providing helpful error messages before you attempt to incorrectly process TBs of data. For example, if we try to use a datatype that is too small, such that conversion to an object would result in truncation (i.e. numStudents is larger than a byte, which holds a maximum value of 255) the Analyzer will emit an AnalysisException.



我正在使用 Long类型,所以我没想到会看到这个错误信息。

最佳答案

使用模式推断:

val cities = spark.read
  .option("inferSchema", "true")
  ...

或提供架构:
val cities = spark.read
  .schema(StructType(Array(StructField("name", StringType), ...)

或 Actor :
val cities = spark.read
  .option("header", "true")
  .csv(location)
  .withColumn("number_of_people", col("number_of_people").cast(LongType))
  .as[City]

关于apache-spark - 从 CSV 文件创建 Spark 数据集,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/39522411/

相关文章:

java - 未找到与带有 Base 的可序列化的 Product 对应的 Java 类

java - 如何将具有值的列添加到 Spark Java 中的新数据集?

hadoop - 我可以索引 parquet 文件中的列以使其使用 Spark 更快地连接吗

java - 如何替换字符串类型列中的子字符串?

hadoop - 带有 Yarn 的 Spark Shell - 错误 : Yarn application has already ended! 它可能已被杀死或无法启动应用程序主机

java - 如何将 JavaPairRDD 转换为数据集?

python - 值错误: Length of object (3) does not match with length of fields

apache-spark - Spark SQL - 在连接和 groupBy 后获取重复行

scala - 如何命名聚合列?

java - 需要根据 1 列的值在数据集的列中设置值