我想从一个简单的 CSV 文件创建一个 Spark 数据集。以下是 CSV 文件的内容:
name,state,number_of_people,coolness_index
trenton,nj,"10","4.5"
bedford,ny,"20","3.3"
patterson,nj,"30","2.2"
camden,nj,"40","8.8"
这是制作数据集的代码:
var location = "s3a://path_to_csv"
case class City(name: String, state: String, number_of_people: Long)
val cities = spark.read
.option("header", "true")
.option("charset", "UTF8")
.option("delimiter",",")
.csv(location)
.as[City]
这是错误消息:“无法将
number_of_people
从字符串转换为 bigint,因为它可能会被截断”Databricks 在 this blog post 中讨论了创建数据集和这个特定的错误消息.
Encoders eagerly check that your data matches the expected schema, providing helpful error messages before you attempt to incorrectly process TBs of data. For example, if we try to use a datatype that is too small, such that conversion to an object would result in truncation (i.e. numStudents is larger than a byte, which holds a maximum value of 255) the Analyzer will emit an AnalysisException.
我正在使用
Long
类型,所以我没想到会看到这个错误信息。
最佳答案
使用模式推断:
val cities = spark.read
.option("inferSchema", "true")
...
或提供架构:
val cities = spark.read
.schema(StructType(Array(StructField("name", StringType), ...)
或 Actor :
val cities = spark.read
.option("header", "true")
.csv(location)
.withColumn("number_of_people", col("number_of_people").cast(LongType))
.as[City]
关于apache-spark - 从 CSV 文件创建 Spark 数据集,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/39522411/