apache-spark - SparkSQL : Ignoring invalid json files

我正在使用 SparkSQL 加载一堆 JSON 文件，但有些文件有问题。

我想继续处理其他文件，同时忽略坏文件，我该怎么做？

我尝试使用 try-catch 但它仍然失败。示例:

try {
    val sqlContext = new org.apache.spark.sql.SQLContext(sc)
    import sqlContext._

    val jsonFiles=sqlContext.jsonFile("/requests.loading")
} catch {
    case _: Throwable => // Catching all exceptions and not doing anything with them
}

我失败了:

14/11/20 01:20:44 INFO scheduler.TaskSetManager: Starting task 3065.0 in stage 1.0 (TID 6150, HDdata2, NODE_LOCAL, 1246 bytes)<BR>
14/11/20 01:20:44 WARN scheduler.TaskSetManager: Lost task 3027.1 in stage 1.0 (TID 6130, HDdata2): com.fasterxml.jackson.core.JsonParseException: Unexpected end-of-input: was expecting closing quote for a string value
 at [Source: java.io.StringReader@753ab9f1; line: 1, column: 1805]

最佳答案

如果您使用的是 Spark 1.2，Spark SQL 将为您处理那些损坏的 JSON 记录。这是一个例子...

// requests.loading has some broken records
val jsonFiles=sqlContext.jsonFile("/requests.loading")
// Look at the schema of jsonFiles, you will see a new column called "_corrupt_record", which holds all broken JSON records
// jsonFiles.printSchema
// Register jsonFiles as a table
jsonFiles.registerTempTable("jsonTable")
// To query all normal records
sqlContext.sql("SELECT * FROM jsonTable WHERE _corrupt_record IS NULL")
// To query all broken JSON records
sqlContext.sql("SELECT _corrupt_record FROM jsonTable WHERE _corrupt_record IS NOT NULL")

关于apache-spark - SparkSQL : Ignoring invalid json files，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/27035932/

上一篇：Neo4j JDBCCypherExecutor 阻塞线程

下一篇：google-apps-script - e.在 onFormSubmit 处未定义的响应

相关文章：

python - pyspark；如何有效地减少值

scala - 尝试在 Windows 中使用 sc.textFile 加载文件时出错

scala - Spark Dataframe 除了方法问题

python - 如何在Spark中调用python脚本？

scala - 使用 Pyspark 删除表

python - 如何从同一个数据库读取多个表并将它们保存到自己的 CSV 文件中？

apache-spark - 在 foreachRDD 中执行 rdd.count() 是否将结果返回给 Driver 或 Executor？

apache-spark - 如何在带有分隔符| @ |的spark sql中使用Split函数？

azure - 配置独立 Spark 以进行 Azure 存储访问

scala - 如何在单个文件中执行多个 SQL 查询的 hql 文件？