apache-spark - Spark SQL 是否计数不正确或我无法正确编写 SQL？

在 Databricks“社区版”上的 Python 笔记本中，我正在试验旧金山市关于紧急调用 911 请求消防员的开放数据。 (在 "Using Apache Spark 2.0 to Analyze the City of San Francisco's Open Data" (YouTube) 中使用的数据的旧 2016 副本，并在 S3 上提供用于该教程。)

装载数据并使用明确定义的模式将其读取到 DataFrame 中后 fire_service_calls_df ，我将该 DataFrame 别名为 SQL 表:

sqlContext.registerDataFrameAsTable(fire_service_calls_df, "fireServiceCalls")

有了它和 DataFrame API，我可以计算发生的调用类型:

fire_service_calls_df.select('CallType').distinct().count()

Out[n]: 34

...或使用 Python 中的 SQL:

spark.sql("""
SELECT count(DISTINCT CallType)
FROM fireServiceCalls
""").show()

+------------------------+
|count(DISTINCT CallType)|
+------------------------+
|                      33|
+------------------------+

...或使用 SQL 单元格:

%sql

SELECT count(DISTINCT CallType)
FROM fireServiceCalls

为什么我会得到两个不同的计数结果？ (似乎 34 是正确的 ，尽管 talk in the video 和随附的教程笔记本提到了“35”。)

最佳答案

回答问题

Can Spark SQL not count correctly or can I not write SQL correctly?

来自标题:我无法正确编写 SQL。

写SQL的规则<插入编号>:想想NULL和 UNDEFINED .

%sql
SELECT count(*)
FROM (
  SELECT DISTINCT CallType
  FROM fireServiceCalls 
)

34

另外，我显然无法阅读:

保罗suggested in a comment

With only 30 something values, you could just sort and print all the distinct items to see where the difference is.

嗯，其实我自己也是这么想的。 (减去排序。)除了，没有任何区别，输出中总是有 34 种调用类型，无论我是用 SQL 还是 DataFrame 查询生成它。我只是没有注意到其中一个被不祥地命名为 null :

+--------------------------------------------+
|CallType                                    |
+--------------------------------------------+
|Elevator / Escalator Rescue                 |
|Marine Fire                                 |
|Aircraft Emergency                          |
|Confined Space / Structure Collapse         |
|Administrative                              |
|Alarms                                      |
|Odor (Strange / Unknown)                    |
|Lightning Strike (Investigation)            |
|null                                        |
|Citizen Assist / Service Call               |
|HazMat                                      |
|Watercraft in Distress                      |
|Explosion                                   |
|Oil Spill                                   |
|Vehicle Fire                                |
|Suspicious Package                          |
|Train / Rail Fire                           |
|Extrication / Entrapped (Machinery, Vehicle)|
|Other                                       |
|Transfer                                    |
|Outside Fire                                |
|Traffic Collision                           |
|Assist Police                               |
|Gas Leak (Natural and LP Gases)             |
|Water Rescue                                |
|Electrical Hazard                           |
|High Angle Rescue                           |
|Structure Fire                              |
|Industrial Accidents                        |
|Medical Incident                            |
|Mutual Aid / Assist Outside Agency          |
|Fuel Spill                                  |
|Smoke Investigation (Outside)               |
|Train / Rail Incident                       |
+--------------------------------------------+

关于apache-spark - Spark SQL 是否计数不正确或我无法正确编写 SQL？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/49265307/

apache-spark - Spark SQL 是否计数不正确或我无法正确编写 SQL？

上一篇：j - 如何修改框中的数据？ (J编程)

下一篇：angular-material - Angular 2 Material 水平滚动数据表边框问题