apache-spark - 如何获取 Spark Sql 查询执行计划的 DAG?

标签 apache-spark pyspark apache-spark-sql explain spark-ui

我正在对spark sql查询执行计划进行一些分析。 explain() api 打印的执行计划可读性较差。如果我们看到 Spark Web UI,则会创建一个 DAG 图,它分为作业、阶段和任务,并且更具可读性。有没有办法从代码中的执行计划或任何api创建该图表?如果没有,是否有任何 API 可以从 UI 读取该图

最佳答案

据我所知,这个项目( https://github.com/AbsaOSS/spline-spark-agent )能够解释执行计划并以可读的方式生成它。 这个 Spark 作业正在读取一个文件,将其转换为 CSV 文件,然后写入本地。

JSON 格式的示例输出如下

{
    "id": "3861a1a7-ca31-4fab-b0f5-6dbcb53387ca",
    "operations": {
        "write": {
            "outputSource": "file:/output.csv",
            "append": false,
            "id": 0,
            "childIds": [
                1
            ],
            "params": {
                "path": "output.csv"
            },
            "extra": {
                "name": "InsertIntoHadoopFsRelationCommand",
                "destinationType": "csv"
            }
        },
        "reads": [
            {
                "inputSources": [
                    "file:/Users/liajiang/Downloads/spark-onboarding-demo-application/src/main/resources/wikidata.csv"
                ],
                "id": 2,
                "schema": [
                    "6742cfd4-d8b6-4827-89f2-4b2f7e060c57",
                    "62c022d9-c506-4e6e-984a-ee0c48f9df11",
                    "26f1d7b5-74a4-459c-87f3-46a3df781400",
                    "6e4063cf-4fd0-465d-a0ee-0e5c53bd52b0",
                    "2e019926-3adf-4ece-8ea7-0e01befd296b"
                ],
                "params": {
                    "inferschema": "true",
                    "header": "true"
                },
                "extra": {
                    "name": "LogicalRelation",
                    "sourceType": "csv"
                }
            }
        ],
        "other": [
            {
                "id": 1,
                "childIds": [
                    2
                ],
                "params": {
                    "name": "`source`"
                },
                "extra": {
                    "name": "SubqueryAlias"
                }
            }
        ]
    },
    "systemInfo": {
        "name": "spark",
        "version": "2.4.2"
    },
    "agentInfo": {
        "name": "spline",
        "version": "0.5.5"
    },
    "extraInfo": {
        "appName": "spark-spline-demo-application",
        "dataTypes": [
            {
                "_typeHint": "dt.Simple",
                "id": "f0dede5e-8fe1-4c22-ab24-98f7f44a9a5a",
                "name": "timestamp",
                "nullable": true
            },
            {
                "_typeHint": "dt.Simple",
                "id": "dbe1d206-3d87-442c-837d-dfa47c88b9c1",
                "name": "string",
                "nullable": true
            },
            {
                "_typeHint": "dt.Simple",
                "id": "0d786d1e-030b-4997-b005-b4603aa247d7",
                "name": "integer",
                "nullable": true
            }
        ],
        "attributes": [
            {
                "id": "6742cfd4-d8b6-4827-89f2-4b2f7e060c57",
                "name": "date",
                "dataTypeId": "f0dede5e-8fe1-4c22-ab24-98f7f44a9a5a"
            },
            {
                "id": "62c022d9-c506-4e6e-984a-ee0c48f9df11",
                "name": "domain_code",
                "dataTypeId": "dbe1d206-3d87-442c-837d-dfa47c88b9c1"
            },
            {
                "id": "26f1d7b5-74a4-459c-87f3-46a3df781400",
                "name": "page_title",
                "dataTypeId": "dbe1d206-3d87-442c-837d-dfa47c88b9c1"
            },
            {
                "id": "6e4063cf-4fd0-465d-a0ee-0e5c53bd52b0",
                "name": "count_views",
                "dataTypeId": "0d786d1e-030b-4997-b005-b4603aa247d7"
            },
            {
                "id": "2e019926-3adf-4ece-8ea7-0e01befd296b",
                "name": "total_response_size",
                "dataTypeId": "0d786d1e-030b-4997-b005-b4603aa247d7"
            }
        ]
    }
}


关于apache-spark - 如何获取 Spark Sql 查询执行计划的 DAG?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/64172183/

相关文章:

scala - 如何在 Scala Spark 项目中使用 PySpark UDF?

python - PySpark 将空字符串转换为 null 并写入 Parquet

dataframe - Spark中DataFrame、Dataset、RDD的区别

apache-spark - 无法启动 Spark 历史服务器

apache-spark - 通过Log4j记录HDFS上的Spark驱动程序和执行程序日志

apache-spark - 您可以将一个 Spark Dataframe 嵌套在另一个 Dataframe 中吗?

python - 如何在pyspark中打印具有特征名称的随机森林的决策路径?

python - 从rest api到pyspark数据帧的嵌套json

python - Pyspark:如何过滤两列值对的列表?

java - Spark-Java : Concatenate aggregated groupBy's result