pyspark - 在集群模式下运行作业时,在 dataproc 中哪里可以找到 spark 日志

标签 pyspark google-cloud-dataproc dataproc

我在 dataproc 中运行以下代码作为作业。 在“集群”模式下运行时,我无法在控制台中找到日志。

import sys
import time
from datetime import datetime

from pyspark.sql import SparkSession

start_time = datetime.utcnow()

spark = SparkSession.builder.appName("check_confs").getOrCreate()

all_conf = spark.sparkContext.getConf().getAll()
print("\n\n=====\nExecuting at {}".format(datetime.utcnow()))
print(all_conf)
print("\n\n======================\n\n\n")
incoming_args = sys.argv
if len(incoming_args) > 1:
    sleep_time = int(incoming_args[1])
    print("Sleep time is {} seconds".format(sleep_time))
    if sleep_time > 0:
        time.sleep(sleep_time)

end_time = datetime.utcnow()
time_taken = (end_time - start_time).total_seconds()
print("Script execution completed in {} seconds".format(time_taken))

如果我使用 deployMode as cluster 属性触发作业,我看不到相应的日志。 但是如果作业在默认模式下触发,即 client 模式,则能够看到相应的日志。 我已经给出了用于触发作业的字典。

"spark.submit.deployMode": "集群"

{
        'placement': {
            'cluster_name': dataproc_cluster
        },
        'pyspark_job': {
            'main_python_file_uri': "gs://" + compute_storage + "/" + job_file,
            'args': trigger_params,
            "properties": {
                "spark.submit.deployMode": "cluster",
                "spark.executor.memory": "3155m",
                "spark.scheduler.mode": "FAIR",
            }
        }
    }
21/12/07 19:11:27 INFO org.sparkproject.jetty.util.log: Logging initialized @3350ms to org.sparkproject.jetty.util.log.Slf4jLog
21/12/07 19:11:27 INFO org.sparkproject.jetty.server.Server: jetty-9.4.40.v20210413; built: 2021-04-13T20:42:42.668Z; git: b881a572662e1943a14ae12e7e1207989f218b74; jvm 1.8.0_292-b10
21/12/07 19:11:27 INFO org.sparkproject.jetty.server.Server: Started @3467ms
21/12/07 19:11:27 INFO org.sparkproject.jetty.server.AbstractConnector: Started ServerConnector@18528bea{HTTP/1.1, (http/1.1)}{0.0.0.0:40389}
21/12/07 19:11:28 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at ******-m/0.0.0.5:8032
21/12/07 19:11:28 INFO org.apache.hadoop.yarn.client.AHSProxy: Connecting to Application History server at ******-m/0.0.0.5:10200
21/12/07 19:11:29 INFO org.apache.hadoop.conf.Configuration: resource-types.xml not found
21/12/07 19:11:29 INFO org.apache.hadoop.yarn.util.resource.ResourceUtils: Unable to find 'resource-types.xml'.
21/12/07 19:11:30 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl: Submitted application application_1638554180947_0014
21/12/07 19:11:31 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at ******-m/0.0.0.5:8030
21/12/07 19:11:33 INFO com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl: Ignoring exception of type GoogleJsonResponseException; verified object already exists with desired state.


=====
Executing at 2021-12-07 19:11:35.100277
[....... ('spark.yarn.historyServer.address', '****-m:18080'), ('spark.ui.proxyBase', '/proxy/application_1638554180947_0014'), ('spark.driver.appUIAddress', 'http://***-m.c.***-123456.internal:40389'), ('spark.sql.cbo.enabled', 'true')]


======================



Sleep time is 1 seconds
Script execution completed in 9.411261 seconds
21/12/07 19:11:36 INFO org.sparkproject.jetty.server.AbstractConnector: Stopped Spark@18528bea{HTTP/1.1, (http/1.1)}{0.0.0.0:0}

在客户端模式下运行时控制台中没有日志

21/12/07 19:09:04 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at ******-m/0.0.0.5:8032
21/12/07 19:09:04 INFO org.apache.hadoop.yarn.client.AHSProxy: Connecting to Application History server at ******-m/0.0.0.5:8032
21/12/07 19:09:05 INFO org.apache.hadoop.conf.Configuration: resource-types.xml not found
21/12/07 19:09:05 INFO org.apache.hadoop.yarn.util.resource.ResourceUtils: Unable to find 'resource-types.xml'.
21/12/07 19:09:06 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl: Submitted application application_1638554180947_0013

最佳答案

在集群模式下运行作业时,驱动程序日志位于 Cloud Logging yarn-userlogs 中。查看doc :

By default, Dataproc runs Spark jobs in client mode, and streams the driver output for viewing as explained, below. However, if the user creates the Dataproc cluster by setting cluster properties to --properties spark:spark.submit.deployMode=cluster or submits the job in cluster mode by setting job properties to --properties spark.submit.deployMode=cluster, driver output is listed in YARN userlogs, which can be accessed in Logging.

关于pyspark - 在集群模式下运行作业时,在 dataproc 中哪里可以找到 spark 日志,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/70266214/

相关文章:

google-cloud-platform - 在 dataproc 上使用 PEX 环境打包 PySpark

arrays - pyspark : How we select the dataframe which takes the highest values ​by Matricule?

python - Mrjob 无法在 dataproc 上创建集群 : __init__() got an unexpected keyword argument 'channel'

apache-spark - 使用服务帐户和 IAM 角色管理 dataproc 集群访问

google-cloud-platform - 如何摆脱 grpc api 中对 CallCredentials2 的调用

google-cloud-platform - 在 Dataproc 现有集群上安装 PIP 软件包

apache-spark - 使用 python 的 Spark 矩阵乘法

python - 无法创建 Spark session

python - 使用 lower 函数在文本清理中将 pyspark 数据框中单列中的值转换为小写