我在 dataproc 中运行以下代码作为作业。 在“集群”模式下运行时,我无法在控制台中找到日志。
import sys
import time
from datetime import datetime
from pyspark.sql import SparkSession
start_time = datetime.utcnow()
spark = SparkSession.builder.appName("check_confs").getOrCreate()
all_conf = spark.sparkContext.getConf().getAll()
print("\n\n=====\nExecuting at {}".format(datetime.utcnow()))
print(all_conf)
print("\n\n======================\n\n\n")
incoming_args = sys.argv
if len(incoming_args) > 1:
sleep_time = int(incoming_args[1])
print("Sleep time is {} seconds".format(sleep_time))
if sleep_time > 0:
time.sleep(sleep_time)
end_time = datetime.utcnow()
time_taken = (end_time - start_time).total_seconds()
print("Script execution completed in {} seconds".format(time_taken))
如果我使用 deployMode
as cluster
属性触发作业,我看不到相应的日志。
但是如果作业在默认模式下触发,即 client
模式,则能够看到相应的日志。
我已经给出了用于触发作业的字典。
"spark.submit.deployMode": "集群"
{
'placement': {
'cluster_name': dataproc_cluster
},
'pyspark_job': {
'main_python_file_uri': "gs://" + compute_storage + "/" + job_file,
'args': trigger_params,
"properties": {
"spark.submit.deployMode": "cluster",
"spark.executor.memory": "3155m",
"spark.scheduler.mode": "FAIR",
}
}
}
21/12/07 19:11:27 INFO org.sparkproject.jetty.util.log: Logging initialized @3350ms to org.sparkproject.jetty.util.log.Slf4jLog
21/12/07 19:11:27 INFO org.sparkproject.jetty.server.Server: jetty-9.4.40.v20210413; built: 2021-04-13T20:42:42.668Z; git: b881a572662e1943a14ae12e7e1207989f218b74; jvm 1.8.0_292-b10
21/12/07 19:11:27 INFO org.sparkproject.jetty.server.Server: Started @3467ms
21/12/07 19:11:27 INFO org.sparkproject.jetty.server.AbstractConnector: Started ServerConnector@18528bea{HTTP/1.1, (http/1.1)}{0.0.0.0:40389}
21/12/07 19:11:28 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at ******-m/0.0.0.5:8032
21/12/07 19:11:28 INFO org.apache.hadoop.yarn.client.AHSProxy: Connecting to Application History server at ******-m/0.0.0.5:10200
21/12/07 19:11:29 INFO org.apache.hadoop.conf.Configuration: resource-types.xml not found
21/12/07 19:11:29 INFO org.apache.hadoop.yarn.util.resource.ResourceUtils: Unable to find 'resource-types.xml'.
21/12/07 19:11:30 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl: Submitted application application_1638554180947_0014
21/12/07 19:11:31 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at ******-m/0.0.0.5:8030
21/12/07 19:11:33 INFO com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl: Ignoring exception of type GoogleJsonResponseException; verified object already exists with desired state.
=====
Executing at 2021-12-07 19:11:35.100277
[....... ('spark.yarn.historyServer.address', '****-m:18080'), ('spark.ui.proxyBase', '/proxy/application_1638554180947_0014'), ('spark.driver.appUIAddress', 'http://***-m.c.***-123456.internal:40389'), ('spark.sql.cbo.enabled', 'true')]
======================
Sleep time is 1 seconds
Script execution completed in 9.411261 seconds
21/12/07 19:11:36 INFO org.sparkproject.jetty.server.AbstractConnector: Stopped Spark@18528bea{HTTP/1.1, (http/1.1)}{0.0.0.0:0}
在客户端模式下运行时控制台中没有日志
21/12/07 19:09:04 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at ******-m/0.0.0.5:8032
21/12/07 19:09:04 INFO org.apache.hadoop.yarn.client.AHSProxy: Connecting to Application History server at ******-m/0.0.0.5:8032
21/12/07 19:09:05 INFO org.apache.hadoop.conf.Configuration: resource-types.xml not found
21/12/07 19:09:05 INFO org.apache.hadoop.yarn.util.resource.ResourceUtils: Unable to find 'resource-types.xml'.
21/12/07 19:09:06 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl: Submitted application application_1638554180947_0013
最佳答案
在集群模式下运行作业时,驱动程序日志位于 Cloud Logging yarn-userlogs
中。查看doc :
By default, Dataproc runs Spark jobs in client mode, and streams the driver output for viewing as explained, below. However, if the user creates the Dataproc cluster by setting cluster properties to
--properties spark:spark.submit.deployMode=cluster
or submits the job in cluster mode by setting job properties to--properties spark.submit.deployMode=cluster
, driver output is listed in YARN userlogs, which can be accessed in Logging.
关于pyspark - 在集群模式下运行作业时,在 dataproc 中哪里可以找到 spark 日志,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/70266214/