amazon-web-services - 如何让自定义log4j.properties对AWS EMR集群上的Spark driver和executor生效?

标签 amazon-web-services apache-spark log4j amazon-emr

我有一个 AWS CLI 集群创建命令,我正在尝试修改该命令,以便它 使我的驱动程序和执行程序能够使用自定义的 log4j.properties 文件。和 Spark独立集群我已经成功使用了使用的方法 --files 与通过指定的设置 -Dlog4j.configuration= 一起切换 Spark.driver.extraJavaOptions 和 Spark.executor.extraJavaOptions。

我尝试了许多不同的排列和变化,但尚未使其与 我在 AWS EMR 集群上运行的 Spark 作业。

我使用 AWS CLI 的“创建集群”命令以及下载我的 Spark jar、解压缩的中间步骤 它可以获取与该 .jar 一起打包的 log4j.properties。然后我复制 log4j.properties 到我的 hdfs/tmp 文件夹并尝试通过“--files”分发该 log4j.properties 文件。

注意,我也尝试过在没有 hdfs 的情况下进行此操作(指定 --files log4j.properties 而不是 --files hdfs:///tmp/log4j.properties),这也不起作用。

下面给出了该命令的最新非工作版本(使用 hdfs)。我想知道是否有人可以分享 一个真正有效的食谱。当我运行这个版本时,驱动程序的命令输出是:

log4j: Trying to find [log4j.properties] using context classloader sun.misc.Launcher$AppClassLoader@1e67b872.
log4j: Using URL [file:/etc/spark/conf.dist/log4j.properties] for automatic log4j configuration.
log4j: Reading configuration from URL file:/etc/spark/conf.dist/log4j.properties
log4j: Parsing for [root] with value=[WARN,stdout].

从上面我可以看到我的 log4j.properties 文件没有被拾取(默认是)。 除了 -Dlog4j.configuration=log4j.properties 之外,我还尝试通过配置 -Dlog4j.configuration=classpath:log4j.properties(再次失败)。

非常感谢任何指导!

AWS 命令​​

jarPath=s3://com-acme/deployments/spark.jar
class=com.acme.SparkFoo


log4jConfigExtractCmd="aws s3 cp $jarPath /tmp/spark.jar ; cd /home/hadoop ; unzip /tmp/spark.jar log4j.properties ;  hdfs dfs -put log4j.properties /tmp/log4j.properties  "


aws emr create-cluster --applications Name=Hadoop Name=Hive Name=Spark \
--tags 'Project=mouse' \
      'Owner=SwarmAnalytics'\
       'DatadogMonitoring=True'\
       'StreamMonitorRedshift=False'\
       'DeployRedshiftLoader=False'\
       'Environment=dev'\
       'DeploySpark=False'\
       'StreamMonitorS3=False'\
       'Name=CCPASixCore' \
--ec2-attributes '{"KeyName":"mouse-spark-2021","InstanceProfile":"EMR_EC2_DefaultRole","SubnetId":"subnet-07039960","EmrManagedSlaveSecurityGroup":"sg-09c806ca38fd32353","EmrManagedMasterSecurityGroup":"sg-092288bbc8812371a"}' \
--release-label emr-5.27.0 \
--log-uri 's3n://log-foo' \
--steps '[{"Args":["bash","-c", "$log4jConfigExtractCmd"],"Type":"CUSTOM_JAR","ActionOnFailure":"CONTINUE","Jar":"command-runner.jar","Properties":"","Name":"downloadSparkJar"},{"Args":["spark-submit","--files", "hdfs:///tmp/log4j.properties","--deploy-mode","client","--class","$class","--driver-memory","24G","--conf","spark.executor.extraJavaOptions=-XX:+UseG1GC -XX:G1HeapRegionSize=256    -Dlog4j.debug -Dlog4j.configuration=log4j.properties","--conf","spark.driver.extraJavaOptions=-XX:+UseG1GC -XX:G1HeapRegionSize=256    -Dlog4j.debug -Dlog4j.configuration=log4j.properties","--conf","spark.yarn.executor.memoryOverhead=10g","--conf","spark.yarn.driver.memoryOverhead=10g","$jarPath"],"Type":"CUSTOM_JAR","ActionOnFailure":"CANCEL_AND_WAIT","Jar":"command-runner.jar","Properties":"","Name":"SparkFoo"}]'\
 --instance-groups '[{"InstanceCount":6,"EbsConfiguration":{"EbsBlockDeviceConfigs":[{"VolumeSpecification":{"SizeInGB":32,"VolumeType":"gp2"},"VolumesPerInstance":2}]},"InstanceGroupType":"CORE","InstanceType":"r5d.4xlarge","Name":"Core - 6"},{"InstanceCount":1,"EbsConfiguration":{"EbsBlockDeviceConfigs":[{"VolumeSpecification":{"SizeInGB":32,"VolumeType":"gp2"},"VolumesPerInstance":4}]},"InstanceGroupType":"MASTER","InstanceType":"m5.2xlarge","Name":"Master - 1"}]' \
--configurations '[{"Classification":"spark-log4j","Properties":{"log4j.logger.org.apache.spark.cluster":"ERROR","log4j.logger.com.foo":"INFO","log4j.logger.org.apache.zookeeper":"ERROR","log4j.appender.stdout.layout":"org.apache.log4j.PatternLayout","log4j.logger.org.apache.spark":"ERROR","log4j.logger.org.apache.hadoop":"ERROR","log4j.appender.stdout":"org.apache.log4j.ConsoleAppender","log4j.logger.io.netty":"ERROR","log4j.logger.org.apache.spark.scheduler.cluster":"ERROR","log4j.rootLogger":"WARN,stdout","log4j.appender.stdout.layout.ConversionPattern":"%d{yyyy-MM-dd HH:mm:ss,SSS} %p/%c{1}:%L - %m%n","log4j.logger.org.apache.spark.streaming.scheduler.JobScheduler":"INFO"}},{"Classification":"hive-site","Properties":{"hive.metastore.client.factory.class":"com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"}},{"Classification":"spark-hive-site","Properties":{"hive.metastore.client.factory.class":"com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"}}]'\
 --auto-terminate --ebs-root-volume-size 10 --service-role EMR_DefaultRole \
--security-configuration 'CCPA_dev_security_configuration_2' --enable-debugging --name 'SparkFoo' \
--scale-down-behavior TERMINATE_AT_TASK_COMPLETION --region us-east-1 --profile sandbox

最佳答案

以下是更改日志记录的方法。 AWS/EMR(我发现)的最佳方法是不要摆弄

spark.driver.extraJavaOptions  or 
spark.executor.extraJavaOptions

相反,利用如下所示的配置 block >

[{"Classification":"spark-log4j","Properties":{"log4j.logger.org.apache.spark.cluster":"ERROR","log4j.logger.com.foo":"INFO","log4j.logger.org.apache.zookeeper":"ERROR","log4j.appender.stdout.layout":"org.apache.log4j.PatternLayout","log4j.logger.org.apache.spark":"ERROR",

然后,假设您要将 com.foo 及其后代下的类完成的所有日志记录更改为 TRACE。然后你将上面的 block 更改为如下所示 ->

[{"Classification":"spark-log4j","Properties":{"log4j.logger.org.apache.spark.cluster":"ERROR","log4j.logger.com.foo":"TRACE","log4j.logger.org.apache.zookeeper":"ERROR","log4j.appender.stdout.layout":"org.apache.log4j.PatternLayout","log4j.logger.org.apache.spark":"ERROR",

关于amazon-web-services - 如何让自定义log4j.properties对AWS EMR集群上的Spark driver和executor生效?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/67053135/

相关文章:

jboss - jboss-log4j.xml中的类别VS记录器标签

amazon-web-services - AWS EMR 错误 : All slaves in the job flow were terminated

amazon-web-services - Amazon S3 生命周期规则 : Archive files have recent dates

grails - 调试语句未显示src/groovy类

java - 重写 Camel 和 log4j

apache-spark - 通过 Apache Spark Streaming 从 RabbitMq 读取消息

java - 通过java客户端获取运行aws实例的CPU利用率指标

amazon-web-services - 如何创建CDK NestedStack?

java - 如何使用 newAPIHadoopFile 在 spark 中读取 avro 文件?

apache-spark - partitionColumn、lowerBound、upperBound、numPartitions参数是什么意思?