apache-spark - 如何将javaagent传递给emr Spark应用程序?

标签 apache-spark apache-spark-sql hadoop-yarn amazon-emr profiler

我正在尝试使用uber jvm profiler分析我的 Spark 应用程序(spark 2.4,在 emr 5.21 上运行)

以下是我的集群配置

          [
             {
                "classification": "spark-defaults",
                "properties": {
                   "spark.executor.memory": "38300M",
                   "spark.driver.memory": "38300M",
                   "spark.yarn.scheduler.reporterThread.maxFailures": "5",
                   "spark.driver.cores": "5",
                   "spark.yarn.driver.memoryOverhead": "4255M",
                   "spark.executor.heartbeatInterval": "60s",
                   "spark.rdd.compress": "true",
                   "spark.network.timeout": "800s",
                   "spark.executor.cores": "5",
                   "spark.memory.storageFraction": "0.27",
                   "spark.speculation": "true",
                   "spark.sql.shuffle.partitions": "200",
                   "spark.shuffle.spill.compress": "true",
                   "spark.shuffle.compress": "true",
                   "spark.storage.level": "MEMORY_AND_DISK_SER",
                   "spark.default.parallelism": "200",
                   "spark.serializer": "org.apache.spark.serializer.KryoSerializer",
                   "spark.memory.fraction": "0.80",
                   "spark.executor.extraJavaOptions": "-XX:+UseG1GC   -XX:InitiatingHeapOccupancyPercent=35 -XX:OnOutOfMemoryError='kill -9 %p'",
                   "spark.executor.instances": "107",
                   "spark.yarn.executor.memoryOverhead": "4255M",
                   "spark.dynamicAllocation.enabled": "false",
                   "spark.driver.extraJavaOptions": "-XX:+UseG1GC  -XX:InitiatingHeapOccupancyPercent=35 -XX:OnOutOfMemoryError='kill -9 %p'"
                   },
                "configurations": []
            },
            {
                "classification": "yarn-site",
                "properties": {
                   "yarn.log-aggregation-enable": "true",
                   "yarn.nodemanager.pmem-check-enabled": "false",
                   "yarn.nodemanager.vmem-check-enabled": "false"
                },
                "configurations": []
            },
            {
                "classification": "spark",
                "properties": {
                   "maximizeResourceAllocation": "true",
                   "spark.sql.broadcastTimeout": "-1"
                 },
                 "configurations": []
            },
            {
                 "classification": "emrfs-site",
                 "properties": {
                     "fs.s3.threadpool.size": "50",
                     "fs.s3.maxConnections": "5000"
                  },
                  "configurations": []
            },
            {
                  "classification": "core-site",
                  "properties": {
                     "fs.s3.threadpool.size": "50",
                     "fs.s3.maxConnections": "5000"
                   },
                   "configurations": []
             }

    ]

分析器 jar 存储在 s3 中 (mybucket/profilers/jvm-profiler-1.0.0.jar)。在引导我的核心和主节点时,我运行以下引导脚本

     sudo mkdir -p /tmp
     aws s3 cp s3://mybucket/profilers/jvm-profiler-1.0.0.jar /tmp/

我提交我的 emr 步骤如下

       spark-submit --deploy-mode cluster --master=yarn ......(other parameters).........
       --conf spark.jars=/tmp/jvm-profiler-1.0.0.jar --conf spark.driver.extraJavaOptions=-javaagent:jvm-profiler-1.0.0.jar=reporter=com.uber.profiling.reporters.ConsoleOutputReporter,metricInterval=5000 --conf spark.executor.extraJavaOptions=-javaagent:jvm-profiler-1.0.0.jar=reporter=com.uber.profiling.reporters.ConsoleOutputReporter,metricInterval=5000

但我无法在日志中看到与分析相关的输出(检查了所有容器的 stdout 和 stderr 日志)。参数是否被忽略?我错过了什么吗?我还可以检查其他内容以了解为什么此参数被忽略吗?

最佳答案

我没有使用过 Uber JVM Profiler,但我认为要在 spark-submit 中添加额外的 jar,您应该使用 --jars选项。在使用 EMR 时,您可以直接从 S3 存储桶添加它们。

此外,在引导时,您将 jar jvm-profiler-1.0.0.jar 复制到 /tmp 文件夹中,但当您设置 Java 选项时,您没有复制'添加路径。试试这个:

 spark-submit --deploy-mode cluster \
 --master=yarn \
 --conf "spark.driver.extraJavaOptions=-javaagent:/tmp/jvm-profiler-1.0.0.jar=reporter=com.uber.profiling.reporters.ConsoleOutputReporter,metricInterval=5000" \
 --conf "spark.executor.extraJavaOptions=-javaagent:/tmp/jvm-profiler-1.0.0.jar=reporter=com.uber.profiling.reporters.ConsoleOutputReporter,metricInterval=5000" \
 --jars "/tmp/jvm-profiler-1.0.0.jar" \
 --<other params> 

关于apache-spark - 如何将javaagent传递给emr Spark应用程序?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59233394/

相关文章:

scala - 如何更改 Spark 数据框中的列位置?

java - Spark 数据集 - NumberFormatException : Zero length BigInteger

hadoop - 如何在 Cloudera 中配置 Yarn 以在所有集群节点上运行 Spark 执行程序?

Hadoop distcp 作业成功但尝试_xxx 被 ApplicationMaster 杀死

apache-spark - 使用日期字段对 Dataframe 进行 Spark 分区并在每个分区上运行算法

apache-spark - 理解 Spark 结构化流并行

java - 应该包含/导入什么来识别我的 Spark -java 代码中的 "$"操作 join 函数?

hadoop - 错误启动 hadoop 2.2.0 minicluster : java. lang.NoClassDefFoundError: org/apache/hadoop/yarn/server/MiniYARNCluster

apache-spark - 依靠 spark Dataframe 的最佳方式是什么

apache-spark - 如何从 spark2.3 访问 us-east-2 区域上的 Parquet 文件(使用 hadoop aws 2.7)