apache-spark - 超过 yarn throw 最大递归深度时, Spark 提交pyspark脚本

标签 apache-spark hadoop pyspark cloudera

我可以在spark提交yarn-cluster模式下提交org.apache.spark.examples.SparkPi示例jar,并且成功,但是pyspark中的以下代码片段失败,最大递归深度超过错误。

spark-submit --master yarn --deploy-mode cluster --executor-memory 1G --num-executors 4 --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON="/usr/bin/python2.7" test.py --verbose
我根据Pyspark on yarn-cluster mode的建议添加了pyspark_python env
test.py
from pyspark import SparkContext
from pyspark.sql import HiveContext

sc_new = SparkContext()
SQLContext = HiveContext(sc_new)
SQLContext.setConf("spark.sql.hive.convertMetastoreOrc", "false")
txt = SQLContext.sql( "SELECT 1")
txt.show(2000000, False)
如何解决此问题?
File "/hdfs/data_06/yarn/nm/usercache/<alias>/appcache/application_1583989737267_1123855/container_e59_1583989737267_1123855_01_000001/py4j-0.9-src.zip/py4j/java_gateway.py", line 746, in send_command
                   raise Py4JError("Answer from Java side is empty")
               Py4JError: Answer from Java side is empty
               ERROR:py4j.java_gateway:Error while sending or receiving.
               Traceback (most recent call last):File "/hdfs/data_10/yarn/nm/usercache/<alias>/appcache/application_1583989737267_1123601/container_e59_1583989737267_1123601_01_000001/py4j-0.9-src.zip/py4j/java_gateway.py", line 626, in send_command
File "/hdfs/data_10/yarn/nm/usercache/<alias>/appcache/application_1583989737267_1123601/container_e59_1583989737267_1123601_01_000001/py4j-0.9-src.zip/py4j/java_gateway.py", line 749, in send_command
File "/usr/lib64/python2.7/logging/__init__.py", line 1182, in exception
  self.error(msg, *args, **kwargs)
File "/usr/lib64/python2.7/logging/__init__.py", line 1175, in error
  self._log(ERROR, msg, args, **kwargs)
File "/usr/lib64/python2.7/logging/__init__.py", line 1268, in _log
  self.handle(record)
File "/usr/lib64/python2.7/logging/__init__.py", line 1278, in handle
  self.callHandlers(record)
File "/usr/lib64/python2.7/logging/__init__.py", line 1318, in callHandlers
  hdlr.handle(record)
File "/usr/lib64/python2.7/logging/__init__.py", line 749, in handle
  self.emit(record)
File "/usr/lib64/python2.7/logging/__init__.py", line 879, in emit
  self.handleError(record)
File "/usr/lib64/python2.7/logging/__init__.py", line 802, in handleError
  None, sys.stderr)
File "/usr/lib64/python2.7/traceback.py", line 125, in print_exception
  print_tb(tb, limit, file)
File "/usr/lib64/python2.7/traceback.py", line 69, in print_tb
  line = linecache.getline(filename, lineno, f.f_globals)
File "/usr/lib64/python2.7/linecache.py", line 14, in getline
  lines = getlines(filename, module_globals)
File "/usr/lib64/python2.7/linecache.py", line 40, in getlines
  return updatecache(filename, module_globals)
File "/usr/lib64/python2.7/linecache.py", line 128, in updatecache
  lines = fp.readlines()
RuntimeError: maximum recursion depth exceeded while calling a Python object
  • Running Spark版本1.6.0
  • hive,版本1.1.0
  • Hadoop版本:2.6.0-cdh5.13.0
  • 最佳答案

    通过调用txt.show(2000000, False),您正在制作py4j以进行to-and-fro jvm-python-object-jvm调用,其中您的结果没有那么多行。
    我相信最大可以调用show()是2000 ish。
    当您只需要SELECT 1时,为什么需要显示 2000000 记录?

    关于apache-spark - 超过 yarn throw 最大递归深度时, Spark 提交pyspark脚本,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/63748099/

    相关文章:

    java - 写入相同数据量的多个文件与写入相同数据量的单个大文件

    hadoop - 错误 : Java heap space

    python-3.x - DF 中每个组的 pyspark corr(超过 5K 列)

    algorithm - Spark : What is the time complexity of the connected components algorithm used in GraphX?

    apache-spark - 如何将每个 DStream 保存/插入到永久表中

    apache-spark - 为什么Spark JavaRDD flatmap函数返回一个迭代器

    hadoop - 错误启动 hadoop 2.2.0 minicluster : java. lang.NoClassDefFoundError: org/apache/hadoop/yarn/server/MiniYARNCluster

    python - 数据从 Spark 中的 RDD 获取进程的顺序是什么?

    pyspark - 如何使动态查询过滤器在pyspark中运行?

    hadoop - Apache Spark 在作业开始之前正在做什么