python - IBM Bluemix Spark : Supplying python dependencies to spark-submit. sh

我在 IBM Bluemix PySpark 中使用 Cloudant Python API应用程序。

我如何提供依赖包来 spark submit ？ spark-submit.sh 的 py-files 选项只需要 py、zip 或 egg 文件，我的包在 tar 中。 gz 和 whl 格式。

这是我尝试使用的 Cloudant Python 客户端库的链接 - https://pypi.python.org/pypi/cloudant

文章How to install dependencies for python讨论相同的主题，但我想查看解决方案中提到的 requirements.txt、Procfile 和 manifest.yml 文件的示例。

最佳答案

您应该能够从您的 python 脚本中以编程方式使用 pip，例如

import pip
pip.main(['install', '--user', 'cloudant'])

这对我有用:

helloSpark.py

import sys
from pyspark import SparkContext

import pip
pip.main(['install', '--user', 'cloudant'])

from cloudant.client import Cloudant
client = Cloudant('username', 'password', account='account', connect=True)

# do some spark processing
def computeStatsForCollection(sc,countPerPartitions=100000,partitions=5):
    totalNumber = min( countPerPartitions * partitions, sys.maxsize)
    rdd = sc.parallelize( range(totalNumber),partitions)
    return (rdd.mean(), rdd.variance())

if __name__ == "__main__":
    sc = SparkContext(appName="Hello Spark")
    print("Hello Spark Demo. Compute the mean and variance of a collection")
    stats = computeStatsForCollection(sc);
    print(">>> Results: ")
    print(">>>>>>>Mean: " + str(stats[0]));
    print(">>>>>>>Variance: " + str(stats[1]));
    sc.stop()

运行.sh

./spark-submit.sh --vcap ./vcap.json --deploy-mode cluster \
     --master https://169.54.219.20:8443 \
     --conf spark.service.spark_version=1.6
     helloSpark.py

运行后的标准输出:

$ cat stdout_1498114277669877424 
no extra config
load default config from : /usr/local/src/spark160master/spark/profile/batch/
Requirement already satisfied: cloudant in /gpfs/global_fs01/sym_shared/YPProdSpark/user/s9c8-cbcae60bfa1d3e-39ca506ba762/.local/lib/python2.7/site-packages
Requirement already satisfied: requests<3.0.0,>=2.7.0 in /usr/local/src/bluemix_jupyter_bundle.v47/notebook/lib/python2.7/site-packages (from cloudant)
Traceback (most recent call last):
  File "/tmp/spark-160-ego-master/work/spark-driver-380d8ae7-4ddc-452e-bb29-1665375a348c/helloSpark.py", line 8, in <module>
    client = Cloudant('username', 'password', account='account', connect=True)
  File "/gpfs/fs01/user/s9c8-cbcae60bfa1d3e-39ca506ba762/.local/lib/python2.7/site-packages/cloudant/client.py", line 443, in __init__
    self.connect()
  File "/gpfs/fs01/user/s9c8-cbcae60bfa1d3e-39ca506ba762/.local/lib/python2.7/site-packages/cloudant/client.py", line 114, in connect
    self.session_login(self._user, self._auth_token)
  File "/gpfs/fs01/user/s9c8-cbcae60bfa1d3e-39ca506ba762/.local/lib/python2.7/site-packages/cloudant/client.py", line 172, in session_login
    resp.raise_for_status()
  File "/usr/local/src/bluemix_jupyter_bundle.v47/notebook/lib/python2.7/site-packages/requests/models.py", line 840, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://account.cloudant.com/_session

不幸的是，我第一次运行通知它已安装 Cloudant 的脚本时没有保存输出。但在这里您可以看到 Cloudant 库可用，并尝试使用无效凭证连接到集群，因此 Cloudant 返回 401 错误。

您可能不想在每次运行脚本时都尝试 pip 安装，因此您可以试试这个:

try:
    import cloudant
except:
    import pip
    pip.main(['install', '--user', 'cloudant'])

这将尝试加载 Cloudant 库。如果加载它时出错(例如，因为它尚未安装)，它将使用 pip 安装。

关于python - IBM Bluemix Spark : Supplying python dependencies to spark-submit. sh，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/44688434/

python - IBM Bluemix Spark : Supplying python dependencies to spark-submit. sh

上一篇：python - 存在相同名称时使用 iloc 替换列

下一篇：python - 我可以在 Django 1.11 中将 {% include %} 与自定义过滤器结合使用吗？