我在 IBM Bluemix PySpark 中使用 Cloudant Python API应用程序。
我如何提供依赖包来 spark submit ? spark-submit.sh
的 py-files
选项只需要 py、zip 或 egg
文件,我的包在 tar 中。 gz
和 whl
格式。
这是我尝试使用的 Cloudant Python 客户端库的链接 - https://pypi.python.org/pypi/cloudant
文章How to install dependencies for python讨论相同的主题,但我想查看解决方案中提到的 requirements.txt、Procfile 和 manifest.yml 文件的示例。
最佳答案
您应该能够从您的 python 脚本中以编程方式使用 pip,例如
import pip
pip.main(['install', '--user', 'cloudant'])
这对我有用:
helloSpark.py
import sys
from pyspark import SparkContext
import pip
pip.main(['install', '--user', 'cloudant'])
from cloudant.client import Cloudant
client = Cloudant('username', 'password', account='account', connect=True)
# do some spark processing
def computeStatsForCollection(sc,countPerPartitions=100000,partitions=5):
totalNumber = min( countPerPartitions * partitions, sys.maxsize)
rdd = sc.parallelize( range(totalNumber),partitions)
return (rdd.mean(), rdd.variance())
if __name__ == "__main__":
sc = SparkContext(appName="Hello Spark")
print("Hello Spark Demo. Compute the mean and variance of a collection")
stats = computeStatsForCollection(sc);
print(">>> Results: ")
print(">>>>>>>Mean: " + str(stats[0]));
print(">>>>>>>Variance: " + str(stats[1]));
sc.stop()
运行.sh
./spark-submit.sh --vcap ./vcap.json --deploy-mode cluster \
--master https://169.54.219.20:8443 \
--conf spark.service.spark_version=1.6
helloSpark.py
运行后的标准输出:
$ cat stdout_1498114277669877424
no extra config
load default config from : /usr/local/src/spark160master/spark/profile/batch/
Requirement already satisfied: cloudant in /gpfs/global_fs01/sym_shared/YPProdSpark/user/s9c8-cbcae60bfa1d3e-39ca506ba762/.local/lib/python2.7/site-packages
Requirement already satisfied: requests<3.0.0,>=2.7.0 in /usr/local/src/bluemix_jupyter_bundle.v47/notebook/lib/python2.7/site-packages (from cloudant)
Traceback (most recent call last):
File "/tmp/spark-160-ego-master/work/spark-driver-380d8ae7-4ddc-452e-bb29-1665375a348c/helloSpark.py", line 8, in <module>
client = Cloudant('username', 'password', account='account', connect=True)
File "/gpfs/fs01/user/s9c8-cbcae60bfa1d3e-39ca506ba762/.local/lib/python2.7/site-packages/cloudant/client.py", line 443, in __init__
self.connect()
File "/gpfs/fs01/user/s9c8-cbcae60bfa1d3e-39ca506ba762/.local/lib/python2.7/site-packages/cloudant/client.py", line 114, in connect
self.session_login(self._user, self._auth_token)
File "/gpfs/fs01/user/s9c8-cbcae60bfa1d3e-39ca506ba762/.local/lib/python2.7/site-packages/cloudant/client.py", line 172, in session_login
resp.raise_for_status()
File "/usr/local/src/bluemix_jupyter_bundle.v47/notebook/lib/python2.7/site-packages/requests/models.py", line 840, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://account.cloudant.com/_session
不幸的是,我第一次运行通知它已安装 Cloudant 的脚本时没有保存输出。但在这里您可以看到 Cloudant 库可用,并尝试使用无效凭证连接到集群,因此 Cloudant 返回 401 错误。
您可能不想在每次运行脚本时都尝试 pip 安装,因此您可以试试这个:
try:
import cloudant
except:
import pip
pip.main(['install', '--user', 'cloudant'])
这将尝试加载 Cloudant 库。如果加载它时出错(例如,因为它尚未安装),它将使用 pip 安装。
关于python - IBM Bluemix Spark : Supplying python dependencies to spark-submit. sh,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/44688434/