python - 亚马逊电子病历 : Pyspark having strange dependency issues

标签 python amazon-web-services pyspark emr amazon-emr

我在让 pyspark 作业在 EMR 集群上运行时遇到问题,所以我登录到主节点并直接在那里运行 spark-submit

我有一个提交给 pyspark 的 python 文件,在这个文件中我有:

import subprocess
from pyspark import SparkContext, SparkConf
import boto3
from boto3.s3.transfer import S3Transfer
import os, re
import tarfile
import time
...

当我尝试在集群模式下运行它时,我得到: (来自 yarn 日志,为简洁起见进行了修剪)

16/01/31 21:45:57 INFO spark.CacheManager: Partition rdd_2_0 not found, computing it
16/01/31 21:45:57 INFO spark.CacheManager: Partition rdd_1_0 not found, computing it
16/01/31 21:45:57 ERROR executor.Executor: Exception in task 0.0 in stage 0.0 (TID 0)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/mnt1/yarn/usercache/hadoop/appcache/application_1454273602144_0005/container_1454273602144_0005_01_000002/pyspark.zip/pyspark/worker.py", line 98, in main
    command = pickleSer._read_with_length(infile)
  File "/mnt1/yarn/usercache/hadoop/appcache/application_1454273602144_0005/container_1454273602144_0005_01_000002/pyspark.zip/pyspark/serializers.py", line 164, in _read_with_length
    return self.loads(obj)
  File "/mnt1/yarn/usercache/hadoop/appcache/application_1454273602144_0005/container_1454273602144_0005_01_000002/pyspark.zip/pyspark/serializers.py", line 422, in loads
    return pickle.loads(obj)
ImportError: No module named boto3.s3.transfer

        at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
        at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
        at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
        at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
        at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
        at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
        at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
        at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
        at org.apache.spark.scheduler.Task.run(Task.scala:89)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)

稍后我收到有关无法导入 boto3 的错误。

如果我在客户端模式下运行,我只会收到关于 boto3.s3.transfer 的 ImportError。

Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, ip-172-31-39-79.us-west-2.compute.internal): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/mnt1/yarn/usercache/hadoop/appcache/application_1454273602144_0005/container_1454273602144_0005_01_000002/pyspark.zip/pyspark/worker.py", line 98, in main
    command = pickleSer._read_with_length(infile)
  File "/mnt1/yarn/usercache/hadoop/appcache/application_1454273602144_0005/container_1454273602144_0005_01_000002/pyspark.zip/pyspark/serializers.py", line 164, in _read_with_length
    return self.loads(obj)
  File "/mnt1/yarn/usercache/hadoop/appcache/application_1454273602144_0005/container_1454273602144_0005_01_000002/pyspark.zip/pyspark/serializers.py", line 422, in loads
    return pickle.loads(obj)
ImportError: No module named boto3.s3.transfer

    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
    at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
    at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

但是,如果我检查 pip freeze:

boto3==1.2.3
botocore==1.3.23

如果我在 master 上打开 Spark Shell 并执行以下操作:

import boto3
client = boto3.client("s3")

它工作正常。

这里有某种虚拟环境吗?我完全糊涂了。

编辑 忘了说我使用的是最新的 EMR 版本和 Spark 1.6.0。

此外,这在我自己的机器上以本地模式运行良好。

最佳答案

好吧,derp,我发现了问题。

原来我必须 pip install boto3,默认情况下 EMR 节点不会安装它。

这是一个错误非常具有描述性的案例。

关于python - 亚马逊电子病历 : Pyspark having strange dependency issues,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/35120048/

相关文章:

python - 在 python 中分组,列具有 N/A 值

php - 如何使用 PHP 调用亚马逊 MWS 订单?

amazon-web-services - 设置 Tableau Server 以在 AWS GovCloud 上运行

apache-spark - Spark : Removing rows which occur less than N times

python-3.x - Pyspark UDF 属性错误 : 'NoneType' object has no attribute '_jvm'

python - _mysql_exceptions 错误(1064,默认为 "check the manual that corresponds to your MySQL server version for the right syntax to use near ')VALUES

python - Django 模型 DateTimeField 设置 auto_now_add 格式或修改序列化器

python - 确保角度小于 360 度的简单循环 [Python]

java - Java SDK 中的 AmazonDynamoDBClient 和 DynamoDB 类之间的区别?

python - 如何在 PySpark 中连接两个 LabeledPoints 的特征列