python - 在 Amazon EMR 上为 Pig UDF 加载我自己的 python 模块

标签 python hadoop apache-pig emr

我正在尝试从 Pig 调用我自己的两个模块。

这是module_one.py:

import sys 
print sys.path

def foo():
    pass

这是module_two.py:
from module_one import foo

def bar():
    foo()

我把他们都带进了s3。

这是我尝试将它们导入 Pig 时得到的结果:

2015-06-14 12:12:10,578 [main] INFO org.apache.pig.Main - Apache Pig version 0.12.0-amzn-2 (rexported) compiled May 05 2015, 19:03:23 2015-06-14 12:12:10,579 [main] INFO org.apache.pig.Main - Logging error messages to: /mnt/var/log/apps/pig.log 2015-06-14 12:12:10,620 [main] INFO org.apache.pig.impl.util.Utils - Default bootup file /home/hadoop/.pigbootup not found 2015-06-14 12:12:11,277 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address 2015-06-14 12:12:11,279 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS 2015-06-14 12:12:11,279 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://1.1.1.1:9000 2015-06-14 12:12:12,794 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS

grunt> REGISTER 's3://mybucket/pig/module_one.py' USING jython AS m1; 2015-06-14 12:12:15,177 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS 2015-06-14 12:12:17,457 [main] INFO com.amazon.ws.emr.hadoop.fs.EmrFileSystem - Consistency disabled, using com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem as filesystem implementation 2015-06-14 12:12:17,889 [main] INFO amazon.emr.metrics.MetricsSaver - MetricsConfigRecord disabledInCluster: false instanceEngineCycleSec: 60 clusterEngineCycleSec: 60 disableClusterEngine: false maxMemoryMb: 3072 maxInstanceCount: 500 2015-06-14 12:12:17,889 [main] INFO amazon.emr.metrics.MetricsSaver - Created MetricsSaver j-5G45FR7N987G:i-a95a5379:RunJar:03073 period:60 /mnt/var/em/raw/i-a95a5379_20150614_RunJar_03073_raw.bin 2015-06-14 12:12:18,633 [main] INFO com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem - Opening 's3://mybucket/pig/module_one.py' for reading 2015-06-14 12:12:18,661 [main] INFO amazon.emr.metrics.MetricsSaver - Thread 1 created MetricsLockFreeSaver 1 2015-06-14 12:12:18,743 [main] INFO org.apache.pig.scripting.jython.JythonScriptEngine - created tmp python.cachedir=/tmp/pig_jython_4599752347759040376 2015-06-14 12:12:21,060 [main] WARN org.apache.pig.scripting.jython.JythonScriptEngine - pig.cmd.args.remainders is empty. This is not expected unless on testing. ['/home/hadoop/.versions/pig-0.12.0-amzn-2/lib/Lib', '/home/hadoop/.versions/pig-0.12.0-amzn-2/lib/jython-standalone-2.5.3.jar/Lib', 'classpath', 'pyclasspath/', '/home/hadoop'] 2015-06-14 12:12:21,142 [main] INFO org.apache.pig.scripting.jython.JythonScriptEngine - Register scripting UDF: m1.foo

grunt> REGISTER 's3://mybucket/pig/module_two.py' USING jython AS m2; 2015-06-14 12:12:33,870 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS 2015-06-14 12:12:33,918 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS 2015-06-14 12:12:34,020 [main] INFO com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem - Opening 's3://mybucket/pig/module_two.py' for reading 2015-06-14 12:12:34,064 [main] WARN org.apache.pig.scripting.jython.JythonScriptEngine - pig.cmd.args.remainders is empty. This is not expected unless on testing. 2015-06-14 12:12:34,621 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1121: Python Error. Traceback (most recent call last): File "/tmp/pig1436120267849453375tmp/module_two.py", line 1, in from module_one import foo ImportError: No module named module_one Details at logfile: /mnt/var/log/apps/pig.log


我试过:
  • 平常的sys.path.append('./Lib')sys.path.append('.') , 没有帮助
  • 使用 sys.path.append(os.path.dirname(__file__)) 破解文件夹位置但得到 NameError: name '__file__' is not defined
  • 创建 __init__.py并使用 REGISTER
  • 加载它
  • sys.path.append('s3://mybucket/pig/')也没有用。

  • 我正在使用 Apache Pig version 0.12.0-amzn-2因为这是唯一一个现在显然可以选择的。

    最佳答案

    您将第一个 python udf 导入为 m1 ,因此您应该使用 m1.foo() 访问其命名空间,而不是来自 module_one .

    编辑:第二个 python 文件应该是:

    from m1 import foo
    
    def bar():
        foo()
    

    我刚刚在亚马逊 EMR 上进行了测试,它可以工作。

    关于python - 在 Amazon EMR 上为 Pig UDF 加载我自己的 python 模块,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/30829471/

    相关文章:

    azure - Spark 流访问 azure Blob

    hadoop - Namenode重启后如何重构全 block 信息?

    hadoop - 在 Hadoop 上使用 Apache-Pig 无法识别 JAR

    python - 从 Blender 中的每个像素读取 Alpha

    python - 如何在 Heroku 中安装 NLTK 模块

    python - Yellowbrick t-SNE 拟合引发 ValueError

    hadoop - 无法使用 ParquetStorer 存储整数数据

    python - 在 Python 中处理大数时遇到问题

    hadoop - 如何获取Hadoop集群中的主机数量,其IP和机架

    hadoop - pig 0.13 错误 2998 : Unhandled internal error. org/apache/hadoop/mapreduce/task/JobContextImpl