python - 在 Python UDF 中访问外部文件

标签 python hadoop hive user-defined-functions

我正在使用配置单元和 python udf。我定义了一个 sql 文件，我在其中添加了 python udf 并调用它。到目前为止一切顺利，我可以使用我的 python 函数处理我的查询结果。但是，此时，我必须在我的 python udf 中使用外部 .txt 文件。我将该文件上传到我的集群(与 .sql 和 .py 文件相同的目录)，我还使用以下命令将其添加到我的 .sql 文件中:

ADD FILE /home/ra/stopWords.txt;

当我在我的 python udf 中这样调用这个文件时:

file = open("/home/ra/stopWords.txt", "r")

我遇到了几个错误。我不知道如何添加嵌套文件并在配置单元中使用它们。

有什么想法吗？

最佳答案

所有添加的文件都位于UDF脚本的当前工作目录(./)。

如果您使用 ADD FILE/dir1/dir2/dir3/myfile.txt 添加单个文件，其路径将为

./myfile.txt

如果您使用 ADD FILE/dir1/dir2 添加一个目录，该文件的路径将为

./dir2/dir3/myfile.txt

关于python - 在 Python UDF 中访问外部文件，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/45112390/

上一篇：hadoop - 无法更改或删除影响很大的分区表 - 由 : MetaException: Timeout when executing 引起

下一篇：python - pickle .PicklingError : args[0] from newobj args has the wrong class with hadoop python

c# - 从 Python 调用用 C# 编写的函数时，“NoneType”对象不可调用

mapreduce - Hive-Tez 上的 Map-Reduce 日志

scala - Spark HWC 无法写入现有表

python - Shopify python api : how do add new assets to published theme?

python - 为什么我的 dlib.get_frontal_face_detector() 的输出(矩形[])是空的？

hadoop - 为HBase 0.90.6建议使用哪个Hadoop版本？

hadoop - NULL 在将 Hive 查询结果写入文本文件时显示为 '\N'

hadoop - 来自 ambari 的 Data-node Alive 不稳定

hadoop - 想要将table1与table2联接，但要以与table2中相同的顺序输出行，但是我得到的输出与table1中相同

python - 在 Python UDF 中访问外部文件

上一篇：hadoop - 无法更改或删除影响很大的分区表 - 由 : MetaException: Timeout when executing 引起

下一篇：python - pickle .PicklingError : args[0] from __newobj__ args has the wrong class with hadoop python

下一篇：python - pickle .PicklingError : args[0] from newobj args has the wrong class with hadoop python