python - 使用带有子进程、管道、Popen 的 python 从 hdfs 读取/写入文件给出错误

我正在尝试在 python 脚本中读取(打开)和写入 hdfs 中的文件。但是有错误。谁能告诉我这里出了什么问题。

代码(完整):sample.py

#!/usr/bin/python

from subprocess import Popen, PIPE

print "Before Loop"

cat = Popen(["hadoop", "fs", "-cat", "./sample.txt"],
            stdout=PIPE)

print "After Loop 1"
put = Popen(["hadoop", "fs", "-put", "-", "./modifiedfile.txt"],
            stdin=PIPE)

print "After Loop 2"
for line in cat.stdout:
    line += "Blah"
    print line
    print "Inside Loop"
    put.stdin.write(line)

cat.stdout.close()
cat.wait()
put.stdin.close()
put.wait()

当我执行时:

hadoop jar /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.5.1.jar -file ./sample.py -mapper './sample.py' -input sample.txt -output fileRead

它执行正常我找不到应该在 hdfs modifiedfile 中创建的文件

当我执行时:

 hadoop fs -getmerge ./fileRead/ file.txt

在 file.txt 中，我得到:

Before Loop 
Before Loop 
After Loop 1    
After Loop 1    
After Loop 2    
After Loop 2

有人可以告诉我我做错了什么吗？？我不认为它是从 sample.txt 中读取的

最佳答案

尝试更改您的 put 子进程，通过更改它自己获取 cat stdout

put = Popen(["hadoop", "fs", "-put", "-", "./modifiedfile.txt"],
            stdin=PIPE)

进入这个

put = Popen(["hadoop", "fs", "-put", "-", "./modifiedfile.txt"],
            stdin=cat.stdout)

完整脚本:

#!/usr/bin/python

from subprocess import Popen, PIPE

print "Before Loop"

cat = Popen(["hadoop", "fs", "-cat", "./sample.txt"],
            stdout=PIPE)

print "After Loop 1"
put = Popen(["hadoop", "fs", "-put", "-", "./modifiedfile.txt"],
            stdin=cat.stdout)
put.communicate()

关于python - 使用带有子进程、管道、Popen 的 python 从 hdfs 读取/写入文件给出错误，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/28139406/

python - 使用带有子进程、管道、Popen 的 python 从 hdfs 读取/写入文件给出错误

上一篇：java - Hadoop 作业仅在 LocalJobRunner 上运行

下一篇：hadoop - hadoop中的 block 大小