我正在尝试在 python 脚本中读取(打开)和写入 hdfs 中的文件。但是有错误。谁能告诉我这里出了什么问题。
代码(完整):sample.py
#!/usr/bin/python
from subprocess import Popen, PIPE
print "Before Loop"
cat = Popen(["hadoop", "fs", "-cat", "./sample.txt"],
stdout=PIPE)
print "After Loop 1"
put = Popen(["hadoop", "fs", "-put", "-", "./modifiedfile.txt"],
stdin=PIPE)
print "After Loop 2"
for line in cat.stdout:
line += "Blah"
print line
print "Inside Loop"
put.stdin.write(line)
cat.stdout.close()
cat.wait()
put.stdin.close()
put.wait()
当我执行时:
hadoop jar /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.5.1.jar -file ./sample.py -mapper './sample.py' -input sample.txt -output fileRead
它执行正常我找不到应该在 hdfs modifiedfile 中创建的文件
当我执行时:
hadoop fs -getmerge ./fileRead/ file.txt
在 file.txt 中,我得到:
Before Loop
Before Loop
After Loop 1
After Loop 1
After Loop 2
After Loop 2
有人可以告诉我我做错了什么吗??我不认为它是从 sample.txt 中读取的
最佳答案
尝试更改您的 put
子进程,通过更改它自己获取 cat
stdout
put = Popen(["hadoop", "fs", "-put", "-", "./modifiedfile.txt"],
stdin=PIPE)
进入这个
put = Popen(["hadoop", "fs", "-put", "-", "./modifiedfile.txt"],
stdin=cat.stdout)
完整脚本:
#!/usr/bin/python
from subprocess import Popen, PIPE
print "Before Loop"
cat = Popen(["hadoop", "fs", "-cat", "./sample.txt"],
stdout=PIPE)
print "After Loop 1"
put = Popen(["hadoop", "fs", "-put", "-", "./modifiedfile.txt"],
stdin=cat.stdout)
put.communicate()
关于python - 使用带有子进程、管道、Popen 的 python 从 hdfs 读取/写入文件给出错误,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/28139406/