python - 为什么在使用 Python 脚本读取或写入 Hadoop 文件系统时会出现这些奇怪的连接错误?

标签 python hadoop read-write

我写了一个 python 代码,用 IP hdfs_ip 读写 hadoop 文件系统。它需要 3 个参数 -

  1. 本地文件路径
  2. 远程文件路径
  3. 读或写

当我想运行它时,我首先需要导出名称节点 IP,然后使用 3 个参数运行 python 代码。但是,我遇到了一些奇怪的错误。

$ export namenode='http://hdfs_ip1:50470,http://hdfs_ip2:50470'
$ python3 python_hdfs.py ./1.png /user/testuser/new_1.png read
['http://hdfs_ip1:50470', 'http://hdfs_ip2:50470']
./1.png /user/testuser/new_1.png
http://hdfs_ip1:50470
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 377, in _make_request
    httplib_response = conn.getresponse(buffering=True)
TypeError: getresponse() got an unexpected keyword argument 'buffering'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 560, in urlopen
    body=body, headers=headers)
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 379, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/lib/python3.5/http/client.py", line 1197, in getresponse
    response.begin()
  File "/usr/lib/python3.5/http/client.py", line 297, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python3.5/http/client.py", line 279, in _read_status
    raise BadStatusLine(line)
http.client.BadStatusLine: 


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/requests/adapters.py", line 376, in send
    timeout=timeout
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 610, in urlopen
    _stacktrace=sys.exc_info()[2])
  File "/usr/lib/python3/dist-packages/urllib3/util/retry.py", line 247, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/usr/lib/python3/dist-packages/six.py", line 685, in reraise
    raise value.with_traceback(tb)
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 560, in urlopen
    body=body, headers=headers)
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 379, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/lib/python3.5/http/client.py", line 1197, in getresponse
    response.begin()
  File "/usr/lib/python3.5/http/client.py", line 297, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python3.5/http/client.py", line 279, in _read_status
    raise BadStatusLine(line)
requests.packages.urllib3.exceptions.ProtocolError: ('Connection aborted.', BadStatusLine('\x15\x03\x03\x00\x02\x02\n',))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "python_hdfs.py", line 63, in <module>
    status, name, nnaddress= check_node_status(node)
  File "python_hdfs.py", line 18, in check_node_status
    request = requests.get("%s/jmx?qry=Hadoop:service=NameNode,name=NameNodeStatus"%name,verify=False).json()
  File "/usr/lib/python3/dist-packages/requests/api.py", line 67, in get
    return request('get', url, params=params, **kwargs)
  File "/usr/lib/python3/dist-packages/requests/api.py", line 53, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/lib/python3/dist-packages/requests/sessions.py", line 468, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/lib/python3/dist-packages/requests/sessions.py", line 576, in send
    r = adapter.send(request, **kwargs)
  File "/usr/lib/python3/dist-packages/requests/adapters.py", line 426, in send
    raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', BadStatusLine('\x15\x03\x03\x00\x02\x02\n',))

这里可能是什么问题,如果有人可以指出,请问


下面是我尝试运行的 Python 脚本,python_hdfs.py:

import requests
import json
import os
import kerberos
import sys

node = os.getenv("namenode").split(",")
print (node)

local_file_path = sys.argv[1]
remote_file_path = sys.argv[2]
read_or_write = sys.argv[3]
print (local_file_path,remote_file_path)

def check_node_status(node):
    for name in node:
        print (name)
        request = requests.get("%s/jmx?qry=Hadoop:service=NameNode,name=NameNodeStatus"%name,
                               verify=False).json()
        status = request["beans"][0]["State"]
        if status =="active":
            nnhost = request["beans"][0]["HostAndPort"]
            splitaddr = nnhost.split(":")
            nnaddress = splitaddr[0]
            print(nnaddress)
            break
    return status,name,nnaddress

def kerberos_auth(nnaddress):
    __, krb_context = kerberos.authGSSClientInit("HTTP@%s"%nnaddress)
    kerberos.authGSSClientStep(krb_context, "")
    negotiate_details = kerberos.authGSSClientResponse(krb_context)
    headers = {"Authorization": "Negotiate " + negotiate_details,
                "Content-Type":"application/binary"}
    return headers

def kerberos_hdfs_upload(status,name,headers):
    print("running upload function")
    if status =="active":
        print("if function")
        data=open('%s'%local_file_path, 'rb').read()
        write_req = requests.put("%s/webhdfs/v1%s?op=CREATE&overwrite=true"%(name,remote_file_path),
                                 headers=headers,
                                 verify=False, 
                                 allow_redirects=True,
                                 data=data)
        print(write_req.text)

def kerberos_hdfs_read(status,name,headers):
    if status == "active":
        read = requests.get("%s/webhdfs/v1%s?op=OPEN"%(name,remote_file_path),
                            headers=headers,
                            verify=False,
                            allow_redirects=True)

        if read.status_code == 200:
            data=open('%s'%local_file_path, 'wb')
            data.write(read.content)
            data.close()
        else : 
            print(read.content)


status, name, nnaddress= check_node_status(node)
headers = kerberos_auth(nnaddress)
if read_or_write == "write":
    kerberos_hdfs_upload(status,name,headers)
elif read_or_write == "read":
    print("fun")
    kerberos_hdfs_read(status,name,headers)

最佳答案

我在用 urllib 替换 requests 模块时遇到了同样的错误。问题是异常跟踪在概念上是错误的顺序。真正的问题是跟踪中的最后一个异常。

这里有更多相关信息:https://github.com/requests/requests/issues/2510

在您的情况下,python_hdfs.py 中出现了连接错误 这才是真正的问题

关于python - 为什么在使用 Python 脚本读取或写入 Hadoop 文件系统时会出现这些奇怪的连接错误?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/48702844/

相关文章:

python - 导入错误: bad magic number in 'dateparser' : b'\x03\xf3\r\n'

python - 使用另一个张量作为种子在 tensorflow 中生成随机序列

windows - 使用 cmd 在任务完成之前读取和写入文件

hadoop - 在 Hadoop 中使用外部 Web 服务数据

python - 在另一个文件中搜索一个文件的行并在python中打印适当的行

java - Java读写文件时如何保持格式一致?

python - 在 Pandas 数据帧中隐式将字节转换为字符串(Python 3.6)

Python:strftime() UTC 偏移量在 Windows 中未按预期工作

java - 如何正确理解 yarn 容器?

hadoop - MapReduce:当 2 个 block 分布在不同节点上时,如何进行输入分割?