hadoop - Dask:从 HDFS 读取时,pyarrow/hdfs.py 返回 OSError: Getting symbol hdfsNewBuilder failed

标签 hadoop hdfs dask dask-distributed pyarrow

我试图用我的研究小组的 Hadoop 集群运行 dask-on-yarn。
我尝试了以下每个说明:

  • dd.read_parquet('hdfs://file.parquet', engine='fastparquet')
  • dd.read_parquet('hdfs://file.parquet', engine='pyarrow')
  • dd.read_csv('hdfs://file.csv')

  • 每次,都会出现以下错误消息:
    ~/miniconda3/envs/dask/lib/python3.8/site-packages/fsspec/core.py in get_fs_token_paths(urlpath, mode, num, name_function, storage_options, protocol)
        521         path = cls._strip_protocol(urlpath)
        522         update_storage_options(options, storage_options)
    --> 523         fs = cls(**options)
        524 
        525         if "w" in mode:
    
    ~/miniconda3/envs/dask/lib/python3.8/site-packages/fsspec/spec.py in __call__(cls, *args, **kwargs)
         52             return cls._cache[token]
         53         else:
    ---> 54             obj = super().__call__(*args, **kwargs)
         55             # Setting _fs_token here causes some static linters to complain.
         56             obj._fs_token_ = token
    
    ~/miniconda3/envs/dask/lib/python3.8/site-packages/fsspec/implementations/hdfs.py in __init__(self, host, port, user, kerb_ticket, driver, extra_conf, **kwargs)
         42         AbstractFileSystem.__init__(self, **kwargs)
         43         self.pars = (host, port, user, kerb_ticket, driver, extra_conf)
    ---> 44         self.pahdfs = HadoopFileSystem(
         45             host=host,
         46             port=port,
    
    ~/miniconda3/envs/dask/lib/python3.8/site-packages/pyarrow/hdfs.py in __init__(self, host, port, user, kerb_ticket, driver, extra_conf)
         38             _maybe_set_hadoop_classpath()
         39 
    ---> 40         self._connect(host, port, user, kerb_ticket, extra_conf)
         41 
         42     def __reduce__(self):
    
    ~/miniconda3/envs/dask/lib/python3.8/site-packages/pyarrow/io-hdfs.pxi in pyarrow.lib.HadoopFileSystem._connect()
    
    ~/miniconda3/envs/dask/lib/python3.8/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()
    
    OSError: Getting symbol hdfsNewBuilderfailed
    
    我应该如何解决这个问题?
    我的环境
    这是我在这个 conda env 中的包:
    # Name                    Version                   Build  Channel
    _libgcc_mutex             0.1                        main
    abseil-cpp                20200225.2           he1b5a44_0    conda-forge
    arrow-cpp                 0.17.1          py38h1234567_9_cpu    conda-forge
    attrs                     19.3.0                     py_0
    aws-sdk-cpp               1.7.164              hc831370_1    conda-forge
    backcall                  0.2.0                      py_0
    blas                      1.0                         mkl
    bleach                    3.1.5                      py_0
    bokeh                     2.1.1                    py38_0
    boost-cpp                 1.72.0               h7b93d67_1    conda-forge
    brotli                    1.0.7                he6710b0_0
    brotlipy                  0.7.0           py38h7b6447c_1000
    bzip2                     1.0.8                h7b6447c_0
    c-ares                    1.15.0            h7b6447c_1001
    ca-certificates           2020.6.24                     0
    certifi                   2020.6.20                py38_0
    cffi                      1.14.0           py38he30daa8_1
    chardet                   3.0.4                 py38_1003
    click                     7.1.2                      py_0
    cloudpickle               1.4.1                      py_0
    conda-pack                0.4.0                      py_0
    cryptography              2.9.2            py38h1ba5d50_0
    curl                      7.71.0               hbc83047_0
    cytoolz                   0.10.1           py38h7b6447c_0
    dask                      2.19.0                     py_0
    dask-core                 2.19.0                     py_0
    dask-yarn                 0.8.1            py38h32f6830_0    conda-forge
    decorator                 4.4.2                      py_0
    defusedxml                0.6.0                      py_0
    distributed               2.19.0                   py38_0
    entrypoints               0.3                      py38_0
    fastparquet               0.3.2            py38heb32a55_0
    freetype                  2.10.2               h5ab3b9f_0
    fsspec                    0.7.4                      py_0
    gflags                    2.2.2                he6710b0_0
    glog                      0.4.0                he6710b0_0
    grpc-cpp                  1.30.0               h9ea6770_0    conda-forge
    grpcio                    1.27.2           py38hf8bcb03_0
    heapdict                  1.0.1                      py_0
    icu                       67.1                 he1b5a44_0    conda-forge
    idna                      2.10                       py_0
    importlib-metadata        1.7.0                    py38_0
    importlib_metadata        1.7.0                         0
    intel-openmp              2020.1                      217
    ipykernel                 5.3.0            py38h5ca1d4c_0
    ipython                   7.16.1           py38h5ca1d4c_0
    ipython_genutils          0.2.0                    py38_0
    jedi                      0.17.1                   py38_0
    jinja2                    2.11.2                     py_0
    jpeg                      9b                   h024ee3a_2
    json5                     0.9.5                      py_0
    jsonschema                3.2.0                    py38_0
    jupyter_client            6.1.3                      py_0
    jupyter_core              4.6.3                    py38_0
    jupyterlab                2.1.5                      py_0
    jupyterlab_server         1.1.5                      py_0
    krb5                      1.18.2               h173b8e3_0
    ld_impl_linux-64          2.33.1               h53a641e_7
    libcurl                   7.71.0               h20c2e04_0
    libedit                   3.1.20191231         h7b6447c_0
    libevent                  2.1.10               hcdb4288_1    conda-forge
    libffi                    3.3                  he6710b0_1
    libgcc-ng                 9.1.0                hdf63c60_0
    libgfortran-ng            7.3.0                hdf63c60_0
    libllvm9                  9.0.1                h4a3c616_0
    libpng                    1.6.37               hbc83047_0
    libprotobuf               3.12.3               hd408876_0
    libsodium                 1.0.18               h7b6447c_0
    libssh2                   1.9.0                h1ba5d50_1
    libstdcxx-ng              9.1.0                hdf63c60_0
    libtiff                   4.1.0                h2733197_1
    llvmlite                  0.33.0           py38hd408876_0
    locket                    0.2.0                    py38_1
    lz4-c                     1.9.2                he6710b0_0
    markupsafe                1.1.1            py38h7b6447c_0
    mistune                   0.8.4           py38h7b6447c_1000
    mkl                       2020.1                      217
    mkl-service               2.3.0            py38he904b0f_0
    mkl_fft                   1.1.0            py38h23d657b_0
    mkl_random                1.1.1            py38h0573a6f_0
    msgpack-python            1.0.0            py38hfd86e86_1
    nbconvert                 5.6.1                    py38_0
    nbformat                  5.0.7                      py_0
    ncurses                   6.2                  he6710b0_1
    notebook                  6.0.3                    py38_0
    numba                     0.50.1           py38h0573a6f_0
    numpy                     1.18.5           py38ha1c710e_0
    numpy-base                1.18.5           py38hde5b4d6_0
    olefile                   0.46                       py_0
    openssl                   1.1.1g               h7b6447c_0
    packaging                 20.4                       py_0
    pandas                    1.0.5            py38h0573a6f_0
    pandoc                    2.9.2.1                       0
    pandocfilters             1.4.2                    py38_1
    parquet-cpp               1.5.1                         2    conda-forge
    parso                     0.7.0                      py_0
    partd                     1.1.0                      py_0
    pexpect                   4.8.0                    py38_0
    pickleshare               0.7.5                 py38_1000
    pillow                    7.1.2            py38hb39fc2d_0
    pip                       20.1.1                   py38_1
    prometheus_client         0.8.0                      py_0
    prompt-toolkit            3.0.5                      py_0
    protobuf                  3.12.3           py38he6710b0_0
    psutil                    5.7.0            py38h7b6447c_0
    ptyprocess                0.6.0                    py38_0
    pyarrow                   0.17.1          py38h1234567_9_cpu    conda-forge
    pycparser                 2.20                       py_0
    pygments                  2.6.1                      py_0
    pyopenssl                 19.1.0                   py38_0
    pyparsing                 2.4.7                      py_0
    pyrsistent                0.16.0           py38h7b6447c_0
    pysocks                   1.7.1                    py38_0
    python                    3.8.3                hcff3b4d_2
    python-dateutil           2.8.1                      py_0
    python_abi                3.8                      1_cp38    conda-forge
    pytz                      2020.1                     py_0
    pyyaml                    5.3.1            py38h7b6447c_1
    pyzmq                     19.0.1           py38he6710b0_1
    re2                       2020.07.01           he1b5a44_0    conda-forge
    readline                  8.0                  h7b6447c_0
    requests                  2.24.0                     py_0
    send2trash                1.5.0                    py38_0
    setuptools                47.3.1                   py38_0
    six                       1.15.0                     py_0
    skein                     0.8.0            py38h32f6830_1    conda-forge
    snappy                    1.1.8                he6710b0_0
    sortedcontainers          2.2.2                      py_0
    sqlite                    3.32.3               h62c20be_0
    tbb                       2020.0               hfd86e86_0
    tblib                     1.6.0                      py_0
    terminado                 0.8.3                    py38_0
    testpath                  0.4.4                      py_0
    thrift                    0.13.0           py38he6710b0_0
    thrift-cpp                0.13.0               h62aa4f2_2    conda-forge
    tk                        8.6.10               hbc83047_0
    toolz                     0.10.0                     py_0
    tornado                   6.0.4            py38h7b6447c_1
    traitlets                 4.3.3                    py38_0
    typing_extensions         3.7.4.2                    py_0
    urllib3                   1.25.9                     py_0
    wcwidth                   0.2.5                      py_0
    webencodings              0.5.1                    py38_1
    wheel                     0.34.2                   py38_0
    xz                        5.2.5                h7b6447c_0
    yaml                      0.2.5                h7b6447c_0
    zeromq                    4.3.2                he6710b0_2
    zict                      2.0.0                      py_0
    zipp                      3.1.0                      py_0
    zlib                      1.2.11               h7b6447c_3
    zstd                      1.4.4                h0b5b093_3
    
    Hadoop 集群正在运行版本 Hadoop 2.7.0-mapr-1607
    Cluster 对象是通过以下方式创建的:
    # Create a cluster where each worker has two cores and eight GiB of memory
    cluster = YarnCluster(
        environment='conda-env-packed-for-worker-nodes.tar.gz',
        
        worker_env={
            # See https://github.com/dask/dask-yarn/pull/30#issuecomment-434001858
            'ARROW_LIBHDFS_DIR': '/opt/mapr/hadoop/hadoop-0.20.2/c++/Linux-amd64-64/lib',
        },
    )
    
    疑似原因
    我怀疑 hadoop-0.20.2 环境变量中的 ARROW_LIBHDFS_DIR 和 hadoop CLI 版本 Hadoop 2.7.0 之间的版本不匹配可能是导致问题的原因。
    我必须手动指定 pyarrow 才能使用这个文件(使用这个设置: https://stackoverflow.com/a/62749053/1147061 )。 libhdfs.so 下未提供所需的文件 /opt/mapr/hadoop/hadoop-2.7.0/ 。通过 libhdfs3 安装 conda install -c conda-forge libhdfs3 也不能解决要求。
    这可能是问题吗?

    最佳答案

    (部分回答)
    要使用 libhdfs3(这些天维护得很差),您需要调用

    dd.read_csv('hdfs://file.csv', storage_options={'driver': 'libhdfs3'})
    
    当然,安装 libhdfs3。这对 hadoop 库选项没有帮助,它们是独立的代码路径。
    我还怀疑让 JNI libhdfs(没有“3”)工作是找到正确的 .so 文件的情况。

    关于hadoop - Dask:从 HDFS 读取时,pyarrow/hdfs.py 返回 OSError: Getting symbol hdfsNewBuilder failed,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/62749263/

    相关文章:

    python - Dask 相当于 pandas.DataFrame.update

    apache-spark - 在Spark中的AWS EMR集群上处理Google Storage中的数据

    hadoop - 如何在单个 Hadoop 节点上写入多条记录

    hadoop - 如何将 Netezza 连接到 CDH 3 集群?

    hadoop - 我可以使用 Hadoop 插入不同的 DFS 而不是 HDFS 吗?

    hadoop - 具有 Kerberos 的 Hdfs 无法从远程服务器访问

    Hadoop 2.7.1 wordcount 作业

    python - 如何使用具有特定 AWS 配置文件的 dask 从 s3 读取 Parquet 文件

    hadoop - reduce任务和reducer的区别

    python - 将数组扩展到 dask 数据框中的列