hadoop - pyspark.sql.utils.IllegalArgumentException : u'java.net.UnknownHostException: 用户'

标签 hadoop apache-spark pyspark

我是 Pyspark 的新手,我正在尝试做一个简单的计数。但是它给了我这个错误。文本文件在 hdfs 中。

代码:

>>> mydata = sc.textFile("hdfs://user/poem.txt")
>>> mydata.count()

错误:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/spark-2.0.1-bin-hadoop2.7/python/pyspark/rdd.py", line 1008, in count
    return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
  File "/usr/local/lib/spark-2.0.1-bin-hadoop2.7/python/pyspark/rdd.py", line 999, in sum
    return self.mapPartitions(lambda x: [sum(x)]).fold(0, operator.add)
  File "/usr/local/lib/spark-2.0.1-bin-hadoop2.7/python/pyspark/rdd.py", line 873, in fold
    vals = self.mapPartitions(func).collect()
  File "/usr/local/lib/spark-2.0.1-bin-hadoop2.7/python/pyspark/rdd.py", line 776, in collect
    port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
  File "/usr/local/lib/spark-2.0.1-bin-hadoop2.7/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py", line 1133, in __call__
  File "/usr/local/lib/spark-2.0.1-bin-hadoop2.7/python/pyspark/sql/utils.py", line 79, in deco
    raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.IllegalArgumentException: u'java.net.UnknownHostException: user'

最佳答案

你少了一个“/”

r = sc.textFile("hdfs://user/myFile")
r.count()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p1464.1349/lib/spark/python/pyspark/rdd.py", line 1004, in count
    return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
  File "/opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p1464.1349/lib/spark/python/pyspark/rdd.py", line 995, in sum
    return self.mapPartitions(lambda x: [sum(x)]).fold(0, operator.add)
  File "/opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p1464.1349/lib/spark/python/pyspark/rdd.py", line 869, in fold
    vals = self.mapPartitions(func).collect()
  File "/opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p1464.1349/lib/spark/python/pyspark/rdd.py", line 771, in collect
    port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
  File "/opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p1464.1349/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__
  File "/opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p1464.1349/lib/spark/python/pyspark/sql/utils.py", line 53, in deco
    raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.IllegalArgumentException: u'java.net.UnknownHostException: user'

但是,如果你这样做

>>> r = sc.textFile("hdfs:///user/myFile")
>>> r.count()
318199

因为 hdfs://是 URI。在完全限定语法中,它应该是 hdfs:///。因此,Spark 将 token “用户”视为 NN-Host

关于hadoop - pyspark.sql.utils.IllegalArgumentException : u'java.net.UnknownHostException: 用户',我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/40209393/

相关文章:

python - 从列表 PySpark 的列表创建单行数据框

python - 根据另一列中的值复制 PySpark Dataframe 中的行并获得顺序

python - 在 Pyspark 中有效计算加权滚动平均值,但有一些注意事项

hadoop - Hadoop节点在启动期间不要求输入密码

java - 在 Hadoop 集群(一个名称节点,12 个数据节点)上完成没有映射和缩减的 hadoop 作业

apache-spark - 如何阻止 pyspark 中的时间戳删除尾随零

python - 如何在 Spark Dataframe 中按组/分区重命名列?

hadoop - 使用cloudera manager免费版配置hadoop多节点

hadoop - Hive 从单个表的数据创建/更新多个表

python - 使用 Python 类中的方法作为 PySpark 用户定义函数