linux - 尝试打印数据集表时出现问题

标签 linux apache-spark machine-learning pyspark apache-spark-mllib

我正在试用 PySpark 的机器学习教程。

已关注 this tutorial here .

当我进入“相关性和数据准备”部分时遇到问题。

试图在这里运行这段代码:

from pyspark.sql.types import DoubleType
from pyspark.sql.functions import UserDefinedFunction

binary_map = {'Yes':1.0, 'No':0.0, 'True':1.0, 'False':0.0}
toNum = UserDefinedFunction(lambda k: binary_map[k], DoubleType())

CV_data = CV_data.drop('State').drop('Area code') \
    .drop('Total day charge').drop('Total eve charge') \
    .drop('Total night charge').drop('Total intl charge') \
    .withColumn('Churn', toNum(CV_data['Churn'])) \
    .withColumn('International plan', toNum(CV_data['International plan'])) \
    .withColumn('Voice mail plan', toNum(CV_data['Voice mail plan'])).cache()


final_test_data = final_test_data.drop('State').drop('Area code') \
    .drop('Total day charge').drop('Total eve charge') \
    .drop('Total night charge').drop('Total intl charge') \
    .withColumn('Churn', toNum(final_test_data['Churn'])) \
    .withColumn('International plan', toNum(final_test_data['International plan'])) \
    .withColumn('Voice mail plan', toNum(final_test_data['Voice mail plan'])).cache()

这是打印在终端上的错误信息(部分)。

17/06/20 17:58:53 WARN BlockManager: Putting block rdd_38_0 failed due to an exception
17/06/20 17:58:53 WARN BlockManager: Block rdd_38_0 could not be removed as it was not found on disk or in memory
17/06/20 17:58:53 WARN BlockManager: Putting block rdd_53_0 failed due to an exception
17/06/20 17:58:53 WARN BlockManager: Block rdd_53_0 could not be removed as it was not found on disk or in memory
17/06/20 17:58:53 ERROR Executor: Exception in task 0.0 in stage 14.0 (TID 16)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/home/main/spark-2.1.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 174, in main
    process()
  File "/home/main/spark-2.1.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 169, in process
serializer.dump_stream(func(split_index, iterator), outfile)
  File "/home/main/spark-2.1.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 106, in <lambda>
    func = lambda _, it: map(mapper, it)
  File "<string>", line 1, in <lambda>
  File "/home/main/spark-2.1.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 70, in <lambda>
    return lambda *a: f(*a)
  File "<stdin>", line 1, in <lambda>
KeyError: False

    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
    at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
    at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
    ....

其余报错信息可查看this document here .

有人知道这是什么问题吗???

提前致谢。

最佳答案

[已解决]

引用this thread from 2 months back解决了.

主要问题如上文所述的@user6910411。这是一个数据类型错误。

因为我不需要将所有数据都打印成数字,所以我从教程站点:

CV_data中排除:

.withColumn('Churn', toNum(CV_data['Churn'])) \
.withColumn('International plan', toNum(CV_data['International plan'])) \
.withColumn('Voice mail plan', toNum(CV_data['Voice mail plan'])).cache()

final_test_data中排除:

.withColumn('Churn', toNum(final_test_data['Churn'])) \
.withColumn('International plan', toNum(final_test_data['International plan'])) \
.withColumn('Voice mail plan', toNum(final_test_data['Voice mail plan'])).cache()

打印出来的表格:

>>> pd.DataFrame(CV_data.take(5), columns=CV_data.columns).transpose()
17/06/21 13:49:54 WARN Executor: 1 block locks were not released by TID = 11:
[rdd_16_0]
                            0      1      2      3      4
Account length            128    107    137     84     75
International plan         No     No     No    Yes    Yes
Voice mail plan           Yes    Yes     No     No     No
Number vmail messages      25     26      0      0      0
Total day minutes       265.1  161.6  243.4  299.4  166.7
Total day calls           110    123    114     71    113
Total eve minutes       197.4  195.5  121.2   61.9  148.3
Total eve calls            99    103    110     88    122
Total night minutes     244.7  254.4  162.6  196.9  186.9
Total night calls          91    103    104     89    121
Total intl minutes         10   13.7   12.2    6.6   10.1
Total intl calls            3      3      5      7      3
Customer service calls      1      1      0      2      3
Churn                   False  False  False  False  False

关于linux - 尝试打印数据集表时出现问题,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/44650557/

相关文章:

amazon-web-services - 如何在 Spark MLLib 中为支持向量机配置内核选择和损失函数

python - Pyspark错误: "Field rawPrediction does not exist" when using cross validation

python-2.7 - Python lightgbm feature_importance() 错误?

linux - Swift 字符串 hasSuffix 错误

python - Spark : Python Windowed Functions for Data Frames

c - Linux 套接字 (AF_UNIX) 连接 () 失败

处理零和缺失数据的 Python 非负矩阵分解?

machine-learning - Kohonen 自组织映射 : Determining the number of neurons and grid size

linux - 当我在 cavium thunderx 上安装 centos 时安装停止

linux - 理解linux内核中LWP的概念