java - Spark StringIndexer 返回空数据集

标签 java apache-spark

Apache Spark StringIndexerModel 在对某一特定列进行转换后返回空数据集。我正在使用成人数据集:http://mlr.cs.umass.edu/ml/datasets/Adult

第1步:创建StringIndexerModel并保存到本地

StringIndexerModel model = new StringIndexer().setInputCol(column).setOutputCol("label").setHandleInvalid("skip").setStringOrderType("alphabetAsc").fit(originalDataset);
model.write().save(filelocation);

第 2 步:读取索引器模型并转换新数据集

StringIndexerModel model = StringIndexerModel.read().load(filelocation);
newDataset = model.transform(newDataset).drop(column).withColumnRenamed("label", column);

新数据集:

+---+------------+------------+----------+-------------+------+--------------+-------------------+--------------+----------------+-----+--------------+----+-----------------+
|age|capital gain|capital loss|education |education num|fnlgwt|hours per week|marital status     |native country|occupation      |race |relationship  |sex |workclass        |
+---+------------+------------+----------+-------------+------+--------------+-------------------+--------------+----------------+-----+--------------+----+-----------------+
|39 |2174        |0           | Bachelors|13           |77516 |40            | Never-married     | United-States| Adm-clerical   |White| Not-in-family|Male| State-gov       |
|50 |0           |0           | Bachelors|13           |83311 |13            | Married-civ-spouse| United-States| Exec-managerial|White| Husband      |Male| Self-emp-not-inc|
+---+------------+------------+----------+-------------+------+--------------+-------------------+--------------+----------------+-----+--------------+----+-----------------+

正确输出:

Column: education | File Location: localFolder/stringIndex/education
Labels: [ 10th,  11th,  12th,  1st-4th,  5th-6th,  7th-8th,  9th,  Assoc-acdm,  Assoc-voc,  Bachelors,  Doctorate,  HS-grad,  Masters,  Preschool,  Prof-school,  Some-college]
+---+------------+------------+-------------+------+--------------+-------------------+--------------+----------------+-----+--------------+----+-----------------+---------+
|age|capital gain|capital loss|education num|fnlgwt|hours per week|marital status     |native country|occupation      |race |relationship  |sex |workclass        |education|
+---+------------+------------+-------------+------+--------------+-------------------+--------------+----------------+-----+--------------+----+-----------------+---------+
|39 |2174        |0           |13           |77516 |40            | Never-married     | United-States| Adm-clerical   |White| Not-in-family|Male| State-gov       |9.0      |
|50 |0           |0           |13           |83311 |13            | Married-civ-spouse| United-States| Exec-managerial|White| Husband      |Male| Self-emp-not-inc|9.0      |
+---+------------+------------+-------------+------+--------------+-------------------+--------------+----------------+-----+--------------+----+-----------------+---------+

Column: marital status | File Location: localFolder/stringIndex/marital status
Labels: [ Divorced,  Married-AF-spouse,  Married-civ-spouse,  Married-spouse-absent,  Never-married,  Separated,  Widowed]
+---+------------+------------+-------------+------+--------------+--------------+----------------+-----+--------------+----+-----------------+---------+--------------+
|age|capital gain|capital loss|education num|fnlgwt|hours per week|native country|occupation      |race |relationship  |sex |workclass        |education|marital status|
+---+------------+------------+-------------+------+--------------+--------------+----------------+-----+--------------+----+-----------------+---------+--------------+
|39 |2174        |0           |13           |77516 |40            | United-States| Adm-clerical   |White| Not-in-family|Male| State-gov       |9.0      |4.0           |
|50 |0           |0           |13           |83311 |13            | United-States| Exec-managerial|White| Husband      |Male| Self-emp-not-inc|9.0      |2.0           |
+---+------------+------------+-------------+------+--------------+--------------+----------------+-----+--------------+----+-----------------+---------+--------------+

Column: native country | File Location: localFolder/stringIndex/native country
Labels: [ ?,  Cambodia,  Canada,  China,  Columbia,  Cuba,  Dominican-Republic,  Ecuador,  El-Salvador,  England,  France,  Germany,  Greece,  Guatemala,  Haiti,  Holand-Netherlands,  Honduras,  Hong,  Hungary,  India,  Iran,  Ireland,  Italy,  Jamaica,  Japan,  Laos,  Mexico,  Nicaragua,  Outlying-US(Guam-USVI-etc),  Peru,  Philippines,  Poland,  Portugal,  Puerto-Rico,  Scotland,  South,  Taiwan,  Thailand,  Trinadad&Tobago,  United-States,  Vietnam,  Yugoslavia]
+---+------------+------------+-------------+------+--------------+----------------+-----+--------------+----+-----------------+---------+--------------+--------------+
|age|capital gain|capital loss|education num|fnlgwt|hours per week|occupation      |race |relationship  |sex |workclass        |education|marital status|native country|
+---+------------+------------+-------------+------+--------------+----------------+-----+--------------+----+-----------------+---------+--------------+--------------+
|39 |2174        |0           |13           |77516 |40            | Adm-clerical   |White| Not-in-family|Male| State-gov       |9.0      |4.0           |39.0          |
|50 |0           |0           |13           |83311 |13            | Exec-managerial|White| Husband      |Male| Self-emp-not-inc|9.0      |2.0           |39.0          |
+---+------------+------------+-------------+------+--------------+----------------+-----+--------------+----+-----------------+---------+--------------+--------------+

Column: occupation | File Location: localFolder/stringIndex/occupation
Labels: [ ?,  Adm-clerical,  Armed-Forces,  Craft-repair,  Exec-managerial,  Farming-fishing,  Handlers-cleaners,  Machine-op-inspct,  Other-service,  Priv-house-serv,  Prof-specialty,  Protective-serv,  Sales,  Tech-support,  Transport-moving]
+---+------------+------------+-------------+------+--------------+-----+--------------+----+-----------------+---------+--------------+--------------+----------+
|age|capital gain|capital loss|education num|fnlgwt|hours per week|race |relationship  |sex |workclass        |education|marital status|native country|occupation|
+---+------------+------------+-------------+------+--------------+-----+--------------+----+-----------------+---------+--------------+--------------+----------+
|39 |2174        |0           |13           |77516 |40            |White| Not-in-family|Male| State-gov       |9.0      |4.0           |39.0          |1.0       |
|50 |0           |0           |13           |83311 |13            |White| Husband      |Male| Self-emp-not-inc|9.0      |2.0           |39.0          |4.0       |
+---+------------+------------+-------------+------+--------------+-----+--------------+----+-----------------+---------+--------------+--------------+----------+

输出错误:除此之外所有其他模型都工作正常

Column: race | File Location: localFolder/stringIndex/race
Labels: [ Amer-Indian-Eskimo,  Asian-Pac-Islander,  Black,  Other,  White]
+---+------------+------------+-------------+------+--------------+------------+---+---------+---------+--------------+--------------+----------+----+
|age|capital gain|capital loss|education num|fnlgwt|hours per week|relationship|sex|workclass|education|marital status|native country|occupation|race|
+---+------------+------------+-------------+------+--------------+------------+---+---------+---------+--------------+--------------+----------+----+
+---+------------+------------+-------------+------+--------------+------------+---+---------+---------+--------------+--------------+----------+----+

如果您能帮助解决此问题,我将不胜感激。谢谢!

最佳答案

事实证明,新数据集的数据不正确。值之前应有空格。

添加空格'White'让我得到了正确的输出。

关于java - Spark StringIndexer 返回空数据集,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59518208/

相关文章:

java - 在 Java 中将 ChartFX7 导出为 SVG

更新到 seam 2.3.0.Final 时找不到 javax.validation.ConstraintViolation

apache-spark - Spark DataFrame 是无类型的 vs DataFrame 有架构?

java - Apache Spark - foreach Vs foreachPartition 什么时候使用?

apache-spark - Spark-Shell 的默认执行器和核心数

apache-spark - 使用 Spark 结构化流进行流标准化

java - 保存和加载(Java)程序的状态

java - Spring/NetBeans - java.io.FileNotFoundException : class path resource [beans. xml] 无法打开,因为它不存在

java - 像 ios JSONModel 这样的好 json 库

java - Apache Spark - JavaSparkContext 无法转换为 SparkContext 错误