java - 如何使用 StopWordsRemover 转换 json 对象的 Dataframe?

标签 java json apache-spark

我正在使用 MLlib 和 Spark 1.5.1。输入类型必须是ArrayType(StringType),但得到的是StringType。我的代码有什么问题吗?

StopWordsRemover remover = new StopWordsRemover()
                       .setInputCol("text")
                       .setOutputCol("filtered");

DataFrame df = sqlContext.read().json("file:///home/ec2-user/spark_apps/article.json");

System.out.println("***DATAFRAME SCHEMA: " + df.schema());

DataFrame filteredTokens = remover.transform(df);
filteredTokens.show();

输出:

***DATAFRAME SCHEMA: StructType(StructField(doc_id,LongType,true), StructField(image,StringType,true), StructField(link_title,StringType,true), StructField(sentiment_polarity,DoubleType,true), StructField(sentiment_subjectivity,DoubleType,true), StructField(text,StringType,true), StructField(url,StringType,true))

错误:

Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Input type must be ArrayType(StringType) but got StringType.
    at scala.Predef$.require(Predef.scala:233)
    at org.apache.spark.ml.feature.StopWordsRemover.transformSchema(StopWordsRemover.scala:149)
    at org.apache.spark.ml.feature.StopWordsRemover.transform(StopWordsRemover.scala:129)
    at com.bah.ossem.spark.topic.LDACountVectorizer.main(LDACountVectorizer.java:50)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:497)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

article.json(第一行)

    {"doc_id": 11, "sentiment_polarity": 0.223, "link_title": "Donald Trump will live-tweet 's Democratic Debate - Politics.com", "sentiment_subjectivity": 0.594, "url": "https://www.cnn.com/...", "text": "Watch the first Democratic presidential debate Tuesday...", "image": "http://i2.cdn.turner.com..."}

编辑:在java中实现了zero323的scala代码,并且效果很好。谢谢zero323!

Tokenizer tokenizer = new Tokenizer().setInputCol("text").setOutputCol("words");

StopWordsRemover remover = new StopWordsRemover().setInputCol("words").setOutputCol("filtered");

DataFrame jsondf = sqlContext.read().json("file:///home/ec2-user/spark_apps/article.json");

DataFrame wordsDataFrame = tokenizer.transform(jsondf);

DataFrame filteredTokens = remover.transform(wordsDataFrame);
filteredTokens.show();

CountVectorizerModel cvModel = new CountVectorizer()
        .setInputCol("filtered").setOutputCol("features")
        .setVocabSize(10).fit(filteredTokens);
cvModel.transform(filteredTokens).show();

最佳答案

嗯,错误消息是不言自明的。 StopWordsRemover 需要 StringArray 作为输入,而不是 String。这意味着您必须首先对数据进行标记。使用 Scala API:

import org.apache.spark.ml.feature.Tokenizer
import org.apache.spark.ml.feature.StopWordsRemover
import org.apache.spark.sql.DataFrame

val tokenizer: Tokenizer = new Tokenizer()
  .setInputCol("text")
  .setOutputCol("tokens_raw")

val remover: StopWordsRemover = new StopWordsRemover()
  .setInputCol("tokens_raw")
  .setOutputCol("tokens")

val tokenized: DataFrame = tokenizer.transform(df)
val filtered: DataFrame = remover.transform(tokenized)

关于java - 如何使用 StopWordsRemover 转换 json 对象的 Dataframe?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/33583661/

相关文章:

javascript - '引用错误: httpResponse is not defined' When parsing JSON response of Parse httpRequest

python - 在pyspark中将时间戳转换为纪元毫秒

linux - 如何对多个 csv 文件(Linux 或 Scala)进行完全外部联接?

Java 内存不足错误

java - 如何为 Java 11 编译和运行我的 Maven 单元测试,同时为旧版本的 Java 8 编译我的代码

javascript - 组合位置并从 JSON 输出组合数据

php - Android - 将 JSONArray 发送到 php

scala - Spark shell : strange behavior with import

java - 带子查询的 JPA 条件查询

java - JXCollapsiblePane 与 JXMultiSplitPane