scala - 在 Scala 和 Spark 中根据长度跳过一些行

我有一个包含大量文档的文件，如何跳过长度 <= 2 的行，然后处理长度 > 2 的行。
例如:

fit perfectly clie .
purchased not
instructions install helpful . improvement battery life not hoped .
product.
cable good not work . cable extremely hot not recognize devices .

跳线后:

fit perfectly clie .
instructions install helpful . improvement battery life not hoped .
cable good not work . cable extremely hot not recognize devices .

我的代码:

 val Bi = text.map(sen=> sen.split(" ").sliding(2))

有什么解决办法吗？

最佳答案

我会使用过滤器:

> val text = sc.parallelize(Array("fit perfectly clie .",
                                "purchased not",
                                "instructions install helpful . improvement battery life not hoped .",
                                "product.",
                                "cable good not work . cable extremely hot not recognize devices ."))

> val result = text.filter{_.split(" ").size > 2}
> result.collect.foreach{println}

fit perfectly clie .
instructions install helpful . improvement battery life not hoped .
cable good not work . cable extremely hot not recognize devices .

从这里，您可以在过滤后以原始形式(即未标记化)处理数据。如果您更喜欢先标记化，那么您可以这样做:

text.map{_.split(" ")}.filter{_.size > 2}

所以，最后，要分词，然后过滤，然后用 sliding 找到二元组。，你会使用:

text.map{_.split(" ")}.filter{_.size > 2}.map{_.sliding(2)}

关于scala - 在 Scala 和 Spark 中根据长度跳过一些行，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/30670337/

scala - 在 Scala 和 Spark 中根据长度跳过一些行

上一篇：r - 对所有参数组合应用函数(输出为列表)

下一篇：rest - 在 REST API 中处理非常深的关系