我有一个包含大量文档的文件,如何跳过长度 <= 2 的行,然后处理长度 > 2 的行。
例如:
fit perfectly clie .
purchased not
instructions install helpful . improvement battery life not hoped .
product.
cable good not work . cable extremely hot not recognize devices .
跳线后:
fit perfectly clie .
instructions install helpful . improvement battery life not hoped .
cable good not work . cable extremely hot not recognize devices .
我的代码:
val Bi = text.map(sen=> sen.split(" ").sliding(2))
有什么解决办法吗?
最佳答案
我会使用过滤器:
> val text = sc.parallelize(Array("fit perfectly clie .",
"purchased not",
"instructions install helpful . improvement battery life not hoped .",
"product.",
"cable good not work . cable extremely hot not recognize devices ."))
> val result = text.filter{_.split(" ").size > 2}
> result.collect.foreach{println}
fit perfectly clie .
instructions install helpful . improvement battery life not hoped .
cable good not work . cable extremely hot not recognize devices .
从这里,您可以在过滤后以原始形式(即未标记化)处理数据。如果您更喜欢先标记化,那么您可以这样做:
text.map{_.split(" ")}.filter{_.size > 2}
所以,最后,要分词,然后过滤,然后用
sliding
找到二元组。 ,你会使用:text.map{_.split(" ")}.filter{_.size > 2}.map{_.sliding(2)}
关于scala - 在 Scala 和 Spark 中根据长度跳过一些行,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/30670337/