python - 如何让scala字符串拆分以匹配python

标签 python scala split apache-spark

我正在使用 spark-shell 和 pyspark 对一篇文章进行字数统计。 line.split("") 和 python split() 上的 scala flatmap 获得不同的字数(scala 有更多)。我在 scala 代码中尝试了 split("+") 和 split("\W+") ，但无法将计数降到与 python 相同。

有谁知道什么模式会与 python 完全匹配？

最佳答案

Python 的 str.split()默认分隔符有一些特殊的行为:

runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace. Consequently, splitting an empty string or a string consisting of just whitespace with a None separator returns [].

For example, ' 1 2 3 '.split() returns ['1', '2', '3']

在 Scala 中完全匹配 this 的最简单方法可能是这样的:

scala> """\S+""".r.findAllIn(" 1  2   3  ").toList
res0: List[String] = List(1, 2, 3)

scala> """\S+""".r.findAllIn("   ").toList
res1: List[String] = List()

scala> """\S+""".r.findAllIn("").toList
res2: List[String] = List()

另一种方法是预先trim() 字符串:

scala> " 1  2   3  ".trim().split("""\s+""")
res3: Array[String] = Array(1, 2, 3)

但这与 Python 对空字符串的行为不同:

scala> "".trim().split("""\s+""")
res4: Array[String] = Array("")

在 Scala 中，空字符串的 split() 返回一个包含一个元素的数组，但在 Python 中，结果是一个包含零的列表元素。

关于python - 如何让scala字符串拆分以匹配python，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/30008160/

上一篇：python - Scrapy - 为 gif 制作网络爬虫

下一篇：python pip安装kivy fatal error C1083 : Cannot open include file: 'GL/glew.h' : No such file or directory

相关文章：

java - 如何选取不为空的最新值？

python - 从 tdms 文件中分割一维 numpy 数组，并从原始数组中绘制较短的时间序列/间隔

python - TensorFlow:数据集应用方法的简单自定义 transformation_func 的示例实现

python - 我什么时候应该在我的公共(public) python 库中停止对 python2.4 的支持？

generics - Scala:抽象类型可以是其他类型的子类型吗？

scala - 什么是 "(), the Unit value"？

sql-server - 将函数应用于列中的一行

python - 检查Python中数组元素的最大长度

javascript - Flask无法处理ajax GET请求中的json数据

scala - 扩展教程:com.twitter.scalding.InvalidSourceException:一条或多条路径中的数据丢失