在 SparkR 1.4.0 中读取文本文件

有谁知道如何在 SparkR 1.4.0 版中读取文本文件？
有没有可用的 Spark 包？

最佳答案

Spark 1.6+

您可以使用 text将文本文件读取为 DataFrame 的输入格式:

read.df(sqlContext=sqlContext, source="text", path="README.md")

Spark <= 1.5

简短的回答是你没有。 SparkR 1.4 几乎完全从低级 API 中剥离出来，只留下了有限的数据帧操作子集。
正如您在 old SparkR webpage 上看到的那样:

As of April 2015, SparkR has been officially merged into Apache Spark and is shipping in an upcoming release (1.4). (...) Initial support for Spark in R be focussed on high level operations instead of low level ETL.

可能最接近的方法是使用 spark-csv 加载文本文件。 :

> df <- read.df(sqlContext, "README.md", source = "com.databricks.spark.csv")
> showDF(limit(df, 5))
+--------------------+
|                  C0|
+--------------------+
|      # Apache Spark|
|Spark is a fast a...|
|high-level APIs i...|
|supports general ...|
|rich set of highe...|
+--------------------+

由于典型的 RDD 操作如 map , flatMap , reduce或 filter也消失了，这可能是你想要的。

现在，底层 API 仍然处于底层，因此您始终可以执行以下操作，但 我怀疑这是个好主意 . SparkR 开发人员很可能有充分的理由将其设为私有(private)。报价 :::手册页:

It is typically a design mistake to use ‘:::’ in your code since the corresponding object has probably been kept internal for a good reason. Consider contacting the package maintainer if you feel the need to access the object for anything but mere inspection.

即使您愿意忽略良好的编码实践，我也很可能不值得花时间。 1.4 之前的低级 API 非常缓慢和笨拙，并且没有 Catalyst 优化器的所有优点，它很可能与内部 1.4 API 相同。

> rdd <- SparkR:::textFile(sc, 'README.md')
> counts <- SparkR:::map(rdd, nchar)
> SparkR:::take(counts, 3)

[[1]]
[1] 14

[[2]]
[1] 0

[[3]]
[1] 78

不是spark-csv , 不像 textFile , 忽略空行。

关于在 SparkR 1.4.0 中读取文本文件，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/31157649/

在 SparkR 1.4.0 中读取文本文件

上一篇：django - 制作 Q 对象的正确方法，它过滤 Django QuerySet 中的所有条目？

下一篇：Wix 安装程序 - "DLL required for this install to complete"错误