python - 是否可以在 Pyspark 中继承 DataFrame？

Pyspark 的文档显示了从 sqlContext、sqlContext.read() 和各种其他方法构建的 DataFrame。

(参见 https://spark.apache.org/docs/1.6.2/api/python/pyspark.sql.html)

是否可以继承 Dataframe 并独立实例化它？我想向 DataFrame 基类添加方法和功能。

最佳答案

这实际上取决于您的目标。

从技术上讲这是可能的。 pyspark.sql.DataFrame 只是一个普通的 Python 类。如果需要，您可以扩展它或使用 monkey-patch。

from pyspark.sql import DataFrame

class DataFrameWithZipWithIndex(DataFrame):
     def __init__(self, df):
         super(self.__class__, self).__init__(df._jdf, df.sql_ctx)

     def zipWithIndex(self):
         return (self.rdd
             .zipWithIndex()
             .map(lambda row: (row[1], ) + row[0])
             .toDF(["_idx"] + self.columns))

示例用法:

df = sc.parallelize([("a", 1)]).toDF(["foo", "bar"])

with_zipwithindex = DataFrameWithZipWithIndex(df)

isinstance(with_zipwithindex, DataFrame)

True

with_zipwithindex.zipWithIndex().show()

+----+---+---+
|_idx|foo|bar|
+----+---+---+
|   0|  a|  1|
+----+---+---+

实际上，您在这里无法做很多事情。 DataFrame 是 JVM 对象的一个薄包装器，除了提供文档字符串、将参数转换为 native 所需的形式、调用 JVM 方法以及在必要时使用 Python 适配器包装结果之外没有做太多事情。

使用纯 Python 代码，您甚至无法接近 DataFrame/Dataset 内部结构或修改其核心行为。如果您正在寻找独立的、仅限 Python 的 Spark DataFrame 实现，那是不可能的。

关于python - 是否可以在 Pyspark 中继承 DataFrame？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/41598383/

python - 是否可以在 Pyspark 中继承 DataFrame？

上一篇：python - 从文件的一部分快速读取格式化数据(Gmsh 网格格式)

下一篇：python - 串联两个 numpy 数组的每一行组合