apache-spark - Spark 根据字母分区写入 Parquet

我对这个话题做了很多研究。我有一个 3 TB 大小的数据集。以下是该表的数据架构:

root
 |-- user: string (nullable = true)
 |-- attributes: array (nullable = true)
 |    |-- element: string (containsNull = true)

每天，我都会得到一份我需要其属性的用户列表。我想知道我是否可以将上述模式写入包含前 2 个用户字母的 Parquet 文件。例如，

Omkar | [a,b,c,d,e]
Mac   | [a,b,c,d,e]
Zee   | [a,b,c,d,e]
Kim   | [a,b,c,d,e]
Kelly | [a,b,c,d,e]

在上面的数据集上，我可以做这样的事情吗:

spark.write.mode("overwrite").partitionBy("user".substr(0,2)).parquet("path/to/location")

这样做，我觉得下次加入用户时加载到内存中的数据会非常少，因为我们只能命中那些分区。

如果有人这样实现，有什么意见吗？

谢谢!!

最佳答案

可以。只需将您的代码替换为:

df
  .withColumn("prefix", $"user".substr(0,2)) // Add prefix column
  .write.mode("overwrite")            
  .partitionBy("prefix")                     // Use it for partitioning 
  .parquet("path/to/location")

关于apache-spark - Spark 根据字母分区写入 Parquet ，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/50395139/

上一篇：hadoop - 如何更改 hadoop 中的 super 组？

下一篇：hadoop - ORC 或 Parquet 格式的灵活架构？

apache-spark - Spark 2.x 中结构化流连接两个流的解决方法

hadoop - 启动 Spark REPL 时出错

apache-spark - Spark/Parquet 分区是否保持顺序？

hadoop - 在同一台机器上的多个核心上运行 Map-Reduce 应用程序

hadoop - 在 pig 中创建一个巨大的过滤器

apache-spark - Apache Spark Parquet 数据帧的 JOOQ 生成器？

scala - 将数据框中的向量列转换回数组列

apache-spark - PySpark 数据帧上的自定义聚合

hadoop - Hadoop 1 和 Hadoop 2 的区别