我正在加载 parquet 文件作为 Spark 数据集。我可以从查询中查询并创建新的数据集。现在,我想向数据集添加一个新列(“hashkey”)并生成值(例如 md5sum(nameValue))。我怎样才能实现这个目标?
public static void main(String[] args) {
SparkConf sparkConf = new SparkConf();
sparkConf.setAppName("Hello Spark");
sparkConf.setMaster("local");
SparkSession spark = SparkSession.builder().appName("Java Spark SQL basic example")
.config("spark.master", "local").config("spark.sql.warehouse.dir", "file:///C:\\spark_warehouse")
.getOrCreate();
Dataset<org.apache.spark.sql.Row> df = spark.read().parquet("meetup.parquet");
df.show();
df.createOrReplaceTempView("tmpview");
Dataset<Row> namesDF = spark.sql("SELECT * FROM tmpview where name like 'Spark-%'");
namesDF.show();
}
输出如下所示:
+-------------+-----------+-----+---------+--------------------+
| name|meetup_date|going|organizer| topics|
+-------------+-----------+-----+---------+--------------------+
| Spark-H20| 2016-01-01| 50|airisdata|[h2o, repeated sh...|
| Spark-Avro| 2016-01-02| 60|airisdata| [avro, usecases]|
|Spark-Parquet| 2016-01-03| 70|airisdata| [parquet, usecases]|
+-------------+-----------+-----+---------+--------------------+
最佳答案
只需在查询中添加 MD5 的 Spark sql 函数即可。
Dataset<Row> namesDF = spark.sql("SELECT *, md5(name) as modified_name FROM tmpview where name like 'Spark-%'");
关于java - 向 Spark 数据集添加列并转换数据,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/43323158/