我有一个如下的文本文件
1234_4567_DigitalDoc_XRay-01.pdf
2345_5678_DigitalDoc_CTC-03.png
1234_5684_DigitalDoc_XRay-05.pdf
1234_3345_DigitalDoc_XRay-02.pdf
我期望输出为
| catg|sub_catg| doc_name |revision_label|extension|
|1234| 4567|DigitalDoc_XRay-01.pdf| 01 |pdf |
我创建了一个自定义架构
val customSchema = StructType(
StructField("catg", StringType, true)
:: StructField("sub_catg", StringType, true)
:: StructField("doc_name", StringType, true)
:: StructField("revision_label", StringType, true)
:: StructField("extension", StringType, true)
:: Nil
)
我正在尝试创建一个数据框
val df = sparkSession.read
.format("csv")
.schema(customSchema)
.option("delimiter", "_")
.load("src/main/resources/data/sample.txt")
df.show()
我想知道如何通过自定义记录打破每一行
我可能可以写一个类似的java代码,有人可以帮我解决spark问题吗?我是 Spark 新手。
String word[] = line.split("_");
String filenName[] = word[3].split("-");
String revision = filenName[1];
word[0]+","+word[1]+","+ word[2]+"_"+word[3]+","+revision.replace(".", " ");
最佳答案
您可以使用spark functions来获取所需的详细信息 -
1。加载数据
val data =
"""
|1234_4567_DigitalDoc_XRay-01.pdf
|2345_5678_DigitalDoc_CTC-03.png
|1234_5684_DigitalDoc_XRay-05.pdf
|1234_3345_DigitalDoc_XRay-02.pdf
""".stripMargin
val customSchema = StructType(
StructField("catg", StringType, true)
:: StructField("sub_catg", StringType, true)
:: StructField("doc_name", StringType, true)
:: StructField("revision_label", StringType, true)
:: StructField("extension", StringType, true)
:: Nil
)
val df = spark.read.schema(customSchema)
.option("sep", "_")
.csv(data.split(System.lineSeparator()).toSeq.toDS())
df.show(false)
df.printSchema()
输出-
+----+--------+----------+--------------+---------+
|catg|sub_catg|doc_name |revision_label|extension|
+----+--------+----------+--------------+---------+
|1234|4567 |DigitalDoc|XRay-01.pdf |null |
|2345|5678 |DigitalDoc|CTC-03.png |null |
|1234|5684 |DigitalDoc|XRay-05.pdf |null |
|1234|3345 |DigitalDoc|XRay-02.pdf |null |
+----+--------+----------+--------------+---------+
root
|-- catg: string (nullable = true)
|-- sub_catg: string (nullable = true)
|-- doc_name: string (nullable = true)
|-- revision_label: string (nullable = true)
|-- extension: string (nullable = true)
2。提取所需信息
df.withColumn("doc_name", concat_ws("_", col("doc_name"), col("revision_label")))
.withColumn("extension", substring_index(col("revision_label"), ".", -1))
.withColumn("revision_label", regexp_extract(col("revision_label"),"""\d+""", 0))
.show(false)
输出-
+----+--------+----------------------+--------------+---------+
|catg|sub_catg|doc_name |revision_label|extension|
+----+--------+----------------------+--------------+---------+
|1234|4567 |DigitalDoc_XRay-01.pdf|01 |pdf |
|2345|5678 |DigitalDoc_CTC-03.png |03 |png |
|1234|5684 |DigitalDoc_XRay-05.pdf|05 |pdf |
|1234|3345 |DigitalDoc_XRay-02.pdf|02 |pdf |
+----+--------+----------------------+--------------+---------+
关于dataframe - 如何将具有多个分隔符的文件转换为数据帧,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/62130128/