我在 Spark 数据框中有一个包含 2500 行的扭矩列,其数据如下
torque
190Nm@ 2000rpm
250Nm@ 1500-2500rpm
12.7@ 2,700(kgm@ rpm)
22.4 kgm at 1750-2750rpm
11.5@ 4,500(kgm@ rpm)
我想将每一行分成两列 Nm 和 rpm,例如
Nm | rpm
190Nm | 2000rpm
250Nm | 1500-2500rpm
12.7Nm | 2,700(kgm@ rpm)
22.4 kgm | 1750-2750rpm
11.5Nm | 4,500(kgm@ rpm)
如何在 databricks 中执行此操作?
我尝试使用它来做到这一点
type herefrom pyspark.sql.functions import split, trim, regexp_extract, when
df=cars
# Assuming the name of your dataframe is "df" and the torque column is "torque"
df = df.withColumn("torque_split", split(df["torque"], "@"))
# Extract the torque values and units, assign to columns 'torque_value' and 'torque_units'
df = df.withColumn("torque_value", trim(regexp_extract(df["torque_split"].getItem(0), r'\d+\.?\d*', 0)))
df = df.withColumn("torque_units", trim(regexp_extract(df["torque_split"].getItem(0), r'[a-zA-Z]+', 0)))
# Extract the rpm values and assign to the 'rpm' column
df = df.withColumn("rpm", trim(regexp_extract(df["torque_split"].getItem(1), r'\d+-?\d*\s?rpm', 0)))
# Convert kgm values to Nm
df = df.withColumn("Nm",
when(df["torque_units"] == "kgm", df["torque_value"] * 9.80665)
.otherwise(df["torque_value"]))
# Drop the original torque, torque_split and torque_units columns
df = df.drop("torque", "torque_split", "torque_units", "torque_value")
# Show the resulting dataframe
df.display()
但是当数据类似于 2,700(kgm@ rpm) 时,我会得到空值
最佳答案
没有 UDF 的方法:
- 使用 regexp_extract_all 提取前两组数字
- 如果原始字符串包含
kgm
,则调整扭矩值的系数 - 将第 1 步中的数组拆分为两列,并乘以第 2 步中的因子
- 删除中间列
from pyspark.sql import functions as F
df = ...
df.withColumn("numbers", F.expr("regexp_extract_all(Torque, '([0-9,.\-]+)')")) \
.withColumn("factor", F.when(F.instr("Torque", "kgm") > 0 , 9.80665).otherwise(1.0)) \
.withColumn("torque value", F.col("numbers")[0] * F.col("factor")) \
.withColumn("rpm", F.col("numbers")[1]) \
.drop("numbers", "factor") \
.show(truncate=False)
结果:
+------------------------+------------------+---------+
|Torque |torque value |rpm |
+------------------------+------------------+---------+
|190Nm@ 2000rpm |190.0 |2000 |
|250Nm@ 1500-2500rpm |250.0 |1500-2500|
|12.7@ 2,700(kgm@ rpm) |124.54445499999999|2,700 |
|22.4 kgm at 1750-2750rpm|219.66895999999997|1750-2750|
|11.5@ 4,500(kgm@ rpm) |112.77647499999999|4,500 |
+------------------------+------------------+---------+
关于python - 在 Spark 数据框中拆分列,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/76015864/