我想阅读来自 Rigol MSO5000 示波器的带有 Pola-rs 的以下(非常破损,IMO)CSV:
D7-D0,D15-D8,t0 = -25.01s, tInc = 2e-06,
+2.470000E+02,2.000000E+00,,
+1.590000E+02,1.600000E+01,,
+2.400000E+02,2.000000E+00,,
+2.470000E+02,+1.300000E+02,,
+1.590000E+02,1.800000E+01,,
+2.470000E+02,+1.300000E+02,,
9.500000E+01,1.800000E+01,,
9.500000E+01,1.800000E+01,,
+2.400000E+02,0.000000E+00,,
(...)
在发现 Selecting with indexing is discouraged in Pola-rs 之前,这是我当前的 Jupyter Notebook 迭代/尝试:
import polars as pl
df = pl.read_csv("normal0.csv")
# Grab initial condition and increments
t0 = df.columns[2]; assert "t0" in t0; t0 = float(t0.split('=')[1].replace('s', '').strip())
tinc = df.columns[3]; assert "tInc" in tinc; tinc = float(tinc.split('=')[1].strip())
# TODO: Generate Series() from t0+tinc and include it in the final DataFrame
# Reshape and cleanup
probes = df.with_column(df["D7-D0"].cast(pl.Float32).alias("D0-D7 floats")) \
.with_column(df["D15-D8"].cast(pl.Float32).alias("D15-D8 floats")) \
.drop(df.columns[2]) \
.drop(df.columns[3])
(...)
def split_probes(probes: pl.Series):
''' Splits bundles of probe cables such as D0-D7 or D15-D8 into individual dataframe columns
'''
out = pl.DataFrame(columns=["D"+str(line) for line in range(0,16)])
# for row in range(probes.height):
# for probe in range(0, 7):
# out["D"+str(probe)].with_columns(probes["D0-D7 floats"][row % (probe + 1)])
# for probe in reversed(range(9, 16)): # TODO: Fix future captures or parametrise this
# outprobes["D15-D8 floats"][row % probe]
这是我在 I was told on Polars Discord that this problem might not be optimally solvable with dataframe libraries 时的低级 CSV 解析 Rust 伪代码方法:
use csv;
use std::error::Error;
use std::io;
use std::process;
use serde::Deserialize;
#[derive(Debug, Deserialize)]
struct OrigOscilloscope {
#[serde(rename = "D7-D0")]
d7_d0: String, // TODO: Unfortunately those fields are "user-flippable" in order from the scope, i.e: d0_d7 vs d7_d0
#[serde(rename = "D15-D8")]
d15_d8: String,
// Do not even register those on Serde as they are empty rows anyway
// t0: Option<String>,
// t_inc: Option<String>
}
#[derive(Debug, Deserialize)]
struct LAProbesOscilloscopeState {
//Vec<d0...d15>: Vec<Vec<16*f32>>, // TODO: More appropriate struct representation for the target dataframe
d0: f32,
d1: f32,
d2: f32,
d3: f32,
d4: f32,
d5: f32,
d6: f32,
d7: f32,
d8: f32,
d9: f32,
d10: f32,
d11: f32,
d12: f32,
d13: f32,
d14: f32,
d15: f32,
timestamp: f32
}
fn run() -> Result<(), Box<dyn Error>> {
let mut rdr = csv::ReaderBuilder::new()
.has_headers(false)
.flexible(true) // ignore broken header
.from_reader(io::stdin());
// Get timeseries information from header... :facepalm: rigol
// Initial timestamp...
let header = rdr.headers()?.clone();
let t0_header: Vec<&str> = header[2].split('=').collect();
let t0 = t0_header[1].trim_start().replace('s', "").parse::<f32>()?;
// ...and increments
let tinc_header: Vec<&str> = header[3].split('=').collect();
let tinc = tinc_header[1].trim_start().parse::<f32>()?;
println!("Initial timestamp {t0} with increments of {tinc} seconds");
// Now do the splitting of wires from its Dx-Dy "bundles"
let mut timestamp = t0;
for result in rdr.deserialize().skip(1) {
let row: OrigOscilloscope = result?;
// if rdr.position().line().rem_euclid(8) {
// // Read D0-D15 field(s), expanding them into the current row, matching its column
// }
// println!("{:#?}", row.d7_d0.parse::<f32>()?);
// update timestamp for this row
timestamp = timestamp + tinc;
}
Ok(())
}
fn main() {
if let Err(err) = run() {
println!("{}", err);
process::exit(1);
}
}
我希望很明显,所需的目标数据框看起来像这样:
d0, d1, d2, d3, d4, d5, d6, d7, d8, d9, d10, d11, d12, d13, d14, d15, timestamp
95, 95, 247, 159, 247, 240, 159, 247 (...) -25.01+000000.2
(...)
生成的代码应该(理想情况下?)有效地使用 Polars。此外,如上所述,避免“使用索引选择”,因为它可能在未来在 Polars 中完全弃用。
最佳答案
也许您正在寻找 unstack
(它的性能令人难以置信)。
让我们从这些数据开始(它只是复制您提供的值)。我还更改了列名,只是为了让事情更容易检查:
import polars as pl
from io import StringIO
normal_csv = """D7-D0,D15-D8,t0 = -25.01s, tInc = 2e-06,
+2.470000E+02,2.000000E+00,,
+1.590000E+02,1.600000E+01,,
+2.400000E+02,2.000000E+00,,
+2.470000E+02,+1.300000E+02,,
+1.590000E+02,1.800000E+01,,
+2.470000E+02,+1.300000E+02,,
9.500000E+01,1.800000E+01,,
9.500000E+01,1.800000E+01,,
+2.470000E+02,2.000000E+00,,
+1.590000E+02,1.600000E+01,,
+2.400000E+02,2.000000E+00,,
+2.470000E+02,+1.300000E+02,,
+1.590000E+02,1.800000E+01,,
+2.470000E+02,+1.300000E+02,,
9.500000E+01,1.800000E+01,,
9.500000E+01,1.800000E+01,,"""
df = pl.read_csv(file=StringIO(normal_csv),
new_columns=('D', 'E', 't0', 'tInc'))
df
shape: (16, 4)
┌───────┬───────┬──────┬──────┐
│ D ┆ E ┆ t0 ┆ tInc │
│ --- ┆ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ str ┆ str │
╞═══════╪═══════╪══════╪══════╡
│ 247.0 ┆ 2.0 ┆ null ┆ null │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 159.0 ┆ 16.0 ┆ null ┆ null │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 240.0 ┆ 2.0 ┆ null ┆ null │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 247.0 ┆ 130.0 ┆ null ┆ null │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 159.0 ┆ 18.0 ┆ null ┆ null │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 247.0 ┆ 130.0 ┆ null ┆ null │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 95.0 ┆ 18.0 ┆ null ┆ null │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 95.0 ┆ 18.0 ┆ null ┆ null │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 247.0 ┆ 2.0 ┆ null ┆ null │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 159.0 ┆ 16.0 ┆ null ┆ null │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 240.0 ┆ 2.0 ┆ null ┆ null │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 247.0 ┆ 130.0 ┆ null ┆ null │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 159.0 ┆ 18.0 ┆ null ┆ null │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 247.0 ┆ 130.0 ┆ null ┆ null │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 95.0 ┆ 18.0 ┆ null ┆ null │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 95.0 ┆ 18.0 ┆ null ┆ null │
└───────┴───────┴──────┴──────┘
由于我不熟悉您的数据源,我假设您的输入数据具有正则化模式——特别是始终提供 D7-D0(即,csv 文件中没有跳过的行)。
如果是这样,这里有一些高性能代码应该可以启动......
(
df
.reverse()
.unstack(step=8,
how='horizontal',
columns=('D', 'E'),
)
)
shape: (2, 16)
┌──────┬──────┬───────┬───────┬───────┬───────┬───────┬───────┬──────┬──────┬───────┬──────┬───────┬─────┬──────┬─────┐
│ D_0 ┆ D_1 ┆ D_2 ┆ D_3 ┆ D_4 ┆ D_5 ┆ D_6 ┆ D_7 ┆ E_0 ┆ E_1 ┆ E_2 ┆ E_3 ┆ E_4 ┆ E_5 ┆ E_6 ┆ E_7 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
╞══════╪══════╪═══════╪═══════╪═══════╪═══════╪═══════╪═══════╪══════╪══════╪═══════╪══════╪═══════╪═════╪══════╪═════╡
│ 95.0 ┆ 95.0 ┆ 247.0 ┆ 159.0 ┆ 247.0 ┆ 240.0 ┆ 159.0 ┆ 247.0 ┆ 18.0 ┆ 18.0 ┆ 130.0 ┆ 18.0 ┆ 130.0 ┆ 2.0 ┆ 16.0 ┆ 2.0 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 95.0 ┆ 95.0 ┆ 247.0 ┆ 159.0 ┆ 247.0 ┆ 240.0 ┆ 159.0 ┆ 247.0 ┆ 18.0 ┆ 18.0 ┆ 130.0 ┆ 18.0 ┆ 130.0 ┆ 2.0 ┆ 16.0 ┆ 2.0 │
└──────┴──────┴───────┴───────┴───────┴───────┴───────┴───────┴──────┴──────┴───────┴──────┴───────┴─────┴──────┴─────┘
这没有您想要的列名称(您可以更改)。它也不包含时间信息。但它可能会帮助您朝着正确的方向开始。
关于csv - 使用索引选择是 Polars : How to parse and transform (select/filter? 中的反模式)似乎需要这样的 CSV?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/74841242/