r - 在r中导入不规则数据

我希望有人可以帮助我解决数据导入问题，我认为这可能是一个简单的解决方案，但尚未找到答案。我有大量包含天线扫描的 txt 文件，我需要以统一的配置导入它们。问题在于，在实际数据开始之前，它们都包含有关天线的不规则数量的诊断数据行。我需要一个可以识别实际数据何时开始的函数，这样我就可以将其与正确列中的正确数据一起导入。基本上，对于每个文件，我需要确定诊断代码的行数，因此我可以在使用 read.delim 或类似内容输入文件时指定skip=""。

这是我正在讨论的文件之一的示例:

Power OFF @ 12:05:50 02/15/13 
Power ON  @ 12:06:03 02/15/13 
Reader #1 12:06:03 02/15/13 

Reader #2 12:06:03 02/15/13 

Battery Voltage = 13.35 @ 13:00:00 02/15/13 
Battery Voltage = 13.42 @ 14:00:00 02/15/13 
Battery Voltage = 13.32 @ 15:00:00 02/15/13 
Battery Voltage = 13.55 @ 16:00:00 02/15/13 

Reader #2 02:57:40 02/17/13 LA 900 226000012999

Reader #2 02:57:40 02/17/13 LA 900 226000012999

Reader #2 02:57:40 02/17/13 LA 900 226000012999

Reader #2 02:57:40 02/17/13 LA 900 226000012999

最佳答案

`读取.table`

如果使用readLines逐行读入文本，则可以使用grep搜索与“BatteryVoltage”匹配的最高行号并使用它来跳过。

read.table(file.txt, 
           skip = max(grep('Battery Voltage', readLines(file.txt))), 
           # set comment delimiting character to anything besides "#"
           comment.char = '')
##       V1 V2       V3       V4 V5  V6       V7
## 1 Reader #2 02:57:40 02/17/13 LA 900 2.26e+11
## 2 Reader #2 02:57:40 02/17/13 LA 900 2.26e+11
## 3 Reader #2 02:57:40 02/17/13 LA 900 2.26e+11
## 4 Reader #2 02:57:40 02/17/13 LA 900 2.26e+11

请注意，需要进一步清理(合并列、格式化日期)。

`读取.fwf`

使用read.fwf(f固定width file)可能更有意义，如果列宽一致。您需要使用 na.omit、complete.cases 或其他一些消除空行的方法，如 read.fwf不接受 blank.lines.skip 参数，如 read.table 及其变体:

na.omit(read.fwf(file.txt, 
                 widths = c(9, -1, 17, -1, 2, -1, 3, -1, 12), 
                 skip = max(grep('Battery Voltage', readLines(file.txt))), 
                 comment.char = ''))
##          V1                V2 V3  V4       V5
## 2 Reader #2 02:57:40 02/17/13 LA 900 2.26e+11
## 4 Reader #2 02:57:40 02/17/13 LA 900 2.26e+11
## 6 Reader #2 02:57:40 02/17/13 LA 900 2.26e+11
## 8 Reader #2 02:57:40 02/17/13 LA 900 2.26e+11

但是，通过计算字符数来计算列宽是一件痛苦的事情(并且容易出错)。

`readr::read_fwf`

readr 包使得处理固定宽度文件稍微不那么烦人，并且当事情没有得到理想的解析时会给出有用的警告。它还提供了在您读取数据时解析日期和日期时间的参数，这很方便:

library(readr)

df <- read_fwf(file.txt, 
               fwf_widths(c(9, 18, 3, 4, NA)), 
               col_types = list('c', col_datetime('%H:%M:%S %m/%d/%y'),'c', 'i', 'd'), 
               skip = max(grep('Battery Voltage', readLines(file.txt))))

df <- df[complete.cases(df), ]
# or df <- na.omit(df)
# or if some NAs are possible, more robust:
# df <- df[colSums(!apply(df, 1, is.na)) > 0, ]

df
## # A tibble: 4 x 5
##          X1                  X2    X3    X4       X5
##       <chr>              <time> <chr> <int>    <dbl>
## 1 Reader #2 2013-02-17 02:57:40    LA   900 2.26e+11
## 2 Reader #2 2013-02-17 02:57:40    LA   900 2.26e+11
## 3 Reader #2 2013-02-17 02:57:40    LA   900 2.26e+11
## 4 Reader #2 2013-02-17 02:57:40    LA   900 2.26e+11

请注意解析良好的日期时间和稍微简单的列宽输入方法(您可以使用 fwf_empty 让它猜测，如果您有列名称，该方法效果很好)。

如果您的宽度和列类型正确，则任何不正确的内容都将输入为 NA，因此，如果您使用 na.omit，您也许能够完全避免使用skip参数:

na.omit(read_fwf(file.txt, 
                 fwf_widths(c(9, 18, 3, 4, 13)), 
                 col_types = list('c', col_datetime('%H:%M:%S %m/%d/%y'),'c', 'i', 'd')))
## # A tibble: 4 x 5
##          X1                  X2    X3    X4       X5
##       <chr>              <time> <chr> <int>    <dbl>
## 1 Reader #2 2013-02-17 02:57:40    LA   900 2.26e+11
## 2 Reader #2 2013-02-17 02:57:40    LA   900 2.26e+11
## 3 Reader #2 2013-02-17 02:57:40    LA   900 2.26e+11
## 4 Reader #2 2013-02-17 02:57:40    LA   900 2.26e+11

不过，这种方法有些不稳定，因此只有在您可以验证它是否正常工作时才应使用。

关于r - 在r中导入不规则数据，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/38881061/

r - 在r中导入不规则数据

`读取.table`

`读取.fwf`

`readr::read_fwf`

上一篇：xamarin - 按钮上的图像可通过按钮调整大小

下一篇：spring-integration - 使用 Spring Integration DSL 获取聚合器消息组到期事件？