python - 从 Python 到 R - 来自字符串的 DataFrame

标签 python r regex

我在 Python 中有以下工作示例,它接受一个字符串,在其上使用字典理解和正则表达式,最后从中生成一个数据帧:

import re, pandas as pd

junk = """total=7871MB;free=5711MB;used=2159MB;shared=0MB;buffers=304MB;cached=1059MB;
free=71MB;total=5751MB;shared=3159MB;used=5MB;buffers=30MB;cached=1059MB;
cached=1059MB;total=5751MB;shared=3159MB;used=5MB;buffers=30MB;free=109MB;"""

rx = re.compile(r'(?P<key>\w+)=(?P<value>[^;]+)')
records = [{m.group('key'): m.group('value') 
            for m in rx.finditer(line)} 
            for line in junk.split("\n")]
df = pd.DataFrame(records)
print(df)

这产生

  buffers  cached    free  shared   total    used
0   304MB  1059MB  5711MB     0MB  7871MB  2159MB
1    30MB  1059MB    71MB  3159MB  5751MB     5MB
2    30MB  1059MB   109MB  3159MB  5751MB     5MB


现在如何...我可以在 R 中做同样的事情吗?
我弄乱了 lapplyregmatches 但无济于事。此外,我该如何处理缺失值?

最佳答案

咕噜声选项:

library(purrr)

'total=7871MB;free=5711MB;used=2159MB;shared=0MB;buffers=304MB;cached=1059MB;
free=71MB;total=5751MB;shared=3159MB;used=5MB;buffers=30MB;cached=1059MB;
cached=1059MB;total=5751MB;shared=3159MB;used=5MB;buffers=30MB;free=109MB;' %>% 
    strsplit('\n') %>% .[[1]] %>%    # separate lines into character vector
    strsplit(';') %>%     # separate each line into a list of key-value pairs
    map(strsplit, '=') %>%    # split key-value pairs into length-2 sublists
    map(transpose) %>%    # flip list of key-value pairs to list of keys and values
    map_dfr(~set_names(.x[[2]], .x[[1]]))    # set names of values to keys and simplify to data frame
#> # A tibble: 3 x 6
#>   total  free   used   shared buffers cached
#>   <chr>  <chr>  <chr>  <chr>  <chr>   <chr> 
#> 1 7871MB 5711MB 2159MB 0MB    304MB   1059MB
#> 2 5751MB 71MB   5MB    3159MB 30MB    1059MB
#> 3 5751MB 109MB  5MB    3159MB 30MB    1059MB

或更多以数据框为中心的选项:

library(tidyverse)

# put text in data frame
data_frame(text = 'total=7871MB;free=5711MB;used=2159MB;shared=0MB;buffers=304MB;cached=1059MB;
free=71MB;total=5751MB;shared=3159MB;used=5MB;buffers=30MB;cached=1059MB;
cached=1059MB;total=5751MB;shared=3159MB;used=5MB;buffers=30MB;free=109MB;') %>% 
    separate_rows(text, sep = '\n') %>%    # separate lines into separate rows
    rowid_to_column('line') %>%    # add index for each line to help spreading later
    separate_rows(text, sep = ';') %>%    # separate each line into key-value pairs
    filter(text != '') %>%    # drop extra entries from superfluous semicolons
    separate(text, c('key', 'value')) %>%    # separate keys and values into columns
    spread(key, value) %>%    # reshape to wide form
    select(-line)    # drop line index column
#> # A tibble: 3 x 6
#>   buffers cached free   shared total  used  
#>   <chr>   <chr>  <chr>  <chr>  <chr>  <chr> 
#> 1 304MB   1059MB 5711MB 0MB    7871MB 2159MB
#> 2 30MB    1059MB 71MB   3159MB 5751MB 5MB   
#> 3 30MB    1059MB 109MB  3159MB 5751MB 5MB

如果你想避免包,你可以通过 read.dcf 破解它,它读取 Debian 控制格式(就像 R 包描述文件),它只是键值对。 DCF 使用 : 而不是 =\n 而不是 ;,所以你需要做先做一点 gsubing:

junk <- 'total=7871MB;free=5711MB;used=2159MB;shared=0MB;buffers=304MB;cached=1059MB;
free=71MB;total=5751MB;shared=3159MB;used=5MB;buffers=30MB;cached=1059MB;
cached=1059MB;total=5751MB;shared=3159MB;used=5MB;buffers=30MB;free=109MB;'

junk <- gsub('=', ':', junk) 
junk <- gsub(';', '\n', junk)
mat <- read.dcf(textConnection(junk))
mat
#>      total    free     used     shared   buffers cached  
#> [1,] "7871MB" "5711MB" "2159MB" "0MB"    "304MB" "1059MB"
#> [2,] "5751MB" "71MB"   "5MB"    "3159MB" "30MB"  "1059MB"
#> [3,] "5751MB" "109MB"  "5MB"    "3159MB" "30MB"  "1059MB"

它返回一个矩阵,但它的格式正确且易于转换为适当的 data.frame:

df <- as.data.frame(mat, stringsAsFactors = FALSE)
df
#>    total   free   used shared buffers cached
#> 1 7871MB 5711MB 2159MB    0MB   304MB 1059MB
#> 2 5751MB   71MB    5MB 3159MB    30MB 1059MB
#> 3 5751MB  109MB    5MB 3159MB    30MB 1059MB

关于python - 从 Python 到 R - 来自字符串的 DataFrame,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/49369657/

相关文章:

r - 在 R 中遍历数据帧长度的正确方法

r - 删除 R 图中的顶点标签

regex - 在 N 个字符后最接近的逗号处拆分一个长字符串并循环每个字符串

python - 正则表达式最佳实践 : is it ok to use regex to match multiple phrases?

python - 无法在 Microsoft Azure 上运行 Dash 应用程序

R 尝试查找欧洲城市的纬度/经度数据并获取地理编码错误消息

Java split() 方法不会再次起作用

python - 为什么 maxmin 分而治之的实现比其他 maxmin 算法慢?

python - 多处理。池 : calling helper functions when using apply_async's callback option

regex - 如何改进我的 Python 正则表达式语法?