python - 从 Python 到 R - 来自字符串的 DataFrame

我在 Python 中有以下工作示例，它接受一个字符串，在其上使用字典理解和正则表达式，最后从中生成一个数据帧:

import re, pandas as pd

junk = """total=7871MB;free=5711MB;used=2159MB;shared=0MB;buffers=304MB;cached=1059MB;
free=71MB;total=5751MB;shared=3159MB;used=5MB;buffers=30MB;cached=1059MB;
cached=1059MB;total=5751MB;shared=3159MB;used=5MB;buffers=30MB;free=109MB;"""

rx = re.compile(r'(?P<key>\w+)=(?P<value>[^;]+)')
records = [{m.group('key'): m.group('value') 
            for m in rx.finditer(line)} 
            for line in junk.split("\n")]
df = pd.DataFrame(records)
print(df)

这产生

  buffers  cached    free  shared   total    used
0   304MB  1059MB  5711MB     0MB  7871MB  2159MB
1    30MB  1059MB    71MB  3159MB  5751MB     5MB
2    30MB  1059MB   109MB  3159MB  5751MB     5MB

现在如何...我可以在 R 中做同样的事情吗？
我弄乱了 lapply 和 regmatches 但无济于事。此外，我该如何处理缺失值？

最佳答案

咕噜声选项:

library(purrr)

'total=7871MB;free=5711MB;used=2159MB;shared=0MB;buffers=304MB;cached=1059MB;
free=71MB;total=5751MB;shared=3159MB;used=5MB;buffers=30MB;cached=1059MB;
cached=1059MB;total=5751MB;shared=3159MB;used=5MB;buffers=30MB;free=109MB;' %>% 
    strsplit('\n') %>% .[[1]] %>%    # separate lines into character vector
    strsplit(';') %>%     # separate each line into a list of key-value pairs
    map(strsplit, '=') %>%    # split key-value pairs into length-2 sublists
    map(transpose) %>%    # flip list of key-value pairs to list of keys and values
    map_dfr(~set_names(.x[[2]], .x[[1]]))    # set names of values to keys and simplify to data frame
#> # A tibble: 3 x 6
#>   total  free   used   shared buffers cached
#>   <chr>  <chr>  <chr>  <chr>  <chr>   <chr> 
#> 1 7871MB 5711MB 2159MB 0MB    304MB   1059MB
#> 2 5751MB 71MB   5MB    3159MB 30MB    1059MB
#> 3 5751MB 109MB  5MB    3159MB 30MB    1059MB

或更多以数据框为中心的选项:

library(tidyverse)

# put text in data frame
data_frame(text = 'total=7871MB;free=5711MB;used=2159MB;shared=0MB;buffers=304MB;cached=1059MB;
free=71MB;total=5751MB;shared=3159MB;used=5MB;buffers=30MB;cached=1059MB;
cached=1059MB;total=5751MB;shared=3159MB;used=5MB;buffers=30MB;free=109MB;') %>% 
    separate_rows(text, sep = '\n') %>%    # separate lines into separate rows
    rowid_to_column('line') %>%    # add index for each line to help spreading later
    separate_rows(text, sep = ';') %>%    # separate each line into key-value pairs
    filter(text != '') %>%    # drop extra entries from superfluous semicolons
    separate(text, c('key', 'value')) %>%    # separate keys and values into columns
    spread(key, value) %>%    # reshape to wide form
    select(-line)    # drop line index column
#> # A tibble: 3 x 6
#>   buffers cached free   shared total  used  
#>   <chr>   <chr>  <chr>  <chr>  <chr>  <chr> 
#> 1 304MB   1059MB 5711MB 0MB    7871MB 2159MB
#> 2 30MB    1059MB 71MB   3159MB 5751MB 5MB   
#> 3 30MB    1059MB 109MB  3159MB 5751MB 5MB

如果你想避免包，你可以通过 read.dcf 破解它，它读取 Debian 控制格式(就像 R 包描述文件)，它只是键值对。 DCF 使用 : 而不是 = 和 \n 而不是 ;，所以你需要做先做一点 gsubing:

junk <- 'total=7871MB;free=5711MB;used=2159MB;shared=0MB;buffers=304MB;cached=1059MB;
free=71MB;total=5751MB;shared=3159MB;used=5MB;buffers=30MB;cached=1059MB;
cached=1059MB;total=5751MB;shared=3159MB;used=5MB;buffers=30MB;free=109MB;'

junk <- gsub('=', ':', junk) 
junk <- gsub(';', '\n', junk)
mat <- read.dcf(textConnection(junk))
mat
#>      total    free     used     shared   buffers cached  
#> [1,] "7871MB" "5711MB" "2159MB" "0MB"    "304MB" "1059MB"
#> [2,] "5751MB" "71MB"   "5MB"    "3159MB" "30MB"  "1059MB"
#> [3,] "5751MB" "109MB"  "5MB"    "3159MB" "30MB"  "1059MB"

它返回一个矩阵，但它的格式正确且易于转换为适当的 data.frame:

df <- as.data.frame(mat, stringsAsFactors = FALSE)
df
#>    total   free   used shared buffers cached
#> 1 7871MB 5711MB 2159MB    0MB   304MB 1059MB
#> 2 5751MB   71MB    5MB 3159MB    30MB 1059MB
#> 3 5751MB  109MB    5MB 3159MB    30MB 1059MB

关于python - 从 Python 到 R - 来自字符串的 DataFrame，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/49369657/

python - 从 Python 到 R - 来自字符串的 DataFrame

上一篇：python - 如何将 jupyter 内核从 Python 2 更改为 python 3？

下一篇：python - 无法使用 Python 打开神秘的 DICOM 文件