我在 Python
中有以下工作示例,它接受一个字符串,在其上使用字典理解和正则表达式,最后从中生成一个数据帧:
import re, pandas as pd
junk = """total=7871MB;free=5711MB;used=2159MB;shared=0MB;buffers=304MB;cached=1059MB;
free=71MB;total=5751MB;shared=3159MB;used=5MB;buffers=30MB;cached=1059MB;
cached=1059MB;total=5751MB;shared=3159MB;used=5MB;buffers=30MB;free=109MB;"""
rx = re.compile(r'(?P<key>\w+)=(?P<value>[^;]+)')
records = [{m.group('key'): m.group('value')
for m in rx.finditer(line)}
for line in junk.split("\n")]
df = pd.DataFrame(records)
print(df)
这产生
buffers cached free shared total used
0 304MB 1059MB 5711MB 0MB 7871MB 2159MB
1 30MB 1059MB 71MB 3159MB 5751MB 5MB
2 30MB 1059MB 109MB 3159MB 5751MB 5MB
现在如何...我可以在
R
中做同样的事情吗?我弄乱了
lapply
和 regmatches
但无济于事。此外,我该如何处理缺失值?
最佳答案
咕噜声选项:
library(purrr)
'total=7871MB;free=5711MB;used=2159MB;shared=0MB;buffers=304MB;cached=1059MB;
free=71MB;total=5751MB;shared=3159MB;used=5MB;buffers=30MB;cached=1059MB;
cached=1059MB;total=5751MB;shared=3159MB;used=5MB;buffers=30MB;free=109MB;' %>%
strsplit('\n') %>% .[[1]] %>% # separate lines into character vector
strsplit(';') %>% # separate each line into a list of key-value pairs
map(strsplit, '=') %>% # split key-value pairs into length-2 sublists
map(transpose) %>% # flip list of key-value pairs to list of keys and values
map_dfr(~set_names(.x[[2]], .x[[1]])) # set names of values to keys and simplify to data frame
#> # A tibble: 3 x 6
#> total free used shared buffers cached
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 7871MB 5711MB 2159MB 0MB 304MB 1059MB
#> 2 5751MB 71MB 5MB 3159MB 30MB 1059MB
#> 3 5751MB 109MB 5MB 3159MB 30MB 1059MB
或更多以数据框为中心的选项:
library(tidyverse)
# put text in data frame
data_frame(text = 'total=7871MB;free=5711MB;used=2159MB;shared=0MB;buffers=304MB;cached=1059MB;
free=71MB;total=5751MB;shared=3159MB;used=5MB;buffers=30MB;cached=1059MB;
cached=1059MB;total=5751MB;shared=3159MB;used=5MB;buffers=30MB;free=109MB;') %>%
separate_rows(text, sep = '\n') %>% # separate lines into separate rows
rowid_to_column('line') %>% # add index for each line to help spreading later
separate_rows(text, sep = ';') %>% # separate each line into key-value pairs
filter(text != '') %>% # drop extra entries from superfluous semicolons
separate(text, c('key', 'value')) %>% # separate keys and values into columns
spread(key, value) %>% # reshape to wide form
select(-line) # drop line index column
#> # A tibble: 3 x 6
#> buffers cached free shared total used
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 304MB 1059MB 5711MB 0MB 7871MB 2159MB
#> 2 30MB 1059MB 71MB 3159MB 5751MB 5MB
#> 3 30MB 1059MB 109MB 3159MB 5751MB 5MB
如果你想避免包,你可以通过 read.dcf
破解它,它读取 Debian 控制格式(就像 R 包描述文件),它只是键值对。 DCF 使用 :
而不是 =
和 \n
而不是 ;
,所以你需要做先做一点 gsub
ing:
junk <- 'total=7871MB;free=5711MB;used=2159MB;shared=0MB;buffers=304MB;cached=1059MB;
free=71MB;total=5751MB;shared=3159MB;used=5MB;buffers=30MB;cached=1059MB;
cached=1059MB;total=5751MB;shared=3159MB;used=5MB;buffers=30MB;free=109MB;'
junk <- gsub('=', ':', junk)
junk <- gsub(';', '\n', junk)
mat <- read.dcf(textConnection(junk))
mat
#> total free used shared buffers cached
#> [1,] "7871MB" "5711MB" "2159MB" "0MB" "304MB" "1059MB"
#> [2,] "5751MB" "71MB" "5MB" "3159MB" "30MB" "1059MB"
#> [3,] "5751MB" "109MB" "5MB" "3159MB" "30MB" "1059MB"
它返回一个矩阵,但它的格式正确且易于转换为适当的 data.frame:
df <- as.data.frame(mat, stringsAsFactors = FALSE)
df
#> total free used shared buffers cached
#> 1 7871MB 5711MB 2159MB 0MB 304MB 1059MB
#> 2 5751MB 71MB 5MB 3159MB 30MB 1059MB
#> 3 5751MB 109MB 5MB 3159MB 30MB 1059MB
关于python - 从 Python 到 R - 来自字符串的 DataFrame,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/49369657/