R:通过具有小拼写错误的字符列连接两个数据框

标签 r dataframe dplyr tidyverse

我的两个数据框有相同的字符列。使用 dplyr::full_joint 通过此专栏很容易加入它们。但问题是,公共(public)列在拼写上有细微但明显的差异。相对于定义技能的每个字符串,拼写差异很小:

Skill                   Grade_Judge_A

pack & ship               1
pack & store              5
sell                      3
Design a room             9


Skill                   Grade_Judge_B

pack and store            3
pack & ship               7
sell                      2
Design room               6

如何实现下面想要的输出:

Skill                   Grade_Judge_A      Grade_Judge_B

pack & ship               1                     3                
pack & store              5                     7
sell                      3                     2
Design a room             9                     6

我在考虑根据“技能”列中字符串之间的距离来匹配两个数据框中的行,例如使用 stringdist 包。如果两个字符串之间的差异很小,则表示技能相同。

我更喜欢 dplyr/tidyverse 解决方案。

这是数据框 A 的实际输出:

> dput(df_A)

structure(list(skill = c(" [Assess abdomen for a floating mass]", 
" [Assess Nerve Root Compression]", " [Evaluate breathing effort (rate, patterns, chest expansions)]", 
" [Evaluate Plantar Reflex/Babinski sign]", " [Evaluate Speech]", 
" [External palpation of a uterus]", " [Heel to Shin test]", 
" [Inspect anterior chamber of eye with ophthalmoscope or penlight]", 
" [Inspect breast]", " [Inspect Overall Skin Color/Tone]", " [Inspect Skin Lesions]", 
" [Inspect Wounds]", " [Mental Status/level of consciousness]", 
" [Nose/index finger]", " [Percuss abdomen to determine spleen size]", 
" [Percuss costovertebral angle for kidney tenderness]", " [Percuss for diaphragmatic excursion]", 
" [Percuss the abdomen for abdominal tones]", " [Percuss the abdomen to determine liver span]"
), `2016-09-17 13:41:08` = c(1, 1, 5, 3, 4, 0, 4, 3, 3, 5, 4, 
5, 5, 3, 1, 1, 2, 4, 1)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -19L), .Names = c("skill", "2016-09-17 13:41:08"
))

和数据框 B:

> dput(df_B)

structure(list(skill = c(" [Assess abdomen for floating mass]", 
" [Assess nerve root compression]", " [Evaluate breathing effort (rate, patterns, chest expansion)]", 
" [Evaluate plantar reflex/Babinski sign]", " [Evaluate speech]", 
" [External palpation of uterus]", " [Heel to shin test]", " [Inspect anterior chamber of the eye with opthalmoscope or penlight]", 
" [Inspect breasts]", " [Inspect overall skin color/tone]", " [Inspect skin lesions]", 
" [Inspect wounds]", " [Mental status/level of consciousness]", 
" [Nose/Index finger]", " [Percuss costovertebral angle for kidney tenderness]", 
" [Percuss for diaphragmatic excursion]", " [Percuss the abdomen for abdominal tones]", 
" [Percuss the abdomen to determine liver span]", " [Percuss the abdomen to determine spleen size]"
), `2016-09-21 07:58:43` = c(0, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 
2, 2, 2, 2, 2, 2, 2, 2)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -19L), .Names = c("skill", "2016-09-21 07:58:43"
))

这是两个数据框的头部:

 > head(df_A)
        # A tibble: 6 × 2
                                                                    skill `2016-09-17 13:41:08`
                                                                    <chr>                 <dbl>
        1                            [Assess abdomen for a floating mass]                     1
        2                                 [Assess Nerve Root Compression]                     1
        3  [Evaluate breathing effort (rate, patterns, chest expansions)]                     5
        4                         [Evaluate Plantar Reflex/Babinski sign]                     3
        5                                               [Evaluate Speech]                     4
        6                                [External palpation of a uterus]                     0

第二个:

> head(df_B)
# A tibble: 6 × 2
                                                           skill `2016-09-21 07:58:43`
                                                           <chr>                 <dbl>
1                             [Assess abdomen for floating mass]                     0
2                                [Assess nerve root compression]                     2
3  [Evaluate breathing effort (rate, patterns, chest expansion)]                     2
4                        [Evaluate plantar reflex/Babinski sign]                     2
5                                              [Evaluate speech]                     2
6                                 [External palpation of uterus]                     1

最佳答案

这有多接近?

require(tidyverse)
require(stringdist)

df_A %>%
    rownames_to_column %>%
    mutate(foo=1) %>%
    full_join((df_B %>% rownames_to_column %>% mutate(foo=1)), by='foo') %>%
    select(-foo) %>%
    mutate(dist = stringdist(skill.x, skill.y), norm_dist = dist / length(skill.x)) %>%
    arrange(norm_dist) %>%
    filter(norm_dist < 0.015)

我在 df_Adf_B 上做了一个真正的(关系代数风格)全连接,如果你拥有的真实数据很大(例如,如果两个数据框都有 1000 行,则连接的结果将是 1,000,000 行)。此连接是通过创建一个虚拟列 foo 来完成的,该列对每一行都相等,然后在虚拟列上连接。

注释中提到的 stringdist 包然后比较 A 行和 B 行中两个字符串的每种可能组合。对于您的示例数据,归一化字符串距离的截止值为 0.015,结果似乎不错。当然,这种任意截断可能会过度适合您的示例数据。

关于R:通过具有小拼写错误的字符列连接两个数据框,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/40580277/

相关文章:

R-Heatmap.2 在禁用列树状图后删除标题和实际热图之间留下的巨大空间

r - 对齐绘图内的文本

r - 从我的数据框中创建虚拟变量矩阵;使用 `NA` 表示缺失值

r - 使用 dplyr::summarise 中的数据函数

r - 计算给定条件的百分比

r - 没有省略值的分组非密集排名

r - 整个过程中如何控制r中的小数点

r - groupby 后在多列中应用不同的功能

dataframe - Julia - 来自模块 Main 的错误 : cannot assign variable ImageAxes. 数据

python - 如何交换日期时间对象中的月份和日期?