r - 使用 R 对单词中的相同模式进行分类

标签 r dplyr tm fuzzy-search

我想进行文本挖掘分析,但遇到任何麻烦。
使用 dput(),我加载了文本的一小部分。

text<-structure(list(ID_C_REGCODES_CASH_VOUCHER = c(3941L, 3941L, 3941L, 
3945L, 3945L, 3945L, 3945L, 3945L, 3945L, 3945L, 3953L, 3953L, 
3953L, 3953L, 3953L, 3953L, 3960L, 3960L, 3960L, 3960L, 3960L, 
3960L, 3967L, 3967L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), GOODS_NAME = structure(c(19L, 
17L, 15L, 18L, 16L, 23L, 21L, 14L, 22L, 20L, 6L, 2L, 10L, 8L, 
7L, 13L, 5L, 11L, 7L, 12L, 4L, 3L, 9L, 9L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L
), .Label = c("", "* 2108609 SLOB.Mayon.OLIVK.67% 400ml", "* 3014084 D.Dym.Spikachki DEREVEN.MINI 1kg", 
"* 3398012 DD Kolb.SERV.OKHOTN in / to v / y0.35", "* 3426789 WH.The corn rav guava / yagn.d / CAT seed 85g", 
"197 Onion 1 kg", "2013077 MAKFA Makar.RAKERS 450g", "2030918 MARIA TRADITIONAL Biscuit 180g", 
"2049750 MAKFA Makar.SHIGHTS 450g", "3420159 LEBED.Mol.past.3,4-4,5% 900g", 
"3491144 LIP.NAP.ICE TEA green yellow 0.5 liter", "6788 MAKFA Makar.perya 450g", 
"809 Bananas 1kg", "FetaXa Cheese product 60% 400g (", "Lemons 55+", 
"MAKFA Macaroni feathers like. in / with", "Napkins paper color 100pcs PL", 
"Package \"Magnet\" white (Plastiktre)", "Pasta Makfa snail flow-pack 450 g.", 
"SHEBEKINSKIE Macaroni Butterfly №40", "SOFT Cotton sticks 100 PE (BELL", 
"TENDER AGE Cottage cheese 10", "TOBUS steering-wheel 0.5kg flow"
), class = "factor")), .Names = c("ID_C_REGCODES_CASH_VOUCHER", 
"GOODS_NAME"), class = "data.frame", row.names = c(NA, -61L))

(NA是偶然的。)
正文是检查产品的名称。

我想将任何相似的名称归为一组。

例如。在这里我手动取 MAKFA makar(乌克兰名称)。我找到了 7 行 "root or key word MAKFA Makar"
Pasta Makfa snail flow-pack 450 g.
MAKFA Macaroni feathers like. in / with
2013077 MAKFA Makar.RAKERS 450g
2013077 MAKFA Makar.RAKERS 450g
6788 MAKFA Makar.perya 450g
2049750 MAKFA Makar.SHIGHTS 450g
2049750 MAKFA Makar.SHIGHTS 450g

所有产品位置具有相同的词根。
MAKFA Makar 不能像 MFAMKR作为输出我想得到
                                                Initially                 class
1                       Pasta Makfa snail flow-pack 450 g.          MAKFA Makar.
2                  MAKFA Macaroni feathers like. in / with          MAKFA Makar.
3                          2013077 MAKFA Makar.RAKERS 450g          MAKFA Makar.
4                          2013077 MAKFA Makar.RAKERS 450g          MAKFA Makar.
5                              6788 MAKFA Makar.perya 450g          MAKFA Makar.
6                         2049750 MAKFA Makar.SHIGHTS 450g          MAKFA Makar.
7                         2049750 MAKFA Makar.SHIGHTS 450g          MAKFA Makar.
8          * 3398012 DD Kolb.SERV.OKHOTN in / to v / y0.35                  kolb
9               * 3014084 D.Dym.Spikachki DEREVEN.MINI 1kg             Spikachki
10                                         809 Bananas 1kg              Bananas 
11                                              Lemons 55+                Lemons
12                           Napkins paper color 100pcs PL        Napkins paper 
13                         SOFT Cotton sticks 100 PE (BELL         Cotton sticks
14                     SHEBEKINSKIE Macaroni Butterfly №40 SHEBEKINSKIE Macaroni
15 * 3426789 WH.The corn rav guava / yagn.d / Cat SEED 85g              CAT seed
16                        FetaXa Cheese product 60% 400g (               Cheese 
17          3491144 LIP.NAP.ICE TEA green yellow 0.5 liter                  TEA 
18                  2030918 MARIA TRADITIONAL Biscuit 180g              Biscuit 
19                                          197 Onion 1 kg                 Onion
20                          TOBUSsteering-wheel 0.5kg flow        steering-wheel
21                     Package "Magnet" white (Plastiktre) Package  (Plastiktre)
22                    * 2108609 SLOB.Mayon.OLIVK.67% 400ml                 Mayon
23                            TENDER AGE Cottage cheese 10        Cottage cheese

我如何按词根对产品进行分类?(相反,单词 Makar.Makfa、cheese 中存在相同的模式)

最佳答案

我认为您可以通过清理然后聚类您的文本来获得您想要的位置 - 这是一个初学者:

text <- text[1:24,]
library(quanteda)
library(tidyverse)
hc <- text %>% 
  pull(GOODS_NAME) %>% 
  as.character %>% 
  quanteda::tokens(
    remove_numbers = T,  
    remove_punct = T,
    remove_symbols = T, 
    remove_separators = T
  ) %>% 
  quanteda::tokens_tolower() %>% 
  quanteda::tokens_remove(valuetype="regex", pattern = c("^\\d.*")) %>% 
  quanteda::dfm() %>% 
  textstat_simil(method = "jaccard") %>% 
  magrittr::multiply_by(-1) %>% 
  `attr<-`("Labels", text$GOODS_NAME) %>% 
  hclust(method = "average") 

pdf(tf<-tempfile(fileext = ".pdf"), width = 20, height = 10)
plot(hc)
dev.off()
shell.exec(tf)

clusters <- cutree(hc, h = -0.1)
split(text, clusters)

关于r - 使用 R 对单词中的相同模式进行分类,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/52346232/

相关文章:

r - 用r中的dplyr总结具有不同功能的不同列

r - 使用R将PDF文件转换为文本文件以进行文本挖掘

r - 合并一个波段栅格以获得多波段栅格

r - 加速将文本解析为 R 中的 data.table

r - 在 r 中使用 twitteR 排除抓取转推

R tm removeWords 停用词不删除停用词

r - tm.package : findAssocs vs Cosine

r - 将列表类型列与 DF 中的其他列匹配

r - 如何在R中将一个数据集分割成多个数据集

r - 使用 dplyr mutate 根据列名向量创建新列