r - 与unnest_tokens相反

这很可能是一个愚蠢的问题，但是我已经在Google和Google上搜索了，找不到解决方案。我认为这是因为我不知道用正确的方式来表达我的问题以进行搜索。

我有一个数据框，已将其转换为R中的整洁文本格式，以消除停用词。我现在想“整理”该数据框回到其原始格式。

什么是unnest_tokens的相反/反向命令？

编辑：这是我正在使用的数据的样子。我正在尝试复制Silge和Robinson的Tidy Text书中的分析，但使用的是意大利歌剧librettos。

character = c("FIGARO", "SUSANNA", "CONTE", "CHERUBINO") 
line = c("Cinque... dieci.... venti... trenta... trentasei...quarantatre", "Ora sì ch'io son contenta; sembra fatto inver per me. Guarda un po', mio caro Figaro, guarda adesso il mio cappello.", "Susanna, mi sembri agitata e confusa.", "Il Conte ieri perché trovommi sol con Barbarina, il congedo mi diede; e se la Contessina, la mia bella comare, grazia non m'intercede, io vado via, io non ti vedo più, Susanna mia!") 
sample_df = data.frame(character, line)
sample_df

character line
FIGARO    Cinque... dieci.... venti... trenta... trentasei...quarantatre
SUSANNA   Ora sì ch'io son contenta; sembra fatto inver per me. Guarda un po', mio caro Figaro, guarda adesso il mio cappello.
CONTE     Susanna, mi sembri agitata e confusa.
CHERUBINO Il Conte ieri perché trovommi sol con Barbarina, il congedo mi diede; e se la Contessina, la mia bella comare, grazia non m'intercede, io vado via, io non ti vedo più, Susanna mia!

我将其转换为整洁的文字，因此可以摆脱停用词：

tribble <- sample_df %>%
           unnest_tokens(word, line)
# Get rid of stop words
# I had to make my own list of stop words for 18th century Italian opera
itstopwords <- data_frame(text=mystopwords)
names(itstopwords)[names(itstopwords)=="text"] <- "word"
tribble2 <- tribble %>%
            anti_join(itstopwords)

现在我有这样的事情：

text    word
FIGARO  cinque
FIGARO  dieci
FIGARO  venti
FIGARO  trenta
...

我想将其恢复为字符名称和相关行的格式，以查看其他内容。基本上，我希望文本使用以前的格式，但是删除了停用词。

最佳答案

这不是一个愚蠢的问题！ library(tidytext) tidy_austen <- janeaustenr::austen_books() group_by(book) %>% mutate(linenumber = ungroup() %>% unnest_tokens(word, text) tidy_austen #> # A tibble: 725,055 x 3 #> book #> <fct> #> 1 Sense & Sensibility #> 2 Sense & Sensibility #> 3 Sense & Sensibility #> 4 Sense & Sensibility #> 5 Sense & Sensibility #> 6 Sense & Sensibility #> 7 Sense & Sensibility #> 8 Sense & Sensibility #> 9 Sense & Sensibility #> 10 Sense & Sensibility #> # … with 725,045 more rows

文字整齐 group_by(book, linenumber) %>% summarize(text = str_c(word, ungroup() #> # A tibble: 62,272 x 3 #> book linenumber text #> <fct> #> 1 Sense & Sensib… #> 2 Sense & Sensib… #> 3 Sense & Sensib… #> 4 Sense & Sensib… #> 5 Sense & Sensib… #> 6 Sense & Sensib… #> 7 Sense & Sensib… #> 8 Sense & Sensib… #> 9 Sense & Sensib… #> 10 Sense & Sensib… #> # … with 62,262 more rows

由group_by()函数以整理后的格式将文本恢复为原始格式，这将是我的典型方法。

首先，让我们从原始文本转换为整齐的格式。

library(tidyverse) %>% row_number()) %>% linenumber word <int> <chr> 1 sense 1 and 1 sensibility 3 by 3 jane 3 austen 5 1811 10 chapter 10 1 13 the 了！但是我们可以将其整理回原来的形式。我通常使用dplyr中的group_by()和summarize()以及stringr中的str_c()来解决这个问题。在这种情况下，最后的文本是什么样的？

tidy_austen %>% collapse = " ")) %>%

<int> <chr> 1 sense and sensibility 3 by jane austen 5 1811 10 chapter 1 13 the family of dashwood had long been settled… 14 was large and their residence was at norland… 15 their property where for many generations th… 16 respectable a manner as to engage the genera… 17 surrounding acquaintance the late owner of t… 18 man who lived to a very advanced age and who… href="https://reprex.tidyverse.org" rel="noreferrer noopener nofollow">reprex package（v0.3.0）创建于2019-07-11



					

					
					
						关于r - 与unnest_tokens相反，我们在Stack Overflow上找到一个类似的问题：
							
								https://stackoverflow.com/questions/46734501/

r - 与unnest_tokens相反

上一篇：r - 如何在行数变化时绑定(bind)向量和矩阵

下一篇：代码未使用 fgets 到达语句？