例如,我有一个文本文件,内容如下:
I wantto separate those wordswhich arejoined.
如何分隔此文本中的单词,以便将其作为输出。
I want to separate those words which are joined.
基本上,它可以从文本中检测出无意义的单词并使它们变得有意义。
例如,代码应该检测到“wantto”没有任何意义,并且在处理它之后,它应该能够返回“want to”作为输出。
它可能会返回一些其他有意义的单词组合,但这很好。
最佳答案
如果您安装了 aspell(请参阅安装了 ?aspell
),这可能会给您一个提示:
> writeLines("I wantto separate those wordswhich arejoined.", "/tmp/test.txt")
> sp <- aspell('/tmp/test.txt')
> sp
arejoined
/tmp/test.txt:1:36
wantto
/tmp/test.txt:1:3
wordswhich
/tmp/test.txt:1:25
> sp[[5]]
[[1]]
[1] "want to" "want-to" "want" "wanton" "Watt" "watt" "wand" "went" "wont" "whatnot" "wants" "canto"
[13] "panto" "Wanda" "waned" "won't" "want's" "wanted" "NATO" "vanity" "wander" "winter" "wart" "natty"
[25] "vaunt" "wan" "ant" "walnut" "wasn't" "Witt" "wait" "wane" "wino"
[[2]]
[1] "words which" "words-which" "wordsmith" "Wordsworth" "words" "Woodstock" "word's" "woodsier"
[9] "Woods" "wards" "woods" "ward's" "woad's" "wood's" "wort's"
[[3]]
[1] "are joined" "are-joined" "rejoined" "adjoined" "enjoined" "rejoinder" "regained"
无论如何,这样的任务总是基于字典的。
关于r - 如何在 R 中分隔给定文本中的单词?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/24058618/