R 将语料库分解为句子

我有许多 PDF 文档，我已将其读入库 tm 的语料库中。如何将语料库分解为句子？
可以通过使用 readLines 读取文件，然后使用 qdap 包中的 sentSplit 来完成此操作 [*]。该函数需要一个数据框。它还需要放弃语料库并单独读取所有文件。
如何通过 tm 中的语料库传递函数 sentSplit {qdap}？或者有更好的办法吗？

注意:openNLP 库中有一个函数 sentDetect，现在是 Maxent_Sent_Token_Annotator - 同样的问题适用:如何将其与语料库 [tm] 结合起来？

最佳答案

我不知道如何 reshape 语料库，但这将是一个很棒的功能。

我想我的方法会是这样的:

使用这些包

# Load Packages
require(tm)
require(NLP)
require(openNLP)

我会将文本设置为句子功能，如下所示:

convert_text_to_sentences <- function(text, lang = "en") {
  # Function to compute sentence annotations using the Apache OpenNLP Maxent sentence detector employing the default model for language 'en'. 
  sentence_token_annotator <- Maxent_Sent_Token_Annotator(language = lang)

  # Convert text to class String from package NLP
  text <- as.String(text)

  # Sentence boundaries in text
  sentence.boundaries <- annotate(text, sentence_token_annotator)

  # Extract sentences
  sentences <- text[sentence.boundaries]

  # return sentences
  return(sentences)
}

以及我对 reshape 语料库函数的破解(注意:除非您以某种方式修改此函数并适本地复制它们，否则您将丢失此处的元属性)

reshape_corpus <- function(current.corpus, FUN, ...) {
  # Extract the text from each document in the corpus and put into a list
  text <- lapply(current.corpus, Content)

  # Basically convert the text
  docs <- lapply(text, FUN, ...)
  docs <- as.vector(unlist(docs))

  # Create a new corpus structure and return it
  new.corpus <- Corpus(VectorSource(docs))
  return(new.corpus)
}

其工作原理如下:

## create a corpus
dat <- data.frame(doc1 = "Doctor Who is a British science fiction television programme produced by the BBC. The programme depicts the adventures of a Time Lord—a time travelling, humanoid alien known as the Doctor. He explores the universe in his TARDIS (acronym: Time and Relative Dimension in Space), a sentient time-travelling space ship. Its exterior appears as a blue British police box, a common sight in Britain in 1963, when the series first aired. Along with a succession of companions, the Doctor faces a variety of foes while working to save civilisations, help ordinary people, and right wrongs.",
                  doc2 = "The show has received recognition from critics and the public as one of the finest British television programmes, winning the 2006 British Academy Television Award for Best Drama Series and five consecutive (2005–10) awards at the National Television Awards during Russell T Davies's tenure as Executive Producer.[3][4] In 2011, Matt Smith became the first Doctor to be nominated for a BAFTA Television Award for Best Actor. In 2013, the Peabody Awards honoured Doctor Who with an Institutional Peabody \"for evolving with technology and the times like nothing else in the known television universe.\"[5]",
                  doc3 = "The programme is listed in Guinness World Records as the longest-running science fiction television show in the world[6] and as the \"most successful\" science fiction series of all time—based on its over-all broadcast ratings, DVD and book sales, and iTunes traffic.[7] During its original run, it was recognised for its imaginative stories, creative low-budget special effects, and pioneering use of electronic music (originally produced by the BBC Radiophonic Workshop).",
                  stringsAsFactors = FALSE)

current.corpus <- Corpus(VectorSource(dat))
# A corpus with 3 text documents

## reshape the corpus into sentences (modify this function if you want to keep meta data)
reshape_corpus(current.corpus, convert_text_to_sentences)
# A corpus with 10 text documents

我的 sessionInfo 输出

> sessionInfo()
R version 3.0.1 (2013-05-16)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
  [1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United Kingdom.1252    LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C                           
[5] LC_TIME=English_United Kingdom.1252    

attached base packages:
  [1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
  [1] NLP_0.1-0     openNLP_0.2-1 tm_0.5-9.1   

loaded via a namespace (and not attached):
  [1] openNLPdata_1.5.3-1 parallel_3.0.1      rJava_0.9-4         slam_0.1-29         tools_3.0.1

关于R 将语料库分解为句子，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/18712878/

R 将语料库分解为句子

上一篇：nginx - 如何在 nginx 配置中排除特定子域 server_name

下一篇：java - 卡夫卡 : move partition from invalid broker