r - 并行处理的句子生成会产生乱码结果

标签 r machine-learning foreach nlp doparallel

我正在尝试为某些神经网络学习目的创建一个数据集。以前,我使用 for 循环来连接并生成句子,但由于该过程花费了很长时间,所以我使用 foreach 实现了句子生成。整个过程很快,不到 50 秒就完成了。我只是在模板上使用插槽填充,然后将其粘贴在一起形成句子,但输出出现乱码(单词中的拼写错误、单词之间的未知空格、单词本身丢失等...)

library(foreach)
library(doParallel)
library(tictoc)

tic("Data preparation - parallel mode")
cl <- makeCluster(3)
registerDoParallel(cl)

f_sentences<-c();sentences<-c()
hr=38:180;fl=1:5;month=1:5
strt<-Sys.time()
a<-foreach(hr=38:180,.packages = c('foreach','doParallel')) %dopar% {
  foreach(fl=1:5,.packages = c('foreach','doParallel')) %dopar%{
    foreach(month=1:5,.packages = c('foreach','doParallel')) %dopar% {
      if(hr>=35 & hr<=44){
        sentences<-paste("About",toString(hr),"soldiers died in the battle (count being severly_low).","Around",toString(fl),
                         "soldiers and civilians went missing. We only have about",(sample(38:180,1)),"crates which lasts for",toString(month),"months as food supply")
        f_sentences<-c(f_sentences,sentences);outfile<-unname(f_sentences)}
      if(hr>=45 & hr<=59){
        sentences<-paste("About",toString(hr),"soldiers died in the battle (count being low).","Around",toString(fl),
                         "soldiers and civilians went missing. We only have about",(sample(38:180,1)),"crates which lasts for",toString(month),"months as food supply")
        f_sentences<-c(f_sentences,sentences);outfile<-unname(f_sentences)}
      if(hr>=60 & hr<=100){
        sentences<-paste("About",toString(hr),"soldiers died in the battle (count being medium).","Around",toString(fl),
                         "soldiers and civilians went missing. We only have about",(sample(38:180,1)),"crates which lasts for",toString(month),"months as food supply")
        f_sentences<-c(f_sentences,sentences);outfile<-unname(f_sentences)}
      if(hr>=101 & hr<=150){
        sentences<-paste("About",toString(hr),"soldiers died in the battle (count being high).","Around",toString(fl),
                         "soldiers and civilians went missing. We only have about",(sample(38:180,1)),"crates which lasts for",toString(month),"months as food supply")
        f_sentences<-c(f_sentences,sentences);outfile<-unname(f_sentences)}
      if(hr>=151 & hr<=180){
        sentences<-paste("About",toString(hr),"soldiers died in the battle (count being severly_high).","Around",toString(fl),
                         "soldiers and civilians went missing. We only have about",(sample(38:180,1)),"crates which lasts for",toString(month),"months as food supply")
        f_sentences<-c(f_sentences,sentences);outfile<-unname(f_sentences)}
      return(outfile)
    }
    write.table(outfile,file="/home/outfile.txt",append = T,row.names = F,col.names = F)
    gc()
  }
}
stopCluster(cl)
toc()

如此创建的文件的统计信息:

  • 行数:427,975
  • 使用的拆分:单词拆分 ("")
  • 词汇量:567

    path<-"/home/outfile.txt"
    File<-(fread(path,sep = "\n",header = F))[[1]]
    corpus<-tolower(File) %>%
    #removePunctuation() %>%
    strsplit(splitting) %>%
    unlist()
    vocab<-unique(corpus)

    像这样的简单句子应该包含很少的词汇,因为数字是这里唯一变化的参数。在检查词汇输出并使用 grep 命令时,我发现了很多乱码 (也有一些缺失的单词)如 wenttcrpply 等出现在句子中,通常不应该出现,因为我有一个固定的模板。

    Expected sentence
    "About 40 soldiers died in the battle (count being severly_low). Around 1 soldiers and civilians went missing. We only have about 146 crates which lasts for 1 months as food supply"

    grep -rnw 'outfile.txt' -e 'wentt'
    24105:"About 62 soldiers died in the battle (count being medium). Around 2 soldiers and civilians wentt 117 crates which lasts for 1 months as food supply"

    grep -rnw 'outfile.txt' -e 'crpply'
    76450:"About 73 soldiers died in the battle (count being medium). Around 1 soldiers and civilians went missing. We only have about 133 crpply"

    前几句,出现问题后生成是正确的。这是什么原因呢?我只是执行普通粘贴和槽填充。任何帮助将不胜感激!

最佳答案

代码现在运行正确。没有更多的错误。我假设错误是由于上次的故障而发生的。在其他具有不同 R 版本的机器上进行了测试,仍然没有问题。

关于r - 并行处理的句子生成会产生乱码结果,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/49248519/

相关文章:

python-3.x - tensorflow 中批处理中每个项目的 LSTM 初始状态

api - 如何判断图片是否露骨

java - 使用二维数组创建和打印乘法表?

r - 在 R 中针对不同的初始条件模拟 ODE 模型

machine-learning - ML 的功能够不够?

r - R 编辑器中的制表符大小

php - 如何在 php foreach 循环中保持运行总计

mysql - 将数组存储到php中的变量中

r - 按字符拆分 sf 对象而不删除它在 R 中的几何形状

R - 如何根据一列中的值汇总其他列