r - 神经网络公式错误

这实际上是一个非常有用的问题的副本，其答案(部分)由提问者得出。原标题:《用于文本分析的 R 神经网络模型中超过 512 个字符的公式》。他最终解决了这个问题，尽管他给出的推理是不正确的，然后他删除了问题并使评论和他的解决方案不可见，从而使错误更加复杂。

我正在尝试拟合一个神经网络模型，用于将网站分类到 2 个桶中的一个。训练数据特征是网站上所有链接中的词，例如，一个网站可能具有“主页”、“关于”、“联系方式”、“产品”等特征。数据结构为带有类列的数据框，然后是训练中每个单词的列。每行都有类别(合格或不合格)以及出现在该网站上的每个单词的 0 和 1。

出现合理次数的单词总数约为 1000 个，我想将它们全部用作特征。但是，公式似乎有 225 个字符的限制，所以我无法这样做。

我没有好的数据集来提供可重现的输出，但这是我的代码和我遇到的错误。

如果我尝试做一个公式，它会被切断:

nn.model <- neuralnet(paste("class ~ ", paste(clean.features, collapse = "+", sep = "")), data = training.data, 
                hidden = num.nodes)
                )
Error in parse(text = x, keep.source = FALSE) : :2:0: unexpected end of input 1: ranty+recipes+contract+just+inventory+types+working+wine+hampshire+suppliers+rise+body+selection+laurel+trek+arlington+cabinet+citrus+advertisers+rhode+highway+intl+province+jewelers+cycles+wy

如果我使用 as.formula，同样的事情会发生:

f <- as.formula(paste("class ~ ", paste(clean.features, collapse = "+", sep = "")))
Error in parse(text = x, keep.source = FALSE) : :2:0: unexpected end of input 1: ranty+recipes+contract+just+inventory+types+working+wine+hampshire+suppliers+rise+body+selection+laurel+trek+arlington+cabinet+citrus+advertisers+rhode+highway+intl+province+jewelers+cycles+wy

如果我尝试使用数据集中的所有特征，它会说没有“数据”参数(即使有):

nn.model <- neuralnet(class ~ . , data = training.data, 
                hidden = num.nodes, 0))
                )
Error in terms.formula(formula) : '.' in formula and no 'data' argument

> sessionInfo()
R version 3.3.2 (2016-10-31) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows >= 8 x64 (build 9200)

有什么变通办法吗？

最佳答案

我将建议一个可能的起点来隔离问题。如果公式本身的长度存在问题，那么应该可以通过创建公式本身来复制该问题。试试这个:

form <- reformulate( clean.features, quote(class) )

Ew，只是打字让我的内部 R 解析器感到畏缩。请将您的 LHS 变量重命名为不同于此类中央 R 函数的名称。也许是这样的:

names( training.data)[ names(training.data) %in% "class"] <- "myclass"
form <- reformulate( clean.features, quote(myclass) )

提问者对其他评论的回应，我在此不再赘述。我曾告诉他，他关于 512 个字符的字符限制的理论是不正确的，但他随后发布了:

So with a lot of manual review, it looks like the word "for" happened to be exactly at the character limit that was mentioned in other posts (512). But the actual problem was that "for" was being recognized as a function in the formula. Sorry for all the confusion.

那是不正确的。该问题与公式中的字符限制无关，而是与他的列名称为“for”有关。这是 R 中保留的控制函数，可能出现在公式中的任何位置。看这个演示(显示一些保留字，但不是全部)

f <- reformulate(c( paste(sep="","X",1:5), "for", paste(sep="","X",1:5)), quote(Y))
Error in parse(text = termtext, keep.source = FALSE) : 
  <text>:1:30: unexpected '+'
1: response ~ X1+X2+X3+X4+X5+for+
                                 ^
> f <- reformulate(c( paste(sep="","X",1:5), "class", paste(sep="","X",1:5)), quote(Y))
# no error ... OK perhaps not a reserved word
> f <- reformulate(c( paste(sep="","X",1:5), "in", paste(sep="","X",1:5)), quote(Y))
Error in parse(text = termtext, keep.source = FALSE) : 
  <text>:1:27: unexpected 'in'
1: response ~ X1+X2+X3+X4+X5+in
                              ^
> f <- reformulate(c( paste(sep="","X",1:5), "TRUE", paste(sep="","X",1:5)), quote(Y))
#  no error, so maybe "TRUE" is not reserved and quote(TRUE) is?

因此提出术语是否可以与函数共享名称的问题是正确的。答案并不完全如我所料。如果有人想提供更仔细的 CS 解释，我很乐意检查他们的努力。

出现此问题的另一个上下文是调用帮助页面的 prefix-? 运算符。尝试获取有关 ?for 的帮助。您只会得到一个续行 + 提示。解析器正在等待左括号。

关于r - 神经网络公式错误，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/42238675/

r - 神经网络公式错误

上一篇：symfony - JMSSerializerBundle 上的自定义处理程序被忽略

下一篇：selenium-webdriver - Selenium Grid 2 -/console 页面的 API 版本