r - 如何使用 keras 和 pad_sequences 在 R 中填充文本序列?

标签 r tensorflow keras

我有一个包含文本的数据集。

dat <- data.frame(id=c("1","2","3","4","5"),text=as.character(c("hello","hello you","hello duck","Dogs and cats","hello cats, ducks and dogs")),stringsAsFactors = F)
str(dat)

我想用 keras 准备文本进行文本分类。这适用于少量标记和填充。

library(keras)
install_keras()
library(dplyr)

data<- dat$text

tok <- keras::text_tokenizer(10, lower = TRUE, split = " ", char_level
= FALSE) keras::fit_text_tokenizer(tok, data) data_idx <- keras::texts_to_sequences(tok, data)


data_idx <- data_idx %>% pad_sequences(maxlen=10,padding="post",value=0)

> data_idx
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,]    1    0    0    0    0    0    0    0    0     0
[2,]    1    5    0    0    0    0    0    0    0     0
[3,]    1    6    0    0    0    0    0    0    0     0
[4,]    2    3    4    0    0    0    0    0    0     0
[5,]    1    4    7    3    2    0    0    0    0     0

但是,如果我提高标记和填充的数量(对于我的真实文本,我必须这样做),我会得到一个奇怪的填充序列。

data<- dat$text

tok <- keras::text_tokenizer(10000, lower = TRUE, split = " ", char_level = FALSE)
keras::fit_text_tokenizer(tok, data)
data_idx <- keras::texts_to_sequences(tok, data)


data_idx <- data_idx %>% pad_sequences(maxlen=10000,padding="post",value=0)

> data_idx
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25]
     [,26] [,27] [,28] [,29] [,30] [,31] [,32] [,33] [,34] [,35] [,36] [,37] [,38] [,39] [,40] [,41] [,42] [,43] [,44] [,45] [,46] [,47] [,48] [,49]
     [,50] [,51] [,52] [,53] [,54] [,55] [,56] [,57] [,58] [,59] [,60] [,61] [,62] [,63] [,64] [,65] [,66] [,67] [,68] [,69] [,70] [,71] [,72] [,73]
     [,74] [,75] [,76] [,77] [,78] [,79] [,80] [,81] [,82] [,83] [,84] [,85] [,86] [,87] [,88] [,89] [,90] [,91] [,92] [,93] [,94] [,95] [,96] [,97]
     [,98] [,99] [,100] [,101] [,102] [,103] [,104] [,105] [,106] [,107] [,108] [,109] [,110] [,111] [,112] [,113] [,114] [,115] [,116] [,117] [,118]
     [,119] [,120] [,121] [,122] [,123] [,124] [,125] [,126] [,127] [,128] [,129] [,130] [,131] [,132] [,133] [,134] [,135] [,136] [,137] [,138]
     [,139] [,140] [,141] [,142] [,143] [,144] [,145] [,146] [,147] [,148] [,149] [,150] [,151] [,152] [,153] [,154] [,155] [,156] [,157] [,158]
     [,159] [,160] [,161] [,162] [,163] [,164] [,165] [,166] [,167] [,168] [,169] [,170] [,171] [,172] [,173] [,174] [,175] [,176] [,177] [,178]
     [,179] [,180] [,181] [,182] [,183] [,184] [,185] [,186] [,187] [,188] [,189] [,190] [,191] [,192] [,193] [,194] [,195] [,196] [,197] [,198]
     [,199] [,200] [,201] [,202] [,203] [,204] [,205] [,206] [,207] [,208] [,209] [,210] [,211] [,212] [,213] [,214] [,215] [,216] [,217] [,218]

我认为我完全错了,但我无法解决它。

最佳答案

输出没有任何问题。我们需要检查尺寸

dim(data_idx)
#[1]     5 10000 

只是控制台仅打印列标题,并且基于 max.print它无法显示整个输出

#[ reached getOption("max.print") -- omitted 5 rows ]

如果我们进行子集化,可以看到输出

data_idx[1:3, 1:3]
#      [,1] [,2] [,3]
#[1,]    1    0    0
#[2,]    1    5    0
#[3,]    1    6    0

关于r - 如何使用 keras 和 pad_sequences 在 R 中填充文本序列?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/49594691/

相关文章:

python - 具有两个预训练 ResNet 50 的连体神经网络 - 测试模型时出现奇怪的行为

r - 图中的颜色点根据值向量的不同而不同

python - 你如何将星级评分作为可视化

r - 使用 Shiny 在 Markdown 中的不同选项卡上同步两个传单 map

python - 预测和拟合之间的keras形状不匹配

python - keras ValueError : output of generator should be a tuple (x, y, sample_weight) 或 (x, y)。发现:无

r - 在 Linux 上升级 R 时会删除软件包吗?

Tensorflow first epoch 极慢(可能与 pool_allocator 有关)

python - tensorflow no_grad 概念

python - 如何在 Python 中添加二维数组的相邻元素而不必使用嵌套循环?