tensorflow - 向量化单词时如何处理文本字符串中的数字？

标签 tensorflow nlp word2vec word-embedding

如果我有一个要向量化的文本字符串，应该如何处理其中的数字？或者，如果我用数字和单词喂入神经网络，如何将数字保持为数字？

我打算制作一个所有单词的字典(as suggested here)。在这种情况下，所有字符串都将成为数字数组。我应该如何处理数字字符？如何输出不将单词索引与数字字符混合的向量？

将数字转换为字符串会削弱我输入网络的信息吗？

最佳答案

使用@ user1735003扩展讨论-让我们考虑两种表示数字的方式:

将其视为字符串，并将其视为另一个单词，并在形成字典时为其分配ID。或

将数字转换为实际单词:“1”变成“1”，“2”变成“2”，依此类推。

第二个内容是否会改变上下文？为了验证它，我们可以使用 word2vec 找到两个表示的相似性。如果它们具有相似的上下文，则得分将很高。

例如，
1和one的相似度得分为0.17，2和two的相似度得分为0.23。他们似乎暗示，如何使用它们的上下文是完全不同的。

By treating the numbers as another word, you are not changing the context but by doing any other transformation on those numbers, you can't guarantee its for better. So, its better to leave it untouched and treat it as another word.

注意:通过将数字视为字符串来训练word-2-vec和glove(情况1)。

关于tensorflow - 向量化单词时如何处理文本字符串中的数字？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/44865840/

上一篇：shell - Groovy Init 中的 Jenkins 环境变量

下一篇：firebase - 无法使用(com.firebase.ui.storage.images.FirebaseImageLoader)解析方法

相关文章：

python - 情感分析python 3亚马逊

python - 导入错误 : No module named tensorflow

python - 卷积神经网络: Weights and Bias initialization

python - SpaCy:如何加载 Google 新闻 word2vec 向量？

algorithm - 只提取英文句子

python - 如何在 python 中使用 gensim 和 word2vec 查找语义相似性

python - Keras ImageDataGenerator 用于在单独的目录中使用图像和蒙版进行分割

tensorflow - TF2.0 中 Keras 损失中 `sample_weight` 参数的奇怪形状要求

python - 如何提前判断 CountVectorizer 是否会抛出 ValueError : empty vocabulary?

python - 如何在自训练的word2vec模型中删除单词