python - 计数向量化器中是否可以有无序二元组

我想要无序的二元组，例如:“猫坐在垫子上”

[("猫","the"),("cat","sat"),("on","sat"),("on","the"),("mat ","the")]

每个二元组按字母顺序排序 - 这意味着，例如，“to house to”将给出 [("house", "to"),("house", "to")] 这将为这些二元组提供更高的频率，同时最小化搜索空间。

我可以使用以下方法获得上述内容:
unordered_bigrams = [列表中的n的元组(排序(n))(nltk.bigrams(words))]
但我现在想要一个“词袋”类型的向量。

我使用以下命令订购了二元特征向量:
o_bigram_vectorizer = CountVectorizer(ngram_range=(2, 2))

所以我的无序二元组也希望如此...我正在努力在 CountVectorizer 中找到一个可以为我提供此处理选项的选项(我已经查看了词汇表和预处理器，但运气不佳)

最佳答案

如果您需要的只是给定无序单词列表的可能单词对，那么您实际上并不需要二元组生成器:

>>> from itertools import permutations
>>> words = set("the cat sat on the mat".split())
>>> list(permutations(words, 2))
[('on', 'the'), ('on', 'sat'), ('on', 'mat'), ('on', 'cat'), ('the', 'on'), ('the', 'sat'), ('the', 'mat'), ('the', 'cat'), ('sat', 'on'), ('sat', 'the'), ('sat', 'mat'), ('sat', 'cat'), ('mat', 'on'), ('mat', 'the'), ('mat', 'sat'), ('mat', 'cat'), ('cat', 'on'), ('cat', 'the'), ('cat', 'sat'), ('cat', 'mat')]

或者，如果您不想要具有相同单词但顺序不同的重复元组:

>>> from itertools import combinations
>>> list(combinations(words, 2))
[('on', 'the'), ('on', 'sat'), ('on', 'mat'), ('on', 'cat'), ('the', 'sat'), ('the', 'mat'), ('the', 'cat'), ('sat', 'mat'), ('sat', 'cat'), ('mat', 'cat')]

https://stackoverflow.com/a/942551/610569 上关于产品、组合和排列有一个很好的答案

关于python - 计数向量化器中是否可以有无序二元组，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/42630026/

python - 计数向量化器中是否可以有无序二元组

上一篇：python - 同时运行 rqworker

下一篇：python - 为什么我的数据框没有求和？