python - 在没有重复值的 pandas 系列上使用 set 返回一个较小的对象

我有一个具有唯一值的 pandas 系列，但在使用 set (或 pd.unique() )时以某种方式返回一个较小的对象。

In [255]: titles.shape
Out[255]: (77767,)

In [256]: len(set(titles))
Out[256]: 77750

In [257]: titles.nunique()
Out[257]: 77750

在进一步检查时，我发现 set 视为重复的内容彼此之间有一些相似之处，但它们并不是真正的重复。

In [254]: titles[titles.duplicated()]
Out[254]: 
927892                            Sham (film)
945686                     Shalom in the Home
947578                            Sham (play)
4380452                Blind Spot (1958 film)
4390747                Blind Spot (1932 film)
4403857                     Blind Rage (film)
4406443                  Blind Witness (film)
4421728                          Blind Terror
4424566                Blind Spot (1947 film)
4435819                           Blind Wives
4441354                           Blind Youth
4452296                Blind Side (1993 film)
4629350                  Ports of Call (film)
5562561                 Great Day (1945 film)
5586514              Great Day in the Morning
5634649    Great Continental Railway Journeys
5640835           Great Day (unfinished film)
Name: Title, dtype: object

到底是什么触发了这个奇怪的问题？ set 认为具有相同第一个单词的标题条目是重复的。奇怪的是，我正在使用维基百科数据集来提取这些电影标题，因此一定有更多条目具有相同的第一个单词。但在这里我们只看到这 17 个标题。

In [265]: title_list = list(titles)

In [266]: len(title_list)
Out[266]: 77767

In [267]: title_list = [i.split()[0] for i in title_list]

In [268]: len(set(title_list))
Out[268]: 17696

有什么想法吗？

编辑2: 由于问题已成功回答，因此删除了数据链接。

最佳答案

让我们举个简单的例子:

check = pd.Series([1,2,2,3,4,2])
check[check.duplicated()]
#2 2
#5 2
dtype: int64

因此，这显示了没有第一个实例的重复项。

正确的方法是:

check[check.isin(check[check.duplicated()])]
#1 2
#2 2
#5 2

关于python - 在没有重复值的 pandas 系列上使用 set 返回一个较小的对象，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/47394969/

python - 在没有重复值的 pandas 系列上使用 set 返回一个较小的对象

上一篇：Python，识别循环中的文件给出错误: setting an array element with a squence

下一篇：Python - 在 GPU 工作时将数据从存储传输到 RAM