python - 查找最近邻更改算法

我正在创建一个推荐系统，向用户推荐 20 首最合适的歌曲。我已经训练了我的模型，我已经准备好为给定的播放列表推荐歌曲了!但是，我遇到的一个问题是，我需要嵌入该新播放列表，以便使用 kmeans 在该嵌入空间中找到最接近的相关播放列表。
为了推荐歌曲，我首先将所有训练播放列表的学习嵌入进行聚类，然后为我的给定测试播放列表选择“邻居”播放列表作为同一集群中的所有其他播放列表。然后我从这些播放列表中取出所有轨道，并将测试播放列表嵌入和这些“相邻”轨道输入到我的模型中进行预测。这根据它们在给定测试播放列表中接下来出现的可能性(在我的模型下)对“相邻”轨道进行排名。

desired_user_id = 123
model_path = Path(PATH, 'model.h5')
print('using model: %s' % model_path)
model =keras.models.load_model(model_path)
print('Loaded model!')

mlp_user_embedding_weights = (next(iter(filter(lambda x: x.name == 'mlp_user_embedding', model.layers))).get_weights())

# get the latent embedding for your desired user
user_latent_matrix = mlp_user_embedding_weights[0]
one_user_vector = user_latent_matrix[desired_user_id,:]
one_user_vector = np.reshape(one_user_vector, (1,32))

print('\nPerforming kmeans to find the nearest users/playlists...')
# get 100 similar users
kmeans = KMeans(n_clusters=100, random_state=0, verbose=0).fit(user_latent_matrix)
desired_user_label = kmeans.predict(one_user_vector)
user_label = kmeans.labels_
neighbors = []
for user_id, user_label in enumerate(user_label):
    if user_label == desired_user_label:
        neighbors.append(user_id)
print('Found {0} neighbor users/playlists.'.format(len(neighbors)))

tracks = []
for user_id in neighbors:
    tracks += list(df[df['pid'] == int(user_id)]['trackindex'])
print('Found {0} neighbor tracks from these users.'.format(len(tracks))) 

users = np.full(len(tracks), desired_user_id, dtype='int32')
items = np.array(tracks, dtype='int32')

# and predict tracks for my user
results = model.predict([users,items],batch_size=100, verbose=0) 
results = results.tolist()
print('Ranked the tracks!')

results_df = pd.DataFrame(np.nan, index=range(len(results)), columns=['probability','track_name', 'track artist'])
print(results_df.shape)

# loop through and get the probability (of being in the playlist according to my model), the track, and the track's artist 
for i, prob in enumerate(results):
    results_df.loc[i] = [prob[0], df[df['trackindex'] == i].iloc[0]['track_name'], df[df['trackindex'] == i].iloc[0]['artist_name']]
results_df = results_df.sort_values(by=['probability'], ascending=False)

results_df.head(20)

我想用这个 https://www.tensorflow.org/recommenders/examples/basic_retrieval#building_a_candidate_ann_index 代替上面的代码或来自 Spotify 的官方 GitHub 存储库 https://github.com/spotify/annoy .
不幸的是，我不知道如何使用它，因此新程序为我提供了用户最流行的 20 首轨道。
我该如何改变这个？

编辑 :
我试过的:

from annoy import AnnoyIndex
import random
desired_user_id = 123
model_path = Path(PATH, 'model.h5')
print('using model: %s' % model_path)
model =keras.models.load_model(model_path)
print('Loaded model!')
    
mlp_user_embedding_weights = (next(iter(filter(lambda x: x.name == 'mlp_user_embedding', model.layers))).get_weights())
    
# get the latent embedding for your desired user
user_latent_matrix = mlp_user_embedding_weights[0]
one_user_vector = user_latent_matrix[desired_user_id,:]
one_user_vector = np.reshape(one_user_vector, (1,32))

t = AnnoyIndex(desired_user_id , one_user_vector)  #Length of item vector that will be indexed
for i in range(1000):
    v = [random.gauss(0, 1) for z in range(f)]
    t.add_item(i, v)

t.build(10) # 10 trees
t.save('test.ann')

u = AnnoyIndex(desired_user_id , one_user_vector)
u.load('test.ann') # super fast, will just mmap the file
print(u.get_nns_by_item(0, 1000)) # will find the 1000 nearest neighbors
# Now how to I get the probability and the values?

最佳答案

您快到了!
在以 desired_user_id = 123 开头的代码中，您有 4 个主要步骤:
1 (L 1-12):从您保存的模型中检索用户嵌入矩阵 ( user_latent_matrix )
2(L 14-23):使用kmeans找到用户的集群标签( desired_user_label )并列出集群中的其他用户( neighbors )。同一集群中的用户应该听与您相似的歌曲。
3(L 25-31):列出集群中其他用户喜欢的歌曲(tracks)。您喜欢的音乐将类似于您集群中其他人听的音乐。第 2 步和第 3 步只是过滤掉 99% 的所有音乐，因此您只需在最后 1% 上运行模型即可节省时间和金钱。删除 2 和 3 并将每首歌曲添加到 tracks仍然可以工作(但需要 100 倍的时间)。
4(L 33+):使用保存的模型来预测集群中其他用户喜欢的歌曲是否适合你(results_df)
Annoy 是寻找相似用户的替代品(第 2 步)。而不是使用 kmeans要找到用户的集群然后在集群中找到其他用户，它使用 k nearest neighbors风格算法，直接找到密切用户。
找到后 one_user_vector在第 12 行，用类似的内容替换第 2 步(第 14-23 行)

from annoy import AnnoyIndex

user_embedding_length = 23
t = AnnoyIndex(user_embedding_length, 'angular')

# add the user embeddings to annoy (your annoy userids will be the row indexes)
for user_id, user_embedding in enumerate(user_latent_matrix):
    t.add_item(user_id, user_embedding)

# build the forrest
t.build(10) # 10 trees

# save the forest for later if you're using this again and don't want to rebuild the trees every time
t.save('test.ann')

# find the 100 nearest neighbor users
neighbors = t.get_nns_by_item(desired_user_id, 100)

如果你想再次运行你的东西，但不想重建树并且已经运行过一次，只需将步骤 2 替换为

from annoy import AnnoyIndex

user_embedding_length = 23
t = AnnoyIndex(user_embedding_length, 'angular')

# load the trees
t.load('test.ann')

# find the 100 nearest neighbor users
neighbors = t.get_nns_by_item(desired_user_id, 100)

替换第 2 步中的内容后，只需像平常一样运行第 3 步和第 4 步(第 25+ 行)

关于python - 查找最近邻更改算法，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/64754955/

python - 查找最近邻更改算法

上一篇：.net-core - 即使我使用 .Net Core 3.1，我也可以更新到 .Net 5 NuGet 包吗？

下一篇：c# - 如何修复 GattStatus : 3 - WriteNotPermitted exception for BLE Xamarin forms application?