python - 如何将 sklearn.metrics.pairwise pairwise_distances 与可调用指标一起使用？

我正在做一些行为分析，我会随着时间的推移跟踪行为，然后创建这些行为的 n-gram。

sample_n_gram_list = [['scratch', 'scratch', 'scratch', 'scratch', 'scratch'],
                      ['scratch', 'scratch', 'scratch', 'scratch', 'smell/sniff'],
                      ['scratch', 'scratch', 'scratch', 'sit', 'stand']]

我希望能够对这些 n-gram 进行聚类，但我需要使用自定义指标创建一个预先计算的距离矩阵。我的指标似乎工作正常，但是当我尝试使用 sklearn 函数创建距离矩阵时，出现错误:

ValueError: could not convert string to float: 'scratch'

我查看了文档 https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise_distances.html并且在这个话题上不是特别清楚。

有人熟悉如何正确使用它吗？

完整代码如下:

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import matplotlib.mlab as mlab
import math
import hashlib 
import networkx as nx
import itertools
import hdbscan
from sklearn.metrics.pairwise import pairwise_distances

def get_levenshtein_distance(path1, path2):
    """
    https://en.wikipedia.org/wiki/Levenshtein_distance
    :param path1:
    :param path2:
    :return:
    """
    matrix = [[0 for x in range(len(path2) + 1)] for x in range(len(path1) + 1)]

    for x in range(len(path1) + 1):
        matrix[x][0] = x
    for y in range(len(path2) + 1):
        matrix[0][y] = y

    for x in range(1, len(path1) + 1):
        for y in range(1, len(path2) + 1):
            if path1[x - 1] == path2[y - 1]:
                matrix[x][y] = min(
                    matrix[x - 1][y] + 1,
                    matrix[x - 1][y - 1],
                    matrix[x][y - 1] + 1
                )
            else:
                matrix[x][y] = min(
                    matrix[x - 1][y] + 1,
                    matrix[x - 1][y - 1] + 1,
                    matrix[x][y - 1] + 1
                )

    return matrix[len(path1)][len(path2)]

sample_n_gram_list = [['scratch', 'scratch', 'scratch', 'scratch', 'scratch'],
                      ['scratch', 'scratch', 'scratch', 'scratch', 'smell/sniff'],
                      ['scratch', 'scratch', 'scratch', 'sit', 'stand']]

print("should be 0")
print(get_levenshtein_distance(sample_n_gram_list[1],sample_n_gram_list[1]))
print("should be 1")
print(get_levenshtein_distance(sample_n_gram_list[1],sample_n_gram_list[0]))
print("should be 2")
print(get_levenshtein_distance(sample_n_gram_list[0],sample_n_gram_list[2]))

clust_number = 2
distance_matrix = pairwise_distances(sample_n_gram_list, metric=get_levenshtein_distance)
clusterer = hdbscan.HDBSCAN(metric='precomputed')
clusterer.fit(distance_matrix)
clusterer.labels_

最佳答案

那是因为 sklearn 中的 pairwise_distances 设计用于数值数组(这样所有不同的内置距离函数都可以正常工作)，但是您正在向它传递一个字符串列表。如果您可以将字符串转换为数字(将字符串编码为特定数字)然后传递它，它将正常工作。

一个快速的 numpy 方法是:

# Get all the unique strings in the input data
uniques = np.unique(sample_n_gram_list)
# Output:
# array(['scratch', 'sit', 'smell/sniff', 'stand'])

# Encode the strings to numbers according to the indices in "uniques" array
X = np.searchsorted(uniques, sample_n_gram_list)

# Output:
# array([[0, 0, 0, 0, 0],    <= scratch is assigned 0, sit = 1 and so on
         [0, 0, 0, 0, 2],
         [0, 0, 0, 1, 3]])


# Now this works
distance_matrix = pairwise_distances(X, metric=get_levenshtein_distance)

# Output
# array([[0., 1., 2.],
         [1., 0., 2.],
         [2., 2., 0.]])

关于python - 如何将 sklearn.metrics.pairwise pairwise_distances 与可调用指标一起使用？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/53808957/

python - 如何将 sklearn.metrics.pairwise pairwise_distances 与可调用指标一起使用？

上一篇：重构为 JSON 数组时 JSON 中的 Python 不可散列类型列表

下一篇：python - 无法从使用 BeautifulSoup 传递 URL 的结果中删除前导空格