我想优化一段代码,帮助我计算具有 10 万行的给定数据集中每个项目的最近邻居。数据集包含 50 个可变列,这有助于描述每个行项,并且大多数单元格包含一个介于 0 - 1 之间的概率值。
问题: 我是 python 的新手,但想知道是否有更高级的用户可以在下面的代码中推荐任何更好的结构,以帮助我加快计算速度。目前,该程序需要很长时间才能完成。提前致谢!
import math
import numpy as np
import pandas as pd
from scipy.spatial import distance
from sklearn.neighbors import KNeighborsRegressor
df_set = pd.read_excel('input.xlsx', skiprows=0)
distance_columns = ["var_1",
......,
......,
......,
"var_50"]
def euclidean_distance(row):
inner_value = 0
for k in distance_columns:
inner_value += (row[k] - selected_row[k]) ** 2
return math.sqrt(inner_value)
knn_name_list = []
for i in range(len(df_set.index)):
numeric = df_set[distance_columns]
normalized = (numeric - numeric.mean()) / numeric.std()
normalized.fillna(0, inplace=True)
selected_normalized = normalized[df_set["Filename"] == df_set["Filename"][i]]
euclidean_distances = normalized.apply(lambda row: distance.euclidean(row, selected_normalized), axis=1)
distance_frame = pd.DataFrame(data={"dist": euclidean_distances, "idx": euclidean_distances.index})
distance_frame.sort_values("dist", inplace=True)
second_smallest = distance_frame.iloc[1]["idx"]
most_similar_to_selected = df_set.loc[int(second_smallest)]["Filename"]
knn_name_list.append(most_similar_to_selected)
print(knn_name_list)
df_set['top_neighbor'] = np.array(knn_name_list)
df_set.to_csv('output.csv', encoding='utf-8', sep=',', index=False)
最佳答案
我建议使用 NearestNeighbors . (将 n_jobs 设置为 -1 以使用所有处理器)
import numpy as np
from sklearn.neighbors import NearestNeighbors
from sklearn.preprocessing import normalize
#Loading data
df_set = ...
#Selecting numerical data
numeric = df_set[distance_columns]
#normelizing
normalized = normalize(numeric, norm='l1', axis=1, copy=True, return_norm=False)
#Initializing NearestNeighbors
neigh = NearestNeighbors(n_neighbors=5, metric='euclidean', n_jobs=-1)
#Fitting with normilized data
neigh.fit(normalized)
...
second_smallest = ...
#Getting Most similar to your selected data
most_similar_to_selected = neigh.kneighbors(second_smallest)
关于python - 在 50 个变量 x 100k 行数据集上优化 K 最近邻算法,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58082177/