我想自动化我的代码,因为我希望它处理几个文件。每次我想创建一个相关矩阵,以确定一个阈值,如果两列之间的相关性高于阈值,则从其中选择一列并将其从数据框中删除。我想继续这个过程,直到没有任何相关性高于阈值。
有人知道如何解决这个问题吗? 谢谢!
最佳答案
删除变量不会改变其他变量之间的相关性。因此,您可以迭代地删除相关性最高且高于阈值的变量。您可能还想研究降维或特征重要性以删除冗余变量。
import numpy as np
np.random.seed(42)
# 100 variables, 100 samples, to make some features
# highly correlated by random chance
x = np.random.random((100, 100))
corr = abs(np.corrcoef(x))
# Set diagonal to zero to make comparison with threshold simpler
np.fill_diagonal(corr, 0)
threshold = 0.3
# Mask to keep track of what is removed
keep_idx = np.ones(x.shape[0], dtype=bool)
for i in range(x.shape[0]):
# Create the mask from the kept indices
mask = np.ix_(keep_idx, keep_idx)
# Get the number of correlations above a threshold.
counts = np.sum(corr[mask] > threshold, axis=0)
print(counts.shape)
if max(counts) == 0:
break
# Get the worst offender and work out what the
# original index was
idx = np.where(keep_idx)[0][np.argmax(counts)]
# Update mask
keep_idx[idx] = False
关于python - 自动决定从Python中的相关矩阵中删除哪个特征,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59771290/