python - 如何使用 python pandas 打印相关特征？

我正在尝试获取一些有关自变量相关性的信息。

我的数据集有很多变量，因此热图不是解决方案，它非常不可读。

目前，我制作了一个仅返回那些高度相关的变量的函数。我想改变它以指示相关特征对。

其余解释如下:

def find_correlated_features(df, threshold, target_variable):

    df_1 = df.drop(target_variable)

    #corr_matrix has in index and columns names of variables
    corr_matrix = df_1.corr().abs()

    # I'm taking only half of this matrix to prevent doubling results
    half_of_matrix = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k = 1).astype(np.bool))

    # This prints list of columns which are correlated 
    to_drop = [column for column in half_of_matrix.columns if any(half_of_matrix[column] > threshold)]
    
    return to_drop

如果这个函数返回带有column_1的pandas数据框，那就最好了；列_2； corr_coef 仅适用于高于阈值的变量。

类似这样的事情:

output = {'feature name 1': column_name,
          'feature name 2': index,
          'correlation coef': corr_coef}

output_list.append(output)
return pd.DataFrame(output_list).sort_values('corr_coef', ascending=False)

最佳答案

编辑后:

在OP评论和@user6386471回答之后，我再次阅读了这个问题，我认为简单地重构相关矩阵就可以了，不需要循环。类似于 half_of_matrix.stack().reset_index() 加上过滤器。请参阅:

def find_correlated_features(df, threshold, target_variable):
    # remove target column
    df = df.drop(columns=target_variable).copy()
    # Get correlation matrix
    corr_matrix = df.corr().abs()
    # Take half of the matrix to prevent doubling results
    corr_matrix = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k = 1).astype(np.bool))
    # Restructure correlation matrix to dataframe
    df = corr_matrix.stack().reset_index()
    df.columns = ['feature1', 'feature2', 'corr_coef']
    # Apply filter and sort coefficients
    df = df[df.corr_coef >= threshold].sort_values('corr_coef', ascending=False)
    return df

原始答案:

您可以轻松创建系数高于阈值的系列，如下所示:

s = df.corr().loc[target_col]
s[s.abs() >= threshold]

其中 df 是您的数据框，target_col 是您的目标列，threshold 是阈值。

示例:

import pandas as pd
import seaborn as sns

df = sns.load_dataset('iris')

print(df.shape)
# -> (150, 5)

print(df.head())

   sepal_length  sepal_width  petal_length  petal_width species
0           5.1          3.5           1.4          0.2  setosa
1           4.9          3.0           1.4          0.2  setosa
2           4.7          3.2           1.3          0.2  setosa
3           4.6          3.1           1.5          0.2  setosa
4           5.0          3.6           1.4          0.2  setosa

def find_correlated_features(df, threshold, target_variable):
    s = df.corr().loc[target_variable].drop(target_variable)
    return s[s.abs() >= threshold]

find_correlated_features(df, .7, 'sepal_length')

输出:

petal_length    0.871754
petal_width     0.817941
Name: sepal_length, dtype: float64

您可以在输出中使用 .to_frame() 和 .T 来获取 pandas 数据帧。

关于python - 如何使用 python pandas 打印相关特征？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/65039955/

python - 如何使用 python pandas 打印相关特征？

上一篇：azure-devops-rest-api - 如何使用 Azure DevOps Rest api 或 CLI 命令启用/禁用 Azure DevOps Pipeline

下一篇：sql - 为什么要使用相关子查询？