python - 无监督人口分类

我有一个包含 2 个参数的数据集，看起来像这样(我添加了密度等高线图):

我的目标是像这样将这个样本分成 2 个子集:

这张图片来自 SDSS 群中恒星形成的淬火:中央、卫星和银河整合，Knobel 等。等，天体物理学杂志，800:24 (20pp)，2015 年 2 月 1 日，可用 here .这分隔线是肉眼画的，并不完美。

我需要的是这个漂亮的维基百科图表中的红线(最大化距离):

不幸的是，所有看起来接近我正在寻找的线性分类(SVM、SVC 等)都是监督学习。

我尝试过无监督学习，比如 KMeans 2 集群，这种方式(CompactSFR[['lgm_tot_p50','sSFR']] 是 Pandas 数据集，您可以在本文末尾找到) :

X = CompactSFR[['lgm_tot_p50','sSFR']] from sklearn.cluster import KMeans kmeans2 = KMeans(n_clusters=2) # Fitting the input data kmeans2 = kmeans2.fit(X) # Getting the cluster labels labels2 = kmeans2.predict(X) # Centroid values centroids = kmeans2.cluster_centers_ f, (ax1,ax2) = plt.subplots(nrows=1, ncols=2, figsize=(10, 5), sharey=True) ax1.scatter(CompactSFR['lgm_tot_p50'],CompactSFR['sSFR'],c=labels2); X2 = kmeans2.transform(X) ax1.set_title("Kmeans 2 clusters", fontsize=15) ax1.set_xlabel('$\log_{10}(M)$',fontsize=10) ; ax1.set_ylabel('sSFR',fontsize=10) ; f.subplots_adjust(hspace=0)

但我得到的分类是这样的:

这是行不通的。

另外，我要的不是简单的分类而是分割线的方程(这显然和线性回归有很大区别)。

如果某些东西已经存在，我想避免开发最大似然的贝叶斯模型。

可以找小样本(959分)here .

注意:this question不符合我的情况。

最佳答案

以下代码将使用 2 个分量的高斯混合模型来执行此操作，并生成此结果。

首先，从文件中读取数据并删除异常值:

import pandas as pd import numpy as np from sklearn.neighbors import KernelDensity frm = pd.read_csv(FILE, index_col=0) kd = KernelDensity(kernel='gaussian') kd.fit(frm.values) density = np.exp(kd.score_samples(frm.values)) filtered = frm.values[density>0.05,:]

然后拟合高斯混合模型:

from sklearn.mixture import GaussianMixture model = GaussianMixture(n_components=2, covariance_type='full') model.fit(filtered) cl = model.predict(filtered)

获取剧情:

import matplotlib.pyplot as plt plt.scatter(filtered[cl==0,0], filtered[cl==0,1], color='Blue') plt.scatter(filtered[cl==1,0], filtered[cl==1,1], color='Red')

关于python - 无监督人口分类，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/55143333/

python - 无监督人口分类

上一篇：Python 将 NAN 更改为零向量

下一篇：python - 在 Python 3 中运行代码以使用步进电机