python - 处理 NaN 进行距离计算时出现问题？

我有一个 DataFrame 如下(为简单起见)，其中点作为索引列:

 import numpy as np
import pandas as pd
a = {'a' : [0.6,0.7,0.4,np.NaN,0.5,0.4,0.5,np.NaN],'b':['cat','bat','cat','cat','bat',np.NaN,'bat',np.nan]}
df = pd.DataFrame(a,index=['x1','x2','x3','x4','x5','x6','x7','x8'])
df

由于它有 NaN，我希望将该列视为数字并执行以下操作:

for col in df.select_dtypes(include=['object']):
        s = pd.to_numeric(df[col], errors='coerce')
        if s.notnull().any():
            df[col] = s

将列转换为数字类型后，我想计算距离矩阵如下:

def distmetric(x,y):
    numeric5=x.select_dtypes(include=["number"])
    others5=x.select_dtypes(exclude=["number"])
    numeric6=y.select_dtypes(include=["number"])
    others6=y.select_dtypes(exclude=["number"])
    numnp5=numeric5.values
    catnp5=others5.values
    numnp6=numeric6.values
    catnp6=others6.values
    result3=np.around((np.repeat(numnp5, len(numnp6),axis=0) - np.tile(numnp6,(len(numnp5),1)))**2,3)
    catres3=~(np.equal((np.repeat(catnp5,len(catnp6),axis=0)),(np.tile(catnp6,(len(catnp5),1)))))
    sumtogeth3=result3.sum(axis=1)
    sumcattoget3=catres3.sum(axis=1)
    sum_result3=sumtogeth3+sumcattoget3
    final_result3=np.around(np.sqrt(sum_result3),3)
    final_result20=np.reshape(final_result3, (len(x.index),len(y.index)))
    return final_result20

metric=distmetric(df,df)
print(metric)

我得到的距离矩阵如下:

 [[0.    1.005 0.2     nan 1.005 1.02  1.005   nan]
 [1.005 0.    1.044   nan 0.2   1.044 0.2     nan]
 [0.2   1.044 0.      nan 1.005 1.    1.005   nan]
 [  nan   nan   nan   nan   nan   nan   nan   nan]
 [1.005 0.2   1.005   nan 0.    1.005 0.      nan]
 [1.02  1.044 1.      nan 1.005 1.    1.005   nan]
 [1.005 0.2   1.005   nan 0.    1.005 0.      nan]
 [  nan   nan   nan   nan   nan   nan   nan   nan]]

我想得到如下输出:

            x1       x2       x3      x4      x5       x6       x7       x8
x1         0.0      1.005    0.2     1.0     1.005    1.02     1.005   1.414
x2         1.005    0.0     1.044   1.414    0.2      1.044    0.2     1.414
x3         0.2      1.044    0.0     1.0     1.005    1.0      1.005   1.414
x4         1.0      1.414    1.0     0.0     1.414    1.414    1.414    1.0
x5         1.005    0.2     1.005   1.414    0.0      1.005    0.0     1.414
x6         1.02     1.044    1.0    1.414    1.005    0.0      1.005    1.0
x7         1.005    0.2     1.005   1.414    0.1      1.005    0.0     1.414
x8         1.414    1.414   1.414    1.0     1.414     1.0     1.414    0.0

我想计算两个 NaN 之间的距离，结果应为 0，而 NaN 与任何数字或任何字符串之间的距离应结果为 1。有什么方法或途径吗？这样做吗？

编辑: 我用以下形式计算距离:

for each row:
     if col is numerical: 
         then calculate (x1 element)-(x2 element)**2 and return this value to squareresult
     if col is categorical:
         then compare x1 element and x2 element.
         if they are equal then cateresult=0 
         else cateresult=1
     totaldistanceresultforrow=sqrt(squareresult+cateresult)

注意:NaN-NaN=0 和 NaN-any Num 或 string=1(这里“-”是减法)

最佳答案

这对我有帮助:

square_res = (df['a'].values - df['a'][:, None]) ** 2
numeric=pd.DataFrame(square_res)
idx = numeric.isnull().all()
alltrueindices=np.where(idx)

for index in alltrueindices:
    numeric.loc[index, index] = 0
numeric = numeric.fillna(1)
df['b']=df['b'].replace(np.nan, '?')
cat_res = (df['b'].values != df['b'][:, None])
res = (numeric + cat_res) ** .5

print(res.round(3))

关于python - 处理 NaN 进行距离计算时出现问题？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/52587239/

python - 处理 NaN 进行距离计算时出现问题？

上一篇：python - 如何使用 BeautifulSoup 发送 key

下一篇：python - 将Pycharm的主题更改为jupyter笔记本的默认主题