python - 处理 NaN 进行距离计算时出现问题?

标签 python python-3.x pandas numpy

我有一个 DataFrame 如下(为简单起见),其中点作为索引列:

 import numpy as np
import pandas as pd
a = {'a' : [0.6,0.7,0.4,np.NaN,0.5,0.4,0.5,np.NaN],'b':['cat','bat','cat','cat','bat',np.NaN,'bat',np.nan]}
df = pd.DataFrame(a,index=['x1','x2','x3','x4','x5','x6','x7','x8'])
df

由于它有 NaN,我希望将该列视为数字并执行以下操作:

for col in df.select_dtypes(include=['object']):
        s = pd.to_numeric(df[col], errors='coerce')
        if s.notnull().any():
            df[col] = s

将列转换为数字类型后,我想计算距离矩阵如下:

def distmetric(x,y):
    numeric5=x.select_dtypes(include=["number"])
    others5=x.select_dtypes(exclude=["number"])
    numeric6=y.select_dtypes(include=["number"])
    others6=y.select_dtypes(exclude=["number"])
    numnp5=numeric5.values
    catnp5=others5.values
    numnp6=numeric6.values
    catnp6=others6.values
    result3=np.around((np.repeat(numnp5, len(numnp6),axis=0) - np.tile(numnp6,(len(numnp5),1)))**2,3)
    catres3=~(np.equal((np.repeat(catnp5,len(catnp6),axis=0)),(np.tile(catnp6,(len(catnp5),1)))))
    sumtogeth3=result3.sum(axis=1)
    sumcattoget3=catres3.sum(axis=1)
    sum_result3=sumtogeth3+sumcattoget3
    final_result3=np.around(np.sqrt(sum_result3),3)
    final_result20=np.reshape(final_result3, (len(x.index),len(y.index)))
    return final_result20

metric=distmetric(df,df)
print(metric)

我得到的距离矩阵如下:

 [[0.    1.005 0.2     nan 1.005 1.02  1.005   nan]
 [1.005 0.    1.044   nan 0.2   1.044 0.2     nan]
 [0.2   1.044 0.      nan 1.005 1.    1.005   nan]
 [  nan   nan   nan   nan   nan   nan   nan   nan]
 [1.005 0.2   1.005   nan 0.    1.005 0.      nan]
 [1.02  1.044 1.      nan 1.005 1.    1.005   nan]
 [1.005 0.2   1.005   nan 0.    1.005 0.      nan]
 [  nan   nan   nan   nan   nan   nan   nan   nan]]

我想得到如下输出:

            x1       x2       x3      x4      x5       x6       x7       x8
x1         0.0      1.005    0.2     1.0     1.005    1.02     1.005   1.414
x2         1.005    0.0     1.044   1.414    0.2      1.044    0.2     1.414
x3         0.2      1.044    0.0     1.0     1.005    1.0      1.005   1.414
x4         1.0      1.414    1.0     0.0     1.414    1.414    1.414    1.0
x5         1.005    0.2     1.005   1.414    0.0      1.005    0.0     1.414
x6         1.02     1.044    1.0    1.414    1.005    0.0      1.005    1.0
x7         1.005    0.2     1.005   1.414    0.1      1.005    0.0     1.414
x8         1.414    1.414   1.414    1.0     1.414     1.0     1.414    0.0

我想计算两个 NaN 之间的距离,结果应为 0,而 NaN 与任何数字或任何字符串之间的距离应结果为 1。有什么方法或途径吗?这样做吗?

编辑: 我用以下形式计算距离:

for each row:
     if col is numerical: 
         then calculate (x1 element)-(x2 element)**2 and return this value to squareresult
     if col is categorical:
         then compare x1 element and x2 element.
         if they are equal then cateresult=0 
         else cateresult=1
     totaldistanceresultforrow=sqrt(squareresult+cateresult)

注意:NaN-NaN=0 和 NaN-any Num 或 string=1(这里“-”是减法)

最佳答案

这对我有帮助:

square_res = (df['a'].values - df['a'][:, None]) ** 2
numeric=pd.DataFrame(square_res)
idx = numeric.isnull().all()
alltrueindices=np.where(idx)

for index in alltrueindices:
    numeric.loc[index, index] = 0
numeric = numeric.fillna(1)
df['b']=df['b'].replace(np.nan, '?')
cat_res = (df['b'].values != df['b'][:, None])
res = (numeric + cat_res) ** .5

print(res.round(3))

关于python - 处理 NaN 进行距离计算时出现问题?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/52587239/

相关文章:

python - 如何使用 StreamHandler 捕获记录器 stderr 上的输出?

python - Kivy虚拟键盘不显示

Python3 :requests module headers

python - 在 Python 中以正确的顺序对星期几进行排序

python - 使用索引的 Numpy 数组索引

python - AppEngine : No module named pyasn1. 兼容.binary

python-3.x - ImportError : libnvidia-fatbinaryloader. so.384.90:无法打开共享对象文件:没有这样的文件或目录

python - 如何装饰类并使用描述符访问属性?

pandas - Pandas Dataframe 中的分组和连接列

python - 计算pandas DataFrame中每组的t检验统计量