我有一个 DataFrame
如下(为简单起见),其中点作为索引列:
import numpy as np
import pandas as pd
a = {'a' : [0.6,0.7,0.4,np.NaN,0.5,0.4,0.5,np.NaN],'b':['cat','bat','cat','cat','bat',np.NaN,'bat',np.nan]}
df = pd.DataFrame(a,index=['x1','x2','x3','x4','x5','x6','x7','x8'])
df
由于它有 NaN
,我希望将该列视为数字并执行以下操作:
for col in df.select_dtypes(include=['object']):
s = pd.to_numeric(df[col], errors='coerce')
if s.notnull().any():
df[col] = s
将列转换为数字类型后,我想计算距离矩阵如下:
def distmetric(x,y):
numeric5=x.select_dtypes(include=["number"])
others5=x.select_dtypes(exclude=["number"])
numeric6=y.select_dtypes(include=["number"])
others6=y.select_dtypes(exclude=["number"])
numnp5=numeric5.values
catnp5=others5.values
numnp6=numeric6.values
catnp6=others6.values
result3=np.around((np.repeat(numnp5, len(numnp6),axis=0) - np.tile(numnp6,(len(numnp5),1)))**2,3)
catres3=~(np.equal((np.repeat(catnp5,len(catnp6),axis=0)),(np.tile(catnp6,(len(catnp5),1)))))
sumtogeth3=result3.sum(axis=1)
sumcattoget3=catres3.sum(axis=1)
sum_result3=sumtogeth3+sumcattoget3
final_result3=np.around(np.sqrt(sum_result3),3)
final_result20=np.reshape(final_result3, (len(x.index),len(y.index)))
return final_result20
metric=distmetric(df,df)
print(metric)
我得到的距离矩阵如下:
[[0. 1.005 0.2 nan 1.005 1.02 1.005 nan]
[1.005 0. 1.044 nan 0.2 1.044 0.2 nan]
[0.2 1.044 0. nan 1.005 1. 1.005 nan]
[ nan nan nan nan nan nan nan nan]
[1.005 0.2 1.005 nan 0. 1.005 0. nan]
[1.02 1.044 1. nan 1.005 1. 1.005 nan]
[1.005 0.2 1.005 nan 0. 1.005 0. nan]
[ nan nan nan nan nan nan nan nan]]
我想得到如下输出:
x1 x2 x3 x4 x5 x6 x7 x8
x1 0.0 1.005 0.2 1.0 1.005 1.02 1.005 1.414
x2 1.005 0.0 1.044 1.414 0.2 1.044 0.2 1.414
x3 0.2 1.044 0.0 1.0 1.005 1.0 1.005 1.414
x4 1.0 1.414 1.0 0.0 1.414 1.414 1.414 1.0
x5 1.005 0.2 1.005 1.414 0.0 1.005 0.0 1.414
x6 1.02 1.044 1.0 1.414 1.005 0.0 1.005 1.0
x7 1.005 0.2 1.005 1.414 0.1 1.005 0.0 1.414
x8 1.414 1.414 1.414 1.0 1.414 1.0 1.414 0.0
我想计算两个 NaN
之间的距离,结果应为 0,而 NaN
与任何数字或任何字符串之间的距离应结果为 1。有什么方法或途径吗?这样做吗?
编辑: 我用以下形式计算距离:
for each row:
if col is numerical:
then calculate (x1 element)-(x2 element)**2 and return this value to squareresult
if col is categorical:
then compare x1 element and x2 element.
if they are equal then cateresult=0
else cateresult=1
totaldistanceresultforrow=sqrt(squareresult+cateresult)
注意:NaN
-NaN
=0 和 NaN
-any Num 或 string=1(这里“-”是减法)
最佳答案
这对我有帮助:
square_res = (df['a'].values - df['a'][:, None]) ** 2
numeric=pd.DataFrame(square_res)
idx = numeric.isnull().all()
alltrueindices=np.where(idx)
for index in alltrueindices:
numeric.loc[index, index] = 0
numeric = numeric.fillna(1)
df['b']=df['b'].replace(np.nan, '?')
cat_res = (df['b'].values != df['b'][:, None])
res = (numeric + cat_res) ** .5
print(res.round(3))
关于python - 处理 NaN 进行距离计算时出现问题?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/52587239/