python - 使用带有 sklearn kmeans 的任意度量的文本聚类

标签 python cluster-analysis k-means cosine-similarity

我在一个包含医学术语的表上运行文本聚类,我想对具有相似词的字符串进行聚类,如果两个词有两个或更多词,应该比它们只有一个词更有可能被包含在一个聚类中共同点。

我尝试了很多技术,但没有得到任何有效的结果!我首先尝试将 Levenshtein 距离与 kmeans 和 AgglomerativeClustering 结合使用(三种链接方法:ward、complete 和 avaerage)。它返回的结果很差,并且该指标组合了具有部分相似字母的单词,例如“dog”和“door”。

我将距离度量更改为使用 TF-IDF,然后运行余弦相似度,然后通过将每个值减去 1(距离 = 1-相似度)将相似度转换为距离,因为我通过 2* 尝试了 wiki 方法acosine(相似性),它返回了 nan 值!

无论如何,使用这个距离度量,我也尝试了这两种算法,它总体上返回了良好的聚类,除了一个巨大的聚类,它们之间不包含相似的词!无论我如何更改 no of clusters 的值,这个巨大的集群仍然会出现,即使我选择大的 k(接近 n,这是输入的长度),它通常出现在开头,要么是集群 0, 1、2、3.. 为什么会这样??我做错了什么?我的数据集长度超过 5000。这是集群输出的一部分。

 cluster no 0:['Prolonged INR', 'Prolonged PTT', 'Prolonged QT Interval']
 cluster no 1:['GI bleeding', 'Gastrointestinal (GI) Bleeding', 'Lower GI bleeding']
 cluster no 2:['ACS', 'Acetazolamide', 'Achondroplasia', 'Acrocyanosis', 'Acromegaly', 'Adenoidectomy', 'Adenomyosis', 'Afebrile', 'Antihistamine', 'Apheresis', 'Aplasia', 'Argatroban', 'Arthralgia', 'Arthrocentesis', 'Arthrography', 'Arthroplasty', 'Asbestosis', 'Ascorbate', 'Asian', 'Asterixis', 'Astigmatism', 'Astrocytoma', 'Asymptomatic', 'Atelectasis', 'Atherosclerosis', 'Atropine', 'Audiogram', 'Autonomic Dysreflexia', 'Autopsy', 'Bacteremia', 'Balanitis', 'Balanoposthitis', 'Breastfeeding', 'Breech Presentation', 'Bronchiectasis', 'Bronchiolitis', 'Bronchospasm', 'Cachexia', 'Caf� Au Lait Spot', 'Calcaneovalgus', 'Chalazion', 'Chemistry Panels', 'Chills', 'Cholelithiasis', 'Cholera', 'Chondroblastoma', 'Chondrosarcoma', 'Chorioamnionitis', 'Chorionic Villus Sampling (CVS)', 'Choroid Plexus Papilloma (CPP)', 'Circumcision', 'Citrate', 'Claudication', 'Clonus', 'Coccidioidomycosis', 'Coccygodynia', 'Costochondritis', 'Craniectomy', 'Craniofacial Anomalies', 'Craniopharyngioma', 'Craniosynostosis', 'Craniotomy', 'Cri du Chat', 'Croup', 'Cryofibrinogen', 'Cryoglobulin', 'Cyclophosphamide', 'Cystometry', 'D-Dimer', 'Dacryocystitis', 'Dacryocystorhinostomy (DCR)', 'Dacryostenosis', 'Dantrolene', 'Deformational Plagiocephaly', 'Delusions', 'Demeclocycline', 'Dentures', 'Dermabrasion', 'Deviated Septum', 'Electrolytes', 'Electronystagmography (ENG)', 'Embolectomy', 'Emmetropia', 'Empyema', 'Enchondroma', 'Encopresis', 'Enterovirus', 'Ependymoma', 'Epididymitis', 'Epirubicin', 'Episiotomy', 'Epispadias', 'Eribulin', 'Erythroderma', 'Esophagectomy', 'Essential Tremor', 'Foraminotomy', 'Frostnip/Frostbite', 'Gallstones', 'Gastritis', 'Gastrojejunostomy', 'Gastroschisis', 'Giardiasis', 'Gingivitis', 'Gingivostomatitis', 'Glaucoma', 'Gliomas', 'Glomerulonephritis', 'Glomerulosclerosis', 'Group B Streptococcus', 'Herpangina', 'Hiccups', 'Hidradenitis Suppurativa', 'Hirsutism', 'Hookworm', 'Hordeolum (Stye)', 'Hydatidiform Mole', 'Hydration', 'Hydrocelectomy', 'Hydrops Fetalis', 'Hyperbilirubinemia', 'Hyperlipidemia', 'Hyperopia', 'Hyperphosphatemia', 'Hyperreflexia', 'Hypnosis', 'Hypoparathyroidism', 'Hypopituitarism', 'Hypovolemia', 'Hypoxia', 'Hysterosalpingogram (HSG)', 'Hysteroscopy', 'Intussusception', 'Irritability', 'Isoproterenol', 'Ixabepilone', 'Jewish', 'Karyotype', 'Keratoconus', 'Ketonemia', 'Ketonuria', 'Kyphoplasty', 'Kyphosis', 'Labyrinthitis', 'Lactulose', 'Laminectomy', 'Laminotomy', 'Lapatinib', 'Laryngectomy', 'Laryngitis', 'Laryngomalacia', 'Laryngoscopy', 'Laxative', 'Lymphadenitis', 'Lymphangitis', 'Lymphocele', 'Malaise', 'Malaria', 'Malocclusion', 'Mammography', 'Mannitol', 'Mastalgia', 'Mastectomy', 'Mastitis', 'Mastoidectomy', 'Mastopexy', 'Mediastinoscopy', 'Megaureter', 'Melena', 'Meningioma', 'Menopause', 'Menorrhagia', 'Menstruation', 'Metatarsalgia', 'Metatarsus Adductus', 'Metoclopramide', 'Neomycin', 'Nephrectomy', 'Nephrolithiasis', 'Neuromyelitis Optica', 'Neurosonography', 'Neurosurgery', 'Nocturnal Enuresis', 'Norovirus', 'Pericardectomy', 'Perimenopause', 'Periventricular Leukomalacia', 'Pertuzumab', 'Phimosis', 'Phobia', 'Photorefractive Keratectomy (PRK)', 'Phytophotodermatitis', 'Pilomatrixoma', 'Pinworms', 'Pityriasis Rosea', 'Plain radiograph', 'Platelets', 'Pleurisy', 'Pneumococcus', 'Pneumoconiosis', 'Pneumonectomy', 'Psychosis', 'Pterygium', 'Ptosis', 'Pulpitis (Toothache)', 'Pyeloplasty', 'Quantitative Immunoglobulins', 'Rabies', 'Rales', 'Red wale marks', 'Refractive Error', 'Smallpox', 'Smoking Cessation', 'Snoring', 'Sonohysterography', 'Spasmodic Dysphonia', 'Spina Bifida', 'Terlipressin', 'Tetany', 'Thoracotomy', 'Thrombocythemia', 'Thrombophilia', 'Thrombophlebitis', 'Thyroidectomy', 'Tinnitus', 'Tonsillar enlargement', 'Torn Annulus', 'Toxoplasmosis', 'Trabeculectomy', 'Ureterolysis', 'Ureteroplasty', 'Ureterosigmoidostomy', 'Urethritis', 'Urethroplasty', 'Uroflowmetry', 'Urostomy', 'Urticaria (Hives)', 'Uvulitis', 'Uvulopalatopharyngoplasty (UPPP)', 'Valsalva Maneuver', 'Varicella (Chickenpox)', 'Vasculitis', 'Vasopressin', 'Vasopressor', 'Venography', 'Ventriculostomy', 'Vertebroplasty', 'Vesicoureteral Reflux (VUR)', 'Osteochondritis Dissecans (OCD)', 'Osteochondroma', 'Osteogenesis Imperfecta (OI)', 'Osteopenia', 'Osteophyte formation', 'Osteosarcoma', 'Overuse Injuries', 'Overweight', 'Pallister Killian', 'Pallor', 'Palpitation', 'Palpitations', 'Paraesthesia', 'Paranoia', 'Paraphimosis', 'Parasomnias', 'Parathyroidectomy', 'Paronychia', 'Parotidectomy', 'Peaked T waves', 'Pemphigus Vulgaris', 'Lepirudin', 'Lethargy', 'Letrozole', 'Lichen Planus', 'Liposarcoma', 'Listeriosis', 'Living will', 'Lordosis', 'Excessive urination', 'Exemestane', 'Exploratory Laparotomy', 'Facelift (Rhytidectomy)', 'Fainting', 'Fibrinogen', 'Fibromyalgia', 'Fluorouracil', 'Folliculitis', 'Fondaparinux', 'Bedbound', 'Bedrest', 'Bevacizumab', 'BiPAP', 'Biloma', 'Birthmark', 'Bisphosphonate', 'Bivalirudin', 'Blepharitis', 'Blepharoplasty', 'Blindness', 'Blister', 'Bloodborne Pathogens', 'Allopurinol', 'Alopecia', 'Amblyopia', 'Amenorrhea', 'Amniocentesis', 'Anastrozole', 'Anencephaly', 'Angiodysplasia', 'Angioembolization', 'Ankyloglossia', 'Ankylosing Spondylitis', 'Haptoglobin', 'HbA1C', 'Heatstroke', 'Height', 'Heliox', 'Hematemesis', 'Hematochezia', 'Hematocrit', 'Hematology', 'Hemifacial Microsomia', 'Hemochromatosis', 'Hemoglobinuria', 'Hemophagocytic Lymphohistiocytosis (HLH)', 'Hemothorax', 'Hepatoblastoma', 'Hepatomegaly', 'Hepatosplenomegaly', 'Hepatotoxicity', 'Her2neu', 'IgG Deficiencies', 'Ileostomy', 'Impetigo', 'Improving', 'Impulsiveness', 'Incontinentia Pigmenti', 'Restlessness', 'Retinitis Pigmentosa', 'Retinoblastoma', 'Reversible Dementias', 'Rhabdomyosarcoma', 'Rhinoplasty', 'Rifaximin', 'Rosacea', 'Roseola', 'STEMI', 'Sacroiliitis', 'Scabies', 'Schistocytes', 'Sciatica', 'Scleral Buckling', 'Scleroderma', 'Sclerotherapy', 'Scotoma', 'Selective Mutism', 'Digitalization', 'Dihydroergotamine', 'Discogram', 'Dislocations', 'Disorientation', 'Diverticulosis', 'Docetaxel', 'Domperidone', 'Dopamine', 'Doxorubicin', 'Drooling', 'Drowsiness', 'Duodenitis', "Dupuytren's Contracture", 'Dyskeratosis Congenita', 'Dyslipidemia', 'Dysmenorrhea', 'Dysphasia', 'Dyssomnias', 'Dysthymia', 'Dysuria', 'ESR', 'Eclampsia', 'Ectropion (Eublepharon)', 'Ehrlichiosis', 'Translocations', 'Transverse Myelitis', 'Trastuzumab', 'Trigeminal Neuralgia', 'Tympanoplasty', 'Unconscious', 'Underweight', 'Undescended Testes (Cryptorchidism)', 'Ureter obstructed', 'Colchicine', 'Coldness', 'Colectomy', 'Coloboma', 'Colostomy', 'Colposcopy', 'Comfort Measures Only (CMO)', 'Comorbid conditions', 'Compromised local circulation', 'Conivaptan', 'Constipation', 'Continence', 'Cor Pulmonale', 'Splinters', 'Spondylolisthesis', 'Spondylolysis', 'Stapedectomy', 'Steroid', 'Stillbirth', 'Stomatitis', 'Strabismus (Crossed Eyes)', 'Stridor', 'Stupor', 'Suicide plan', 'Sunburn', 'Suprasternal retractions', 'Sympathectomy', 'Tapeworm', 'Tattoo', 'Tau/A Beta42', 'Teething', 'Telangiectasias', 'Temper Tantrum', 'Temporal Arteritis', 'Microbiology', 'Microcephaly', 'Microdiskectomy', 'Micropenis', 'Midodrine', 'Miscarriage', 'Modified duke criteria', 'Molluscum Contagiosum', 'Monoamniotic twins', 'Mosaicism', 'Motorcycle accident', 'Myalgias', 'Myasthenia Gravis', 'Myelogram', 'Myoclonus', 'Myoglobinuria', 'Myopia', 'Myositis', 'Myxedema', 'NSAID', 'Narcolepsy', 'Nausea', 'Poliomyelitis', 'Poly-pharmacy', 'Polyhydramnios (Hydramnios)', 'Polymyalgia Rheumatica', 'Polymyositis', 'Postictal State', 'Presbycusis', 'Presbyopia', 'Presyncope', 'Proctectomy', 'Proctocolectomy', 'Pruritis Ani', 'Pseudotumor Cerebri', 'Vinorelbine', 'Vitrectomy', 'Voiding Cystourethrogram (VCUG)', 'Vomit', 'Vulvitis', "Wegener's Granulomatosis", 'Whiplash', 'Widening QRS', 'Wrinkles', 'X-linked Agammaglobulinemia', 'YAG Capsulotomy', 'Yersiniosis', 'caffeine', 'coagulopathy', 'dexamethasone', 'Infliximab', 'Insomnia', 'Insulinoma', 'Intravenous contrast extravasation', 'Obtundation', 'Octreotide', 'Odynophagia', 'Oligodendroglioma', 'Oligohydramnios', 'Oliguria', 'Omphalocele', 'Onychomycosis', 'Oophorectomy', 'Orchiectomy', 'Orchitis', 'Orthopnea', 'Carboplatin', 'Cardiomegaly', 'Cataracts', 'Cecostomy', 'Cephalopelvic Disproportion (CPD)']
 cluster no 3:['Brain Malignancy', 'Brain metastasis']
 cluster no 4:['Pubic Lice', 'Lice', 'Head Lice']
 cluster no 5:['Assistive, Adaptive, Supportive or Protective Device Fitting', 'Gait Training Using an Assistive Device', 'Unsteady gait']
 cluster no 6:['Removal of Soft Tissue Foreign Body', 'Soft Tissue Foreign Body']
 cluster no 7:['Necrotizing pneumonia', 'Pneumocystis Pneumonia', 'Pneumocystis pneumonia', 'Pneumonia', 'Pneumonia', 'Mycoplasma Pneumonia', 'Walking Pneumonia']
 cluster no 8:['Esophageal Atresia', 'Esophageal Dilation', 'Esophageal Manometry', 'Esophageal ring/web', 'Esophageal stricture']

我做错了什么?我的技术在这里错了吗? 这是我的代码,我使用 sklearn 包轻松地更改为其他技术:

from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.cluster import KMeans
import pprint

my_list = ['Cervical Cryotherapy', 'Cervical Disk Replacement Surgery', 'Cervical Disk Rupture', 'Cervical Disk Surgery', 'Cervical Epidural Injection', 'Cervical Fracture (exclude uncomplicated compression fractures)', 'Cervical Insufficiency (Cervical Incompetence)', 'Cervical Neck Brace', 'Cervical Radiculopathy', 'Cervical Spinal Fusion', 'Cervical Spine Disorder', 'Cervical Spondylosis', 'Cervical Subluxation', 'Cervical dilation', 'Cervical dislocation', 'Cervical effacement', 'Cervical ripening procedure', 'Cervicitis', 'Cervicitis (Non-STD)', 'Cervicitis (STD)', 'Cervix', 'Cervix closed', 'Cesarean Section (C-Section)', 'Cesarean section procedure', 'Chagas Disease']

tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(my_list)
#print (tfidf_matrix.shape)

k=len(my_list)
dist = np.zeros((k,k))

for i in range(k):
       dist[i] = cosine_similarity(tfidf_matrix[i:i+1], tfidf_matrix)
#print(dist

dist1 = np.subtract(np.ones((k,k),dtype=np.float), dist) ## convert to distance
#print(dist1)

data2=np.asarray(dist1)
arr_3d = data2.reshape((1,k,k))

#print(arr_3d)


for i in range(len(arr_3d)):

  km = KMeans(n_clusters=5, init='k-means++')
  km = km.fit(arr_3d[i])

  centers = km.cluster_centers_
  labels = km.labels_

  print (labels)
  print(type(labels))

Groups = {}
for element, label in zip(my_list, labels):
    print 'element', element
    print 'label', label

    try:
        Groups[str(label)].append(element)
    except:
        Groups[str(label)] = [element]

pprint.pprint(Groups)

编辑: 我现在只使用余弦相似度,遇到了同样的问题,大集群和不相关的词,所以这不是 tf-idf 问题!

    WORD = re.compile(r'\w+')

    def get_cosine(vec1, vec2):
         intersection = set(vec1.keys()) & set(vec2.keys())
         numerator = sum([vec1[x] * vec2[x] for x in intersection])

         sum1 = sum([vec1[x]**2 for x in vec1.keys()])
         sum2 = sum([vec2[x]**2 for x in vec2.keys()])
         denominator = math.sqrt(sum1) * math.sqrt(sum2)

         if not denominator:
            return 0.0
         else:
            return float(numerator) / denominator

    def text_to_vector(text):
         words = WORD.findall(text)
         return Counter(words)


k=len(my_list)

data1 = np.zeros((k,k))

for i,string1 in enumerate(my_list):
   for j,string2 in enumerate(my_list):
        data1[i][j] = 1-get_cosine(text_to_vector(string1), text_to_vector(string2))

print(data1)
k=len(my_list)
data2=np.asarray(data1)
arr_3d = data2.reshape((1,k,k))

编辑:我运行的是 LSA 而不是 TF-IDF,后者应该适用于短文本,但我得到了非常非常糟糕的结果!不匹配的集群:

vectorizer = CountVectorizer(min_df = 1, stop_words = 'english')
dtm = vectorizer.fit_transform(my_list)

lsa = TruncatedSVD(2, algorithm = 'arpack')
dtm_lsa = lsa.fit_transform(dtm)
dtm_lsa = Normalizer(copy=False).fit_transform(dtm_lsa)
similarity = np.asarray(numpy.asmatrix(dtm_lsa) * numpy.asmatrix(dtm_lsa).T)
#print(1-similarity)
k=len(my_list)
dist1 = np.subtract(np.ones((k,k),dtype=np.float), similarity)
#dist1.astype(float)
print(dist1)

最佳答案

k-means 基于方差最小化

它最小化偏差平方和(x[i]-center[i])**2 对于每个对象 x ,尺寸 i 和最佳(最小成本)中心 center。它不能最小化任意距离(在此处查看关于这个问题的很多很多问题)。

您的代码中有两个致命问题:

  • 任何基于余弦的方法所需的矢量化仅适用于文本,例如新闻文章。它不适用于推文或任何其他短文本,因为它们的有用标记太少。根据经验,每篇文章需要 100 个以上的单词。
  • kmeans 必须应用于数据矩阵,而不是距离矩阵。它需要计算原始数据的 means(记住,它称为 k-means)。因此,它需要原始数据矩阵。此外,kmeans 不使用 pairwkse 距离,而仅寻求点到中心的最小二乘法。

关于python - 使用带有 sklearn kmeans 的任意度量的文本聚类,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/37601294/

相关文章:

python - wx.StaticBox 标签的对齐方式

java - 按簇索引 2d 点的算法

machine-learning - 对自由格式文本条目进行分类或聚类的好方法是什么?

python - 只要不输入空就打印数字

python - 确定具有最多值的键

python - Python中基于字符串/整数序列的聚类和距离/相异矩阵

java - 无法处理任何类属性! k意味着java

algorithm - K 均值和文档聚类中的正确顺序输出

python - python中Kmeans聚类对RGB图像进行图像分割

php - 弱类型语言的优点(和缺点)是什么?