python - 如何将我的 csv 文件转换为这个 scikit learn 数据集

标签 python scikit-learn dataset

抱歉,如果我在这里没有使用正确的术语。我有一个包含我自己数据的 csv 文件。我首先需要将其转换为另一个 format这样我就可以将它加载到另一个 Python code 中。我展示了下面格式的示例,它是示例加载的 Iris 数据集的子集:

from sklearn import datasets
data = datasets.load_iris()


{'data': array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [6.5, 3. , 5.2, 2. ],
       [6.2, 3.4, 5.4, 2.3],
       [5.9, 3. , 5.1, 1.8]]), 'target': array([0, 0, 0, ... 2, 2, 2]), 'target_names': array(['setosa', 'versicolor', 'virginica'], dtype='<U10'), 'DESCR': 'Iris Plants Database\n====================\n\nNotes\n-----\nData Set Characteristics:\n    :Number of Instances: 150 (50 in each of three classes)\n    :Number of Attributes: 4 numeric, predictive attributes and the class\n    :Attribute Information:\n        - sepal length in cm\n        - sepal width in cm\n        - petal length in cm\n        - petal width in cm\n        - class:\n                - Iris-Setosa\n                - Iris-Versicolour\n                - Iris-Virginica\n    :Summary Statistics:\n\n    ============== ==== ==== ======= ===== ====================\n                    Min  Max   Mean    SD   Class Correlation\n    ============== ==== ==== ======= ===== ====================\n    sepal length:   4.3  7.9   5.84   0.83    0.7826\n    sepal width:    2.0  4.4   3.05   0.43   -0.4194\n    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)\n    petal width:    0.1  2.5   1.20  0.76     0.9565  (high!)\n    ============== ==== ==== ======= ===== ====================\n\n    :Missing Attribute Values: None\n    :Class Distribution: 33.3% for each of 3 classes.\n    :Creator: R.A. Fisher\n    :Donor: Michael Marshall (MARSHALL%<a href="" class="__cf_email__" data-cfemail="a9f9e5fce9c0c687c8dbca87c7c8dac887cec6df" rel="noreferrer noopener nofollow">[email protected]</a>)\n    :Date: July, 1988\n\nThis is a copy of UCI ML iris datasets.\n\n\nThe famous Iris database, first used by Sir R.A Fisher\n\nThis is perhaps the best known database to be found in the\npattern recognition literature.  Fisher\'s paper is a classic in the field and\nis referenced frequently to this day.  (See Duda & Hart, for example.)  The\ndata set contains 3 classes of 50 instances each, where each class refers to a\ntype of iris plant.  One class is linearly separable from the other 2; the\nlatter are NOT linearly separable from each other.\n\nReferences\n----------\n   - Fisher,R.A. "The use of multiple measurements in taxonomic problems"\n     Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to\n     Mathematical Statistics" (John Wiley, NY, 1950).\n   - Duda,R.O., & Hart,P.E. (1973) Pattern Classification and Scene Analysis.\n     (Q327.D83) John Wiley & Sons.  ISBN 0-471-22361-1.  See page 218.\n   - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System\n     Structure and Classification Rule for Recognition in Partially Exposed\n     Environments".  IEEE Transactions on Pattern Analysis and Machine\n     Intelligence, Vol. PAMI-2, No. 1, 67-71.\n   - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule".  IEEE Transactions\n     on Information Theory, May 1972, 431-433.\n   - See also: 1988 MLC Proceedings, 54-64.  Cheeseman et al"s AUTOCLASS II\n     conceptual clustering system finds 3 classes in the data.\n   - Many, many more ...\n', 'feature_names': ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']}




理想情况下寻找一段代码来从我的 csv 文件生成此格式。


from numpy import genfromtxt
data = genfromtxt('myfile.csv', delimiter=',')
features = data[:, :3]
targets = data[:, 3]

myfile.csv 只是 4 列中的随机数,带有标题和几行,仅供测试。


好的。在这篇文章的帮助下,我找到了一种方法来做到这一点: How to create my own datasets using in scikit-learn?

我的 iris.csv 文件如下所示:

....(150 rows)


import numpy as np
import csv
from sklearn.datasets.base import Bunch

def load_my_dataset():
    with open('iris.csv') as csv_file:
        data_file = csv.reader(csv_file)
        temp = next(data_file)
        n_samples = 150 #number of data rows, don't count header
        n_features = 4 #number of columns for features, don't count target column
        feature_names = ['f1','f2','f3','f4'] #adjust accordingly
        target_names = ['t1','t2','t3'] #adjust accordingly
        data = np.empty((n_samples, n_features))
        target = np.empty((n_samples,),

        for i, sample in enumerate(data_file):
            data[i] = np.asarray(sample[:-1], dtype=np.float64)
            target[i] = np.asarray(sample[-1],

    return Bunch(data=data, target=target, feature_names = feature_names, target_names = target_names)

data = load_my_dataset()


  • 您的文件名
  • 数据行数,不包括标题
  • 特征的列数,不计算最后一个目标列
  • 列出功能名称
  • 列出目标名称

关于python - 如何将我的 csv 文件转换为这个 scikit learn 数据集,我们在Stack Overflow上找到一个类似的问题:


python - 对 python 2.7 的支持结束了吗?

python-3.x - 使用文本处理对数据进行分类

python - Kmeans 算法的特征缩放

python - 消除稀疏矩阵数据集中的零

python - 模型属性 django 中的简单乘法?

python - Pandas 读取 json 不适用于 MultiIndex

python - 是否有必要在 Windows 上为 "react-native init AwesomeProject"安装 Python for react-native?

python - 新版本的 MinMaxScaler 不再接受最大值和最小值的范围

r - 如何使用 R 为大量 IP 生成国家/地区名称?

c# - 为什么从 WCF 服务返回数据集或数据表不是一个好的做法?什么是替代品?