python - 无法使用 scipy.arff.loadarff 加载 arff 数据集

标签 python scipy

我正在尝试从https://cometa.ujaen.es/下载arff数据集(例如 https://cometa.ujaen.es/datasets/yahoo_arts )并使用 scipy.arff.loadarff 将其加载到 python 中。

但是，scipy 似乎需要 header 后有一种 csv 文件，并且无法解析绝大多数数据集。

例如。重现问题:

from scipy.arff import loadarff
import urllib

urllib.request.urlretrieve('https://cometa.ujaen.es/public/full/yahoo_arts.arff', 'yahoo_arts.arff')
ds = loadarff('yahoo_arts.arff')

(在这种情况下我遇到了ValueError:无法将字符串转换为 float :'{8 1')。

这是预期的吗？ (又名 scipy 实现不完全符合 arff 格式)您知道一些解决方法或一些手工解析函数吗？

感谢您对此主题的任何帮助/建议。

最佳答案

您可以使用以下解决方法:

import numpy as np
import pandas as pd


with open('yahoo_arts.arff', 'r') as fp:
    file_content = fp.readlines()


def parse_row(line, len_row):
    line = line.replace('{', '').replace('}', '')

    row = np.zeros(len_row)
    for data in line.split(','):
        index, value = data.split()
        row[int(index)] = float(value)

    return row


columns = []
len_attr = len('@attribute')

# get the columns
for line in file_content:
    if line.startswith('@attribute '):
        col_name = line[len_attr:].split()[0]
        columns.append(col_name)

rows = []
len_row = len(columns)
# get the rows
for line in file_content:
    if line.startswith('{'):
        rows.append(parse_row(line, len_row))

df = pd.DataFrame(data=rows, columns=columns)

df.head()

输出:

关于python - 无法使用 scipy.arff.loadarff 加载 arff 数据集，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/59271661/

上一篇：python - 覆盖函数内部的全局变量不适用于 Spyder 4

下一篇：python - 使用正则表达式删除文档字符串的有效方法

python - 从 3.4 升级后如何在 python3.5 中使用 pip？

python 多线程 "maximum recursion depth exceed"

字典查找(字符串键)与列表索引之间的 Python 性能差异

python - scipy.sparse.linalg.spsolve Linux 系统上大型稀疏矩阵的令人惊讶的行为

python-3.x - python : cannot import name 'rgb2gray' from 'skimage.color'

python - 使 von Mises KDE 适应 Seaborn

python - 使用 scipy.signal 在 Python 中进行卷积和反卷积

python - python openCV中的人口普查转换

python - 如何使用 GPU 实现更快的 convolve2d