Python-机器学习: Creating a training and test set from a list of arrays

标签 python arrays dictionary for-loop machine-learning

我想创建一个在 RAVDESS 数据集 ( https://smartlaboratory.org/ravdess/ ) 上训练的神经网络:想法是使用该数据集来检测对着我的应用程序的麦克风说话的人的情绪。

使用 librosa 和下面的 for 循环,我提取了要用于分析的标签和特征。

# I started with only one folder to fasten the operations
oneActorPath = '/content/drive/My Drive/RAVDESS/Audio_Speech_Actors_01-24/Actor_01/'
lst = []

# Loop through each folder to find the wavs

for subdir, dirs, files in os.walk(oneActorPath):
  for file in files:
    if file == '.DS_Store':
      continue
    else:
      # Check if the format of the file is valid
      try:
        #Load librosa array
        data, rate = librosa.load(os.path.join(subdir,file))
        #Using the name of the file I can understand the emotion that contains
        file = file[6:8]
        arr = data, file
        lst.append(arr)
        #print(list)

      # If is not valid, skip it
      except ValueError:
        continue      

此循环的输出是以下格式的数组列表:

[(array([-8.1530527e-10,  8.9952795e-10, -9.1185753e-10, ...,
          0.0000000e+00,  0.0000000e+00,  0.0000000e+00], dtype=float32),
  '08'),
 (array([0., 0., 0., ..., 0., 0., 0.], dtype=float32), '08'),
 (array([0., 0., 0., ..., 0., 0., 0.], dtype=float32), '06'),
 (array([-0.00050612, -0.00057967, -0.00035985, ...,  0.        ,
          0.        ,  0.        ], dtype=float32), '05'),
 (array([ 6.8139506e-08, -2.3837963e-05, -2.4622474e-05, ...,
          3.1678758e-06, -2.4535689e-06,  0.0000000e+00], dtype=float32),
  '05'),
 (array([ 0.0000000e+00,  0.0000000e+00,  0.0000000e+00, ...,
          6.9306935e-07, -6.6020442e-07,  0.0000000e+00], dtype=float32),
  '04'),
 (array([-7.30260945e-05, -1.18022966e-04, -1.08280736e-04, ...,
          8.83421380e-05,  4.97258679e-06,  0.00000000e+00], dtype=float32),
  '06'),
 (array([0., 0., 0., ..., 0., 0., 0.], dtype=float32), '07'),
 (array([ 2.3406714e-05,  3.1186773e-05,  4.9467826e-06, ...,
          1.2180173e-07, -9.2944845e-08,  0.0000000e+00], dtype=float32),
  '01'),
 (array([ 1.1845550e-06, -1.6399191e-06,  2.5565218e-06, ...,
         -8.7445065e-09,  5.9859917e-09,  0.0000000e+00], dtype=float32),
  '04'),
 (array([0., 0., 0., ..., 0., 0., 0.], dtype=float32), '03'),
 (array([-1.3284328e-05, -7.4090644e-07,  7.2679302e-07, ...,
          0.0000000e+00,  0.0000000e+00,  0.0000000e+00], dtype=float32),
  '07'),
 (array([ 0.0000000e+00,  0.0000000e+00,  0.0000000e+00, ...,
          5.0694009e-08, -3.4546797e-08,  0.0000000e+00], dtype=float32),
  '03'),
 (array([ 1.5591205e-07, -1.5845627e-07,  1.5362870e-07, ...,
          0.0000000e+00,  0.0000000e+00,  0.0000000e+00], dtype=float32),
  '01'),
 (array([0., 0., 0., ..., 0., 0., 0.], dtype=float32), '03'),
 (array([0.0000000e+00, 0.0000000e+00, 0.0000000e+00, ..., 1.1608539e-05,
         8.2463991e-09, 0.0000000e+00], dtype=float32), '03'),
 (array([-3.6192148e-07, -1.4590451e-05, -5.3999561e-06, ...,
         -1.9935460e-05, -3.4417746e-05,  0.0000000e+00], dtype=float32),
  '02'),
 (array([ 0.0000000e+00,  0.0000000e+00,  0.0000000e+00, ...,
         -2.5319534e-07,  2.6521766e-07,  0.0000000e+00], dtype=float32),
  '02'),
 (array([ 0.0000000e+00,  0.0000000e+00,  0.0000000e+00, ...,
         -2.5055220e-08,  1.2936166e-08,  0.0000000e+00], dtype=float32)
...

上面列表中每个元素的第二个元素(第一行中的“08”)根据下面的字典表示数据集的标签

emotions = {
    "neutral": "01",
    "calm": "02",
    "happy": "03",
    "sad": "04",
    "angry": "05", 
    "fearful": "06", 
    "disgust": "07", 
    "surprised": "08"
}

此时,我有了标签和数据:如何拆分此数据集以获得训练集和测试集?

EDIT1:我需要了解如何从该结构中获取 X 和 y,以便在数据上使用 train_test_split。

最佳答案

您可以使用 scikit-learn 的 train_test_split 函数( relevant docs) 。文档中的示例非常简单:

import numpy as np
from sklearn.model_selection import train_test_split

X, y = np.arange(10).reshape((5, 2)), range(5)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

在您的情况下,您可能需要进行一些数据操作才能从输出列表中获取 Xy 向量:

X, y = zip(*lst)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

关于Python-机器学习: Creating a training and test set from a list of arrays,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/53349568/

相关文章:

python - 无法在 Jupyter Notebook 中加载 matplotlib

python - 在 H5PY 中打开文件时出错(未找到文件签名)

Javascript - 带有 boolean 键的数组?

Python 在列表中添加具有相同键的字典值

java - 寻找开发互斥锁的解决方案

Python GeoPy 错误处理

python - Django get_models 与模型/__init.py__

javascript - 对象数组,按对象键 :value 分组

javascript - 元素不会添加到 Javascript 数组中

python - 如何将嵌套的字典键和值平坦化为其类型和变量的平坦列表?