我目前正在尝试制作一个设置脚本,能够为我设置一个工作区,这样我就不需要手动完成了。 我开始在 bash 中执行此操作,但很快意识到效果不佳。
我的下一个想法是使用 python 来完成它,但似乎无法以正确的方式做到这一点。我的想法是制作一个列表(列表是一个包含所有数据文件路径的 .txt 文件),打乱这个列表,然后将每个文件移动到我的训练目录或测试目录,给定比率....
但这是 python,难道没有更简单的方法吗,似乎我只是为了拆分文件而做了一个不必要的解决方法。
代码:
# Partition data randomly into train and test.
cd ${PATH_TO_DATASET}
SPLIT=0.5 #train/test split
NUMBER_OF_FILES=$(ls ${PATH_TO_DATASET} | wc -l) ## number of directories in the dataset
even=1
echo ${NUMBER_OF_FILES}
if [ `echo "${NUMBER_OF_FILES} % 2" | bc` -eq 0 ]
then
even=1
echo "Even is true"
else
even=0
echo "Even is false"
fi
echo -e "${BLUE}Seperating files in to train and test set!${NC}"
for ((i=1; i<=${NUMBER_OF_FILES}; i++))
do
ran=$(python -c "import random;print(random.uniform(0.0, 1.0))")
if [[ ${ran} < ${SPLIT} ]]
then
##echo "test ${ran}"
cp -R $(ls -d */|sed "${i}q;d") ${WORKSPACE_SETUP_ROOT}/../${WORKSPACE}/data/test/
else
##echo "train ${ran}"
cp -R $(ls -d */|sed "${i}q;d") ${WORKSPACE_SETUP_ROOT}/../${WORKSPACE}/data/train/
fi
##echo $(ls -d */|sed "${i}q;d")
done
cd ${WORKSPACE_SETUP_ROOT}/../${WORKSPACE}/data
NUMBER_TRAIN_FILES=$(ls train/ | wc -l)
NUMBER_TEST_FILES=$(ls test/ | wc -l)
echo "${NUMBER_TRAIN_FILES} and ${NUMBER_TEST_FILES}..."
echo $(calc ${NUMBER_TRAIN_FILES}/${NUMBER_OF_FILES})
if [[ ${even} = 1 ]] && [[ ${NUMBER_TRAIN_FILES}/${NUMBER_OF_FILES} != ${SPLIT} ]]
then
echo "Something need to be fixed!"
if [[ $(calc ${NUMBER_TRAIN_FILES}/${NUMBER_OF_FILES}) > ${SPLIT} ]]
then
echo "Too many files in the TRAIN set move some to TEST"
cd train
echo $(pwd)
while [[ ${NUMBER_TRAIN_FILES}/${NUMBER_TEST_FILES} != ${SPLIT} ]]
do
mv $(ls -d */|sed "1q;d") ../test/
echo $(calc ${NUMBER_TRAIN_FILES}/${NUMBER_OF_FILES})
done
else
echo "Too many files in the TEST set move some to TRAIN"
cd test
while [[ ${NUMBER_TRAIN_FILES}/${NUMBER_TEST_FILES} != ${SPLIT} ]]
do
mv $(ls -d */|sed "1q;d") ../train/
echo $(calc ${NUMBER_TRAIN_FILES}/${NUMBER_OF_FILES})
done
fi
fi
我的问题是最后一部分。由于我随机选择数字,因此我不确定数据是否会按预期进行分区,我的最后一个 if 语句是检查分区是否正确完成,如果不正确则修复它。这是不可能的,因为我正在检查 float ,一般来说解决方案变得更像是一个快速修复。
最佳答案
scikit-learn
来拯救 =)
>>> import numpy as np
>>> from sklearn.cross_validation import train_test_split
>>> X, y = np.arange(10).reshape((5, 2)), range(5)
>>> X
array([[0, 1],
[2, 3],
[4, 5],
[6, 7],
[8, 9]])
>>> y
[0, 1, 2, 3, 4]
# If i want 1/4 of the data for testing
# and i set a random seed of 42.
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
>>> X_train
array([[4, 5],
[0, 1],
[6, 7]])
>>> X_test
array([[2, 3],
[8, 9]])
>>> y_train
[2, 0, 3]
>>> y_test
[1, 4]
参见 http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html
演示:
alvas@ubi:~$ mkdir splitfileproblem
alvas@ubi:~$ cd splitfileproblem/
alvas@ubi:~/splitfileproblem$ mkdir original
alvas@ubi:~/splitfileproblem$ mkdir train
alvas@ubi:~/splitfileproblem$ mkdir test
alvas@ubi:~/splitfileproblem$ ls
original train test
alvas@ubi:~/splitfileproblem$ cd original/
alvas@ubi:~/splitfileproblem/original$ ls
alvas@ubi:~/splitfileproblem/original$ echo 'abc' > a.txt
alvas@ubi:~/splitfileproblem/original$ echo 'def\nghi' > b.txt
alvas@ubi:~/splitfileproblem/original$ cat a.txt
abc
alvas@ubi:~/splitfileproblem/original$ echo -e 'def\nghi' > b.txt
alvas@ubi:~/splitfileproblem/original$ cat b.txt
def
ghi
alvas@ubi:~/splitfileproblem/original$ echo -e 'jkl' > c.txt
alvas@ubi:~/splitfileproblem/original$ echo -e 'mno' > d.txt
alvas@ubi:~/splitfileproblem/original$ ls
a.txt b.txt c.txt d.txt
在 Python 中:
alvas@ubi:~/splitfileproblem$ ls
original test train
alvas@ubi:~/splitfileproblem$ python
Python 2.7.12 (default, Jul 1 2016, 15:12:24)
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> from sklearn.cross_validation import train_test_split
>>> os.listdir('original')
['b.txt', 'd.txt', 'c.txt', 'a.txt']
>>> X = y= os.listdir('original')
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)
>>> X_train
['a.txt', 'd.txt', 'b.txt']
>>> X_test
['c.txt']
现在移动文件:
>>> for x in X_train:
... os.rename('original/'+x , 'train/'+x)
...
>>> for x in X_test:
... os.rename('original/'+x , 'test/'+x)
...
>>> os.listdir('test')
['c.txt']
>>> os.listdir('train')
['b.txt', 'd.txt', 'a.txt']
>>> os.listdir('original')
[]
关于python - 给定比例将文件随机分发到训练/测试中,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/39210765/