train_test_splits

학습 데이터와 테스트 데이터를 분리 할때는 사이킷 런의 train_test_splits 함수를 이용한다.

array([[ 1, 1],
       [ 2, 2],
       [ 1, 3],
       [ 2, 4],
       [ 1, 5],
       [ 2, 6],
       [ 1, 7],
       [ 2, 8],
       [ 1, 9],
       [ 2, 10]])

col에 1부터 10까지 1차원 배열에 넣는다.
reshape로 행의 크기가 10, 열의 크기가 1인 배열로 변환한다.
행을 Row로 열을 Column이라 부른다.

data 0칼럼에 속성이 1이나 2인 값을 넣는다.
data에 col을 열로 추가한다.
append 3번째 매개변수가 1이면 열(Column) 추가이다.

newcol = col + 10
newdata = np.append(data, newcol, 1)
newdata

array([[ 1, 1, 11],
       [ 2, 2, 12],
       [ 1, 3, 13],
       [ 2, 4, 14],
       [ 1, 5, 15],
       [ 2, 6, 16],
       [ 1, 7, 17],
       [ 2, 8, 18],
       [ 1, 9, 19],
       [ 2, 10, 20]])

data와 newcol로 newdata를 만든다.

x = newdata[:, 0:2]
y = newdata[:, 2]
x

array([[ 1, 1],
       [ 2, 2],
       [ 1, 3],
       [ 2, 4],
       [ 1, 5],
       [ 2, 6],
       [ 1, 7],
       [ 2, 8],
       [ 1, 9],
       [ 2, 10]])

데이터를 x, y 리스트에 담는다.

seed = 5
#x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=seed)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=seed, stratify=newdata[:, 0:1], shuffle=True)
print('x_train 전체 데이터 %d: ' % len(x_train))
print('x_test 전체 데이터 %d:' % len(x_test))
x_test

x_train 전체 데이터 8:
x_test 전체 데이터 2:
array([[2, 2],
[1, 7]])

사이킷런의 train_test_split 함수로 학습 데이터와 테스트 데이터로 분리 할수 있다.

test_size=0.2
테스트 데이터의 크기의 비율을 전체가 1일때 0.2의 비율로 분리한다. 여기서는 전체가 10개, 테스트 데이터는 2개이다.

random_state=seed
랜덤 시드의 값을 설정 한다.

stratify=newdata[:, 0:1]
0번째 칼럼의 1, 2값이 골고루 포함 되도록 한다.
stratify 값을 설정 하지 않으면 다음과 같이 x_test 값이 분리 될수 있다.
여러 값이 포함이 되었을때 과적합을 막을수 있다.
array([[2, 2],
[2, 8]])

shuffle=True
shuffle을 하는 이유는 미니 배치를 할때 기울기의 평균으로 학습할때 잘못된 방향으로 학습할수 있기 때문이다.

전체코드는 다음과 같다.

import numpy as np
from sklearn.model_selection import train_test_split

col = np.arange(1, 11).reshape(10, 1)
data = [[1], [2], [1], [2], [1], [2], [1], [2], [1], [2]]
data = np.append(data, col, 1)

newcol = col + 10
newdata = np.append(data, newcol, 1)

x = newdata[:, 0:2]
y = newdata[:, 2]

seed = 5
#x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=seed)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=seed, stratify=newdata[:, 0:1], shuffle=True)
print('x_train 전체 데이터 %d: ' % len(x_train))
print('x_test 전체 데이터 %d:' % len(x_test))