MachineLearning

Class imbalanced problem - 데이터 비대칭 문제 (oversampling, undersampling)

쌍쌍바나나 2017. 8. 27. 22:16

Class imbalance problem

imbalance problem

Class Imbalance Problem이 무엇인가

데이터에서 각 클래스의 개수가 현저하게 차이가 나는 문제를 말한다. 이 문제는 실제로 여러 학문에서 나타나는데 그 중에는 fraud detection, anomaly detection, medical diagnosis, oil spillage detection, facial recognition 등에서 나타난다.

무엇인 문제인가

머신 러닝 알고리즘은 각 클래스들의 개수가 거의 비슷한 경우에 가장 좋은 결과를 보여준다. 하나의 클래스의 개수가 다른 클래스보다 많게 되면 아래와 같은 문제가 발생한다.

transaction data의 데이터셋이 주어졌을때, fraudulent(사기를 치는)과 genuine(진짜의)것을 찾아야 한다. 지금 e-commerce company에서는fraudulent transaction인지 아닌지를 구별하는게 매우 중요하다. 가능하다면 많은 fraudulent transactions을 발견하는 것을 원한다.

만약 데이터 셋이 10000개의 genuineㅇ과 10개의 fraudulent transactions으로 이루어져 있다면, classifier는 fraudulent transactions을 genuine transactions으로 분류를 하는 경향이 나타날 것이다. 이 이유는 쉽게 설명이 가능하다. 머신러닝 알고리즘은 두개의 가능한 outputs을 가지고 있다고 생각한다.

Model 1은 10개 중 7개의 fraudulent transactdions을 genuine transactions으로, 100000개 중 10개의 genuine transactions을 fraudulent transactions으로 분류
Model 2는 10개 중 2개의 fraudulent transactions을 genuine transactions으로, 100000개 중에 100개의 genuine transactions을 fraudulent transactions으로 분류

만약 mistakes의 개수로 classifier의 성능을 평가한다면, Model1은 17개의 mistakes를 했고, 반면 Model2는 102개의 mistakes를 했다. 결론적으로 Model1의 성능이 더 좋다고 할 수 있다. 하지만 우리가 풀고자 하는 문제는 fraudulent transactions의 개수를 minimize를 원하기 때문에, 우리는 Model2가 fraudulent transactions을 분류하는데 있어서, 2개의 mistakes를 했기 때문에 Model1보다 좋다고 할 수 있다. 물론 genuine transactions을 fraudulent transactions으로 분류한 비용은 생기게 된다. 그러나 이 비용은 사실 그렇게 큰 비용이 아니다. 일반적인 머신러닝 알고리즘은 Model1을 Model2보다 선택을 하게 될것이다. 이게 바로 문제이다. 실제로 Model2를 사용하여 많은 fraudulent transactions을 처리를 할 수 있었을 텐데, 할 수 없었다. 라는 것을 의미하고, 해석하면 회사 입장에서는 금전적 손실과 customers를 unhappy하게 만드는 것이다.

How to tell machine learning algorithm which is the better solution

Model2가 Model1보다 더 좋다고 말하기 위해서는 단순히 mistakes의 개수를 카운팅하는 방법보다는 더 좋은 metric이 필요하다.

TP: positive -> postivie
TN: negative -> negative
FP: negative -> postivie
FN: positive -> negative

위를 기본으로 True Positive Rate, True Negative Rate, False Positive Rate, False Negative Rate:

name	formula	explanation
TP rate	TP / (TP + FP)	The closer to 1, the better
TN rate	TN / (TN + FN)	The closer to 1, the better
FP rate	FP / (FP + TN)	The closer to 0, the better
FN rate	FN / (FN + TP)	The closer to 0, the better

새로운 metrics와, 단순히 mistake의 개수를 측정하는 conventional metrics와의 비교를 해보면

Error(Model 1)
= (FP + FN) / total dataset size
= (7 + 10) / 10010
= 0.0017 = (0.1% error)

Error(Model 2)
= (FN + FP) / total dataset size
= (2 + 100) / 10010
= 0.01 (= 1% error)

단순히 mistakes의 개수로 에를 측정하면 Model 1의 에러 0.1%이 Model 2의 에러 1%보다 더 작기 때문에 Model 2보다 Model 1이 더 좋다라고 마할 수 있다. 하지만 우리는 Model 2가 더 좋은 것을 이미 알고 있으니…

새로운 metrics을 계산을 해보면

Model 1

TP_rate(M1) = 3 / (3 + 7) = 0.3
TN_rate(M1) = 9990 / (10 + 9990) = 0.999
FP_rate(M1) = 10 / (10 + 9990) = 0.001
FN_rate(M1) = 7 / (7 + 3) = 0.7

Model2

TP_rate(M2) = 8 / (8 + 2) = 0.8
TN_rate(M2) = 9900 / (100 + 9900) = 0.990
FP_rate(M2) = 100 / (100 + 9900) = 0.01
FN_rate(M2) = 2 / (2 + 8) = 0.2

Model1의 false negative rate는 70%, Model2는 20%, 즉 Model2의 성능이 더 좋음

How to mitigate this problem

지금까지 class imbalance problem이 무엇인지 알아보았다. 이제 어떻게 imbalance dataset의 문제를 다루는지에 대해서 설명한다.

우리는 대략적으로 두개의 카테고리 형태로 접근이 가능하다: sampling based approaches, cost function based approaches

Cost function based approaches

Sampling based approaches

oversampling: minority class를 추가
undersampling: majority class를 제거
hybrid (mix of oversampling and undersampling)

sampling based approaches응 확실하게 drawbacks이 있음

undersampling

더 많이 나타나는 majority class instances를 제거하는 방법으로, useful information이 버려지게 된다.

초록색 선은 ideal decision boundary이다. 파란선은 actual result이다.
왼쪽은 general한 machine learning algorithm에 undersampling없이 적용한 것이고, 오른쪽은 ngative class를 undersampled하고, 그러나 negative class의 정보가 제거되었다. 그리고 파란색 decision boundary는 경사가 때문에, 어떤 negative class는 positive class로 잘못 classified될 수 있다.

oeversampling

minority classess를 duplicating하게 되면 classifier가 overfitting될 수 있다.

왼쪽 그림은 oversampling하기 전의 그림이고, 반면 오른쪽은 oversampling을 적용했을때의 모양이다. thick positive signs은 positive instances를 여러번 duplicate한 것을 나타낸다.

hybrid approach

undersampling, oversampling을 합쳐놓은 접근 방법이지만, 이 방법 또한 drawbacks이 있음 여전히 trade-off가 존재.

Even more recent approaches

RUSBoost, SMOTEBagging, 그리고 Underbagging에 관한 literature가 최근에 볼 수가 있는데, 이 모든게 SMOTE로 부터 시작된 approaches이다.
하지만 여전히 SMOTE는 단순하기 때문에 매우 사용이 많이 되고 있음.

Summary

Clas Imbalance Problem은 실제로 class의 instances의 개수의 불균형의 원인으로 머신러닝에 영향을 주는 문제중 하나이다. 솔루션들을 비교하기 위해, alternative metrics를 mistakes의 개수를 카운팅하는 general accuracy와 비교를 하였다.

샘플링하는 방법은 크게 sampling based, cost function based가 있고, sampling based는 oversampling, undersampling, hybrid가 있다.

[참고]
http://www.chioka.in/class-imbalance-problem/

저작자표시

'MachineLearning' 카테고리의 다른 글

RNN(Recursive Neural Networks) (0)	2017.12.10
RNN(Recurrent Nueral Networks) (0)	2017.12.10
XGBoost parameters (0)	2017.08.27
Imbalanced data를 처리하는 기술 7가지 (0)	2017.08.27
[머신러닝] Machine Learning? (0)	2016.07.08

현재글Class imbalanced problem - 데이터 비대칭 문제 (oversampling, undersampling)

불로

함께 하는 블로그

Android, 소스코드, 파이썬, 설치, Spark, ubuntu, 리눅스, javascript, 빅데이터, Python, dict, 자바스크립트, Linux, 데이터분석, TensorFlow, nodejs, git, 안드로이드, RDD, 스파크,

Today :
Yesterday :

일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

불로

Class imbalanced problem - 데이터 비대칭 문제 (oversampling, undersampling)

Class imbalance problem

imbalance problem

Class Imbalance Problem이 무엇인가

무엇인 문제인가

How to tell machine learning algorithm which is the better solution

How to mitigate this problem

Cost function based approaches

Sampling based approaches

undersampling

oeversampling

hybrid approach

more recent approaches to the problem

Add new minority class instances by

Even more recent approaches

Summary

'MachineLearning' 카테고리의 다른 글

'MachineLearning'의 다른글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역

Class imbalanced problem - 데이터 비대칭 문제 (oversampling, undersampling)

Class imbalance problem

imbalance problem

Class Imbalance Problem이 무엇인가

무엇인 문제인가

How to tell machine learning algorithm which is the better solution

How to mitigate this problem

Cost function based approaches

Sampling based approaches

undersampling

oeversampling

hybrid approach

more recent approaches to the problem

Add new minority class instances by

Even more recent approaches

Summary

'MachineLearning' 카테고리의 다른 글

'MachineLearning'의 다른글

관련글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역