2.2 Assessing Model Accuracy

2.2.3 The Classification Setting

그동안 regression문제에서 model accuracy에 대해 다뤘다.

classification 문제는 어떻게 다룰까?

가장 흔한 접근 방법은 training error rate를 추정한 $\hat{f}$의 accuracy를 찾는 것이다.

Untitled

$I(y_i\neq \hat{y_i})$는 indicator variable로써, $y_i\neq \hat{y_i}$이면 1, 그렇지 않으면 0을 count하는 지시함수이다.

Untitled

우리는 regression에서와 마찬가지로 test data에서 잘 분류하는 것에 관심이 있기 때문에 위와 같은 공식을 사용한다.

좋은 classifier일 수록, test error가 작을 것이다.

주어진 설명변수 값에 대해 가장 가능성이 높은 클래스에 각 관측치를 할당하는 분류기이다.

이는 Bayes classifier라고 부르는데, conditional probability를 사용하여

Untitled

에서 가장 큰 확률의 클래스에 할당되도록 한다.

Bayes classifier는 Bayes decision boundary를 결정한다.

Bayes decision boundary는 확률이 정확히 50%인 구간으로, decision boundary를 기준으로 클래스를 분류한다.

Bayes는 최대 확률을 선택하므로, error rate는 1-선택 확률이 된다.

Bayes error rate

Untitled