2.4 Exercises | Notion

1. (a)에서 (d)까지 각 부분에 대해 일반적으로 flexible statistical learning method가 inflexible method에 비해 좋은지 혹은 나쁜지에 대해 자신의 대답을 정당화해라.

(a) sample size n이 매우 크고, predictors p의 수는 매우 작다.

→ better. inflexible method는 sample 수가 많을 때 유리하다.

(b) predictors p의 수는 매우 크고, 관찰 변수 n은 매우 작다. (?)

→ worse. inflexible method는 관찰 변수가 작으면 overfitting이 발생할 수 있다.

→ better. inflexible model은 non-linear 형태 처리가 힘듬.

⇒ (답) 자유도가 높을 때, flexible model은 더 잘 fit됨.

(d) error term의 분산이 극도로 높다.

→ flexible method는 variance가 높아짐

2. 각 문제가 classification 문제인지, regression 문제인지 설명하고, inference 인지 prediction인지 궁금하다. 결과적으로 n 과 p를 도출해라.

(a) US top 500에 대한 데이터를 수집했다. 각 영화는 record profit, employees 수, industry, the CEO salary로 구성되어 있다. 우리는 CEO salary가 미치는 영향을 이해하는 것에 관심이 있다.

→ regression. inference. 한 개의 변수를 이용해 한 개 이상의 변수를 예측하는 문제이기 때문이다.

n: US top 500

p: record profit, number of employees, industry

(b) 새로운 제품을 런칭하고 성공할지 / 실패할지 알고싶다. 우리는 20개의 이전에 출시한 제품 데이터를 모았다. 각각의 제품은 성공/실패 그리고 제품 가격과 마켓팅 예산, 경쟁 가격 외 10가지 변수들이 있다.

→ classification. prediction. 성공과 실패를 예측하고자 하기 때문이다.

n: 20 similar products

p: charged for the product, marketing budget, competition price, and ten other variables.

(c) 우리는 USD/Euro의 %change 환율과 세계 주식 시장의 주식 변화의 관계에 대해 예측하고 싶다. 이로 인해, 우리는 2012년 주 단위의 데이터를 수집했다. 각 레코드는 매주 USD/Euro의 %change, US market의 %change 변화, British market의 %change 변화, German marget의 %change 변화로 구성되어 있다.

→ regression. prediction. 각 변수의 관계를 분석하는 것이 아닌 결과를 예측하는 것이기 때문. 그리고 한개 혹은 한개 이상의 데이터를 예측 + 수량 데이터

n: weekly data for all of 2012

p: % change in US, British, German market

3. bias-variance decomposition 복습

(a) 우리는 less flexible statistical learning methods에서 bias, variance, training error, test error, Bayes error curves가 single plot에 표현함으로써 more flexible 접근한다. x-axis는 flexibility의 양을 나타내고, y-axis는 각 curve의 값을 나타낸다. 이를 그리고 설명해라.

→ bias: flexible 할 수록 작아진다.

→ variance: flexible 할 수록 커진다.

→ training error: flexible 할 수록 줄어든다.

→ test error: flexible 할 수록 overfitting 가능성이 있다.

→ Bayes error curves: 분류 문제에 대해 정의거나 decision boundary의 wrong side의 data point ratio를 기준으로 정해진다.

4. statistics learning을 real-life에 적용해보자.

(a) real-life에서 유용한 classification을 3가지 대시오. response와 predictors도 설명해라. 각각의 상황은 inference인가 prediction인가?

cancer 탐지. prediction, response: O/X, predictor: cancer data
stock prediction. prediction. response: up/down predictor: yesterday stock price

(b) real-life에서 유용한 classification을 3가지 대시오. response와 predictors도 설명해라. 각각의 상황은 inference인가 prediction인가?

(1)

5. flexible approach가 regression/classification에서 장점/단점은 무엇인가?

어떤 상황에서 flexible approach가 less flexible approach보다 유연한가? 언제 less flexible approach가 선호되는가?

flexible approach

→ 장점: 데이터를 잘 표현할 수 있다. bias가 줄일 수 있다.

→ 단점: overfitting 문제가 발생할 수 있다. variance가 증가할 수 있다.

→ 데이터가 많은 상황에서 유리하다. (정답에 가깝진 않은듯)

⇒ prediction과 결과에 대해 설명하지 않아도 될 때 유용함

less-flexible approach

→ inference와 결과에 대해 설명이 필요할 때 유용함