[머신러닝] 4회차 앙상블, 부스팅(06.27)

특강/머신러닝

[머신러닝] 4회차 앙상블, 부스팅(06.27)

코딩 아가 2025. 6. 27. 16:10

수치형 변수 변환

log 변환

def log_transform(X, columns):
    X_transformed = X.copy()
    for col in columns:
        X_transformed[f'{col}_log'] = np.log1p(X_transformed[col])

# 시각화
        fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
        sns.histplot(X[col], ax=ax1)
        ax1.set_title(f'Original Distribution of {col}')
        sns.histplot(X_transformed[f'{col}_log'], ax=ax2)
        ax2.set_title(f'Log Transformed Distribution of {col}')
        plt.show()

    return X_transformed

Box-Cox 변환

from scipy import stats

def boxcox_transform(X, columns):
    X_transformed = X.copy()
    for col in columns:
        X_transformed[f'{col}_boxcox'], lambda_param = stats.boxcox(X_transformed[col])

# 시각화
        fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
        sns.histplot(X[col], ax=ax1)
        ax1.set_title(f'Original Distribution of {col}')
        sns.histplot(X_transformed[f'{col}_boxcox'], ax=ax2)
        ax2.set_title(f'Box-Cox Transformed Distribution of {col}')
        plt.show()

    return X_transformed

앙상블모델이란?

여러 모델의 예측을 결합하여 더 나은 결과를 얻는 방법

1. 배깅: 랜덤 포레스트

여러 개의 결정 트리를 독립적으로 학습
각 트리가 서로 다른 데이터와 특성을 사용

2. 부스팅: XGBoost, LightGBM

이전 모델의 오차를 보와하는 방향으로 학습
높은 예측 성능으로 실무에서 널리 사용됨

장점

안정적, 강건한 예측 가능
과적합 위험 적음
더 높은 예측 성능

단점

학습과 예측에 많은 시간, 자원 필요
복잡하여 해석 어려움

1. 결정 트리란?

불순도(지니계수나 엔트로피)가 가장 크게 감소하는 방향으로 특성과 분할 기준 선택
(장점): 모델의 의사결정 과정을 시각적으로 표현가능하여 해석이 쉬움
(단점): 과적합 >> 가지치기(pruning) 기법 사용

불순도란?

각 노드에서 데이터가 얼마나 섞여있는지 측정하는 지표

단일 클래스: 불순도 = 0

여러 클래스 균등 mix: 불순도 = 최대

2. 랜덤 포레스트란?

무작위로 추출된 샘플의 특성만 고려하여 학습
(장점) 과적합 위험성 낮음, 특성 중요도 계산 쉬움, 이상치에 강함, 대규모 데이터셋 사용가능
(단점) 계산 비용/시간 많이듬, 복잡하고 해석 어려움, 하이퍼파라미터 튜닝의 복잡성

작동원리

부트스트랩 샘플링 (Bootstrap Sampling)
- 복원추출(중복 추출 허용)
특성의 무작위 선택 (Random Feature Selection)
- 각 분기점에서 일부 특성만 고려
개별 트리의 학습
- 서로 다른 데이터와 특성으로 학습
앙상블 예측 (Ensemble Prediction)
- 분류 문제(각 트리의 예측을 투표로 결정), 회귀 문제(각 트리의 예측 평균 사용)

실습

1. 라이브러리 불러오기

# 라이브러리 불러오기
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import DecisionTreeregressor    #회귀
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

2. 데이터 준비

# 가상의 대출 데이터 생성
np.random.seed(42)  # 재현성을 위한 시드 설정
n_samples = 1000    # 1000개의 샘플 생성

# 현실적인 대출 데이터 특성 생성
data = {
    'income': np.random.normal(5000000, 2000000, n_samples),        # 연소득 (평균 5000만원)
    'credit_score': np.random.normal(650, 100, n_samples),          # 신용점수 (평균 650점)
    'employment_years': np.random.normal(5, 3, n_samples),          # 근속년수 (평균 5년)
    'dti_ratio': np.random.normal(0.3, 0.1, n_samples),            # 총부채상환비율 (평균 30%)
    'ltv_ratio': np.random.normal(0.6, 0.2, n_samples),            # 담보인정비율 (평균 60%)
    'age': np.random.normal(40, 10, n_samples)                      # 나이 (평균 40세)
}

df = pd.DataFrame(data)

# 현실적인 범위로 데이터 조정
df['credit_score'] = df['credit_score'].clip(300, 900)     # 신용점수는 300-900 사이
df['employment_years'] = df['employment_years'].clip(0, 40) # 근속년수는 0-40년 사이
df['dti_ratio'] = df['dti_ratio'].clip(0, 1)              # DTI는 0-100% 사이
df['ltv_ratio'] = df['ltv_ratio'].clip(0, 1)              # LTV는 0-100% 사이
df['age'] = df['age'].clip(20, 80)                        # 나이는 20-80세 사이

3. 목표변수(상환여부) 생성 > 간단 규칙 기반

def determine_repayment(row):
    score = 0
    # 각 조건별 점수 부여
    score += 1 if row['credit_score'] > 650 else 0
    score += 1 if row['dti_ratio'] < 0.4 else 0
    score += 1 if row['employment_years'] > 2 else 0
    score += 1 if row['income'] > 3000000 else 0
    score += 1 if row['ltv_ratio'] < 0.7 else 0
    
    # 약간의 무작위성 추가 (현실적인 노이즈)
    score += np.random.normal(0, 0.5)
    
    return 1 if score > 2.5 else 0  # 2.5점 이상이면 상환 가능

df['repaid'] = df.apply(determine_repayment, axis=1)

4. 데이터 분할

X = df.drop('repaid', axis=1)    # 특성(설명변수)
y = df['repaid']                 # 목표변수

# 훈련세트와 테스트세트로 분할 (80:20)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

5. 결정 트리 모델 구현

dt_model = DecisionTreeClassifier(
    max_depth=5,                # 트리의 최대 깊이 제한
    min_samples_split=50,       # 노드 분할에 필요한 최소 샘플 수
    min_samples_leaf=20,        # 리프 노드의 최소 샘플 수
    random_state=42
)

# 결정 트리 모델 학습
dt_model.fit(X_train, y_train)

6. 랜덤 포레스트 모델 구현

rf_model = RandomForestClassifier(
    n_estimators=100,           # 트리의 개수
    max_depth=5,                # 각 트리의 최대 깊이
    min_samples_split=50,       # 노드 분할에 필요한 최소 샘플 수
    min_samples_leaf=20,        # 리프 노드의 최소 샘플 수
    random_state=42
)

# 랜덤 포레스트 모델 학습
rf_model.fit(X_train, y_train)

7. 두 모델의 예측 수행

dt_pred = dt_model.predict(X_test)
rf_pred = rf_model.predict(X_test)

8. 성능 평가 함수 정의

def evaluate_model(y_true, y_pred, model_name):
    print(f"\n{model_name} 성능 평가")
    print("="* 50)
    
    # 정확도 출력
    accuracy = accuracy_score(y_true, y_pred)
    print(f"\n정확도: {accuracy:.4f}")
    
    # 분류 보고서 출력 (정밀도, 재현율, F1 점수 등)
    print("\n분류 보고서:")
    print(classification_report(y_true, y_pred))
    
    # 혼동 행렬 시각화
    plt.figure(figsize=(8, 6))
    sns.heatmap(confusion_matrix(y_true, y_pred), 
                annot=True, fmt='d', cmap='Blues',
                xticklabels=['상환불가', '상환가능'],
                yticklabels=['상환불가', '상환가능'])
    plt.title(f'{model_name} 혼동 행렬')
    plt.ylabel('실제 값')
    plt.xlabel('예측 값')
    plt.show()

9. 두 모델의 성능 평가 실행

evaluate_model(y_test, dt_pred, "결정 트리")
evaluate_model(y_test, rf_pred, "랜덤 포레스트")

10. 특성 중요도 시각화 함수

def plot_feature_importance(model, model_name):
    importances = pd.DataFrame({
        'feature': X.columns,
        'importance': model.feature_importances_
    }).sort_values('importance', ascending=False)
    
    plt.figure(figsize=(10, 6))
    sns.barplot(x='importance', y='feature', data=importances)
    plt.title(f'{model_name} 특성 중요도')
    plt.show()

# 두 모델의 특성 중요도 비교
plot_feature_importance(dt_model, "결정 트리")
plot_feature_importance(rf_model, "랜덤 포레스트")

11. 새로운 고객 데이터에 대한 예측

new_customer = pd.DataFrame({
    'income': [6000000],        # 연소득 6000만원
    'credit_score': [750],      # 신용점수 750점
    'employment_years': [5],    # 근속년수 5년
    'dti_ratio': [0.3],        # DTI 30%
    'ltv_ratio': [0.6],        # LTV 60%
    'age': [35]                # 35세
})

print("\n새로운 고객에 대한 예측")
print("="* 50)
print("결정 트리 예측:", "상환 가능" if dt_model.predict(new_customer)[0] == 1 else "상환 불가")
print("랜덤 포레스트 예측:", "상환 가능" if rf_model.predict(new_customer)[0] == 1 else "상환 불가")

# 랜덤 포레스트의 확률값 예측
rf_proba = rf_model.predict_proba(new_customer)[0]
print(f"랜덤 포레스트 예측 확률: 상환불가 {rf_proba[0]:.1%}, 상환가능 {rf_proba[1]:.1%}")

부스팅이란?

약한 학습기들을 순차적으로 학습시켜 강한 학습기로 만드는 방법
약한 학습기(단순): 하나의 간단한 기준으로 판단하는 모델
강한 학습기(복합): 여러 가지의 조건을 복합적으로 고려하는 모델
핵심 원리: 순차적 학습(Sequential)

모델 종류

AdaBoost (Adaptive Boosting): 가중치 업데이트
- 실수한 부분을 중점적으로 학습
- 모델 결합(더 좋은 성능을 보인 모델에 높은 가중치 부여)
Gradient Boosting Machine (GBM)
- 경사하강법 원리 적용(음의 기울기 방향으로 이동하며 최솟값 찾는 방식)
- 비선형 관계 잘 포착
XG Boost (eXtreme Gradient Boosting)
- 정규화 항 도입하여 과적합 방지(like 릿지, 라쏘)
- 병렬 처리로 학습 속도 항샹
- 가지치기로 복잡도 제어
- 결측치 처리 자동화
- 2차 미분하여 정확한 방향으로 모델 최적화
LightGBM (Light한 GBM)
- 데이터가 많을 때 사용 > 적을 때 과적합

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor, GradientBoostingRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import xgboost as xgb
# lightgbm은 설치되어 있지 않으므로 제외

# 데이터 생성
np.random.seed(42)
n_samples = 1000
data = {
    'income': np.random.normal(5000000, 2000000, n_samples),
    'credit_score': np.random.normal(650, 100, n_samples),
    'employment_years': np.random.normal(5, 3, n_samples),
    'dti_ratio': np.random.normal(0.3, 0.1, n_samples),
    'ltv_ratio': np.random.normal(0.6, 0.2, n_samples),
    'age': np.random.normal(40, 10, n_samples)
}
df = pd.DataFrame(data)
df['credit_score'] = df['credit_score'].clip(300, 900)
df['employment_years'] = df['employment_years'].clip(0, 40)
df['dti_ratio'] = df['dti_ratio'].clip(0, 1)
df['ltv_ratio'] = df['ltv_ratio'].clip(0, 1)
df['age'] = df['age'].clip(20, 80)

def determine_repayment_score(row):
    score = 0
    score += 1 if row['credit_score'] > 650 else 0
    score += 1 if row['dti_ratio'] < 0.4 else 0
    score += 1 if row['employment_years'] > 2 else 0
    score += 1 if row['income'] > 3000000 else 0
    score += 1 if row['ltv_ratio'] < 0.7 else 0
    score += np.random.normal(0, 0.5)
    return np.clip(score / 5, 0, 1)

df['repayment_prob'] = df.apply(determine_repayment_score, axis=1)
X = df.drop('repayment_prob', axis=1)
y = df['repayment_prob']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 회귀 모델 정의
models = {
    # 🚀 AdaBoost 회귀
    'AdaBoost': AdaBoostRegressor(
        n_estimators=100,     # 약한 학습기(기본: 결정트리) 개수
        learning_rate=0.1,    # 학습률: 낮출수록 천천히 학습
        random_state=42
    ),

    # 📈 Gradient Boosting 회귀 (GBM)
    'Gradient Boosting': GradientBoostingRegressor(
        n_estimators=100,     # 트리 개수 (많을수록 성능↑, 과적합 주의)
        learning_rate=0.1,    # 학습률
        max_depth=3,          # 각 트리의 최대 깊이 (작을수록 일반화↑)
        min_samples_split=5,  # 노드 분할을 위한 최소 샘플 수
        random_state=42
    ),

    # 💥 XGBoost 회귀 (정규화 포함, 빠르고 성능 좋음)
    'XGBoost': xgb.XGBRegressor(
        n_estimators=100,         # 트리 개수
        learning_rate=0.1,        # 학습률
        max_depth=3,              # 트리 깊이
        min_child_weight=1,       # 리프 노드의 최소 가중치 합 (작으면 과적합↑)
        subsample=0.8,            # 트리당 사용할 샘플 비율 (과적합 방지)
        colsample_bytree=0.8,     # 트리당 사용할 특성 비율
        random_state=42
    ),

    # ⚡ LightGBM 회귀 (속도 매우 빠름, 대용량 데이터 적합)
    'LightGBM': lgb.LGBMRegressor(
        n_estimators=100,         # 트리 개수
        learning_rate=0.1,        # 학습률
        max_depth=3,              # 트리 최대 깊이
        num_leaves=31,            # 리프 노드 수 (너무 크면 과적합)
        subsample=0.8,            # 샘플링 비율
        colsample_bytree=0.8,     # 특성 샘플링 비율
        random_state=42
    )
}

# 모델 학습 및 평가
print("\n📊 회귀 모델 성능 비교 (RMSE, R²)")
print("="*50)
for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    r2 = r2_score(y_test, y_pred)
    print(f"{name:<20} | RMSE: {rmse:.4f} | R²: {r2:.4f}")

모델 선택 가이드

AdaBoost를 선택해야 할 때:

데이터셋이 비교적 작고 노이즈가 적을 때
모델의 작동 원리를 명확하게 설명해야 할 때
이진 분류 문제에서 특히 효과적

GBM을 선택해야 할 때:

예측 성능이 가장 중요한 고려사항일 때
데이터의 비선형성이 강할 때
충분한 학습 시간을 확보할 수 있을 때

XGBoost를 선택해야 할 때:

대규모 데이터셋을 다룰 때
결측치가 많은 데이터를 다룰 때
과적합 방지가 중요할 때
높은 예측 성능과 적절한 학습 속도가 모두 필요할 때

LightGBM을 선택해야 할 때:

매우 큰 데이터셋을 다룰 때
빠른 학습 속도가 필수적일 때
메모리 자원이 제한적일 때
단, 데이터셋이 너무 작을 경우 과적합 위험이 있으므로 주의

'특강 > 머신러닝' 카테고리의 다른 글

[머신러닝 주요기법] 2회차 (07.01) (2)	2025.07.01
[머신러닝 주요기법] 1회차 (06.30) (1)	2025.06.30
[머신러닝] 3회차 회귀분석 (06.26) (0)	2025.06.26
[머신러닝] 2회차 머신러닝 핵심기술(06.25) (2)	2025.06.25
[머신러닝] 1회차 (06.24) (3)	2025.06.24

현재글[머신러닝] 4회차 앙상블, 부스팅(06.27)

코딩 아가의 성장과정

QAQC분야 데이터 분석가로 취업하기 위한 한걸음

코딩, 데이터분석, Til, 파이썬, tableau, 코드카타, 내일배움캠프, 상관관계, 챌린지, 태블로, python3, SQL, 시계열데이터, 랜덤포레스트, 머신러닝, 아티클스터디, ChatGPT, xgboost, 테블로, Python,

Today :
Yesterday :

코딩 아가의 성장과정