[머신러닝 주요기법] 2회차 (07.01)

특강/머신러닝

[머신러닝 주요기법] 2회차 (07.01)

코딩 아가 2025. 7. 1. 20:48

의사결정나무(Decision Tree)란?

의사결정 과정을 트리 구조로 나타낸 알고리즘

불순도(Impurity)를 최소화하는 분할점 찾기

[특징]

해석 가능성: 결정 과정을 쉽게 설명할 수 있음
비선형 관계: 복잡한 패턴도 포착 가능
특성 선택: 중요한 변수를 자동으로 선별
데이터 전처리: 스케일링이나 정규화 불필요

[원리]

1단계: 불순도 측정(지니 불순도)

데이터가 얼마나 섞여 있는지 측정
낮을수록 좋음

2단계: 정보 획득(Information Gain) 계산

분할 전후 불순도 감소량 측정
높을수록 좋음

3단계: 트리 구축 과정(탐욕적 알고리즘)

최고 정보 획득 찾기: 모든 가능한 분할 중 정보 획득이 최대인 것 선택
재귀적 분할: 각 자식 노드에서 1단계 반복
종료 조건: 더 이상 분할할 수 없거나 조건을 만족할 때까지

[코드]

from sklearn.tree import DecisionTreeClassifier, plot_tree, export_text
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# 한글 폰트 설정 (트리 시각화용)
plt.rcParams['font.family'] = 'Malgun Gothic'  # Windows
# plt.rcParams['font.family'] = 'AppleGothic'  # Mac

2단계: 타이타닉 데이터 로드 및 기본 정보 확인

# 타이타닉 데이터 로드
titanic = sns.load_dataset('titanic')

# 기본 정보 확인
print("=== 타이타닉 데이터셋 기본 정보 ===")
print(f"전체 데이터 수: {len(titanic)}")
print(f"생존자 수: {titanic['survived'].sum()}")
print(f"생존률: {titanic['survived'].mean():.3f}")

# 데이터 구조 확인
print(f"\n데이터 형태: {titanic.shape}")
print(f"컬럼 정보:\n{titanic.dtypes}")

3단계: 데이터 전처리

# 분석에 사용할 특성 선택
features_to_use = ['pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked']
titanic_clean = titanic[features_to_use + ['survived']].copy()

print("=== 결측치 확인 ===")
print(titanic_clean.isnull().sum())

# 결측치 처리
titanic_clean['age'] = titanic_clean['age'].fillna(titanic_clean['age'].median())
titanic_clean['embarked'] = titanic_clean['embarked'].fillna(titanic_clean['embarked'].mode()[0])

print("=== 결측치 처리 후 ===")
print(titanic_clean.isnull().sum())

4단계: 범주형 변수 인코딩

# 성별을 숫자로 변환
titanic_clean['sex'] = titanic_clean['sex'].map({'male': 0, 'female': 1})

# 승선항구를 원-핫 인코딩
titanic_clean = pd.get_dummies(titanic_clean, columns=['embarked'], prefix='embarked', drop_first=True)

print("=== 인코딩 후 특성 목록 ===")
print(list(titanic_clean.columns))
print(f"최종 특성 수: {len(titanic_clean.columns) - 1}")  # survived 제외

5단계: 독립변수와 종속변수 분리

# X (독립변수)와 y (종속변수) 분리
X = titanic_clean.drop('survived', axis=1)
y = titanic_clean['survived']

print("=== 변수 분리 완료 ===")
print(f"독립변수 형태: {X.shape}")
print(f"종속변수 형태: {y.shape}")
print(f"특성 목록: {list(X.columns)}")

6단계: 훈련/테스트 데이터 분할

# 8:2 비율로 데이터 분할
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2,      # 20%를 테스트용으로
    random_state=42,    # 재현 가능한 결과
    stratify=y          # 클래스 비율 유지
)

print("=== 데이터 분할 완료 ===")
print(f"훈련 데이터: {X_train.shape[0]}개 샘플")
print(f"테스트 데이터: {X_test.shape[0]}개 샘플")
print(f"훈련 데이터 생존률: {y_train.mean():.3f}")
print(f"테스트 데이터 생존률: {y_test.mean():.3f}")

7단계: 디시전 트리 모델 생성

# 디시전 트리 모델 생성
dt_model = DecisionTreeClassifier(
    criterion='gini',        # 분할 기준: 지니 불순도
    max_depth=5,            # 최대 깊이 제한 (과적합 방지)
    min_samples_split=20,   # 분할을 위한 최소 샘플 수
    min_samples_leaf=10,    # 리프 노드의 최소 샘플 수
    random_state=42         # 재현 가능한 결과
)

print("=== 디시전 트리 모델 생성 완료 ===")
print(f"분할 기준: {dt_model.criterion}")
print(f"최대 깊이: {dt_model.max_depth}")
print(f"분할 최소 샘플: {dt_model.min_samples_split}")

8단계: 모델 학습

# 모델 학습
print("=== 모델 학습 시작 ===")
dt_model.fit(X_train, y_train)
print("모델 학습 완료!")

# 학습된 트리 정보 확인
print(f"\n=== 학습된 트리 정보 ===")
print(f"실제 트리 깊이: {dt_model.get_depth()}")
print(f"리프 노드 수: {dt_model.get_n_leaves()}")
print(f"전체 노드 수: {dt_model.tree_.node_count}")

# 9단계: 예측 및 성능 평가

# 훈련 및 테스트 데이터에 대한 예측
y_pred_train = dt_model.predict(X_train)
y_pred_test = dt_model.predict(X_test)

# 정확도 계산
train_accuracy = accuracy_score(y_train, y_pred_train)
test_accuracy = accuracy_score(y_test, y_pred_test)

print("=== 디시전 트리 성능 ===")
print(f"훈련 데이터 정확도: {train_accuracy:.3f}")
print(f"테스트 데이터 정확도: {test_accuracy:.3f}")

# 10단계: 상세한 분류 성능 분석

# 상세한 분류 보고서
print("\n=== 테스트 데이터 분류 보고서 ===")
report = classification_report(y_test, y_pred_test, target_names=['사망', '생존'], output_dict=True)
print(classification_report(y_test, y_pred_test, target_names=['사망', '생존']))

# 핵심 지표 해석
print("\n=== 성능 지표 해석 ===")
precision_survived = report['생존']['precision']
recall_survived = report['생존']['recall']
f1_survived = report['생존']['f1-score']

print(f"생존 예측 정밀도: {precision_survived:.3f} → 생존 예측 중 {precision_survived:.1%}가 실제 생존")
print(f"생존 예측 재현율: {recall_survived:.3f} → 실제 생존자 중 {recall_survived:.1%}를 올바르게 예측")
print(f"생존 예측 F1-Score: {f1_survived:.3f} → 정밀도와 재현율의 조화평균")

# 트리 결과 시각화

# 트리 시각화 (간단한 버전)
plt.figure(figsize=(20, 12))
plot_tree(dt_model, 
          feature_names=X.columns,
          class_names=['사망', '생존'],
          filled=True,
          rounded=True,
          fontsize=10)
plt.title('타이타닉 생존 예측 디시전 트리', fontsize=16)
plt.show()

# 특성 중요도 확인하기

feature_importance = pd.DataFrame({
    'feature': X.columns,
    'importance': dt_model.feature_importances_
}).sort_values('importance', ascending=False)

print("=== 특성 중요도 순위 ===")
print(feature_importance)

# 특성 중요도 시각화
plt.figure(figsize=(10, 6))
sns.barplot(data=feature_importance, x='importance', y='feature', palette='viridis')
plt.title('디시전 트리 특성 중요도')
plt.xlabel('중요도')
plt.tight_layout()
plt.show()

GridSearch를 활용한 모델 최적화

여러 하이퍼파라미터 조합을 체계적으로 테스트하여 최적의 모델을 찾는 방법

수동으로 하나씩 테스트하는 대신 자동화된 방식으로 최적 조합을 발견 가능

from sklearn.model_selection import GridSearchCV

# 하이퍼파라미터 그리드 정의
param_grid = {
    'max_depth': [3, 5, 7, 10, None],
    'min_samples_split': [10, 20, 50],
    'min_samples_leaf': [5, 10, 20],
    'criterion': ['gini', 'entropy']
}

print("=== GridSearch 하이퍼파라미터 최적화 ===")
print(f"테스트할 조합 수: {len(param_grid['max_depth']) * len(param_grid['min_samples_split']) * len(param_grid['min_samples_leaf']) * len(param_grid['criterion'])}")

# GridSearch 설정
grid_search = GridSearchCV(
    DecisionTreeClassifier(random_state=42),
    param_grid,
    cv=5,              # 5-fold 교차검증
    scoring='accuracy', # 평가 지표
    n_jobs=-1,         # 모든 CPU 코어 사용
    verbose=1          # 진행 상황 출력
)

# GridSearch 실행
grid_search.fit(X_train, y_train)

# 최적 모델 결과
print("\n=== 최적 하이퍼파라미터 ===")
print(f"최적 파라미터: {grid_search.best_params_}")
print(f"최적 교차검증 점수: {grid_search.best_score_:.3f}")

# 최적 모델로 테스트
best_model = grid_search.best_estimator_
test_accuracy = best_model.score(X_test, y_test)
print(f"테스트 데이터 정확도: {test_accuracy:.3f}")

k-fold 교차 검증이란?

데이터를 여러 번 나누어 훈련하고 검증하는 방식

[필요성] = 홀드 아웃 방식의 문제점(훈련/테이스 세트로 한번만 나누는 방식)

데이터 편향: 특정 데이터 분할 방식에 따라 모델 성능이 크게 달라짐(과대, 과소)
데이터 활용 부족: 데이터셋이 작을 경우, 테스트 데이터는 훈련에 사용X, 모델 충분한 학습X
과적합 평가: 훈련 세트(과적합된 모델)도 테스트 세트이 좋게 나올 수 있음

랜덤 포레스트(Random Forest)란? (앙상블 알고리즘 )

여러 개의 디시전 트리를 독립적으로 학습시키고, 예측을 종합하여 최종 결정
편향-분산 트레이드오프를 해결하는 핵심 원리

병렬 학습
배깅(Bagging): 부트스트랩 샘플링으로 다양한 훈련 데이터 생성
특성 무작위성: 각 분할에서 일부 특성만 고려
투표 방식: 여러 트리의 예측을 종합

[원리]

1단계: 부트스트랩 샘플링

원본 데이터에서 복원 추출로 새로운 데이터셋 만들기
각 트리가 다른 패턴 학습
과접합 방지 및 일반화 성능 향상

2단계: 특성 무작위성 (Random Feature Selection)

각 노드 분할 시 전체 특성 중 일부만 랜덤하게 선택
각 분할마다 다른 특성 조합
트리 간 상관관계 감소로 앙상블 효과 극대화

3단계: 투표 방식 (Votiong Mechanism)

앙상블의 최종 예측(개별 트리들의 예측 종합)

[코드] scikit-learn을 활용

# 1단계: 랜덤 포레스트 모델 생성

# RandomForestClassifier import
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# 랜덤 포레스트 모델 생성
rf_model = RandomForestClassifier(
    n_estimators=100,        # 트리 개수
    max_depth=10,           # 최대 깊이
    min_samples_split=20,   # 분할 최소 샘플
    min_samples_leaf=10,    # 리프 최소 샘플
    max_features='sqrt',    # 특성 무작위성
    bootstrap=True,         # 부트스트랩 사용
    random_state=42,
)

print("=== 랜덤 포레스트 모델 설정 ===")
print(f"트리 개수: {rf_model.n_estimators}")
print(f"최대 깊이: {rf_model.max_depth}")
print(f"특성 선택: {rf_model.max_features}")
print(f"부트스트랩: {rf_model.bootstrap}")

# 2단계: 모델 학습 과정

# 모델 학습 시작
print("=== 랜덤 포레스트 학습 시작 ===")
print("100개의 트리를 병렬로 학습 중...")

# 학습 시간 측정
import time
start_time = time.time()

rf_model.fit(X_train, y_train)

end_time = time.time()
training_time = end_time - start_time

print(f"학습 완료! 소요 시간: {training_time:.2f}초")
print(f"학습된 트리 개수: {len(rf_model.estimators_)}")

3단계: 기본 성능 평가

rf_train_pred = rf_model.predict(X_train)
rf_test_pred = rf_model.predict(X_test)

# 정확도 계산
rf_train_acc = accuracy_score(y_train, rf_train_pred)
rf_test_acc = accuracy_score(y_test, rf_test_pred)
rf_overfit = rf_train_acc - rf_test_acc

print(f"훈련 데이터 정확도: {rf_train_acc:.3f}")
print(f"테스트 데이터 정확도: {rf_test_acc:.3f}")

# 특성 중요도 확인

# 특성 중요도 추출
feature_importance = pd.DataFrame({
    'feature': X.columns,
    'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False)

print("랜덤 포레스트 특성 중요도 순위:")
print("-" * 30)
for idx, row in feature_importance.iterrows():
    print(f"{row['feature']:15}: {row['importance']:.3f}")

# 상위 5개 특성 강조
print(f"\n🏆 TOP 5 중요 특성:")
top5_features = feature_importance.head(5)
for i, (_, row) in enumerate(top5_features.iterrows(), 1):
    print(f"{i}위. {row['feature']} ({row['importance']:.3f})")

# 트리 시각화(첫번째 트리만 추출)

first_tree = rf_model.estimators_[0] # 첫 번째 트리를 가져옴

# 첫 번째 트리 시각화
plt.figure(figsize=(20, 12))
plot_tree(first_tree,
          feature_names=X.columns,
          class_names=['사망', '생존'],
          filled=True,
          rounded=True,
          fontsize=10)
plt.title('랜덤 포레스트 내 첫 번째 디시전 트리', fontsize=16)
plt.show()

# 첫 번째 트리 구조를 텍스트로 출력
print("\n=== 랜덤 포레스트 내 첫 번째 디시전 트리 구조 (텍스트) ===")
tree_rules = export_text(first_tree, feature_names=list(X.columns))
print(tree_rules[:1000]) # 처음 1000자만 출력

XGBoost란? (차세대 부스팅 알고리즘)

그래디언트 부스팅을 극한으로 최적화한 알고리즘

순차 학습: 이전 모델의 오류를 다음 모델이 보완
정규화: L1, L2 정규화로 과적합 방지
병렬 처리: GPU 지원으로 빠른 학습
결측치 처리: 자동으로 최적의 결측치 처리 방향 학습

[원리]

1단계: 기본 트리 만들기(약한 분류기)

대략적 패턴만 잡고, 많은 실수를 남김

2단계: 첫 번째 모델의 오차 분석하기

오차패턴 찾기
가중치 부여 메커니즘

3단계: 두 번째 약한 분류기 만들기(실수 보완에 집중)

두 번째 모델 학습(첫 번째 모델 약점 보완)

4단계: 세 번째 약한 분류기 만들기(남은 실수 보완)

5단계: 최종 예측

모든 모델의 지혜 결합

[코드]

# 1단계: XGBoost 라이브러리 import

# XGBoost 설치 및 import (주피터 노트북에서 설치 시)
# !pip install xgboost

import xgboost as xgb
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, roc_auc_score
import time

print("=== XGBoost 라이브러리 준비 ===")
print(f"XGBoost 버전: {xgb.__version__}")
print("✅ XGBoost 라이브러리 로드 완료!")

# 2단계: XGBoost 모델 생성 및 하이퍼파라미터 설정

xgb_model = XGBClassifier(
    n_estimators=100,        # 부스팅 라운드 수 (트리 개수)
    max_depth=6,            # 트리 최대 깊이
    learning_rate=0.1,      # 학습률 (각 트리의 기여도)
    subsample=0.8,          # 샘플 서브샘플링 비율
    colsample_bytree=0.8,   # 특성 서브샘플링 비율
    random_state=42,
    eval_metric='logloss',  # 평가 지표
)

print("🎯 XGBoost 모델 설정 완료!")
print(f"   부스팅 라운드: {xgb_model.n_estimators}")
print(f"   학습률: {xgb_model.learning_rate}")
print(f"   최대 깊이: {xgb_model.max_depth}")

# 3단계: 조기 종료를 위한 검증 세트 준비

# 조기 종료를 위한 검증 세트 분할
print("\n=== 조기 종료를 위한 검증 세트 준비 ===")

X_train_split, X_val_split, y_train_split, y_val_split = train_test_split(
    X_train, y_train, 
    test_size=0.2,           # 훈련 데이터의 20%를 검증용으로
    random_state=42, 
    stratify=y_train         # 클래스 비율 유지
)

print(f"📊 데이터 분할 결과:")
print(f"   훈련용: {len(X_train_split)}개 샘플")
print(f"   검증용: {len(X_val_split)}개 샘플")
print(f"   테스트용: {len(X_test)}개 샘플")

# 각 세트의 생존률 확인
train_survival_rate = y_train_split.mean()
val_survival_rate = y_val_split.mean()
test_survival_rate = y_test.mean()

print(f"\n📈 각 세트의 생존률:")
print(f"   훈련 세트: {train_survival_rate:.3f}")
print(f"   검증 세트: {val_survival_rate:.3f}")
print(f"   테스트 세트: {test_survival_rate:.3f}")
print("✅ 모든 세트에서 유사한 생존률 확인!")

# 4단계: XGBoost 모델 학습

# 학습 시간 측정
start_time = time.time()

# 학습
xgb_model.fit(
    X_train_split, y_train_split,
    eval_set=[(X_val_split, y_val_split)],  # 검증 세트 지정
)

end_time = time.time()
training_time = end_time - start_time

print("✅ XGBoost 학습 완료!")
print(f"   학습 시간: {training_time:.2f}초")

# 5단계: XGBoost 기본 성능 평가

xgb_train_pred = xgb_model.predict(X_train)
xgb_test_pred = xgb_model.predict(X_test)

# 정확도 계산
xgb_train_acc = accuracy_score(y_train, xgb_train_pred)
xgb_test_acc = accuracy_score(y_test, xgb_test_pred)
xgb_overfit = xgb_train_acc - xgb_test_acc

print(f"🎯 XGBoost 성능 결과:")
print(f"   훈련 데이터 정확도: {xgb_train_acc:.3f}")
print(f"   테스트 데이터 정확도: {xgb_test_acc:.3f}")

# 6단계: 특성 중요도 평가

feature_importance = pd.DataFrame({
    'feature': X.columns,
    'importance': xgb_model.feature_importances_
}).sort_values('importance', ascending=False)

print("\n\nXGBoost 특성 중요도 순위:")
print("-" * 30)
for idx, row in feature_importance.iterrows():
    print(f"{row['feature']:15}: {row['importance']:.3f}")

# 상위 5개 특성 강조
print(f"\n🏆 TOP 5 중요 특성:")
top5_features = feature_importance.head(5)
for i, (_, row) in enumerate(top5_features.iterrows(), 1):
    print(f"{i}위. {row['feature']} ({row['importance']:.3f})")

'특강 > 머신러닝' 카테고리의 다른 글

[머신러닝 주요기법] 4회차 (07.03) 계층적 군집화, DBSCAN, 차원축소(PCA, t-SNE) (0)	2025.07.03
[머신러닝 주요기법] 3회차 (07.02)☆elbow, silhouette (2)	2025.07.02
[머신러닝 주요기법] 1회차 (06.30) (1)	2025.06.30
[머신러닝] 4회차 앙상블, 부스팅(06.27) (2)	2025.06.27
[머신러닝] 3회차 회귀분석 (06.26) (0)	2025.06.26

현재글[머신러닝 주요기법] 2회차 (07.01)

코딩 아가의 성장과정

QAQC분야 데이터 분석가로 취업하기 위한 한걸음

Python, 코딩, ChatGPT, 아티클스터디, 파이썬, 챌린지, python3, 내일배움캠프, 태블로, Til, SQL, 랜덤포레스트, 상관관계, tableau, 머신러닝, xgboost, 시계열데이터, 데이터분석, 코드카타, 테블로,

Today :
Yesterday :

일	월	화	수	목	금	토
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30

코딩 아가의 성장과정