🌳 What Is It?
A Decision Tree is like playing "20 Questions" with your data. It makes decisions using a flowchart of if-then rules, where each node asks a question about a feature.
Mental Model: Imagine you're trying to guess which animal someone is thinking of. You'd ask questions like "Does it have fur?" or "Can it fly?" — each answer narrows down the possibilities until you reach the answer!
🔢 The Math Behind It
1. Entropy (Disorder Measure)
H(S) = -Σ pi × log₂(pi)
What it means: How mixed up is this group?
- H = 0 → Perfectly pure (all same class) 🎯
- H = 1 → Maximum chaos (equal split) 🌪️
2. Information Gain
IG = H(parent) - weighted_avg[H(children)]
What it means: How much uncertainty did this split remove?
The algorithm picks the split with the highest information gain!
3. Gini Impurity (Alternative)
Gini = 1 - Σ pi²
What it means: Probability of incorrect classification
- ✅ Faster to compute than entropy
- ✅ Similar results in practice
🎯 Key Concepts
1. Tree Depth vs Overfitting
Shallow Trees 🌱
High bias, low variance
Risk: Underfit (too simple)
Deep Trees 🌲
Low bias, high variance
Risk: Overfit (memorizes noise)
Sweet spot: Use cross-validation to find optimal max_depth
2. Pruning Strategies
- Pre-pruning: Stop early (max_depth, min_samples_split)
- Post-pruning: Build full tree, then cut back weak branches
- Goal: Reduce overfitting, improve generalization
3. Feature Importance
Measures how much each feature reduces impurity across all splits
Great for feature selection — tells you which features actually matter!
📊 When to Use Decision Trees
✅ Great For:
- Interpretability: Easy to explain decisions to non-technical folks
- Mixed data types: Handles numerical + categorical features
- Non-linear relationships: Captures complex patterns automatically
- Feature interactions: Finds interactions without manual engineering
- Quick prototyping: Fast to train and test
❌ Not Ideal For:
- High-dimensional sparse data (text embeddings, images)
- Extrapolation (can't predict outside training range)
- Stable predictions (small data changes = big tree changes)
🆚 Comparison with Other Algorithms
vs Linear Regression
✅ Handles non-linearity
✅ No feature scaling needed
vs k-NN
✅ Faster prediction (O(log n))
✅ More interpretable
🛠️ Hyperparameters to Tune
max_depth: Maximum tree depth (controls complexity)
min_samples_split: Minimum samples to split node
min_samples_leaf: Minimum samples at leaf
max_features: Features to consider per split
criterion: "gini" or "entropy"
Pro tip: Start with max_depth=3, then gradually increase until validation performance plateaus
🎓 Checkpoint Questions
Question 1: What is entropy?
Think before you peek at the answer below...
💡 Answer
Entropy is a measure of disorder/uncertainty in a dataset. 0 = perfectly pure (all same class), higher values = mixed classes. The algorithm uses entropy to decide which splits reduce uncertainty the most.
Question 2: How does a decision tree choose which feature to split on?
💡 Answer
The tree evaluates ALL possible splits, calculates information gain (reduction in entropy/Gini impurity) for each, and picks the split with the highest gain. This maximizes how much we learn from each question.
Question 3: Why do deep trees overfit?
💡 Answer
Deep trees memorize training data noise by creating too-specific rules. They perform great on training data but fail on new data. Pruning (limiting depth) keeps rules general and improves generalization.
🚀 Next Steps
After mastering Decision Trees, you'll learn:
- Algorithm #4: Random Forest — Ensemble of decision trees
- Algorithm #8: Gradient Boosting — Sequential trees correcting errors
Practice suggestion: Try the Kaggle Titanic dataset — a classic decision tree problem!
✨ Phase 1: Supervised Learning — Algorithm 3 of 7 ✨