Breaking Down Decision Trees: A Simple Guide to Predicting the Future

Pranjali Srivastava
Aug 30
4 min read

Decision trees are versatile supervised learning algorithms used for both classification and regression tasks. They belong to the family of information-based learning methods, which rely on measures of information gain to guide their learning process. Decision trees can handle both continuous and categorical input and target features, making them suitable for a wide range of problems.

The core idea of decision trees is to identify the features that provide the most “information” about the target variable. By splitting the dataset based on these features, the algorithm aims to create subsets where the target variable is as pure as possible. The feature that results in the highest purity of the target variable is considered the most informative. This process of selecting the most informative feature and splitting the data continues iteratively until a predefined stopping criterion is met, at which point the process concludes at the leaf nodes.

The leaf nodes are where the decision tree makes predictions for new data. This works because the tree has learned patterns from the training data and can use those patterns to guess the target value (or class) for new, unseen data. A decision tree is made up of a root node (where the decision-making starts), interior nodes (where decisions are made based on features), and leaf nodes (which hold the final predictions). These nodes are connected by branches, showing the path from the root to the leaves.

Decision Tree — Regression or Classification

The diagram might look simple and straightforward, but here is the challenge: how do you decide which feature or variable should become the root node and what comes next as the interior nodes?

In real-world scenarios, datasets often contain numerous variables, and figuring out the best way to split the data is the key to building an effective decision tree. The true power of a decision tree lies in the technique it uses to determine these splits. Only by using the right approach can you train the algorithm to make predictions with maximum accuracy.

These below examples will help you zoom out and understand the intricacies of the real time data –

House Price Prediction: It is based on features like location, number of bedrooms and square footage
Stock Market Analysis: Based on market factors like company revenue, global news sentiment and interest rate
Weather forecasting: Based on factors like humidity, wind speed and cloud cover
E-commerce revenue prediction: Based on features like website visit duration, number of items viewed, cart add and removal rate

Attribute Selection Measure (ASM) comes in the play to solve the issue of selecting the best attribute for root node and sub-nodes. It utilizes two of its types — Information gain and Entropy.

When building a regression decision tree for predicting house prices, we use ASM to decide the best splits in the data. Let’s break this down step by step with an example and illustrations.

Step 1: The Problem — Predicting House Prices

Imagine you have a dataset of houses with features like:

Location (City A, City B, City C)
Number of Bedrooms
Square Footage

The target variable is the price of the house. Your goal is to predict house prices based on these features.

Step 2: Understanding Entropy

Entropy, in a regression context, refers to the variance of the target variable (house prices) in a dataset. It can be considered as degree of randomness in the data. High entropy means the prices are spread out and unpredictable. Low entropy means the prices are tightly clustered and more predictable.

Example: Initial Entropy

Consider this small dataset:

Variance in Prices (Entropy):

The prices range from $300,000 to $450,000, so there’s a lot of spread. This is high entropy.

Step 3: Splitting to Reduce Entropy

To make better predictions, we split the dataset based on a feature that reduces entropy the most. This is where information gain comes in.

Information Gain Formula:

IG = Variance before split − Weighted Variance after split

Example: Splitting by Location

If we split the data by the “Location” feature:

City A:
Prices: $300,000, $350,000
Variance: Low (prices are close together)
City B:
Prices: $400,000, $450,000
Variance: Low

Weighted Variance After Split: Combine the variances of City A and City B, weighted by the number of houses in each city.

Information Gain: The difference between the initial variance (high) and the weighted variance after the split (low) tells us how much information we gained by splitting on “Location.”

Step 4: Iterative Splitting

After splitting by location, you can further reduce entropy by splitting within each group. For example:

In City A, split based on the number of bedrooms.
In City B, split based on square footage.

Each split reduces variance, making predictions more accurate.

Entropy measures the unpredictability (variance) in house prices, and information gain helps us decide the feature that reduces this unpredictability the most. By iteratively splitting the data, we build a tree that predicts house prices with high accuracy.

This method ensures the model focuses on the most important features step by step, improving the precision of its predictions.

Attribute Selection Measure in Decision Tree

Decision Tree Regression — Implementation in Python

Problem Statement: Use Machine Learning to predict the selling prices of houses based on few market factors. Build a model using Decision Tree in python

Data Set: https://www.kaggle.com/datasets/amitabhajoy/bengaluru-house-price-data

Importing required libraries and packages. Load the dataset and visualize using scatter plot to observe the entropy of feature.

Here, we are plotting house price against the area size of the house

import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
from sklearn.tree import DecisionTreeRegressor 
from sklearn.model_selection import train_test_split

data = pd.read_csv("Bengaluru_House_Data.csv") 
plt.scatter(x=data['total_sqft'],y=data['price'],color='blue') 
plt.xlabel('Area in squarefeet') 
plt.ylabel('House price')

Define the features and target. Split the dataset into Train and Test sets

X = pd.DataFrame(data['total_sqft'])
Y = pd.DataFrame(data['price'])
x_trn, x_tst, y_trn, y_tst = train_test_split(X, Y, test_size = 0.20)

Build the model with the Decision Tree Regressor Function. The ‘criterion’ parameter of DecisionTreeRegressor has options — ‘absolute_error’, ‘poisson’, ‘squared_error’, ‘friedman_mse’

regressor = DecisionTreeRegressor(criterion='friedman_mse', random_state=100, max_depth=4, min_samples_leaf=1) 
regressor.fit(x_trn,y_trn)

Predict the Values

y_pred = regressor.predict(np.array([1150]).reshape(1, 1)) 
print(y_pred)

Thus, for a house of area of 1150 sq. ft., the price predicted by model is Rs. 77.83 Lakh

Although, decision trees have some weakness like they are expensive to train and also there is a high chance of overfitting so as to perfectly fit all the samples. Random Forest Regression comes in play to resolve these issues as it combines the numerous decision trees into one model and average out the predictions of all individual decision tree.

Welcome
to NumpyNinja Blogs

Breaking Down Decision Trees: A Simple Guide to Predicting the Future

Decision Tree Regression — Implementation in Python

Recent Posts

Welcome to NumpyNinja Blogs

Decision Tree Regression — Implementation in Python

Welcome
to NumpyNinja Blogs