Skip to main content

In the world of machine learning, the Random Forest algorithm has gained immense popularity for its versatility and robustness. This ensemble method combines the strengths of decision trees and the wisdom of crowds, making it a go-to algorithm for a wide range of tasks, from classification to regression and beyond. In this post, explore a comprehensive Python code example for implementing a Random Forest algorithm for Titanic survival predictions.

What is Random Forest?

Random Forest comprises multiple individual decision trees, each trained on a different random subset of the data and features. The model is less sensitive to the training data than just a single decision tree. It combines the power of many decision trees to make accurate predictions.
These Decision tress learn simple decision rules based on input features to make predictions. Decision trees are known for capturing non-linear relationships and interactions between features.
The dataset is passed through the decision trees, which make a decision, for example, 0 and 1, in binary classification. The final decision is like a majority vote/average system that finalizes the result. Since the algorithm doesn’t depend on only one decision tree, it has increased robustness and accuracy.

“Having more number of trees in the forest leads to higher accuracy and reduces the chances of overfitting”.

How the Random Forest makes predictions?

The algorithm consists of the following steps:

  • Step 1 Data Splitting: Select samples from the training dataset. The training dataset consists of features and corresponding labels. The data is split in two datasets i.e., training and test.
  • Step 2 Bagging: For each decision tree in a Random Forest, a bootstrap sample is generated by randomly sampling the training data with replacement. This results in multiple subsets of the training data, each of which may contain duplicate samples and have approximately two-thirds of the original data.
  • Step 3 Decision Tree Construction: Each decision tree in the Random Forest is built with a bootstrap sample. At each node, a random subset of features (determined by ‘max_features’) is used for the best split, until a stopping criterion is met (e.g. reaching a maximum depth or minimum number of samples per leaf.
  • Step 4 Prediction Aggregation: After constructing decision trees, predictions are made for each tree. For classification tasks, the final prediction is the class with the most votes. For regression tasks, the final output is obtained by averaging the predictions of all trees.

Architecture of Random Forest

 

Advantages of using Random Forest:

 

  1. Robustness: Compared to individual decision trees, Random Forest is less prone to overfitting. The aggregation of multiple trees helps to reduce variance and improve generalization on unseen data.
  2. Feature Importance: Random Forest provides a measure of feature importance based on how much each feature contributes to the overall performance of the model. This information can be valuable for feature selection and understanding the underlying importance of different variables.
  3. Handles Large Datasets: Random Forest can handle large datasets with high dimensionality, making it suitable for a wide range of real-world problems.
  4. Versatility: Random Forest can be used for both classification and regression tasks, and has proven to be effective in various domains, including finance, healthcare, and natural language processing.

 

Ensemble:

Ensemble models refer to multiple learning models that are utilized to obtain better predictive performance. There are three main types of ensemble learning methods i.e., bagging, stacking, boosting.
Ensemble methods combine multiple models to make more accurate and robust predictions. They are useful in situations with noisy or uncertain data, as they can smooth out irregularities and inconsistencies. Ensemble methods can also identify and mitigate biases or errors in individual models, improving the overall quality of predictions. They are a powerful tool for data scientists and machine learning practitioners, applicable to a wide range of problems and data sets.

Python Code:

This code example shows a Python implementation of the Random Forest algorithm to predict survival outcomes in the Titanic dataset. The dataset, available on Kaggle, contains information about passengers and whether they survived or not. Random Forest is an ensemble learning algorithm that effectively analyzes and predicts based on this dataset. We will demonstrate how to preprocess data, build a Random Forest model, and evaluate its accuracy.

Dataset Link: https://www.kaggle.com/competitions/titanic

Import required libraries:

We import the necessary libraries, including pandas for data handling, RandomForestClassifier from scikit-learn for creating a Random Forest classifier, train_test_split for splitting the data into training and testing sets, and accuracy_score for evaluating the model’s accuracy.’

import pandas as pd 
import seaborn as sns 
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

 

Preprocess the Data:

We load the Titanic dataset from the provided CSV file using pandas’ read_csv() function. This dataset contains information about passengers, including features like age, sex, ticket class, etc., along with the “Survived” column indicating whether a passenger survived (1) or not (0).

For simplicity, we drop the features that do not contribute to the learning of model. Using Pandas feature .drop(), we drop the unnecessary columns.

df = pd.read_csv('D:\\Downloads\\titanic\\titanic.csv')
df = df.drop(['Name','SibSp','Parch','Fare','Cabin','Ticket','Embarked','PassengerId'], axis=1)
df.head()

output:

Handling Missing Values:

Before training the Random Forest model, we typically perform preprocessing steps such as feature selection and engineering, handling missing values, encoding categorical variables, and scaling numeric features. These steps are essential to ensure the data is in a suitable format for the Random Forest algorithm.

df = df.dropna()
X = df[['Pclass','Sex','Age']]
y = df['Survived']

 

Split the data into training and test datasets:

We split the preprocessed data into training and testing sets using train_test_split(). This allows us to train the model on a portion of the data and evaluate its performance on unseen data.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

 

Visualize relationship between survival and features:

Visualizing the relationship between survival and various features in the Titanic dataset is crucial. Through techniques such as count plots, histograms, and pair plots, we can explore the connections between survival and features like passenger class, gender, age, family size, fare, and more. These visualizations enable us to make data-driven observations and uncover meaningful patterns that contribute to a better understanding of the factors affecting survival in the Titanic dataset.

sns.pairplot(df, height=2.5, hue='Survived')

 

relationship between survival and features

sns.countplot(x='Survived', hue='Sex', data=df)
plt.title("Survival by Sex")
plt.show()

sdsdSurvival of passengers wrt to sex

X1 = pd.get_dummies(X_train)
y1 =pd.get_dummies(y_train)

X2 = pd.get_dummies(X_test)
y2 = pd.get_dummies(y_test)

 

Create a Random Forest Classifier:

We create an instance of the RandomForestClassifier class from scikit-learn with specified parameters, such as the number of estimators (decision trees) in the forest. In this example, we set n_estimators to 100 and use a random state for reproducibility.

model = RandomForestClassifier(n_estimators=120, max_depth=7, random_state=1)

 

Train the Model:

We fit the Random Forest classifier to the training data using the fit() method. This trains the ensemble of decision trees on the training set, learning patterns and relationships between features and the target variable (Survived).

model.fit(X1, y1)

 

Make Predictions:

We use the trained Random Forest model to make predictions on the test set (unseen data) using the predict() method. This generates predicted survival outcomes for the passengers in the test set.

predictions = model.predict(X2)

 

Evaluate Model:

We evaluate the model’s performance by comparing the predicted survival outcomes with the actual labels in the test set. We calculate the accuracy of the predictions using accuracy_score() and print the result.

accuracy = accuracy_score(y2, predictions)
print(accuracy)

0.8601398601398601

Confusion Matrix:

In the field of machine learning and classification tasks, a confusion matrix is a fundamental tool for evaluating the performance of a model. It provides a comprehensive summary of the predictions made by a classification model and their comparison with the actual ground truth labels.

A confusion matrix is a square matrix that displays the counts or percentages of true positive (TP), true negative (TN), false positive (FP), and false negative (FN) predictions. By revealing the types of errors a classifier makes, it allows us to assess the accuracy and quality of the classifier.

Confusion Matrix of the Model

Conclusion:

In this blog post, we delved into the Random Forest algorithm, a powerful ensemble learning technique, and demonstrated its implementation using a Python code example. By combining the predictions of multiple decision trees, Random Forest offers robust and accurate predictions for various machine learning tasks, including the Titanic survival prediction problem.

in this blog we did the following:

– Provided an introduction to the Random Forest algorithm and its advantages in ensemble learning.
– Demonstrated the implementation of Random Forest using a Python code example.
– Highlighted the versatility of Random Forest in handling complex classification tasks and its ability to capture intricate relationships in high-dimensional data.
– Provided a conclusion summarizing the key takeaways from the blog post.

 

Read about Generative Artificial Intelligence here:

Subscribe

* indicates required

Intuit Mailchimp

Sania Shujaat

A Mechanical Engineer with a keen interest in applying AI to revolutionize Mechanical Engineering.

Leave a Reply