# Random Forests for Regression

Predictive modeling plays a crucial role in extracting valuable insights from data, enabling researchers to make informed decisions and uncover underlying patterns. Regression analysis is a powerful technique that has gained popularity. Here I want to focus on one particularly popular method called Random Forest Regression. By bundling together forests of decision trees (the Random Forest) one can collective exploit their combined power to solve Random Forest complex prediction problems and deliver accurate, and most importantly interpretable results.

Random Forest Regression offers a robust approach to modeling relationships between variables, accommodating non-linearities, and handling missing data (this is why I like them so much). Unlike traditional linear regression methods, this ensemble learning technique leverages the strength of multiple decision trees to provide reliable predictions. Whether you're analysing financial markets, predicting customer behaviour, or studying scientific phenomena, Random Forest Regression should be always be a favourite in your toolkit.

## A Crash course in Random Forests

### Decision trees

One cannot fully appreciate how random forests work without first understanding decision trees. I might write an article on these at some later stage, but for the time being here is a short summary.

A decision tree is basically a flow diagram, plain and simple, but more precisely it’s a supervised machine learning algorithm that can be used for both classification and regression tasks (we will be interested in the latter). It models decisions based on the input features and uses these to make a decision about the potential value of target variable. The tree structure consists of internal nodes that represent features, branches that correspond to feature values, and leaf nodes that hold the predicted outcomes. The decision tree algorithm recursively splits the data based on feature conditions, aiming to create homogeneous subsets that are “more pure”. At each split, a specific criterion—e.g., Gini impurity or entropy—is used to evaluate the quality of the split. Finally the resulting tree can be easily interpreted and provides insights into the decision-making process.

For example, suppose we want to classify some dataset characterised by the tuple $(\vec{x}_{i},y_{i})$ where $\vec{x}_{i}$ is a vector of features and $y_{i}$ is the label of the $i$th data point. Then a decision tree might look at each of the individual features of each data point $x_{i}^{(j)}$ and use these to make a decision on whether, in this example, the classification is $y=({\rm Red}, {\rm Green})$.

Now, we need to determine what is the best decision boundary. For example why is $x^{(1)}>1.73$ the best to determine if $y= {\rm Red}$? Just like all statistics and forecasting, we are trying to minimise or maximise some meaningful criterion. In the case of decision trees, this is the convex Gini index defined

$G = 1 - \sum_{k}p_{k}^{2}$where $p_{k}$ of class $k$ which in our case is Green or Red. So if we can find some decision boundary that minimises the Gini index such that $\sum_{k}p_{k}^{2}\rightarrow1$, then this decision tree is said to have high “information” about the classification at hand. One can also use the Shannon entropy $S = -\sum_{k}p_{k}\log p_{k}$ which is similar convex and the notion of which is much more closely associated with information theory. However, for large datasets, it can be preferential to work with the Gini index since logarithms can be computationally expensive.

In any case, we can cycle through different decision boundaries algorithmically and seek the optimal decision boundaries that maximise our information. With this, we can now look at Random Forests.

## From trees to a forest

Now a collection—or an ensemble—of trees is called a forest! Random Forest Regression is based on the idea of ensemble learning, where multiple decision trees work together to make predictions. The power of a forest—or ensemble methods comes from two underlying concepts which improve the robustness, namely randomisation and aggregation.

Randomisation: Each decision tree in the Random Forest is trained on a randomly selected subset of the data—which is called bagging or bootstrap aggregation. We might only consider $20\%$ of the data, or we might even drop entire features. By introducing randomness, we reduce the risk of overfitting and encourage diversity among the trees. This randomisation helps capture different aspects of the data and leads to a more robust model.

Aggregation: Once the individual random trees are trained, their predictions are combined to produce the final prediction. This aggregation process, often referred to as "voting" or "averaging," helps reduce the impact of individual tree errors and produces a more accurate and stable prediction. The exact method of aggregation depends on the type of problem (regression or classification) and can include averaging, weighted averaging, or majority voting.

## But why Random Forests?

I like using Random Forests as a quick tool for building predicative models for several reasons. Firstly, they are super easy to implement using scikit learn, but they also have several useful benefits.

*Random forests are flexibility*: Random Forest Regression can handle both numerical, categorical features, and missing data, making it suitable for a wide range of datasets. It can capture non-linear relationships and handle interactions between variables, making it a versatile modeling technique.

*Random forests are robust*: Random Forest Regression is robust to outliers and noise in the data. The ensemble nature of the algorithm helps mitigate the impact of individual data points, resulting in more reliable predictions. This can make it easy to get some quick and relatively accurate results.

*Random forests are interpretable*: Random Forest provides a measure of feature importance, given the collection of decision trees, indicating which variables have the most significant impact on the prediction. After construction of the forest, we can look at which variables contain the most information and thus determine their importance. This can be invaluable for feature selection and gaining insights into the underlying relationships in the data. Other techniques like principle component analysis garble the dominant features into a linear combination removing their interpretability which is not the case for random forests.

## Python Example: House Price Prediction

In this simple example, we will cover all the above concepts using the House Price Kaggle dataset in python. In this dataset, the SalePrice is the target variable, and there are 79 features ranging from the year built, to the Lot area, and the different kinds of utilities. Now, in any preliminary data investigation one should spend considerable time studying the data and preprocessing it to remove outliers and errors. I won’t focus on this here, but rather on the Random Forest implementation.

In any case, we can clearly see that the distribution of our target variable follows a Log-Normal distribution as we would expect, therefore we will use the Log of the Sale Price as our target variable rather than dealing with the extreme tails. We also use the Shapiro-Wilk test to show that the overall distribution is highly normal.

```
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# import data
train = pd.read_csv('./data/train.csv')
test = pd.read_csv('./data/test.csv')
# Make a plot
fig = plt.figure(figsize=(10, 6))
ax1 = fig.add_subplot(1, 2, 1)
sns.histplot(train['SalePrice'], color='b', kde=True, bins=50, stat='probability')
ax1.set_title('Sale Price')
ax2 = fig.add_subplot(1, 2, 2)
sns.histplot(np.log(train['SalePrice']), color='r', kde=True, bins=50, stat='probability')
ax2.set_title('Log of Sale Price')
# add Shapiro-Wilk test
shap = stats.shapiro(np.log(train['SalePrice']))
props = dict(boxstyle='round', facecolor='wheat', alpha=0.5)
ax2.text(12.5, 0.07, f'W = {shap[0]:.4f}', fontsize=12, bbox=props)
plt.show()
```

We can now split our data into a train and a test dataset

```
# Now split data into features and labels
train_X = train.drop(columns=['SalePrice'])
train_y = np.log(train['SalePrice'])
```

Now we are going to use scikit learn’s random forest regressor module. We will then want to do some hyperparameter tuning to get the best possible model. Here we are going to vary a few hyperparameters in our forest including

- Number of trees in the forest n_estimators
- Number of features to consider at every split (automatically selected, or square root of the total number of features
- Maximum number of leaves
- Minimum number of samples to split the node
- Minimum number of samples require at each leaf node
- Whether or not to bootstrap

We can then combine this with the RandomizedSearchCV module from scikit learn to find iterate over the best features

```
from sklearn.model_selection import RandomizedSearchCV
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]
# Create the random grid
random_grid = {'n_estimators': n_estimators,
'max_features': max_features,
'max_depth': max_depth,
'min_samples_split': min_samples_split,
'min_samples_leaf': min_samples_leaf,
'bootstrap': bootstrap}
# Use the random grid to search for best hyperparameters
rf = RandomForestRegressor()
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 100, cv = 3, verbose=2, random_state=42, n_jobs = -1)
# Fit the random search model
rf_random.fit(train_X, train_y)
```

This will take a few minutes to run but it will provide us with a set of the best possible hyperparameters that we can then use to build a new model

```
# get best parameters
best_params = rf_random.best_params_
# Now lets make a model with these parameters
rf = RandomForestRegressor(**best_params)
```

We are now in a position to create a test and validation set to test the accuracy of our random forest. We can split our training data into a new test and validation set using the train_test_split function in scikit learn and then fit our model by computing the root-mean-square error

```
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
X_train, X_test, y_train, y_test = train_test_split(train_X, train_y, test_size=0.2, random_state=42)
# fit model
rf.fit(X_train, y_train)
# make predictions
preds = rf.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, preds))
# plot predictions vs actual
fig = plt.figure(figsize=(10, 6))
sns.scatterplot(x=y_test, y=preds, color='b', alpha=0.5)
plt.ylabel('Predictions')
plt.xlabel('Actual')
plt.title(f'Predictions vs Actuals (RMSE: {rmse:.4f})')
plt.show()
```

So we can see that overall our model does a pretty good job of predicting the final log distribution of the sales price.

### Feature importance

As I mentioned earlier, one of my favourite components of Random Forests is their interpretability as determined by the aggregate weight of the Gini index for the different features. We can easily access the feature importance and plot them in descending order

```
# Let's now get the feature importances
fi = pd.DataFrame(data=rf.feature_importances_,
index=rf.feature_names_in_,
columns=['importance'])
# Sort the values and plot the top 20
fi.sort_values(by='importance', ascending=False, inplace=True)
plt.figure(figsize=(20, 6))
sns.barplot(x=fi.index, y=fi['importance'], color='b', saturation=0.5)
plt.xticks(rotation=90)
plt.show()
```

Thus, we can now clearly see which features have the greatest impact on the overall price. Unsurprisingly, the top four are the Overall quality, Ground floor living area size, the year it was built, and the number of Garage spaces.

As a final sanity check, we can double check that the overall model reproduces the same Log-Normal distribution for the unseen test data.

```
# Now make predictions
test_y = rf.predict(test)
fig, ax = plt.subplots()
sns.histplot(np.exp(train_y), ax=ax, color='b', alpha=0.5, kde=True)
sns.histplot(np.exp(test_y), ax=ax, color='r', alpha=0.5, kde=True)
plt.show()
```

Thus our model reproduces a similar Log-Normal distribution as our training data which is a good sanity check that it is behaving well, on average.

## Conclusion

In conclusion, random forests are powerful and versatile machine learning algorithms for multi-parameter regression tasks. By combining the predictions of multiple decision trees, random forests can improve the accuracy and robustness of the models while reducing the risk of overfitting. The ensemble nature of random forests allows them to capture complex relationships and interactions among features, resulting in robust predictions even in the presence of noisy or incomplete data. Furthermore, the random feature selection and bootstrapping techniques employed in random forests contribute to their ability to handle high-dimensional data and mitigate the effects of outliers.