In an earlier post, we discussed different methods for forecasting the future values of a variable. Forecasting is a rich subject; even a cursory survey suggests several different algorithms. As you go deeper, you see even more possibilities. Some algorithms, such as deep learning models, can be quite involved in order to identify significant patterns in the time series.
However, sometimes even the best algorithm may not be good enough.
Furthermore, one algorithm that looks less accurate under some conditions may be superior under other conditions. Therefore, it might be better to use a combination of these algorithms, especially in the context of time-series forecasting. Additionally, practitioners observe that even the most sophisticated algorithms are frequently outperformed by simple ones. Instead of chasing The One Big algorithm that generates forecasts with very little error, combining several weak algorithms in simple ways, such as averaging, seems to be the better approach (Armstrong, 2001).
For the rest of this post, we will assume that a set of forecasts is given. We don’t particularly care where they come from; they can be coming out of black boxes for all we care. We will only discuss different ways to combine them, optionally adding some extra features.
The Simple Mean method takes the forecasts for a particular target time and just averages them. No bells, no whistles. No parameters to estimate, no learning from the past. Still, it works surprisingly well. It frequently outperforms more sophisticated combinations.
The Simple Mean averages the forecasts with equal weight. However, if we have reason to trust some forecasters more than others, we can assign a greater weight to them so that they have a greater influence on the result. Usually, these weights are set using past data by evaluating the error of each forecaster. Forecasters with more minor errors would carry greater weight in the result.
One such formula, the Minimum Variance method (also called the Inverse-Variance Weighting method), was proposed by Bates and Granger in 1969. It weights each forecaster according to its past precision, that is, the inverse of its variance. The variance of a forecaster can be estimated as the mean of the squares of its past errors. This method has the unique property that it minimizes the It turns out that, when we average forecasts using the inverses of variances, the variance of the result is minimized; hence the name of the method. The same method is also used in portfolio theory to set up a stock portfolio that minimizes risk.
Regression is the bread-and-butter of data scientists, and one might naturally think of using linear regression algorithms to determine the best parameters in the linear combination of forecasters.
So, how well do they work? Let us illustrate some of these methods on some example data: Hourly wind power production taken from two wind farms over more than two years.
Model combination on wind farm production forecast
Our data is composed of three different sets of forecasts on two separate wind farms in Western Turkey, as well as the actual production on these farms. Two of the forecasters are commercial products based on meteorological models. The third is just a “persistence model”, which simply says that the production at the target time will be the same as the production two hours before it.
Here is an overview of the commercial forecasts over one week, together with the actual production:
The plots show that even though the forecasts follow the general trend, there are occasional dips and peaks in the production that the models do not capture. The addition of the persistence model helps to capture such short-term variations.
We can assess the performance of the forecasts using their mean absolute error over more than two years of data. Smaller values indicate a better forecaster:
The mean absolute error values have the same units (MW) as the production values. Farm 2 has a greater production capacity; accordingly, the errors there have larger values. We can compare forecast performances within each farm, but we should not compare one forecaster across farms.
We see that Forecaster 1 has the best performance on farm 1. The persistence model is the worst in Farm 1 but, interestingly, the best Forecaster in Farm 2.
The simplest combination is averaging the forecasts at each time step. This gives a forecaster that is better than any of the existing ones, as measured by the mean absolute error:
Can we improve on this by using more sophisticated combinations using the Minimum Variance method? We will split the two-year data set into training and test sets, evaluate the mean square errors of each forecaster over the training set, and make predictions over the test set. The mean absolute errors over the test set are as follows:
We see that the Minimum Variance method again outperforms the individual forecasts. However, it is not significantly better than simple averaging, even though its algorithm is more complex.
We can get around this problem with a cheat. We can enrich the data with some new features using our domain knowledge. It makes sense that the wind data varies within a day and a year, following the natural cycles. It might be the case that individual forecasters’ errors are not constant but depend on the time of the day and year.
To try this hypothesis, we break the data points into hours and months and evaluate each group's mean square errors separately. This gives us a different weight for each (hour, month) pair. For example, the weights of forecaster 1 at 9:00 in January, 10:00 in January, and 9:00 in February will all be different.
Modifying our model combination algorithm this way, we get the following mean absolute error:
We get an improved forecast as a result of grouping the data by month and hour, but not by very much. Still, depending on the objective, this can be a significant improvement.
As a last attempt, let us combine the three forecasts using linear regression. This model uses the month and hour information as categorical variables in addition to the individual forecasts.
In both farms, linear regression performs better than all individual forecasts. In Farm 2 it is the best one, albeit by a very small margin. However, in Farm 1, it performs slightly worse than other combined models.
We see that there is not one consistently better method. If we investigate another farm, we might find that the Simple Mean is the most successful one there. In practice, we can take the best combination method for each farm.
What else?
We can try other model combination algorithms as well. For example, Bayesian Model Averaging (Hinne et al., 2020) is another method that assigns weights to the sum. It evaluates the Bayesian Information Criterion (BIC) for each forecaster using past data. Given the data, this parameter is related to the probability that this forecaster is the best one. These probabilities are used as the weights for the model combination.
We can assume that the weights are not constant but change over time. We can evaluate the weights only within a fixed time window, say, a few weeks before each target time. Then we can shift the window as we go along. Timmermann (2006) lists several methods that can be used for combining forecasts.
However, diversifying the portfolio of combination algorithms may not always be useful. In many practical problems, it turns out that the Simple Mean is the best combination method (in the sense of minimum error). More complicated combination methods do not significantly reduce the error metric; they even result in a bigger error. This observation is called the Forecast Combination Puzzle.
The main reason is that the weights of forecasters are not known beforehand but are estimated from past data. This process itself introduces its own biases and errors. If the error variances of individual forecasters are close to each other, the weight estimation error will be dominant and throw us off course. To avoid this problem, Timmermann (2006) suggests using the Simple Mean unless there is statistical reason to think that the error variance of forecasters is significantly different.
This brings us back to Occam’s Razor: Clever and sophisticated algorithms may look sexy, but the simplest methods usually perform the best. Start simple, and improve only if necessary.