- by Julien Herzen, Florian Ravasi, Guillaume Raille, Gaël Grosch
- 12 min read

In this article, we will see how transfer learning can be applied to time series forecasting, and how forecasting models can be trained once on a diverse time series dataset and used later on to obtain forecasts on different datasets without training. We will use the open-source Darts library to do all this with in a few lines of code. A self-contained notebook containing everything needed to reproduce the results is available here.

Time series forecasting has numerous applications in supply chain, energy, agriculture, control, IT operations, finance and other domains. For a long time, the best-performing approaches were relatively sophisticated statistical methods such as Exponential Smoothing or ARIMA. However, since recently, machine learning and deep learning have started to outperform these classical approaches on a number of forecasting tasks and competitions.

One of the distinctive features of machine learning models is that their parameters can be estimated on a potentially large number of series; unlike classical methods, which are usually estimated on a single series at a time. Although machine learning shows great potential, its utilisation still poses a few practical challenges. On the one hand, it requires enough data to train — in some cases, this causes a cold-start problem, where it is hard to get ML models to work when not much data has been acquired yet. On the other hand, even if enough data is available, training models can be somewhat more cumbersome. Even though the actual training code can now be made simple (using e.g., Darts), training large models may still require some time-consuming hyper-parameter tuning phase, some special hardware (such as GPUs), or some changes to the infrastructure or processes involved to manage the data and the models.

In this article, we look at transfer learning applied to time series forecasting. We will train a deep learning model once on a large and diverse dataset of time series, and see that it performs competitively when used to forecast different time series, in different datasets. This in itself is quite intriguing, as it means that time series coming from different domains (such as demographics, finance or industry) can share some common features. Depending on what we decide constitutes a “learning task”, this can also be seen under the angle of meta-learning (or “learning to learn”), where models can adapt themselves to new tasks (e.g. forecasting a new time series) at inference time without further training [1].

Beyond ease-of-use, using models that do not require training can also be beneficial in situations where the inference time has to be minimised. With neural networks for example, the inference time is typically small, as it requires forward passes only. In this article, we will train a large model that needs only a few milliseconds to forecast new unseen time series.

We will use the Darts open source Python library for time series, in order to train and use our models in a few lines of code only. Darts has a rich features set — for instance it can also be used to train models on multivariate series (where each individual time series can contain multiple dimensions), provide probabilistic forecasts, and take external data (covariates) into account. In this article, we will only use it to get forecasts for large numbers of univariate time series. We will provide key code snippets in this article, and a complete notebook to reproduce the results (including downloading the required datasets) is available here.

Let’s start by loading a dataset containing the number of passengers for different airline companies:

We’re not showing the function `load_air()`

here, but it returns two lists, each containing 301 monthly `TimeSeries`

objects. `air_train`

contains the training part of the series (of average length ~137 months) and `air_test`

contains the last 18 months, which we keep aside as validation set. (Note: a good practice would be to keep aside yet another test set which we never touch before the final model has been selected). We use a forecast horizon of 18 months here as this is consistent with the horizon used for monthly series in the M3 and M4 competitions (whose datasets we will use later in the article). In addition, we will use the symmetric mean absolute percentage error (sMAPE) as an error metric to evaluate the quality our forecasts.

Let’s plot a few of the training series:

A few of the air traffic series (for different carriers) making our training dataset.

Most series look quite different, and they don’t even share the same time axis. For example some series start in January 2001 while others start in April 2010. You can see that the maximum value for each of these training series is 1 — We have used a Darts scaler (wrapping around a scikit-learn `MaxAbsScaler`

) to divide each of them by their largest absolute value. Doing this type of scaling does not impact the sMAPE, so for simplicity we will work with the scaled series only here. In a real application, we would have to call `scaler.inverse_transform()`

on our forecasts to translate them back to the original domain.

We will now try to forecast our 300 series with the “classical” approach of fitting one model for each time series. We call models being trained on one series *local models*. Below, we first write two small function that will make our life easier afterwards. First, `eval_forecasts()`

computes the median sMAPE error of all our forecasts (over the 300 test series), and shows the distribution of errors.

Second, `eval_local_model()`

iterates over all the series, and for each series, it builds a (local) model, fits it on the training part of the series, and stores the forecast. It then calls `eval_forecasts()`

to show the sMAPE errors over all series. It returns both the list of all errors and the total elapsed time.

We can now try a first forecasting model on this dataset. As a first step, it is usually a good practice to see how a (very) naive model blindly repeating the last value of the training series performs. This can be done in Darts using a NaiveSeasonal model:

So the most naive model gives us a median sMAPE of about 29.4. Can we do better with a “less naive” model exploiting the fact that most monthly series have a seasonality of 12?

This is better. Let’s try ExponentialSmoothing (by default, for monthly series, it will use a seasonality of 12).

It’s even better! Now I hope you see how simple this is. Let’s try a few more models while we’re at it:

So it seems that ARIMA is winning. Let’s plot the (median) error obtained by each model against the time taken to fit and predict:

ARIMA gives the best results, but it is also (by far) the most time-consuming model. The Theta method provides an interesting tradeoff, with good forecasting accuracy and about 50x faster than ARIMA. Can we maybe find a better compromise by considering *global* models — i.e., models that are trained only once, jointly on all time series?

In this section we will use “global models” — that is, models that are trained on multiple series at once. Darts has essentially two kinds of global models:

`RegressionModels`

which are wrappers around sklearn-like regression models.- PyTorch-based models, which offer various deep learning models.

Both models can be trained on multiple series by “tabularizing” the data — i.e., taking many (input, output) sub-slices from all the training series, and training machine learning models in a supervised fashion to predict the output based on the input.

We start by defining a function `eval_global_model()`

which works similarly to `eval_local_model()`

, but on global models.

`RegressionModel`

in Darts are forecasting models that can wrap around any “scikit-learn compatible” regression model to obtain forecasts. Compared to deep learning, they represent good go-to global models because they typically don’t have many hyper-parameters and can be faster to train. In addition, Darts also offers some pre-packaged regression models such as `LinearRegressionModel`

and `LightGBMModel`

.

We’ll now use our function `eval_global_models()`

and try a a few of those regression models.

You can refer to the API doc for how to use these models. Important parameters are `lags`

and `output_chunk_length`

. They determine respectively the length of the lookback and “lookforward” windows used by the model, and they correspond to the lengths of the input/output sub-slices used for training. For instance `lags=24`

and `output_chunk_length=12`

mean that the model will consume the past 24 lags in order to predict the next 12. In our case, because the shortest training series has length 36, we must have `lags + output_chunk_length <= 36`

. (Note that `lags`

can also be a list of integers representing the individual lags to be consumed by the model instead of the window length).

Let’s try linear regression:

LGBM:

And random forest:

Below, we will train an N-BEATS model on our `air`

dataset. Again, you can refer to the API doc for documentation on the hyper-parameters. The following hyper-parameters should be a good starting point,

And now let’s build the model, train it, and get some forecasts. Training takes in the order of a minute or two on a (somewhat slow) Colab GPU.

Let’s compare our models again:

So it looks like a linear regression model trained jointly on all series is now providing the best tradeoff between accuracy and speed (about 85x faster than ARIMA for similar accuracy). Linear regression is often the way to go!

Our deep learning model N-BEATS is not doing great. Note that we haven’t tried to tune it to this problem explicitly, which might have produced more accurate results. Instead of spending time tuning it though, in the next section we will see if it can do better if we train it on an entirely different dataset.

Deep learning models often do better when trained on *large* datasets. Let’s try to load all 48,000 monthly time series in the M4 competition dataset and train our model once more on this larger dataset.

`m4_train`

is a list containing 47,992`TimeSeries`

that have already been scaled so the maximum value in the training series is 1. We will use only the training part of the M4 series and do not store the testing part here.

We will now try again to train an N-BEATS model, but on this larger dataset.

By default, the number of (input, output) training samples generated to train an ML-based forecasting model from a given sequence of series is proportional to the number of series multiplied by their lengths. The M4 dataset contains 48,000 series with an average length of ~216 time steps. So if we leave the default parameters, we would end up with an order of magnitude of ~10M training samples. In order to somewhat limit the time required by each epoch, we will limit the number of training samples used per series. This is done when calling `fit()`

with the parameter `max_samples_per_ts`

. We add a new hyper-parameter `MAX_SAMPLES_PER_TS`

to capture this. Note: if we wanted more control over the way the (input, output) training examples are generated to train the model, we could call `fit_from_dataset()`

instead of `fit()`

and provide a `darts.utils.data.TrainingDataset`

implementation of our choice.

Since the M4 training series are all slightly longer, we can also use a slightly longer `input_chunk_length`

.

We can now again build and train our model:

We can now use our M4-trained model to get forecasts for the air passengers series. As we use the model in a transfer learning way here, we will be timing only the inference part (assuming the model has been pre-trained in advance).

And let’s compare all our models again:

Although it’s not the absolute best in terms of accuracy, our N-BEATS model pre-trained on M4 reaches competitive accuracies. This is quite remarkable because this model has *not* been trained on *any* of the air passengers series we’ve asked it to forecast! The forecasting step with N-BEATS is ~350x faster than the fit-predict step we needed with ARIMA, and about 4x faster than the fit-predict step of linear regression.

Just for the fun, we can also inspect manually how this model does on another series — for example, the monthly milk production series available in `darts.datasets`

:

So it seems that this model is quite capable on monthly series. Is this a trait of N-BEATS or would we get similar behaviours if we trained other global models (such as linear regression or LGBM) on M4 and then evaluated them on air passengers series?

Let’s try first with `LinearRegressionModel`

And with `LightGBMModel`

Finally, let’s plot these new results as well:

Linear regression offers competitive performance too. It is somewhat slower probably only because the inference with N-BEATS is efficiently batched across batches of time series and performed on GPU.

OK, now, were we lucky with the airline passengers dataset? Let’s see by repeating the entire process on a new dataset 🙂 You will see that it actually requires very few lines of code. As a new “test” dataset, we will use the 1,400 monthly series from the M3 forecasting competition. Here’s all the code required to run and test all our models:

And now, comparing them all:

Here too, the pre-trained N-BEATS model obtains reasonable accuracy, although not as good as the most accurate models. Note that two models out of the 3 most accurate (Exponential Smoothing and Kalman Filter) did not perform so well when used on the air passengers series. ARIMA performs best but is about 170x slower than N-BEATS, which didn’t require any training and takes about 15 ms per time series to produce its forecasts. Recall that this N-BEATS model has *never*been trained on *any* of the series we’re asking it to forecast.

Transfer learning and meta learning is definitely an interesting avenue that is at the moment under-explored in time series forecasting. When does it succeed? When does it fail? Can fine tuning help? When should it be used? Many of these questions still have to be explored but we hope to have shown that doing so is quite easy with Darts models.

Now, which method is best for your case? As always, it depends. If you’re dealing mostly with isolated series that have a sufficient history, classical methods such as ARIMA will get you a long way. Even on larger datasets, if compute power is not too much an issue, they can represent interesting out-of-the-box options for univariate series. On the other hand if you’re dealing with larger number of series, or multivariate series, ML methods and global models will often be the way to go. They can capture patterns across wide ranges of different time series, and are in general faster to run. Don’t under-estimate linear regression based models in this category! If you have reasons to believe you need to capture more complex patterns, or if inference speed is *really* important for you, give deep learning methods a shot. N-BEATS has proved its worth for meta-learning [1], but this can potentially work with other models too.

[1] Oreshkin et al., “Meta-learning framework with applications to zero-shot time-series forecasting”, 2020, https://arxiv.org/abs/2002.02887