- Jul 6, 2021
- 16 min read

- Francesco Lässig

## How a convolutional network with some simple adaptations can become a powerful tool for sequence modeling and forecasting.

Although commonly associated with image classification tasks, convolutional neural networks (CNNs) have proven to be valuable tools for sequence modeling and forecasting, given the right modifications. In this article we explore in detail the basic building blocks that a Temporal Convolutional Network (TCN) consists of, and how they all fit together to create a powerful forecasting model. Using our open-source *Darts* TCN implementation, we show that accurate predictions can be achieved on a real-world dataset with only a few lines of code.

The following description of a Temporal Convolutional Network is based on the following paper: https://arxiv.org/pdf/1803.01271.pdf. References to this paper will be denoted by (*).

## Motivation

Up until recently, the topic of sequence modeling in the context of deep learning has been largely associated with recurrent neural network architectures such as LSTM and GRU. S. Bai et al. (*) suggest that this way of thinking is antiquated, and that convolutional networks should be taken into consideration as one of the primary candidates when modeling sequential data. They were able to show that convolutional networks can achieve better performance than RNNs in many tasks while avoiding common drawbacks of recurrent models, such as the exploding/vanishing gradient problem or lacking memory retention. Furthermore, using a convolutional network instead of a recurrent one can lead to performance improvements as it allows for parallel computation of outputs. The architecture they propose is called Temporal Convolutional Network (TCN) and will be explained in the following sections. To facilitate understanding the TCN architecture in conjunction with its *Darts* implementation, this article will use the same model parameter names as seen in the library wherever possible (indicated in **bold**).

## Overview

A TCN, short for Temporal Convolutional Network, consists of dilated, causal 1D convolutional layers with the same input and output lengths. The following sections go into detail about what these terms actually mean.

## 1D Convolutional Network

A 1D convolutional network takes as input a 3-dimensional tensor and also outputs a 3-dimensional tensor. The input tensor of our TCN implementation has the shape (**batch_size**, **input_length**, **input_size**) and the output tensor has the shape (**batch_size**, **input_length**, **output_size**). Since every layer in a TCN has the same input and output length, only the third dimension of the input and output tensors differs. In the univariate case, **input_size** and **output_size** will both be equal to one. In the more general multivariate case, **input_size** and **output_size** might differ since we might not want to forecast every component of the input sequence.

One single 1D convolutional layer receives an input tensor of shape (**batch_size**, **input_length**, *nr_input_channels*) and outputs a tensor of shape (**batch_size**, **input_length**, *nr_output_channels*). To see how one single layer converts its input into the output, let’s look at one element of the batch (the same process takes place for every element in the batch). Let’s start with the simplest case where *nr_input_channels* and *nr_output_channels* both equal 1. In this case we are looking at 1D input and output tensors. The following image shows how one element of the output tensor is computed.

We can see that to compute one element of the output, we look at a series of consecutive elements of length **kernel_size **of the input. In the above example we chose a **kernel_size **of 3. To obtain the output, we take the dot product of the subsequence of the input and a kernel vector of learned weights of the same length. To get the next element of the output, the same procedure is applied, but the **kernel_size**-sized window of the input sequence is shifted to the right by one element (for the purposes of this forecasting model, the stride is always set to 1). Please note that the same set of kernel weights will be used to compute every output of one convolutional layer. The following image shows two consecutive output elements and their respective input subsequences.

To make the visualization simpler, the dot product with the kernel vector is not shown anymore, but takes place for every output element with the same kernel weights.

To make sure that the output sequence has the same length as the input sequence, some zero-padding is applied. This means that additional zero-valued entries are added to either the beginning or the end of the input tensor to ensure that the output has the desired length. How this is done exactly will be explained in later sections.

Now let’s look at the case where we have multiple input channels, i.e. *nr_input_channels* is larger than 1. In this case, the process described above is repeated for every single input channel, but each time with a different kernel. This leads to *nr_input_channels* intermediate output vectors and a number of kernel weights of **kernel_size *** *nr_input_channels*. Then, all the intermediate output vectors are summed up to obtain the final output vector. In some sense this is equivalent to a 2D convolution with an input tensor of shape (**input_size**, *nr_input_channels*) and a kernel of shape (**kernel_size**, *nr_input_channels*) as shown in the following image. It’s still 1D in the sense that the window only moves along a single axis, but we do have a 2D convolution at every step in the sense that we are using a 2-dimensional kernel matrix.

For this example we chose *nr_input_channels* to be equal to 2. Now, instead of a kernel vector sliding over a 1-dimensional input sequence we have a *nr_input_channels* by **kernel_size **kernel matrix sliding along a *nr_input_channels* wide series of length **input_length**.

If both *nr_input_channels* and *nr_output_channels* are larger than 1, the above process is simply repeated for every output channel with a different kernel matrix. The output vectors are then stacked on top of each other resulting in an output tensor of shape (**input_length**, *nr_output_channels*). The number of kernel weights in this case is equal to **kernel_size****nr_input_channels***nr_output_channels*.

The two variables *nr_input_channels* and *nr_output_channels* depend on the position of the layer within the network. The first layer will have *nr_input_channels* = **input_size**, and the last layer will have *nr_output_channels* = **output_size.** All other layers will use the intermediate channel number given by **num_filters**.

## Causal Convolution

For a convolutional layer to be causal, for every i in {0, …, **input_length** — 1} the i’th element of the output sequence may only depend on the elements of the input sequence with indices {0, …, i}. In other words, an element in the output sequence can only depend on elements that come before it in the input sequence. As mentioned before, to ensure an output tensor has the same length as the input tensor, we need to apply zero padding. If we only apply zero-padding on the left side of the input tensor, then causal convolution will be ensured. To understand this, consider the rightmost output element. Given that there is no padding on the right side of the input sequence, the last element it depends on is the last element of the input. Now consider the second to last output element of the output sequence. Its kernel window is shifted to the left by one compared to the last output element, that means its rightmost dependency in the input sequence is the second to last element of the input sequence. It follows by induction that for every element in the output sequence, its latest dependency in the input sequence has the same index as itself. The following figure shows an example with an **input_length** of 4 and a **kernel_size** of 3.

We can see that with a left zero-padding of 2 entries we can achieve the same output length while obeying the causality rule. In fact, without dilation, the number of zero-padding entries required for maintaining the input length is always equal to **kernel_size – **1.

## Dilation

One desirable quality of a forecasting model is that the value of a specific entry in the output depends on all previous entries in the input, i.e. all entries that have an index smaller or equal to itself. This is achieved when the receptive field, meaning the set of entries of the original input that affect a specific entry of the output, has size **input_length**. We also call this ‘full history coverage’. As we have seen before, one conventional convolutional layer makes an entry in the output dependent on the **kernel_size** entries of the input that have an index smaller or equal to itself. For instance, if we have a **kernel_size** of 3, the 5th element in the output will be dependent on elements 3, 4 and 5 of the input. This reach is expanded when we stack multiple layers on top of each other. In the following figure we can see that by stacking two layers with **kernel_size** 3 we get a receptive field size of 5.

More generally, a 1D convolutional network with *n* layers and a **kernel_size***k* has a receptive field *r* of size

To know how many layers are needed for full coverage, we can set the receptive field size to **input_length ***l* and solve for the number of layers *n *(we need to round up in case of non-integer values):

This means that, given a fixed **kernel_size**, the number of layers required for full history coverage is linear in the length of the input tensor, which will result in networks that become very deep very fast, leading to models with a very large number of parameters that take longer to train. Furthermore, a high number of layers has been shown to lead to degradation problems related to the gradient of the loss function. One way to increase the receptive field size while still keeping the number of layers relatively small is to introduce dilation to the convolutional network.

Dilation in the context of a convolutional layer refers to the distance between the elements of the input sequence that are used to compute one entry of the output sequence. So a conventional convolutional layer could be seen as a 1-dilated layer, since the input elements for 1 output value are adjacent. The following figure shows an example of a 2-dilated layer with an **input_length** of 4 and a **kernel_size** of 3.

Compared to the 1-dilated case, this layer has a receptive field that spreads across a length of 5 rather than 3. More generally, a *d*-dilated layer with a kernel size *k* has a receptive field spreading across a length of 1+d*(k-1). If *d* is fixed, this will still require a number linear in the length of the input tensor to achieve full receptive field coverage (we just decreased the constant).

This problem can be addressed by increasing the value of *d* exponentially as we move up through the layers. For this, we choose a constant **dilation_base** integer *b* which will let us compute the dilation *d *of a specific layer as a function of the number of layers below it, *i*, as *d = b**i.* The following figure shows a network with an **input_length** of 10, a **kernel_size** of 3 and a **dilation_base** of 2 which results in 3 dilated convolutional layers for full coverage.

Here we only show the influence of inputs that affect the last value of the output. Likewise, only zero-padding entries necessary for the last output value are shown. Clearly, the last output value is dependent on the whole input coverage. Actually, given the hyperparameters, an **input_length** of up to 15 could be used while maintaining full receptive field coverage. Generally speaking, every additional layer adds a value of *d*(k-1)* to the current receptive field width, where *d* is computed as *d=b**i*, with *i*representing the number of layers below our new layer. Consequently, the width of the receptive field *w* of a TCN with exponential dilation of base *b*, kernel size *k* and number of layers *n* is given by

However, depending on the values *b* and *k*, this receptive field can have ‘holes’. Consider the following network with a **dilation_base** of 3 and a kernel size of 2:

The receptive field does cover a range that is larger than the input size (namely 15). However, the receptive field has holes in it; that is there are entries in the input sequence that the output value is not dependent on (shown above in red). To solve this problem, we need to either increase the kernel size to 3, or decrease the dilation base to 2. Generally speaking, for a receptive field with no holes, the kernel size k has to be at least as big as the dilation base b.

Considering these observations, we can compute how many layers our network needs for full history coverage. Given a kernel size *k*, dilation base *b*, where *k* ≥ *b*, and input length *l,* the following inequality must hold for full history coverage:

We can solve for *n* and get the minimum number of required layers as

We can see that the number of layers is now logarithmic rather than linear in the length of the input. This constitutes a significant improvement, which can be achieved without sacrificing receptive field coverage.

Now the only thing left to specify is the number of zero-padding entries required at each layer. Given a dilation base *b*, a kernel size *k* and a number of *i* layers below our current layer, then the number of zero-padding entries *p* needed for the current layer are computed as follows:

## Basic TCN Overview

Given **input_length**, **kernel_size**, **dilation_base** and the minimum number of layers required for full history coverage, the basic TCN network would look something like this:

## Forecasting

So far we have only talked about the ‘input sequence’ and the ‘output sequence’ without getting into how they relate to each other. In the context of forecasting, we want to predict the next entries of a time series into the future. To train our TCN network to do forecasting, the training set will consist of (input sequence, target sequence)-pairs of equally-sized subsequences of the given time series. A target series will be a series that is shifted forward in relation to its respective input series by a certain number **output_length**. This means that a target series of length **input_length**contains the last (**input_length – output_length**) elements of its respective input sequence as first elements, and the **output_length** elements that come after the last entry of the input series as its final elements. In the context of forecasting, this means that the maximum forecasting horizon that can be predicted with such a model is equal to **output_length**. Using a sliding window approach, many overlapping pairs of input and target sequences can be created out of one time series.

## Improvements to the Model

S. Bai et al. (*) suggest a few additions to the basic TCN architecture for improved performance which will be discussed in this section, namely residual connections, regularization and activation functions.

### Residual Blocks

The biggest modification we make to the previously introduced basic model is to change the fundamental building block of the model from a simple 1D causal convolutional layer to a residual block which consists of 2 layers with the same dilation factor and a residual connection.

Let’s consider a layer with a dilation factor *d* of 2 and kernel size *k* of 3 from the basic model to see how this translates into a residual block of the improved model.

becomes

The output of the two convolutional layers will be added to the input of the residual block to produce the input for the next block. For all inner blocks of the network, i.e. all but the first and the last one, the input and output channel width is the same, namely **num_filters**. Since the first convolutional layer of the first residual block and the second convolutional layer of the last residual block may have different input and output channel widths, the width of the residual tensor might have to be adjusted, which is done using a 1×1 convolution.

This change affects the calculus for the minimum number of required layers for full coverage. Now we have to think about how many residual blocks are necessary to achieve a full receptive field coverage. Adding a residual block to a TCN adds twice as much receptive field width than when adding a basic causal layer, since it includes 2 such layers. So the total size of the receptive field *r* of a TCN with dilation base *b*, kernel size* k* with k ≥ b and number of residual blocks *n* can be computed as

which leads to a minimum number of residual blocks *n* for full history coverage of **input_length** *l* of

### Activation, Normalization, Regularization

To make our TCN more than just an overly complex linear regression model, activation functions need to be added on top of the convolutional layers to introduce non-linearities. ReLU activations are added to the residual blocks after both convolutional layers.

To normalize the input of hidden layers (which counteracts the exploding gradient problem among other things), weight normalization is applied to every convolutional layer.

In order to prevent overfitting, regularization is introduced via dropout after every convolutional layer in every residual block. The following figure shows the final residual block.

The asterisk in the second ReLU unit indicates that it is present in every layer but the last one, since we want our final output to be able to take on negative values as well (this differs from the architecture outlined in the paper).

## Final Model

The following picture shows our final TCN model with *l* equal to **input_length, ***k *equal to **kernel_size**, *b* equal to **dilation_base, ***k* ≥ *b* and with a minimum number of residual blocks for full history coverage *n,*where *n* can be computed from the other values as explained above.

## Example

Let’s take a look at an example of how we can use the TCN architecture to forecast a time series using the *Darts* library.

First, we need a time series to train and evaluate our model on. For this, we use a Kaggle dataset containing hourly energy production data from Spain. More specifically, we choose to predict the ‘run-of-river hydroelectricity’ production. Furthermore, to make the problem less computation-intensive, we average the energy production over each day to get a daily time series.

from darts import TimeSeries from darts.dataprocessing.transformers import MissingValuesFiller import pandas as pd df = pd.read_csv('energy_dataset.csv', delimiter=",") df['time'] = pd.to_datetime(df['time'], utc=True) df['time']= df.time.dt.tz_localize(None) df_day_avg = df.groupby(df['time'].astype(str).str.split(" ").str[0]).mean().reset_index() value_filler = MissingValuesFiller() series = value_filler.transform(TimeSeries.from_dataframe(df_day_avg, 'time', ['generation hydro run-of-river and poundage'])) series.plot()

We can see that, in addition to a yearly seasonality, there are regular ‘spikes’ in energy production that come in monthly intervals. Since the TCN model supports multiple input channels, we can add additional time series components to the current time series that encode the current day of the month. This can help our TCN model to converge faster.

`series = series.add_datetime_attribute('day', one_hot=True)`

Now we split our data into train and validation components and perform standardization.

from darts.dataprocessing.transformers import Scaler train, val = series.split_after(pd.Timestamp('20170901')) scaler = Scaler() train_transformed = scaler.fit_transform(train) val_transformed = scaler.transform(val) series_transformed = scaler.transform(series)

Now it’s time to create and train our TCN model. Note that all **bold variable names** that have appeared in the above description of the architecture can be used as arguments for the constructor of the *Darts* TCN implementation. The **output_length** parameter is set to 7 since we want to perform weekly forecasts. When training the model, we specify as **target_series** only the first component of our training series, since we do not want to predict the helper time series we added earlier. We tried out a few different hyperparameter combinations, but most of the values were chosen quite arbitrarily.

from darts.models import TCNModel model = TCNModel( input_size=train.width, n_epochs=20, input_length=365, output_length=7, dropout=0, dilation_base=2, weight_norm=True, kernel_size=7, num_filters=4, random_state=0 ) model.fit( training_series=train_transformed, target_series=train_transformed['0'], val_training_series=val_transformed, val_target_series=val_transformed['0'], verbose=True )

To evaluate our model, we want to test its performance using a 7-day forecasting horizon on many different points in time in the validation set. For this, we use the historic backtesting functionality from *Darts*. Note that the model is provided new input data for every prediciton, but it is never retrained. To save time, we set the stride to 5.

```
pred_series = model.backtest(
series_transformed,
target_series=series_transformed['0'],
start=pd.Timestamp('20170901'),
forecast_horizon=7,
stride=5,
retrain=False,
verbose=True,
use_full_output_length=True
)
```

Let’s visualize the historic forecast predictions of our TCN model against the ground truth data points and compute the R2 score.

from darts.metrics import r2_score import matplotlib.pyplot as plt series_transformed[900:]['0'].plot(label='actual') pred_series.plot(label=('historic 7 day forecasts')) r2_score_value = r2_score(series_transformed['0'], pred_series) plt.title('R2:' + str(r2_score_value)) plt.legend()

For more details and additional examples, please check out our TCN examples notebook on GitHub.

### Conclusion

Deep learning in sequence modeling is, for the most part, still widely associated with recurrent neural network architectures. But research has shown that these types of models can be outperformed in many tasks by a TCN, both in terms of predictive performance and efficiency. In this article we explored how this promising model can be understood in terms of simple building blocks such as 1D convolutional layers, dilations and residual connections, and how they all fit together. Furthermore, we successfully applied the current *Darts* implementation of the TCN architecture to predict a real-world time series.

Thanks to Julien Herzen.