Time Series Convolution

A convolutional kernel approach for reinforcing the modeling of time series trends and interpreting temporal patterns, allowing one to leverage Fourier transforms and learn sparse representations. The interpretable machine learning models such as sparse regression unlock opportunities to better capture the long-term changes and temporal patterns of real-world time series.

(Updated on October 29, 2024)


In this post, we intend to explain the essential ideas of our latent research work:

Content:

  • Motivation
  • Preliminaries
  • Circulant matrix nuclear norm minimization
  • Laplacian convolutional representation
  • Learning interpretable convolutional kernels


I. Motivation

Convolution is one of the most commonly-used operations in applied mathematics and signal processing, which has been widely used to several machine learning problems. We hope to revisit the essential ideas of circular convolution and lays an insightful foundation for modeling time series data.

Though nowadays we have a lot of machine learning algorithms on hand, it is still necessary to address the following challenges:

  • How to characterize global time series trends?
  • How to characterize local time series trends?
  • How to learn interpretable local and nonlocal patterns as convolutional kernels?

Sometimes, time series exhibit complicated trends if they are not stationary.


(a) Global trends

(b) Local trends

Figure 1. Illustration of time series trends.


II. Preliminaries

In this study, we build the modeling concepts of Laplacian convolutional representation (LCR) upon several key ideas from the fields of signal processing and machine learning, including circular convolution, discrete Fourier transform, and fast Fourier transform. In the following sections, we will discuss:

  • What are circular convolution, convolution matrix, and circulant matrix?
  • What is the convolution theorem?
  • How can fast Fourier transform be used to compute the circular convolution?

II-A. Circular Convolution

Convolution is a powerful operation in many classical deep learning frameworks, such as convolutional neural networks (CNNs). In the context of discrete sequences (typically vectors), circular convolution refers to the convolution of two discrete sequences of data, and it plays an important role in maximizing the efficiency of certain common filtering operations (see circular convolution on Wikipedia).

By definition, for any vectors and with , the circular convolution (denoted by the symbol ) of these two vectors is formulated as follows,

Elment-wise, we have

where represents the th entry of . For a cyclical operation, it takes when . While the definition of circular convolution might seem complicated for beginners, it becomes much clearer when you consider the concept of a convolution or circulant matrix, as demonstrated in the examples provided below.

As mentioned above, the vector has a length of , and the vector has a length of . According to the definition of circular convolution, the resulting vector will also have a length of , matching the length of the original vector .

Given any vectors and , the circular convolution between them can be expressed as follows,

where and according to the definition.

In this case, each entry of the resulting vector is computed as the inner product between the vector and a reversed and truncated version of the vector . Specifically, the entry is obtained by computing the inner product between and . For subsequent entries , the vector is cyclically shifted in reverse and truncated (only preserving the first entries), and the inner product with is calculated. Figure 2 illustrates the basic steps for computing each entry of the circular convolution.


Figure 2. Illustration of the circular convolution between and . (a) Computing involves and . (b) Computing involves . The figure inspired by Prince (2023).


As can be seen, circular convolution between two vectors can essentially be viewed as a linear operation. This perspective allows one to reformulate the circular convolution as a linear transformation using a convolution matrix (or a circulant matrix when these two vectors have the same length).



Example 1. Given vectors and , the circular convolution has the following entries:

where and according to the definition.



II-B. Convolution Matrix

Using the notations above, for any vectors and with , the circular convolution can be expressed as a linear transformation:

where denotes the convolution operator. The convolution matrix can be represented as follows,

In signal processing, this linear transformation is a fundamental property of circular convolution, highlighting its role in efficently implementing filtering operations.


Figure 3. Illustration of circular convolution as the linear transformation with a convolution matrix.



Example 2. Given vectors and , the circular convolution can be expressed as:

where is the convolution matrix with columns. Specifically, the convolution matrix is structured as follows,

As a result, it gives

This representation shows that the circular convolution is equivalent to a matrix-vector multiplication, making it easier to understand, especially in signal processing applications.



In this post, we aim to make the concepts clear and accessible by incorporting programming code, intuitive illustrations, and detailed explanations of the formulas. To demonstrate how circular convolution can be computed, we will use Python’s numpy library. First, we will construct the convolution matrix on the vector and then perform the circular convolution as follows.


import numpy as np

def conv_mat(vec, tau):
    n = vec.shape[0]
    mat = np.zeros((n, tau))
    mat[:, 0] = vec
    for i in range(1, tau):
        mat[:, i] = np.append(vec[-i :], vec[: n - i], axis = 0)
    return mat

x = np.array([0, 1, 2, 3, 4])
y = np.array([2, -1, 3])
mat = conv_mat(x, y.shape[0])
print('Convolution matrix of x with 3 columns:')
print(mat)
print()
z = mat @ y
print('Circular convolution of x and y:')
print(z)


II-C. Circulant Matrix

Recall that the convolution matrix is specified with columns, corresponding to the length of vector ). In the case of with same columns, the convolution matrix becoms a square matrix, known as a circulant matrix. In this study, we emphasize the importance of circulant matrices and their properties, such as their strong connection with circular convolution and discrete Fourier transform, even through we do not work directly with circulant matrices.

For any vectors , the circular convolution can be expressed as a linear transformation such that

where denotes the circulant operator. The circulant matrix is defined as:

which forms a square matrix.


Figure 4. Illustration of circular convolution as the linear transformation with a circulant matrix.



Example 3. Given vectors and , the circular convolution is identical to

where is the circulant matrix formed from , defined as:

Thus, the result can be written as follows,

The example shows that the circular convolution of and is equivalent to the circular convolution of and with its last two entries padded with zeros. Thus, to compute the circular convolution of and when is greater than , one can simply append zeros to the end of and perform the circular convolution.



II-D. Discrete Fourier Transform

Discrete Fourier transform (see Wikipedia) is a fundamental tool in mathematics and signal processing with widespread applications to machine learning. The discrete Fourier transform is the key discrete transform used for Fourier analysis, enabling the decomposition of a signal into its constituent frequencies. The fast Fourier transform is an efficient algorithm for computing the discrete Fourier tranform (see the difference between discrete Fourier transform and fast Fourier transform), significantly reducing the time complexity from to , where is the number of data points. This efficiency makes fast Fourier transform essential for processing large problems.

A crucial concept in signal processing is the convolution theorem, which states that convolution in the time domain is the multiplication in the frequency domain. This implies that the circular convolution can be efficiently computed using the fast Fourier transform. The convolution theorem for discrete Fourier transform is summarized as follows,

or

where and denote the discrete Fourier transform and the inverse discrete Fourier transform, respectively. The symbol represents the Hadamard product, i.e., element-wise multiplication.

In fact, this principle underlies many efficient algorithms in signal processing and data analysis, allowing complex operations to be performed efficiently in the frequency domain.



Example 4. Given vectors and , the circular convolution can be computed via the use of fast Fourier transform in numpy as follows,

import numpy as np

x = np.array([0, 1, 2, 3, 4])
y = np.array([2, -1, 3, 0, 0])
fx = np.fft.fft(x)
fy = np.fft.fft(y)
z = np.fft.ifft(fx * fy).real
print('Fast Fourier transform of x:')
print(fx)
print('Fast Fourier transform of y:')
print(fy)
print('Circular convolution of x and y:')
print(z)

in which the outputs are

Fast Fourier transform of x:
[10. +0.j         -2.5+3.4409548j  -2.5+0.81229924j -2.5-0.81229924j
 -2.5-3.4409548j ]
Fast Fourier transform of y:
[ 4.        +0.j         -0.73606798-0.81229924j  3.73606798+3.4409548j
  3.73606798-3.4409548j  -0.73606798+0.81229924j]
Circular convolution of x and y:
[ 5. 14.  3.  7. 11.]




III. Circulant Matrix Nuclear Norm Minimization

Circulant matrices are fundamental in many computational and theoretical aspects of signal processing and machine learning, providing an efficient framework for implementating various algorithms such as circulant matrix nuclear norm minimization. By definition, a circulant matrix is a spcial square matrix where which shifts the previous row to the right by one position, with the last entry wrapping around to the first position. As we already discussed the circulant matrix above, we will present the circulant matrix nuclear norm, its minimization problem, and applications.

III-A. Definition

Nuclear norm is a key concept in matrix computations and convex optimization, frequently applied in low-rank matrix approximation and completion problems. For any matrix , the nuclear norm is defined as the sum of singular values:

where denotes the nuclear norm. As illustrated in Figure 5, the singular values are with .


Figure 5. Singular value decomposition of matrix . In the decomposed matrices, the unitary matrix (or ) consists of orthogonal left (or right) singular vectors, while the diagonal entries of are singular values such that . Note that for notational convenience.



Example 5. Given vector , the circulant matrix is

Thus, the singular values are

As a result, we have the nuclear norm as follows,

Please reproduce the results by using the following numpy implementation.

import numpy as np

def circ_mat(vec):
    n = vec.shape[0]
    mat = np.zeros((n, n))
    mat[:, 0] = vec
    for i in range(1, n):
        mat[:, i] = np.append(vec[-i :], vec[: n - i], axis = 0)
    return mat

x = np.array([0, 1, 2, 3, 4])
mat = circ_mat(x)
w, s, q = np.linalg.svd(mat, full_matrices = False)
print('Singular values of C(x):')
print(s)




III-B. Property

One of the most intriguing properties of circulant matrices is that they are diagonalizable by the discrete Fourier transform matrix. The eigenvalue decomposition of circulant matrix (constructed from any vector ) is given by

where is the unitary discrete Fourier transform matrix, is the Hermitian transpose of , and is a diagonal matrix containing the eigenvalues of . Due to this property, the nuclear norm of the circulant matrix can be formulated as the -norm of the discrete Fourier transform of the vector :

This relationship draws the strong connection between circulant matrices and Fourier analysis, enabling efficient computation and analysis in various applications.



Example 6. Given vector , the circulant matrix is

The eigenvalues of and the fast Fourier transform of can be computed using numpy as follows.

import numpy as np

x = np.array([0, 1, 2, 3, 4])
mat = circ_mat(x)
eigval, eigvec = np.linalg.eig(mat)
print('Eigenvalues of C(x):')
print(eigval)
print('Fast Fourier transform of x:')
print(np.fft.fft(x))

in which the outputs are

Eigenvalues of C(x):
[10. +0.j         -2.5+3.4409548j  -2.5-3.4409548j  -2.5+0.81229924j
 -2.5-0.81229924j]
Fast Fourier transform of x:
[10. +0.j         -2.5+3.4409548j  -2.5+0.81229924j -2.5-0.81229924j
 -2.5-3.4409548j ]

In this case, the -norm of the complex valued —the imaginary unit is defined as —is given by



III-C. Optimization

For any partially observed time series in the form of a vector with the observed index set , solving the circulant matrix nuclear norm minimization allows one to reconstruct missing values in time series. The problem is formulated as follows,

where in the constraint represents the tolerance of errors between the reconstructed time series and the partially observed time series .

In this case, one can rewrite the constraint as a penalty term (weighted by ) in the objective function:

Since the variable is associated with a circulant matrix nuclear norm and a penalty term, the first impluse is using variable separated to convert the problem into the following one:


III-D. Solution Algorithm

The augmented Lagrangian function is

where the dual variable is an estimate of the Lagrange multiplier, and is a penalty parameter that controls the convergence rate.

Thus, the variables , , and the dual variable can be updated iteratively as follows,


III-E. -Subproblem

Solving the variable involves discrete Fourier transform, convolution theorem, and -norm minimization in complex space. The optimization problem with respect to the variable is given by

Using discrete Fourier transform, the optimization problem can be converted into the following one in complex space:

where the complex-valued variables refer to in the frequency domain. The closed-form solution to the complex-valued variable is

with . For reference, this closed-form solution can be found in Lemma 3.3 in Yang et al., (2009), see Eq. (3.18) and (3.19) for discussing real-valued variables. In the sparsity-induced norm optimization of machine learning, this closed-form solution is also called as proximal operator or shrinkage operator.


import numpy as np

def update_x(z, w, lmbda):
    T = z.shape[0]
    h = np.fft.fft(z - w / lmbda)
    temp = 1 - T / (lmbda * np.abs(h))
    temp[temp <= 0] = 0
    return np.fft.ifft(h * temp).real



Shrinkage Operator. For any vectors , the closed-form solution to the optimization problem

can be expressed as

or


Figure 6. Illustration of the shrinkage operator for solving the -norm minimization problem.




III-F. -Subproblem

In terms of the variable , the partial derivative of the augmented Lagrangian function with respect to is given by

while

As a result, letting the partial derivative of the augmented Lagrangian function with respect to be a zero vector, the least squares solution is given by

where the partial derivative of the augmented Lagrangian function with respect to the variable is the combination of variables and .


import numpy as np

def update_z(y_train, pos_train, x, w, lmbda, gamma):
    z = x + w / lmbda
    z[pos_train] = (gamma * y_train + lmbda * z[pos_train]) / (gamma + lmbda) 
    return z


III-G. Time Series Imputation

As shown in Figure 7, we randomly remove 95% observations as missing values, and we only have 14 volume observations (i.e., 14 blue dots) for the reconstruction. The circulant matrix nuclear norm minimization can capture the global trends from partial observations.


Figure 7. Univariate time series imputation on the freeway traffic volume time series. The blue and red curves correspond to the ground truth time series and reconstructed time series achieved by the circulant matrix nuclear norm minimization.


Please reproduce the experiments by following the Jupyter Notebook, which is available at the LCR repository on GitHub. For the supplementary material, please check out Appendix I(A).


IV. Laplacian Convolutional Representation

V. Learning Interpretable Convolutional Kernels

V-A. Convolutional Kernels

On the univariate time series in the form of a vector, the circular convolution between time series and convolutional kernel can be constructed with certain purposes. The expression is formally defined by . When using this kernel to characterize the temporal correlations of time series, one simple yet interesting idea is making the loss of circular convolution, namely, , as small as possible. However, the optimization problem

is ill-posed because the optimal solution is all entries of kernel being zeros. To address this challenge, we assume that the first entry of is one and the following entries are in which is a non-negative vector. Thus, the optimization problem becomes

where the constraint is of great significance for minimizing the objective function.


V-B. Sparse Linear Regression

Recall that the circular convolution can be converted into a linear transformation with circulant matrix. Considering the kernel , the loss of circular convolution is equivalent to

where the auxiliary matrix is comprised of the last columns of the circulant matrix , namely,

where the matrix is separated into the vector (i.e., first column of ) and the matrix (i.e., last columns of ), see Figure 8 and Figure 9 for illustrations.


Figure 8. Illustration of circular convolution as the linear transformation with a circulant matrix.


As can be seen, one of the most intriguing properties is the circular convolution can be converted into the expression (see Figure 9), which takes the form of a linear regression with the data pair .


Figure 9. Illustration of circular convolution with a structured vector .



Example 7. Given vectors and , the circular matrix is

and the auxiliary matrix is

Thus, we can compute the circular convolution by



To reinforce the model interpretability, we impose a sparsity constraint on the vector . This is extremely important for capturing local and nonlocal correlations of time series. The process of learning interpretable convolutional kernels can be formulated as follows,

where is the sparsity level, i.e., the number of nonzero values.


Figure 10. Illustration of learning -sparse vector from the time series with the constructed formula as .



-Norm. For any vector , the -norm is defined as follows,

where and denote the cardinality and the support set of , respectively. For instance, given vector , the support set is and the number of nonzero entries is .



V-C. Mixed-Integer Programming

There are often some special-purpose algorithms for solving constrained linear regression problems, see constrained least squares. Some examples of constraints include: 1) Non-negative least squares in which the vector must satisfy the vector inequality (each entry must be either positive or zero); 2) Box-constrained least squares in which the vector must satisfy the vector inequalities .



Example 8. Given vector , the auxiliary matrix (i.e., the last 4 columns of circulant matrix ) is

The question is how to learn the vector with sparsity level from the following optimization problem:

In this case, the empirical solution of the decision variable is . To reproduce it, the solution algorithm of mixed-integer programming is given by

import numpy as np
import cvxpy as cp

x = np.array([0, 1, 2, 3, 4])
A = circ_mat(x)[:, 1 :]
tau = 2 # sparsity level
d = A.shape[1]

# Variables
w = cp.Variable(d, nonneg=True)
z = cp.Variable(d, boolean=True)

# Constraints
constraints = [cp.sum(z) <= tau, w <= z, w >= 0]

# Objective
objective = cp.Minimize(cp.sum_squares(x - A @ w))

# Problem
problem = cp.Problem(objective, constraints)
problem.solve(solver=cp.GUROBI)  # Ensure to use a solver that supports MIP

# Solution
print("Optimal beta:", w.value)
print("Active indices:", np.nonzero(z.value > 0.5)[0])

Please install the optimization packages in-ahead, e.g., pip install gurobipy. One can also check out the optimization solvers in cvxpy:

import cvxpy as cp

print(cp.installed_solvers())





VI. Experiments


Figure 11. Daily average of rideshare trips over 77 zones during the first 8 weeks since April 1, 2024 in the City of Chicago, USA. There are 11,374,540 trips in total, while the daily average is 203,117 trips.



VII. Concluding Remarks

From a time series analysis perspective, our modeling ideas can guide further research by using the properties of circular convolution and connecting with various signals. Because circular convolution is a core component in many machine learning tasks, this post could provide an example for how to leverage the circular convolution, discrete Fourier transform, and linear regression. While we focused on the time series imputation and the convolutional kernel learning problems, we hope that our modeling ideas will inspire others in machine learning and optimization.



(Posted by Xinyu Chen on April 17, 2024.)