Time Series Convolution
A convolutional kernel approach for reinforcing the modeling of time series trends and interpreting temporal patterns, allowing one to leverage Fourier transforms and learn sparse representations. The interpretable machine learning models such as sparse regression unlock opportunities to better capture the long-term changes and temporal patterns of real-world time series.
(Updated on February 15, 2025)
In this post, we intend to explain the essential ideas of our research work:
- Xinyu Chen, Zhanhong Cheng, HanQin Cai, Nicolas Saunier, Lijun Sun (2024). Laplacian convolutional representation for traffic time series imputation. IEEE Transactions on Knowledge and Data Engineering. 36 (11): 6490-6502.
- Xinyu Chen, Xi-Le Zhao, Chun Cheng (2024). Forecasting urban traffic states with sparse data using Hankel temporal matrix factorization. INFORMS Journal on Computing.
- Xinyu Chen, HanQin Cai, Fuqiang Liu, Jinhua Zhao (2024). Correlating time series with interpretable convolutional kernels. arXiv:2409.01362.
Content:
In Part I of this series, we introduce motivations of time series modeling with global and local trends. These time series trends are important for improving the performance of time series imputation. If there is one appropriate interpretable machine learning model, it is also possible to quantify the periodicity of time series. Following the basic motivations, we elaborate on several key concepts such as circular convolution, convolution matrix, circulant matrix, and discrete Fourier transform in Part II.
Part III and Part IV give the modeling ideas of circulant matrix nuclear norm minimization and Laplacian convolutional representation, addressing the critical challenges in time series imputation tasks. The optimization algorithm of both models makes use of fast Fourier transform in a log-linear time complexity. Part V presents an interpretable convolutional kernel method in which the sparsity of convolutional kernels is modeled by -norm induced sparsity constraints.
I. Motivation
The development of machine learning models in the past decade is truly remarkable. Convolution is one of the most commonly-used operations in applied mathematics and signal processing, which has been widely applied to several machine learning problems. The aims of this post are revisiting the essential ideas of circular convolution and laying an insightful foundation for modeling time series data.
Nowadays, although we have quite a lot of machine learning algorithms on hand, it is still necessary to rethink about the following perspectives in time series modeling:
- How to characterize global time series trends?
- How to characterize local time series trends?
- How to learn interpretable local and nonlocal patterns as convolutional kernels?
Sometimes, time series exhibit complicated trends if they are not stationary.
(a) Global trends (e.g., long-term daily/weekly periodicity)
(b) Local trends (e.g., short-term time series trends)
Figure 1. Illustration of time series trends.
II. Preliminaries
In this study, we build the modeling concepts of Laplacian convolutional representation (LCR) upon several key essential ideas, including circular convolution, discrete Fourier transform, and fast Fourier transform, from the field of signal processing. In the following sections, we will discuss:
- What are circular convolution, convolution matrix, and circulant matrix?
- What is the convolution theorem?
- How can fast Fourier transform be used to compute the circular convolution?
II-A. Circular Convolution
Convolution is a powerful operation in many classical deep learning frameworks, such as convolutional neural networks (CNNs). In the context of discrete sequences (typically vectors), circular convolution refers to the convolution of two discrete sequences of data, and it plays an important role in maximizing the efficiency of certain common filtering operations (see circular convolution on Wikipedia).
By definition, for any vectors and
with
, the circular convolution (denoted by the symbol
) of these two vectors is formulated as follows,
Elment-wise, we have
where represents the
th entry of
. For a cyclical operation, it takes
when
. While the definition of circular convolution might seem complicated for beginners, it becomes much clearer when you consider the concept of a convolution or circulant matrix, as demonstrated in the examples provided below.
As mentioned above, the vector has a length of
, and the vector
has a length of
. According to the definition of circular convolution, the resulting vector
will also have a length of
, matching the length of the original vector
.
Given any vectors and
, the circular convolution between them can be expressed as follows,
where and
according to the definition.
In this case, each entry of the resulting vector
is computed as the inner product between the vector
and a reversed and truncated version of the vector
. Specifically, the entry
is obtained by computing the inner product between
and
, i.e.,
For subsequent entries , the vector
is cyclically shifted in reverse and truncated (only preserving the first
entries), and the inner product with
is calculated. Figure 2 illustrates the basic steps for computing each entry of the circular convolution.
Figure 2. Illustration of the circular convolution between and
. (a) Computing
involves
and
. (b) Computing
involves
. The figure inspired by Prince (2023).
As can be seen, circular convolution between two vectors can essentially be viewed as a linear operation. This perspective allows one to reformulate the circular convolution as a linear transformation using a convolution matrix (or a circulant matrix when these two vectors have the same length).
Example 1. Given vectors and
, the circular convolution
has the following entries:
where and
according to the definition.
II-B. Convolution Matrix
Using the notations above, for any vectors and
with
, the circular convolution can be expressed as a linear transformation:
where denotes the convolution operator. The convolution matrix can be represented as follows,
In the feild of signal processing, this linear transformation is a fundamental property of circular convolution, highlighting its role in efficently implementing filtering operations.
Figure 3. Illustration of circular convolution as the linear transformation with a convolution matrix.
Example 2. Given vectors and
, the circular convolution
can be expressed as:
where is the convolution matrix with
columns. Specifically, the convolution matrix is structured as follows,
As a result, it gives
This representation shows that circular convolution is equivalent to a matrix-vector multiplication, making it easier to understand, especially in signal processing applications.
In this post, we aim to make the concepts clear and accessible by incorporting programming code in Python, intuitive illustrations, and detailed explanations of the formulas. To demonstrate how circular convolution can be computed, we will use Python’s numpy
library. First, we will construct the convolution matrix on the vector and then perform the circular convolution as follows.
import numpy as np
def conv_mat(vec, tau):
n = vec.shape[0]
mat = np.zeros((n, tau))
mat[:, 0] = vec
for i in range(1, tau):
mat[:, i] = np.append(vec[-i :], vec[: n - i], axis = 0)
return mat
x = np.array([0, 1, 2, 3, 4])
y = np.array([2, -1, 3])
mat = conv_mat(x, y.shape[0])
print('Convolution matrix of x with 3 columns:')
print(mat)
print()
z = mat @ y
print('Circular convolution of x and y:')
print(z)
II-C. Circulant Matrix
Recall that the convolution matrix is specified with
columns, corresponding to the length of vector
). In the case of
with same columns, the convolution matrix becoms a square matrix, known as a circulant matrix. In this study, we emphasize the importance of circulant matrices and their properties, such as their strong connection with circular convolution and discrete Fourier transform, even through we do not work directly with circulant matrices.
For any vectors , the circular convolution can be expressed as a linear transformation such that
where denotes the circulant operator. The circulant matrix is defined as:
which forms a square matrix. It always holds that where
and
are the Frobenius norm of matrix and the
-norm of vector, respectively.
Figure 4. Illustration of circular convolution as the linear transformation with a circulant matrix.
Example 3. Given vectors and
, the circular convolution
is identical to
where is the circulant matrix formed from
, defined as:
Thus, the result can be written as follows,
The example shows that the circular convolution of and
is equivalent to the circular convolution of
and
with its last two entries padded with zeros. Thus, to compute the circular convolution of
and
when
is greater than
, one can simply append
zeros to the end of
and perform the circular convolution.
II-D. Discrete Fourier Transform
Discrete Fourier transform (see Wikipedia) is a fundamental tool in mathematics and signal processing with widespread applications to machine learning. The discrete Fourier transform is the key discrete transform used for Fourier analysis, enabling the decomposition of a signal into its constituent frequencies. The fast Fourier transform is an efficient algorithm for computing the discrete Fourier tranform (see the difference between discrete Fourier transform and fast Fourier transform), significantly reducing the time complexity from to
, where
is the number of data points. This efficiency makes fast Fourier transform essential for processing large problems.
A crucial concept in signal processing is the convolution theorem, which states that convolution in the time domain is the multiplication in the frequency domain. This implies that the circular convolution can be efficiently computed using the fast Fourier transform. The convolution theorem for discrete Fourier transform is summarized as follows,
or
where and
denote the discrete Fourier transform and the inverse discrete Fourier transform, respectively. The symbol
represents the Hadamard product, i.e., element-wise multiplication.
In fact, this principle underlies many efficient algorithms in signal processing and data analysis, allowing complex operations to be performed efficiently in the frequency domain.
Example 4. Given vectors and
, the circular convolution
can be computed via the use of fast Fourier transform in
numpy
as follows,
import numpy as np
x = np.array([0, 1, 2, 3, 4])
y = np.array([2, -1, 3, 0, 0])
fx = np.fft.fft(x)
fy = np.fft.fft(y)
z = np.fft.ifft(fx * fy).real
print('Fast Fourier transform of x:')
print(fx)
print('Fast Fourier transform of y:')
print(fy)
print('Circular convolution of x and y:')
print(z)
in which the outputs are
Fast Fourier transform of x:
[10. +0.j -2.5+3.4409548j -2.5+0.81229924j -2.5-0.81229924j
-2.5-3.4409548j ]
Fast Fourier transform of y:
[ 4. +0.j -0.73606798-0.81229924j 3.73606798+3.4409548j
3.73606798-3.4409548j -0.73606798+0.81229924j]
Circular convolution of x and y:
[ 5. 14. 3. 7. 11.]
II-F. Hankel Matrix Factorization & Discrete Fourier Transform
The Hankel matrix plays a fundamental role in numerous areas of applied mathematics and signal processing. By definition, a Hankel matrix is a square or rectangular matrix in which each ascending skew-diagonal (from left to right) has the same value. Given vector , the Hankel matrix can be constructed as follows,
where the Hankel matrix has rows and
columns. This matrix is often used to represent signals or time series data, capturing their sequential dependencies and structure, see e.g., Chen et al. (2024) in our work.
On the Hankel matrix , if it can be approximated by the multiplication of two matrices
and
, then one can compute the inverse of Hankel matrix factorization as follows,
where and
are the
-th and
-th rows of
and
, respectively. Herein,
denotes the inverse operator of Hankel matrix. For any matrix
of size
, the inverse operator is given by
where .
is the number of entries on the
-th antidiagonal of the
matrix, satisfying
Following the aforementioned formula, it is easy to connect the Hankel factorization with circular convolution and discrete Fourier transform (Cai et al., 2019; Cai et al., 2022) such that
where we define two vectors:
of length . The notation
refers to the
th entry of the vector. Notably, they are different from the vectors
and
. If
, then the vector
consists of the first
entries of
. If
, then the vector
consists of the vector
and
zeros, i.e.,
This principle is well-suited to the construction of vector . In order to compute each
, the time complexity of the aforementioned circular convolution is
, and the element-wise multiplication takes
.
Example 5. Given vector , let the number of rows of the Hankel matrix be
, the Hankel matrix is given by
If it takes a rank-one approximation such that with
and
, then the inverse of Hankel matrix factorization can be written as follows,
which can be converted into circular convolution. By doing so, the computing process can be implemented with fast Fourier transform.
Acknowledgement. Thank you @Jinming Yang for correcting the notational mistake of circular convolution in this example.
Time complexity. Given the Hankel matrix factorization formula as where
and
, suppose
, the number of operations for computing the vector
in the element-wise multiplication is
leading to time complexity. Following the inverse of Hankel matrix factorization in Example 5, the number of operations is
if is odd. Otherwise, the number of operations is
Figure 5 shows the empirical time complexity of the inverse of Hankel matrix factorization with element-wise multiplication and circular convolution, respectively. If one uses circular convolution with fast Fourier transform, then the computational cost of inverse operations is about 100 fold compared to the element-wise multiplication. It turns out that the element-wise multiplication is more efficient than the circular convolution in this case.
Figure 5. Empirical time complexity of the inverse of Hankel matrix factorization where
and
. Note that we set the vector length as
and
. We repeat the numerical experiments 100 times corresponding to each vector length.
import numpy as np
def inverse_hankel(w, q, fft = False):
dim1 = w.shape[0]
dim2 = q.shape[0]
dim = min(dim1, dim2)
T = dim1 + dim2 - 1
x_tilde = np.zeros(T)
for t in range(T):
w_new = np.zeros(t + 1)
if t < dim1:
w_new[: t + 1] = w[: t + 1]
elif t >= dim1:
w_new[: dim1] = w
q_new = np.zeros(t + 1)
if t < dim2:
q_new[: t + 1] = q[: t + 1]
elif t >= dim2:
q_new[: dim2] = q
if t < dim:
rho = t + 1
else:
rho = T - (t + 1) + 1
if fft == True:
vec = np.fft.ifft(np.fft.fft(w_new) * np.fft.fft(q_new)).real
x_tilde[t] = vec[t] / rho
elif fft == False:
x_tilde[t] = np.inner(w_new, np.flip(q_new)) / rho
return x_tilde
For the entire implementation, please check out Appendix.
Example 6. For any vector , it always holds that
in which the operator is defined as follows,
Verify this property on the vector if the number of rows of the Hankel matrix is set as
.
According to the defintion, we have
The Hankel matrix is given by
Thus, the Frobenius norm of this Hankel matrix is equivalent to the -norm of the vector
.
III. Circulant Matrix Nuclear Norm Minimization
Circulant matrices are fundamental in many computational and theoretical aspects of signal processing and machine learning, providing an efficient framework for implementating various algorithms such as circulant matrix nuclear norm minimization. By definition, a circulant matrix is a spcial square matrix where which shifts the previous row to the right by one position, with the last entry wrapping around to the first position. As we already discussed the circulant matrix above, we will present the circulant matrix nuclear norm, its minimization problem, and applications.
III-A. Definition
Nuclear norm is a key concept in matrix computations and convex optimization, frequently applied in low-rank matrix approximation and completion problems. For any matrix , the nuclear norm is defined as the sum of singular values:
where denotes the nuclear norm. As illustrated in Figure 6, the singular values are
with
.
Figure 6. Singular value decomposition of matrix . In the decomposed matrices, the unitary matrix
(or
) consists of
orthogonal left (or right) singular vectors, while the
diagonal entries of
are singular values such that
. Note that
for notational convenience.
Example 7. Given vector , the circulant matrix is
Thus, the singular values are
As a result, we have the nuclear norm as follows,
Please reproduce the results by using the following numpy
implementation.
import numpy as np
def circ_mat(vec):
n = vec.shape[0]
mat = np.zeros((n, n))
mat[:, 0] = vec
for i in range(1, n):
mat[:, i] = np.append(vec[-i :], vec[: n - i], axis = 0)
return mat
x = np.array([0, 1, 2, 3, 4])
mat = circ_mat(x)
w, s, q = np.linalg.svd(mat, full_matrices = False)
print('Singular values of C(x):')
print(s)
III-B. Property
One of the most intriguing properties of circulant matrices is that they are diagonalizable by the discrete Fourier transform matrix. The eigenvalue decomposition of circulant matrix (constructed from any vector
) is given by
where is the unitary discrete Fourier transform matrix,
is the Hermitian transpose of
, and
is a diagonal matrix containing the eigenvalues of
. Due to this property, the nuclear norm of the circulant matrix can be formulated as the
-norm of the discrete Fourier transform of the vector
:
This relationship draws the strong connection between circulant matrices and Fourier analysis, enabling efficient computation and analysis in various applications, e.g., circulant matrix nuclear norm minimization (Chen et al., 2024).
Example 8. Given vector , the circulant matrix is
The eigenvalues of and the fast Fourier transform of
can be computed using
numpy
as follows.
import numpy as np
x = np.array([0, 1, 2, 3, 4])
mat = circ_mat(x)
eigval, eigvec = np.linalg.eig(mat)
print('Eigenvalues of C(x):')
print(eigval)
print('Fast Fourier transform of x:')
print(np.fft.fft(x))
in which the outputs are
Eigenvalues of C(x):
[10. +0.j -2.5+3.4409548j -2.5-3.4409548j -2.5+0.81229924j
-2.5-0.81229924j]
Fast Fourier transform of x:
[10. +0.j -2.5+3.4409548j -2.5+0.81229924j -2.5-0.81229924j
-2.5-3.4409548j ]
In this case, the -norm of the complex valued
—the imaginary unit is defined as
—is given by
III-C. Optimization
For any partially observed time series in the form of a vector with the observed index set
, solving the circulant matrix nuclear norm minimization allows one to reconstruct missing values in time series. The problem is formulated as follows,
where in the constraint represents the tolerance of errors between the reconstructed time series
and the partially observed time series
.
Example 9. Given vector and observed index set
, the orthogonal projection supported on
is
On the complement of , we have
In this case, one can rewrite the constraint as a penalty term (weighted by ) in the objective function:
Since the variable is associated with a circulant matrix nuclear norm and a penalty term, the first impluse is using variable separated to convert the problem into the following one:
III-D. Solution Algorithm
The augmented Lagrangian function is
where the dual variable is an estimate of the Lagrange multiplier, and
is a penalty parameter that controls the convergence rate.
Thus, the variables ,
, and the dual variable
can be updated iteratively as follows,
where the dual variable takes a standard update in the ADMM. In what follows, we give details about how to get the closed-form solution to the variables
and
.
III-E.
-Subproblem
Solving the variable involves discrete Fourier transform, convolution theorem, and
-norm minimization in complex space. The optimization problem with respect to the variable
is given by
Using discrete Fourier transform, the optimization problem can be converted into the following one in complex space (Chen et al., 2024):
where the complex-valued variables refer to
in the frequency domain, or see nuclear norm minimization of a circulant matrix with fast Fourier transform on StackExchange Mathematics. The closed-form solution to the complex-valued variable
is
with . For reference, this closed-form solution can be found in Lemma 3.3 in Yang et al., (2009), see Eq. (3.18) and (3.19) for discussing real-valued variables. In the sparsity-induced norm optimization of machine learning, this closed-form solution is also called as proximal operator or shrinkage operator.
import numpy as np
def update_x(z, w, lmbda):
T = z.shape[0]
h = np.fft.fft(z - w / lmbda)
temp = 1 - T / (lmbda * np.abs(h))
temp[temp <= 0] = 0
return np.fft.ifft(h * temp).real
Shrinkage Operator. For any vectors , the closed-form solution to the optimization problem
can be expressed as
or
Figure 7. Illustration of the shrinkage operator for solving the -norm minimization problem.
III-F.
-Subproblem
In terms of the variable , the partial derivative of the augmented Lagrangian function with respect to
is given by
while
As a result, letting the partial derivative of the augmented Lagrangian function with respect to be a zero vector, the least squares solution is given by
where the partial derivative of the augmented Lagrangian function with respect to the variable is the combination of variables
and
.
import numpy as np
def update_z(y_train, pos_train, x, w, lmbda, gamma):
z = x + w / lmbda
z[pos_train] = (gamma * y_train + lmbda * z[pos_train]) / (gamma + lmbda)
return z
III-G. Time Series Imputation
As shown in Figure 8, we randomly remove 95% observations as missing values, and we only have 14 volume observations (i.e., 14 blue dots) for the reconstruction. The circulant matrix nuclear norm minimization can capture the global trends from partial observations.
Figure 8. Univariate time series imputation on the freeway traffic volume time series. The blue and red curves correspond to the ground truth time series and reconstructed time series achieved by the circulant matrix nuclear norm minimization.
Please reproduce the experiments by following the Jupyter Notebook, which is available at the LCR repository on GitHub. For the supplementary material, please check out Appendix I(A).
IV. Laplacian Convolutional Representation
IV-A. Representing Circulant Graphs by Laplacian Kernels, Instead of Laplacian Matrices?
Laplacian convolutional representation model proposed by Chen et al., (2024) integrates local trends of time series into the global trend modeling via the use of circulant matrix nuclear norm minimization. As shown in Figure 1(b), the transition of time series data points can be naively modeled by smoothing regularization. We started by introducing Laplacian matrix to represent the prescribed relationship among time series data points.
Figure 9. Undirected and circulant graphs on the data points with certain degrees. The degrees of the left and right graphs are 2 and 4, respectively.
By definition, the Laplacian matrix of a circulant graph (see e.g., Figure 9) is a circulant matrix (refer to the formal definition of circulant matrix in Section II-C). In Figure 9, the Laplacian matrix corresponding to the left graph is expressed as
while the Laplacian matrix for the right one is
In both two Laplacian matrices, the diagonal entries are the degrees of circulant graphs. The vector encodes the structural information of these graphs, capturing their underlying circulant structure. In this work, we define the first column of Laplacian matrices as the Laplacian kernels (Chen et al., 2024):
and
respectively.
Laplacian Kernel. Given any time series , suppose
be the kernel size of an undirected and circulant graph, then the Laplacian kernel is defined as
which is the first column of the Laplacian matrix and the degree matrix is diagonalized with entries .
IV-B. Reformulating Temporal Regularization with Circular Convolution
In machine learning, one can write the temporal regularization on the time series , e.g.,
Since the Laplacian matrix is a circulant matrix, the matrix-vector multiplication
can be reformulated as a circular convolution between Laplacian kernel
and the vector
, i.e.,
. For instance, given a Laplacian kernel
, we have
As can be seen, this Laplacian kernel can build local correlations for the vector . Thus, the purpose of introducing Laplacian kernels on time series is local trend modeling with temporal regularization such that
As a matter of fact, we have several motivations and reasons for reformulating temporal regularization with circular convolution. Among them, there are some important properties inspired us a lot. In particular, one of the most useful properties of circular convolution is its relationship with discrete Fourier transform, i.e.,
where denotes the discrete Fourier transform. The symbol
is the Hadamard product or element-wise product.
Example 10. Given vectors and
, the circular convolution is
and the regularization is
How to compute the regularization with fast Fourier transform? 1) Compute the fast Fourier transform on
and
as
and
, respectively; 2) Compute the regularization as
.
import numpy as np
ell = np.array([0, 1, 2, 3, 4])
x = np.array([2, -1, 0, 0, -1])
f_ell = np.fft.fft(ell)
f_x = np.fft.fft(x)
print('Regularization R(x):')
print(np.linalg.norm(f_ell * f_x, 2) ** 2 / (2 * len(x)))
V. Learning Interpretable Convolutional Kernels
V-A. Convolutional Kernels
On the univariate time series in the form of a vector, the circular convolution between time series
and convolutional kernel
can be constructed with certain purposes. The expression is formally defined by
. When using this kernel to characterize the temporal correlations of time series, one simple yet interesting idea is making the loss of circular convolution, namely,
, as small as possible. However, the optimization problem
is ill-posed because the optimal solution is all entries of kernel being zeros. To address this challenge, we assume that the first entry of
is one and the following
entries are
in which
is a non-negative vector. Thus, the optimization problem becomes
where the constraint is of great significance for minimizing the objective function.
V-B. Sparse Linear Regression
Recall that the circular convolution can be converted into a linear transformation with circulant matrix, namely, . If we let
, then this matrix can be expressed as a matrix polynomial such that
which is equivalent to
where is the identity matrix and the time-shift matrix is given by
Example 11. Any square matrix can be multiplied by itself and the result is a square matrix of the same size. Given a time-shift matrix of size such that
Then, the power of matrix can be written as follows,
As a result, the loss function constructed by circular convolution is
where the auxiliary matrix is comprised of the last
columns of the circulant matrix
, namely,
where the matrix is separated into the vector
(i.e., first column of
) and the matrix
(i.e., last
columns of
), see Figure 10 and Figure 11 for illustrations.
Figure 10. Illustration of circular convolution as the linear transformation with a circulant matrix.
As can be seen, one of the most intriguing properties is the circular convolution can be converted into the expression
(see Figure 11), which takes the form of a linear regression with the data pair
.
Figure 11. Illustration of circular convolution with a structured vector .
Example 12. Given vectors and
, the circular matrix is
and the auxiliary matrix is
Thus, we can compute the circular convolution by
V-C. Optimization Problem
To reinforce the model interpretability, we impose a sparsity constraint on the vector . This is extremely important for capturing local and nonlocal correlations of time series. The process of learning interpretable convolutional kernels can be formulated as follows,
where is the sparsity level, i.e., the number of nonzero values.
Figure 12. Illustration of learning -sparse vector
from the time series
with the constructed formula as
.
-Norm. For any vector
, the
-norm is defined as follows,
where and
denote the cardinality and the support set of
, respectively. For instance, given vector
, the support set is
and the number of nonzero entries is
.
V-D. Subspace Pursuit
Subspace Pursuit (SP), introduced by W. Dai and O. Milenkovic in 2008, is an iterative greedy algorithm used for sparse signal recovery, particularly in the context of compressed sensing. It aims to solve the -norm minimization problem, which seeks to find the sparsest solution to an underdetermined system of linear equations. The
-norm counts the number of non-zero elements in a vector, making it a natural measure of sparsity. Generally speaking, one can use the subspace pursuit algorithm to solve the following optimization problem:
Figure 13. Illustration of learning -sparse vector
from signal vector
and dictionary matrix
with the linear formula
.
Algorithm. Subspace Pursuit
- Input: Signal vector
, dictionary matrix
, and sparsity level
.
- Output:
-sparse vector
and index set
.
- Initialization: Sparse vector
(i.e., zero vector), index set
(i.e., empty set), and error vector
.
- while not stop do
- Find
as the index set of the
largest entries of
.
.
(least squares).
- Find
as the index set of the
largest entries of
.
(least squares again!).
- Set
for all
.
.
- Find
- end
Example 13. Given vector , the auxiliary matrix
(i.e., the last 4 columns of circulant matrix
) is
The question is how to learn the vector with sparsity level
from the following optimization problem:
In this case, the empirical solution of the decision variable is . To reproduce it, the solution algorithm of subspace pursuit is given by
import numpy as np
def circ_mat(vec):
n = vec.shape[0]
mat = np.zeros((n, n))
mat[:, 0] = vec
for i in range(1, n):
mat[:, i] = np.append(vec[-i :], vec[: n - i], axis = 0)
return mat
def SP(x, tau, stop = np.infty, nonnegative = True, type = 3, epsilon = 1e-2):
t = x.shape[0]
mat = circ_mat(x)
A = mat[:, 1 :]
r = x
w = np.zeros(t - 1)
S = np.array([])
i = 0
while np.linalg.norm(r, 2) > epsilon and i < stop:
Ar = A.T @ r
S0 = np.argsort(abs(Ar))[- tau :]
S = np.append(S[:], S0[:]).astype(int)
AS = A[:, S]
w[S] = np.linalg.pinv(AS) @ x
S = np.argsort(abs(w))[- tau :]
w = np.zeros(t - 1)
AS = A[:, S]
w[S] = np.linalg.pinv(AS) @ x
r = x - AS @ w[S]
i += 1
print('Indices of non-zero coefficients (support set):', S)
print('Non-zero entries of sparse temporal kernel:', w[S])
print()
return w, S, r
x = np.array([0, 1, 2, 3, 4])
tau = 2 # sparsity level
w, S, r = SP(x, tau, 10)
V-F. Mixed-Integer Programming
There are often some special-purpose algorithms for solving constrained linear regression problems, see constrained least squares. Some examples of constraints include: 1) Non-negative least squares in which the vector must satisfy the vector inequality
(each entry must be either positive or zero); 2) Box-constrained least squares in which the vector
must satisfy the vector inequalities
.
In fact, the linear regression with sparsity constraints in the form of -norm can be easily converted into a mixed-integer programming problem. Thus, the problem illustrated in Figure 12 is equivalent to
where is the binary decision variable. The weight
(possibly a great value for most cases) can control the upper bound of the vector
. If
, then the value of
is ranging between
and
. Otherwise, the value of
is
because both lower and upper bounds are
. In fact, the last constraint can also be written as follows,
because the vector is a non-negative vector.
Example 14. Given vector , the auxiliary matrix
(i.e., the last 4 columns of circulant matrix
) is
The question is how to learn the vector with sparsity level
from the following optimization problem:
In this case, the empirical solution of the decision variable is . To reproduce it, the solution algorithm of mixed-integer programming is given by
import numpy as np
import cvxpy as cp
x = np.array([0, 1, 2, 3, 4])
A = circ_mat(x)[:, 1 :]
tau = 2 # sparsity level
d = A.shape[1]
# Variables
w = cp.Variable(d, nonneg=True)
beta = cp.Variable(d, boolean=True)
# Constraints
constraints = [cp.sum(beta) <= tau, w <= beta, w >= 0]
# Objective
objective = cp.Minimize(cp.sum_squares(x - A @ w))
# Problem
problem = cp.Problem(objective, constraints)
problem.solve(solver=cp.GUROBI) # Ensure to use a solver that supports MIP
# Solution
print("Optimal beta:", w.value)
print("Active indices:", np.nonzero(beta.value > 0.5)[0])
Please install the optimization packages in-ahead, e.g., pip install gurobipy
. One can also check out the optimization solvers in cvxpy
:
import cvxpy as cp
print(cp.installed_solvers())
VI. Insight into Ridesharing Trip Time Series
The City of Chicago’s open data portal provides a large amount of human mobility data, including taxi trips and TNP ridesharing trips. Figure 14 shows the realistic time series data of ridesharing trips with a strong weekly periodicity, allowing one to examine the usage of interpretable convolutional kernels.
Figure 14. Hourly time series of aggregated ridesharing trip counts in the City of Chicago during the first two weeks (i.e., hours in total) since April 1, 2024. The time series exhibits weekly periodicity, referring to the regularity of human mobility.
Take the time series of Figure 14 as an example, the mixed-integer programming solver in CPLEX produces the temporal kernel with
where the sparsity level is set as in the constraint
, or equivalently
on the binary decision variable
. This result basically demonstrates local correlations such as
and
, as well as the weekly seasonality at
. Below is the Python implementation of the mixed-integer programming solver with CPLEX.
import numpy as np
from docplex.mp.model import Model
def kernel_mip(data, tau):
model = Model(name = 'Sparse Autoregression')
T = data.shape[0]
w = [model.continuous_var(lb = 0, name = f'w_{k}') for k in range(T - 1)]
beta = [model.binary_var(name = f'beta_{k}') for k in range(T - 1)]
error = [data[t] - model.sum(w[k] * data[t - k - 1] for k in range(T - 1)) for t in range(T)]
model.minimize(model.sum(r ** 2 for r in error))
model.add_constraint(model.sum(beta[k] for k in range(T - 1)) <= tau)
for k in range(T - 1):
model.add_constraint(w[k] <= beta[k])
solution = model.solve()
if solution:
w_coef = np.array(solution.get_values(w))
error = 0
for t in range(T):
a = data[t]
for k in range(T - 1):
a -= w_coef[k] * data[t - k - 1]
error += a ** 2
print('Objective function: {}'.format(error))
ind = np.where(w_coef > 0)[0].tolist()
print('Support set: ', ind)
print('Coefficients w: ', w_coef[ind])
print('Cardinality of support set: ', len(ind))
return w_coef, ind
else:
print('No solution found.')
return None
import numpy as np
import time
tensor = np.load('Chicago_rideshare_mob_tensor_24.npz')['arr_0'][:, :, : 14 * 24]
data = np.sum(np.sum(tensor, axis = 0), axis = 0)
tau = 3
start = time.time()
w, ind = kernel_mip(data, tau)
end = time.time()
print('Running time (s):', end - start)
For the entire implementation, please check out the Jupyter Notebook on GitHub. The processed time series dataset is available at the folder of Chicago-data.
VII. Concluding Remarks
From a time series analysis perspective, our modeling ideas can guide future research by using the properties of circular convolution and connecting with various signals. Because circular convolution is a core component in many machine learning tasks, this post could provide an example for how to leverage the circular convolution, discrete Fourier transform, and linear regression. While we focused on the time series imputation and the convolutional kernel learning problems, we hope that our modeling ideas will inspire others in the fields machine learning and optimization.
(Posted by Xinyu Chen on April 17, 2024)