Knowledge Repository

Xinyu Chen (陈新宇) created this page since early 2024 with the purpose of fostering research knowledge, vision, insight, and style. In the meantime, it aims to connect random concepts with mathematics and machine learning.

59th Mile

Optimal Sketching Bounds for Sparse Linear Regression

References

Tung Mai, Alexander Munteanu, Cameron Musco, Anup B. Rao, Chris Schwiegelshohn, David P. Woodruff (2023). Optimal Sketching Bounds for Sparse Linear Regression. arXiv:2304.02261.

58th Mile

Sum of Squares (SOS) Technique

SOS is a classical method for solving polynomial optimization. The feasible set with a multivariate polynomial function

$p(x)=\sum_{\alpha=0}^{n}a_{\alpha}x^{\alpha}\geq 0,\forall x$

$p(x)=\sum_{\alpha=0}^{n}q_{\alpha}^{2}(x), q_{\alpha}(x)\in\mathbb{R}[x]$

becomes an SOS by using semidefinite programming, i.e., $p(x)=w(x)^\top Qw(x)$ with a semidefinite matrix $Q$ . Sometimes, SOS does not work, e.g., on the polynomial function such that

$p(x,y)=2x^4+5y^4+2x^3y$

SOS can optimize directly over the sum-of-squares cone and its dual, circumventing the semidefinite programming (SDP) reformulation, which requires a large number of auxiliary variables when the degree of sum-of-squares polynomials is large.

References

Pablo A. Parrilo. Sum of squares techniques and polynomial optimization. [MIT Course – 6.7230 Algebraic Techniques and Semidefinite Optimization]
Dávid Papp, Sercan Yildiz (2019). Sum-of-Squares Optimization without Semidefinite Programming. SIAM Journal on Optimization. 29 (1).
Yang Zheng, Giovanni Fantuzzi (2023). Sum-of-squares chordal decomposition of polynomial matrix inequalities. Mathematical Programming. Volume 197, pages 71–108.

57th Mile

Subspace-Conjugate Gradient

Solving multi-term linear equations efficiently in numerical linear algebra is still a challenging problem. For any $\boldsymbol{A}_{k}\in\mathbb{R}^{m\times m},\boldsymbol{B}_{k}\in\mathbb{R}^{n\times n},\,k\in[d]$ and $\boldsymbol{C}\in\mathbb{R}^{m\times n}$ , the Sylvester equation is

$\sum_{k\in[d]}\boldsymbol{A}_k\boldsymbol{X}\boldsymbol{B}_k=\boldsymbol{C}$

whose closed-form solution is

$\operatorname{vec}(\boldsymbol{X})=\Bigl(\sum_{k\in[d]}\boldsymbol{B}_k^\top\otimes\boldsymbol{A}_k\Bigr)^\dagger\operatorname{vec}(\boldsymbol{C})$

Recent study proposed a new iterative scheme for symmetric and positive definite operators, significantly advancing methods such as truncated matrix-oriented Conjugate Gradients (CG). The new algorithm capitalizes on the low-rank matrix format of its iterates by fully exploiting the subspace information of the factors as iterations proceed.

References

Davide Palitta, Martina Iannacito, Valeria Simoncini (2025). A subspace-conjugate gradient method for linear matrix equations. arXiv:2501.02938.

56th Mile

Characterizing Wikipedia Linking Across the Web

Common Crawl maintains a free, open repository of web crawl data that can be used by anyone. Using the dataset with 90 million English Wikipedia links spanning 1.68% of Web domains, recent study (Veselovsky et al, 2025) reveals three key findings:

Wikipedia is most frequently cited by news and science websites for informational purposes, while commercial websites reference it less often.
The majority of Wikipedia links appear within the main content rather than in boilerplate or user-generated sections, highlighting their role in structured knowledge presentation.
Most links (95%) serve as explanatory references rather than as evidence or attribution, reinforcing Wikipedia’s function as a background knowledge provider.

References

Veniamin Veselovsky, Tiziano Piccardi, Ashton Anderson, Robert West, Akhil Arora (2025). Web2Wiki: Characterizing Wikipedia Linking Across the Web. arXiv:2505.15837.
Cristian Consonni, David Laniado, Alberto Montresor (2019). WikiLinkGraphs: A Complete, Longitudinal and Multi-Language Dataset of the Wikipedia Link Networks. arXiv:1902.04298.

55th Mile

Random Projections for Ordinary Least Squares

For any $\boldsymbol{y}\in\mathbb{R}^{n}$ and $\boldsymbol{\Phi}\in\mathbb{R}^{n\times d}$ (i.e., $n$ observations and $d$ features), the linear regression such that

$\min_{\boldsymbol{\theta}\in\mathbb{R}^{d}}\,\|\boldsymbol{y}-\boldsymbol{\Phi}\boldsymbol{\theta}\|_2^2$

can be replaced by

$\min_{\boldsymbol{\theta}\in\mathbb{R}^{d}}\,\|\boldsymbol{S}\boldsymbol{y}-\boldsymbol{S}\boldsymbol{\Phi}\boldsymbol{\theta}\|_2^2$

where $\boldsymbol{S}\in\mathbb{R}^{s\times n}$ is an i.i.d. Gaussian matrix. We typically have $n>s>d$ (more observations than the feature dimension), and one of the benefits of sketching is to be able to store a reduced representation of the data (i.e., $\mathbb{R}^{s\times d}$ instead of $\mathbb{R}^{n\times d}$ ).

The matrix $\boldsymbol{S}\boldsymbol{\Phi}\in\mathbb{R}^{s\times d}$ is a subspace embedding for $\boldsymbol{\Phi}\in\mathbb{R}^{n\times d}$ if

$(1-\epsilon)\|\boldsymbol{\Phi}\boldsymbol{\theta}\|_2^2\leq \|\boldsymbol{S}\boldsymbol{\Phi}\boldsymbol{\theta}\|_2^2\leq (1+\epsilon)\|\boldsymbol{\Phi}\boldsymbol{\theta}\|_2^2$

for all $\boldsymbol{\theta}\in\mathbb{R}^{d}$ .

A sketching matrix $\boldsymbol{S}\in\mathbb{R}^{d\times s}$ of random projection can also be introduced to

$\min_{\boldsymbol{\eta}\in\mathbb{R}^{s}}\,\|\boldsymbol{y}-\boldsymbol{\Phi}\boldsymbol{S}\boldsymbol{\eta}\|_2^2$

for the case of $d>n>s$ (i.e., underdetermined). This corresponds to replacing the feature vectors $\phi(\boldsymbol{x})\in\mathbb{R}^{d}$ by $\boldsymbol{S}^\top\phi(\boldsymbol{x})\in\mathbb{R}^s$ .

References

Ethan N. Epperly. Does Sketching Work?
Francis Bach (2025). Learning Theory from First Principles. Chapter 10.2. MIT Press. [PDF]
Yuji Nakatsukasa, Joel A Tropp (2024). Fast and accurate randomized algorithms for linear systems and eigenvalue problems. SIAM Journal on Matrix Analysis and Applications. 45 (2): 1183-1214.
Christopher Musco. Tutorial on Matrix Sketching.
Moses Charikar. Lecture 19: Sparse Subspace Embeddings. CS369G: Algorithmic Techniques for Big Data.
Tensor sketch. Wikipedia.

54th Mile

The Challenge of Insuring Vehicles with Autonomous Functions

References

Oliver Wyman. The Challenge of Insuring Vehicles with Autonomous Functions. [PDF]

53rd Mile

Two Decades of Low-Rank Optimization

Semidefinite programming (SDP) is powerful for solving low-rank optimization. Despite the most classical algorithms of SDP, one can use first-order methods to solve bigger SDPs more efficiently, including dual-scaling method, spectral bundle method, nonlinear programming approaches, dual Cholesky approach, chordal-graph approaches, and iterative solver for the Newton system.

Low-rank matrix optimization can be formulated as

$\begin{aligned}\min_{\boldsymbol{X}}\,&\langle\boldsymbol{C},\boldsymbol{X}\rangle \\ \text{s.t.}\,&\langle\boldsymbol{A}_{i},\boldsymbol{X}\rangle=b_i,\,\forall i=1,2,\ldots,m \\ &\boldsymbol{X}\succeq 0 \end{aligned}$

There exists an optimal solution $\boldsymbol{X}^{*}$ with rank $r^{*}$ satisfying $r^{*}<\sqrt{2m}$ . For almost all cost matrices $\boldsymbol{C}$ , the first- and second-order necessary optimality conditions are sufficiently for global optimality.

One idea to verify the global optimality is summarized as follows. Suppose the original SDP feasible set is compact with interior. Consider the rank-constrained SDP, enforcing $\text{rank}(\boldsymbol{X})\leq r$ with $r\geq \lceil \sqrt{2(m+1)}\rceil$ . Let $\bar{\boldsymbol{X}}$ be a local minimum of the rank-constrained SDP. If $\text{rank}(\boldsymbol{X})< r$ , then $\bar{\boldsymbol{X}}$ is optimal for the original SDP.

Benign nonconvexity refers to a property of certain nonconvex optimization problems where, despite the lack of global convexity, the problem exhibits characteristics that make it tractable—meaning efficient optimization methods can still find good (often global) solutions.

References

Sam Burer (2023). Two decades of low-rank optimization. [YouTube]

52nd Mile

Matrix Completion and Decomposition in Phase-Bounded Cones

The nuclear norm minimization for matrix completion

$\min_{\boldsymbol{X}}\,\|\boldsymbol{X}\|_{*}\quad\text{s.t.}\,\mathcal{P}_{\Omega}(\boldsymbol{X})=\mathcal{P}_{\Omega}(\boldsymbol{Y})$

is equivalent to the semidefinite programming such that

$\min_{\boldsymbol{X},\boldsymbol{W}_1,\boldsymbol{W}_2}\,\frac{1}{2}\text{tr}(\boldsymbol{W}_1+\boldsymbol{W}_2)\quad \text{s.t.}\,\begin{bmatrix} \boldsymbol{W}_1 & \boldsymbol{X} \\ \boldsymbol{X}^\top & \boldsymbol{W}_2 \end{bmatrix}\succeq 0,\quad \mathcal{P}_{\Omega}(\boldsymbol{X})=\mathcal{P}_{\Omega}(\boldsymbol{Y})$

where the feasible set is the positive semidefinite cone. There is a phase-bounded cone for matrix completion with chordal graph pattern, i.e., phase-bounded completions of a completable partial matrix with a block bounded pattern.

References

Ding Zhang, Axel Ringl, and Li Qiu (2025). Matrix Completion and Decomposition in Phase-Bounded Cones. SIAM Journal on Matrix Analysis and Applications. [DOI]

51st Mile

Revisiting Interpretable Machine Learning (IML)

Local IML methods explain individual predictions of ML models. Popular IML methods are Shapley values and counterfactual explanations. Counterfactual explanations explain predictions in the form of what-if scenarios, they are contrastive and focus on a few reasons. The Shapley values provide an answer on how to fairly share a payout among the players of a collaborative game.

Global model-agnostic explanation methods are used to explain the expected model behavior, i.e., how the model behaves on average for a given dataset. A useful distinction of global explanations are feature importance and feature effect. Feature importance ranks features based on how relevant they were for the prediction. One of the most popular importance measures is permutation feature importance, originated from random forests. Feature effect expresses how a change in a feature changes the predicted outcome.

There are many challenges in IML methods: 1) uncertainty quantification of the explanation, 2) causal interpretation for reflecting the true causal structure of its underlying phenomena, and 3) feature dependence.

References

Christoph Molnar, Giuseppe Casalicchio, and Bernd Bischl (2020). Interpretable Machine Learning – A Brief History, State-of-the-Art and Challenges. arXiv preprint arXiv:2010.09337.

50th Mile

Orthogonal Procrustes Problem

Ever heard of the Orthogonal Procrustes Problem (OPP)? It might sound complex, but the optimal solution can be achieved in just two simple steps:

Singular Value Decomposition (SVD)
Matrix Multiplication

That’s it! This elegant approach helps find the closest orthogonal matrix to a given one, minimizing the Frobenius norm.

📌 Key Takeaways:

OPP is a powerful tool for matrix alignment and optimization.
The solution is computationally efficient with SVD at its core.
Python makes implementation a breeze—just a few lines of code!

49th Mile

Amazon Deforestation

Deforestation in the Amazon: past, present and future visually analyzed the deforestation rates of recent years, identifying the main threats (e.g., cattle-raising activity, road network, and navigable rivers) in the present and pointing to measures needed to reverse this process. There are some basic data:

In 2001, the forest cover of the Amazon occupied over 600 million hectares.
Between 2001 and 2020, the deforestation in the Amazon totalled about 54.2 million hectares, the equivalent of 9% of the forest cover.

The Amazon could lose almost half of what it lost in the past two decades.

48th Mile

Convergence Rates

References

Rates of convergence

47th Mile

The Art of Linear Programming

Linear programming is an important technique for solving NP-hard problems such as Knapsack (e.g., packing problem), TSP, and Vertex Cover.

46th Mile

Semidefinite Programming

The basic definition of positive definite matrix is that: For any square matrix $\boldsymbol{A}\in\mathbb{R}^{n\times n}$ , if it always holds that

$\boldsymbol{x}^\top\boldsymbol{A}\boldsymbol{x}>0$

with any $\boldsymbol{x}\in\mathbb{R}^{n}$ not being a vector of zeros, then $\boldsymbol{A}\succ 0$ is a positive definite matrix. Similarly, we can define a positive semidefinite matrix as follows,

$\boldsymbol{A}\succeq 0 \Leftrightarrow \boldsymbol{x}^\top\boldsymbol{A}\boldsymbol{x}\geq 0,\forall \boldsymbol{x}\neq\boldsymbol{0}$

Semidefinite programming is the most exciting development of mathematical programming techniques in 1990s. One can leverage an initial point $\boldsymbol{u}_0$ , following by the governing equation $\boldsymbol{u}_{t+1}=\boldsymbol{A}\boldsymbol{u}_{t},t=0,1,2,\ldots$ , to build a dynamical system as shown below.

References

45th Mile

Physics-Informed Machine Learning

Reviewed machine learning algorithms:

Deterministic regression
- Linear regression
- Decision tree
- Machine learning algorithms: Sparse Identification of Nonlinear Dynamics (SINDy)
- Neural networks
Probabilistic regression (Gaussian process regression)
Ensemble methods
- Bagging (Random forest)
- Boosting (Gradient boosting machine, XGB regressor)

References

Navid Zobeiry. Physics-Informed Machine Learning: A nine-lecture series on Physics-Informed Machine Learning (PIML) delivered by Professor Navid Zobeiry. This course introduces the key techniques of PIML and demonstrates how integrating physics-based constraints with machine learning (ML) can help solve complex multi-physics challenges in engineering.

44th Mile

Sparse Linear Regression

For any vector $\boldsymbol{y}\in\mathbb{R}^{m}$ and matrix $\boldsymbol{A}\in\mathbb{R}^{m\times n}$ , the sparse linear regression such that

$\begin{align} \min_{\boldsymbol{w}}\,&\|\boldsymbol{y}-\boldsymbol{A}\boldsymbol{w}\|_2^2 \\ \text{s.t.}\,&\|\boldsymbol{w}\|_0\leq \tau \end{align}$

There might be two solutions:

1) Mixed-integer programming such that

$\begin{align} \min_{\boldsymbol{w},\boldsymbol{\beta}}\,&\|\boldsymbol{y}-\boldsymbol{A}\boldsymbol{w}\|_2^2 \\ \text{s.t.}\,&\begin{cases} \|\boldsymbol{\beta}\|_1\leq \tau,\,\boldsymbol{\beta}\in\{0,1\}^{n}, \\ -M\cdot\boldsymbol{\beta}\leq\boldsymbol{w}\leq M\cdot\boldsymbol{\beta} \end{cases} \end{align}$

where $M$ is a sufficiently large constant.

2) Semidefinite programming such that

$\begin{align} \min_{\boldsymbol{X},\boldsymbol{w},\boldsymbol{\beta}}\,&\|\boldsymbol{y}-\boldsymbol{A}\boldsymbol{w}\|_2^2 \\ \text{s.t.}\,&\begin{cases} \begin{bmatrix} 1 & \boldsymbol{w}^\top \\ \boldsymbol{w} & \boldsymbol{X} \\ \end{bmatrix} \succeq 0, \\ x_{i}\leq M^2\beta_{i},\forall i\in[n], \\ \|\boldsymbol{\beta}\|_1\leq \tau,\,\boldsymbol{\beta}\in\{0,1\}^{n}, \\ \end{cases} \end{align}$

MIP provides exact solution, but it scales poorly with $n$ . SDP has better scalability, but not exact in most cases.

43rd Mile

Matrix Calculus for Machine Learning and Beyond

42nd Mile

Causal Inference, Causal Discovery, and Machine Learning

Causal inference is a framework to answer causal questions from observational and/or experimental data. It is important to infer underlying mechanisms of data (e.g., climate system), learn correlation networks, and recognize patterns. Pearl’s causal inference framework assumes an underlying structural causal model with an associated acyclic graph.

There are two types of tasks in the causal inference framework: 1) Utilizing qualitative causal knowledge in the form of directed acyclic graphs; 2) Learning causal graphs based on general assumption.

References

Causal Inference, Causal Discovery, and Machine Learning with Jakob Runge. YouTube.

41st Mile

Composable Optimization for Robotic Motion Planning and Control

From a perspective of control as optimization, the objective could be what one wants system to do, given the model of one’s robot as constraints. This is exactly an optimal control problem such that

$\begin{aligned} \min_{x(t),u(t)}\,&J(x(t),u(t))=\int_{0}^{T}L(x(t),u(t))\,\mathrm{d}t \\ \text{s.t.}\,&\begin{cases} \dot{x}=f(x,u) \\ u_{\text{min}}\leq u\leq u_{\text{max}} \end{cases} \end{aligned}$

References

MIT Robotics - Zac Manchester - Composable Optimization for Robotic Motion Planning and Control. YouTube.

40th Mile

Cauchy-Schwarz Regularizers

The main idea is that Cauchy-Schwarz inequality $|\langle\boldsymbol{x},\boldsymbol{y}\rangle|\leq\|\boldsymbol{x}\|_2\|\boldsymbol{y}\|_2$ can be used to binarize neural networks. Cauchy-Schwarz regularizers are a new class of regularization that can promote discrete-valued vectors, eigenvectors of a given matrix, and orthogonal matrices. These regularizers are effective for quantizing neural network weights and solving underdetermined systems of linear equations.

References

Sueda Taner, Ziyi Wang, Christoph Studer (2025). Cauchy-Schwarz Regularizers. ICLR 2025.

39th Mile

Nystrom Truncation of Spectral Features

Parameterizing policy gradient. (Mercer’s Theorem) If $K:\mathcal{X}\times\mathcal{X}\to\mathbb{R}$ is a continuous, symmetric, and positive definite kernel, then there exists a sequence of non-negative eigenvalues $\{\lambda_i\}_{i=1}^{\infty}$ and corresponding orthonomal basis $\{\phi_i\}_{i=1}^{\infty}$ such that

$K(s,t)=\sum_{i=1}^{\infty}\lambda_i\phi_i(s)\phi_i(t),\,\forall (s,t)\in\mathcal{X}\times\mathcal{X}$

In practice, for large datasets, computing the full kernel matrix $K$ and its eigenvalue decomposition (EVD) is in $\mathcal{O}(n^3)$ time and $\mathcal{O}(n^2)$ memory.

The Nystrom method approximates the kernel matrix $K$ by selecting a subset. Let $K_1$ be the $m\times m$ kernel matrix for the subset and $K_2$ be the $n\times m$ kernel matrix between the full dataset and the subset, then the Nystrom approximation of $K$ is

$K\approx K_2K_1^{-1}K_2^\top$

Furthermore, let $K_1=U\Lambda_1U^\top$ be the EVD, then

$\tilde{\Phi}=K_2U\Lambda_1^{-1/2},\quad\tilde{\Lambda}=\Lambda_1$

corresponding to eigenvectors and eigenvalues. Selecting top- $k$ eigenvalues and eigenvectors, denoted by $\tilde{\Lambda}_k$ and $\tilde{\Phi}_k$ , respectively. The truncation approximation of the kernel matrix becomes

$K\approx\tilde{\Phi}_k\tilde{\Lambda}_k\tilde{\Phi}_k^\top$

References

Na Li (Harvard). Representation-based Learning and Control for Dynamical Systems. YouTube.

38th Mile

Sparse Dictionary Learning

Interpretable machine learning provides a data-driven framework for understanding complicated dynamical systems. One important perspective is sparsity to reinforce the interpretability of several state-of-the-art models. Sparse dictionary learning stems from sparse signal processing, which takes the form of linear regression with sparse parameters.

In a very recent study, researchers developed an interpretable and efficient reinforcement learning model for sparse dictionary learning. The essential idea is iteratively improving control and dynamics on the data $(\boldsymbol{X},\boldsymbol{U})$ where $\boldsymbol{X}$ is the velocity matrix and $\boldsymbol{U}$ could be angles. Reinforcement learning can train policies in the dictionary approximation. However, one of the challenges is how to address the overfitting issue. In the modeling process, uncertainty quantification is also meaningful to improve the robustness of the system control.

References

Nicholas Zolman, Urban Fasel, J. Nathan Kutz, Steven L. Brunton (2024). SINDy-RL: Interpretable and Efficient Model-Based Reinforcement Learning. arXiv:2403.09110.

37th Mile

Cardinality Minimization, Constraints, and Regularization

This is a review paper for solving the optimization problem that involves the cardinality of variable vectors in constraints or objective function. The problems can be formulated as follows,

Cardinality minimization problems:

$\begin{aligned} \min_{\boldsymbol{x}}\,&\|\boldsymbol{x}\|_0 \\ \text{s.t.}\,&\boldsymbol{x}\in\mathcal{X}\subset\mathbb{R}^{n} \end{aligned}$

Cardinality-constrained problems:

$\begin{aligned} \min_{\boldsymbol{x}}\,&f(\boldsymbol{x}) \\ \text{s.t.}\,&\|\boldsymbol{x}\|_0\leq k,\quad\boldsymbol{x}\in\mathcal{X}\subset\mathbb{R}^{n} \end{aligned}$

Regularized cardinality problems:

$\begin{aligned} \min_{\boldsymbol{x}}\,&\|\boldsymbol{x}\|_0+\rho(\boldsymbol{x}) \\ \text{s.t.}\,&\boldsymbol{x}\in\mathcal{X}\subset\mathbb{R}^{n} \end{aligned}$

These optimization problems have broad applications such as signal and image processing, portfolio selection, and machine learning.

References

Andreas M. Tillmann, Daniel Bienstock, Andrea Lodi, Alexandra Schwartz (2024). Cardinality Minimization, Constraints, and Regularization: A Survey. SIAM Review, 66(3). [PDF]

36th Mile

Single-Factor Matrix Decomposition with Sparse Penalty

For any positive semidefinite matrix $\boldsymbol{Y}\in\mathbb{R}^{n\times n}$ , the optimization problem of rank-one single-factor matrix decomposition with sparse penalty can be formulated as follows,

$\min_{\boldsymbol{x}}\,\frac{1}{2}\|\boldsymbol{Y}-\boldsymbol{x}\boldsymbol{x}^{\top}\|_F^2+\frac{\lambda}{2}\|\boldsymbol{x}\|_1$

can be solved by the following algorithm:

Initialize $\boldsymbol{x}$ as the unit vector with equal entries;
Repeat
- Compute
$\boldsymbol{x}:=\mathcal{S}_{\lambda}(\boldsymbol{Y}\boldsymbol{x})/\|\mathcal{S}_{\lambda}(\boldsymbol{Y}\boldsymbol{x})\|_2$
Until convergence
Compute $d=\boldsymbol{x}^{\top}\boldsymbol{Y}\boldsymbol{x}$ (referring to the singular value);
Compute $\boldsymbol{x}:=\sqrt{d}\boldsymbol{x}$ .

In the algorithm, the soft-thresholding operator is defined as

$[\mathcal{S}_{\lambda}(\boldsymbol{x})]_{i}=\begin{cases} x_{i}-\lambda, & \text{if}\,x_{i}>t \\ x_{i}+\lambda, & \text{if}\,x_{i}<-t \\ 0, & \text{otherwise} \end{cases}$

for all $i\in\{1,2,\ldots,n\}$ .

References

(Deng et al., 2021) Correlation tensor decomposition and its application in spatial imaging data. Journal of the American Statistical Association. [DOI] (see Algorithm 2)
(Witten et al., 2009) A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics. [DOI]

35th Mile

Iterative Shrinkage Thresholding Algorithm (ISTA)

In machine learning, the closed-form solution to LASSO is defined upon the soft thresholding operator such that

$\mathcal{S}_{\lambda t}(\boldsymbol{\beta})=\arg\min_{\boldsymbol{z}}\,\frac{1}{2t}\|\boldsymbol{\beta}-\boldsymbol{z}\|_2^2+\lambda\|\boldsymbol{z}\|_1$

element-wise, we have

$[\mathcal{S}_{\lambda t}(\boldsymbol{\beta})]_{i}=\begin{cases} \beta_i-\lambda t, & \text{if}\, \beta_i>\lambda t \\ \beta_i+\lambda t, & \text{if}\, \beta_i<-\lambda t \\ 0, & \text{otherwise} \end{cases}$

for all $i\in\{1,2,\ldots, n\}$ .

Considering the optimization problem

$\min_{\boldsymbol{\beta}}\,\frac{1}{2}\|\boldsymbol{y}-\boldsymbol{X}\boldsymbol{\beta}\|_2^2+\lambda\|\boldsymbol{\beta}\|_1$

The proximal gradient update can be written as follows,

$\boldsymbol{\beta}:=\mathcal{S}_{\lambda t}(\boldsymbol{\beta}+t\boldsymbol{X}^\top (\boldsymbol{y}-\boldsymbol{X}\boldsymbol{\beta}))$

where $t$ is the step size, and the gradient of the first component in the objective function is $-\boldsymbol{X}^\top(\boldsymbol{y}-\boldsymbol{X}\boldsymbol{\beta})$ .

References

Ryan Tibshirani. Proximal Gradient Descent (and Acceleration).
Xiaohan Chen, Jialin Liu, Zhangyang Wang, Wotao Yin (2021). Hyperparameter Tuning is All You Need for LISTA. NeurIPS 2021.

34th Mile

Learning Sparse Nonparametric Directed Acyclic Graphs (DAG)

DAG learning problem: Given a data matrix $\boldsymbol{X}\in\mathbb{R}^{n\times d}$ consisting of $n$ independent and identically distributed observations and $d$ column vectors $\{\boldsymbol{x}_{j}\}_{j=1}^{d}$ , one can learn the DAG $\mathcal{G}(\boldsymbol{X})$ that encodes the dependency between the variables in $\boldsymbol{X}$ . One approach is to learn $f=f(f_1,f_2,\cdots,f_d)$ such that $\mathcal{G}(f)=\mathcal{G}(\boldsymbol{X})$ using a well-designed score. Given a loss function $\ell(y,\hat{y})$ such as least squares or the negative log-likelihood, the optimization problem can be summarized as follows,

$\begin{aligned} \min_{f}\,&\frac{1}{n}\sum_{j=1}^{d}\ell(\boldsymbol{x}_j,f_j(\boldsymbol{X})) \\ \text{s.t.}\,&\mathcal{G}(f)\in\text{DAG} \end{aligned}$

Two challenges in this formulation:

How to enforce the acyclicity constraint that $\mathcal{G}(f)\in\text{DAG}$ ?
How to enforce sparsity in the learned DAG $\mathcal{G}(f)$ ?

If one uses MLP in the optimization, then it becomes

$\begin{aligned} \min_{\boldsymbol{\theta}}\,&\frac{1}{n}\sum_{j=1}^{d}\ell(\boldsymbol{x}_j,\text{MLP}(\boldsymbol{X};\boldsymbol{\theta}_j))+\lambda\|\boldsymbol{A}_{j}^{(1)}\|_{1,1} \\ \text{s.t.}\,&h(W(\boldsymbol{\theta}))=0 \end{aligned}$

where $\boldsymbol{\theta}=\{\boldsymbol{\theta}_{j}\}_{j=1}^{d}$ denotes all parameters and the parameters of the $j$ th MLP are $\boldsymbol{\theta}_j=(\boldsymbol{A}_{j}^{(1)},\boldsymbol{A}_{j}^{(2)},\cdots, \boldsymbol{A}_{j}^{(h)})$ .

References

Xun Zheng, Bryon Aragam, Pradeep Ravikumar, Eric P. Xing (2018). DAGs with NO TEARS: Continuous Optimization for Structure Learning. arXiv:1803.01422.
Xun Zheng, Chen Dan, Bryon Aragam, Pradeep Ravikumar, Eric P. Xing (2019). Learning Sparse Nonparametric DAGs. arXiv:1909.13189.
Victor Chernozhukov, Christian Hansen, Nathan Kallus, Martin Spindler, Vasilis Syrgkanis (2024). Applied Causal Inference Powered by ML and AI. arXiv:2403.02467. (See Chapter 7)

33rd Mile

Learning Sparse Nonlinear Dynamics via Mixed-Integer Optimization

Discovering governing equations of complex dynamical systems directly from data is a central problem in scientific machine learning. In recent years, the sparse identification of nonlinear dynamics (SINDy, see Brunton et al., (2016)) framework, powered by heuristic sparse regression methods, has become a dominant tool for learning parsimonious models. The optimization problem for learning system equations is

$\begin{aligned} \min_{\boldsymbol{\xi},\boldsymbol{z}}\,&\|\dot{\boldsymbol{X}}_{j}-\boldsymbol{\Theta}(\boldsymbol{X})\boldsymbol{\xi}\|_2^2+\lambda\|\boldsymbol{\xi}\|_2^2 \\ \text{s.t.}\,&\begin{cases} M_i^{\ell}z_{i}\leq \xi_{i}\leq M_i^{u}z_{i} \\ \displaystyle\sum_{i=1}^{D}z_{j}\leq k_j,\,z_{i}\in\{0,1\} \end{cases} \end{aligned}$

where $M_i^{\ell},M_{i}^{u}$ are lower and upper bounds on the coefficients. The full system dynamics are $\boldsymbol{\xi}$ .

References

Bertsimas, D., & Gurnee, W. (2023). Learning sparse nonlinear dynamics via mixed-integer optimization. Nonlinear Dynamics, 111(7), 6585-6604.

32nd Mile

31st Mile

$t$ -Statistic & Student $t$ -Distribution

Given population mean $\mu$ , suppose the sample mean $\bar{x}$ , sample standard deviation $s$ , and sample size $n$ (small value), the formula of $t$ -statistic for small sample sizes is written as follows,

$t=\frac{\bar{x}-\mu}{s/\sqrt{n}}$

A high absolute value of $t$ suggests a statistically significant difference. $H_0$ is the null hypothesis, namely, the population mean is $\mu$ . Below is the student $t$ -distribution with a $95%$ confidence interval.

Please check out the details of the relevance of $t$ -statistics for small sample sizes as the teaching sample.

Supporting Materials

30th Mile

Interpretable ML vs. Explainable ML

In the context of AI, there is a subtle difference between terms interpretability and explainability. The interpretability techniques such as sparse linear regression were used to “understand how the underlying AI technology works”, while the explainability refer to “the level of understanding how AI-based systems produce with a given result”. Main claim from Wikipedia is that “treating the model as a black box and analyzing how marginal changes to the inputs affect the result sometimes provides a sufficient explanation.”

References

Explainable Artificial Intelligence on Wikipedia.
W. James Murdoch, Chandan Singh, Karl Kumbier, and Bin Yu (2019). Definitions, methods, and applications in interpretable machine learning. PNAS.
Christoph Molnar (2024). Interpretable Machine Learning: A Guide for Making Black Box Models Explainable.

29th Mile

INFORMS 2024 | Optimal k-Sparse Ridge Regression

The classical linear regression with a $\ell_0$ -norm induced sparsity penalty can be written as follows,

$\begin{aligned} \min_{\boldsymbol{\beta}}\,&\|\boldsymbol{y}-\boldsymbol{X}\boldsymbol{\beta}\|_2^2+\lambda\|\boldsymbol{\beta}\|_2^2 \\ \text{s.t.}\,&\|\boldsymbol{\beta}\|_0\leq k \end{aligned}$

which is equivalent to

$\begin{aligned} \min_{\boldsymbol{\beta},\boldsymbol{z}}\,&\|\boldsymbol{y}-\boldsymbol{X}\boldsymbol{\beta}\|_2^2+\lambda\|\boldsymbol{\beta}\|_2^2 \\ \text{s.t.}\,&\begin{cases} (1-z_j)\beta_{j}=0\quad\text{or}\quad \underbrace{-Mz_{j}\leq \beta_{j}\leq Mz_{j}}_{\text{\color{blue}lower/upper bounds}} \\ \displaystyle\sum_{j=1}^{p}z_{j}\leq k,\,z_{j}\in\{0,1\} \end{cases} \end{aligned}$

References

Jiachang Liu, Sam Rosen, Chudi Zhong, Cynthia Rudin (2023). OKRidge: Scalable Optimal k-Sparse Ridge Regression. NeurIPS 2023.

28th Mile

Mixed Integer Linear Programming (Example)

Mixed Integer Linear Programming with Python

import cvxpy as cp
import numpy as np

# Data
n, d, k = 100, 50, 3  # n: samples, d: features, k: sparsity level
X = np.random.randn(n, d)
y = np.random.randn(n)
M = 1  # Large constant for enforcing non-zero constraint

# Variables
beta = cp.Variable(d, nonneg=True)
z = cp.Variable(d, boolean=True)

# Constraints
constraints = [
    cp.sum(z) <= k,
    beta <= M * z,
    beta >= 0
]

# Objective
objective = cp.Minimize(cp.sum_squares(y - X @ beta))

# Problem
problem = cp.Problem(objective, constraints)
problem.solve(solver=cp.GUROBI)  # Ensure to use a solver that supports MIP

# Solution
print("Optimal beta:", beta.value)
print("Active indices:", np.nonzero(z.value > 0.5)[0])

Note that the “Model too large for size-limited Gurobi license” error.

27th Mile

Importance of Sparsity in Interpretable Machine Learning

Sparsity is an important type of model-based interpretability methods. Typically, the practitioner can impose sparsity on the model by limiting the number of nonzero parameters. When the number of nonzero parameters is sufficiently small, a practitioner can interpret the variables corresponding to those parameters as being meaningfully related to the outcome in question and can also interpret the magnitude and direction of the parameters. Two important methods including $\ell_1$ -norm penalty (e.g., LASSO) and $\ell_0$ -norm constraint (e.g., OMP). Model sparsity is often useful for high-dimensional problems, where the goal is to identify key features for further analysis.

References

W. James Murdoch, Chandan Singh, Karl Kumbier, Reza Abbasi-Asl, and Bin Yu (2019). Definitions, methods, and applications in interpretable machine learning. PNAS.

26th Mile

INFORMS 2024 | Core Tensor Shape Optimization

Recall that the sum of squared singular values of $\boldsymbol{X}_{t}\in\mathbb{R}^{N\times D},\,t=1,2,\ldots,T$ and outcomes $\boldsymbol{X}$ is

$\sum_{i=1}^{I}(\sigma_{i})^2=\|\boldsymbol{X}\|_F^2$

because Frobenius norm is invariant under orthogonal transformations with respect to singular vectors.

This means that we can solve a singular value packing problem instead of considering the complement of the surrogate loss. Please reproduce the aforementioned property as follows,

import numpy as np

X = np.random.rand(100, 100)
print(np.linalg.norm(X, 'fro') ** 2)
u, s, v = np.linalg.svd(X, full_matrices = False)
print(np.linalg.norm(s, 2) ** 2)

Thus, Tucker packing problem on the non-increasing sequences $\boldsymbol{a}^{(n)}\in\mathbb{R}_{\geq 0}^{I_n}$ (w.r.t. singular values), the optimization problem is given by

$\begin{aligned} \min_{\{R_n\}_{n=1}^{N}}\,&\underbrace{\sum_{n=1}^{N}\sum_{i_n=1}^{R_n}a_{i_n}^{(n)}}_{\text{\color{blue}sum of singular values}} \\ \text{s.t.}\,&\underbrace{\prod_{n=1}^{N}R_n}_{\text{\color{blue}core tensor shape}}+\underbrace{\sum_{n=1}^{N}I_nR_n}_{\text{\color{blue}matrix shapes}}\leq \underbrace{c}_{\text{\color{blue}weight}} \end{aligned}$

The optimization problem can be implemented by using an integer programming solvers, and its solution quality is competitive with the greedy algorithm.

References

Mehrdad Ghadiri, Matthew Fahrbach, Gang Fu, Vahab Mirrokni (2023). Approximately Optimal Core Shapes for Tensor Decompositions. ICML 2023. [Python code]

25th Mile

Mobile Service Usage Data

Orlando E. Martínez-Durive et al. (2023). The NetMob23 Dataset: A High-resolution Multi-region Service-level Mobile Data Traffic Cartography. arXiv:2305.06933.
André Zanella (2024). Characterizing Large-Scale Mobile Traffic Measurements for Urban, Social and Networks Sciences. PhD thesis.

24th Mile

Optimization in Reinforcement Learning

References

Jalaj Bhandari and Daniel Russo (2024). Global Optimality Guarantees for Policy Gradient Methods. Operations Research, 72(5): 1906 - 1927.
Lucia Falconi, Andrea Martinelli, and John Lygeros (2024). Data-driven optimal control via linear programming: boundedness guarantees. IEEE Transactions on Automatic Control.

23rd Mile

Sparse and Time-Varying Regression

This work addresses a time series regression problem for features $\boldsymbol{X}_{t}\in\mathbb{R}^{N\times D},\,t=1,2,\ldots,T$ and outcomes $\boldsymbol{y}_t\in\mathbb{R}^{N},\,t=1,2,\ldots,T$ , taking the following expression:

$\boldsymbol{y}_t\approx\boldsymbol{X}_t\boldsymbol{\beta}_t$

where $\boldsymbol{\beta}_t\in\mathbb{R}^{D},\,t=1,2,\ldots,T$ are coefficient vectors, which are supposed to represent both sparsity and time-varying behaviors of the system. Thus, the optimization problem has both temporal smoothing (in the objective) and sparsity (in the constraint), e.g.,

$\begin{aligned} \min_{\boldsymbol{\beta}_1,\boldsymbol{\beta}_2,\ldots,\boldsymbol{\beta}_T}\,&\sum_{t=1}^{T}\bigl(\|\boldsymbol{y}_t-\boldsymbol{X}_t\boldsymbol{\beta}_t\|_2^2+\lambda_{\beta}\|\boldsymbol{\beta}_t\|_2^2\bigr)+\lambda_{\delta}\sum_{(s,t)\in E}\|\boldsymbol{\beta}_t-\boldsymbol{\beta}_s\|_2^2 \\ \text{s.t.}\,&\begin{cases} |\text{supp}(\boldsymbol{\beta}_t)|\leq K_L,\,\forall t \\ \bigl|\bigcup\limits_{t=1}^{T}\text{supp}(\boldsymbol{\beta}_t)\bigr|\leq K_G \\ \sum_{(s,t)\in E}|\text{supp}(\boldsymbol{\beta}_t)\Delta\text{supp}(\boldsymbol{\beta}_s)|\leq K_C \end{cases} \end{aligned}$

where the constraint is indeed the $\ell_0$ -norm of vectors, as the symbol $\text{supp}(\cdot)$ denotes the index set of nonzero entries in the vector. For instance, the first constraint can be rewritten as $|\text{Supp}(\boldsymbol{\beta}_t)|=\|\boldsymbol{\beta}_t\|_0\leq K_L$ . Thus, $K_L,K_G,K_C\in\mathbb{Z}^{+}$ stand for sparsity levels.

The methodological contribution is reformulating this problem as a binary convex optimization problem (w/ a novel relaxation of the objective function), which can be solved efficiently using a cutting plane-type algorithm.

References

Dimitris Bertsimas, Vassilis Digalakis, Michael Lingzhi Li, Omar Skali Lami (2024). Slowly Varying Regression Under Sparsity. Operations Research. [arXiv]

22nd Mile

Revisiting $\ell_1$ -Norm Minimization

References

Lijun Ding (2023). One dimensional least absolute deviation problem. Blog post.
Gregory Gundersen (2022). Weighted Least Squares. Blog post.
stephentu’s blog (2014). Subdifferential of a norm. Blog post.

21st Mile

Research Seminars

Computational Research in Boston and Beyond (CRIBB) seminar series: A forum for interactions among scientists and engineers throughout the Boston area working on a range of computational problems. This forum consists of a monthly seminar where individuals present their work.
Param-Intelligence (𝝅) seminar series: A dynamic platform for researchers, engineers, and students to explore and discuss the latest advancements in integrating machine learning with scientific computing. Key topics include data-driven modeling, physics-informed neural surrogates, neural operators, and hybrid computational methods, with a strong focus on real-world applications across various fields of computational science and engineering.

20th Mile

Robust, Interpretable Statistical Models: Sparse Regression with the LASSO

First of all, we revisit the classical least squares such that

$\min_{\boldsymbol{x}}\,\frac{1}{2}\|\boldsymbol{A}\boldsymbol{x}-\boldsymbol{b}\|_2^2$

Putting the Tikhonov regularization together with least squares, it refers to as the Ridge regression used almost everywhere:

$\min_{\boldsymbol{x}}\,\frac{1}{2}\|\boldsymbol{A}\boldsymbol{x}-\boldsymbol{b}\|_2^2+\alpha\|\boldsymbol{x}\|_2^2$

Another classical variant is the LASSO:

$\min_{\boldsymbol{x}}\,\frac{1}{2}\|\boldsymbol{A}\boldsymbol{x}-\boldsymbol{b}\|_2^2+\lambda\|\boldsymbol{x}\|_1$

with $\ell_1$ -norm on the vector $\boldsymbol{x}$ . It allows one to find a few columns of the matrix $\boldsymbol{A}$ that are most correlated with the designed outcomes (e.g., $\boldsymbol{b}$ ) for making decisions (e.g., why they take actions?).

One interesting application is using sparsity-promoting techniques and machine learning with nonlinear dynamical systems to discover governing equations from noisy measurement data. The only as-sumption about the structure of the model is that there are only a fewimportant terms that govern the dynamics, so that the equations aresparse in the space of possible functions; this assumption holds formany physical systems in an appropriate basis.

References

Steve Brunton (2021). Robust, Interpretable Statistical Models: Sparse Regression with the LASSO. see YouTube. (Note: Original paper by Tibshirani (1996))
Steve L. Brunton, Joshua L. Proctor, and Nathan Kutz (2016). Discovering governing equations from data by sparse identification of nonlinear dynamical systems. Proceedings of the National Academy of Sciences. 113 (15), 3932-3937.

19th Mile

Causal Inference for Geosciences

Learning causal interactions from time series of complex dynamical systems is of great significance in real-world systems. But the questions arise as: 1) How to formulate causal inference for complex dynamical systems? 2) How to detect causal links? 3) How to quantify causal interactions?

References

Jakob Runge (2017). Causal inference and complex network methods for the geosciences. Slides.
Jakob Runge, Andreas Gerhardus, Gherardo Varando, Veronika Eyring & Gustau Camps-Valls (2023). Causal inference for time series. Nature Reviews Earth & Environment, 4: 487–505.
Jitkomut Songsiri (2013). Sparse autoregressive model estimation for learning Granger causality in time series. 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

18th Mile

Tensor Factorization for Knowledge Graph Completion

Knowledge graph completion is a kind of link prediction problems, inferring missing “facts” based on existing ones. Tucker decomposition of the binary tensor representation of knowledge graph triples allows one to make data completion.

References

TuckER: Tensor Factorization for Knowledge Graph Completion. GitHub.
Ivana Balazevic, Carl Allen, Timothy M. Hospedales (2019). TuckER: Tensor Factorization for Knowledge Graph Completion. arXiv:1901.09590. [PDF]

17th Mile

RESCAL: Tensor-Based Relational Learning

Multi-relational data is everywhere in real-world applications such as computational biology, social networks, and semantic web. This type of data is often represented in the form of graphs or networks where nodes represent entities, and edges represent different types of relationships.

Instead of using the classical Tucker and CP tensor decomposition, RESCAL takes the inherent structure of dyadic relational data into account, whose tensor factorization on the tensor variable $\boldsymbol{\mathcal{X}}:=\{\boldsymbol{X}_k\}_k$ (i.e., frontal tensor slices) is

$\boldsymbol{X}_k=\boldsymbol{A}\boldsymbol{S}_{k}\boldsymbol{A}^\top$

where $\boldsymbol{A}\in\mathbb{R}^{n\times r}$ is the global entity factor matrix, and $\boldsymbol{S}_{k}\in\mathbb{R}^{r\times r},\forall k$ specifies the interaction of the latent components. Such kind of methods can be used to solve link prediction, collective classification, and link-based clustering.

References

Maximilian Nickel, Volker Tresp, Hans-Peter Kriegel (2011). A Three-Way Model for Collective Learning on Multi-Relational Data. ICML 2011. [PDF] [Slides]
Maximilian Nickel (2013). Tensor Factorization for Relational Learning. PhD thesis.
Denis Krompaß, Maximilian Nickel, Xueyan Jiang, Volker Tresp (2013). Non-Negative Tensor Factorization with RESCAL. ECML Workshop 2013.
Elynn Y. Chen, Rong Chen (2019). Modeling Dynamic Transport Network with Matrix Factor Models: with an Application to International Trade Flow. arXiv:1901.00769. [PDF]
Zhanhong Cheng. factor_matrix_time_series. GitHub.

16th Mile

Subspace Pursuit Algorithm

Considering the optimization problem for estimating $K$ -sparse vector $\boldsymbol{x}\in\mathbb{R}^{n}$ :

$\begin{aligned} \min_{\boldsymbol{x}}\,&\frac{1}{2}\|\boldsymbol{y}-\boldsymbol{A}\boldsymbol{x}\|_2^2 \\ \text{s.t.}\,& \|\boldsymbol{x}\|_0\leq K,\,K\in\mathbb{Z}^{+} \end{aligned}$

with the signal vector $\boldsymbol{y}\in\mathbb{R}^{m}$ (or measurement vector), the dictionary matrix $\boldsymbol{A}\in\mathbb{R}^{m\times n}$ (or measurement matrix), and the sparsity level $K\in\mathbb{Z}^{+}$ .

The subspace pursuit algorithm, introduced by W. Dai and O. Milenkovic in 2008, is a classical algorithm in the greedy family. It bears some resemblance with compressive sampling matching pursuit (CoSaMP by D. Needell and J. A. Tropp in 2008), except that, instead of $2K$ , only $K$ indices of largest (in modulus) entries of the residual vector are selected, and that an additional orthogonal projection step is performed at each iteration. The implementation of subspace pursuit algorithm (adapted from A Mathematical Introduction to Compressive Sensing, see Page 65) can be summarized as follows:

Input: Signal vector $\boldsymbol{y}\in\mathbb{R}^{m}$ , dictionary matrix $\boldsymbol{A}\in\mathbb{R}^{m\times n}$ , and sparsity level $K\in\mathbb{Z}^{+}$ .
Output: $K$ -sparse vector $\boldsymbol{x}\in\mathbb{R}^{n}$ and index set $S$ .
Initialization: Sparse vector $\boldsymbol{x}=\boldsymbol{0}$ (i.e., zero vector), index set $S=\emptyset$ (i.e., empty set), and error vector $\boldsymbol{r}=\boldsymbol{y}$ .
while not stop do
- Find $\ell$ as the index set of the $K$ largest entries of $|\boldsymbol{A}^\top\boldsymbol{r}|$ .
- $S:=S\cup\ell$ .
- $\boldsymbol{x}_S:=\boldsymbol{A}_S^{\dagger}\boldsymbol{y}$ (least squares).
- Find $S$ as the index set of the $K$ largest entries of $|\boldsymbol{x}|$ .
- $\boldsymbol{x}_S:=\boldsymbol{A}_S^{\dagger}\boldsymbol{y}$ (least squares again!).
- Set $x_i=0$ for all $i\notin S$ .
- $\boldsymbol{r}=\boldsymbol{y}-\boldsymbol{A}_S\boldsymbol{x}_S$ .
end

The subspace pursuit algorithm is a fixed-cardinality method, quite different from the classical orthogonal matching pursuit algorithm developed in 1993 such that

Input: Signal vector $\boldsymbol{y}\in\mathbb{R}^{m}$ , dictionary matrix $\boldsymbol{A}\in\mathbb{R}^{m\times n}$ , and sparsity level $K\in\mathbb{Z}^{+}$ .
Output: $K$ -sparse vector $\boldsymbol{x}\in\mathbb{R}^{n}$ and index set $S$ .
Initialization: Sparse vector $\boldsymbol{x}=\boldsymbol{0}$ (i.e., zero vector), index set $S=\emptyset$ (i.e., empty set), and error vector $\boldsymbol{r}=\boldsymbol{y}$ .
while not stop do
- Find $\ell$ as the index set of the largest entry of $|\boldsymbol{A}^\top\boldsymbol{r}|$ , while $\ell\notin S$ .
- $S:=S\cup\ell$ .
- $\boldsymbol{x}_S:=\boldsymbol{A}_S^{\dagger}\boldsymbol{y}$ (least squares).
- $\boldsymbol{r}=\boldsymbol{y}-\boldsymbol{A}_S\boldsymbol{x}_S$ .
end

15th Mile

Synthetic Sweden Mobility

The Synthetic Sweden Mobility (SySMo) model provides a simplified yet statistically realistic microscopic representation of the real population of Sweden. The agents in this synthetic population contain socioeconomic attributes, household characteristics, and corresponding activity plans for an average weekday. This agent-based modelling approach derives the transportation demand from the agents’ planned activities using various transport modes (e.g., car, public transport, bike, and walking). The dataset is available on Zenodo.

Going back to the individual mobility trajectory, there would be some opportunities to approach taxi trajectory data such as

1 million+ trips collected by 13,000+ taxi cabs during 5 days in Harbin, China
Daily GPS trajectory data of 664 electric taxis in Shenzhen, China
Takahiro Yabe, Kota Tsubouchi, Toru Shimizu, Yoshihide Sekimoto, Kaoru Sezaki, Esteban Moro & Alex Pentland (2024). YJMob100K: City-scale and longitudinal dataset of anonymized human mobility trajectories. Scientific Data. [Data] [Challenge]
The dataset comprises trajectory data of traffic participants, along with traffic light data, current local weather data, and air quality data from the Application Platform Intelligent Mobility (AIM) Research Intersection

14th Mile

Prediction on Extreme Floods

AI increases global access to reliable flood forecasts (see dataset).

Another weather forecasting dataset for consideration: Rain forecasts world-wide on an expansive data set with over a magnitude more hi-res rain radar data.

References

Nearing, G., Cohen, D., Dube, V. et al. (2024). Global prediction of extreme floods in ungauged watersheds. Nature, 627: 559–563.

13th Mile

Sparse Recovery Problem

Considering a general optimization problem for estimating the sparse vector $\boldsymbol{x}\in\mathbb{R}^{n}$ :

$\begin{aligned} \min_{\boldsymbol{x}}\,&\frac{1}{2}\|\boldsymbol{y}-\boldsymbol{A}\boldsymbol{x}\|_2^2 \\ \text{s.t.}\,& \begin{cases} \boldsymbol{x}\geq 0 \\ \displaystyle\|\boldsymbol{x}\|_0\leq K,\,K\in\mathbb{Z}^{+} \end{cases} \end{aligned}$

with the signal vector $\boldsymbol{y}\in\mathbb{R}^{m}$ and a dictionary of elementary functions $\boldsymbol{A}\in\mathbb{R}^{m\times n}$ (i.e., dictionary matrix). There are a lot of solution algorithms in literature:

Mehrdad Yaghoobi, Di Wu, Mike E. Davies (2015). Fast Non-Negative Orthogonal Matching Pursuit. IEEE Signal Processing Letters, 22 (9): 1229-1233.
Thanh Thi Nguyen, Jérôme Idier, Charles Soussen, El-Hadi Djermoune (2019). Non-Negative Orthogonal Greedy Algorithms. IEEE Transactions on Signal Processing, 67 (21): 5643-5658.
Nicolas Nadisic, Arnaud Vandaele, Nicolas Gillis, Jeremy E. Cohen (2020). Exact Sparse Nonnegative Least Squares. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
Chiara Ravazzi, Francesco Bullo, Fabrizio Dabbene (2022). Unveiling Oligarchy in Influence Networks From Partial Information. IEEE Transactions on Control of Network Systems, 10 (3): 1279-1290.
Thi Thanh Nguyen (2019). Orthogonal greedy algorithms for non-negative sparse reconstruction. PhD thesis.

The most classical (greedy) method for solving the linear sparse regression is orthogonal matching pursuit (see an introduction here).

12th Mile

Economic Complexity

References

César A. Hidalgo (2021). Economic complexity theory and applications. Nature Reviews Physics. 3: 92-113.

11th Mile

Time-Varying Autoregressive Models

Vector autoregression (VAR) has a key assumption that the coeffcients are invariant across time (i.e., time-invariant), but it is not always true when accounting for psychological phenomena such as the phase transition from a healthy to unhealthy state (or vice versa). Consequently, time-varying vector autoregressive models are of great significance for capturing the parameter changes in response to interventions. From the statistical perspective, there are two types of lagged effects between pairs of variables: autocorrelations (e.g., $\boldsymbol{x}_{t}\to\boldsymbol{x}_{t+1}$ ) and cross-lagged effects (e.g., $\boldsymbol{x}_{t}\to\boldsymbol{y}_{t+1}$ ). The time-varying autoregressive models can be solved by using generalized additive model and kernel smoothing estimation.

References

Haslbeck, J. M., Bringmann, L. F., & Waldorp, L. J. (2021). A tutorial on estimating time-varying vector autoregressive models. Multivariate Behavioral Research, 56(1), 120-149.

10th Mile

Higher-Order Graph & Hypergraph

The concept of a higher-order graph extends the traditional notion of a graph, which consists of nodes and edges, to capture more complex relationships and structures in data. A common formalism for representing higher-order graphs is through hypergraphs, which generalize the concept of a graph to allow for hyperedges connecting multiple nodes. In a hypergraph, each hyperedge connects a subset of nodes, forming higher-order relationships among them.

References

Higher-order organization of complex networks. Stanford University.
Quintino Francesco Lotito, Federico Musciotto, Alberto Montresor, Federico Battiston (2022). Higher-order motif analysis in hypergraphs. Communications Physics, volume 5, Article number: 79.
Christian Bick, Elizabeth Gross, Heather A. Harrington, and Michael T. Schaub (2023). What are higher-order networks? SIAM Review. 65(3).
Vincent Thibeault, Antoine Allard & Patrick Desrosiers (2024). The low-rank hypothesis of complex systems. Nature Physics. 20: 294-302.
Louis Boucherie, Benjamin F. Maier, Sune Lehmann (2024). Decomposing geographical and universal aspects of human mobility. arXiv:2405.08746.
Raissa M. D’Souza, Mario di Bernardo & Yang-Yu Liu (2023). Controlling complex networks with complex nodes. Nature Reviews Physics. 5: 250–262.
PS Chodrow, N Veldt, AR Benson (2021). Generative hypergraph clustering: From blockmodels to modularity. Science Advances, 28(7).

9th Mile

Eigenvalues of Directed Cycles

The graph signal processing possesses an interesting property of directed cycle (see Figure 2 in the literature). The adjacency matrix of a directed cycle has a set of unit eigenvalues as follows.

import numpy as np

## Construct an adjacency matrix A
a = np.array([0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
n = a.shape[0]
A = np.zeros((n, n))
A[:, 0] = a
for i in range(1, n):
    A[:, i] = np.append(a[-i :], a[: -i])

## Perform eigenvalue decomposition on A
eig_val, eig_vec = np.linalg.eig(A)

## Plot eigenvalues
import matplotlib.pyplot as plt
plt.rcParams['font.family'] = 'Helvetica'

fig = plt.figure(figsize = (3, 3))
ax = fig.add_subplot(1, 1, 1)
circ = plt.Circle((0, 0), radius = 1, edgecolor = 'b', facecolor = 'None', linewidth = 2)
ax.add_patch(circ)
plt.plot(eig_val.real, eig_val.imag, 'rx', markersize = 8)
ax.set_aspect('equal', adjustable = 'box')
plt.xlabel('Re')
plt.ylabel('Im')
plt.show()
fig.savefig('eigenvalues_directed_cycle.png', bbox_inches = 'tight')

8th Mile

Graph Filter

Defining graph-aware operator plays an important role for characterizing a signal $\boldsymbol{x}\in\mathbb{R}^{N}$ with $N$ vertices over a graph $\mathcal{G}$ . One simple idea is introducing the adjacency matrix $\boldsymbol{A}$ so that the operation is $\boldsymbol{A}\boldsymbol{x}$ . In that case, $\boldsymbol{A}$ is a simple operator that accounts for the local connectivity of $\mathcal{G}$ . One example is using the classical unit delay (seems to be time-shift) such that

$\boldsymbol{A}=\begin{bmatrix} 0 & 0 & 0 & \cdots & 1 \\ 1 & 0 & 0 & \cdots & 0 \\ 0 & 1 & 0 & \cdots & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & 0 & \cdots & 0 \end{bmatrix}\in\mathbb{R}^{N\times N}$

The simplest signal operation as multiplication by the adjacency matrix $\boldsymbol{A}$ defines graph filters as matrix polynomials of the form

$p(\boldsymbol{A})=p_0\boldsymbol{I}_N+p_1\boldsymbol{A}+\cdots+p_{N-1}\boldsymbol{A}^{N-1}$

For instance, we have

$\boldsymbol{A}=\begin{bmatrix} 0 & 0 & 0 & 0 & 1 \\ 1 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 \end{bmatrix}\quad \boldsymbol{A}^2=\begin{bmatrix} 0 & 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 0 & 1 \\ 1 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 & 0 \end{bmatrix}$

On the signal $\boldsymbol{x}=(x_1,x_2,x_3,x_4,x_5)^\top$ , it always holds that

$\boldsymbol{A}\boldsymbol{x}=\underbrace{(x_5,x_1,x_2,x_3,x_4)^\top}_{\text{\color{red}one-hop neighbors}}\quad \boldsymbol{A}^2\boldsymbol{x}=\underbrace{(x_4,x_5,x_1,x_2,x_3)^\top}_{\text{\color{red}two-hop neighbors}}$

When applying the polynomial filter to a graph signal $\boldsymbol{x}\in\mathbb{R}^{N}$ , the operation $\boldsymbol{A}\boldsymbol{x}$ takes a local linear combination of the signal values at one-hop neighbors. $\boldsymbol{A}^2\boldsymbol{x}$ takes a local linear combination of $\boldsymbol{A}\boldsymbol{x}$ , referring to two-hop neighbors. Consequently, a graph filter $p(\boldsymbol{A})$ of order $N-1$ represents the mixing values that are at most $N-1$ hops away.

References

Geert Leus, Antonio G. Marques, José M. F. Moura, Antonio Ortega, David I Shuman (2023). Graph Signal Processing: History, Development, Impact, and Outlook. arXiv:2303.12211.
A Sandryhaila, JMF Moura (2013). Discrete signal processing on graphs: Graph filters. Section 3: Graph Filters.
Henry Kenlay, Dorina Thanou, Xiaowen Dong (2020). On The Stability of Polynomial Spectral Graph Filters. ICASSP 2020.
Xiaowen Dong, Dorina Thanou, Michael Rabbat, and Pascal Frossard (2019). Learning Graphs From Data: A signal representation perspective. IEEE Signal Processing Magazine.
Eylem Tugçe Güneyi, Berkay Yaldız, Abdullah Canbolat, and Elif Vural (2024). Learning Graph ARMA Processes From Time-Vertex Spectra. IEEE Transactions on Signal Processing, 72: 47 - 56.

7th Mile

Graph Signals

For any graph $\mathcal{G}=\{\mathcal{V},\mathcal{E}\}$ where $\mathcal{V}=\{1,2,\ldots,N\}$ is a finite set of $N$ vertices, and $\mathcal{E} \subseteq \mathcal{V}\times\mathcal{V}$ is the set of edges. Graph signals can be formally represented as vectors $\boldsymbol{x}\in\mathbb{R}^{N}$ where $x_{n}$ (or say $\boldsymbol{x}(n)$ in the following) stores the signal value at the $n$ th vertex in $\mathcal{V}$ . The graph Fourier transform of $\boldsymbol{x}$ is element-wise defined as follows,

$\hat{\boldsymbol{x}}(k)=\langle\boldsymbol{x},\boldsymbol{\psi}_k\rangle=\sum_{n=1}^{N}\boldsymbol{x}(n)\boldsymbol{\psi}_{k}^{*}(n)$

or another form such that

$\hat{\boldsymbol{x}}=\boldsymbol{\Psi}^{H}\boldsymbol{x}$

where $\boldsymbol{\Psi}$ consists of the eigenvectors $\boldsymbol{\psi}_k,\,k=1,2,\ldots,N$ . The notation $\cdot^{*}$ is the conjugate of complex values, and $\cdot^{H}$ is the conjugate transpose.

The above graph Fourier transform can also be generalized to the graph signals in the form of multivariate time series. For instance, on the data $\boldsymbol{X}\in\mathbb{R}^{N\times T}$ , we have

$\hat{\boldsymbol{X}}=\boldsymbol{\Psi}^{H}\boldsymbol{X}$

where $\boldsymbol{\Psi}$ consists of the eigenvectors of the graph Laplacian matrix.

References

Santiago Segarra, Weiyu Huang, and Alejandro Ribeiro (2020). Signal Processing on Graphs.
Matthew Begue. Fourier analysis on graphs. Slides.

6th Mile

Graph Signal Processing

Graph signal processing not only focuses on the graph typology (e.g., connection between nodes), but also covers the quantity of nodes (i.e., graph signals) with weighted adjacency information.

References

Antonio Ortega, Pascal Frossard, Jelena Kovacevic, Jose M. F. Moura, Pierre Vandergheynst (2017). Graph Signal Processing: Overview, Challenges and Applications. arXiv:1712.00468.
Gonzalo Mateos, Santiago Segarra, Antonio G. Marques, Alejandro Ribeiro (2019). Connecting the Dots: Identifying Network Structure via Graph Signal Processing. IEEE Signal Processing Magazine. 36 (3): 16-43.
Xiaowen Dong, Dorina Thanou, Laura Toni, Michael Bronstein, and Pascal Frossard (2020). Graph signal processing for machine learning: A review and new perspectives. arXiv:2007.16061. [Slides]
Michael M. Bronstein, Joan Bruna, Taco Cohen, Petar Veličković (2021). Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges. arXiv:2104.13478.
Wei Hu, Jiahao Pang, Xianming Liu, Dong Tian, Chia-Wen Lin, Anthony Vetro (2022). Graph Signal Processing for Geometric Data and Beyond: Theory and Applications. IEEE Transactions on Multimedia, 24: 3961-3977.
Geert Leus, Antonio G. Marques, José M. F. Moura, Antonio Ortega, David I Shuman (2023). Graph Signal Processing: History, Development, Impact, and Outlook. arXiv:2303.12211.
Spectral graph theory for dummies. YouTube.

5th Mile

Clifford Product

In Grassmann algebra, the inner product between two vectors $\vec{x}=x_1\vec{e}_1+x_2\vec{e}_2$ and $\vec{y}=y_1\vec{e}_1+y_2\vec{e}_2$ (w/ basis vectors $\vec{e}_1$ and $\vec{e}_2$ ) is given by

$\langle\vec{x},\vec{y}\rangle=\|\vec{x}\|_2 \|\vec{y}\|_2 \cos\theta$

implies to be the multiplication between the magnitude of $\vec{x}$ and the projection of $\vec{y}$ on $\vec{x}$ . Here, the notation $\|\cdot\|_2$ refers to the $\ell_2$ norm, or say the magnitude. $\theta$ is the angle between $\vec{x}$ and $\vec{y}$ in the plane containing them.

In contrast, the outer product (usually called Wedge product) is

$\vec{x}\wedge\vec{y}=\underbrace{(\vec{e}_1\wedge\vec{e}_2)}_{\text{\color{red}orientation}}\underbrace{\|\vec{x}\|_2 \|\vec{y}\|_2 \sin\theta}_{\text{\color{red}area/determinant}}$

implies to be the multiplication between $\vec{x}$ and the projection of $\vec{y}$ on the orthogonal direction of $\vec{x}$ . Here, the unit bivector $\vec{e}_1\wedge\vec{e}_2$ represents the orientation ( $+1$ or $-1$ ) of the hyperplane of $\vec{x}\wedge\vec{y}$ (see Section II in geometric-algebra adaptive filters).

As a result, they consist of Clifford product (or called geometric product, denoted by the symbol $\cdot$ ) such that

$\begin{aligned} \vec{x}\cdot\vec{y}=&\langle\vec{x},\vec{y}\rangle+\vec{x}\wedge\vec{y} \\ =&\|\vec{x}\|_2\|\vec{y}\|_2(\cos\theta +(\vec{e}_1\wedge\vec{e}_2)\sin\theta) \\ =&\|\vec{x}\|_2\|\vec{y}\|_2e^{(\vec{e}_1\wedge\vec{e}_2)\theta} \end{aligned}$

In particular, Clifford algebra is important for modeling vector fields, thus demonstrating valuable applications to wind velocity and fluid dynamics (e.g., Navier-Stokes equation).

References

Spinors for Beginners 11: What is a Clifford Algebra? (and Geometric, Grassmann, Exterior Algebras). YouTube.
A Swift Introduction to Geometric Algebra. YouTube.
Learning on Graphs & Geometry. Weekly reading groups every Monday at 11 am ET.
What’s the Clifford algebra? Mathematics stackexchange.
Introducing CliffordLayers: Neural Network layers inspired by Clifford / Geometric Algebras. Microsoft Research AI4Science.
David Ruhe, Jayesh K. Gupta, Steven de Keninck, Max Welling, Johannes Brandstetter (2023). Geometric Clifford Algebra Networks. arXiv:2302.06594.
Maksim Zhdanov, David Ruhe, Maurice Weiler, Ana Lucic, Johannes Brandstetter, Patrick Forre (2024). Clifford-Steerable Convolutional Neural Networks. arXiv:2402.14730.

4th Mile

Bayesian Variable Selection

In genetic fine mapping, one critical problem is the variable selection in linear regression. There is a Bayesian variable selection based on the sum of single effects, i.e., the vector with one non-zero element. Given any data $\boldsymbol{X}\in\mathbb{R}^{m\times n}$ (of $n$ explanatory variables) and $\boldsymbol{y}\in\mathbb{R}^{m}$ , one can build an optimization problem as follows,

$\begin{aligned} \min_{\{\boldsymbol{b}_{\ell}\}_{\ell\in[L]}}\,&\frac{1}{2}\|\boldsymbol{y}-\boldsymbol{X}\boldsymbol{b}\|_2^2 \\ \text{s.t.}\,&\begin{cases} \displaystyle\boldsymbol{b}=\sum_{\ell\in[L]}\boldsymbol{b}_{\ell} \\ \|\boldsymbol{b}_{\ell}\|_{0}=1,\,\forall \ell\in[L] \\ \boldsymbol{b}_{k}^\top\boldsymbol{b}_{\ell}=0,\,k\neq \ell \end{cases} \end{aligned}$

where $L\in\mathbb{Z}^{+}$ ( $L< n$ ) is predefined by the number of correlated variables. The vectors $\boldsymbol{b}_{\ell},\,\ell\in[L]$ are the coefficients in this linear regression. This optimization problem can also be written as follows,

$\begin{aligned} \min_{\boldsymbol{w}}\,&\frac{1}{2}\|\boldsymbol{y}-\boldsymbol{X}\boldsymbol{w}\|_2^2 \\ \text{s.t.}\,& \displaystyle\|\boldsymbol{w}\|_0=L,\,L\in\mathbb{Z}^{+} \end{aligned}$

Or see Figure 1.1 in Section 1.1 Non-negative sparse reconstruction (Page 2) for an illustration.

References

Gao Wang, Abhishek Sarkar, Peter Carbonetto and Matthew Stephens (2020). A simple new approach to variable selection in regression, with application to genetic fine mapping. Journal of the Royal Statistical Society Series B: Statistical Methodology, 82(5), 1273-1300.
Emil Uffelmann, Qin Qin Huang, Nchangwi Syntia Munung, Jantina de Vries, Yukinori Okada, Alicia R. Martin, Hilary C. Martin, Tuuli Lappalainen & Danielle Posthuma (2021). Genome-wide association studies. Nature Reviews Methods Primers. 1: 59.

3rd Mile

Causal Effect Estimation/Imputation

The causal effect estimation problem is usually defined as a matrix completion on the partially observed matrix $\boldsymbol{Y}\in\mathbb{R}^{N\times T}$ in which $N$ units and $T$ periods are involved. The observed index set is denoted by $\Omega$ . The optimization is from the classical matrix factorization techniques for recommender systems (see Koren et al.’09):

$\min_{\boldsymbol{W},\boldsymbol{X},\boldsymbol{u},\boldsymbol{p}}\,\frac{1}{2}\left\|\mathcal{P}_{\Omega}(\boldsymbol{Y}-\boldsymbol{W}^\top\boldsymbol{X}-\boldsymbol{u}\mathbf{1}_{T}^\top-\mathbf{1}_{N}\boldsymbol{p}^\top)\right\|_F^2$

where $\boldsymbol{W}\in\mathbb{R}^{R\times N}$ and $\boldsymbol{X}\in\mathbb{R}^{R\times T}$ are factor matrices, referring to units and periods, respectively. Here, $\boldsymbol{u}\in\mathbb{R}^{N}$ and $\boldsymbol{p}\in\mathbb{R}^{T}$ are bias vectors, corresponding to $N$ units and $T$ periods, respectively. This idea has also been examined on the tensor factorization (to be honest, performance gains are marginal), see e.g., Bayesian augmented tensor factorization by Chen et al.’19. In the causal effect imputation, one great challenge is how to handle the structural patterns of missing data as mentioned by Athey et al.’21. The structural missing patterns have been discussed on spatiotemporal data with autoregressive tensor factorization (for spatiotemporal predictions).

2nd Mile

Data Standardization in Healthcare

The motivation for discussing the value of standards for health datasets is the risk of algorithmic bias, consequently leading to the possible healthcare inequity. The problem arises from the systemic inequalities in the dataset curation and the unequal opportunities to access the data and research. The aim is to expolore the standards, frameworks, and best practices in health datasets. Some discrete insights throughout the whole paper are summarized as follows,

AI as a medical device (AIaMD). One concern is the risk of systemic algorithmic bias (well-recognized in the literature) if models are trained on biased training datasets.
Less accurate performance in certain patient groups when using the biased algorithms.
Data diversity (Mainly discuss “how to improve”):
- Challenges: lack of standardization across attribute categories, difficulty in harmonizing several methods of data capture and data-governance restrictions.
- Inclusiveness is a core tenet of ethical AI in healthcare.
- Guidance on how to apply the principles in the curation (e.g., developing the data collection strategy), aggregation and use of health data.
The use of metrics (measuring diversity). How to promote diversity and transparency?
Future actions: Guidelines for data collection, handling missing data and labeling data.

References

Anmol Arora, Joseph E. Alderman, Joanne Palmer, Shaswath Ganapathi, Elinor Laws, Melissa D. McCradden, Lauren Oakden-Rayner, Stephen R. Pfohl, Marzyeh Ghassemi, Francis McKay, Darren Treanor, Negar Rostamzadeh, Bilal Mateen, Jacqui Gath, Adewole O. Adebajo, Stephanie Kuku, Rubeta Matin, Katherine Heller, Elizabeth Sapey, Neil J. Sebire, Heather Cole-Lewis, Melanie Calvert, Alastair Denniston, Xiaoxuan Liu (2023). The value of standards for health datasets in artificial intelligence-based applications. Nature Medicine, 29: 2929–2938.

1st Mile

Large Time Series Forecasting Models

As we know, the training data in the large time series model is from different areas, this means that the model training process highly depends on the selected datasets across various areas, so one question is how to reduce the model biases if we consider the forecasting scenario as traffic flow or human mobility? Because I guess time series data in different areas should demonstrate different data behaviors. Hopefully, it is interesting to develop domain-specific time series datasets (e.g., Largest multi-city traffic dataset) and large models (e.g., TimeGPT).

References

Gerald Woo, Chenghao Liu, Akshat Kumar, Caiming Xiong, Silvio Savarese, Doyen Sahoo (2024). Unified Training of Universal Time Series Forecasting Transformers. arXiv:2402.02592.

Motivation & Principle: “不积硅步，无以至千里。” (Small Steps to Accuracy)