Step Forward on the Prior Knowledge

Xinyu Chen (陈新宇) created this page since early 2024 with the purpose of fostering research knowledge, vision, insight, and style. In the meantime, it aims to connect random concepts with mathematics and machine learning.


34th Mile

Cardinality Minimization, Constraints, and Regularization

References


33rd Mile

Learning Sparse Nonlinear Dynamics via Mixed-Integer Optimization

Discovering governing equations of complex dynamical systems directly from data is a central problem in scientific machine learning. In recent years, the sparse identification of nonlinear dynamics (SINDy, see Brunton et al., (2016)) framework, powered by heuristic sparse regression methods, has become a dominant tool for learning parsimonious models. The optimization problem for learning system equations is

where are lower and upper bounds on the coefficients. The full system dynamics are .

References


32nd Mile


31st Mile

-Statistic & Student -Distribution

Given population mean , suppose the sample mean , sample standard deviation , and sample size (small value), the formula of -statistic for small sample sizes is written as follows,

A high absolute value of suggests a statistically significant difference. is the null hypothesis, namely, the population mean is . Below is the student -distribution with a confidence interval.


Please check out the details of the relevance of -statistics for small sample sizes.

Supporting Materials


30th Mile

Interpretable ML vs. Explainable ML

In the context of AI, there is a subtle difference between terms interpretability and explainability. The interpretability techniques such as sparse linear regression were used to “understand how the underlying AI technology works”, while the explainability refer to “the level of understanding how AI-based systems produce with a given result”. Main claim from Wikipedia is that “treating the model as a black box and analyzing how marginal changes to the inputs affect the result sometimes provides a sufficient explanation.”

References


29th Mile

INFORMS 2024 | Optimal k-Sparse Ridge Regression

The classical linear regression with a -norm induced sparsity penalty can be written as follows,

which is equivalent to

References


28th Mile

Mixed Integer Linear Programming (Example)


import cvxpy as cp
import numpy as np

# Data
n, d, k = 100, 50, 3  # n: samples, d: features, k: sparsity level
X = np.random.randn(n, d)
y = np.random.randn(n)
M = 1  # Large constant for enforcing non-zero constraint

# Variables
beta = cp.Variable(d, nonneg=True)
z = cp.Variable(d, boolean=True)

# Constraints
constraints = [
    cp.sum(z) <= k,
    beta <= M * z,
    beta >= 0
]

# Objective
objective = cp.Minimize(cp.sum_squares(y - X @ beta))

# Problem
problem = cp.Problem(objective, constraints)
problem.solve(solver=cp.GUROBI)  # Ensure to use a solver that supports MIP

# Solution
print("Optimal beta:", beta.value)
print("Active indices:", np.nonzero(z.value > 0.5)[0])


Note that the “Model too large for size-limited Gurobi license” error.


27th Mile

Importance of Sparsity in Interpretable Machine Learning

Sparsity is an important type of model-based interpretability methods. Typically, the practitioner can impose sparsity on the model by limiting the number of nonzero parameters. When the number of nonzero parameters is sufficiently small, a practitioner can interpret the variables corresponding to those parameters as being meaningfully related to the outcome in question and can also interpret the magnitude and direction of the parameters. Two important methods including -norm penalty (e.g., LASSO) and -norm constraint (e.g., OMP). Model sparsity is often useful for high-dimensional problems, where the goal is to identify key features for further analysis.

References


26th Mile

INFORMS 2024 | Core Tensor Shape Optimization

Recall that the sum of squared singular values of and outcomes is

because Frobenius norm is invariant under orthogonal transformations with respect to singular vectors.

This means that we can solve a singular value packing problem instead of considering the complement of the surrogate loss. Please reproduce the aforementioned property as follows,


import numpy as np

X = np.random.rand(100, 100)
print(np.linalg.norm(X, 'fro') ** 2)
u, s, v = np.linalg.svd(X, full_matrices = False)
print(np.linalg.norm(s, 2) ** 2)


Thus, Tucker packing problem on the non-increasing sequences (w.r.t. singular values), the optimization problem is given by

The optimization problem can be implemented by using an integer programming solvers, and its solution quality is competitive with the greedy algorithm.

References


25th Mile

Mobile Service Usage Data


24th Mile

Optimization in Reinforcement Learning

References


23rd Mile

Sparse and Time-Varying Regression

This work addresses a time series regression problem for features and outcomes , taking the following expression:

where are coefficient vectors, which are supposed to represent both sparsity and time-varying behaviors of the system. Thus, the optimization problem has both temporal smoothing (in the objective) and sparsity (in the constraint), e.g.,

where the constraint is indeed the -norm of vectors, as the symbol denotes the index set of nonzero entries in the vector. For instance, the first constraint can be rewritten as . Thus, stand for sparsity levels.

The methodological contribution is reformulating this problem as a binary convex optimization problem (w/ a novel relaxation of the objective function), which can be solved efficiently using a cutting plane-type algorithm.

References


22nd Mile

Revisiting -Norm Minimization

References


21st Mile

Research Seminars

  • Computational Research in Boston and Beyond (CRIBB) seminar series: A forum for interactions among scientists and engineers throughout the Boston area working on a range of computational problems. This forum consists of a monthly seminar where individuals present their work.
  • Param-Intelligence (𝝅) seminar series: A dynamic platform for researchers, engineers, and students to explore and discuss the latest advancements in integrating machine learning with scientific computing. Key topics include data-driven modeling, physics-informed neural surrogates, neural operators, and hybrid computational methods, with a strong focus on real-world applications across various fields of computational science and engineering.


20th Mile

Robust, Interpretable Statistical Models: Sparse Regression with the LASSO

First of all, we revisit the classical least squares such that

Putting the Tikhonov regularization together with least squares, it refers to as the Ridge regression used almost everywhere:

Another classical variant is the LASSO:

with -norm on the vector . It allows one to find a few columns of the matrix that are most correlated with the designed outcomes (e.g., ) for making decisions (e.g., why they take actions?).

One interesting application is using sparsity-promoting techniques and machine learning with nonlinear dynamical systems to discover governing equations from noisy measurement data. The only as-sumption about the structure of the model is that there are only a fewimportant terms that govern the dynamics, so that the equations aresparse in the space of possible functions; this assumption holds formany physical systems in an appropriate basis.

References


19th Mile

Causal Inference for Geosciences

Learning causal interactions from time series of complex dynamical systems is of great significance in real-world systems. But the questions arise as: 1) How to formulate causal inference for complex dynamical systems? 2) How to detect causal links? 3) How to quantify causal interactions?

References


18th Mile

Tensor Factorization for Knowledge Graph Completion

Knowledge graph completion is a kind of link prediction problems, inferring missing “facts” based on existing ones. Tucker decomposition of the binary tensor representation of knowledge graph triples allows one to make data completion.

References


17th Mile

RESCAL: Tensor-Based Relational Learning

Multi-relational data is everywhere in real-world applications such as computational biology, social networks, and semantic web. This type of data is often represented in the form of graphs or networks where nodes represent entities, and edges represent different types of relationships.

Instead of using the classical Tucker and CP tensor decomposition, RESCAL takes the inherent structure of dyadic relational data into account, whose tensor factorization on the tensor variable (i.e., frontal tensor slices) is

where is the global entity factor matrix, and specifies the interaction of the latent components. Such kind of methods can be used to solve link prediction, collective classification, and link-based clustering.

References


16th Mile

Subspace Pursuit Algorithm

Considering the optimization problem for estimating -sparse vector :

with the signal vector (or measurement vector), the dictionary matrix (or measurement matrix), and the sparsity level .

The subspace pursuit algorithm, introduced by W. Dai and O. Milenkovic in 2008, is a classical algorithm in the greedy family. It bears some resemblance with compressive sampling matching pursuit (CoSaMP by D. Needell and J. A. Tropp in 2008), except that, instead of , only indices of largest (in modulus) entries of the residual vector are selected, and that an additional orthogonal projection step is performed at each iteration. The implementation of subspace pursuit algorithm (adapted from A Mathematical Introduction to Compressive Sensing, see Page 65) can be summarized as follows:

  • Input: Signal vector , dictionary matrix , and sparsity level .
  • Output: -sparse vector and index set .
  • Initialization: Sparse vector (i.e., zero vector), index set (i.e., empty set), and error vector .
  • while not stop do
    • Find as the index set of the largest entries of .
    • .
    • (least squares).
    • Find as the index set of the largest entries of .
    • (least squares again!).
    • Set for all .
    • .
  • end

The subspace pursuit algorithm is a fixed-cardinality method, quite different from the classical orthogonal matching pursuit algorithm developed in 1993 such that

  • Input: Signal vector , dictionary matrix , and sparsity level .
  • Output: -sparse vector and index set .
  • Initialization: Sparse vector (i.e., zero vector), index set (i.e., empty set), and error vector .
  • while not stop do
    • Find as the index set of the largest entry of , while .
    • .
    • (least squares).
    • .
  • end


15th Mile

Synthetic Sweden Mobility

The Synthetic Sweden Mobility (SySMo) model provides a simplified yet statistically realistic microscopic representation of the real population of Sweden. The agents in this synthetic population contain socioeconomic attributes, household characteristics, and corresponding activity plans for an average weekday. This agent-based modelling approach derives the transportation demand from the agents’ planned activities using various transport modes (e.g., car, public transport, bike, and walking). The dataset is available on Zenodo.

Going back to the individual mobility trajectory, there would be some opportunities to approach taxi trajectory data such as


14th Mile

Prediction on Extreme Floods

AI increases global access to reliable flood forecasts (see dataset).

Another weather forecasting dataset for consideration: Rain forecasts world-wide on an expansive data set with over a magnitude more hi-res rain radar data.

References


13th Mile

Sparse Recovery Problem

Considering a general optimization problem for estimating the sparse vector :

with the signal vector and a dictionary of elementary functions (i.e., dictionary matrix). There are a lot of solution algorithms in literature:

The most classical (greedy) method for solving the linear sparse regression is orthogonal matching pursuit (see an introduction here).


12th Mile

Economic Complexity

References


11th Mile

Time-Varying Autoregressive Models

Vector autoregression (VAR) has a key assumption that the coeffcients are invariant across time (i.e., time-invariant), but it is not always true when accounting for psychological phenomena such as the phase transition from a healthy to unhealthy state (or vice versa). Consequently, time-varying vector autoregressive models are of great significance for capturing the parameter changes in response to interventions. From the statistical perspective, there are two types of lagged effects between pairs of variables: autocorrelations (e.g., ) and cross-lagged effects (e.g., ). The time-varying autoregressive models can be solved by using generalized additive model and kernel smoothing estimation.

References


10th Mile

Higher-Order Graph & Hypergraph

The concept of a higher-order graph extends the traditional notion of a graph, which consists of nodes and edges, to capture more complex relationships and structures in data. A common formalism for representing higher-order graphs is through hypergraphs, which generalize the concept of a graph to allow for hyperedges connecting multiple nodes. In a hypergraph, each hyperedge connects a subset of nodes, forming higher-order relationships among them.

References


9th Mile

Eigenvalues of Directed Cycles

The graph signal processing possesses an interesting property of directed cycle (see Figure 2 in the literature). The adjacency matrix of a directed cycle has a set of unit eigenvalues as follows.


import numpy as np

## Construct an adjacency matrix A
a = np.array([0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
n = a.shape[0]
A = np.zeros((n, n))
A[:, 0] = a
for i in range(1, n):
    A[:, i] = np.append(a[-i :], a[: -i])

## Perform eigenvalue decomposition on A
eig_val, eig_vec = np.linalg.eig(A)

## Plot eigenvalues
import matplotlib.pyplot as plt
plt.rcParams['font.family'] = 'Helvetica'

fig = plt.figure(figsize = (3, 3))
ax = fig.add_subplot(1, 1, 1)
circ = plt.Circle((0, 0), radius = 1, edgecolor = 'b', facecolor = 'None', linewidth = 2)
ax.add_patch(circ)
plt.plot(eig_val.real, eig_val.imag, 'rx', markersize = 8)
ax.set_aspect('equal', adjustable = 'box')
plt.xlabel('Re')
plt.ylabel('Im')
plt.show()
fig.savefig('eigenvalues_directed_cycle.png', bbox_inches = 'tight')


8th Mile

Graph Filter

Defining graph-aware operator plays an important role for characterizing a signal with vertices over a graph . One simple idea is introducing the adjacency matrix so that the operation is . In that case, is a simple operator that accounts for the local connectivity of . One example is using the classical unit delay (seems to be time-shift) such that

The simplest signal operation as multiplication by the adjacency matrix defines graph filters as matrix polynomials of the form

For instance, we have

On the signal , it always holds that

When applying the polynomial filter to a graph signal , the operation takes a local linear combination of the signal values at one-hop neighbors. takes a local linear combination of , referring to two-hop neighbors. Consequently, a graph filter of order represents the mixing values that are at most hops away.

References


7th Mile

Graph Signals

For any graph where is a finite set of vertices, and is the set of edges. Graph signals can be formally represented as vectors where (or say in the following) stores the signal value at the th vertex in . The graph Fourier transform of is element-wise defined as follows,

or another form such that

where consists of the eigenvectors . The notation is the conjugate of complex values, and is the conjugate transpose.

The above graph Fourier transform can also be generalized to the graph signals in the form of multivariate time series. For instance, on the data , we have

where consists of the eigenvectors of the graph Laplacian matrix.

References


6th Mile

Graph Signal Processing

Graph signal processing not only focuses on the graph typology (e.g., connection between nodes), but also covers the quantity of nodes (i.e., graph signals) with weighted adjacency information.

References


5th Mile

Clifford Product

In Grassmann algebra, the inner product between two vectors and (w/ basis vectors and ) is given by

implies to be the multiplication between the magnitude of and the projection of on . Here, the notation refers to the norm, or say the magnitude. is the angle between and in the plane containing them.

In contrast, the outer product (usually called Wedge product) is

implies to be the multiplication between and the projection of on the orthogonal direction of . Here, the unit bivector represents the orientation ( or ) of the hyperplane of (see Section II in geometric-algebra adaptive filters).



As a result, they consist of Clifford product (or called geometric product, denoted by the symbol ) such that

In particular, Clifford algebra is important for modeling vector fields, thus demonstrating valuable applications to wind velocity and fluid dynamics (e.g., Navier-Stokes equation).

References


4th Mile

Bayesian Variable Selection

In genetic fine mapping, one critical problem is the variable selection in linear regression. There is a Bayesian variable selection based on the sum of single effects, i.e., the vector with one non-zero element. Given any data (of explanatory variables) and , one can build an optimization problem as follows,

where () is predefined by the number of correlated variables. The vectors are the coefficients in this linear regression. This optimization problem can also be written as follows,

Or see Figure 1.1 in Section 1.1 Non-negative sparse reconstruction (Page 2) for an illustration.

References


3rd Mile

Causal Effect Estimation/Imputation

The causal effect estimation problem is usually defined as a matrix completion on the partially observed matrix in which units and periods are involved. The observed index set is denoted by . The optimization is from the classical matrix factorization techniques for recommender systems (see Koren et al.’09):

where and are factor matrices, referring to units and periods, respectively. Here, and are bias vectors, corresponding to units and periods, respectively. This idea has also been examined on the tensor factorization (to be honest, performance gains are marginal), see e.g., Bayesian augmented tensor factorization by Chen et al.’19. In the causal effect imputation, one great challenge is how to handle the structural patterns of missing data as mentioned by Athey et al.’21. The structural missing patterns have been discussed on spatiotemporal data with autoregressive tensor factorization (for spatiotemporal predictions).


2nd Mile

Data Standardization in Healthcare

The motivation for discussing the value of standards for health datasets is the risk of algorithmic bias, consequently leading to the possible healthcare inequity. The problem arises from the systemic inequalities in the dataset curation and the unequal opportunities to access the data and research. The aim is to expolore the standards, frameworks, and best practices in health datasets. Some discrete insights throughout the whole paper are summarized as follows,

  • AI as a medical device (AIaMD). One concern is the risk of systemic algorithmic bias (well-recognized in the literature) if models are trained on biased training datasets.
  • Less accurate performance in certain patient groups when using the biased algorithms.
  • Data diversity (Mainly discuss “how to improve”):
    • Challenges: lack of standardization across attribute categories, difficulty in harmonizing several methods of data capture and data-governance restrictions.
    • Inclusiveness is a core tenet of ethical AI in healthcare.
    • Guidance on how to apply the principles in the curation (e.g., developing the data collection strategy), aggregation and use of health data.
  • The use of metrics (measuring diversity). How to promote diversity and transparency?
  • Future actions: Guidelines for data collection, handling missing data and labeling data.

References

  • Anmol Arora, Joseph E. Alderman, Joanne Palmer, Shaswath Ganapathi, Elinor Laws, Melissa D. McCradden, Lauren Oakden-Rayner, Stephen R. Pfohl, Marzyeh Ghassemi, Francis McKay, Darren Treanor, Negar Rostamzadeh, Bilal Mateen, Jacqui Gath, Adewole O. Adebajo, Stephanie Kuku, Rubeta Matin, Katherine Heller, Elizabeth Sapey, Neil J. Sebire, Heather Cole-Lewis, Melanie Calvert, Alastair Denniston, Xiaoxuan Liu (2023). The value of standards for health datasets in artificial intelligence-based applications. Nature Medicine, 29: 2929–2938.


1st Mile

Large Time Series Forecasting Models

As we know, the training data in the large time series model is from different areas, this means that the model training process highly depends on the selected datasets across various areas, so one question is how to reduce the model biases if we consider the forecasting scenario as traffic flow or human mobility? Because I guess time series data in different areas should demonstrate different data behaviors. Hopefully, it is interesting to develop domain-specific time series datasets (e.g., Largest multi-city traffic dataset) and large models (e.g., TimeGPT).

References


Motivation & Principle: “不积硅步,无以至千里。” (Small Steps to Accuracy)