Step Forward on the Prior Knowledge
Xinyu Chen (陈新宇) created this page since early 2024 with the purpose of fostering research knowledge, vision, insight, and style. In the meantime, it aims to connect random concepts with mathematics and machine learning.
34th Mile
Cardinality Minimization, Constraints, and Regularization
References
- Andreas M. Tillmann, Daniel Bienstock, Andrea Lodi, Alexandra Schwartz (2024). Cardinality Minimization, Constraints, and Regularization: A Survey. SIAM Review, 66(3). [PDF]
33rd Mile
Learning Sparse Nonlinear Dynamics via Mixed-Integer Optimization
Discovering governing equations of complex dynamical systems directly from data is a central problem in scientific machine learning. In recent years, the sparse identification of nonlinear dynamics (SINDy, see Brunton et al., (2016)) framework, powered by heuristic sparse regression methods, has become a dominant tool for learning parsimonious models. The optimization problem for learning system equations is
where are lower and upper bounds on the coefficients. The full system dynamics are .
References
- Bertsimas, D., & Gurnee, W. (2023). Learning sparse nonlinear dynamics via mixed-integer optimization. Nonlinear Dynamics, 111(7), 6585-6604.
32nd Mile
- Behavioral changes during the COVID-19 pandemic decreased income diversity of urban encounters
- COVID-19 is linked to changes in the time–space dimension of human mobility
- Evacuation patterns and socioeconomic stratification in the context of wildfires in Chile
- Fine tune an LLM
- Learning with Combinatorial Optimization Layers: A Probabilistic Approach
31st Mile
-Statistic & Student -Distribution
Given population mean , suppose the sample mean , sample standard deviation , and sample size (small value), the formula of -statistic for small sample sizes is written as follows,
A high absolute value of suggests a statistically significant difference. is the null hypothesis, namely, the population mean is . Below is the student -distribution with a confidence interval.
Please check out the details of the relevance of -statistics for small sample sizes.
Supporting Materials
- Hypothesis Testing Problems - Z Test & T Statistics - One & Two Tailed Tests 2
- Introduction to the t Distribution (non-technical)
- Confidence Intervals for One Mean: Determining the Required Sample Size
- Student’s T Distribution
- An Introduction to the t Distribution (Includes some mathematical details)
30th Mile
Interpretable ML vs. Explainable ML
In the context of AI, there is a subtle difference between terms interpretability and explainability. The interpretability techniques such as sparse linear regression were used to “understand how the underlying AI technology works”, while the explainability refer to “the level of understanding how AI-based systems produce with a given result”. Main claim from Wikipedia is that “treating the model as a black box and analyzing how marginal changes to the inputs affect the result sometimes provides a sufficient explanation.”
References
- Explainable Artificial Intelligence on Wikipedia.
- W. James Murdoch, Chandan Singh, Karl Kumbier, and Bin Yu (2019). Definitions, methods, and applications in interpretable machine learning. PNAS.
- Christoph Molnar (2024). Interpretable Machine Learning: A Guide for Making Black Box Models Explainable.
29th Mile
INFORMS 2024 | Optimal k-Sparse Ridge Regression
The classical linear regression with a -norm induced sparsity penalty can be written as follows,
which is equivalent to
References
- Jiachang Liu, Sam Rosen, Chudi Zhong, Cynthia Rudin (2023). OKRidge: Scalable Optimal k-Sparse Ridge Regression. NeurIPS 2023.
28th Mile
Mixed Integer Linear Programming (Example)
import cvxpy as cp
import numpy as np
# Data
n, d, k = 100, 50, 3 # n: samples, d: features, k: sparsity level
X = np.random.randn(n, d)
y = np.random.randn(n)
M = 1 # Large constant for enforcing non-zero constraint
# Variables
beta = cp.Variable(d, nonneg=True)
z = cp.Variable(d, boolean=True)
# Constraints
constraints = [
cp.sum(z) <= k,
beta <= M * z,
beta >= 0
]
# Objective
objective = cp.Minimize(cp.sum_squares(y - X @ beta))
# Problem
problem = cp.Problem(objective, constraints)
problem.solve(solver=cp.GUROBI) # Ensure to use a solver that supports MIP
# Solution
print("Optimal beta:", beta.value)
print("Active indices:", np.nonzero(z.value > 0.5)[0])
Note that the “Model too large for size-limited Gurobi license” error.
27th Mile
Importance of Sparsity in Interpretable Machine Learning
Sparsity is an important type of model-based interpretability methods. Typically, the practitioner can impose sparsity on the model by limiting the number of nonzero parameters. When the number of nonzero parameters is sufficiently small, a practitioner can interpret the variables corresponding to those parameters as being meaningfully related to the outcome in question and can also interpret the magnitude and direction of the parameters. Two important methods including -norm penalty (e.g., LASSO) and -norm constraint (e.g., OMP). Model sparsity is often useful for high-dimensional problems, where the goal is to identify key features for further analysis.
References
- W. James Murdoch, Chandan Singh, Karl Kumbier, Reza Abbasi-Asl, and Bin Yu (2019). Definitions, methods, and applications in interpretable machine learning. PNAS.
26th Mile
INFORMS 2024 | Core Tensor Shape Optimization
Recall that the sum of squared singular values of and outcomes is
because Frobenius norm is invariant under orthogonal transformations with respect to singular vectors.
This means that we can solve a singular value packing problem instead of considering the complement of the surrogate loss. Please reproduce the aforementioned property as follows,
import numpy as np
X = np.random.rand(100, 100)
print(np.linalg.norm(X, 'fro') ** 2)
u, s, v = np.linalg.svd(X, full_matrices = False)
print(np.linalg.norm(s, 2) ** 2)
Thus, Tucker packing problem on the non-increasing sequences (w.r.t. singular values), the optimization problem is given by
The optimization problem can be implemented by using an integer programming solvers, and its solution quality is competitive with the greedy algorithm.
References
- Mehrdad Ghadiri, Matthew Fahrbach, Gang Fu, Vahab Mirrokni (2023). Approximately Optimal Core Shapes for Tensor Decompositions. ICML 2023. [Python code]
25th Mile
Mobile Service Usage Data
- Orlando E. Martínez-Durive et al. (2023). The NetMob23 Dataset: A High-resolution Multi-region Service-level Mobile Data Traffic Cartography. arXiv:2305.06933.
- André Zanella (2024). Characterizing Large-Scale Mobile Traffic Measurements for Urban, Social and Networks Sciences. PhD thesis.
24th Mile
Optimization in Reinforcement Learning
References
- Jalaj Bhandari and Daniel Russo (2024). Global Optimality Guarantees for Policy Gradient Methods. Operations Research, 72(5): 1906 - 1927.
- Lucia Falconi, Andrea Martinelli, and John Lygeros (2024). Data-driven optimal control via linear programming: boundedness guarantees. IEEE Transactions on Automatic Control.
23rd Mile
Sparse and Time-Varying Regression
This work addresses a time series regression problem for features and outcomes , taking the following expression:
where are coefficient vectors, which are supposed to represent both sparsity and time-varying behaviors of the system. Thus, the optimization problem has both temporal smoothing (in the objective) and sparsity (in the constraint), e.g.,
where the constraint is indeed the -norm of vectors, as the symbol denotes the index set of nonzero entries in the vector. For instance, the first constraint can be rewritten as . Thus, stand for sparsity levels.
The methodological contribution is reformulating this problem as a binary convex optimization problem (w/ a novel relaxation of the objective function), which can be solved efficiently using a cutting plane-type algorithm.
References
- Dimitris Bertsimas, Vassilis Digalakis, Michael Lingzhi Li, Omar Skali Lami (2024). Slowly Varying Regression Under Sparsity. Operations Research. [arXiv]
22nd Mile
Revisiting -Norm Minimization
References
- Lijun Ding (2023). One dimensional least absolute deviation problem. Blog post.
- Gregory Gundersen (2022). Weighted Least Squares. Blog post.
- stephentu’s blog (2014). Subdifferential of a norm. Blog post.
21st Mile
Research Seminars
- Computational Research in Boston and Beyond (CRIBB) seminar series: A forum for interactions among scientists and engineers throughout the Boston area working on a range of computational problems. This forum consists of a monthly seminar where individuals present their work.
- Param-Intelligence (𝝅) seminar series: A dynamic platform for researchers, engineers, and students to explore and discuss the latest advancements in integrating machine learning with scientific computing. Key topics include data-driven modeling, physics-informed neural surrogates, neural operators, and hybrid computational methods, with a strong focus on real-world applications across various fields of computational science and engineering.
20th Mile
Robust, Interpretable Statistical Models: Sparse Regression with the LASSO
First of all, we revisit the classical least squares such that
Putting the Tikhonov regularization together with least squares, it refers to as the Ridge regression used almost everywhere:
Another classical variant is the LASSO:
with -norm on the vector . It allows one to find a few columns of the matrix that are most correlated with the designed outcomes (e.g., ) for making decisions (e.g., why they take actions?).
One interesting application is using sparsity-promoting techniques and machine learning with nonlinear dynamical systems to discover governing equations from noisy measurement data. The only as-sumption about the structure of the model is that there are only a fewimportant terms that govern the dynamics, so that the equations aresparse in the space of possible functions; this assumption holds formany physical systems in an appropriate basis.
References
- Steve Brunton (2021). Robust, Interpretable Statistical Models: Sparse Regression with the LASSO. see YouTube. (Note: Original paper by Tibshirani (1996))
- Steve L. Brunton, Joshua L. Proctor, and Nathan Kutz (2016). Discovering governing equations from data by sparse identification of nonlinear dynamical systems. Proceedings of the National Academy of Sciences. 113 (15), 3932-3937.
19th Mile
Causal Inference for Geosciences
Learning causal interactions from time series of complex dynamical systems is of great significance in real-world systems. But the questions arise as: 1) How to formulate causal inference for complex dynamical systems? 2) How to detect causal links? 3) How to quantify causal interactions?
References
- Jakob Runge (2017). Causal inference and complex network methods for the geosciences. Slides.
- Jakob Runge, Andreas Gerhardus, Gherardo Varando, Veronika Eyring & Gustau Camps-Valls (2023). Causal inference for time series. Nature Reviews Earth & Environment, 4: 487–505.
- Jitkomut Songsiri (2013). Sparse autoregressive model estimation for learning Granger causality in time series. 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.
18th Mile
Tensor Factorization for Knowledge Graph Completion
Knowledge graph completion is a kind of link prediction problems, inferring missing “facts” based on existing ones. Tucker decomposition of the binary tensor representation of knowledge graph triples allows one to make data completion.
References
- TuckER: Tensor Factorization for Knowledge Graph Completion. GitHub.
- Ivana Balazevic, Carl Allen, Timothy M. Hospedales (2019). TuckER: Tensor Factorization for Knowledge Graph Completion. arXiv:1901.09590. [PDF]
17th Mile
RESCAL: Tensor-Based Relational Learning
Multi-relational data is everywhere in real-world applications such as computational biology, social networks, and semantic web. This type of data is often represented in the form of graphs or networks where nodes represent entities, and edges represent different types of relationships.
Instead of using the classical Tucker and CP tensor decomposition, RESCAL takes the inherent structure of dyadic relational data into account, whose tensor factorization on the tensor variable (i.e., frontal tensor slices) is
where is the global entity factor matrix, and specifies the interaction of the latent components. Such kind of methods can be used to solve link prediction, collective classification, and link-based clustering.
References
- Maximilian Nickel, Volker Tresp, Hans-Peter Kriegel (2011). A Three-Way Model for Collective Learning on Multi-Relational Data. ICML 2011. [PDF] [Slides]
- Maximilian Nickel (2013). Tensor Factorization for Relational Learning. PhD thesis.
- Denis Krompaß, Maximilian Nickel, Xueyan Jiang, Volker Tresp (2013). Non-Negative Tensor Factorization with RESCAL. ECML Workshop 2013.
- Elynn Y. Chen, Rong Chen (2019). Modeling Dynamic Transport Network with Matrix Factor Models: with an Application to International Trade Flow. arXiv:1901.00769. [PDF]
- Zhanhong Cheng. factor_matrix_time_series. GitHub.
16th Mile
Subspace Pursuit Algorithm
Considering the optimization problem for estimating -sparse vector :
with the signal vector (or measurement vector), the dictionary matrix (or measurement matrix), and the sparsity level .
The subspace pursuit algorithm, introduced by W. Dai and O. Milenkovic in 2008, is a classical algorithm in the greedy family. It bears some resemblance with compressive sampling matching pursuit (CoSaMP by D. Needell and J. A. Tropp in 2008), except that, instead of , only indices of largest (in modulus) entries of the residual vector are selected, and that an additional orthogonal projection step is performed at each iteration. The implementation of subspace pursuit algorithm (adapted from A Mathematical Introduction to Compressive Sensing, see Page 65) can be summarized as follows:
- Input: Signal vector , dictionary matrix , and sparsity level .
- Output: -sparse vector and index set .
- Initialization: Sparse vector (i.e., zero vector), index set (i.e., empty set), and error vector .
- while not stop do
- Find as the index set of the largest entries of .
- .
- (least squares).
- Find as the index set of the largest entries of .
- (least squares again!).
- Set for all .
- .
- end
The subspace pursuit algorithm is a fixed-cardinality method, quite different from the classical orthogonal matching pursuit algorithm developed in 1993 such that
- Input: Signal vector , dictionary matrix , and sparsity level .
- Output: -sparse vector and index set .
- Initialization: Sparse vector (i.e., zero vector), index set (i.e., empty set), and error vector .
- while not stop do
- Find as the index set of the largest entry of , while .
- .
- (least squares).
- .
- end
15th Mile
Synthetic Sweden Mobility
The Synthetic Sweden Mobility (SySMo) model provides a simplified yet statistically realistic microscopic representation of the real population of Sweden. The agents in this synthetic population contain socioeconomic attributes, household characteristics, and corresponding activity plans for an average weekday. This agent-based modelling approach derives the transportation demand from the agents’ planned activities using various transport modes (e.g., car, public transport, bike, and walking). The dataset is available on Zenodo.
Going back to the individual mobility trajectory, there would be some opportunities to approach taxi trajectory data such as
- 1 million+ trips collected by 13,000+ taxi cabs during 5 days in Harbin, China
- Daily GPS trajectory data of 664 electric taxis in Shenzhen, China
- Takahiro Yabe, Kota Tsubouchi, Toru Shimizu, Yoshihide Sekimoto, Kaoru Sezaki, Esteban Moro & Alex Pentland (2024). YJMob100K: City-scale and longitudinal dataset of anonymized human mobility trajectories. Scientific Data. [Data] [Challenge]
- The dataset comprises trajectory data of traffic participants, along with traffic light data, current local weather data, and air quality data from the Application Platform Intelligent Mobility (AIM) Research Intersection
14th Mile
Prediction on Extreme Floods
AI increases global access to reliable flood forecasts (see dataset).
Another weather forecasting dataset for consideration: Rain forecasts world-wide on an expansive data set with over a magnitude more hi-res rain radar data.
References
- Nearing, G., Cohen, D., Dube, V. et al. (2024). Global prediction of extreme floods in ungauged watersheds. Nature, 627: 559–563.
13th Mile
Sparse Recovery Problem
Considering a general optimization problem for estimating the sparse vector :
with the signal vector and a dictionary of elementary functions (i.e., dictionary matrix). There are a lot of solution algorithms in literature:
- Mehrdad Yaghoobi, Di Wu, Mike E. Davies (2015). Fast Non-Negative Orthogonal Matching Pursuit. IEEE Signal Processing Letters, 22 (9): 1229-1233.
- Thanh Thi Nguyen, Jérôme Idier, Charles Soussen, El-Hadi Djermoune (2019). Non-Negative Orthogonal Greedy Algorithms. IEEE Transactions on Signal Processing, 67 (21): 5643-5658.
- Nicolas Nadisic, Arnaud Vandaele, Nicolas Gillis, Jeremy E. Cohen (2020). Exact Sparse Nonnegative Least Squares. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
- Chiara Ravazzi, Francesco Bullo, Fabrizio Dabbene (2022). Unveiling Oligarchy in Influence Networks From Partial Information. IEEE Transactions on Control of Network Systems, 10 (3): 1279-1290.
- Thi Thanh Nguyen (2019). Orthogonal greedy algorithms for non-negative sparse reconstruction. PhD thesis.
The most classical (greedy) method for solving the linear sparse regression is orthogonal matching pursuit (see an introduction here).
12th Mile
Economic Complexity
References
- César A. Hidalgo (2021). Economic complexity theory and applications. Nature Reviews Physics. 3: 92-113.
11th Mile
Time-Varying Autoregressive Models
Vector autoregression (VAR) has a key assumption that the coeffcients are invariant across time (i.e., time-invariant), but it is not always true when accounting for psychological phenomena such as the phase transition from a healthy to unhealthy state (or vice versa). Consequently, time-varying vector autoregressive models are of great significance for capturing the parameter changes in response to interventions. From the statistical perspective, there are two types of lagged effects between pairs of variables: autocorrelations (e.g., ) and cross-lagged effects (e.g., ). The time-varying autoregressive models can be solved by using generalized additive model and kernel smoothing estimation.
References
- Haslbeck, J. M., Bringmann, L. F., & Waldorp, L. J. (2021). A tutorial on estimating time-varying vector autoregressive models. Multivariate Behavioral Research, 56(1), 120-149.
10th Mile
Higher-Order Graph & Hypergraph
The concept of a higher-order graph extends the traditional notion of a graph, which consists of nodes and edges, to capture more complex relationships and structures in data. A common formalism for representing higher-order graphs is through hypergraphs, which generalize the concept of a graph to allow for hyperedges connecting multiple nodes. In a hypergraph, each hyperedge connects a subset of nodes, forming higher-order relationships among them.
References
- Higher-order organization of complex networks. Stanford University.
- Quintino Francesco Lotito, Federico Musciotto, Alberto Montresor, Federico Battiston (2022). Higher-order motif analysis in hypergraphs. Communications Physics, volume 5, Article number: 79.
- Christian Bick, Elizabeth Gross, Heather A. Harrington, and Michael T. Schaub (2023). What are higher-order networks? SIAM Review. 65(3).
- Vincent Thibeault, Antoine Allard & Patrick Desrosiers (2024). The low-rank hypothesis of complex systems. Nature Physics. 20: 294-302.
- Louis Boucherie, Benjamin F. Maier, Sune Lehmann (2024). Decomposing geographical and universal aspects of human mobility. arXiv:2405.08746.
- Raissa M. D’Souza, Mario di Bernardo & Yang-Yu Liu (2023). Controlling complex networks with complex nodes. Nature Reviews Physics. 5: 250–262.
- PS Chodrow, N Veldt, AR Benson (2021). Generative hypergraph clustering: From blockmodels to modularity. Science Advances, 28(7).
9th Mile
Eigenvalues of Directed Cycles
The graph signal processing possesses an interesting property of directed cycle (see Figure 2 in the literature). The adjacency matrix of a directed cycle has a set of unit eigenvalues as follows.
import numpy as np
## Construct an adjacency matrix A
a = np.array([0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
n = a.shape[0]
A = np.zeros((n, n))
A[:, 0] = a
for i in range(1, n):
A[:, i] = np.append(a[-i :], a[: -i])
## Perform eigenvalue decomposition on A
eig_val, eig_vec = np.linalg.eig(A)
## Plot eigenvalues
import matplotlib.pyplot as plt
plt.rcParams['font.family'] = 'Helvetica'
fig = plt.figure(figsize = (3, 3))
ax = fig.add_subplot(1, 1, 1)
circ = plt.Circle((0, 0), radius = 1, edgecolor = 'b', facecolor = 'None', linewidth = 2)
ax.add_patch(circ)
plt.plot(eig_val.real, eig_val.imag, 'rx', markersize = 8)
ax.set_aspect('equal', adjustable = 'box')
plt.xlabel('Re')
plt.ylabel('Im')
plt.show()
fig.savefig('eigenvalues_directed_cycle.png', bbox_inches = 'tight')
8th Mile
Graph Filter
Defining graph-aware operator plays an important role for characterizing a signal with vertices over a graph . One simple idea is introducing the adjacency matrix so that the operation is . In that case, is a simple operator that accounts for the local connectivity of . One example is using the classical unit delay (seems to be time-shift) such that
The simplest signal operation as multiplication by the adjacency matrix defines graph filters as matrix polynomials of the form
For instance, we have
On the signal , it always holds that
When applying the polynomial filter to a graph signal , the operation takes a local linear combination of the signal values at one-hop neighbors. takes a local linear combination of , referring to two-hop neighbors. Consequently, a graph filter of order represents the mixing values that are at most hops away.
References
- Geert Leus, Antonio G. Marques, José M. F. Moura, Antonio Ortega, David I Shuman (2023). Graph Signal Processing: History, Development, Impact, and Outlook. arXiv:2303.12211.
- A Sandryhaila, JMF Moura (2013). Discrete signal processing on graphs: Graph filters. Section 3: Graph Filters.
- Henry Kenlay, Dorina Thanou, Xiaowen Dong (2020). On The Stability of Polynomial Spectral Graph Filters. ICASSP 2020.
- Xiaowen Dong, Dorina Thanou, Michael Rabbat, and Pascal Frossard (2019). Learning Graphs From Data: A signal representation perspective. IEEE Signal Processing Magazine.
- Eylem Tugçe Güneyi, Berkay Yaldız, Abdullah Canbolat, and Elif Vural (2024). Learning Graph ARMA Processes From Time-Vertex Spectra. IEEE Transactions on Signal Processing, 72: 47 - 56.
7th Mile
Graph Signals
For any graph where is a finite set of vertices, and is the set of edges. Graph signals can be formally represented as vectors where (or say in the following) stores the signal value at the th vertex in . The graph Fourier transform of is element-wise defined as follows,
or another form such that
where consists of the eigenvectors . The notation is the conjugate of complex values, and is the conjugate transpose.
The above graph Fourier transform can also be generalized to the graph signals in the form of multivariate time series. For instance, on the data , we have
where consists of the eigenvectors of the graph Laplacian matrix.
References
- Santiago Segarra, Weiyu Huang, and Alejandro Ribeiro (2020). Signal Processing on Graphs.
- Matthew Begue. Fourier analysis on graphs. Slides.
6th Mile
Graph Signal Processing
Graph signal processing not only focuses on the graph typology (e.g., connection between nodes), but also covers the quantity of nodes (i.e., graph signals) with weighted adjacency information.
References
- Antonio Ortega, Pascal Frossard, Jelena Kovacevic, Jose M. F. Moura, Pierre Vandergheynst (2017). Graph Signal Processing: Overview, Challenges and Applications. arXiv:1712.00468.
- Gonzalo Mateos, Santiago Segarra, Antonio G. Marques, Alejandro Ribeiro (2019). Connecting the Dots: Identifying Network Structure via Graph Signal Processing. IEEE Signal Processing Magazine. 36 (3): 16-43.
- Xiaowen Dong, Dorina Thanou, Laura Toni, Michael Bronstein, and Pascal Frossard (2020). Graph signal processing for machine learning: A review and new perspectives. arXiv:2007.16061. [Slides]
- Michael M. Bronstein, Joan Bruna, Taco Cohen, Petar Veličković (2021). Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges. arXiv:2104.13478.
- Wei Hu, Jiahao Pang, Xianming Liu, Dong Tian, Chia-Wen Lin, Anthony Vetro (2022). Graph Signal Processing for Geometric Data and Beyond: Theory and Applications. IEEE Transactions on Multimedia, 24: 3961-3977.
- Geert Leus, Antonio G. Marques, José M. F. Moura, Antonio Ortega, David I Shuman (2023). Graph Signal Processing: History, Development, Impact, and Outlook. arXiv:2303.12211.
- Spectral graph theory for dummies. YouTube.
5th Mile
Clifford Product
In Grassmann algebra, the inner product between two vectors and (w/ basis vectors and ) is given by
implies to be the multiplication between the magnitude of and the projection of on . Here, the notation refers to the norm, or say the magnitude. is the angle between and in the plane containing them.
In contrast, the outer product (usually called Wedge product) is
implies to be the multiplication between and the projection of on the orthogonal direction of . Here, the unit bivector represents the orientation ( or ) of the hyperplane of (see Section II in geometric-algebra adaptive filters).
As a result, they consist of Clifford product (or called geometric product, denoted by the symbol ) such that
In particular, Clifford algebra is important for modeling vector fields, thus demonstrating valuable applications to wind velocity and fluid dynamics (e.g., Navier-Stokes equation).
References
- Spinors for Beginners 11: What is a Clifford Algebra? (and Geometric, Grassmann, Exterior Algebras). YouTube.
- A Swift Introduction to Geometric Algebra. YouTube.
- Learning on Graphs & Geometry. Weekly reading groups every Monday at 11 am ET.
- What’s the Clifford algebra? Mathematics stackexchange.
- Introducing CliffordLayers: Neural Network layers inspired by Clifford / Geometric Algebras. Microsoft Research AI4Science.
- David Ruhe, Jayesh K. Gupta, Steven de Keninck, Max Welling, Johannes Brandstetter (2023). Geometric Clifford Algebra Networks. arXiv:2302.06594.
- Maksim Zhdanov, David Ruhe, Maurice Weiler, Ana Lucic, Johannes Brandstetter, Patrick Forre (2024). Clifford-Steerable Convolutional Neural Networks. arXiv:2402.14730.
4th Mile
Bayesian Variable Selection
In genetic fine mapping, one critical problem is the variable selection in linear regression. There is a Bayesian variable selection based on the sum of single effects, i.e., the vector with one non-zero element. Given any data (of explanatory variables) and , one can build an optimization problem as follows,
where () is predefined by the number of correlated variables. The vectors are the coefficients in this linear regression. This optimization problem can also be written as follows,
Or see Figure 1.1 in Section 1.1 Non-negative sparse reconstruction (Page 2) for an illustration.
References
- Gao Wang, Abhishek Sarkar, Peter Carbonetto and Matthew Stephens (2020). A simple new approach to variable selection in regression, with application to genetic fine mapping. Journal of the Royal Statistical Society Series B: Statistical Methodology, 82(5), 1273-1300.
- Emil Uffelmann, Qin Qin Huang, Nchangwi Syntia Munung, Jantina de Vries, Yukinori Okada, Alicia R. Martin, Hilary C. Martin, Tuuli Lappalainen & Danielle Posthuma (2021). Genome-wide association studies. Nature Reviews Methods Primers. 1: 59.
3rd Mile
Causal Effect Estimation/Imputation
The causal effect estimation problem is usually defined as a matrix completion on the partially observed matrix in which units and periods are involved. The observed index set is denoted by . The optimization is from the classical matrix factorization techniques for recommender systems (see Koren et al.’09):
where and are factor matrices, referring to units and periods, respectively. Here, and are bias vectors, corresponding to units and periods, respectively. This idea has also been examined on the tensor factorization (to be honest, performance gains are marginal), see e.g., Bayesian augmented tensor factorization by Chen et al.’19. In the causal effect imputation, one great challenge is how to handle the structural patterns of missing data as mentioned by Athey et al.’21. The structural missing patterns have been discussed on spatiotemporal data with autoregressive tensor factorization (for spatiotemporal predictions).
2nd Mile
Data Standardization in Healthcare
The motivation for discussing the value of standards for health datasets is the risk of algorithmic bias, consequently leading to the possible healthcare inequity. The problem arises from the systemic inequalities in the dataset curation and the unequal opportunities to access the data and research. The aim is to expolore the standards, frameworks, and best practices in health datasets. Some discrete insights throughout the whole paper are summarized as follows,
- AI as a medical device (AIaMD). One concern is the risk of systemic algorithmic bias (well-recognized in the literature) if models are trained on biased training datasets.
- Less accurate performance in certain patient groups when using the biased algorithms.
- Data diversity (Mainly discuss “how to improve”):
- Challenges: lack of standardization across attribute categories, difficulty in harmonizing several methods of data capture and data-governance restrictions.
- Inclusiveness is a core tenet of ethical AI in healthcare.
- Guidance on how to apply the principles in the curation (e.g., developing the data collection strategy), aggregation and use of health data.
- The use of metrics (measuring diversity). How to promote diversity and transparency?
- Future actions: Guidelines for data collection, handling missing data and labeling data.
References
- Anmol Arora, Joseph E. Alderman, Joanne Palmer, Shaswath Ganapathi, Elinor Laws, Melissa D. McCradden, Lauren Oakden-Rayner, Stephen R. Pfohl, Marzyeh Ghassemi, Francis McKay, Darren Treanor, Negar Rostamzadeh, Bilal Mateen, Jacqui Gath, Adewole O. Adebajo, Stephanie Kuku, Rubeta Matin, Katherine Heller, Elizabeth Sapey, Neil J. Sebire, Heather Cole-Lewis, Melanie Calvert, Alastair Denniston, Xiaoxuan Liu (2023). The value of standards for health datasets in artificial intelligence-based applications. Nature Medicine, 29: 2929–2938.
1st Mile
Large Time Series Forecasting Models
As we know, the training data in the large time series model is from different areas, this means that the model training process highly depends on the selected datasets across various areas, so one question is how to reduce the model biases if we consider the forecasting scenario as traffic flow or human mobility? Because I guess time series data in different areas should demonstrate different data behaviors. Hopefully, it is interesting to develop domain-specific time series datasets (e.g., Largest multi-city traffic dataset) and large models (e.g., TimeGPT).
References
- Gerald Woo, Chenghao Liu, Akshat Kumar, Caiming Xiong, Silvio Savarese, Doyen Sahoo (2024). Unified Training of Universal Time Series Forecasting Transformers. arXiv:2402.02592.
Motivation & Principle: “不积硅步,无以至千里。” (Small Steps to Accuracy)