Algorithmic, Mathematical, and Statistical Foundations of Data Science and Applications

April 12-13, 2019

Abstracts

On the Theory of Gradient-Based Learning: A View from Continuous Time

Speaker: Michael Jordan

Gradient-based optimization has provided the theoretical and practical foundations on which recent developments in statistical machine learning have reposed. A complementary set of foundations is provided by Monte Carlo sampling, where gradient-based methods have also been leading the way in recent years. We explore links between gradient-based optimization algorithms and gradient-based sampling algorithms. Although these algorithms are generally studied in discrete time, we find that fundamental insights can be obtained more readily if we work in continuous time. A particularly striking finding is that there is a counterpart of Nesterov acceleration in the world of Langevin diffusion.

Controlling Confounding and Selection Biases in Causal Inference

Speaker: Elias Bareinboim

A fundamental challenge pervasive throughout the empirical sciences is the study of cause and effect relationships from a combination of non-experimental observations and substantive knowledge about the phenomenon under investigation. Causal relations are deemed more interpretable and robust than their purely statistical counterparts. Two of the most common obstacles to discovering and reasoning with causal relations appear in the form of two biases, namely, confounding and selection. In this talk, I briefly describe these two biases and a general solution on how to control for both biases simultaneously. This approach generalizes the pervasive and well-known strategy known as the Backdoor criterion and Inverse probability weighting (IPW). Suggested reference: https://causalai.net/r29.pdf

Engineering drug discovery using chemical data science

Speaker: Gaurav Chopra

Current drug discovery has not matched the accelerating rate of technology development observed in many other walks of life, getting exponentially more expensive (costing billions of dollars) and less efficient (taking a decade or more of time). We have developed automated methods that will accelerate several areas of the drug design pipeline from synthesis to bioactivity. Our approach employs deep learning compatible molecular representations (features) that reduce the time and cost of developing new molecules and processes while increasing their efficacy (desired property) because molecules (and related process) will be designed for specific properties rather than created using empirical knowledge. We will briefly discuss molecular representations and computational pipelines to develop a library of synthetically feasible bioactive molecules, and models for reactivity that will be used to predict and validate synthetic conditions for chemical reactions. We expect that engineering drug design based on phenotype (property) of interest will save cost and time by accelerating different aspects of the drug discovery processes.

Stochastic Gradient Descent, in Theory and Practice

Speaker: Rachel Ward

Stochastic Gradient Descent (SGD) is a first-order stochastic optimization algorithm which is the algorithm of choice behind powerful deep learning architectures which are becoming increasingly omnipresent in society. Still, there remains is a wide gap between the setting where SGD theoretical guarantees and the setting where SGD is most effective and useful in practice. We discuss recent theoretical results which make progress towards closing this gap. First, we present the first theoretical guarantees for "adaptive learning rate" SGD algorithms such as AdaGrad which are used widely in practice, as they make SGD less sensitive to choice of hyper-parameters such as the step-size, but which are challenging to analyze theoretically as they are non-linear stochastic dynamical systems. Second, we provide new guarantees for the convergence of SGD to global minimizers of certain non-convex optimization problems via a novel analysis for products of independent random matrices. We conclude by discussing several open problems.

Learning to Grow Economies

Speaker: Simina Branzei

We consider a simple variant of the von Neumann model of an expanding economy, in which multiple producers make goods according to their production function. The players trade their goods at the market and then use the bundles acquired for the production in the next round. We study a simple decentralized dynamic---known as proportional response---in which players update their bids proportionally to how useful the investments were in the past round. We show this dynamic leads to growth of the economy in the long term (whenever growth is possible) but also creates unbounded inequality, i.e. very rich and very poor players emerge. We analyze several other phenomena, such as how the relation of a player with others influences its development and the Gini index of the system. One of the key technical findings is that the players learn a global feature of the network (the optimal cycle) in a decentralized way, while interacting locally with their direct neighbors.

Joint with Ruta Mehta and Noam Nisan

From ADMM to Consensus Equilibrium

Speaker: Stanley Chan

Over the past decade, the alternating direction method of multipliers (ADMM) has become a workhorse for many large-scale optimization problems arising in data science, machine learning, and computational imaging. However, why should ADMM be bounded to solving optimizations? In this talk, I will discuss a new concept called consensus equilibrium which allows us to use an ADMM-type of procedure to solve inverse problems involving physical forward models and deep neural networks. I will talk about the basic properties of the method and demonstrate a few practical examples.

Joint work with Greg Buzzard and Charles Bouman at https://arxiv.org/abs/1705.08983

Graph Mining at Scale: From Theory to Practice and Back

Speaker: Vahab Mirrokni

As a fundamental tool in analyzing massive data, large-scale graph mining is a central component of big data processing.More recently, graph-based learning have become a powerful tool in semi-supervised learning for large output spaces. In this talk, we discuss challenges and techniques in developing such a large-scale system from three angles: applications, algorithmic techniques, and distributed  systems. For the first part, we discuss public-private computation models, and semi-supervised graph learning and clustering. For the second part, we present distributed algorithms optimizing the number of rounds of computation and the communication complexity, and show how to employ core-set and sketching techniques. For the third part, we discuss augmenting MapReduce with a distributed hash table (DHT) service, and show we can achieve exponential speedup in round complexity of graph algorithms in this model.

Statistical methods for prediction and anomaly detection in dynamic networks

Speaker: Jen Neville

Relational machine learning and statistical network analysis methods have been successfully applied to a wide range of complex networks, including social, physical, and information networks. However, much of this work focused on modeling large-scale static graphs rather than network sequences with structure changing over time. A number of key computational and statistical challenges arise in the analysis of evolving network structure, since prior assumptions of fixed population size and connectivity-prescribed conditional-independence are no longer valid. At the same time, temporal information provides an additional source of information to reason more effectively about relational behavior and influence, and will improve predictive performance if it is incorporated accurately into the models. In this talk, I will discuss these issues while describing our recent work on latent space point process models, subgraph pattern neural networks, and size-consistent statistics for anomaly detection.

Leveraging Big Data to Understand the Genetics of Health and Disease

Speaker: Peristera Paschou

The first human genome took billions of dollars, a huge international team of scientists and technicians, and about eight years to complete. Today, a whole human genome can be sequenced for less than $1200 and in about one day. Given the computational and technical ability to handle large amounts of data our knowledge of human evolution, development and disease now grows exponentially. I will discuss examples of the methods and insights that can by offered by new technologies that enable the fast and accurate scanning of the genetic makeup for hundreds of thousands of individuals and systems biology approaches that allow integration of data from multiple levels. Using big data in genomics research carries the promise of delivering "personalized" or "precision" medicine for disease management and ultimately prevention.

Computation in the Brain

Speaker: Christos Papadimitriou

How does the brain beget the mind? How do molecules, cells and synapses effect reasoning, intelligence, language, science? Despite dazzling progress in experimental neuroscience we do not seem to be making progress in the overarching question -- the gap is huge and a completely new approach seems to be required. As Richard Axel recently put it: "We don't have a logic for the transformation of neural activity into thought."

What kind of formal system would qualify as this "logic? I will sketch a possible answer.

(Joint work with Santosh Vempala, Dan Mitropolsky, and Mike Collins.)

Information Content in Dynamic Networks

Speaker: Wojciech Szpankowski

Shannon's information theory has served as a bedrock for advances in communication and storage systems over the past five decades. However, this theory does not handle well higher order structures (e.g., graphs, geometric structures), temporal aspects (e.g., real-time considerations), or semantics. We argue that these are essential aspects of data and information that underlie a broad class of current and emerging data science applications. We discuss the concept of information-efficient computation as a framework in which we first establish fundamental limits on learnable information, and then seek efficient algorithms that achieve this limit. We illustrate it on extracting temporal information (arrival order of nodes) in a dynamic network from its structure (unlabeled graph).

Brain Connectomics: From Maximizing Subjects Identifiability to Disentangling Heritable and Environment Traits

Speaker: Joaquin Goni

In the 17th century, physician Marcello Malpighi observed the existence of patterns of ridges and sweat glands on fingertips. This was a major breakthrough and originated a long and continuing quest for ways to uniquely identify individuals based on fingerprints. In the modern era, the concept of fingerprinting has expanded to other sources of data, such as voice recognition and retinal scans. It is only in the last few years that technologies and methodologies have achieved high-quality data for individual human brain imaging, and the subsequent estimation of structural and functional connectivity. In this context, the next challenge for human identifiability is posed on brain data, particularly on brain networks, both structural and functional.

Here I present how the individual fingerprint of a human structural or functional connectome (as represented by a network) can be maximized from a reconstruction proce dure based on group-wise decomposition in a finite number of orthogonal brain connectivity modes. By using data from the Human Connectome Project and from a local cohort, I also introduce different extensions of this work, including an extended version of the framework for inter-scanner identifiability, and an extended version of the framework for disentangling heritability and environmental brain network traits.

Stochastic Optimization for Large-Scale Tensor Decomposition

Speaker: Tammy Kolda

Tensor decomposition is a fundamental unsupervised machine learning method in data science, with applications including network analysis and sensor data processing. Recently, the generalized canonical polyadic (GCP) low-rank tensor decomposition has extended tensor decomposition to loss functions besides squared error. For instance, GCP can use logistic loss or Kullback-Leibler divergence, enabling tensor decomposition for binary or count data. However, GCP tensor decomposition can be prohibitively expensive for large-scale and/or sparse tensors. In this talk, we conceptualize a stochastic optimization framework for efficient computation of the GCP decomposition. Our focus is on computing a stochastic gradient, including adaptations for stratified sampling. In particular, we motivate and show how to independently sample zeros and nonzeros for sparse tensors, in contrast to recommender system scenarios that sample only nonzeros. Our framework is amenable to missing data and distributed parallelism. We consider pragmatic questions such as how many samples to use for the stochastic gradient, and we apply it to real-world examples. This is joint work with David Hong (Michigan) and Jed Duersch (Sandia).

Data Science Case Studies at Purdue and Sparse Bayesian Deep Learning

Speaker: Guang Lin

In this talk, I will first highlight the data science case studies that we are doing at Purdue including design optimal control strategy for Ebola outbreak, improving the quality of Chrysler crossmember castings, and robust data-driven discovery of physical laws. Finally, I will introduce a novel sparse Bayesian deep learning algorithm we recently developed, which is an adaptive empirical Bayes method for sparse deep learning. The proposed method works by alternatively sampling from an adaptive posterior distribution using stochastic gradient MCMC and adaptively optimizing the hyperparameters using stochastic approximation. The convergence of the proposed method to the target posterior is established under mild conditions. Empirical applications of the proposed method lead to the state-of-the-art performance on MNIST with shallow convolutional neural networks and the state-of-the-art compression performance on CIFAR10 with Residual Networks.

Goodness of Fit Testing for Dynamic Network Models

Speaker: Abram Magner

Networks in the real world change over time, in the sense that nodes and edges enter and leave them. Various dynamic random graph models have been proposed to explain the macroscopic properties of these systems and to provide a foundation for statistical inferences and predictions. It is of interest to have a rigorous way to determine how well these models match observed networks. We thus ask the following goodness of fit question: given a sequence of observations/snapshots of a growing random graph, along with a candidate model $M$, can we determine whether the snapshots came from $M$ or from some arbitrary alternative model that is well-separated from $M$ in some natural metric? We formulate this problem precisely and boil it down to goodness of fit testing for graph-valued, infinite-state Markov processes and exhibit and analyze a test based on non-stationary sampling for a natural class of models.

Sparse Matrices in Sparse Analysis

Speaker: Anna Gilbert

In this talk, I will give two vignettes on the theme of sparse matrices in sparse analysis. The first vignette covers work from compressive sensing in which we want to design sparse matrices (i.e., matrices with few non-zero entries) that we use to (linearly) sense or measure compressible signals. We also design algorithms such that, from these measurements and these matrices, we can efficiently recover a compressed, or sparse, representation of the sensed data. I will discuss the role of expander graphs and error correcting codes in these designs and applications to high throughput biological screens. The second vignette flips the theme; suppose we are given a distance or similarity matrix for a data set that is corrupted in some fashion, find a sparse correction or repair to the distance matrix so as to ensure the corrected distances come from a metric; i.e., repair as few entries as possible in the matrix so that we have a metric. I will discuss generalizations to graph metrics, applications to (and from) metric embeddings, and algorithms for variations of this problem. I will also touch upon applications in machine learning and bio-informatics.

Stein Goodness-of-fit Tests for Discrete and Point Process Data

Speaker: Vinayak Rao

Recent work has combined Stein's method with reproducing kernel Hilbert space theory to develop nonparametric goodness-of-fit tests for unnormalized probability distributions. However, the currently available tests apply exclusively to distributions with smooth density functions. Here I will describe recent work on kernelized Stein discrepancy measures for discrete spaces as well as infinite-dimensional point pattern spaces. This will allow nonparametric goodness-of-fit test for intractable distributions in such settings, and provide a recipe for constructing new Stein operators. I will describe applications to a number of synthetic and real datasets, demonstrating the usefulness of the proposed approach. This is joint work with Jiasen Yang and Jennifer Neville.

Learning From Small Data by Exploiting Physics

Speaker: Ilias Bilionis

In most engineering applications we do not have big data. Because of the cost of data collection, we have just one or two examples from which we have to say something about a physical system. Classic supervised learning techniques fail spectacularly in this regime because they rely on generic loss functions, e.g., squared loss, which overfit without adequate regularization. In this short talk, I show how knowledge of the underlying physics in the form of the systems Lagrangian can be exploited to construct loss functions that are physics-informed. I argue that the resulting loss functions are good in the sense that they either have a unique minimum (in a suitable Hilbert space) or all local minima represent true physical possibilities. In other words, the physics act as powerful regularizers that eliminate bad solutions. I demonstrate the framework with two examples:

  1. The propagation of 100-dimensional uncertainty through an elliptic partial differential (zero-data);
  2. The reconstruction of the velocity and pressure fields of a fluid from images (two data points).