Browsing by Author "Boley, Daniel"

Now showing 1 - 20 of 23

A Simple Rank Test to Distinguish Extreme Pathways from Elementary Modes in Metabolic Networks
(2008-10-20) Jevremovic, Dimitrije; Trinh, Cong T.; Srienc, Friedrich; Boley, Daniel
Background: Metabolic pathway analysis is a powerful tool to study the metabolic structure of a cellular metabolism that comprises an intricate network for transforming metabolites through enzyme-catalyzed reactions. The approach is based on convex analysis to solve a homogeneous system of linear equations and inequality constraints derived from the steady state operation of mass conservation of metabolites. The solutions constitute the admissible flux space known as the convex polyhedral cone. Elementary Mode and Extreme Pathway Analysis are two closely related techniques that have been developed to identify pathways spanning the admissible flux space. Both elementary modes and extreme pathways are genetically independent pathways that can support steady state operation of cellular metabolism. However, the set of extreme pathways is often a subset of elementary modes, and under certain conditions only extreme pathways are the generating edges of the polyhedral cone. Because the two techniques are closely related, it is important to develop a theoretical framework to distinguish extreme pathways from elementary modes. Results: We have found a simple algebraic test to distinguish extreme pathways from elementary modes which requires only the stoichiometry matrix. The method has been tested with published metabolic networks that have been characterized with Elementary Mode Analysis and Extreme Pathway Analysis. The identity and number of elementary modes are not altered in networks subjected to splitting every reversible reaction into two different irreversible reactions, other than the spurious futile cycles involving the new reactions themselves. However, the set of extreme pathways depends strongly on the specific treatment of the reversible reactions of the network. The application of this algebraic test for efficient computation of elementary modes in very large networks is discussed. Conclusions: Elementary modes are the complete set of genetically independent pathways of a cellular metabolism that supports steady state operation. With the simple algebraic test, we can easily identify whether a given pathway is an elementary mode or an extreme pathway before computing the complete set of pathways. This test provides a convenient way to analyze and interpret network topology with Metabolic Pathway Analysis. The algebraic test is also useful for improving the efficiency of computing elementary modes in very large metabolic networks.
Automated Clustering and Extraction of Distinctive Words in Legal Documents
(2001-12-03) Vaughn, Neal; Boley, Daniel
An agent is described to automatically organize and annotate a collection of text documents data set. Thisagent clusters the data collection and generates a setof distinctive word labels for each cluster of documents,all entirely autonomously. The agent is based on the useof Principal Direction Divisive Partitioning and k-means,applied to both the documents and the words, using a bag ofwords model. The agent is capable of extracting the wordsthat are most useful in distinguishing among the documents.All this processing by the agent occurs without input froma human user, except to specify the original document set.The agent is illustrated with a collection of alcohol lawsenacted in the 50 states of the U.S.
Bisecting K-means and PDDP: A Comparative Analysis
(2000-09-28) Savaresi, Sergio M.; Boley, Daniel
This paper deals with the problem of clustering a data-set. In particular, the bisecting divisive partitioning approach is here considered. We focus on two algorithms: the celebrated K-means algorithm, and the recently proposed Principal Direction Divisive Partitioning (PDDP) algorithm. A comparison of the two algorithms is given, under the assumption that the data set is uniformly distributed within an ellipsoid. In particular, the dynamic behavior of the K-means iterative procedure is studied; for the 2-dimensional case a closed-form model is given.
Building and Navigating a Network of Local Minima
(1999-10-06) Kim, Seung-Woo; Boley, Daniel
We present a novel method that constructs and navigates a network of local minima of potential fields defined over multi-dimensional spaces. Though motivated by problems of motion planning for robotic manipulators, similar techniques have been proposed for use in other domains such as molecular chemistry and drug design. The method is based on building a roadmap of paths connecting local minima of a potential function. The novel approach consists of an up-hill search strategy used to climb out of local minima and find new nearby local minima, without doubling back on previous local minima. With this up-hill search strategy, one can find local minima otherwise difficult to encounter, and one can focus the search to specific local minima and specific directions from those local minima. The construction of the roadmap can be done in parallel with very little communication. We present extensive simulation results.
Choosing the Cluster to Split in Bisecting Divisive Clustering Algorithms
(2000-10-26) Savaresi, Sergio M.; Boley, Daniel; Bittanti, Sergio; Gazzaniga, Giovanna
This paper deals with the problem of clustering a data-set. In particular, the bisecting divisive approach is here considered. This approach can be naturally divided into two sub-problems: the problem of choosing which cluster must be divided, and the problem of splitting the selected cluster. The focus here is on the first problem. The contribution of this work is to propose a new simple technique for the selection of the cluster to split. This technique is based upon the shape of the cluster. This result is presented with reference to two specific splitting algorithms: the celebrated bisecting K-means algorithm, and the recently proposed Principal Direction Divisive Partitioning (PDDP) algorithm. The problem of evaluating the quality of a partition is also discussed.
Common Component Analysis for Multiple Covariance Matrices
(2010-08-04) Wang, Huahua; Banerjee, Arindam; Boley, Daniel
We consider the problem of finding a suitable common low-dimensional subspace for accurately representing a given set of covariance matrices. When the set contains only one covariance matrix, the subspace is given by Principal Component Analysis (PCA). For multiple covariance matrices, we term the problem Common Component Analysis (CCA). While CCA can be posed as a tensor decomposition problem, standard approaches to tensor decomposition have two critical issues: (i) Tensor decomposition methods are iterative and rely on the initialization. A bad initialization may lead to poor local optima; (ii) For a given level of approximation error, one does not know how to choose a suitable low dimensionality. In this paper, we present a detailed analysis of CCA which yields an effective initialization and iterative algorithms for the problem. The proposed methodology has provable approximation guarantees w.r.t. the global optimum, and also allows one to choose the dimensionality for a given level of approximation error. We also establish conditions under which the methodology will obtain the global optimum. We illustrate the effectiveness of the proposed method through extensive experiments on synthetic data as well as two real stock market datasets, where major financial events can be visualized in low dimensions.
Commute Times for a Directed Graph using an Asymmetric Laplacian
(2010-03-08) Boley, Daniel; Ranjan, Gyan; Zhang, Zhi-Li
The expected commute times for a strongly connected directed graph are related to an asymmetric Laplacian matrix as a direct extension to similar well known formulas for undirected graphs. We show the close relationships between the asymmetric Laplacian and the so-called Fundamental matrix. We give bounds for the commute times in terms of the stationary probabilities for a random walk over the graph together with the asymmetric Laplacian and show how this can be approximated by a symmetrized Laplacian derived from a related weighted undirected graph.
Continuum of All-Pair Shortest-Path to All-Path via Random Walk
(2013-04-19) Golnari, Golshan; Boley, Daniel
A method is proposed to compute the continuum of paths, from shortest paths to all random paths between all pairs of nodes at once in a unified way. The analysis is based on treating the network as a random walk with an additional absorbing state named evaporating state reachable with nonzero probability from any state (so called "evaporating random walk"). The probability of avoiding absorption is tuned by a single parameter varying between 0 and 1, with lower values favoring shorter paths. A computational example is used to illustrate the method.
Finding Novel Multivariate Relationships in Time Series Data: Applications to Climate and Neuroscience
(2018-02-12) Agrawal, Saurabh; Steinbach, Michael; Boley, Daniel; Liess, Stefan; Chatterjee, Snigdhansu; Kumar, Vipin; Atluri, Gowtham
In many domains, there is significant interest in capturing novel relationships between time series that represent activities recorded at different nodes of a highly complex system. In this paper, we introduce multipoles, a novel class of linear relationships between more than two time series. A multipole is a set of time series that have strong linear dependence among themselves, with the requirement that each time series makes a significant contribution to the linear dependence. We demonstrate that most interesting multipoles can be identified as cliques of negative correlations in a correlation network. Such cliques are typically rare in a real-world correlation network, which allows us to find almost all multipoles efficiently using a clique-enumeration approach. Using our proposed framework, we demonstrate the utility of multipoles in discovering new physical phenomena in two scientific domains: climate science and neuroscience. In particular, we discovered several multipole relationships that are reproducible in multiple other independent datasets, and lead to novel domain insights.
Generalized Laplacians and First Transit Times for Directed Graphs
(2009-03-23) Boley, Daniel
In this paper, we extend previous results on average commute-times for undirected graphs to fully-connected directed graphs, corresponding to irreducible Markov chains. We introduce an unsymmetrized generalized Laplacian matrix and show how its pseudo-inverse directly yields the one-way first-transit times and round-trip commute times with formulas almost matching those for the undirected graph case. We show that the results are equivalent to similar formulas in terms of the Fundamental Matrix for recurrent irreducible Markov chains. We show that the unsymmetrized generalized Laplacian and its pseudo-inverse are positive semi-definite, leading to a natural embedding of the graph in Euclidean space which preserves the round-trip commute times.
Incremental PDDP for the Clustering of Large Data Sets
(2001-03-19) Littau, David; Boley, Daniel
Principal Direction Divisive Partitioning (PDDP) is an unsupervised method for partitioning data into clusters. The original method was designed to be applied to the entire data set at once, and for good performance required the entire data set be present in core memory. This paper introduces a variant of PDDP which allows a PDDP tree representing the entire data set to be built in sections. This permits the construction of PDDP trees on large data sets even with limited memory. The performance of the resulting Incremental PDDP tree is comparable to a basic PDDP tree.
Linear Convergence of ADMM on a Model Problem
(2012-03-26) Boley, Daniel
In this short report, we analyze the convergence of ADMM as a matrix recurrence for the particular case of a quadratic program or a linear program. We identify a particular combination of the vector iterates in the standard ADMM iteration that exhibits monotonic convergence. We present an analysis which indicates the convergence depends on the eigenvalues of a particular matrix operator. The theory predicts that ADMM should exhibit linear convergence when close enough to the optimal solution, but when far away can exhibit slow "constant step" convergence. This is illustrated with a convergence trace from linear program.
LQ-Schur Projection on Large Sparse Matrix Equations
(1999-12-21) Boley, Daniel; Goehring, Todd
A new paradigm for the solution of nonsymmetric large sparse systems oflinear equations is proposed. The paradigm is based on an LQfactorization of the matrix of coefficients, i.e.~factoring the matrixof coefficients into the product of a lower triangular matrix and anorthogonal matrix. We show how the system of linear equations can bedecomposed into a collection of smaller independent problems which canthen beused to construct an iterative method for a system of smallerdimensionality. We show that the conditioning of the reducedproblem cannot be worse than that of the original, unlike Schurcomplement methods in the nonsymmetric case. The paradigm depends onthe existence of an ordering of the rows representing the equationsinto blocks of rows which are mutually structurely orthogonal, exceptfor a last block row which is coupled to all other rows in a limitedway.
NORTHSTAR: A Parameter Estimation Method for the Spatial Autoregression Model
(2007-02-09) Celik, Mete; Kazar, Baris M.; Shekhar, Shashi; Boley, Daniel; Lilja, David J.
Parameter estimation method for the spatial autoregression model (SAR) is important because of the many application domains, such as regional economics, ecology, environmental management, public safety, transportation, public health, business, travel and tourism. However, it is computationally very expensive because of the need to compute the determinant of a large matrix due to Maximum Likelihood Theory. The limitation of previous studies is the need for numerous computations of the computationally expensive determinant term of the likelihood function. In this paper, we present a faster, scalable and NOvel pRediction and estimation TecHnique for the exact SpaTial Auto Regression model solution (NORTHSTAR). We provide a proof of the correctness of this algorithm by showing the objective function to be unimodular. Analytical and experimental results show that the NORTHSTAR algorithm is computationally faster than the related approaches, because it reduces the number of evaluations of the determinant term in the likelihood function.
Oral history interview with Daniel (Dan) Boley
(Charles Babbage Institute, 2024-01-30) Boley, Daniel
This interview was conducted by CBI for CS&E, a multi-year project extending from the 50th Anniversary of the University of Minnesota Computer Science Department (now Computer Science and Engineering, CS&E). The oral history begins with Boley’s early interests, undergraduate work at Cornell, and completing a doctorate at Stanford University. It explores the Computer Science Department environment in the 1980s, its administration, Boley’s teaching, and research in various areas of numerical analysis, data science, and machine learning. This includes his work, often allowing graduate students to follow their interests, in applications such as health/medicine, navigation, etc. He discusses this work with Vipin Kumar, collaborations across departments in the College of Science and Engineering, and with other colleges such as the College of Liberal Arts, and the discussions and debates, and launch of the immediately popular and fast-growing Data Science Program.
Parallelization of Nullspace Algorithm for the computation of metabolic pathways
(2010-12-14) Jevremovic, Dimitrije; Trinh, Cong T.; Srienc, Friedrich; Sosa, Carlos P.; Boley, Daniel
Elementary mode analysis is a useful metabolic pathway analysis tool in understanding and analyzing cellular metabolism, since elementary modes can represent metabolic pathways with unique and minimal sets of enzyme-catalyzed reactions of a metabolic network under steady state conditions. However, computation of the elementary modes of a genome-scale metabolic network with 100-1000 reactions is very expensive and sometimes not feasible with the commonly used serial Nullspace algorithm. In this work, we develop a distributed memory parallelization of the Nullspace algorithm to handle efficiently the computation of the elementary modes of a large metabolic network. We give an implementation in C++ language with the support of MPI library functions for the parallel communication. Our proposed algorithm is accompanied with an analysis of the complexity and identification of major bottlenecks during computation of all possible pathways of a large metabolic network. The algorithm includes methods to achieve load balancing among the compute-nodes and specific communication patterns to reduce the communication overhead and improve efficiency.
Principal Direction Divisive Partitioning
(1997) Boley, Daniel
We propose a new algorithm capable of partitioning a set of documents or other samples based on an embedding in a high dimensional Euclidean space (i.e. in which every document is a vector of real numbers). The method is unusual in that it is divisive, as opposed to agglomerative, and operates by repeatedly splitting clusters into smaller clusters. The splits are not based on any distance or similarity measure. The documents are assembled in to a matrix which is very sparse. It is this sparsity that permits the algorithm to be very efficient. The performance of the method is illustrated with a set of text documents obtained from the World Wide Web. Some possible extensions are proposed for further investigation.
Recognizing and Learning Unknown Emerging Concepts
(2004-05-24) Cao, Dongwei; Boley, Daniel
We study the classification of data in which some of the concepts represented by the data are known in advance, while new, emerging, concepts are discovered as they appear in the data. The resulting paradigm is called Concept Emergence. Unlike clustering, we start with some known classes, but then learn emerging concepts using new unlabeled data. Unlike Concept Drift, we assume the original concepts remain stationary. This differs from outlier detection because we use the rejected samples to update the classifier. We illustrate the method on both synthetic and real data sets using Support Vector Machine classifiers.
Temporal Sequence Prediction Using an Actively Pruned Hypothesis Space
(2004-04-14) Jensen, Steven; Boley, Daniel; Gini, Maria; Schrater, Paul
We propose a novel time/space efficient method for learning temporal sequences that operates on-line, is rapid (requiring few exemplars), and adapts easily to changes in the underlying stochastic world model. This work is motivated by humans' remarkable ability to learnspatio-temporal patterns and make short-term predictions much faster than most existing machine learning methods. Using a short-term memory of recent observations, our method maintains a dynamic space of candidate hypotheses in which the growth of the space is systematically and dynamically pruned using an entropy measure over the observed predictive quality of each candidate hypothesis. We further demonstrate the application of this method in the domain of ``matching pennies'' and ``rock-paper-scissors'' games.
Training Support Vector Machine using Adaptive Clustering
(2003-10-07) Cao, Dongwei; Boley, Daniel
Training support vector machines involves a huge optimization problem and many specially designed algorithms have been proposed. In this paper, we proposed an algorithm called ClusterSVM that accelerates the training process by exploiting the distributional properties of the training data, that is, the natural clustering of the training data and the overall layout of these clusters relative to the decision boundary of support vector machines. The proposed algorithm first partitions the training data into several pair-wise disjoint clusters. Then, the representatives of these clusters are used to train an initial support vector machine, based on which we can approximately identify the support vectors and non-support vectors. After replacing the cluster containing only non-support vectors with its representative, the number of training data can be significantly reduced, thereby speeding up the training process. The proposed ClusterSVM has been tested against the popular training algorithm SMO on both the artificial data and the real data, and a significant speedup was observed. The complexity of ClusterSVM scales with the square of the number of support vectors and, after a further improvement, it is expected that it will scale with square of the number of non-boundary support vectors.

University Digital Conservancy

Browse by Author

Browsing by Author "Boley, Daniel"