Results 1 
8 of
8
Materialization Optimizations for Feature Selection Workloads
"... There is an arms race in the data management industry to support analytics, in which one critical step is feature selection, the process of selecting a feature set that will be used to build a statistical model. Analytics is one of the biggest topics in data management, and feature selection is wid ..."
Abstract

Cited by 6 (3 self)
 Add to MetaCart
(Show Context)
There is an arms race in the data management industry to support analytics, in which one critical step is feature selection, the process of selecting a feature set that will be used to build a statistical model. Analytics is one of the biggest topics in data management, and feature selection is widely regarded as the most critical step of analytics; thus, we argue that managing the feature selection process is a pressing data management challenge. We study this challenge by describing a featureselection language and a supporting prototype system that builds on top of current industrial, Rintegration layers. From our interactions with analysts, we learned that feature selection is an interactive, humanintheloop process, which means that feature selection workloads are rife with reuse opportunities. Thus, we study how to materialize portions of this computation using not only classical database materialization optimizations but also methods that have not previously been used in database optimization, including structural decomposition methods (like QR factorization) and warmstart. These new methods have no analog in traditional SQL systems, but they may be interesting for array and scientific database applications. On a diverse set of data sets and programs, we find that traditional databasestyle approaches that ignore these new opportunities are more than two orders of magnitude slower than an optimal plan in this new tradeoff space across multiple Rbackends. Furthermore, we show that it is possible to build a simple costbased optimizer to automatically select a nearoptimal execution plan for feature selection.
Direct QR factorizations for tallandskinny matrices
 in MapReduce architectures, arXiv:1301.1071 [cs.DC], 2013
"... Abstract—The QR factorization and the SVD are two fundamental matrix decompositions with applications throughout scientific computing and data analysis. For matrices with many more rows than columns, socalled “tallandskinny matrices, ” there is a numerically stable, efficient, communicationavoi ..."
Abstract

Cited by 5 (1 self)
 Add to MetaCart
(Show Context)
Abstract—The QR factorization and the SVD are two fundamental matrix decompositions with applications throughout scientific computing and data analysis. For matrices with many more rows than columns, socalled “tallandskinny matrices, ” there is a numerically stable, efficient, communicationavoiding algorithm for computing the QR factorization. It has been used in traditional high performance computing and grid computing environments. For MapReduce environments, existing methods to compute the QR decomposition use a numerically unstable approach that relies on indirectly computing the Q factor. In the best case, these methods require only two passes over the data. In this paper, we describe how to compute a stable tallandskinny QR factorization on a MapReduce architecture in only slightly more than 2 passes over the data. We can compute the SVD with only a small change and no difference in performance. We present a performance comparison between our new direct TSQR method, a standard unstable implementation for MapReduce (Cholesky QR), and the classic stable algorithm implemented for MapReduce (Householder QR). We find that our new stable method has a large performance advantage over the Householder QR method. This holds both in a theoretical performance model as well as in an actual implementation. Keywordsmatrix factorization, QR, SVD, TSQR, MapReduce, Hadoop
Model reduction with mapreduceenabled tall and skinny singular value decomposition
 SIAM Journal on Scientific Computing
"... Abstract. We present a method for computing reducedorder models of parameterized partial differential equation solutions. The key analytical tool is the singular value expansion of the parameterized solution, which we approximate with a singular value decomposition of a parameter snapshot matrix. ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
(Show Context)
Abstract. We present a method for computing reducedorder models of parameterized partial differential equation solutions. The key analytical tool is the singular value expansion of the parameterized solution, which we approximate with a singular value decomposition of a parameter snapshot matrix. To evaluate the reducedorder model at a new parameter, we interpolate a subset of the right singular vectors to generate the reducedorder model’s coefficients. We employ a novel method to select this subset that uses the parameter gradient of the right singular vectors to split the terms in the expansion yielding a mean prediction and a prediction covariance—similar to a Gaussian process approximation. The covariance serves as a confidence measure for the reduce order model. We demonstrate the efficacy of the reducedorder model using a parameter study of heat transfer in random media. The highfidelity simulations produce more than 4TB of data; we compute the singular value decomposition and evaluate the reducedorder model using scalable MapReduce/Hadoop implementations. We compare the accuracy of our method with a scalar response surface on a set of temperature profile measurements and find that our model better captures sharp, local features in the parameter space.
Lowrank and sparse dynamic mode decomposition
"... Even though fluid flows possess an exceedingly high number of degrees of freedom, their dynamics often can be approximated reliably by models of low complexity. This observation has given rise to the notion of coherent structures – organized fluid elements that, together with dynamic processes, are ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
Even though fluid flows possess an exceedingly high number of degrees of freedom, their dynamics often can be approximated reliably by models of low complexity. This observation has given rise to the notion of coherent structures – organized fluid elements that, together with dynamic processes, are responsible for the bulk of momentum and energy transfer in the flow. Recent decades have seen great advances in the extraction of coherent structures from experiments and numerical simulations. Proper orthogonal decomposition (POD) modes (Lumley 2007; Sirovich 1987), global eigenmodes, frequential modes (Sipp et al. 2010), and balanced modes (Moore 1981; Rowley 2005) have provided useful insight into the dynamics of fluid flows. Recently, dynamic mode decomposition (DMD) (Rowley et al. 2009; Schmid 2010) has joined the group of feature extraction techniques. Both POD and DMD are snapshotbased postprocessing algorithms which may be applied equally well to data from simulations or experiments. By enforcing orthogonality, POD modes possess multifrequential temporal content; on the other hand, DMD modes are characterized by a single temporal frequency. DMD modes may potentially be nonnormal, but this nonnormality may be essential to capturing certain types
unknown title
"... Abstract. We present a method for computing reducedorder models of parameterized partial differential equation solutions. The key analytical tool is the singular value expansion of the parameterized solution, which we approximate with a singular value decomposition of a parameter snapshot matrix. ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract. We present a method for computing reducedorder models of parameterized partial differential equation solutions. The key analytical tool is the singular value expansion of the parameterized solution, which we approximate with a singular value decomposition of a parameter snapshot matrix. To evaluate the reducedorder model at a new parameter, we interpolate a subset of the right singular vectors to generate the reducedorder model’s coefficients. We employ a novel method to select this subset that uses the parameter gradient of the right singular vectors to split the terms in the expansion, yielding a mean prediction and a prediction covariance—similar to a Gaussian process approximation. The covariance serves as a confidence measure for the reducedorder model. We demonstrate the efficacy of the reducedorder model using a parameter study of heat transfer in random media. The highfidelity simulations produce more than 4TB of data; we compute the singular value decomposition and evaluate the reducedorder model using scalable MapReduce/Hadoop implementations. We compare the accuracy of our method with a scalar response surface on a set of temperature profile measurements and find that our model better captures sharp, local features in the parameter space.
DeepDive: A Data Management System for Automatic Knowledge Base Construction
, 2015
"... iACKNOWLEDGMENTS I owe Christopher Re ́ my career as a researcher, the greatest dream of my life. Since the day I first met Chris and told him about my dream, he has done everything he could, as a scientist, an educator, and a friend, to help me. I am forever indebted to him for his completely hones ..."
Abstract
 Add to MetaCart
(Show Context)
iACKNOWLEDGMENTS I owe Christopher Re ́ my career as a researcher, the greatest dream of my life. Since the day I first met Chris and told him about my dream, he has done everything he could, as a scientist, an educator, and a friend, to help me. I am forever indebted to him for his completely honest criticisms and feedback, the most valuable gifts an advisor can give. His training equipped me with confidence and pride that I will carry for the rest of my career. He is the role model that I will follow. If my whole future career achieves an approximation of what he has done so far in his, I will be proud and happy. I am also indebted to Jude Shavlik and Miron Livny, who, after Chris left for Stanford, kindly helped me through all the paperwork and payments at Wisconsin. If it were not for their help, I would not have been able to continue my PhD studies. I am also profoundly grateful to Jude for being the chair of my committee. I am also likewise grateful to Jeffrey Naughton, David Page, and Shanan Peters for serving on my committee; and Thomas Reps for his feedback during defense. DeepDive would have not been possible without all its users. Shanan Peters was the first user, working with it before it even got its name. He spent three years going through a painful process with us before we understood the current abstraction of DeepDive. I am grateful to him for sticking with
unknown title
, 2012
"... Randomized methods for computing lowrank approximations of matrices by ..."
DISTINGUISHING SIGNAL FROM NOISE IN AN SVD OF SIMULATION DATA
"... Our goal is to predict the output of a parameterized computer simulation code given a database of outputs at different parameter values. To do so, we investigate a particular model reduction technique that interpolates the right singular vectors in the singular value decomposition of the matrix of ..."
Abstract
 Add to MetaCart
(Show Context)
Our goal is to predict the output of a parameterized computer simulation code given a database of outputs at different parameter values. To do so, we investigate a particular model reduction technique that interpolates the right singular vectors in the singular value decomposition of the matrix of outputs. A common observation about these singular vectors is that they become more oscillatory as the index of the singular vectors increases. We use this property to split the singular vectors into “signal ” and “noise ” regions. The model reduction then interpolates the “signal ” and uses the “noise ” to estimate the uncertainty in the result. This methodology requires a bigdata approach because the simulations we study produce snapshots with hundreds or thousands of timesteps on thousands to millions of nodal values. Each simulation output is then a vector with millions to billions of values. We utilize a MapReducebased SVD routine to compute the SVD of the snapshot matrix. 1.