Results 1 
2 of
2
ENFrame: A Platform for Processing Probabilistic Data
"... This paper introduces ENFrame, a unified data processing platform for querying and mining probabilistic data. Using ENFrame, users can write programs in a fragment of Python with constructs such as boundedrange loops, list comprehension, aggregate operations on lists, and calls to external databas ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
(Show Context)
This paper introduces ENFrame, a unified data processing platform for querying and mining probabilistic data. Using ENFrame, users can write programs in a fragment of Python with constructs such as boundedrange loops, list comprehension, aggregate operations on lists, and calls to external database engines. The program is then interpreted probabilistically by ENFrame. The realisation of ENFrame required novel contributions along several directions. We propose an event language that is expressive enough to succinctly encode arbitrary correlations, trace the computation of user programs, and allow for computation of discrete probability distributions of program variables. We exemplify ENFrame on three clustering algorithms: kmeans, kmedoids, and Markov clustering. We introduce sequential and distributed algorithms for computing the probability of interconnected events exactly or approximately with error guarantees. Experiments with kmedoids clustering of sensor readings from energy networks show ordersofmagnitude improvements of exact clustering using ENFrame over näıve clustering in each possible world, of approximate over exact, and of distributed over sequential algorithms. 1.
F: Regression Models over Factorized Views
"... ABSTRACT We demonstrate F, a system for building regression models over database views. At its core lies the observation that the computation and representation of materialized views, and in particular of joins, entail nontrivial redundancy that is not necessary for the efficient computation of ag ..."
Abstract
 Add to MetaCart
(Show Context)
ABSTRACT We demonstrate F, a system for building regression models over database views. At its core lies the observation that the computation and representation of materialized views, and in particular of joins, entail nontrivial redundancy that is not necessary for the efficient computation of aggregates used for building regression models. F avoids this redundancy by factorizing data and computation and can outperform the stateoftheart systems MADlib, R, and Python StatsModels by orders of magnitude on realworld datasets. We illustrate how to incrementally build regression models over factorized views using both an inmemory implementation of F and its SQL encoding. We also showcase the effective use of F for model selection: F decouples the datadependent computation step from the dataindependent convergence of model parameters and only performs once the former to explore the entire model space. WHAT IS F? F is a fast learner of regression models over training datasets defined by selectprojectjoinaggregate (SPJA) views. It is part of an ongoing effort to integrate databases and machine learning including MADlib [2] and Santoku (1) The database joins are an unnecessarily expensive bottleneck for learning due to redundancy in their tabular representation. To alleviate this limitation, F learns models in one pass over factorized joins, where repeating data patterns are only computed and represented once. This has both theoretical and practical benefits. The computational complexity of F follows that of factorized materialized SPJA views The first step computes the aggregates necessary for regression and the factorized view on the input database. The output of this step is a matrix of reals whose dimensions only depend on the arity of the view and is independent of the database size. This matrix contains the necessary information to compute the parameters of any model defined by a subset of the features in the view. This step comes in three flavors F's factorization and task decomposition rely on a representation of data and computation as expressions in the sumproduct commutative semiring, which is subject to the law of distributivity of product over sum. Results of SPJA queries are naturally represented in the semiring with Cartesian product as product and union as sum. The derivatives of the objective functions for LeastSquares, Ridge, Lasso, and ElasticNet regression models are expressible in the sumproduct semiring. Optimization methods such as gradient descent and (quasi) Newton, which rely on first and respectively secondorder derivatives of such objective functions, can thus be used to train any such model using F. HOW DOES F WORK? We next explain F by means of an example for learning a leastsquares regression model over a factorized join. Factorized Joins.