Results 1  10
of
16
The MADlib analytics library or MAD skills, the SQL.
 Proceedings of the VLDB Endowment,
, 2012
"... ABSTRACT MADlib is a free, open source library of indatabase analytic methods. It provides an evolving suite of SQLbased algorithms for machine learning, data mining and statistics that run at scale within a database engine, with no need for data import/export to other tools. The goal is for MADl ..."
Abstract

Cited by 18 (2 self)
 Add to MetaCart
ABSTRACT MADlib is a free, open source library of indatabase analytic methods. It provides an evolving suite of SQLbased algorithms for machine learning, data mining and statistics that run at scale within a database engine, with no need for data import/export to other tools. The goal is for MADlib to eventually serve a role for scalable database systems that is similar to the CRAN library for R: a community repository of statistical methods, this time written with scale and parallelism in mind. In this paper we introduce the MADlib project, including the background that led to its beginnings, and the motivation for its open source nature. We provide an overview of the library's architecture and design patterns, and provide a description of various statistical methods in that context. We include performance and speedup results of a core design pattern from one of those methods over the Greenplum parallel DBMS on a modestsized test cluster. We then report on two initial e↵orts at incorporating academic research into MADlib, which is one of the project's goals. MADlib is freely available at http://madlib.net, and the project is open for contributions of both new methods, and ports to additional database platforms.
DeepDive: Webscale Knowledgebase Construction using Statistical Learning and Inference
"... We present an endtoend (live) demonstration system called DeepDive that performs knowledgebase construction (KBC) from hundreds of millions of web pages. DeepDive employs statistical learning and inference to combine diverse data resources and bestofbreed algorithms. A key challenge of this app ..."
Abstract

Cited by 17 (1 self)
 Add to MetaCart
(Show Context)
We present an endtoend (live) demonstration system called DeepDive that performs knowledgebase construction (KBC) from hundreds of millions of web pages. DeepDive employs statistical learning and inference to combine diverse data resources and bestofbreed algorithms. A key challenge of this approach is scalability, i.e., how to deal with terabytes of imperfect data efficiently. We describe how we address the scalability challenges to achieve webscale KBC and the lessons we have learned from building DeepDive. 1.
Beneath the valley of the noncommutative arithmeticgeometric mean inequality: conjectures, casestudies, and consequences
, 2012
"... Randomized algorithms that base iterationlevel decisions on samples from some pool are ubiquitous in machine learning and optimization. Examples include stochastic gradient descent and randomized coordinate descent. This paper makes progress at theoretically evaluating the difference in performance ..."
Abstract

Cited by 7 (2 self)
 Add to MetaCart
(Show Context)
Randomized algorithms that base iterationlevel decisions on samples from some pool are ubiquitous in machine learning and optimization. Examples include stochastic gradient descent and randomized coordinate descent. This paper makes progress at theoretically evaluating the difference in performance between sampling with and withoutreplacement in such algorithms. Focusing on least means squares optimization, we formulate a noncommutative arithmeticgeometric mean inequality that would prove that the expected convergence rate of withoutreplacement sampling is faster than that of withreplacement sampling. We demonstrate that this inequality holds for many classes of random matrices and for some pathological examples as well. We provide a deterministic worstcase bound on the gap between the discrepancy between the two sampling models, and explore some of the impediments to proving this inequality in full generality. We detail the consequences of this inequality for stochastic gradient descent and the randomized Kaczmarz algorithm for solving linear systems.
Sparkler: Supporting largescale matrix factorization
 In EDBT
, 2013
"... Lowrank matrix factorization has recently been applied with great success on matrix completion problems for applications like recommendation systems, link predictions for social networks, and click prediction for web search. However, as this approach is applied to increasingly larger datasets, such ..."
Abstract

Cited by 5 (2 self)
 Add to MetaCart
(Show Context)
Lowrank matrix factorization has recently been applied with great success on matrix completion problems for applications like recommendation systems, link predictions for social networks, and click prediction for web search. However, as this approach is applied to increasingly larger datasets, such as those encountered in webscale recommender systems like Netflix and Pandora, the data management aspects quickly become challenging and form a roadblock. In this paper, we introduce a system called Sparkler to solve such large instances of low rank matrix factorizations. Sparkler extends Spark, an existing platform for running parallel iterative algorithms on datasets that fit in the aggregate main memory of a cluster. Sparkler supports distributed stochastic gradient descent as an approach to solving the factorization problem – an iterative technique that has been shown to perform very well in practice. We identify the shortfalls of Spark in solving large matrix factorization problems, especially when running on the cloud, and solve this by introducing a novel abstraction called “Carousel Maps ” (CMs). CMs are well suited to storing large matrices in the aggregate memory of a cluster and can efficiently support the operations performed on them during distributed stochastic gradient descent. We describe the design, implementation, and the use of CMs in Sparkler programs. Through a variety of experiments, we demonstrate that Sparkler is faster than Spark by 4x to 21x, with bigger advantages for larger problems. Equally importantly, we show that this can be done without imposing any changes to the ease of programming. We argue that Sparkler provides a convenient and efficient extension to Spark for solving matrix factorization problems on very large datasets.
The missing piece in complex analytics: Low latency, scalable model management and serving with velox. CIDR, 2015. Supplementary Material A Example algorithm implementation Algorithms in NEXT are implemented for a specific application and follow a particu
"... To enable complex dataintensive applications such as personalized recommendations, targeted advertising, and intelligent services, the data management community has focused heavily on the design of systems to train complex models on large datasets. Unfortunately, the design of these systems largely ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
(Show Context)
To enable complex dataintensive applications such as personalized recommendations, targeted advertising, and intelligent services, the data management community has focused heavily on the design of systems to train complex models on large datasets. Unfortunately, the design of these systems largely ignores a critical component of the overall analytics process: the serving and management of models at scale. In this work, we present Velox, a new component of the Berkeley Data Analytics Stack. Velox is a data management system for facilitating the next steps in realworld, largescale analytics pipelines: online model management, maintenance, and serving. Velox provides enduser applications and services with a lowlatency, intuitive interface to models, transforming the raw statistical models currently trained using existing offline largescale compute frameworks into fullblown, endtoend data products capable of targeting advertisements, recommending products, and personalizing web content. To provide uptodate results for these complex models, Velox also facilitates lightweight online model maintenance and selection (i.e., dynamic weighting). In this paper, we describe the challenges and architectural considerations required to achieve this functionality, including the abilities to span online and offline systems, to adaptively adjust model materialization strategies, and to exploit inherent statistical properties such as model error tolerance, all while operating at “Big Data ” scale. 1.
PFOLA: A HighPerformance Framework for Parallel Online Aggregation
 DAPD
"... Online aggregation provides estimates to the final result of a computation during the actual processing. The user can stop the computation as soon as the estimate is accurate enough, typically early in the execution. This allows for the interactive data exploration of the largest datasets. In this ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
Online aggregation provides estimates to the final result of a computation during the actual processing. The user can stop the computation as soon as the estimate is accurate enough, typically early in the execution. This allows for the interactive data exploration of the largest datasets. In this paper we introduce the first framework for parallel online aggregation in which the estimation virtually does not incur any overhead on top of the actual execution. We define a generic interface to express any estimation model that abstracts completely the execution details. We design a novel estimator specifically targeted at parallel online aggregation. When executed by the framework over a massive 8TB TPCH instance, the estimator provides accurate confidence bounds early in the execution even when the cardinality of the final result is seven orders of magnitude smaller than the dataset size and without incurring overhead. 1
A Performance Comparison of Parallel DBMSs and MapReduce on LargeScale Text Analytics
"... Text analytics has become increasingly important with the rapid growth of text data. Particularly, information extraction (IE), which extracts structured data from text, has received significant attention. Unfortunately, IE is often computationally intensive. To address this issue, MapReduce has bee ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
(Show Context)
Text analytics has become increasingly important with the rapid growth of text data. Particularly, information extraction (IE), which extracts structured data from text, has received significant attention. Unfortunately, IE is often computationally intensive. To address this issue, MapReduce has been used for large scale IE. Recently, there are emerging efforts from both academia and industry on pushing IE inside DBMSs. This leads to an interesting and important question: Given that both MapReduce and parallel DBMSs are for large scale analytics, which platform is a better choice for large scale IE? In this paper, we propose a benchmark to systematically study the performance of both platforms for large scale IE tasks. The benchmark includes both statistical learning based and rule based IE programs, which have been extensively used in realworld IE tasks. We show how to express these programs on both platforms and conduct experiments on realworld datasets. Our results show that parallel DBMSs is a viable alternative for large scale IE. 1.
ENFrame: A Platform for Processing Probabilistic Data
"... This paper introduces ENFrame, a unified data processing platform for querying and mining probabilistic data. Using ENFrame, users can write programs in a fragment of Python with constructs such as boundedrange loops, list comprehension, aggregate operations on lists, and calls to external databas ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
(Show Context)
This paper introduces ENFrame, a unified data processing platform for querying and mining probabilistic data. Using ENFrame, users can write programs in a fragment of Python with constructs such as boundedrange loops, list comprehension, aggregate operations on lists, and calls to external database engines. The program is then interpreted probabilistically by ENFrame. The realisation of ENFrame required novel contributions along several directions. We propose an event language that is expressive enough to succinctly encode arbitrary correlations, trace the computation of user programs, and allow for computation of discrete probability distributions of program variables. We exemplify ENFrame on three clustering algorithms: kmeans, kmedoids, and Markov clustering. We introduce sequential and distributed algorithms for computing the probability of interconnected events exactly or approximately with error guarantees. Experiments with kmedoids clustering of sensor readings from energy networks show ordersofmagnitude improvements of exact clustering using ENFrame over näıve clustering in each possible world, of approximate over exact, and of distributed over sequential algorithms. 1.
Learning Generalized Linear Models Over Normalized Data
"... Enterprise data analytics is a booming area in the data management industry. Many companies are racing to develop toolkits that closely integrate statistical and machine learning techniques with data management systems. Almost all such toolkits assume that the input to a learning algorithm is a si ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
(Show Context)
Enterprise data analytics is a booming area in the data management industry. Many companies are racing to develop toolkits that closely integrate statistical and machine learning techniques with data management systems. Almost all such toolkits assume that the input to a learning algorithm is a single table. However, most relational datasets are not stored as single tables due to normalization. Thus, analysts often perform keyforeign key joins before learning on the join output. This strategy of learning after joins introduces redundancy avoided by normalization, which could lead to poorer endtoend performance and maintenance overheads due to data duplication. In this work, we take a step towards enabling and optimizing learning over joins for a common class of machine learning techniques called generalized linear models that are solved using gradient descent algorithms in an RDBMS setting. We present alternative approaches to learn over a join that are easy to implement over existing RDBMSs. We introduce a new approach named factorized learning that pushes ML computations through joins and avoids redundancy in both I/O and computations. We study the tradeoff space for all our approaches both analytically and empirically. Our results show that factorized learning is often substantially faster than the alternatives, but is not always the fastest, necessitating a costbased approach. We also discuss extensions of all our approaches to multitable joins as well as to Hive.