Results 1  10
of
21
A public turbulence database cluster and applications to study lagrangian evolution of velocity increments in turbulence,” arXiv.org
, 2008
"... in turbulence ..."
(Show Context)
Group Anomaly Detection using Flexible Genre Models
"... An important task in exploring and analyzing realworld data sets is to detect unusual and interesting phenomena. In this paper, we study the group anomaly detection problem. Unlike traditional anomaly detection research that focuses on data points, our goal is to discover anomalous aggregated behav ..."
Abstract

Cited by 7 (2 self)
 Add to MetaCart
(Show Context)
An important task in exploring and analyzing realworld data sets is to detect unusual and interesting phenomena. In this paper, we study the group anomaly detection problem. Unlike traditional anomaly detection research that focuses on data points, our goal is to discover anomalous aggregated behaviors of groups of points. For this purpose, we propose the Flexible Genre Model (FGM). FGM is designed to characterize data groups at both the point level and the group level so as to detect various types of group anomalies. We evaluate the effectiveness of FGM on both synthetic and real data sets including images and turbulence data, and show that it is superior to existing approaches in detecting group anomalies. 1
An efficient multitier tablet server storage architecture
 In 2nd ACM Symposium on Cloud Computing. ACM
, 2011
"... Distributed, structured data stores such as Big Table, HBase, and Cassandra use a cluster of machines, each running a databaselike software system called the Tablet Server Storage Layer or TSSL. A TSSL’s performance on each node directly impacts the performance of the entire cluster. In this paper ..."
Abstract

Cited by 5 (3 self)
 Add to MetaCart
(Show Context)
Distributed, structured data stores such as Big Table, HBase, and Cassandra use a cluster of machines, each running a databaselike software system called the Tablet Server Storage Layer or TSSL. A TSSL’s performance on each node directly impacts the performance of the entire cluster. In this paper we introduce an efficient, scalable, multitier storage architecture for tablet servers. Our system can use any layered mix of storage devices such as Flash SSDs and magnetic disks. Our experiments show that by using a mix of technologies, performance for certain workloads can be improved beyond configurations using strictly twotier approaches with one type of storage technology. We utilized, adapted, and integrated cacheoblivious algorithms and data structures, as well as Bloom filters, to improve scalability significantly. We also support versatile, efficient transactional semantics. We analyzed and evaluated our system against the storage layers of Cassandra and Hadoop HBase. We used wide range of workloads and configurations from read to writeoptimized, as well as different input sizes. We found that our system is 3–10 × faster than existing systems; that using proper data structures, algorithms, and techniques is critical for scalability, especially on modern Flash SSDs; and that one can fully support versatile transactions without sacrificing performance. 1.
Support Distribution Machines
"... Most machine learning algorithms, such as classification or regression, treat the individual data point as the object of interest. Here we consider extending machine learning algorithms to operate on groups of data points. We suggest treating a group of data points as a set of i.i.d. samples from an ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
(Show Context)
Most machine learning algorithms, such as classification or regression, treat the individual data point as the object of interest. Here we consider extending machine learning algorithms to operate on groups of data points. We suggest treating a group of data points as a set of i.i.d. samples from an underlying feature distribution for the group. Our approach is to generalize kernel machines from vectorial inputs to i.i.d. sample sets of vectors. For this purpose, we use a nonparametric estimator that can consistently estimate the inner product and certain kernel functions of two distributions. The projection of the estimated Gram matrix to the cone of semidefinite matrices enables us to employ the kernel trick, and hence use kernel machines for classification, regression, anomaly detection, and lowdimensional embedding in the space of distributions. We present several numerical experiments both on real and simulated datasets to demonstrate the advantages of our new approach. 1
Organization of Data in NonConvex Spatial Domains
"... We present a technique for organizing data in spatial databases with nonconvex domains based on an automatic characterization using the medialaxis transform (MAT). We define a tree based on the MAT and enumerate its branches to partition space and define a linear order on the partitions. This or ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
(Show Context)
We present a technique for organizing data in spatial databases with nonconvex domains based on an automatic characterization using the medialaxis transform (MAT). We define a tree based on the MAT and enumerate its branches to partition space and define a linear order on the partitions. This ordering clusters data in a manner that respects the complex shape of the domain. The ordering has the property that all data down any branch of the medial axis, regardless of the geometry of the subregion, are contiguous on disk. Using this data organization technique, we build a system to provide efficient data discovery and analysis of the observational and model data sets of the Chesapeake Bay Environmental Observatory (CBEO). On typical CBEO workloads in which scientists query contiguous substructures of the Chesapeake Bay, we improve query processing performance by a factor of two when compared with orderings derived from space filling curves.
The Saaz Framework for Turbulent Flow Queries
 In Proceedings of the 2011 IEEE conference on eScience. IEEE
, 2011
"... Abstract—In many respects, numerical simulations involving solutions to partial differential equations have replaced physical experimentation. However, few tools are available to sift through the deluge of data. We present Saaz, a query framework to analyze the simulation results of multiscale phys ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
(Show Context)
Abstract—In many respects, numerical simulations involving solutions to partial differential equations have replaced physical experimentation. However, few tools are available to sift through the deluge of data. We present Saaz, a query framework to analyze the simulation results of multiscale physical phenomena which admit mathematical rules for characterizing features of interest. Saaz provides highlevel primitives that free the domainscientist to concentrate more on scientific discovery and less on code implementation and maintenance. It supports userdefined domainspecific query operations which may be subsequently composed into more complex queries. While Saaz supports offline processing of queries, we explore here the online capabilities by attaching Saaz to a running simulation, improving the simulation’s effective temporal resolution. We discuss analysis for a computational fluid dynamics simulation of turbulent flow running on a cluster. I.
1Kernels on Sample Sets via Nonparametric Divergence Estimates
"... Abstract—Most machine learning algorithms, such as classification or regression, treat the individual data point as the object of interest. Here we consider extending machine learning algorithms to operate on groups of data points. We suggest treating a group of data points as an i.i.d. sample set f ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
(Show Context)
Abstract—Most machine learning algorithms, such as classification or regression, treat the individual data point as the object of interest. Here we consider extending machine learning algorithms to operate on groups of data points. We suggest treating a group of data points as an i.i.d. sample set from an underlying feature distribution for that group. Our approach employs kernel machines with a kernel on i.i.d. sample sets of vectors. We define certain kernel functions on pairs of distributions, and then use a nonparametric estimator to consistently estimate those functions based on sample sets. The projection of the estimated Gram matrix to the cone of symmetric positive semidefinite matrices enables us to use kernel machines for classification, regression, anomaly detection, and lowdimensional embedding in the space of distributions. We present several numerical experiments both on real and simulated datasets to demonstrate the advantages of our new approach. F 1
Active Pointillistic Pattern Search
"... Abstract We introduce the problem of active pointillistic pattern search (APPS), which seeks to discover regions of a domain exhibiting desired behavior with limited observations. Unusually, the patterns we consider are defined by largescale properties of an underlying function that we can only ob ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract We introduce the problem of active pointillistic pattern search (APPS), which seeks to discover regions of a domain exhibiting desired behavior with limited observations. Unusually, the patterns we consider are defined by largescale properties of an underlying function that we can only observe at a limited number of points. Given a description of the desired patterns (in the form of a classifier taking functional inputs), we sequentially decide where to query function values to identify as many regions matching the pattern as possible, with high confience. For one broad class of models the expected reward of each unobserved point can be computed analytically. We demonstrate the proposed algorithm on three difficult search problems: locating polluted regions in a lake via mobile sensors, forecasting winning electoral districts with minimal polling, and identifying vortices in a fluid flow simulation.
Active Pointillistic Pattern Search
"... Abstract We introduce the problem of active pointillistic pattern search (APPS), which seeks to discover regions of a domain exhibiting desired behavior with limited observations. Unusually, the patterns we consider are defined by largescale properties of an underlying function that we can only ob ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract We introduce the problem of active pointillistic pattern search (APPS), which seeks to discover regions of a domain exhibiting desired behavior with limited observations. Unusually, the patterns we consider are defined by largescale properties of an underlying function that we can only observe at a limited number of points. Given a description of the desired patterns (in the form of a classifier taking functional inputs), we sequentially decide where to query function values to identify as many regions matching the pattern as possible. We demonstrate the proposed algorithm by identifying vortices in a fluid flow simulation.
Proceedings of the 42nd Hawaii International Conference on System Sciences 2009 GrayWulf: Scalable Clustered Architecture for Data Intensive Computing
"... Data intensive computing presents a significant challenge for traditional supercomputing architectures that maximize FLOPS since CPU speed has surpassed IO capabilities of HPC systems and BeoWulf clusters. We present the architecture for a three tier commodity component cluster designed for a range ..."
Abstract
 Add to MetaCart
(Show Context)
Data intensive computing presents a significant challenge for traditional supercomputing architectures that maximize FLOPS since CPU speed has surpassed IO capabilities of HPC systems and BeoWulf clusters. We present the architecture for a three tier commodity component cluster designed for a range of data intensive computations operating on petascale data sets named GrayWulf. The design goal is a balanced system in terms of IO performance and memory size, according to Amdahl’s Laws. The hardware currently installed at JHU exceeds one petabyte of storage and has 0.5 bytes/sec of I/O and 1 byte of memory for each CPU cycle. The GrayWulf provides almost an order of magnitude better balance