Results 1 - 10
of
143
Mapreduce online
, 2009
"... MapReduce is a popular framework for data-intensive distributed computing of batch jobs. To simplify fault tolerance, many implementations of MapReduce materialize the entire output of each map and reduce task before it can be consumed. In this paper, we propose a modified MapReduce architecture tha ..."
Abstract
-
Cited by 181 (3 self)
- Add to MetaCart
(Show Context)
MapReduce is a popular framework for data-intensive distributed computing of batch jobs. To simplify fault tolerance, many implementations of MapReduce materialize the entire output of each map and reduce task before it can be consumed. In this paper, we propose a modified MapReduce architecture that allows data to be pipelined between operators. This extends the MapReduce programming model beyond batch processing, and can reduce completion times and improve system utilization for batch jobs as well. We present a modified version of the Hadoop MapReduce framework that supports online aggregation, which allows users to see “early returns” from a job as it is being computed. Our Hadoop Online Prototype (HOP) also supports continuous queries, which enable MapReduce programs to be written for applications such as event monitoring and stream processing. HOP retains the fault tolerance properties of Hadoop and can run unmodified user-defined MapReduce programs. 1
Network coordinates in the wild
- In Proceeding of USENIX NSDI’07
, 2007
"... Network coordinates provide a mechanism for selecting and placing servers efficiently in a large distributed system. This approach works well as long as the coordinates continue to accurately reflect network topology. We conducted a long-term study of a subset of a million-plus node coordinate syste ..."
Abstract
-
Cited by 81 (2 self)
- Add to MetaCart
(Show Context)
Network coordinates provide a mechanism for selecting and placing servers efficiently in a large distributed system. This approach works well as long as the coordinates continue to accurately reflect network topology. We conducted a long-term study of a subset of a million-plus node coordinate system and found that it exhibited some of the problems for which network coordinates are frequently criticized, for example, inaccuracy and fragility in the presence of violations of the triangle inequality. Fortunately, we show that several simple techniques remedy many of these problems. Using the Azureus BitTorrent network as our testbed, we show that live, large-scale network coordinate systems behave differently than their tame PlanetLab and simulation-based counterparts. We find higher relative errors, more triangle inequality violations, and higher churn. We present and evaluate a number of techniques that, when applied to Azureus, efficiently produce accurate and stable network coordinates. 1
Processing flows of information: from data stream to complex event processing
- ACM COMPUTING SURVEYS
, 2011
"... A large number of distributed applications requires continuous and timely processing of information as it flows from the periphery to the center of the system. Examples include intrusion detection systems which analyze network traffic in real-time to identify possible attacks; environmental monitori ..."
Abstract
-
Cited by 67 (11 self)
- Add to MetaCart
(Show Context)
A large number of distributed applications requires continuous and timely processing of information as it flows from the periphery to the center of the system. Examples include intrusion detection systems which analyze network traffic in real-time to identify possible attacks; environmental monitoring applications which process raw data coming from sensor networks to identify critical situations; or applications performing online analysis of stock prices to identify trends and forecast future values. Traditional DBMSs, which need to store and index data before processing it, can hardly fulfill the requirements of timeliness coming from such domains. Accordingly, during the last decade, different research communities developed a number of tools, which we collectively call Information flow processing (IFP) systems, to support these scenarios. They differ in their system architecture, data model, rule model, and rule language. In this article, we survey these systems to help researchers, who often come from different backgrounds, in understanding how the various approaches they adopt may complement each other. In particular, we propose a general, unifying model to capture the different aspects of an IFP system and use it to provide a complete and precise classification of the systems and mechanisms proposed so far.
M-dpop: Faithful distributed implementation of efficient social choice problems
- In AAMAS’06 - Autonomous Agents and Multiagent Systems
, 2006
"... In the efficient social choice problem, the goal is to assign values, subject to side constraints, to a set of variables to maximize the total utility across a population of agents, where each agent has private information about its utility function. In this paper we model the social choice problem ..."
Abstract
-
Cited by 48 (17 self)
- Add to MetaCart
(Show Context)
In the efficient social choice problem, the goal is to assign values, subject to side constraints, to a set of variables to maximize the total utility across a population of agents, where each agent has private information about its utility function. In this paper we model the social choice problem as a distributed constraint optimization problem (DCOP), in which each agent can communicate with other agents that share an interest in one or more variables. Whereas existing DCOP algorithms can be easily manipulated by an agent, either by misreporting private information or deviating from the algorithm, we introduce M-DPOP, the first DCOP algorithm that provides a faithful distributed implementation for efficient social choice. This provides a concrete example of how the methods of mechanism design can be unified with those of distributed optimization. Faithfulness ensures that no agent can benefit by unilaterally deviating from any aspect of the protocol, neither informationrevelation, computation, nor communication, and whatever the private information of other agents. We allow for payments by agents to a central bank, which is the only central authority that we require. To achieve faithfulness, we carefully integrate the Vickrey-Clarke-Groves (VCG) mechanism with the DPOP algorithm, such that each agent is only asked to perform computation, report
SODA: An optimizing scheduler for large-scale stream-based distributed computer systems
, 2008
"... This paper describes the SODA scheduler for System S, a highly scalable distributed stream processing system. Unlike traditional batch applications, streaming applications are open-ended. The system cannot typically delay the processing of the data. The scheduler must be able to shift resource allo ..."
Abstract
-
Cited by 31 (1 self)
- Add to MetaCart
(Show Context)
This paper describes the SODA scheduler for System S, a highly scalable distributed stream processing system. Unlike traditional batch applications, streaming applications are open-ended. The system cannot typically delay the processing of the data. The scheduler must be able to shift resource allocation dynamically in response to changes to resource availability, job arrivals and departures, incoming data rates and so on. The design assumptions of System S, in particular, pose additional scheduling challenges. SODA must deal with a highly complex optimization problem, which must be solved in real-time while maintaining scalability. SODA relies on a careful problem decomposition, and intelligent use of both heuristic and exact algorithms. We describe the de-sign and functionality of SODA, outline the mathematical components, and describe experiments to show the performance of the scheduler.
Wishbone: Profile-based Partitioning for Sensornet Applications
"... The ability to partition sensor network application code across sensor nodes and backend servers is important for running complex, data-intensive applications on sensor platforms that have CPU, energy, and bandwidth limitations. This paper presents Wishbone, a system that takes a dataflow graph of o ..."
Abstract
-
Cited by 30 (5 self)
- Add to MetaCart
(Show Context)
The ability to partition sensor network application code across sensor nodes and backend servers is important for running complex, data-intensive applications on sensor platforms that have CPU, energy, and bandwidth limitations. This paper presents Wishbone, a system that takes a dataflow graph of operators and produces an optimal partitioning. With Wishbone, users can run the same program on a range of sensor platforms, including TinyOS motes, smartphones running JavaME, and the iPhone. The resulting program partitioning will in general be different in each case, reflecting the different node capabilities. Wishbone uses profiling to determine how each operator in the dataflow graph will actually perform on sample data, without requiring cumbersome user annotations. Its partitioning algorithm models the problem as an integer linear program that minimizes a linear combination of network bandwidth and CPU load and uses program structure to solve the problem efficiently in practice. Our results on a speech detection application show that the system can quickly identify good trade-offs given limitations in CPU and network capacity. 1
Synergy: sharing-aware component composition for distributed stream processing systems
- In: Middleware ’06: Proceedings of the ACM/IFIP/USENIX 2006 International Conference on Middleware. Springer-Verlag
, 2006
"... Abstract. Many emerging on-line data analysis applications require applying continuous query operations such as correlation, aggregation, and filtering to data streams in real-time. Distributed stream processing systems allow in-network stream processing to achieve better scalability and quality-of ..."
Abstract
-
Cited by 29 (13 self)
- Add to MetaCart
(Show Context)
Abstract. Many emerging on-line data analysis applications require applying continuous query operations such as correlation, aggregation, and filtering to data streams in real-time. Distributed stream processing systems allow in-network stream processing to achieve better scalability and quality-of-service (QoS) provision. In this paper we present Synergy, a distributed stream processing middleware that provides sharing-aware component composition. Synergy enables efficient reuse of both data streams and processing components, while composing distributed stream processing applications with QoS demands. Synergy provides a set of fully distributed algorithms to discover and evaluate the reusability of available data streams and processing components when instantiating new stream applications. For QoS provision, Synergy performs QoS impact projection to examine whether the shared processing can cause QoS violations on currently running applications. We have implemented a prototype of the Synergy middleware and evaluated its performance on both PlanetLab and simulation testbeds. The experimental results show that Synergy can achieve much better resource utilization and QoS provision than previously proposed schemes, by judiciously sharing streams and processing components during application composition.
Network-Aware Overlays with Network Coordinates
- In the Proceedings of 26th IEEE International Conference on Distributed Computing Systems Workshops, ICDCSW06
, 2006
"... ..."
(Show Context)
Agimone: Middleware Support for Seamless Integration of Sensor and IP Networks
- In DCOSS’06
, 2006
"... The scope of wireless sensor network (WSN) applications has traditionally been restricted by physical sensor coverage and limited computational power. Meanwhile, IP networks like the Internet offer tremendous connectivity and computing resources. This paper presents Agimone, a middleware layer that ..."
Abstract
-
Cited by 23 (5 self)
- Add to MetaCart
(Show Context)
The scope of wireless sensor network (WSN) applications has traditionally been restricted by physical sensor coverage and limited computational power. Meanwhile, IP networks like the Internet offer tremendous connectivity and computing resources. This paper presents Agimone, a middleware layer that integrates sensor and IP networks as a uniform platform for flexible application deployment. This layer allows applications to be deployed on the WSN in the form of mobile agents which can autonomously discover and migrate to other WSNs, using a common IP backbone as a bridge. It facilitates data sharing between WSNs and the IP network through remote tuple space operations, allowing sensors to easily defer expensive computations to more-powerful devices. We demonstrate the expressiveness of Agimone’s programming model by examining a prototype cargotracking application that has been deployed using this system. We also provide an empirical evaluation of Agimone using a series of benchmarks deployed on two WSNs consisting of MICA2 sensor nodes connected by an IP network. These benchmarks show that inter-network tuple space operations take 10ms, and that one-way agent migrations between two different WSNs take approximately 83ms.
Cobra: Content-based filtering and aggregation of blogs and rss feeds
- In Proc. NSDI ’07
"... Blogs and RSS feeds are becoming increasingly popular. The blogging site LiveJournal has over 11 million user accounts, and according to one report, over 1.6 million postings are made to blogs every day. The “Blogosphere ” is a new hotbed of Internet-based media that represents a shift from mostly s ..."
Abstract
-
Cited by 16 (1 self)
- Add to MetaCart
(Show Context)
Blogs and RSS feeds are becoming increasingly popular. The blogging site LiveJournal has over 11 million user accounts, and according to one report, over 1.6 million postings are made to blogs every day. The “Blogosphere ” is a new hotbed of Internet-based media that represents a shift from mostly static content to dynamic, continuously-updated discussions. The problem is that finding and tracking blogs with interesting content is an extremely cumbersome process. In this paper, we present Cobra (Content-Based RSS Aggregator), a system that crawls, filters, and aggregates vast numbers of RSS feeds, delivering to each user a personalized feed based on their interests. Cobra consists of a three-tiered network of crawlers that scan web feeds, filters that match crawled articles to user subscriptions, and reflectors that provide recently-matching articles on each subscription as an RSS feed, which can be browsed using a standard RSS reader. We present the design, implementation, and evaluation of Cobra in three settings: a dedicated cluster, the Emulab testbed, and on PlanetLab. We present a detailed performance study of the Cobra system, demonstrating that the system is able to scale well to support a large number of source feeds and users; that the mean update detection latency is low (bounded by the crawler rate); and that an offline service provisioning step combined with several performance optimizations are effective at reducing memory usage and network load. 1