Results 1 - 10
of
50
SherLog: Error Diagnosis by Connecting Clues from Run-time Logs
"... Computer systems often fail due to many factors such as software bugs or administrator errors. Diagnosing such production run failures is an important but challenging task since it is difficult to reproduce them in house due to various reasons: (1) unavailability of users ’ inputs and file content d ..."
Abstract
-
Cited by 63 (8 self)
- Add to MetaCart
(Show Context)
Computer systems often fail due to many factors such as software bugs or administrator errors. Diagnosing such production run failures is an important but challenging task since it is difficult to reproduce them in house due to various reasons: (1) unavailability of users ’ inputs and file content due to privacy concerns; (2) difficulty in building the exact same execution environment; and (3) non-determinism of concurrent executions on multi-processors. Therefore, programmers often have to diagnose a production run failure based on logs collected back from customers and the corresponding source code. Such diagnosis requires expert knowledge and is also too time-consuming, tedious to narrow down root causes. To address this problem, we propose a tool, called Sher-Log, that analyzes source code by leveraging information provided by run-time logs to infer what must or may have happened during the failed production run. It requires neither re-execution of the program nor knowledge on the log’s semantics. It infers both control and data value information regarding to the failed execution. We evaluate SherLog with 8 representative real world software failures (6 software bugs and 2 configuration errors) from 7 applications including 3 servers. Information inferred by SherLog are very useful for programmers to diagnose these evaluated failures. Our results also show that SherLog can analyze large server applications such as Apache with thousands of logging messages within only 40 minutes.
Understanding network failures in data centers: measurement, analysis, and implications
- In Proc. of SIGCOMM. ACM
, 2011
"... We present the first large-scale analysis of failures in a data center network. Through our analysis, we seek to answer several fundamental questions: which devices/links are most unreliable, what causes failures, how do failures impact network traffic and how effective is network redundancy? We ans ..."
Abstract
-
Cited by 49 (3 self)
- Add to MetaCart
(Show Context)
We present the first large-scale analysis of failures in a data center network. Through our analysis, we seek to answer several fundamental questions: which devices/links are most unreliable, what causes failures, how do failures impact network traffic and how effective is network redundancy? We answer these questions using multiple data sources commonly collected by network operators. The key findings of our study are that (1) data center networks show high reliability, (2) commodity switches such as ToRs and AggS are highly reliable, (3) load balancers dominate in terms of failure occurrences with many short-lived software related faults, (4) failures have potential to cause loss of many small packets such as keep alive messages and ACKs, and (5) network redundancy is only 40 % effective in reducing the median impact of failure.
Crowdsourcing service-level network event monitoring
- In Proc. of ACM SIGCOMM
, 2010
"... The user experience for networked applications is becoming a key benchmark for customers and network providers. Perceived user experience is largely determined by the frequency, duration and severity of network events that impact a service. While today’s networks implement sophisticated infrastructu ..."
Abstract
-
Cited by 30 (5 self)
- Add to MetaCart
(Show Context)
The user experience for networked applications is becoming a key benchmark for customers and network providers. Perceived user experience is largely determined by the frequency, duration and severity of network events that impact a service. While today’s networks implement sophisticated infrastructure that issues alarms for most failures, there remains a class of silent outages (e.g., caused by configuration errors) that are not detected. Further, existing alarms provide little information to help operators understand the impact of network events on services. Attempts to address this through infrastructure that monitors end-to-end performance for customers have been hampered by the cost of deployment and by the volume of data generated by these solutions. We present an alternative approach that pushes monitoring to applications on end systems and uses their collective view to detect network events and their impact on services- an approach we call Crowdsourcing Event Monitoring (CEM). This paper presents a general framework for CEM systems and demonstrates its effectiveness for a P2P application using a large dataset gathered from BitTorrent users and confirmed network events from two ISPs. We discuss how we designed and deployed a prototype CEM implementation as an extension to BitTorrent. This system performs online service-level network event detection through passive monitoring and correlation of performance in end-users’ applications.
WebProphet: Automating performance prediction for web services
- In NSDI
, 2010
"... Today, large-scale web services run on complex systems, spanning multiple data centers and content distribution networks, with performance depending on diverse factors in end systems, networks, and infrastructure servers. Web service providers have many options for improving service performance, var ..."
Abstract
-
Cited by 26 (0 self)
- Add to MetaCart
(Show Context)
Today, large-scale web services run on complex systems, spanning multiple data centers and content distribution networks, with performance depending on diverse factors in end systems, networks, and infrastructure servers. Web service providers have many options for improving service performance, varying greatly in feasibility, cost and benefit, but have few tools to predict the impact of these options. A key challenge is to precisely capture web object dependencies, as these are essential for predicting performance in an accurate and scalable manner. In this paper, we introduce WebProphet, a system that automates performance prediction for web services. WebProphet employs a novel technique based on timing perturbation to extract web object dependencies, and then uses these dependencies to predict the performance impact of changes to the handling of the objects. We have built, deployed, and evaluated the accuracy and efficiency of WebProphet. Applying WebProphet to the Search and Maps services of Google and Yahoo, we find WebProphet predicts the median and 95th percentiles of the page load time distribution with an error rate smaller than 16 % in most cases. Using Yahoo Maps as an example, we find that WebProphet reduces the problem of performance optimization to a small number of web objects whose optimization would reduce the page load time by nearly 40%. 1
NetPilot: Automating Datacenter Network Failure Mitigation
"... The soaring demands for always-on and fast-response online services have driven modern datacenter networks to undergo tremendous growth. These networks often rely on scale-out designs with large numbers of commodity switches to reach immense capacity while keeping capital expenses under check. The d ..."
Abstract
-
Cited by 17 (7 self)
- Add to MetaCart
(Show Context)
The soaring demands for always-on and fast-response online services have driven modern datacenter networks to undergo tremendous growth. These networks often rely on scale-out designs with large numbers of commodity switches to reach immense capacity while keeping capital expenses under check. The downside is more devices means more failures, raising a formidable challenge for network operators to promptly handle these failures with minimal disruptions to the hosted services. Recent research efforts have focused on automatic failure localization. Yet, resolving failures still requires significant human interventions, resulting in prolonged failure recovery time. Unlike previous work, NetPilot aims to quickly mitigate rather than resolve failures. NetPilot mitigates failures in much the same way operators do – by deactivating or restarting suspected offending components. NetPilot circumvents the need for knowing the exact root cause of a failure by taking an intelligent trial-and-error approach. The core of NetPilot is comprised of an Impact Estimator that helps guard against overly disruptive mitigation actions and a failure-specific mitigation planner that minimizes the number of trials. We demonstrate that NetPilot can effectively mitigate several types of critical failures commonly encountered in production datacenter networks.
Instrumenting Home Networks
"... In managing and troubleshooting home networks, one of the challenges is in knowing what is actually happening. Availability of a record of events that occurred on the home network before trouble appeared would go a long way toward addressing that challenge. In this position/work-in-progress paper, w ..."
Abstract
-
Cited by 16 (0 self)
- Add to MetaCart
(Show Context)
In managing and troubleshooting home networks, one of the challenges is in knowing what is actually happening. Availability of a record of events that occurred on the home network before trouble appeared would go a long way toward addressing that challenge. In this position/work-in-progress paper, we consider requirements for a general-purpose logging facility for home networks. Such a facility, if properly designed, would potentially have other uses. We describe several such uses and discuss requirements to be considered in the design of a logging platform that would be widely supported and accepted. We also report on our initial experience deploying such a facility.
NetClinic: Interactive Visualization to Enhance Automated Fault Diagnosis in Enterprise Networks
"... Diagnosing faults in an operational computer network is a frustrating, time-consuming exercise. Despite advances, automatic diagnostic tools are far from perfect: they occasionally miss the true culprit and are mostly only good at narrowing down the search to a few potential culprits. This uncertain ..."
Abstract
-
Cited by 14 (2 self)
- Add to MetaCart
(Show Context)
Diagnosing faults in an operational computer network is a frustrating, time-consuming exercise. Despite advances, automatic diagnostic tools are far from perfect: they occasionally miss the true culprit and are mostly only good at narrowing down the search to a few potential culprits. This uncertainty and the inability to extract useful sense from tool output renders most tools not usable to administrators. To bridge this gap, we present NetClinic, a visual analytics system that couples interactive visualization with an automated diagnostic tool for enterprise networks. It enables administrators to verify the output of the automatic analysis at different levels of detail and to move seamlessly across levels while retaining appropriate context. A qualitative user study shows that NetClinic users can accurately identify the culprit, even when it is not present in the suggestions made by the automated component. We also find that supporting a variety of sensemaking strategies is a key to the success of systems that enhance automated diagnosis.
SecureAngle: Improving Wireless Security Using Angle-of-Arrival Information
- in Proceedings of the Ninth ACM SIGCOMM Workshop on Hot Topics in Networks
"... Wireless networks play an important role in our everyday lives, at the workplace and at home. However, they are also relatively vulnerable: physically located off site, attackers can circumvent wireless security protocols such as WEP, WPA, and even to some extent WPA2, presenting a secu-rity risk to ..."
Abstract
-
Cited by 13 (2 self)
- Add to MetaCart
(Show Context)
Wireless networks play an important role in our everyday lives, at the workplace and at home. However, they are also relatively vulnerable: physically located off site, attackers can circumvent wireless security protocols such as WEP, WPA, and even to some extent WPA2, presenting a secu-rity risk to the entire network. To address this problem, we propose SecureAngle, a system designed to operate along-side existing wireless security protocols, adding defense in depth. SecureAngle leverages multi-antenna APs to profile the directions at which a client’s signal arrives, using this angle-of-arrival (AoA) information to construct signatures that uniquely identify each client. We identify SecureAn-gle’s role of providing a fine-grained location service in a multi-path indoor environment. With this location informa-tion, we investigate how an AP might create a “virtual fence” that drops frames received from clients physically located outside a building or office. With SecureAngle signatures, we also identify how an AP can prevent malicious parties from spoofing the link-layer address of legitimate clients. We discuss how SecureAngle might aid whitespace radios in yielding to incumbent transmitters, as well as its role in directional downlink transmissions with uplink AoA infor-mation.
Q-score: Proactive Service Quality Assessment in a Large IPTV System
"... Abstract — In large-scale IPTV systems, it is essential to maintain high service quality while providing a wider variety of service features than typical traditional TV. Thus service quality assessment systems are of paramount importance as they monitor the user-perceived service quality and alert w ..."
Abstract
-
Cited by 12 (1 self)
- Add to MetaCart
(Show Context)
Abstract — In large-scale IPTV systems, it is essential to maintain high service quality while providing a wider variety of service features than typical traditional TV. Thus service quality assessment systems are of paramount importance as they monitor the user-perceived service quality and alert when issues occurs. For IPTV systems, however, there is no simple metric to represent userperceived service quality and Quality of Experience (QoE). Moreover, there is only limited user feedback, often in the form of noisy and delayed customer calls. Therefore, we aim to approximate the QoE through a selected set of performance indicators in a proactive (i.e., detect issues before customers reports to call centers) and scalable fashion. In this paper, we present a service quality assessment framework, Q-score, which accurately learns a small set of performance indicators most relevant to user-perceived service quality, and proactively infers service quality in a single score. We evaluate Q-score using network data collected from a commercial IPTV service provider and show that Q-score is able to predict 60 % of the service problems that are reported by customers with 0.1 % false positives. Through Q-score, we have (i) gained insight into various types of service problems causing user dissatisfaction, including why users tend to react promptly to sound issues while late to video issues; (ii) identified and quantified the opportunity to proactively detect the service quality degradation of individual customers before severe performance impact occurs; and (iii) observed possibility to allocate customer care workforce to potentially troubling service areas before issues break out.