<-- Back to Main Page

Date : Monday, May 1, 2006
Speaker : Peter Bodik
Affiliation : U C Berkeley
Talk Title : Better Tools for Operators of Internet Services
Slides :

Abstract

Web applications suffer from software and configuration faults that lower their availability. Recovering from failure is dominated by the time interval between when these faults appear and when they are detected and fixed by site operators. We introduce a set of tools that augment the ability of operators to perceive the presence of failure. The tools were evaluated on data from Ebates.com and Amazon.com.

The first tool uses an automatic anomaly detector to scours HTTP access logs to find changes in user behavior that are indicative of site failures, and a visualizer helps operators rapidly detect and diagnose problems. Visualization addresses a key question of autonomic computing of how to win operators' confidence so that new tools will be embraced. Evaluation performed using HTTP logs from Ebates.com demonstrates that these tools can enhance the detection of failure as well as shorten detection time. Our approach is application-generic and can be applied to any Web application without the need for instrumentation.

The other two tools are based on our experience with operators and resolvers at Amazon.com. The first tool lets the operators explore the health of system components and dependencies between them; the other monitors the actions of operators and automatically suggests solutions to repeating problems.