Sunday, January 23, 2011

any software for managing data center, incidents, SLAs?

I have been looking in using OTRS - ITSM to manage all of our services but wanted to know if any software existed that is not built into a ticket system?

Pretty much want to add all services and tie them to an SLA. Once that is done, we would manually declare incidents and the system would automatically send our notificaton to the proper e-mail lists for downtime/planned maintenance.

Would be nice if it calculates reports for SLAs etc.

thanks

  • At first I was not going to respond, as I do not have a specific software package to recommend. Nevertheless, the lack of response does not give this question the proper justice.

    I tend to lean towards creating or customizing tools to serve the specific needs in question.

    Currently, I use a combination of tools. Specific to the Service Level Agreement (SLA):

    My current SLA is focused on production uptime of critical services. The three categories are critical, major, and minor. Critical is revenue impacting, major is internal production/non revenue impacting, and minor is everything else. We base the SLA reporting on critical services.

    The primary method for tracking this metric is functionality that was developed within the Web application we created for tracking system and network changes. If a change is made to anything, it is logged to this system. It's essentially a fancy MOTD that's designed to be simple, quick, and easy.

    In case of outage, the log entry records the level of service, the length of the outage, the type of service, and finally the cause of outage. If external we would record but not count against our internal metric. Scheduled changes are identified and not reported against the SLA. Reports and graphs are based off of this. A checkbox e-Mails outage notifications to an e-Mail list, which is utilized for notifications before and after.

    The additional supplement to this is external monitoring based on availability and response times, which I currently use Web Site Pulse for as well as scripts on external servers.

    I'd seriously recommend you consider creating and/or customizing tools to meet your exact requirements. It's an incredibly useful approach. You may also find Request Tracker useful, which is something I've used for access and change control as well as a normal ticketing system. It's highly customizable, which may enable you to use it for your SLA reporting.

    coderwhiz : Well the reason I ask as my boss says to "not reinvent the wheel". However I love coding in my spare time and think most of my ideas would be useful to a lot of people and maybe would be a nice way to give back to the community (were mainly open source shop). Obviously outages would be declared manually as not all alerts generated are actually service impacting, especially when in a high available environment. I hope this gets some buzz as I would love to hear other opinions but as you may have guessed I have the same feelings! thanks
    Warner : I wouldn't necessarily consider it reinventing the wheel if a solution specific to your needs doesn't already exist. Let us know if you find anything!
    From Warner

0 comments:

Post a Comment