Ideas, thoughts, and resources from the permanently curious.
Living World Politics Business SciTech Health Entertainment Opinion Sports About Contact
 
StumbleUpon Toolbar   del.ici.us Save This Page!    
 
 

Lessons learned about risk management (3/24/05)

Again we find ourselves in the midst of a Squirrel, Inc. story, this time at the Division of Financial Management where some Risk Management concerns have arisen. SNSI is a regional stock index of Squirrel Nut stocks, a massive zillion acorns a year market.

Hello,

I've already got some experience with Risk Management due to SNSI and a few other things, so I figured I'd compose a lessons learned document. Please let me know if it is helpful.

With the old SNSI system the impetus was on the clients (TV & Newspapers) to notify us when the system went wrong. This, of course, created no end of stress for everyone involved and in general involved a lot of screaming. With the new system we tried to put some safeguards in place to have the system notify us if there was an issue. Nairc and I coded so many fail safes into it that it should never have failed, and yet it did. Things beyond our control, like the network or email, would have a blip and mess the system up, and then we'd still be in the same boat as before except even more exasperated because we'd done so much work to prevent it and it still happened. Half the time one of our fail-safes had a bug that caused the system to fail, so it was almost like the system was better off before all of that work we'd done. I coded fail-safes for the fail-safes, and in general went down a never ending hole where I felt like I was trying to respond to an endless list of possibilities and the various combinations and permutations from them.

In the end, the breakthrough came from changing my thinking. I took the perspective of a client and had the deliverables sent to my phone via SMS. So, while in the past I'd be chained to a computer every week day @ 5, by assuming this new perspective I was able to gain a degree of freedom in that I just needed to be near a computer IF I didn't get the emails on my phone. I also ditched the fail-safes and focused on fixing the bugs in the "main" code, and stuck in wait times to give the code time to jump to the next section. Rather than depending on something happening exactly when I wanted it to, I had it try it a few times so if the first one failed because some other part of the program hadn't finished, it'd accept that, wait a bit, and then move on. In other words, I coded it to expect failure and sense success, which is a lot less complicated (see point #3 below) than "sensing" failure due to the many known and unknown facets of failure.

One other thing that made a huge difference was ditching the bells and whistles we had used to get data to the clients, and giving them more ways to get the data. So they still get the data via email, but if the emails fail they also know that after a certain time the data is uploaded to a web server, and they can check a really simple webpage to get whatever they need instead of the slower main page. So while the main SNSI page is data intensive and takes some time to load (encouraging stress, I might add):

they now have a quick and simple page to use:

On that page I also gave them open access to the data itself, so they are free to use their own graphs if they wish. They've got the data so their not dependant on our graphs, which would occasionally fail or be a little messed up. Again, it is all about providing them with choices to do as they please.

It is my "chained to the computer" comment above that I really want to ensure we steer around with this effort. With SNSI failing "off" the impetus was on me to check every single day to find out if it had worked our not. By having it go to my cell I simply have to listen for a beep around 5:07pm. If I don't hear it, I know to get to a computer. The knee jerk reaction to that is to dream up some system that'll listen for the beep and if it doesn't happen notify me, but the complexity of that just adds more points of failure and would end up making the situation worse, rather than better. In the end it has been reducing points of failure and sensing success that has helped me. It has also helped to focus on those points of failure that do exist, to try to make them less likely to fail.

So, to summarize:

  1. Keep the clients informed. We've got two network providers (Oak and Pine) as well as several other ways to get data out, such as via the Grape lines, cells with wireless access, and for those that have them, Blackberries. Establishing an off-site website, off our network, and informing users about it's URL would immediately reduce the stress and uncertainty in their minds that causes them to call and further delay efforts to fix whatever problem they're worrying about. With all of the access methods I just outlined it shouldn't be hard for us to get out there and update that info center. Perhaps better would be a way to have all requests shift to that site if our connection is down, although I've no idea of how possible that is. I seem to recall something about secondary IPs associated with domains via DNS in case the first one couldn't be accessed. That seems like a good starting point.
  2. Give clients a degree of freedom to act as they please. People hate having just one option. Give them three and you'll never hear from them. Give them four and they'll complain about all the options. After I gave the clients that simple webpage above I went from hearing from them weekly to yearly. How does this affect our current discussions? The info center should tell the clients options. If the problem is a DNS issue then they should be able to use IPs to get to what they want to get to. If the issue is email direct them to the phone directory. Router down? Direct authorized users to the modem banks and everyone else should get a quick lesson in Google Caches and archive.org's Way Back machine.
  3. Keep things simple. It reduces points of failure, and while this is a short point I can't possibly say how important it is. The simpler a system is, the less things there are to fail. It is the single cheapest and most effective part of a Risk Management strategy.
  4. What technology is in place needs to be highly dependable. That means ensuring standards are met, reading reviews, and returning things we buy that we lose confidence in. This is a double edged sword in that it means when a problem does happen we know exactly what is likely to have caused an issue and what we can depend on, but that knowledge can lead people away from what is actually causing the problem.
  5. Establish tiers of importance and make them real. The urge is there to try to make everything bulletproof. That's not necessary. For the network, email, Knowledge Management Server and the other things below, sure it is. But bringing that strategy down to people's desktop PCs is ludicrous. At that level it is a waste of effort unless some HUGE mistake is made, and it is far cheaper just to wait for the component to fail and replace it. That's why the SNSI computer has hot swapable drives, and why I keep spare drives in my cube.
  6. Codify, codify, codify. I.e. #4, have we got something in the high importance tier and need to reduce it? What are the procedures that must be followed to ensure that is a good idea? Take #s 1 and 2, the options given to clients need to be determined ahead of time. In other words, responses to certain issues need to be established ahead of time. When an unplanned problem occurs, it puts us in the position of adapting well-thought out plans vs. scrambling and making mistakes. Fault tree analysis would be really helpful with all of this... it'd be good to get people trained in it.
  7. Empower the people on the ground. I can cite example after example, but all it really comes down to is that they're the folks who know what is going on, and are the best equipped to respond to an issue or even plan around one. I believe "The Nordstrom Way" by Robert Spector and Patrick D. McCarthy talks about this.

- Jason

StumbleUpon Toolbar    

Business
- Reuters plagiarizes Wikipedia (7/31/07)
- Using resume and job postings to establish job market saturation (4/26/05)
- Bridging the gaps in the journalism profession (7/22/05)
- Gaining perspective on media predictions of disaster through... the Wyoming Index (10/24/03)
- Groupthink and the Challenger disaster (8/19/05)
- Beyond Risk Management: handling the unexpected (8/1/05)
- Lessons learned about risk management (3/24/05)

Most popular topics
- Blood Sugar Management: Introduction & Basics and Techniques for Controlling Blood Sugar
- Thoughts on getting to sleep and a routine to try
- Groupthink and the Challenger disaster
- A comprehensive approach to prevent drunk driving
- Photos & details of a Chinese scroll and it's box
- A new form of international assistance: unskilled migrant visas


 
Living World Politics Business SciTech Health Entertainment Opinion Sports About Contact

The-Brights.net   M4 Message Breaking Project   Creative Commons License

Bookmark this site!
© 2003-2007 by Jason R. Wells. Some rights reserved. Sitemap.