Fernando talks about the uselessness of too many people in technology. After an all-too-familiar tale of woe, he says:
The purpose of this post is to show how important it is to know how to debug a problem. Tools like truss, dbx/gdb/adb and lsof are invaluable for this. I personally think anybody working on IT problem solving independently of their role (DBA, programmer, system administrator, network administrator etc.) should have a minimal knowlege about how they work and what kind of information you can collect with them. There are obviously other tools equally useful in some scenarios. Tools like tcpdump, netstat etc. This whole case revealed that many people involved in this area don't have a clue about how these tools work and the info they can collect.
Everybody involved had a theory about the problem. But none of those theories were based on facts or real observations. They were merely speculations of what could be happening and all these tend to pass the problem to a different team...
Some facts about the situation:
* lots of emails including polished accusations were exchanged
* during two months I spent a considerable amout of time paying attention to this, trying to lead people into the righ direction (mostly without success)
* Vendor S2 had a very large team involved. They even sent a consultant from another country into this customer site (when he arrived we were about to receive the fix, so apart from politics this guy was not able to do much about this problem)
* The problem could have been solved in around two weeks (one for debugging and another for the vendor investigation and code fixing)
* No one admitted that didn't understand the output of the tools above and no one (even after the conclusion) took the opportunity to ask me to explain how to use this kind of tools
I don't think I know everything (and in fact my knowledge about these tools is somewhat superficial), but I only learn how to use this kind of stuff because in some situation in my past I came across some references to them and I took the time to experiment and read about them. In other words, we should use the problems as opportunities to gather more knowledge and to learn new things.
I keep receiving reports about problems without any useful information. My favorite ones are:
* I tried it and it doesn't work!
Surely I believe it doesn't work... But usually when something "doesn't work" it raises an error. Only rarely people include the error code/description. More often they include possible causes (usually not related to the error) than they include the error code or error message
* The process was hang! It was probably in some kind of loop! So I killed it
Well... processes don't "hang". They wait on something or they effectively stay in some loop. And the way to see that is by using the tools... And the tools don't work on a process that doesn't exist anymore...
* I've found a bug!
Sure... A bug is a malfunction. Something that works differently from what's documented or what is expected. Most of the times people find a problem. After some analysis (many times involving supplier's technical support) it may be mapped to a bug. People tend to expect something. If it doesn't happen they "found a bug". Usually they don't bother to read the documentation and try to understand the reasons for the unexpected behaviour.
Obviously people do hit bugs. Most of the cases that I open within IBM end up as bugs. But this is just a very small portion of the problems that customers report.
In short, I feel that in general, people's ability to study a problem in the IT world is very limited. Usually people spend more time trying alternatives than collecting and understand problem data. Error messages and codes are ignored many times. And all these translate into a big waste of time, and obviously money... And of course, this directly impacts the quality and availability of the IT systems.
I couldn't say it better myself...