The Non-Technical Art of Being a Successful DBA – Paranoid DBA Best Practices
The Perils of
our Profession
One of the common challenges that all DBAs face, no matter what vendor's database
they work on, is the absolute attention to detail our profession demands. Switch
a couple of characters in a script, forget to set your SID, set the wrong flag
at the wrong time and the end result usually isn't very pretty. Many commands
we issue on a regular basis are destructive by their very nature. This is the
reason why I have a great respect for all technicians who have selected database
administration as their chosen profession.
I know they have all experienced that uncontrolled "eye-twitching" at 2 AM when they are ready to hit the final enter key to execute the command. You know what command I'm talking about too. It's that one command that you really, really, really hope is going to come back with a successful return code and ultimately end with a database that is finally usable. Whether it's a recovery, a file fix or corrupt data is immaterial, it's the wait that we are talking about.
There is no longer wait in our profession than waiting for the message below after a database recovery:
SQL> Database opened.
Time always seems to stand still. The longer the recovery, the messier the recovery, the more critical the database - the longer you wait. You stare at the screen hoping beyond hope that the above message will appear. It's the ritual cross your fingers, spin around three times, face towards Oracle headquarters and pray to everything that is Larry Ellison wait. I've actually caught myself mumbling "come on, please open, please open " I don't care how sure you are of your capabilities - you know the anticipation I'm talking about.
You then either breathe a sigh of relief or you are in absolute disgust when you see an Oracle error message appear. How about the old "File 1 needs more recovery to be consistent" or the "File 2 not restored from a sufficiently old backup"? Those messages are enough to make anyone cringe. I'm an ex-Oracle instructor, I've seen those messages A LOT in class. I still cringe.
At a previous job, I once had to run through 36 hours of tapes to restore a multi-terabyte warehouse. A disaster occurred that required us to do a recovery. THAT was the longest wait for a database open message I ever experienced. One of my fellow DBAs asked if I needed a brown paper bag to breathe into.
Or it's the command that drops the schema in the test environment that will allow you to do a refresh from production. It's that test database that runs on the same box as production. The environment that makes you do a "SELECT name FROM V$DATABASE" command in SQL*PLUS about 15 times in a row before you finally execute the "DROP USER CASCADE" statement.
Not only must we try to prevent our own mistakes, we must safeguard our environments against the mistakes of others. Operating system administrators, disk storage technicians and application developers are just like us. We are all part of the human community that makes mistakes from time to time.
If you never make mistakes, send me a resume. I'm always looking for a "Patron Saint of Oracle" here at Remote DBA Experts. It will also save us on travel costs because I'm sure you'll be able to spread your wings and fly here on your own.
But as my old boss Dan Pizzica used to tell me (when I was a VERY junior DBA) "It really doesn't make a difference who broke the database. You are the technician who is ultimately responsible for fixing it. The buck stops with you. If you can't protect your environments, you aren't doing your job." We all know he's absolutely correct.
Then there's the software glitches. The problems that pop up out of the blue and make you go:
"WHAT THE ? - how did THAT happen? I've done this 317 times in a row and it worked every time."
For you math majors, here's my calculation for this:
CLOSER YOU ARE TO PRODUCTION TURNOVER | |
+
|
THE GREATER THE VISIBILITY OF THE PROJECT |
=
|
THE MORE LIKELY A PREVIOUSLY UNKNOWN SOFTWARE GLITCH WILL OCCUR |
I don't care what software you are using, you will run into the "only occurs on this release, on this version of the operating system, using this particular feature on the third Tuesday of the sixth month when it's cloudy outside" BUG. Be sure to expect management to stop by and ask "well why didn't you test this on the third Tuesday of the sixth month when it was cloudy outside?"
The more complex
the database ecosystem, the more paranoid I become. Which is why I'm not a follower
of "the database is getting so easy - we won't need DBAs" mantra that
mindless industry pundits profess on a seemingly endless basis.
So now we know that our jobs are somewhat unforgiving and we do make a mistake
from time to time. What can we do to reduce the chance of an error occurring?
The Second Set
of Eyes
As I have stated in previous blogs, I have over 20 years of experience using
Oracle and have done my fair share of database backups and recoveries. During
my career as an Oracle instructor, I have assisted in hundreds of database recoveries
in Oracle's classroom environments. If possible, I still have others review
my recovery strategy and recovery steps before I begin the recovery process.
I used backup and recovery just as an example. Whatever the process is you are
are performing, a second opinion may prevent you from making a mistake. A review
from a fellow DBA has saved me more than once. I may be described as having
an ego (I have no idea where they get that opinion) but it doesn't prevent me
from asking for help from others.
We were recently correcting a very poor backup script created by a customer's previous database support vendor. The customer described this particular environment as "if it goes down we lose our ability to make money" application. After our massive set of changes were complete, two of us went line-by-line verifying each line of the backup script. At the end of each script we asked each other "are you OK with this?". Only then did we move on to the next one. I don't care how much time you have "in the seat" using Oracle, you need to put your ego aside at times and have someone check your work on critical activities.
Concentration
I used to work for a shop that subscribed to "the everybody in one big
room" philosophy. I guess it was supposed to allow everyone to work together
as a team and become as "one with each other". It may have achieved
that purpose but it sure didn't allow you to concentrate on your work very well.
You could hear so many different conversations they had to pump white noise
in. The constant 'whhhsssssshhhssshhh" noise made me feel like I was a
crewmember of the Starship Enterprise. I always wondered when Captain Picard
would walk through.
Like all DBA units, our particular area was often populated with various developers and O/S technicians. Many different conversations were occurring, some that could be described as somewhat animated. The environment did not allow you to concentrate on the task at hand. We often had to go into small conference rooms to work on critical tasks.
The point I'm trying to make is that no matter what type of environment you work in; if you can concentrate OK, but if you are like me and you can't, find a spot where you can. Block off some time, send questions to other DBAs and concentrate on the task at hand. Don't attempt to answer questions and code a complex script at the same time. May seem obvious, but throughout my career I have personally watched numerous DBAs attempt to multitask when they are working on a critical process. It's a recipe for a problem. Once you are done, follow rule number one and have someone review your work.
What Database
Are You Working IN?
Working in the wrong database is a common problem for database experts as well
as their less experienced counterparts. How many times have YOU found yourself
running statements in the wrong environment? Feel free to include me in that
not so select group. The operating system command SET can be used in Windows
systems to display environment variables. The ENV command can be used to display
the environment variables in UNIX. Many seasoned database administrators change
their UNIX shell prompt in their.profile to display the current Oracle SID.
Displaying the current Oracle SID in the shell's prompt provides a continuous
reminder to the DBA of the database they are working in.
The entire paragraph above was from a recent blog I wrote on the new capabilities of 10G SQL*PLUS. I highly recommend you read this blog and use the 10G Version of SQL*PLUS to take advantage of this safeguard.
Setting your
SID Automatically During Logon
Here's an easy recommendation - don't do it. Many of the servers we work on
contain several databases that use different database software releases. It is our standard
to not set environments in logon scripts. We create shell scripts that have
the database name in their title to set our SID, ORACLE_HOME, etc..
It is a little harder in Windows environments. During installation, Oracle often uses the first database created on the server as the default SID. You can override this configuration, but you'll need to read the Oracle installation guides to find out how. The manuals will show you how to override the default database on Windows operating systems.
Saving Time
VS Creating a Problem
I once watched a fellow DBA perform a rather complex set of administrative tasks
to solve a problem. He was rapidly flipping back and forth between at least
15 active screens, copying and pasting and editing and copying and pasting and
editing
I describe this particular activity as "Multiple Screen Syndrome".
He also had several other screens open that were connected to other databases.
He was multi-tasking to its highest degree.
How about the rm -r /u0*/ora*/prod*/*/*.* command in UNIX? It's the command that drops multiple databases in multiple directories. All in one painful swoop. How many times have you heard of a mistake caused by commands like this causing mass mayhem? When you make a mistake like this, you become immortalized in conversations for years to come. Get a few technicians together after work and ultimately the conversation will include "remember when Bob so-and-so ran that big rm -r command by mistake and wiped out the entire O/S on our production web server? You can't tell me you haven't heard stories like this.
My opinion as a database manager is that I would rather you take your time than showcase your multi-tasking and time saving skills. The more complex and critical the activity, the more basic you should become in your plan of attack.
Safety First
Mindset
I once saw a DBA log in to a database using a particular schema account. He
then logged into a different database using a DBA account and dropped the schema
with the same name he was logged into on the first database. I asked him why
he logged into the first database using the schema account he just dropped in
the second. He stated "Oracle won't let you drop a schema that is connected.
No matter what happens after this, I'm positive that I won't drop the user in
this database by mistake." I like that Safety First mindset in a DBA.
You need to think Safety First when you are performing any particular complex or critical activity. Take the time and put one or two safeguards in place like the DBA did when he dropped the user.
Other DBAs may call you paranoid, I'll call you an experienced DBA that would rather be safe than sorry.
Wrapup
The intent of this blog was to not provide you with a laundry list of recommendations.
It was intended to help jumpstart your creative juices to think about different
methods to protect yourself against problems. If you have any helpful hints,
please feel free to respond to this blog with your Safety First Tips and Tricks.
Thanks for Reading,
Chris Foot
Oracle Ace
Replies to this comment