Bad Standards: #4--Data Duplication
It is undeniable that data growth is out of control. Businesses today are gathering and storing more data than ever before. And with this explosion in the amount of data being stored, organizations are relying more than ever on database management systems to get a handle on corporate data and extract useful business information from that raw data. But rarely is there a uniform, guiding data infrastructure in place that dictates when, why, how, and where data is to be stored.
But I'm getting ahead of myself a bit here. The missing standard that I am proposing is one that limits copies of the same data. One of the biggest contributors to data growth is that we copy and store the same data over and over and over again. It may reside in the production system in a DB2 database on the mainframe (and, oh yes, it was copied from an IMS database that still exists because there are a few transactions that have yet to be converted). And then it is copied to the data warehouse (perhaps running Oracle on a Linux server), an operational data store, several data marts, and maybe even to an ad hoc SQL Server database in the business unit. This just has to stop!
A DBMS is a viable, useful piece of software because it enables multiple users and applications to share data while ensuring data integrity and control. But human nature being what it is, everyone wants their own copy of the data to "play with" and/or manipulate. But at what cost? Data storage requirements are but one, small piece of the cost. The true cost is the data integrity problems that are created. If you have customer data (for example) spread across 5 platforms, 4 database systems, and 3 different locations what do you think the chances are that all of that data will be accurate? My guess would be that there is a zero percent change!
So we need to create standards that control, prohibit, and limit the mass duplication of data that is rampant within today's companies. Of course, to do so requires a data management discipline to be enacted such that data is available and accurate to all potential consumers. If the data can be accessed efficiently from a single location, or at least fewer locations, we can reduce the amount of data we need to manage and improve data quality.
Doesn't that sound like a win/win scenario?