Thoughts on Data Quality
Speed to market issues tend to hinder data quality improvement in today’s world of web-enabled, rapidly deployed systems. Instead of operating within a well-defined data architecture that mandates data modeling and logical database design, we fall prey to the opportunistic techniques of the day.
There are many barriers and inhibitors to data quality. Yes, the rapid-fire nature of project planning and application development is one of these inhibitors, but it is not the only one. The data explosion exacerbates the problem. As we add more and more data without necessarily adding more people and processes and policies to ensure the accuracy of the data, data quality will continue to degrade.
The propensity to copy data all over the place is another big problem. Instead of using database management systems to share data among multiple applications, we copy data instead.
Think about it, you’ve probably done it yourself. You’ve got that customer database up on the mainframe. Maybe it exists in DB2 or perhaps even IMS. Then a new application gets funded for development and it needs customer data. But the users don’t want to access the customer data where it resides; instead they want their own copy. Maybe even on the same DBMS. Then a new project comes up and we create a copy in Oracle on Unix… and those SQL Server folks need a copy, too. And it is downloaded to workstations in spreadsheets and another copy is made for the data warehouse and oh, yeah, don’t forget to copy it out to those data marts, too.
With copies of the (supposedly) same data swimming all over the place who can guarantee its accuracy? When it changes in one system, does it get changed everywhere? What about that user who has his spreadsheet up in one window and an application up in another window: who can ensure that s/he does not wreck the application data because the information in the spreadsheet is outdated?
The best analogy I’ve heard is that maintaining duplicate data implementations is like having multiple watches that do not tell the same time. Synchronizing the data can be a Herculean task.
Automation can help. ETL tools that automatically populate data changes from one database out to others can help to keep multiple copies of the same data synchronized. But the results will only be as good as the automation you put in place. It is doubtful that you will automate the modification of every copy of the data throughout your organization. A much better approach is stopping the wild copying of data and instead share data using a DBMS. That is what a DBMS is designed for, you know?
Assuring data quality can be a struggle even in a centralized and homogeneous environment. Data stewards are needed to implement programs for data quality improvement. Such programs need to include both technology and business tasks. Without executive-level business folks intimately involved in the data quality effort it will surely fail. Implementing data quality programs is costly and without upper level management buy-in, human nature will kick in and we’ll start copying data hither and yon all over again.
What is required for data quality? At a high level, there are six components: accuracy, integrity, security, accessibility, nonredundancy, and flexibility. Accuracy involves data modeling and implementing physical databases that correlate to those models. Integrity has a two-fold meaning: placing rules on the data into the DBMS that define its allowable values and ensuring the backup and recoverability of the database in the event of hardware or software failures. Security involves the protection of the data from unauthorized access. Accessibility means that the data is available in an optimal manner for those who are authorized users. Nonredundancy means that data is not stored inconsistently in multiple places. And flexibility refers to the ability to modify database structures as changes are required.
Both database administration and data administration need to be involved as well. DA sets the policies and rules for data quality and DBA helps to implement them using existing database technology. DBA tools for database change management, backup and recovery, and performance assurance can help to achieve these goals from a physical database perspective. From a logical perspective there are tools to automate and facilitate data modeling, metadata management, and data profiling. Of course, tools aren't the only answer, but they sometimes can make an impossible task at least possible.
"Through 2007, more than 50 percent of data warehouse projects will have limited acceptance or be an outright failure as a result of lack of attention to data quality issues. Furthermore, many organizations fail to see that they have an issue with data quality, focusing rather on identifying, extracting and loading data."
Just goes to further bolster my thoughts on data quality...