The Rakes - Ten New Messages
Wednesday, May 23, 2007
The Evil That is Master / Meta Data - Part 2 (or the one where Steve talks about Ireland..!)
In the last post on this subject I talked about some of the key attributes of master data and meta data management and how it is intrinsically synched with what I call the Information or Data lifecycle. Now I would like to elaborate on this and identify some of the advantages that investing in this kind of strategy can provide.
Embarking on corporate wide gathering of all things data requires investment at all levels. Time, effort, money and most importantly commitment are essential. But ensuring a business receives any kind of ROI (return of investment which in the early part of my career I thought meant Republic of Ireland, hence the title) before that kind of commitment takes place can prove daunting and a little difficult to say the least. Let's forget the cons for a moment and look at some examples of the key advantages a data and information strategy can provide.
- Everyone on the same page.
- Information means the same thing throughout the business.
- Reduced cost of new reporting and / or analytical requirements.
This doesn't look like a very extensive list and to be honest if someone presented me with a pitch like this I would be showing them the exit pretty rapidly but when you further examine the nature of each of these bullets you see that they are rooted incredibly deeply in practically all business processes and systems in place within an organisation. Key to the whole concept is that meta data and master data are not only for use within reporting systems. It's just that any projects that tend to drive this kind of requirement are also implementing some kind of reporting mechanism.
Lets take a step back and look at a simplified implementation of a number of reports. First we gather the requirements for the reports which would be based upon an existing set of data, possibly sheets of paper possibly a database storing transactions. Then the nasty business of performing analysis on said data, conforming it into your existing dimensional structure or creating new dimensions from scratch. On defining the model from which you will base your reporting you can finally start building reports.
So how could we improve this process and reduce the time taken to turn around a reporting requirement. First having some degree of knowledge of the system prior to a reporting requirement coming along would be advantageous but that's not the way the world works. Looking at a single report as a deliverable we would need to understand where the constituent data is sourced from. The report, for example, has customers, geographical breakdown, product type, number of orders and order value. Very simple but already pulling data perhaps from CRM, product catalogue and ordering systems.
When building a picture of the data held within the company it is very important that ownership is established. Who owns the customer data? Who is responsible for maintaining the product catalogue? These are the people that own these data elements within the organisation and are therefore ensuring the quality of not only the data in their own systems but also the reporting that is based on this data.
The point to this is that data quality needs to come from the top down. BI projects are generally just the the catalyst for this but should also be used as means of improvement in the source systems. Too often has data cleansing been hooked on to the back of a BI project and weighed it down with responsibility that should lie elsewhere.
Ok enough of this business type talk of responsibility and stuff. Next time I'm going to go into what master data and meta data are actually made of.
Unit Testing SSIS
SSIS packages are almost like a mini application in their own right. On most occasions there is one or more input items of data that may consist of either a single variable value or an entire data set. This will follow a transformation or validation process before a final output that could again be in a number of different formats.
Unit testing these packages should involve minimal change to the structure or behaviour of the package itself. The risk of influencing the code behaviour through the testing process is as great as an incorrect deployment or missing a particular testing scenario.
The most important factor in the process of testing a package is to understand how it will react in controlled circumstances and be in a position to test the success of the anticipated result. Testing this using the natural output of the package, for the reasons discussed previously, will provide the most robust results.
Due to some of the current debug limitations of SSIS and taking into account the need to keep the package structure and design static; it is only really possible to effectively test the control flow of a package whilst remaining ‘hands off’.
Let's look at a simple package example;
The same type of output would be taken from the other standard control flow tasks. The execute SQL task would have an initial state, a completion state and a resulting success state based on the other outputs. The initial state of n rows in the destination table before and 0 rows after the step has executed. This is measured by examining the row count in the table before and after execution and comparing the value with the expected result. For the archive file system task all states would be measured using a different mechanism and so on and so forth.
Essentially this means that whatever the task there may be numerous methods of gathering the data necessary to confirm whether the test process has been successful. Simplifying the process of measuring the results of a test would make applying a standard testing mechanism far easier to implement.
Currently packages provide the ability to perform some manual logging as I've posted about in past. This can be used to establish whether tasks have completed successfully or not but where a measurement is needed to confirm this, this type of logging is lacking. For example, truncating a table will not provide you with row count confirming the number of rows affected or the subsequent amount of rows left in the table whilst a delete statement would. It would not be wise to change all truncates to deletes to allow this information to bubble up to the logging process and therefore allow capture of the state of the task before and after execution.
There are a number of different ways of trying to perform strict, robust SSIS unit testing of which I've generalised them into 3 options.
First is to create a more robust testing methodology that has processes particularly for unit testing SSIS packages based on the control flow. The success criteria should be based on use cases defined at the package design stage. Testing each of the steps within the control flow of the package with documented pre-execution criteria, expected output and the subsequent result.
This does not lend itself to any kind of automated testing approach and would involve stepping through the package control flow and logging the results.
The second option allows for the automation of tests and the capture of results but would require another application to control the execution of the process and establish pre-execute criteria in addition to measuring the post-execute result.
Using the unique identifier for each of the control flow tasks, during the pre execute and post execute process of each step a lookup would be made to either a database or configuration file containing the method call that would initiate the processes necessary to prepare the execution and subsequently validate and measure the success on post execute.
A change like this would mean integrating a procedure call, perhaps using the CLR, to execute such tasks based on an internal variable indicating that unit test, or debug mode was enabled within the package. Whilst providing a number of advantages in automated testing and capture of the test results there would be a great deal of work required in the preparation of the test. This would all have to be completed in addition to that suggested in the first option as the pseudo code necessary to design each of the test criteria would be based on the use cases defined at the package design stage.
The final option would be to remove the more advanced mechanism from option 2. The pre analysis and use case definition would still be required but in this option additional test scripts would be placed in the pre and post execution events of the package. This would mean embedding code into the package that would only be used during unit testing that could possibly be switched off using a similar variable to that in the second option.
Thought it would be possible to automate a great deal of the testing for the package it would mean changing the structure or behaviour of the package directly and increase the danger of introducing problems with script tasks that have previously been experienced on 32bit to 64bit transitions.
So there you go. There are ways of doing SSIS unit testing. There are even ways of automating the process but it's not a quick win. It doesn't remove the need to establish good formal testing strategies and it certainly isn't going to appear out of a box, well not soon anyway.