Showing posts with label SSIS. Show all posts
Showing posts with label SSIS. Show all posts

Wednesday, February 25, 2009

Wanted: Data Dude-esque Features for the Rest of the SQL Server Stack

I’m a big fan of Visual Studio Team Developer Edition for Database Professionals (mouthful..!). But what has practically always taken the sheen off of my BI deployments is the clunky way in which the other components get deployed. I like the idea of being able to perform a one click deployment and so do many of the production administrators I deal with.

DB Pro really provides complete control over all aspects of the database development and deployment so why can’t I have the same thing for my Analysis Services deployments. Between the deployment features of Visual Studio and the Analysis Services Deployment Wizard everything appears to be there to do exactly what I would like so all I’m missing is that close nit development environment integration.

Reporting Services suffices to a certain extent apart from being able to control the connection strings in data sources depending on the project / solution configuration. Yes I know that you can re-deploy the data sources or use the scripting capabilities but yet another external mechanism when Data Dude (now in its second iteration) shows exactly how it should be done.

SSIS deployment is something I’ve never been particularly fond of but you can build the files into the folder of your choice so I’ll swallow that…

Maybe there’s something coming in SQL Server 2010 / Visual Studio 2010. One can always hope…

Friday, November 14, 2008

It's me again.....

Tada

Somebody asked me yesterday if I'd been updating this recently and of course I haven't as I'm a slacker so I thought I start the process off again by posting about one of my fave subjects and that's that's the continued battle 32bit / 64bit mixed environments.

I'm currently working on a real life SQL Server 2008 project with all the bells and whistles and everything so we have VS 2008 in the mix as well. Now let me set the stage....

Dev server = 64 bit

SQL Server = 64 bit

Source Database = doesn't matter (I think)

Now I'm going through a firewall so I've had to setup an alias in the configuration manager so I can specify the port of the instance of SQL server that is my source database. In there you have 32 bit and 64 bit configurations so because I'm setting everything up to make sure I can talk with the server I've gone through management studio.

SQL Server Management Studio connects fine as I'm guessing it's a 32bit process and I've set up an alias in the 32bit bucket. I can also create a SSIS package that connects to my destination server, is able to browse the source data, and will execute from the dtexecui application. But what can't I do, well for starters I can't debug as I get an error telling me that it cannot connect to the source database.

Now I create an alias in the 64bit configuration bucket and now I can debug. So what else can I do? Well next thing is to set the project property Run64bitRuntime to false and try to debug with the 64bit alias deleted. Guess what, it runs.

So the moral of the story is don't expect everything to work nicely for you out of the box all of time even if you're in a single technology environment.

Wednesday, October 24, 2007

Organising SSIS

I think this has probably been blogged about and documented elsewhere to death but it's well worth posting something about (even as a reminder to myself).

SSIS package configuration is great but as always, as people use something heavily, its usage gets more sophisticated and so do the problems. One thing I like is to be able to centrally log and also centrally get configuration values but to do this I need to be able to support multiple environments (dev, test, etc.) and also multiple applications (the same variable used in different packages). This is where I've found the configuration to be little limited so in true developer fashion we came up with an alternative configuration mechanism. I won't go into the details but effectively we get all configuration values from a call to a database.

The only problem with this is that I want to be able to tell the package where the configuration database is, what environment I'm executing in or the application name I'm running under but I can't do that with hard coding a config file into the package. Of course there are alternatives. A number of server variables or registry entries but really I just want to do this the once and have control over everything else.

In my perfect world I would want to structure my data in something similar to this;

image

This contains pretty much everything I need all in a single location. I can have multiple configuration files for different SSIS based systems, a place for source files and if I wanted to drop in XSD files I could create a new folder or just stick it in a suitable existing location. But this is when these little issues creep in.

How do I tell my package where the SSISRoot folder is? I don't want to use configuration files as that defeats the object of having a single location where I keep my configuration files, very chicken and egg. So what alternatives are there? An environment variable can be used to point to the config file location but then I need an environment variable for every config file I have once again defeating the object. What about creating an environment variable that points to SSISRoot, that would deliver what I need. I tried this. I created the variable and called it from Start>Run %SSISRoot%, it worked and I get the folder location that I wanted. The next thing to do was to put "%SSISRoot%/Config/SSISConfigFile.dtconfig" into the path for the file however that had little success, and when I say little I mean none whatsoever.

So I'm back to square one. Hard coding the package config file into the packages. This is one of those issues I would be grateful to hear any clever workarounds that anyone has so please let me know as this is getting increasingly frustrating.....!

Technorati tags: ,

Tuesday, October 09, 2007

SQLBits

Waking up early on a Saturday morning is not one of my greatest traits but I managed to overcome that weakness to attend SQLBits at the Microsoft campus in Reading at the weekend.

There was a track on BI as well as one on SQL Server 2008 so plenty for everyone to be interested in. The best moments for me were seeing my friend and colleague Sutha present on managing early arriving facts and the new spatial data types in SQL Server 2008.

I need to go through the details of Keith Burns presentation again before I can post about it with any real conviction but the general consensus within the room was that "right here right now I don't know what I'm going to use this stuff for but I'm going to find a way to implement it". In other words how we developer types define "cool".

BIDS and TFS

I'm a big fan of Team Foundation Server and how it can be used to better organise analysis, development and delivery. I've been especially impressed with a number of the features of Data Dude (or Visual Studio 2005 Team Edition for Database Professionals for the full mouth full), certainly since SR1, and how, with some work, deployments can be made a lot more efficient.

Let me give you an example. I have multiple environments, dev, UAT and production. All of them will have the same code base but different configuration settings for example we will have a table containing configuration settings or source file locations and these are all likely to change depending on what we are testing or deploying. Now Data Dude allows you to embed variables in your pre and post deployment scripts meaning that I don't have to change anything when it comes to things like allowing a accounts read access to data on dev that should not have read access on prod.

All pretty simple really and well worth installing for these kind of features (notice how I gloss over the cost....!)

Now the point "Visual Studio 2005 Team Edition for Business Intelligence Developers". has a nice ring to it but stick it into Google and at the time of writing nothing comes back. I can think of a number of scenarios where non intrusive deployment would be useful without have to rely on post deployment XMLA or scripts using rs.exe or even the wonderful package deployment wizard.

Some of the things I've named may not seem too reasonable but should you really have to go a separate utility every time you want to release code having multiple settings depending on the server or the environment or the fact it's a 64bit deployment or not. Analysis Services seems a prime candidate for this kind of behaviour as user access is defines in roles but I don't want to have to define them within role membership then remove them or go to the Analysis Services Deployment Wizard and define different configuration files when I've invested in a nice shiny Team Foundation Server and trained everyone up on it's use.

Anyway I would be most interested in hearing any thoughts on the above subject as this kind of thing is taking up a lot of time and more and more clients are jumping onto the TFS / agile bandwagon one even demanding to know why this can be done with the databases and not the cubes. You see the point..!

Technorati tags: , , ,

Wednesday, September 26, 2007

Developing with MDM

Technorati tags: , , ,

There has been some recent announcements on Microsoft's acquisition of Stratature that have begun to solidify their approach and roadmap for Master Data Management. The MDM tools, as most people had previously guessed, will be very much aligned with the Office toolset and made part of the SharePoint. Personally I think the key thing about this is how this data will be accessed in my day to day job pulling data from here and putting there for user consumption.

Jamie Thompson has a video on YouTube (embedded below) detailing how to consume web services in SSIS 2008. Couple this with Bulldog's (Microsoft's name for absorbing Stratature's EDM+ product) ability to expose the master data and meta data as web services and all kinds of pre-built SOA components become possible.

Although possible with the existing set of tools in SSIS, I would be surprised if a data source component didn't appear at some point for SSIS 2008 that was specifically designed to retrieve master data from Bulldog. After the constant battles I've had getting master data out of organisations due to the cost of managing and maintaining it, having a tool that looks after the work flow and provides simple generic interfaces to the data will be very beneficial indeed.

Bring on the CTP next year....

Wednesday, May 23, 2007

Unit Testing SSIS

SSIS packages are almost like a mini application in their own right. On most occasions there is one or more input items of data that may consist of either a single variable value or an entire data set. This will follow a transformation or validation process before a final output that could again be in a number of different formats.

Unit testing these packages should involve minimal change to the structure or behaviour of the package itself. The risk of influencing the code behaviour through the testing process is as great as an incorrect deployment or missing a particular testing scenario.

The most important factor in the process of testing a package is to understand how it will react in controlled circumstances and be in a position to test the success of the anticipated result. Testing this using the natural output of the package, for the reasons discussed previously, will provide the most robust results.

Due to some of the current debug limitations of SSIS and taking into account the need to keep the package structure and design static; it is only really possible to effectively test the control flow of a package whilst remaining ‘hands off’.

Let's look at a simple package example;

The same type of output would be taken from the other standard control flow tasks. The execute SQL task would have an initial state, a completion state and a resulting success state based on the other outputs. The initial state of n rows in the destination table before and 0 rows after the step has executed. This is measured by examining the row count in the table before and after execution and comparing the value with the expected result. For the archive file system task all states would be measured using a different mechanism and so on and so forth.

Essentially this means that whatever the task there may be numerous methods of gathering the data necessary to confirm whether the test process has been successful. Simplifying the process of measuring the results of a test would make applying a standard testing mechanism far easier to implement.

Currently packages provide the ability to perform some manual logging as I've posted about in past. This can be used to establish whether tasks have completed successfully or not but where a measurement is needed to confirm this, this type of logging is lacking. For example, truncating a table will not provide you with row count confirming the number of rows affected or the subsequent amount of rows left in the table whilst a delete statement would. It would not be wise to change all truncates to deletes to allow this information to bubble up to the logging process and therefore allow capture of the state of the task before and after execution.

There are a number of different ways of trying to perform strict, robust SSIS unit testing of which I've generalised them into 3 options.

First is to create a more robust testing methodology that has processes particularly for unit testing SSIS packages based on the control flow. The success criteria should be based on use cases defined at the package design stage. Testing each of the steps within the control flow of the package with documented pre-execution criteria, expected output and the subsequent result.

This does not lend itself to any kind of automated testing approach and would involve stepping through the package control flow and logging the results.

The second option allows for the automation of tests and the capture of results but would require another application to control the execution of the process and establish pre-execute criteria in addition to measuring the post-execute result.

Using the unique identifier for each of the control flow tasks, during the pre execute and post execute process of each step a lookup would be made to either a database or configuration file containing the method call that would initiate the processes necessary to prepare the execution and subsequently validate and measure the success on post execute.

A change like this would mean integrating a procedure call, perhaps using the CLR, to execute such tasks based on an internal variable indicating that unit test, or debug mode was enabled within the package. Whilst providing a number of advantages in automated testing and capture of the test results there would be a great deal of work required in the preparation of the test. This would all have to be completed in addition to that suggested in the first option as the pseudo code necessary to design each of the test criteria would be based on the use cases defined at the package design stage.

The final option would be to remove the more advanced mechanism from option 2. The pre analysis and use case definition would still be required but in this option additional test scripts would be placed in the pre and post execution events of the package. This would mean embedding code into the package that would only be used during unit testing that could possibly be switched off using a similar variable to that in the second option.

Thought it would be possible to automate a great deal of the testing for the package it would mean changing the structure or behaviour of the package directly and increase the danger of introducing problems with script tasks that have previously been experienced on 32bit to 64bit transitions.

So there you go. There are ways of doing SSIS unit testing. There are even ways of automating the process but it's not a quick win. It doesn't remove the need to establish good formal testing strategies and it certainly isn't going to appear out of a box, well not soon anyway.

Monday, February 05, 2007

The Adventures of SSIS and Oracle in 64bit Land..!

There's plenty of information out there spread out amongst the web concerning migration of packages to a 64b bit platform. I thought in the interests of fellow professionals out there I should share some of my experiences.

Firstly installation. Look at the C: drive on a 64bit Windows Server 2003 installation and you'll notice 2 "Program Files" folders one of which is suffixed (x86). Any 64bit applications are installed in the Program Files folder and 32bit applications are installed in the Program Files (x86) folder. So far so good and pretty simple but this is where you will encounter the first of several problems getting Oracle connectivity to behave on a 64 bit server.

Before I go into any detail here's a list of things that you need to know when you have installed 64bit SQL Server on an x64 server.

  • Visual Studio (BI Dev Studio) is a 32bit application.
  • SQL Server Management Studio is a 32bit application.
  • dtexecui.exe is a 32bit application.
  • dtexec.exe has both 32bit and 64bit versions.
  • There are x64 and x86 versions of the Oracle provider available.
  • SQLAgent is a 64bit process.

OK with all this information let's begin. Firstly if you develop packages on a 32bit platform, which is likely, and you are connecting to Oracle, use the Oracle OLEDB Provider, NOT the Microsoft provider for Oracle. For starters there is no 64bit version of this and secondly MS has no plans to create and support one.

Now you're using the Oracle OLEDB Provider we should be able to avoid a few problems later on although I would like to make note of one problem I had but couldn't find a resolution for. In a project I'm working on we needed to execute an Oracle stored procedure before we could access the data. When we ran the procedure using the MS provider (MSDAORA) everything was fine but as soon as we moved onto the Oracle provider (Oracle.OLEDB.1) the package would just hang as soon it attempted to access Oracle. This magically rectified itself when we moved servers. Just letting you know so beware.

Launching the 64bit Oracle universal installer will provide you with four options. Instant client, runtime, custom or administrator. Now I am not any kind of Oracle expert so I didn't go near the "custom" option and stuck with the other three default selections but I could only get the Administrator install to provide me with a working registered Oracle provider. Now this may work for someone else but on this server, only Administrator worked.

With the Oracle provider installed and my connection string configuration using the Oracle.OLEDB.1 provider there's a series of interesting little test you can run. First create a package that just sucks some data out of an Oracle table and places it in to a row count data flow task. We don't want to do anything, just get the Oracle end of this process working. Execute the package in the following ways;

  1. Double click on the package and execute.
  2. Execute it through the command line interface (using the dtexec.exe in C:\Program Files (x86)\Microsoft SQL Server\..\dtexec.exe".
  3. Execute it through the command line interface (using the dtexec.exe in C:\Program Files\Microsoft SQL Server\..\dtexec.exe".
  4. Execute it as a SQL Server job.

The interesting thing is that the first two methods of executing the package will fail and the second two will be successful. Why? Well this is because a 32bit process cannot see the 64bit drivers. The first method initiates dtexecui.exe which is a 32bit process the same as the dtexec.exe in the (x86) program files folder. Running the process using dtexec.exe in the normal program files directory or as a SQL Server job executes the package as a 64bit process which quite happily sees the 64bit Oracle drivers.

This can prove to be a problem if you want to debug your package as Visual Studio is 32bit and can't see the Oracle drivers meaning you'll get messages about things not being registered even though you know you did. Fortunately you can install the 32bit and 64bit Oracle drivers on the same machine meaning each process type then has the ability to see the Oracle drivers. Run the same four methods of executing our test package and the same thing will happen. The first two will succeed and the second two will fail.

This issue is down to a problem with the Oracle provider. The Oracle provider appears to have a particular dislike of being used by processes that have any mention of a bracket, Program Files (x86) being one of them. So even though you have correctly installed the 32bit Oracle providers on the server, you can't use them. The solution, copy the Microsoft SQL Server folder and Visual Studio folder into a new folder on your C: drive called ProgramFilesx86 and all of a sudden you development environment can see the Oracle drivers and work. You need to use the dtexecui.exe in there as well if you want to interactively execute packages.

Remember though this is only an issue if you need to perform any debug of packages that use Oracle providers on a 64bit server. Quite a specific problem but quite common none the less.

Now you've got over that problem solved(ish) you can start to build beautiful complex packages utilising every control and data flow task available to you. But then you get strange errors. Ones that are centered around script tasks especially BUT not always.

Personally I've had a couple of issues moving packages onto 64bit servers, excluding the Oracle problems mentioned above. First script tasks sometimes need to be recompiled on the 64bit hardware. This is kind of hit and miss but appears to be common in scenarios where you have a template package and you copy and change the definitions in data flows. The recommendation is open up each package and script on 64bit to be on the safe side although I have had a number of occasions where we've had to delete the script task and recreate it get the same code working correctly. Very odd.

This issue doesn't seem to be isolated to script tasks though. Execute package tasks seem to sometimes have similar behavior where reopening the package on the 64bit hardware corrects it.

Overall the benefits of SSIS and 64bit are obvious but be careful during development and if you can't develop in 64bit then try to at least have a server that is 64bit for testing this kind of thing and if you do get any kind of strange issues cropping up, try to bear in mind the apparent inconsistencies in the two platforms.

Update:

Corrected some spelling mistakes and disgustingly bad grammar..!

Technorati tags: , , , ,

Monday, December 11, 2006

Making The Client Happy - The Real Reason For Yesterdays Post That Rapidly Turned Into Some Kind Of Rant...!

If there's one thing that users hate it's being kept in the dark. So communication is essential and to be honest BI is all about communication. Articulating a series of 0's and 1's via OLAP or reporting something tangible that an end user can get real benefit from is what it's all about.

So we have a process that may rely on another process outside of the control of the BI infrastructure. This is a typical scenario and due to the 24 hour worldwide mannerisms of a lot of the BI implementations I've worked with, that process may be smack bang in the middle of someone's working day.

Scenario established, what are the problems we have here. Well first there's the problem of processing whilst people are working but with the performance tuning techniques available to us in the Microsoft BI space that shouldn't have so much of an impact on your user base. Really you just want to delivery new data to the user as quick as possible.

The real problem here is a user not knowing when their new data is available. Does that balance include yesterdays transactions yet or not? A typical question that may get asked in financial departments at month end something that would be asked a lot more regularly than you would like.

So communication is a problem, or lack of anyway (see the way I did that, started off talking about it and brought it round again to the subject, you're impressed, I can tell..!).

What can be done about it? There are a number of clever little things we should be able to do as part of our BI profession. Use our skill set and let the consumers of our data know the data is there to be consumed. First method, deploy a report at the end of processing to the user or send an information e-mail. Database mail in SQL Server 2005 makes this a pretty simple task these days. This could be a standard link telling the user base to go to a reporting services link and open a report telling them what has happened in the data processing. Problem with this though is it is reliant on a patient user base who will happily stare at the mail box until that little shining light of hope gets sent from "SQLSVR978BSMNT" (I do miss the days when there were fewer servers and they had names based on characters from The Simpsons or something like that.).

OK so your client / user isn't patient or it's 4:30 on the Friday before Christmas (topical huh..!), either works, and you don't want to have to go answer 20 calls from finance departments around the world demanding to know where the data is. Let's look at what they need to know. Take a typical end to end process, it may go something like this;

  • Extract data from one or more source systems.
  • Process said new data into the warehouse / mart / etc.
  • Process data into cube.
  • Distribute reports (or warm cache, it's useful have this catch all at the end)

Now this is high level generalisation of what may be going on but being able to let the uses know this much gives them a massive increase in the perception of the system. Even to just know that the sources system data has not ben extracted empowers the user, if there are aware of where the data is coming from, and why shouldn't they be, to get on the phone and ask New York why their figures haven't been entered yet.

I've already talked about making this happen via sending reports out but lets face it, this isn't the ideal way of doing this when there are numerous users and numerous steps within the process to communicate. So we make the user pro-active. Go to the portal or website and find out what the hold up is or where the problem lies. All we're doing is taken something that has been happening in customer service departments  for years and applying it to the BI systems that we put in place and to be honest with our experience we really should know how to provide someone with beneficial information in a number of formats.

So what is my point in all this? There has been an idea bubbling around my head for a few months now around this subject and visualising end to end processes for clients, mostly based on SSIS and the simple logging model I've posted about previously. The idea was to use Visio to expose data in a diagrammatic format to the users showing them something similar to the SSIS control flow, you know greens, reds, yellows. This became all the more interesting after the recent UK BI user group at Reading where David Parker was presenting discussing visualising data in Visio, essentially the same topic.

There, that's the idea, now I'm going to put it into practice. Stay tuned and if anyone else has had any experience in doing what I've said then I would be really interested in hearing.

Tuesday, October 17, 2006

Failure - Data Source Does not Exist..! - Update

I got asked to post the demo project I had mentioned in the original post a little while ago but unfortunately I had lost it. Luckily I was able to pull enough from memory to build it again and make it available for you pleasure. Whilst not embedded with premier coding standards it certainly provides some direction on how this kind of thing is achievable.

The project is here, any questions then mail me at the usual address.

Apologies for the delay :)

Wednesday, October 04, 2006

Simple SSIS Logging - Part 3, Bad, Bad Rows..!

It’s all well and good knowing that there has been a failure in your package which has lead to a bleary eyed support technician getting phoned at 3 in the morning but sometimes it’s not really necessary to get Barry (we’ll call him Barry for the purposes of this post) out of bed. Your package may be bullet proof and work perfectly but the nature of data is that it will change and can quite often come back and bite you on the backside.

Take for example early arriving facts. When you start integrating a number of systems it is possible for things to start happening that don’t conform to the world according to the system owner. Take the example of an in-house order system operating completely separately to the companies CRM system. During the analysis phase of the data warehousing project it was determined that CRM was the source of Customer ID therefore that’s where customers get created and that’s where customers get maintained. But a new customer phones up and makes an order before the customers details are completed in the CRM mechanism due to one thing or another. Your data warehouse is quite likely to be collecting that fact row before the relevant CRM data has been correctly populated and passed on to the customer dimension. So what happens in your SSIS package when looking up the Customer key for that dimension, well not a hell of a lot.

To be fair there are numerous techniques for dealing with early arriving facts. Some involve the population of the Customer ID and retrieval of the key so that when the correct dimension member arrives in the dimension table, its attributes are populated automatically. The problem is that things aren’t always as clear cut as that. You have managed to capture the early arriving fact data and correctly allocate it to members in the Customer dimension but when board reports run that specify a particular customer type the data is incorrect, the numbers don’t roll up and for 24 hours you could be running around like a headless chicken trying to figure out where the missing data has gone, oh that’s until the next day when the customer dimension member is correctly updated and things mysteriously fix themselves .

So where does logging fit into all of this? Well knowing that we have an early arriving fact would be a good idea. So we have to log the early arriving fact and let the data pass through to allow it into the fact table.

This looks quite simple but when you’re trying to handle a number of dimensions and any of these could have early arriving facts then you can get have a lot of data destinations. In addition to all this there is also the fact that each package could have a different error table at each dimension lookup making logging, which should really be centralised and in a defined structure, a little reactive and difficult to deal with. Dealing with this may not be as difficult as it seems though and for the first time I’m going to mention a customer SSIS task that has been written with this in mind. A colleague of mine has written a component for just this type of situation. Whilst I would have liked to have completed all 3 parts of this Simple SSIS Logging piece without referring to any external code, this really does make life a lot easier. The full source code is available on the site and I recommend its download.

The Xmilfy component (to give it its officially christened name) will turn a series of columns and convert them into a single XML column that has some obvious benefits. In the above example the data is then written to a single table in my SSIS logging database which has a data type of XML allowing the data to be viewed in a number of ways. Extra information can be added into the data flow that specifies where the error occurred, the number of rows etc, again creating detailed and useful reporting. Want to know the number of early arriving for a particular package or even a particular dimension, easily done using the data in the table. Finally a warning can be raised using script to inform Barry, remember him, that some data problems occurred but now it’s a warning and can be handed over to the department in charge of wielding the large stick that is used to stop sales people logging sales before the customer is correctly created. Furthermore the system can be automated to alert said department taking Barry completely out of the loop. But wait a second, Barry is now worried about losing his job but luckily he’s been learning SSIS in his spare time.

It’s ok people, Barry is fine…!

Friday, September 29, 2006

SQL Server 2005 Metadata

Hidden in the depths of Microsoft’s useful download division in Redmond is a handy pair of utilities you may have discovered that analyse dependencies within and across SSIS and Analysis Services. The first tool, Dependency Analyzer, examines the meta data contained within SSIS packages and Analysis Services objects collating their inter-dependencies writing the results to a pre-defined database structure.

The second tool is the Dependency Viewer and provides a graphical interface displaying relationships between the analysed objects .


The amount of data is considerable and has some quite powerful possibilities. There has been a distinct gap in the MS BI product set, dependency and data analysis being two key areas that their competitors eulogise in large amounts about. This appears to be the beginning of some gap plugging before the release of Katmai (SQL Server 2007). In the meantime all the source code, examples and even some reporting interfaces are provided in the download.

I’m quite looking forward to plugging this tool into some of my end to end 2005 projects and looking at the results. Once I've got some good examples of how everything integrates together on this I'll post ome of the results.

Dynamic Partition Creation and Processing – Part 2

The first part of this post discussed how to identify whether a partition existed and create a partition dynamically. Although this post is a little on the suck eggs side I thought it would be better to write it for completeness. Processing a package can be performed using the exact same techniques that are used for dynamic partition creation thus the suck eggs comment.

First let’s look at the XMLA query for processing a partition similar to the one created in the previous post.

<Batch
xmlns="http://schemas.microsoft.com/analysisservices/2003/engine"
>

<Parallel>

<Process
xmlns:xsd="http://www.w3.org/2001/XMLSchema"

xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
>

<Object>

<DatabaseID>Adventure Works DW</DatabaseID>

<CubeID>Adventure Works DW</CubeID>

<MeasureGroupID>Measure_Group_Name</MeasureGroupID>


<
PartitionID>200609</PartitionID>

</Object>

<Type>ProcessFull</Type>

<WriteBackTableCreation>UseExisting</WriteBackTableCreation>

</Process>

</Parallel>

</Batch>

Note there’s no mention of the partition name anywhere. The partition is processed using its ID and that is all we are really interested in. Replacing just that value in the Source expression in an SSIS Analysis Services Execute DDL Task will point us to the correct partition for processing.

Of course if the entire cube structure is built on the same partition naming strategy then creating s single script to process all of this periods partitions is relatively easy. The added bonus of being able to process partitions in parallel can prove a huge advantage in reducing the time taken to update the cube on whatever schedule you have.

These examples are based on cubes structured with monthly or period based partitions. Partition strategy is an important subject for both cube querying and processing performance. I’ll leave that to another post but I just wanted to state that employing a rules mechanism that say merged period based partitions into a single year partition whilst creating new partitions on the fly does not need to be a massively complex process. The methods I’ve discussed in the two posts are probably some of the simpler ways of implementing this.

Tuesday, September 26, 2006

Dynamic Partition Creation and Processing – Part 1

Processing cubes in SSIS is pretty simple if all you want to do is pick a partition and process it but if that partition is time based, which is quite likely then the partition you process each day, week or month will change. So what are your options?

First you can change your package each month, or use a variable with the partition name. Probably not the best of options as someone is going to have to physically make the change each month or come up with some clever method of changing your variable using something like VBA, Excel, etc. Inefficient if you ask me and defeating the object. Couple this with creating a years’ worth or partitions or creating them every time a new one is required and the whole task becomes quite laborious.

So what else can be done? My personal favourite is to use dynamic XMLA but this has a few of its own pitfalls you have to be aware of. First let’s look at the XMLA script for creating a partition and some of the key tags you have to be aware of.

<Create
xmlns="http://schemas.microsoft.com/analysisservices/2003/engine"
>

<ParentObject>

<DatabaseID>Adventure Works DW</DatabaseID>

<CubeID>Adventure Works DW</CubeID>

<MeasureGroupID>Fact Internet Sales 1</MeasureGroupID>

</ParentObject>

<ObjectDefinition>

<Partition
xmlns:xsd="http://www.w3.org/2001/XMLSchema"

xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
>

<ID>Internet_Sales_2001</ID>

<Name>Internet_Sales_2001</Name>

<Annotations>

<Annotation>

<Name>AggregationPercent</Name>

<Value>20</Value>

</Annotation>

</Annotations>

<Source
xsi:type="QueryBinding"
>

<DataSourceID>Adventure Works DW</DataSourceID>

<QueryDefinition>SELECT * FROM [dbo].[FactInternetSales]

WHERE OrderDateKey &lt;= '184'</QueryDefinition>


</Source>

<StorageMode>Molap</StorageMode>

<ProcessingMode>Regular</ProcessingMode>


<
Slice>[Date].[Calendar].[July 2006]</Slice>

<ProactiveCaching>

<SilenceInterval>-PT1S</SilenceInterval>

<Latency>-PT1S</Latency>

<SilenceOverrideInterval>-PT1S</SilenceOverrideInterval>

<ForceRebuildInterval>-PT1S</ForceRebuildInterval>

<AggregationStorage>MolapOnly</AggregationStorage>

<Source
xsi:type="ProactiveCachingInheritedBinding"
>

<NotificationTechnique>Server</NotificationTechnique>

</Source>

</ProactiveCaching>

<EstimatedRows>1013</EstimatedRows>

<AggregationDesignID>AggregationDesign</AggregationDesignID>

</Partition>

</ObjectDefinition>

</Create>

Above is the create script of one of the partitions from the Adventure Works cube. There are some obvious elements in there but the ones that you need to be aware of when building your script dynamically are as follows;

ID – The most important one. Even though this is the same as the name of the partition there’s a couple of things to remember. Firstly they don’t have to be the same, secondly by default it’s named after the partition name. Why is this important? When you first create a measure group in your Analysis Services project (in Visual Studio) the default partition is named after the measure group name and subsequently the ID. The ID is the unique identifier of a partition within a measure group so if your partitions are named MeasureGroup1_200608, MeasureGroup1_200609 you may find the ID’s are MeasureGroup and MeasureGroup_200609 and processing a partition in XMLA is based on the partition ID.

The partition ID is something you can’t change in an Analysis Services project, well I haven’t found a way yet so in my little world it can’t be changed..! The first thing I would do is create a duplicate partition in the cube design and name it what I would like the ID to be. Once created, you can change the name to whatever you like, the ID won’t be changed.

The Name element becomes pretty arbitrary but you still need to ensure that every partition you create has a different name, again if you want it to be the same as the ID then name it accordingly.

The final two are the QueryDefinition and Slice. This is where you may have to step back and think about the overall design of cubes and their dimension structures. For the purpose of this post I’m talking about partitions being sliced by date, this is because in my experience that’s how most cubes will get sliced. Where it’s suitable I would try to make the key of the date dimension a meaningful value i.e. 20060926 = 26th September 2006. Before we go into multiple discussions about why this is a good or bad idea stop, it’s something I do, I know its pros and cons but I’ve found it to be beneficial in a hell of a lot of occasions, this example being one of them.

My QueryDefinition value will be the query that forms the dataset my partition is to be processed from. Generally you’ll see something with a where clause like “WHERE DateKey BETWEEN 20060900 AND 2006999” if the partition is query bound. If the partition is table bound, more than likely you’re using views to distinguish the different data sets. Finally the partition slice. Even though this is documented as being for information purposes only there have been some other well documented grey areas when not populating it. My recommendation is to put a value in anyway, even if it is for information purposes. The slice value for hmmmm let’s say the September 2006 partition in my cube would look something like this “<Slice>[Date].[Calendar].[September 2006]</Slice>
or ideally <Slice>[Date].[Calendar].&[200609]</Slice>“. The key for the slice makes it clear how you need to be considering this kind of thing when designing your dimensions at the data mart as well as OLAP stage as we’re making the month level of the dimension a critical value.

Looking at those key elements of the XMLA statement for creating a partition we get something like this;

<ID>200609</ID>

<Name>Measure_Group_Name_200609</Name>

<QueryDefinition>

SELECT *

FROM FactTable

WHERE DateKey BETWEEN 20060900 AND 20060999

</QueryDefinition>

<Slice>[Date].[Calendar].&[200609]</Slice>

Each of the elements has the common value of “200609” which can be replaced allowing partitions to be created before processing, in this is example at month end. To actually execute this code within a SSIS package I populate the Source expression of an Analysis Services Execute DDL task. This can be either done via direct input or variable (selected by the SourceType property). The expression takes the variable, or direct input XMLA, and evaluates it based on any parameters you need to input. The parameters in this case would be replacing any reference with 200609 to correctly create the new partition.

Couple of things to remember when creating expression of XMLA like this, the script will have to be in double quotes and any double quotes within the script will have to be replaced with ‘\”’. That’s forward slash and double quote.

Finally, before you run the XMLA it would really be good practice to know if the partition already exists before you attempt to create it. AMO (Analysis Management Objects) is the answer to that. The script below, kindly mixed, matched and cribbed from the those lovely Project REAL chaps, will populate a SSIS boolean variable with the answer to whether the partition exists or not. Don’t forget to add AMO as a reference in the SSIS script task.

Imports System

Imports System.Data

Imports System.Math

Imports Microsoft.SqlServer.Dts.Runtime

Imports Microsoft.AnalysisServices

Public
Class ScriptMain

Public
Sub Main()

Dim WellIsItThere As
Boolean

Dim ASConnection As ConnectionManager

Dim ASServer As
String

Dim ASDatabase As
String

Dim ASCube As
String

Dim ASMeasureGrp As
String

Dim ASPartition As
String

ASPartition = Now().ToString("yyyyMM") ' This will give us “200609”

ASConnection = Dts.Connections("AnalysisServer") ' This is the name of the connection manager that you're going to connect to in the package

ASServer = CStr(ASConnection.Properties("ServerName").GetValue(ASConnection))

ASDatabase = CStr(ASConnection.Properties("InitialCatalog").GetValue(ASConnection))

ASCube = "Adventure Works"
' How you populate this is really up to you

ASMeasureGrp = "Internet Sales"
' And again the population of this is up to you

WellIsItThere = IsPartitionThere(ASServer, ASDatabase, ASCube, ASMeasureGrp, ASPartition)

Dts.Variables("PartitionExists").Value = WellIsItThere

Dts.TaskResult = Dts.Results.Success

End
Sub

Public
Function IsPartitionThere(ByVal sServer As
String, _

ByVal sDatabase As
String, ByVal sCube As
String, _

ByVal sMeasureGroup As
String, _


ByVal sPartition As
String) As
Boolean

' By default, we will assume that it isn't there

' Only if we get all of the way to the end and everything is

' found, will we set it true

Dim bIsPartitionThere As
Boolean =
False

Dim oServer As
New Microsoft.AnalysisServices.Server


' connect to the server and start scanning down the

' object hierarchy

oServer.Connect(sServer)

Dim oDB As Database = oServer.Databases.FindByName(sDatabase)

If oDB Is
Nothing
Then

MsgBox("Did not find expected database: " & sDatabase, _

MsgBoxStyle.OkOnly, "Error looking for partition")


GoTo Done


Else


Dim oCube As Cube = oDB.Cubes.FindByName(sCube)


If oCube Is
Nothing
Then

MsgBox("Did not find expected cube: " & sCube, _

MsgBoxStyle.OkOnly, "Error looking for partition")


GoTo Done


Else


Dim oMG As MeasureGroup = _

oCube.MeasureGroups.FindByName(sMeasureGroup)


If oMG Is
Nothing
Then

MsgBox("Did not find expected Measure Group: " & _

sMeasureGroup, _

MsgBoxStyle.OkOnly, _


"Error looking for partition")


GoTo Done


Else


'-- This is the real test -- to see if the partition


' is really there


Dim oPart As Partition = _

oMG.Partitions.Find(sPartition)


If
Not oPart Is
Nothing
Then


' This is the only place to set the value to TRUE

bIsPartitionThere = True


End
If


End
If


End
If


End
If

Done:

oServer.Disconnect() ' disconnect from the server -- we are done

NoConnection:

Return bIsPartitionThere

End
Function

End
Class

Tuesday, September 12, 2006

Simple SSIS Logging - Part 2, Devil in the Detail

Ok so now you know when a package process starts, when it finishes and its final status (See Part 1). What we don’t have is the ability to accurately analyse the individual steps within a package and make effective decisions based on the information available.

First you will have to decide what you are looking to track within the system. Obviously errors are the most important element as fast resolution is essential. Next should be warnings. The warnings generated within SSIS are key to identifying areas that may have performance issues. A common warning in SSIS would be where a column is brought into from a data source but subsequently not used throughout the package. SSIS will raise a warning at every transformation informing a user that these unused columns may hinder performance but does not effect the final outcome of the package.

For these examples we’re writing our log to a database so it’s necessary to have a structure to hold it. A nice simple structure should let you know the type of alert, timestamp, source task and any other descriptive information (such as error code and error description).

When you have the table structure you can write an insert statement, the most sensible option is of course a stored procedure with a number of parameters serving your needs, something like this;

usp_WriteSSISEvent @ServerName, @EventType, @SourcePackage, @SourceTask, @ErrorCode, @ErrorDesc, @TimeStamp

So how do these variables get populated? Well for the most control on what gets written and where (and it’s pretty simple to boot, hence the title), I’m first going to put an Execute SQL task in to the OnError event handler. After setting the connection manager set the SQL statement via an expression. If you’ve not done this before it’s a very useful thing to know about but to cut a long story short it’s possible to set any number of task properties via an expression.

The expression will effectively be the constructed SQL statement with your parameters populated via a combination of system variables and direct input. Here’s The statement for an OnError event based on the above stored procedure example.

"EXECUTE [usp_WriteSSISEvent]

@ServerName = '" + @[System::MachineName] + "'

,@EventType = '" + "OnError" + "'

,@SourcePackage = '" + @[System::PackageName] + "'

,@SourceTask = '" + @[System::SourceName] + "'

,@ErrorCode = " + (DT_WSTR, 15)@[System::ErrorCode] + "

,@ErrorDescription = '" + @[System::ErrorDescription] + "'

,@EventTimeStamp = '" + (DT_WSTR, 30)GETUTCDATE() + "'"

Yeah I know it looks a bit of a mess but trust me, this works .

The pain with this method is it’s not possible to populate the @EventType via a system variable meaning that you have to populate it manually for each event type you wish to record. The other gotcha is that the system variables differ dependent on the event type. For example of the 4 I usually use, OnError, OnWarning, OnPostExecute and OnPreExecute, only errors and warning have the error code and description. Something to be aware of.

If you want to see it working then I have an example project all ready to send, just mail me at mailto:steve@stevemchugh.com?subject=Simple%20SSIS%20Logging%20Part%20Deux and I’ll send you a copy. It has the database structure as a script so there’s not too much effort involved in having a look.

Yet Another Tip as Cunning as a Fox Who's Just Been Appointed Professor of Cunning at Oxford University

SSIS package deployment, in my opinion is a right royal pain in Aris (I'm sure there's some kind of Wiki out there for cockney rhyming slang, there certainly is for Blackadder quotes).

This became evident whilst trying to successfully deploy password protected packages to folder locations on SSIS servers. So again I came up with a plan to make my life a little more bearable. A plan as as cunning as a fox that used to be professor of cunning at Oxford University, but has moved on, and is now working for the UN at the high commission of international cunning planning..!

All my SSIS projects, of significant size, now include a package called sys_DeployPackages. This does something very simple. deploys packages. A couple of variables such as the source folder and the destination, the password I want to use to encrypt the packages, the location folder on the SSIS server and hey presto. Put the variables into a config file and then you can deploy the same files to another server with just a double click.

All this with just the joining of the Execute Process task and dtutil.exe. a combination so cunning you could brush your teeth with it.





Another Little Tip

I don’t know if you sometimes have a package that contains lots of execute package tasks. Lord knows I do and often too. This is especially the case when pulling a number of tables out of a source system into a staging database.

Well all those individual packages have to have their properties set and their names changes and quite frankly double clicking on each of those tasks waiting for them to spring into life is something I’m not that keen on.

As an expert in trying to makes life a tad easier I now set the package name property via an expression that cunningly grabs the Integration Server path from a variable along with the task name, the combination of which is the full location of the package on the Integration Server.

All I have to do then is copy and paste the task X amount of times and change the name of the task to the package I want to execute. Lazy I hear you say, and there was me thinking I was working “smarter”…!

Wednesday, August 30, 2006

A little tip...

...you know in SSIS when you have to create a parameter based OLE DB command (UPDATE tablename SET column1 = ?, column2 = ?, etc) well an easy way of getting the SQL statement together without the pain of typing those ? in yourself is this.

Right click on the table in SQL Management Studio and create a script to update the table you’re looking to update. Then do a Ctrl H (search and replace) which then gives you the option to replace based on a wildcard search. Search for “= <*>” and replace with “= ?” and voila one update script ready for pasting into the OLE DB command.

Mind you it doesn’t help when you then have almost 100 columns to map out :(

Thursday, August 03, 2006

Simple SSIS Logging - Part 1, The Load ID

Logging in SSIS gives you plenty of options. You have the built in event handling as well as the purpose built logging options. How this is used tends to be pretty different dependent on what project you’re working on and in the majority of cases you’ll find yourself implementing something very similar every time, then you’ll have to implement something very similar into every package on every project. This is time consuming and a pain and you can lose out on a lot of information that to be honest you will need later on for administration.

Let me break it down into a few simple key elements that I believe are important in BI based SSIS projects. Firstly you need to be able to capture the start and end time of a particular process. Secondly, absolute traceability of the data from the source system interfaces through to the warehouse storage itself. Finally, organising the output from any logging into a known structure for later analysis.

The concept of a load ID is fairly familiar to anyone moving data out of a source into a staging database, or if you’re skipping the middle man, straight into a warehouse. The load ID can be a unique indicator based on the specific process execution that extracted the data or identifies a number of loads to different tables. In the second case the source dataset itself becomes part of the identifier.




















Typically the start load process will populate a known table in a logging environment somewhere and return to the package a Load ID. On a recent project this was built in to its own component that had properties such as the connection. The returned Load ID will get tagged onto each of the records that are written in the Data Flow Task. Using this information, along with the data source or file, you have full end to end coverage of where a record came from and where it has ended up in, for the purposes of this say, the staging database.

Obviously there’s a little more to this than just putting a script at the top and bottom of the control flow but not much more. The structure of the table that both provides the Load ID and logs the start time, end time and also the status of a particular load is pretty simple.

click on image to enlarge

There’s a bunch of other information you can attach to this table such as the package name, source server, pretty much anything you like. The important thing is capturing the core data and the real power comes from the reporting available from such a narrow dataset. Add the environment type and you can make comparisons between development, test and production performance. Add in a process identifier and you can see if one of your processes for transforming BI data is affecting a daily load task performed on one of your transactional systems.

Although this provides a very simple mechanism for logging and tracking SSIS packages, that’s all it does. For even more detailed information and to track problem areas within a package, a whole new level of logging is required. That I’ll go into in another post. I am of the view that structuring logging in this fashion not only provides benefits for administration after us consultant types have cleared off site, but also speeds up implementation when you have a few simple SSIS components and a single logging database to deploy. All this and able to apply the K.I.S.S. * rule to the process.

* Keep It Simple Stupid

Wednesday, August 02, 2006

Failure - Data Source Does Not Exist..!

One of the SSIS interfaces I have been working on a project pulls data from 10 identically structured databases. The task we had was to pull data for a particular day, supplied via a parent variable, and place the data into a single file for import into the staging database of the BI infrastructure.

The problem was that if one of the data sources was down, the WAN connection playing up or there was any other problem that would cause one of the data sources to fail, the entire package would fail. We didn’t want this behaviour because the client wanted as much data as possible rather than a complete failure.

For starters the OLEDB source wasn’t going to help us in this situation so we had to go for a scripted data source and changing the Acquire Connection code with a simple Try-Catch statement. This worked well and does the trick. We can run the entire package without any problems building up the connection string as variables that are passed into the script and connecting to the required single data-source. But then another requirement popped up.

How do I know when a data-source has failed, I don’t want the package to fail but a warning would be nice and enable us to capture the event in our logging mechanism. Enter the FireWarning method. This little diamond, along with its cousins FireError, FireInformation and FireProgress allows you to place custom entries into the SSIS logging.

I have an example package built on Adventure Works if anyone wants it and, although quite simple, I have come across a number of instances where connecting to multiple data sources (including flat file connections) where this behaviour is needed.