In the last few days I tested several ways to generate surrogate keys when loading a dimension into a data mart.
The classical approach with SQL Server is to rely on a INT IDENTITY column. It works fine, but you don’t know what surrogate key has been generated for a given application key until you read the row from the table. Not a big issue with DTS, because you can’t do very much (in a way faster than SQL itself) even if you know this number, but with SSIS if you want to push performances to the limit, you need to load a fact table within the dimension table(s) without reading what you just writed into a table.
The first step in this direction is to generate surrogate keys inside a SSIS package. If you do a complete dimension load at each run, you can solve the whole problem using a scripting component inside a data flow task. Read this article from Jamie Thomson to see how you can implement it. But life is never as easy as you could desire, so when you have to incrementally load a dimension table then you need to give to this script the right initial value (the last ID currently used instead of 0).
After several approaches discussed on the newsgroup, I came to the conclusion that the most affordable way to implement it is a not-so-desirable one:
- define a Execute SQL Task which execute a SELECT COALESCE( MAX( ID_Dimension ), 0 ) AS LastID FROM Dimension and put the single row ResultSet into a variable (let’s say we use User::ID_Dimension)
- define a Data Flow Task in the same way Jamie did in his article
- put the User::ID_Dimension variable into the ReadOnly Variables collection in the Script Component
- define PreExecute method inside the script to initialize the value for counter you will use to enumerate new surrogate keys (see code below)
Public
Class ScriptMain
Inherits UserComponent
Dim counter As Int32
Public Overrides Sub Input_ProcessInputRow(ByVal Row As InputBuffer)
counter = counter + 1
Row.IDDimension = counter
End Sub Public Overrides Sub PreExecute()
MyBase.PreExecute()
counter = Variables.IDDimension
End Sub
End Class
That’s it. You can download the sample package here.
Now, don’t pretend this solution to be faster than a SQL Server INT IDENTITY column… by now, it isn’t, but we can’t judge this with current CTPs. But don’t desperate: it could be better in the future and, more important, if you really need to use the generated keys in other transformations you could already take advantage from this architecture.
Why it’s a not-so-desirable way? Well, I don’t like to rely on a variable defined in the control flow while all the dimension-load logic is inside a data flow task. Imagine how ugly could become a SSIS package if you load 20 or 30 dimensions; of course you can group two or more tasks together, but I consider this only as a work-around. There is another issue if you would like to update your somewhere-in-the-package variable with the last generated surrogate key, because you can’t use the same variable as ReadOnly and ReadWrite (in the Script Transform) and you can’t read a variable passed as ReadWrite outside the PostExecute method (I’m right: you can only read/write it inside PreExecute overridden method, you can read it in the PreExecute methos – as in my sample – only if variable is passed as ReadOnly).
I thought other ways to find a better solution (most of them has been suggested by kindly people on the newsgroup):
- Script transformation with two inputs: not supported (I would like to pass the real data source as Input1 and the last used ID as Input2)
- Cross Join between two inputs (defined as the previous idea)
- Custom data flow task (with one input and a property with table/column to look for the last ID) – it needs to be written
There are many other possible variations on those themes. But I have to say that SSIS could be improved very much to make life easier for the BI developer: a real “definitive” surrogate key generation task could be the right answer to this kind of needs.
Rounds a number down to the nearest integer.
INT ( <Number> )
Returns the first argument that does not evaluate to a blank value. If all arguments evaluate to blank values, BLANK is returned.
COALESCE ( <Value1>, <Value2> [, <Value2> [, … ] ] )
Returns the largest value in a column, or the larger value between two scalar expressions. Ignores logical values. Strings are compared according to alphabetical order.
MAX ( <ColumnNameOrScalar1> [, <Scalar2>] )