Tonight I’ve studied how to improve performance of SCD transformation task of SSIS. I started from the consideration that in an actual SSIS package, the dimension processing is at least one magnitude slower than fact table processing; most of the time is consumed by SCD transformation.
A simple SQL Profiler session showed me how SCD works processing data: for each processed row, a sp_executesql is called with a query that have business key fields as parameter. While the query is executed very fast on the server, the throughput is not very good because we have the latency of communication between different processes (and may be different servers). Client CPU is consumed more than server CPU, probably due directly or indirectly to the sql query string construction (it’s a SELECT field, field, field, … FROM table WHERE businesskey = @P1″.
Now, when you have a SCD you could also have a timestamp column on original data that helps you to extract and process only good changing candidates for dimension processing. But when this is not the case, the whole dimension is reprocessed against the last processed dimension, just to find new or changed attribute. This scenario is very common when data are provided by legacy system that are out of our control (otherwise we would have added the timestamp column months before…). It could be wonderful if SCD transformation would have an optional input flow to provide “lookup” sorted data, working in a way similar to the merge task. The wizard could use the business keys as lookup/join condition and SCD would not have to use sql query anymore just to solve the SCD condition to process. Unfortunately this is not the case, so by now the only option is to make your own SCD processing by hand using basic components. But this topic deserves attention by SSIS development team, because this kind of operations are very very common in a data warehouse environment.