Today I ran into a nice misfeature of SSIS that worths sharing.
If I want to use SSIS to sort a flow of data, containing both lowecase and uppercase strings, I would expect a standard behaviour or, at least, something with some “common sense”. As I am going to show, the final result is completely nonsense and will lead to incorrect computations.
Let us start with some information that I remember from the old time of ASCII code. Uppercase letters appears before the lowercase ones in the standard ASCII sequence and hence are sorted before them. Is it still true in the 2009 world with Unicode and all our new and fancy ways of storing strings? It seems so, if you run this simple VB script:
If "A" >= "a" Then MsgBox("A is greater or equal than a") Else MsgBox("A is less than a") End If
You will get the correct result: uppercase comes BEFORE lowercase characters. So, since SSIS sort component does not provide any “consider/ignore case” checkbox, it has two options for sorting strings:
- It ignores casing and sorts data like SQL does.
- It adhere to the standard sorting sequence and sorts uppercase BEFORE lowercase.
Well, armed with all this powerful knowledge, we can build a simpe package that receives three rows from this query:
SELECT Name = 'ALBERTO', Surname = 'FERRARI' UNION ALL SELECT Name = 'alberto', Surname = 'ferrari' UNION ALL SELECT Name = 'alberto', Surname = 'FERRARI'
Then it sends them into a Sort by Name, Surname and add a data viewer after the SORT, just to check what the component does:
I have been amazed to look at the data viewer:
SSIS believes that lowercase strings should be LESS than uppercase ones and produces a reverse sorted list of rows. This is pretty important because, if after the flow you trust data to be sorted, your system will fail, as it has been the case with some code I received to test just today.
The final lesson is pretty simple: do not trust common sense, for some (very) obscure reason the SSIS team decided to provide “Yet Another Sorting Method” to the IT World, generating some more confusion in the head of the poor SSIS programmer.
Returns the union of the tables whose columns match.
UNION ( <Table>, <Table> [, <Table> [, … ] ] )
Returns all the rows in a table, or all the values in a column, ignoring any filters that might have been applied.
ALL ( [<TableNameOrColumnName>] [, <ColumnName> [, <ColumnName> [, … ] ] ] )