Several people downloaded TableDifference to handle SCD faster, some of them, especially using it on huge table (more than 10 millions rows) noticed memory problems. The problem is that of a flow running too fast and making TableDifference cache data, we know of it and now we decided to solve it creating a new component called “FlowSync”. You can find all the details and source code here.
In the article there is a brief discussion about how SSIS handles the ProcessInput method of a component with more than one input, here is an extract:
As you may already know ProcessInput is called once for every buffer and, in the case of a component with two or more inputs like TableDifference or Union All, this method is called once for each buffer of each input, so the inputs are mixed together and handled by the same method. A solution to the problem of syncronizing input, before deciding to develop FlowSync, has been that of using semaphores to stop the faster input inside the ProcessInput method. It would have been a nicer solution BUT ProcessInput is called in only ONE thread, even if it has two input flows. So, if ProcessInput is stopped then all the inputs of the components are stopped and the system will be in a deadlock state.
This is very strange because each flow runs in a separate thread but it seems that the two thread synchronize on a single one when they need to pass data to the component. So the solution has been that of inserting the sync technique where we still have separate threads, hence directly on the flows with a transformation component: FlowSync.
I would really like to see in the next version of SSIS the ability to decide if – when developing a component – we want ProcessInput to be called in a multithreaded environment or not, my personal opinion is that – using threads – programs become easier to write and maintain, TableDifference may be a good candidate to demonstrate this statement.