h1

Slowly Changing Dimension Type 1 with Merge Join.

March 19, 2012

In my last 2 posts (1 and 2) I covered different methods for implementing SCD Type 1 changes in SSIS.  Today we will cover how to use the Merge Join design pattern to load SCD Type 1 columns.

Right from the start the design is a bit different. Instead of adding your source query to an OLEDB Source component and then connecting it to the lookup or the SCD component, we create 2 OLEDB Source components. 

image

The first one is the same as in the previous two posts, it connects to our source table (UserUpdate).  However we have to modify the query a little bit to get it to work with the Merge Join component.  So lets take a look at that query.

SELECT 
[UserAK]
,[ScreenName]
,[AccountCreateDate]
,[TimeZone]
,[Language]
,[GeographySK]
,CHECKSUM(
[ScreenName]
,[AccountCreateDate]
,[TimeZone]
,[Language]
,[GeographySK]
) as Hash_CD
FROM
DimUserUpdate
Order By UserAK
The addition of the ORDER BY clause is essential when using the Merge Join design pattern.
 
The second OLEDB Source component will query against our destination table, essentially replacing the lookup  component from the Lookup Conditional Split design pattern.  As in the lookup query we only need to bring back the UserAK, UserSK and the Hash_CD.  But just like the source query above we need to add the ORDER BY clause because we are going to use the Merge Join.
 
SELECT [UserSK]
,[UserAK]
,[Hash_CD]
FROM [Demo].[dbo].[DimUser]
Order By UserAK

Simply using the ORDER BY clause is not enough for SSIS to know that the datasets you are intending to enter into the Merge Join is not enough. You have to tell SSIS that the data is sorted, and on what column the sort is happening.  I agree it’s a little redundant, but that’s how it is.  You must do this for each OLEDB Source.  To do this right click on the OLEDB Source and select Show Advanced Editor from the menu.
image

Once inside the advanced editor select the Input and Output Properties Tab along the top of the window. Then under the Common Properties for the OLEDB Source Output change the IsSorted property to True.

image

That takes care of telling SSIS that the data is sorted, but now we need to tell SSIS what column it is sorted on.  To do that we drill down OLEDB Source Output to the Output Columns and select the column, UserAK, (or columns if your query is ordered by more than one column) and change its SortKeyPosition to 1

image

Now do this for the second OLEDB source and then we’ll be ready for the Merge Join component.  Please note that if you have multiple column in your order by clause they must match in both queries, and you must set the SortKeyPosition in the same order for both components.  Once the the source components are configured drag in the Merge Join component.

image

Now drag the first source component output connection to the Merge Join component. Once you’ve connected this a selection box will appear.  It will ask you to tell it what side of the join this output belongs on.  This is important, because we are basically going to be creating a Left Outer Join inside the component, so knowing which side of the join each portion belongs to is obviously essential.

image

When you add the second output from the other OLEDB Source you won’t be prompted again, it will just make it the input for the other side of the join. Now double click on the Merge Join component and open the editor. 

The first property to set is the Join Type.  Your options are Left Outer Join, Inner Join, or Full Outer Join.  We are going to be using the Left Outer Join option.  Below the Join Type is the columns from our two queries each on the side we specified when hooking up the component to the Merge Join.  The keys are already linked now we need to add the columns we want to return from both sides of the join.  Make sure to both the Hash_CD columns to make comparing them easier later on in the package.

image

The next step will be to add a conditional split to the package to determine which rows are new rows to be inserted and which rows need to be evaluated for updates.  To do this requires only one output to be created and we’ll call it NewRecords.  Here is the expression you need to add to the editor.

image

If this doesn’t make immediate sense let me explain.  Since we did a left outer join if there were no matches between the two queries on the UserAK it means that the records without a match are new records and as a result the UserSK would evaluate to NULL.  Now hook up the NewRecords output to your OLEDB Destination, set the connection to the destination table, confirm the mappings and the new records will get inserted correctly.

Now on to the second Conditional Split. Here we need to evaluate if the records that matched on UserAK have had a change in any of the columns that we are tracking changes on.  To do this we will use the Checksum values that we’ve created in our source query.  We need to write 2 SSIS expressions, 1 for each output, to determine if the records have changed.  Here are the expressions:

image

We then connect the Update output to the OLEDB Command component to update the rows that need to be updated.

Inside the OLEDB Command editor the first tab you are shown is the connection manager tab. Simply connect to the destination database and then select the component properties tab.
image_thumb20

At the bottom of this tab, under Custom Properties is the SQLCommand property. Here you will write your update statement. To do this you will have to map all of the columns to be updated in the destination to the columns coming through in the data flow. The query will look like you are mapping them to parameter values (?). Notice that even the where clause is set using the parameter indicator.

UPDATE DimUser
SET
[ScreenName] = ?
,[AccountCreateDate] = ?
,[TimeZone] = ?
,[Language] = ?
,[GeographySK] = ?

WHERE [UserSK] =?

Once the query is complete go to the Column Mapping tab.

image_thumb23

No it is just a matter of correctly mapping the parameter values to the columns in your data flow (input columns). Make sure to pay attention to your column order in the update query to map the parameter values to the appropriate input column. Remember that the last parameter value is for the where clause and it is why we brought the UserSK value from the Lookup query to begin with.

When your done the Data flow should look something like this.

image

I hope this helps some of you looking to try different ways to update SCD1 Dimensions.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: