Comparing SSIS Lookup and SQL Joins

2042

Motivating SSIS lookups

Denormalization is the stepping stone of data warehouse creation; usually performed to extract the surrogate key of a dimension to a fact table. However, we can perform denormalization in two ways viz. SQL joins or a lookup component in an ETL tool. In Microsoft SSIS, we have a Data Flow Task, which contains the Lookup component.

The three modes of SSIS Lookups

We can use the SSIS lookup component in 3 modes viz. full cache, partial cache and no-cache. Read this article to know more: SSIS Look Up Component. In the article, we have an important point to note about SSIS lookups i.e. in full cache mode the string comparison is case-sensitive, which is not true in the case of SQL joins(depends on collation).

Usually, SQL joins perform the case-insensitive comparison. Let’s take an example.

The Illustration contains a dimension table DimGeo with columns named GeoId and Geography. We have a staging table stgFactSales having a column called Geography. After denormalization using joins, we get FactSales. Note that DimGeo contains India in camel case whereas stgFactSales holds it in uppercase. Nonetheless, denormalization produces the GeoKey value as ‘1’ for India.

Now let’s go to another illustration with Lookups.

The lookup between Geography between DimGeo and stgFactSales produces the following output:

We can see that GeoKey is populated as NULL for India.

The reason for Case Sensitive comparison in SSIS Lookup Full Cache mode

In full cache mode, the .NET based SSIS Engine carries out string comparison, while SQL Server Engine takes over in partial and no-cache mode. We cannot assert that whether this nuance is a boon or a bane since this is a call of the business requirement. Lastly, interchanging the joins and lookups blindly might lead to undesirable outcomes.

Read: What is a Data Hub: Concepts and Guidelines



I am a Data Scientist with 6+ years of experience.


Leave a Reply