Azure Data Factory Managed Virtual Network(Preview)

2730

The emergence of cloud technologies has enabled enterprises to scale their infrastructure with minimal effort. In fact, you can scale with a few clicks at a minimal cost of provisioning and maintenance. However, the primary concern with cloud technologies is the security aspect. Since your data and infrastructure lies remotely, you don’t get total control over security as in the case of on-premises setup. This is especially true with PaaS offerings where the platform used by you is completely managed by the cloud service provider.

Coming to Azure, one such PaaS offering is Azure Data Factory. Azure Data Factory is a Microsoft engine for creating ETL/ELT pipelines. It is serverless and easy to use with connectors for over 90 data stores. However, all the data movement happens over the public internet. This exposes the data movement to the risk of spoofing attacks. To deal with this, Microsoft Azure has introduced a Managed Virtual Network for Azure Data Factory. So, how does it exactly work?

Integration Runtime(IR)

Before we understand how does the above security feature works, let us understand what is an Integration Runtime, To put it simply, an Integration runtime is the compute engine of the Azure Data Factory. It is the one that does the extraction, transformation and loading of data. It enables data flow, movement, activity dispatch and SSIS package execution. Integration runtime comes in 3 types:

  • Azure IR: Fully Managed Serverless compute.
  • Self Hosted IR: Software for data integration securely over a private network. Used for data sources On-Premises/ Azure VMs. It acts as a gateway for an on-prem data source.
  • Azure SSIS IR: A fully managed cluster of Azure VM To Run SSIS workloads.

For more details about IR, refer to this Microsoft documentation.

Now, the above risk of spoofing attacks is a possibility with Azure IR, since it has a direct line of sight from the public cloud environment i.e. the heavy lifting of ETL/ELT pipelines happens over the public internet. The ADF Managed Virtual Network addresses this drawback of Azure IR.

High-Level Architecture of ADF Managed Virtual Network

Below is the high-level architecture of an Azure Data Factory managed virtual network setup.

As you can see from the architecture, we provision an Azure IR in an ADF managed VNET. The below image depicts provisioning an Azure IR within an ADF managed VNET.

 

Furthermore, this provisioning creates a managed private endpoint in order to connect to different azure services like ADLS Gen2.

Advantages of ADF Managed Virtual Network

  • Securely connect Azure Data Factory to different sources like Storage/ADLS Gen2.
  • No need to manage the VNET.
  • Lastly, no need for Self Hosted IR for Azure VM sources.

Conclusion

This is a brand new feature of Azure Data Factory and will evolve. All the images and reference material emerge from this blog. Moreover, this article is for information. We do not claim any guarantees regarding its completeness and accuracy.

Also read: Azure Data Lake Gen2 and Azure Databricks



I am a Data Scientist with 6+ years of experience.


Leave a Reply