Azure Data Factory Interview Questions

Azure Data Factory is a cloud-based Microsoft tool that collects raw business data and further transforms it into usable information. It is a data integration ETL (extract, transform, and load) service that automates the transformation of the given raw data. This Azure Data Factory Interview Questions blog includes the most-probable questions asked during Azure job interviews

Why Azure Data Factory

  • Cloud Need an Orchestration Tool
  • Extract Data from Cloud Source
  • Transform Data using Datafactory workflows
  • Load/Sink data into Datalake or datawarehouse

Azure Data Factory Concepts

  • Pipeline: This is list of steps we want to perform in our Data Transformation or Copy task. Steps are called Activities
  • Activities: Activities represent the processing steps in a pipeline. A pipeline can have one or multiple activities. Activities can be of following types
    – Data Movement e.g. Copy
    – Data Transformation Activities e.g. Databrick, Azure Functions
    – Control Flow Activities e.g. For Each Loop, Lookup, Set Variable
  • Datasets: Sources of data. In simple words, it is a data structure that holds our data.
  • Linked services: Connection Strings

Data Factory Concepts Mapping with SSIS

Data FactorySSIS
PipelinePackage (*.dtsx) or Control Flow
ActivitySSIS Control flow Components
Linked ServicesConnections
DatasetConnection Mapping
WorkflowDataflow

ADF Integration Runtime

The Integration Runtime (IR) is the compute infrastructure used by Azure Data Factory to provide the following data integration capabilities across different network environments.
– Dataflow
– Data Movement
– Activity Dispatch
– SSIS Package Execution

  • Azure Integration Run Time: Azure Integration Run Time can copy data between cloud data stores and it can dispatch the activity to a variety of compute services such as Azure HDinsight or SQL server where the transformation takes place
  • Self Hosted Integration Run Time: Self Hosted Integration Run Time is software with essentially the same code as Azure Integration Run Time. But you install it on an on-premise machine or a virtual machine in a virtual network. A Self Hosted IR can run copy activities between a public cloud data store and a data store in a private network. It can also dispatch transformation activities against compute resources in a private network. We use Self Hosted IR because Data factory will not be able to directly access on-primitive data sources as they sit behind a firewall.
  • Azure SSIS Integration Run Time: With SSIS Integration Run Time, you can natively execute SSIS packages in a managed environment.

ETL process in Azure Data Factory

Following are high level steps to create a simple pipeline in ADF. Following example copy data from SQL Server DB to Azure Data Lake.

Steps for Creating ETL

  1. Create a Linked Service for source data store which is SQL Server Database
  2. Create Source Dataset using Source Link Service
  3. Create a Linked Service for destination data store which is Azure Data Lake Store
  4. Create Sink/Target Dataset using link Service created in Step 3
  5. Create the pipeline and add copy activity
  6. Schedule the pipeline by adding a trigger

Schedule a pipeline

  • You can use the scheduler trigger or time window trigger to schedule a pipeline.
  • Trigger execute Pipelines on a Schedule e.g. Daily at 6 AM

Pass parameters to a Pipeline

You can define parameters at the pipeline level and pass arguments as you execute the pipeline run on demand or by using a trigger.

Define default values for the Pipeline Parameters?

You can define default values for the parameters in the pipelines.

Can an activity output property be consumed in another activity?

An activity output can be consumed in a subsequent activity with the @activity construct.

Leave a Reply