Database Interview Question and Answers (Part-1)

crop businessman giving contract to woman to sign

This post covers Basic to Advanced Relational Database interview Questions and Answers. Some of the basic Datawarehouse related concepts are also discussed in this post.

Click on the link to know the difference between different Database Types.

1. What is RDBMS?
2. What is OLTP (Online Transaction Processing)?
  • Data processing that executes transaction-focused tasks
  • This involves inserting, deleting, or updating small quantities of database data
  • These DBs are suitable for Financial, Retail and CRM transactions
3. What is Data Warehousing?
  • Process of Collecting and Aggregate Data from Multiple Sources
  • Separating Dimensions and Facts into Separate tables
  • Optimized for Querying and Analyzing large amount of information
  • Support Business Intelligence Systems

Dimension = Typically textual data fields e.g. Date, Product, Employee
Fact =
Typically Numerical data fields e.g. Sales, Profit

4. What is a Primary Key and Foreign Key Columns?
  • Primary Key: This Column has uniquely values for each row in table. Primary Key Column can’t have repeated values.
  • Foreign Key: This Column refers to a Primary Key in another Table. Foreign Key column can have repeated values.
Database Join
5. Primary Key vs Unique Key Constraints?
  • Primary Key: Used to uniquely identify each row of table. Only one Primary Key is allowed in each table. Can’t have Duplicate or Null Values.
  • Unique Key: Used to uniquely identify each row of table. Multiple unique keys can present in a table. Unique key constraint columns can have NULL values.

Constraints = Rules enforced on the data columns of a table

6. Surrogate Key Column?
  • Column or set of columns declared as the primary key, instead of a “real” or natural key
  • Most common type of surrogate key is an incrementing integer, such as an auto_increment column
7. What is Normalization?

Techniques to group Related information into Separate Tables. Data Normalization is used to reduce data redundancy and improve data integrity. Normalization Increase Database joins as related data is grouped into separate tables to reduce redundancy. Most Asked Normal Forms are below.

  • 1st Normal Form (1NF):
    • Value of each attribute contains only a single value from that domain
    • Each Record is Unique
  • 2nd Normal Form (2NF):
    • Table is in 1NF
    • Single Column Primary Key
  • 3rd Normal Form (3NF):
    • Table is in 2NF
    • There are no transitive functional dependencies
8. What is De-normalization?
  • Combining data from multiple tables into 1 single table
  • Data Redundancy is increased due to repeated column values
  • Denormalization decrease no of join needed to extract data
9. What are ACID Properties?
  • Atomicity: Each transaction is considered as single unit, which either succeed completely or fails completely.
  • Consistency: Any data written to the database must be valid according to all defined rules
  • Isolation: Concurrent execution of transactions leaves the database in the same state that would have been obtained if the transactions were executed sequentially.
  • Durability: Guarantee that once a transaction has been committed, it will remain committed even in the case of a system failure
10. What is Basic Query Structure?

Basic Structure of SQL Query is given below. Database engine first evaluated “From and Join” then “Where”, after that “Group by”, Having ,Select and “Order By” clauses are evaluated.

Query Evaluation Order

Select [Column List]
From [Table Name, Multiple Tables can be joined]
Where [Filtering Condition on Columns]
Group By [Column List on which data needs to be aggregated]
Having [Filtering Condition using Column Aggregation]
Order By [Column List to Sort Data]

11. Difference between Having and Where?

Where: where clause in query can be used to filter data. Attribute / Columns of the tables used in “From” clause of query, can be used to filter data. This filter is evaluated for each rows of data. e.g. (where Column1 = “Some Value”)

Select Column1, Column2
From Table1
Where Column1 = “Hello” OR Column2 != 0

Multiple Filtering Conditions can be combined using “AND, OR” Operators
.

Having: having is used to filter data using aggregate functions. Aggregate functions can’t be used for filtering in “where. e.g.SUM(Sales) > 0″

Select Column1, Sum(Column2)
From Table1

Group By Column1
Having Sum(Column2) > 0


Data is Grouped first in this example. After grouping Filtering is applied using “Sum()aggregate function in Having.

Step by Step Neo4j Desktop Installation

This is a Step by Step Neo4j Desktop Installation Guide. Neo4j Desktop is a free version for Developers provided by Neo4j.

Go to link to find comparison between different type of databases.

Installation Steps

Click on Download
  • Go to Downloads and Run the Installer (Neo4j Desktop Setup 1.4.5.exe)
  • Once installation starts you can choose if this application is only for you or all the users on this computer
  • Choose Install Location and Click on Install. For better performance choose SSD drive
Install Location
  • Installation will take 2-3 minutes to complete. Once Completed click on “Run Neo4j Desktop”.
Run Neo4j Desktop
  • Choose Private or Public Network in “Windows Firewall” popup, to make Neo4j accessible over the Network.
Windows Firewall Setting
  • Next Agree to Neo4j License Agreement for Neo4j Desktop
License Agreement
  • Choose Application Data location on local machine. Once Selected, Click on confirm.
Choose Application Data Location
  • In this step, Register Neo4j Desktop. Neo4j Desktop is always free, this information is gathered to analyze who is interested in this application.
Software Registration
  • Installer will check systems requirement and Setup runtime Environment
  • Neo4j Desktop is ready for use
Neo4j Desktop

Databases

Choosing Right Database for Application

Choosing Right Database

This article give brief intro to different type of databases that that are available in Market with Sample Use cases. Choosing right database for application is one of the most important decision to make. This comparison can help to chose the right database for the job.

Following different types of databases are evaluated in this post.

Database Types

  1. Key Value
  2. Wide Column
  3. Document
  4. Relational
  5. Graph
  6. Search Engine

1. Key Value Database

Key Value Database
  • What is key values database?
    • Simplest of NoSQL databases
    • No SQL type queries and Joins are available for these databases
    • key value databases utilize a single main identifier “Key” and associate that with a data point, array of data, or blob—a “value
    • Data can only be queried by passing Key using “Get” command e.g. Get user:24:bio
    • Key names can be URI, hashes, filenames, or anything else that is entirely unique from other keys
    • These databases are Simple and lightweight
    • Built for speed as data is kept in-memory
    • Data type-agnostic

(Data Agnostic: if the method or format of data transmission is irrelevant to the device or program’s function. This means that the device or program can receive data in multiple formats or from multiple sources but still process that data effectively)

  • Why use key values database?
    • When your application needs to handle lots of small continuous reads and writes, that may be volatile. Key-value databases offer fast in-memory access.
    • Optimal for situations requiring low latency and lower operational demand than a relational database
    • For applications that don’t require frequent updates or need to support complex queries
  • Sample Use Case
    • E-commerce shopping carts — work best for temporary, lightweight listings, which makes them perfect for storing customer shopping carts while they order products online.
    • Online session information — Key value databases make it simple to track and store activity information, including which pages were accessed, on-page actions a user took, and more.
    • Use as Cache – Good option for the information that is access very frequently but updated less often. These databases can be used on top of other persistent data layer as a cache.

2. Wide Column Database

Wide Column Database
  • What is Wide Column Database?
    • NoSQL database that organizes data storage into flexible columns that can be spread across multiple servers or database nodes.
    • NoSQL database in which the names and format of the columns can vary across rows, even within the same table.
    • Also known as column family databases. Because data is stored in columns, queries for a particular value in a column are very fast, as the entire column can be loaded and searched quickly. Related columns can be modeled as part of the same column family.
    • CQL is used as Query Language. Joins are not available.
    • Decentralized and Scale Horizontally
    • In Wide Column DB, Outer Layer is Key Space, which hold one or more Column Families. Each Column Family holds set of ordered rows, due to which related data is grouped together.
  • Why Use Wide Column Database?
    • High volume of data
    • Extreme write speeds with relatively less velocity reads
    • Data extraction by columns using row keys
  • Sample Use Case
    • Log data
    • IoT (Internet of Things) sensor data
    • Time-series data, such as temperature monitoring or financial trading data
    • Attribute-based data, such as user preferences or equipment features
    • Real-time analytics
    • Geographic data

3. Document Database

Document DB
Documents DB where each Document is collection of Key Value Pairs
  • What is Document Database?
    • In this type of Database we have Documents
    • Document is a container for Key Value Pairs
    • Documents are Unstructured and Don’t require Schema
    • Document are grouped together in Collections
    • Fields in a Collection Can be indexed and Collections can be organized in logic hierarchies
    • Don’t Support join. This type of DB encourage to embed related data together in same document in de-normalized format
    • Reads from front end application are faster but write/update operations are complex due to embedded data
Documents and Collections
  • Popular Database
    • Mongo DB
    • Dynamo DB
    • Firestore
  • Why Use Document Database?
    • Your data schema is subject to frequent changes
    • When we don’t know how schema is going to evolve in future
    • When the amount of data in many applications cannot be served affordably by a SQL database
  • Sample Use Case
    • Mobile Games
    • IOT
    • Web Application Data

4. Relational Database

Relational Database
  • What is Relational Database?
    • Data is stored in Structures called Tables. Table stores data in Rows and Columns
    • Tables Required Pre Defined Schema in these databases
    • ACID (Atomicity, Consistency, Isolation and Durability) compliant. Data validity is guaranteed.
    • Support SQL (Structured Query Language) Language
    • Support Joins between Different Tables
  • Popular Database
    • My SQL
    • Microsoft SQL Server
    • Oracle
  • Sample Use Case
    • Online transaction processing (Well suited for OLTP apps because they support the ability to insert, update or delete small amounts of data; they accommodate a large number of users; and they support frequent queries)
    • IoT solutions (Relational databases can offer the small footprint that is needed for an IoT workload, with the ability to be embedded into gateway devices and to manage time series data generated by IoT devices)
    • Data warehouses (Relational databases can be optimized for OLAP (online analytical processing) where historical data is analyzed for business intelligence)

5. Graph Database

Graph Database
  • What is Graph Database?
    • Database designed to treat the relationships between data as equally important to the data itself. Relationships are stored alongside data in model
    • Use Cypher Query Language that is very similar to SQL language. As Joins are already stored with Data, due to this cypher queries are very precise
    • Natively handles Many to Many Relations between data
    • Can Traverse millions of connections per second per core
  • Popular Database
    • Neo4j (see Neo4j Desktop free to get started)
    • Dgraph
  • Sample Use Case
    • Fraud Detection
    • Recommendation Engine
    • Network and Database Infrastructure Monitoring
    • Master Data Management
    • Social Media and Network Graph

6. Search Engine Database

Search Engine
  • What is Search Engine Database?
    • Nonrelational database that is dedicated to the search of data content
    • Work very similar to Document oriented database. These database under the hood analyze the data and create index of searchable terms
    • Index Terms are automatically ranked and can handle typos while search is performed on data
  • Popular Database
    • Elastic Search
    • Solr
    • Algolia
    • Meilisearch
  • Sample Use Case
    • Text search
    • Logging and analysis


Google Analytics Bigquery Export

This post go through pre-requisite and steps to setup Google Analytics data export into Google Bigquery. Following pre-requisites needs to be met before setting-up the export.

Prerequisite

  • Need to have Google Analytics 360 Account
  • Need to have Google Cloud Project with Billing Enabled
  • User that will setup Export need to have following permissions
    • Editor Access in Google Analytics
    • Owner Access on Google Cloud Project
  • Google Cloud Project ID. This will be used during setup Process
  • For one GA (Google Analytics) Property, export can be setup for only one View at a time
  • Big Query API is enabled in Google Cloud Project

Setup Process

  • Log into Google Analytics (analytics.google.com)
  • Select GA (Google Analytics) view for which export need to be setup (top left corner). View can be selected by clicking small arrow as shown in below screenshot

Select Account
Select View
  • After Selecting View. Go to Admin tab in GA (shown below)
Select Admin from bottom Left

  • Once on Admin page, select All Products under properties

  • Select BigQuery under “All Products” to setup export. Following is needed to setup export
    • Google Cloud Project ID
    • Google Analytics View for which Export needs to be setup
    • Email Address that is available in GA. This will be used to send export completion notifications
    • Streaming Data Export frequency (batch mode or continuous export)

Note: Setup process with not ask for target BigQuery dataset. It is automatically created once export setup is completed. Dataset will have same name as GA View ID.


Export Validation

Once setup is completed we should be able to see following in Bigquery Console

  • Dataset should appear in BigQuery with ID same as GA view
  • In GA (Admin -> All Product), we should be able to see green check mark for BigQuery. As seen in above screenshot
  • Table and View may take some time to appear in BigQuery
  • Data Export can take up to 24 hours

Reference: Set up BigQuery Export – Analytics Help (google.com)

Setup Google Analytics360 BigQuery Export

Prerequisite

  • Google Analytics 360 Account
  • Google Cloud Project with Billing Enabled
  • User that will setup Export need to have following permissions
    • Editor Access in Google Analytics
    • Owner Access on Google Cloud Project
  • Google Cloud Project ID. This will be used during setup Process
  • For one GA (Google Analytics) Property, export can be setup for only one View at a time
  • BigQuery API is enabled in Google Cloud Project

Setup Process

  • Log into Google Analytics (analytics.google.com)
  • Select GA (Google Analytics) view for which export need to be setup. View can be selected by clicking small arrow as shown in below screenshot
Select GA Property an View
  • After Selecting View. Go to Admin tab in GA (shown below)
  • Go to All Products in Admin Tab
  • Select BigQuery under “All Products” to setup export. Following is needed to setup export
    • Google Cloud Project ID
    • Google Analytics View for which Export needs to be setup
    • Email Address that is available in GA. This will be used to send export completion notifications
    • Streaming Data Export frequency (batch mode or continuous export)

Note: Setup process with not ask for target BigQuery dataset. It is automatically created once export setup is completed. Dataset will have same name as GA View ID.

Export Validation

Once setup is completed we should be able to see following

  • Dataset should appear in BigQuery with ID same as GA view
  • In GA (Admin -> All Product), we should be able to see green check mark for BigQuery. As seen in above screenshot
  • Table and View may take some time to appear in BigQuery
  • Data Export can take up to 24 hours

Reference: Set up BigQuery Export – Analytics Help (google.com)

Azure Data Factory

Azure Data Factory is a cloud-based Microsoft tool that collects raw business data and further transforms it into usable information. It is a data integration ETL (extract, transform, and load) service that automates the transformation of the given raw data. This Azure Data Factory Interview Questions post includes the most-probable questions asked during Azure job interviews

Why Azure Data Factory

  • Cloud Need an Orchestration Tool
  • Extract Data from Cloud Source
  • Transform Data using Datafactory workflows
  • Load/Sink data into Datalake or datawarehouse

Azure Data Factory Concepts

  • Pipeline: This is list of steps we want to perform in our Data Transformation or Copy task. Steps are called Activities
  • Activities: Activities represent the processing steps in a pipeline. A pipeline can have one or multiple activities. Activities can be of following types
    – Data Movement e.g. Copy
    – Data Transformation Activities e.g. Databrick, Azure Functions
    – Control Flow Activities e.g. For Each Loop, Lookup, Set Variable
  • Datasets: Sources of data. In simple words, it is a data structure that holds our data.
  • Linked services: Connection Strings

Data Factory Concepts Mapping with SSIS

For those who are coming from conventional SQL Server Integration Services background following table provide high level mapping between different components of SSIS and Azure Data Factory.

Data FactorySSIS
PipelinePackage (*.dtsx) or Control Flow
ActivitySSIS Control flow Components
Linked ServicesConnections
DatasetConnection Mapping
WorkflowDataflow

ADF Integration Runtime

The Integration Runtime (IR) is the compute infrastructure used by Azure Data Factory to provide the following data integration capabilities across different network environments.
– Dataflow
– Data Movement
– Activity Dispatch
– SSIS Package Execution

  • Azure Integration Run Time: Azure Integration Run Time can copy data between cloud data stores and it can dispatch the activity to a variety of compute services such as Azure HDinsight or SQL server where the transformation takes place
  • Self Hosted Integration Run Time: Self Hosted Integration Run Time is software with essentially the same code as Azure Integration Run Time. But you install it on an on-premise machine or a virtual machine in a virtual network. A Self Hosted IR can run copy activities between a public cloud data store and a data store in a private network. It can also dispatch transformation activities against compute resources in a private network. We use Self Hosted IR because Data factory will not be able to directly access on-primitive data sources as they sit behind a firewall.
  • Azure SSIS Integration Run Time: With SSIS Integration Run Time, you can natively execute SSIS packages in a managed environment.

ETL process in Azure Data Factory

Following are high level steps to create a simple pipeline in ADF. Following example copy data from SQL Server DB to Azure Data Lake.

Steps for Creating ETL

  1. Create a Linked Service for source data store which is SQL Server Database
  2. Create Source Dataset using Source Link Service
  3. Create a Linked Service for destination data store which is Azure Data Lake Store
  4. Create Sink/Target Dataset using link Service created in Step 3
  5. Create the pipeline and add copy activity
  6. Schedule the pipeline by adding a trigger

Schedule a pipeline

  • You can use the scheduler trigger or time window trigger to schedule a pipeline.
  • Trigger execute Pipelines on a Schedule e.g. Daily at 6 AM

Pass parameters to a Pipeline

You can define parameters at the pipeline level and pass arguments as you execute the pipeline run on demand or by using a trigger.

Define default values for the Pipeline Parameters?

You can define default values for the parameters in the pipelines.

Can an activity output property be consumed in another activity?

An activity output can be consumed in a subsequent activity with the @activity construct.

Azure Data Factory Interview Questions

Azure Data Factory is a cloud-based Microsoft tool that collects raw business data and further transforms it into usable information. It is a data integration ETL (extract, transform, and load) service that automates the transformation of the given raw data. This Azure Data Factory Interview Questions blog includes the most-probable questions asked during Azure job interviews

Why Azure Data Factory

  • Cloud Need an Orchestration Tool
  • Extract Data from Cloud Source
  • Transform Data using Datafactory workflows
  • Load/Sink data into Datalake or datawarehouse

Azure Data Factory Concepts

  • Pipeline: This is list of steps we want to perform in our Data Transformation or Copy task. Steps are called Activities
  • Activities: Activities represent the processing steps in a pipeline. A pipeline can have one or multiple activities. Activities can be of following types
    – Data Movement e.g. Copy
    – Data Transformation Activities e.g. Databrick, Azure Functions
    – Control Flow Activities e.g. For Each Loop, Lookup, Set Variable
  • Datasets: Sources of data. In simple words, it is a data structure that holds our data.
  • Linked services: Connection Strings

Data Factory Concepts Mapping with SSIS

Data FactorySSIS
PipelinePackage (*.dtsx) or Control Flow
ActivitySSIS Control flow Components
Linked ServicesConnections
DatasetConnection Mapping
WorkflowDataflow

ADF Integration Runtime

The Integration Runtime (IR) is the compute infrastructure used by Azure Data Factory to provide the following data integration capabilities across different network environments.
– Dataflow
– Data Movement
– Activity Dispatch
– SSIS Package Execution

  • Azure Integration Run Time: Azure Integration Run Time can copy data between cloud data stores and it can dispatch the activity to a variety of compute services such as Azure HDinsight or SQL server where the transformation takes place
  • Self Hosted Integration Run Time: Self Hosted Integration Run Time is software with essentially the same code as Azure Integration Run Time. But you install it on an on-premise machine or a virtual machine in a virtual network. A Self Hosted IR can run copy activities between a public cloud data store and a data store in a private network. It can also dispatch transformation activities against compute resources in a private network. We use Self Hosted IR because Data factory will not be able to directly access on-primitive data sources as they sit behind a firewall.
  • Azure SSIS Integration Run Time: With SSIS Integration Run Time, you can natively execute SSIS packages in a managed environment.

ETL process in Azure Data Factory

Following are high level steps to create a simple pipeline in ADF. Following example copy data from SQL Server DB to Azure Data Lake.

Steps for Creating ETL

  1. Create a Linked Service for source data store which is SQL Server Database
  2. Create Source Dataset using Source Link Service
  3. Create a Linked Service for destination data store which is Azure Data Lake Store
  4. Create Sink/Target Dataset using link Service created in Step 3
  5. Create the pipeline and add copy activity
  6. Schedule the pipeline by adding a trigger

Schedule a pipeline

  • You can use the scheduler trigger or time window trigger to schedule a pipeline.
  • Trigger execute Pipelines on a Schedule e.g. Daily at 6 AM

Pass parameters to a Pipeline

You can define parameters at the pipeline level and pass arguments as you execute the pipeline run on demand or by using a trigger.

Define default values for the Pipeline Parameters?

You can define default values for the parameters in the pipelines.

Can an activity output property be consumed in another activity?

An activity output can be consumed in a subsequent activity with the @activity construct.