GitHub link
YouTube Video for detailed explanation

In this Microsoft Azure Data Engineering Project, I've build a data pipeline using Azure Synapse Analytics, Azure Storage and Azure Synapse SQL pool to perform data analysis on the 2021 Olympics dataset.

Business Overview:

Data engineering is the profession of creating and constructing systems for gathering, storing, and analyzing large amounts of data. It is a vast field with applications in almost every sector. In this project, I will be build a pipeline in Azure using Azure Synapse Analytics, Azure Storage and Azure Synapse SQL pool to perform data analysis on the 2021 Olympics dataset.

Data Description:

For this project, I've working with the 2021 Olympics dataset. This includes the information on more than 11,000 athletes competing in 47 sports for 743 Teams in the Tokyo Olympics in 2021. This dataset includes information on the participating Teams, Athletes, Coaches, and Entries by gender. It includes their names, nationalities, sports they compete in, and name of coaches. The dataset contains 5 files as follows:

Tech Stack

Azure Synapse Analytics:

Azure Synapse is an unlimited analytics service that combines enterprise data warehousing and big data analytics. It gives you the freedom to query data on your terms, using serverless resources or being provisioned at scale. Azure Synapse brings these two worlds together with a unified experience to ingest, prepare, manage, and deliver data for BI and machine learning needs right out of the box.

Azure Storage:

The Azure Storage platform is Microsoft's cloud storage solution for modern data storage scenarios. Azure Storage provides highly available, highly scalable, durable, and secure storage for a wide variety of data objects in the cloud.

Azure Synapse SQL Pool:

Azure Synapse Analytics is an analytics service that unifies enterprise data warehouses and big data analytics. Dedicated SQL pool refers to the enterprise data warehousing feature available in Azure Synapse Analytics. A dedicated SQL pool represents a collection of analytics resources provided when using Synapse SQL. The size of the dedicated SQL pool is determined by the data warehousing unit. Once a dedicated SQL pool is created, you can use simple PolyBase-T-SQL queries to import big data and leverage the power of the distributed query engine to perform high performance analytics.

Power BI:

Power BI is a collection of software services, apps, and connectors that work together to turn your unrelated sources of data into coherent, visually immersive, and interactive insights. Your data might be an Excel spreadsheet, or a collection of cloud-based and on-premises hybrid data warehouses. Power BI lets you easily connect to your data sources, visualize and discover what's important, and share that with anyone or everyone you want.


Architecture Diagram:


Creating a SQL pool in Azure Synapse workspace,Creating tables in SQL pool

Create a pipeline to ingest data from Azure storage into SQL pool tables

Load data from Azure Synapse into Power BI, Creating visualizations in Power BI,Publishing Power BI dashboard

Power BI Report