My Tech Learning

Thursday, March 28, 2024

Navigating the Data Science Seas: A Journey with Microsoft Fabric

Data science is a vast and exciting field, brimming with the potential to unlock valuable insights. But like any seafaring voyage, navigating its currents can be challenging. Data wrangling, complex models, and siloed information – these are just a few of the obstacles that data scientists encounter.

Fortunately, there's a trusty first mate to help us on this journey: Microsoft Fabric. Fabric isn't a single tool, but rather a comprehensive set of services designed to streamline the data science workflow. Let's set sail with an example to see how Fabric equips us for smoother sailing.

The mission of a data scientist is to develop a model to predict when a customer will stop using a service (customer churn). Here's how you can use Fabric can be your guide

Predicting Customer Churn

Let's dive deeper and explore the steps involved in building a customer churn prediction model using Microsoft Fabric. You can get started by signing into http://fabric.microsoft.com using your cruise tickets for your data science journey.

Step 1: Data Discovery & Acquisition

Mapping the Treasure Trove: Utilise Microsoft Purview, the unified data governance service within Azure Portal. Purview acts as your treasure map, helping you discover relevant datasets related to customer demographics, purchase history, and marketing interactions. You can add your own datasets and register them.
Charting the Course: Once you've identified the datasets, leverage Azure Data Factory to orchestrate data extraction, transformation, and loading (ETL) processes. Data Factory acts as your captain, guiding the data from its source to your designated destination (e.g., One Lake). You can also avoid the above two steps and directly chart your course with the existing open datasets and notebooks available in the sea of Microsoft Fabric which is what we will be doing here.
Unveiling the Data in OneLake: As you navigate the vast seas of ocean (data), OneLake, a central data repository within Fabric, serves as your treasure trove. Utilise the Lakehouse item, your personal submarine, to explore and interact with the relevant datasets that are crucial for your customer churn prediction mission. After signing in, enter into the Data Science cabin as shown in the below image

We will be using an existing sample on Customer Churn that is available within Fabric.

Click on Use a Sample as shown below

Choose the Customer Churn Sample from the list of samples as shown below

This opens the customer churn notebook within the Fabric.

Attaching the Lakehouse to Your Notebook: Effortlessly connect the Lakehouse containing your relevant datasets to your analysis Notebook. This allows you to browse and interact with the data directly within your notebook environment.

To do this click on lakehouse link section of the left navigation pane of the Notebook as shown below and create a New Lakehouse

Prepare for sailing: Bring the right luggage for sailing by installing the right libraries. To do this run the below code using the pip install command from the notebook as shown below

Prepare your travel documents by exploring the dataset that you are going to use which is the bank dataset that contains churn status of 10,000 customers with 14 attributes. Run the below configuration to prepare as shown below

Prepare to combat seasickness by downloading the dataset and uploading it to the lakehouse by running the cell as shown below

Seamless Data Reads with Pandas: OneLake and Fabric Notebooks make data exploration a breeze. You can directly read data from your chosen Lakehouse into a Pandas dataframe, a powerful data structure for analysis in Python. This simplifies data access and streamlines the initial stages of your data exploration.

Prepare your groceries by running the next two cells and create a pandas dataframe as shown below

Plan your sailing itinerary by running the next two cells as shown below

Step 2: Data Wrangling & Preparation

Setting Sail with DataWrangler: DataWrangler, your powerful workhorse, welcomes the acquired data frame. Here, you'll have an immersive experience to clean and prepare the data for analysis. This might involve handling missing values, encoding categorical variables, and feature engineering (creating new features based on existing ones).

Have the main mooring lines looped through to manouver by running the datawrangler from the Data tab of the notebook as shown below

Choose the dataframe that you created in the next screen as shown below

Now the Data Wrangler is launched, expand the find and replace and click on the drop duplicate rows as shown below

This will create the code for dropping the duplicate rows from the dataframe if there are any as shown below

Exploring the Currents: Perform Exploratory Data Analysis (EDA) to understand the data's characteristics. Identify patterns and relationships between features that might influence customer churn.

Start moving only after checking that no other boat is already manoeuvring in the same channel arm by running the next three cells as shown below

Also run the five number summary as shown below

Explore further by running the distribution of the exited and non exited customers as shown below

Run the distribution of numerical attributes

Perform feature engineering and one hot encoding

As a final step of Exploratory data analysis create a delta table by running the delta table code as shown below. You can also see the delta table named df_clean created in the lakehouse.

Step3: Building & Training the Model

Choosing Your Vessel: Azure Machine Learning serves as your shipbuilder. Here, you can choose and configure a machine learning algorithm suitable for churn prediction. Popular options include Logistic Regression, Random Forest, or Gradient Boosting Machines (GBMs).

Run the code in Step 4 of the notebook as shown below that will load the delta table and generate the experiment.

Now run the code that sets the experiment and auto logging, imports the scikit learn libraries and prepares the training and test data as shown below

Training the Crew: Split your prepared data into training and testing sets. The training set feeds the algorithm, allowing it to "learn" the patterns associated with customer churn.

Now apply SMOTE to the training dataset and run the below query

Now train the model with Random Forest as shown below

Train the model with LightGBM too as shown below

Fine-Tuning the Sails: Use hyperparameter tuning techniques to optimize the chosen algorithm's performance. This involves adjusting its parameters to achieve the best possible accuracy on the training data.

Track the model performance by observing the model metrics as shown below

Step 4: Evaluation & Deployment

Testing the Waters: Evaluate your model's performance on the unseen testing data. Metrics like accuracy, precision, and recall will tell you how well the model predicts churn.

Load the best model and assess the performance against the test data as shown below

Refinements & Improvements: Based on the evaluation results, you might need to refine your model by trying different algorithms, features, or hyperparameter settings. Iterate until you're satisfied with its performance.

Check the confusion matrix results as shown below

Deploying the Model: Once the model performs well, save the prediction results to a delta file in the Lakehouse.

Save the results into the lakehouse by running the code as shown below

Step 5: Visualization & Communication

Charting the Future: Leverage Power BI, seamlessly integrated with Fabric, to create compelling visualizations of your churn predictions. Segment customers based on their predicted churn probability, allowing for targeted interventions.

An example screenshot of the Power BI Visualisation is as below

Sharing the Treasure: Communicate your findings to stakeholders. Use Power BI dashboards to showcase the model's effectiveness and its potential impact on reducing customer churn.

This blog post demonstrates how Microsoft Fabric acts as your comprehensive toolkit, guiding you through the entire customer churn prediction journey!

Friday, March 03, 2023

Deploy the Azure Machine Learning Model

In the previous post I have discussed how to create an Azure Machine Model. In this post I will be discussing how to Deploy this model.

Prerequisites

Before deploying a machine learning model in Azure, there are several prerequisites you need to fulfill:

Prepare your data: You should have a well-prepared and cleaned dataset that has been tested and validated.

Select your model: You need to choose an appropriate machine learning algorithm based on your problem statement and the nature of your data.

Train your model: You need to train your machine learning model on your prepared dataset.

As you must have seen we have undertaken all these steps and have trained our model in Azure Machine Learning in the previous post.

As part of training the model, we have created an inference pipeline. Now if we want to deploy the model, we need to create a real time inference pipeline.

In order to do that, in the Azure Portal launch the Azure Machine Learning studio as shown below.

Monday, February 27, 2023

Create a Predictive Model in Azure Machine Learning -- Designer mode

Today after a long time, I wanted to play around with Azure Machine Learning Designer formerly Machine Learning Studio.

So I set our to create a Predictive Model using the Sample data that is available in Azure ML -- Automobile price data (Raw)

So I logged in to Azure Portal --

Creating a Machine Learning Workspace

The first step is to create a Machine Learning Resource --

So for that Click on Create a Resource Button and search machine learning as shown below:

Click on Azure Machine Learning

The below screen appears.

Click on Create. The below screen appears.

Fill in the Resource Group, Workspace Name and Region details.

Fill in other details in the other tabs if needed and click on Review and Create

Now the Azure Machine Learning Workspace is created as below

Click on the Name of the AML workspace

The below screen appears.

Click on Launch Studio. This will launch the Azure Machine Learning (AML) interface.

Click on the Create New -- This opens up the Menu of what can be created as shown below.

Ingesting Data

Click on Pipeline. This will create a New Pipeline Menu as shown below

As you can see it uses the Designer Authoring tool. Now Click on the Plus button which is Create a pipeline using classic pre-built components.

This will open a blank canvas in the designer as shown below:

Now we are ready to create a training model. Click on the two arrows beside Undo as shown in the above image. This will open a Menu which contains the components that can be used to build the predictive model as shown below:

Click on Sample data and drag the dataset named Automobile Price Data (Raw) on to the blank canvas

You can right-click the Automobile price data (Raw) component and select Preview Data to understand the dataset.

Each row corresponds to an automobile, and the variables associated with each automobile appear as columns. There are 205 rows and 26 columns in this dataset.

Preparing Data

Now that we have chosen the dataset we need to clean the data.

First step in preparing data is to eliminate columns that we do not need

Second step is to remove the missing values.

So if you look at the dataset carefully, there are many values missing from the column normalized-losses. So we need to eliminate this column. To achieve this, we can use the Select Column in dataset component as shown below.

Now connect the Dataset component with the Select Columns component.

Double click on the New component and choose Edit Columns and create the rules as below to Include All columns Except the column named normalized-losses

Next step is to remove other missing values. For this we will use the Clean missing data component. Drag this component to the Canvas. Connect the Select columns dataset component to the Clean missing data component as shown below.

Double Click on the Clean missing data component and change as shown below.

Preparing Training and test data

Now that we have prepared and cleaned the dataset, the next step is to prepare the train and test data, For this we will use Split Data component. Search for the Split Data component and drag it to the canvas. Connect the Clean Missing data component to the Split Data component. Make sure that the Cleaned dataset port is connected as shown below.

Double click the Split Data component and configure as shown below. The 0.7 in the Fraction of rows in the first output dataset. means that the dataset will be split into 70% and 30%. The 70% dataset will be used for training a model and 30% dataset can be used as a test dataset.

Training a Model

The Next Step is to train a model. From the data we are going to Predict the Price of the automobile. So we will use a linear regression model. So Search for the Linear Regression component and drag it to the canvas. Next search for a Train Model and drag it to the canvas. And connect these components as shown below.

Scoring and Evaluating the Model

Next add the components -- Score Model and Evaluate Model on to the canvas and connect them as shown below:

Now your Pipeline is ready.

Next Steps ---

In order to train this model, you need to click on the Submit button

Once you submit you will be prompted with the below configuration.

Ensure that you have configured the Compute Target as well as shown below

This will create a pipeline job and a notification will pop up at the top right corner of the page. Since this is your first job, this might take up to 20 mins to run. Once you get the notification that the job is completed, you can look at the job detail page as shown below.

You can then look at the scored labels and price predicted as shown below.

You can use the Evaluate Model to see how well the trained model performed on the test dataset as shown below

You can see the error statistics above. For each of the error statistics, smaller is better. A smaller value indicates that the predictions are closer to the actual values.

For the coefficient of determination, the closer its value is to one (1.0), the better the predictions.

That's it from me for today.

In my next blog post I will show you how you can deploy this model.

Tuesday, November 22, 2022

Create a Lollipop Chart in Power BI -- Without the use of any Custom Visuals

I have been playing around with the newly launched Error Bars functionality in Power BI. The result is this Blog Post. Here I am going to explain how you can create a lollipop chart in Power BI without using any custom visuals or charticulator or deneb

Step 1:

Create a simple line chart -- I have sales data and region data. So I created a line chart as shown below.

Step 2:

Format the line chart to remove the line as shown below. Click on the format icon -- Visual tab -- Line -- Stroke Width -- Change from 3 px to 0

The result of this action is -- the line will entirely disappear as we changed the Stroke Width from 3 px to 0 px.

Step 3:

Add Markers as shown below and ensure that the shape is a circle resembling the ball of the lollipop and increase the size to 10 px to get reasonable sized balls

The result is as shown below.

Step 4:

Ensure that the Data Labels are Enabled and choose Position as Above as shown below

You can disable Y-Axis if you need to.

Step 5:

Here is where we will be using the Error Bars to add the Line to show the lollipop stick.

To go to the Error Bars section -- Go to the Further Analytics icon and click on Error Bars.

Enable the Error Bars. As shown below it needs an Upper Bound and Lower Bound. Upper Bound can be Sum of Sales and the Lower Bound is always 0 (Zero). Since there is no way to input a number, let us create a Measure that returns 0 .

The new Measure is -- Lowerbound = 0

Step 6:

You can see faint lollipop lines. You can increase the width of the line by clicking on the Bar and changing the Width as shown below

And Viola! Your Lollipop chart is ready!

Hope you liked this Step by Step instruction on creating the lollipop chart.