Azure Synapse Analytics Dedicated SQL Pool (Data Warehouse) SQL Database

Azure Link For SQL

Azure Link For SQL enables near real time analytics over operational data in an external SQL Database.

Azure Link can now be used with SQL Server 2022 and Azure Database, and it works by replicating the specified tables into a Synapse SQL Pool automatically, in an incremental way. This allows to avoid creating a (possibly) complex ETL process to keep the Datawarehouse in sync.

This integration can be done with a SQL Server and with an Azure SQL Database. For SQL Server a Master key needs to be created both in the Datawarehouse and the SQL Server, and it is required to use a Self-Hosted integration runtime hosted in a machine with access to the server.

For Azure SQL Database the process is much simpler, but there’s a pricing limitation as Synapse Link can only be used in Databases with at least Standard tier (so no Basic tier) and 100 DTUs. The cost of this is 147.19 USD every month, so this limitation is something to keep in mind:

Minimum specification to be able to use Synapse link. Lowering the Data max size doesn’t have an impact on the cost.

Creating a Synapse Link for Azure SQL Database

In this case I’m going to do this example using the “Azure SQL Database” path as it’s simpler. So the first step is to create a Linked Service connecting to the Azure SQL Database as usual (in this case I’ve used System Managed Identity for authentication). As stated previously, the Database needs to be at least Standard Tier and have 100 DTUs and Managed Identity enabled.

After that we can create the Link:

During the process you can select in which SQL Pool do you want to copy, and which tables are going to replicated. It is important to keep in mind that only tables with a primary key can be used.

With the link connection created, you have the chance to further customize the process, renaming the tables, changing the distribution type and the structure type (which need to be set to ‘Heap’ whenever your table contains nvarchar(max) columns), and deciding whether you want to drop already existing tables or not. Once everything is set up the link connection can be published and started. After this, no further steps needs to be taken in Synapse. You can stop the link at any point and monitor the connection:

After this, every insertion, update or deletion in the source will be reflected in the SQL Pool in less than a minute.

Machine Learning

Building a Machine Learning Model to infer a Car Price in Azure Machine Learning Studio

The goal of this article is to build and explain a model in Azure Machine Learning Studio (ML Studio) that infers a price for a car based on some of its characteristics.

Getting the Data

The data we are going to use is provided by ML Studio as a sample, which can be found in the Datastore section inside azureml_globaldatasets:

While we could use the already created Dataset, in this case we will download the file and create our own dataset to illustrate the full process.

Once the file is downloaded to our local PC, we can upload it to our own Blob Storage. Every Azure Machine Learning workspace has an associated storage, which can be found from the Portal:

We will use the ML Studio UI to upload the file, but this can also be achieved directly through the portal as with any Blob Storage. To upload the file and create a dataset we need to use the Dataset section:

From there, we select:

  1. Create Dataset
  2. From Local Files
  3. Provide a name for the Dataset and, for this example, select Tabular
  4. In this screen we can upload the file and change where it’s going to be uploaded. Let’s use the default path for now
  5. We can see a preview of the data and the detected File Format (Parquet)
  6. We can modify the schema changing what columns we include and the datatype

Finally, we can check the details and create the dataset. Azure ML Studio provides some tools to inspect the Dataset that we can use to get a feel for the data we are dealing with:

This Profile view includes a lot of statistics about each column, which may be useful in some cases

Each row holds data for a specific car, including its price.

Design the pipeline

Now that we have the dataset ready, it’s time to train our model. Azure ML Studio offers three separate ways to do this:

  • Python Notebooks: The most versatile choice, meant for programmers.
  • Automated ML: The simplest choice. You just need to select the dataset and the intended goal, and it will automatically choose the best model for the job.
  • Designer Pipelines: A middle ground between Automated ML and Python, allowing you to build ML Pipelines yourself but in a visual way

We are going to use the pipelines, as they provide a good trade-off between understandability and control.

First, we need to create the pipeline:

After that we can add the dataset we just created by dragging from the Dataset section:

After we import the data, we are going to clean it up removing rows with empty data. However, before we do that, we want to get rid of the normalized-losses column, because there are a lot of missing values:

Here we can see that 20% of the values are missing for normalized-losses, which means we would lose 20% of the rows in the next step just from this column

To do this, we can drag a Select Columns in Dataset component from the Data Transformation section and connect the dataset to it. We will use a rule to include all columns except normalized-losses:

Now, from the Data Transformation section we can drag a Clean Missing Data component and select All Columns. This is how it would look like:

After the clean-up we are ready to start with the training process. The first step is, as usual, to split the dataset into two groups, one for training and one for testing/evaluating our model. We will use a random 70/30 split, connecting the first output of Clean Missing Data to a Split Data component and specifying 0.7 as the Fraction of rows in the first output dataset:

Now its time to choose what algorithm we are going to use. Let’s remember our problem: We want to infer a price (a real number) from a set of characteristics of a car. This is a Regression problem then. We could use many different algorithms, and the Designer includes components to compare them and see which one performs better, but for this example we will select the Linear Regression algorithm, which is simple to set up. There is a component for this algorithm under the Machine Learning Algorithms section, so we just need to drag it. We will use the default parameters.

With the algorithm and the dataset ready we can set up the training, dragging the component Train Model and connecting to it both the algorithm and the first dataset from the split (the 70% of the original). We also need to tell it that we want to infer the price:

Note that we have set Model Explanations to True. This is an optional feature that allows you to better understand how is the model predicting the price.

After this step is executed, the model is already trained. However, we will add an Score Model (which receives the model and new data and infers the desired label, in this case the price) and an Evaluate Model component to check how good the model is. For this, we will use the remaining 30% of the dataset, the Testing split, and compare the predictions of the model with the actual values. This is the final pipeline:

Training the model and setup

Now its time to execute it and see the results. For this we first need to create a Compute. For this simple scenario the most basic compute should work. After that, we can attach the compute to the pipeline:

Now we can Publish the pipeline, creating a new Pipeline Endpoint which can be accessed from outside (eg. From Azure Synapse) and Submit it. This runs the Pipeline for the first time. After some minutes we have the results:

Everything completed without errors and a new button appeared on top. This button allows us to create an Inference Pipeline from the trained model, so we can use it through the API from different places like a Python notebook. Let’s do it, selecting real time inference. This is the generated Pipeline:

The Web Service Input indicates where data enters from the API call and Web Service Output shows where the data comes out to answer the API call. This way the pipeline can be modified accordingly if needed (removing unwanted columns before feeding the input, performing some transformations…) by the user before submitting. In our case we are just going to Submit and Deploy, selecting Azure Container Instance for simplicity instead of Kubernetes. Once it’s deployed we will use the REST endpoint to query the model. If we go into the model it even provides us with the python code needed to query the endpoint:

Testing the Inference

Inside ML Studio we can create and run python notebooks using the created computes, employing different IDEs like JupyterLab or VSCode, but in this case we will run the Notebook from an Apache Spark instance in our Synapse Workspace.

After we run the Notebook, we get the expected output (keep in mind that, for this to work, none of the strings provided can be empty):

For this input, the predicted price is 14172 and the real one is 13495, which is roughly a 5% error. This could be further improved by choosing different algorithms or getting a larger dataset. If we change it so the input car has four doors, uses diesel and have six cylinders we will get a different price:

In this case the inferred car price is 26163, which match our expectations of the price rising after adding more cylinders, doors, and changing the fuel type to diesel.