Categories
Dedicated SQL Pool (Data Warehouse) PowerBI

Data Transformation at Different Stages in PowerBI 

There are numerous ways to apply data transformations when creating a dashboard. One of the most important aspects is when to apply these transformations. This decision may impact not only the performance but also the complexity of the dashboard.  

Let us review 4 different options and analyze which one is better under certain conditions. 

Option 1Option 2Option 3Option 4
LanguageDAX (used in PBI presentation layer) M Language (used in PBI data layer) SQL (query database from PBI) SQL (View in database) 
Transformation needs user input YesNoNoNo
ComplexityLow to high LowHighHigh
ReusabilityNoNoNoYes
SampleAverage delivery time in user selected period Time to deliver a product. (Delivery date – order date) Time to deliver a product Time to deliver a product 

Option 1

This is the most frequent option. Transformations are applied at PBI presentation layer by using DAX in measures or calculated columns. These kinds of transformations are only recommended when they depend on some user input. Otherwise, since in previous steps the user input is unknown, every possible choice would need to be considered.  

Note that these kinds of transformations are recalculated every time the dashboard UI changes and therefore may negatively influence the report performance. 

For example, this approach would fit the need to calculate an average value according to a time period selected by the user. 

On the other hand, this option is not recommended for transformations that do not depend on the user context. For example, in a fact table about product deliveries the delivery time could be calculated by subtracting the order date minus the delivery date. 

Option 2

Transformations are applied at the PBI data layer by using the M Language in the Power Query editor. This is a very powerful tool since it has a wide range of transformation tools available with little to no code. This option is ideal when the given transformation does not involve user interaction and neither of the two following options are available. 

Note that these transformations are recalculated by the Power Query Engine every time the report’s dataset is refreshed. 

For example, referring to the example of bad use of the first option, the column “DeliveryTime” can be added with a Power Query step. This is a better approach than using the first option but still could be better with the next two options. 

Option 3

Transformations are applied by using a custom SQL query within PowerBI when the data source is being imported into the report. This option is ideal when the given transformation does not involve user interaction and we cannot create views in the database.  

Note that these queries are executed by the database engine itself every time the report’s dataset is refreshed. 

For example, to achieve the same result as in the previous example, the following import configuration could be used. 

Option 4

Transformations are made directly in the data source by creating a table view that contains the data with the desired transformations. This is the best option as long as the transformations do not involve user interaction and we have enough rights in the database. Also, once the view is created, it can be reused in multiple dashboards. 

Note that these queries are also executed by the database engine itself every time the view is accessed. 

For example, to achieve the same result as in the previous example, a “delivery_time” view should be created with the corresponding grouping. 

Conclusion

There is a key aspect to have in mind when deciding how or when to apply data transformations, the user interaction. If the transformations depend on any user interaction, then it must be performed as in the first option. Otherwise, it must be performed as in the other three options.  

Among these other options the fourth is the most preferable, but sometimes we do not have the required rights to create views in the database and therefore option three becomes useful. If option three is not available due to the type of data source, then the second option is the way to go. 

Categories
Azure Synapse Analytics

Data Anonymization with Synapse Analytics  

Dataset anonymization is a transformation that can be a very beneficial part of the pre-processing stage when dealing with vast amounts of data in a project. It aims to transform an initial sensitive dataset by adding dummy content randomly but preserving data coherence in order to generate an equivalent one with non-real-world information.  

This process can be applied to a dataset before any employees develop or tests solutions around it, adding an extra layer of security to the development process. Once the product seems ready, the real dataset is connected instead for production purposes. 

Using this transformation in the develop and testing environments is considered a best practice. 

Azure Synapse Analytics 

There are a lot of tools available for big data analytics, but we will focus on Azure Synapse Analytics, from Microsoft. They self-define as a limitless analytics service that brings together data integration, enterprise data warehousing, and big data analytics. In the application main page, there are three main sections: Data, Develop and Integrate.  

The first one is where all the storage containers are located, from unstructured ones as Data Lakes to structured ones as SQL databases.

The Develop section is intended to create, store, and execute many kinds of data manipulation or analysis scripts in the cloud, using the advantages of Apache Spark. For example, Notebooks are an immensely powerful tool since they allow general purpose languages such as Python, Scala, or C# to be used along with standard data science libraries such as Pandas. 

Finally, the Integrate section is intended to automate complex processes related to the extraction, transformation, and storage of data. Pipelines are another powerful tool since they define sets of processing data elements connected in series, where the output of one element is the input of the next. Also, they can be set up in order to automatically execute periodically or as response to triggered predefined events. 

Dataset anonymization tool 

Let us review an implementation done in a Synapse Analytics Notebook using Python and Pandas. It reads an excel file from the Data Lake Storage Gen2 and creates another one with the result. The input file is required to have the following structure:  

  • First excel sheet: Contains all the column headers in the first row, and the values that do not require any modification. 
  • Second excel sheet: Contains headers and desired values from the columns that need to be anonymized but are independent from the others. 
  • The rest excel sheets: Each one contains a set of headers and desired values from columns whose values are dependent from each other. This is useful to maintain coherence in the resulting dataset. 

To give a more detailed explanation of the execution, an example dataset of 1000 rows related to business sales was generated using the Mockaroo web application. The following image is a preview of the sensitive data file that needs to be anonymized. 

Imagine the Global Sales Organization wants to create a series of dashboards related to this information. If they want to add an extra layer of security, these datasets should be anonymized before starting to develop any dashboard. The first step would be for a person with access to the real data, to modify the file to follow the required format.

Input file formatting

Since there is no need to apply any transformation to the Start Date and End Date fields, their values can remain on the first sheet.

In the second sheet, independent fields to be anonymized are included, in this case these are Company and Business Unit. These column cells contain all the possible values for the final anonymized dataset, therefore here is where sensitive data can be suppressed or substituted. Since each column is independent, they do not need to share size.

Note that these two fields could be dependent on each other, but for this example they are kept independent, otherwise they should be on the third or a higher index sheet.

The fields Market, Country Code and Currency fields are dependent on each other and therefore they are put in the third sheet. Notice that in dependent column sheets (dimensionals) all the columns must share size.

Same thing goes with Spends, Income and Profit fields since they are dependent on each other. Values are filled with non-real examples but considering that Income – Spends = Profit to maintain coherence.

Once the input file is correctly formatted and uploaded into the Data Lake Gen2, it is time to execute the Anonymization Dataset Tool Notebook.

Notebook manual execution

Once the Notebook is open, attach to a valid execution pool and after filling the first cell input and output file path parameters, press on run all.

In the second cell there are two dependencies being imported. Pandas, a standard data analysis library built on top of NumPy, and the Random module.

After this point, every cell follows a try-except structure to facilitate more comprehensible error messages. The next cell reads the input file and outputs the sheet names found.

Then, some Pandas data frames are created to store content related to each excel sheet. Also, the first sheet data is printed.

The next cell contains two methods to help detect formatting errors in the input file. The first is used to check that every column header has a matching header on the first sheet. The second function is used to verify dimensional sheets (index >= 3) format.

Now all the data frames previously created will be processed in order to clean, organize, and check their values. At first, the values sheet data (index = 2) is processed storing the name, size, and unique values for each column. The output in the image shows part of the information related to the Company field.

A similar process is applied to each dimensional sheet but allowing duplicate values. Some information related to the DimFinancial sheet can be seen.

The next and most important step is to incorporate the previously stored values into the fact data frame by iterating its rows and assigning a random value or group of values to each field. Also, a preview of this result is printed.

To finalize, the resulting data frame is written into an excel file and stored in the output file path set in the very first cell.

The output file has a single sheet, and it contains the freshly anonymized data.

Now this file can act as consistent dummy data to develop solutions around it without exposing real-world sensitive data.

Execution through a Pipeline

To automate the execution of this notebook, it can be executed using a Synapse Analytics Pipeline. Select a Synapse Notebook Activity and set up the correct input and output file paths parameters.

The example can be run by pressing the Debug button. If something goes wrong, the execution will end with a Failed status. The error message can be seen by hovering over the name and clicking on the message icon.

Otherwise, if everything goes as expected the Notebook will finish its execution with a Success status. 

In case there was the need to store the anonymized dataset in a SQL database, a previously created Pipeline pl_upload_CSV_to_DW can be helpful. It takes the input folder path, input csv file name, table schema and table name parameters. This can be added as another step of the previous Pipeline.

Notice that the parameters section shows some sort of code expression. This is because the main Pipeline is parameterized, and those portions of code are references to the main parameters. These can be displayed by clicking on the blank area, in the parameter section.

Of course, also the Notebook item parameters should change since we want to use the main ones.

This time, the value field contains a little more complex expression. This is because the main parameters include the folder paths and the file names as separate values, while the Notebook parameters take the full file paths as single parameters. Then, the expression is creating the full file path by concatenating folder path and file name in the right format.