Quantcast
Channel: WinWire
Viewing all articles
Browse latest Browse all 339

Merge Data into a Single Table using Azure Data Factory

$
0
0

I recently came across a scenario where the source files in Azure blob storage container had slightly different schemas, i.e., one set of files had an extra column compared to another set of files. The challenge was to identify this diverse set of files and load it into a single target table on the fly.

How to handle files with a slightly different structure (coming from the same source/repository) and transfer/merge the data into a single table using Azure Data Factory?

For example, File1 has columns – Name, Age, Department.

Name Age Department
John 34 IT
Mathew 45 Finance

File2 has columns – Id, Name, Age, Department

ID Name Age Department
12 Steve 40 HR
17 Fleming 44 Finance

Data in the target table (say EmployeeDetails) should look like

ID Name Age Department
12 Steve 40 HR
17 Fleming 44 Finance
John 34 IT
Mathew 45 Finance

Solution Summary

  1. Count the columns in the source file
  2. Check the value of the first row and first column of a given source file.
  3. Add 2 copy data activities for each of the file types, map the source and target columns explicitly.
  4. Based on #1 and #2 checks, decide which copy data activity to select for a give file type.

Detailed Explanation of steps involved in the pipeline

All of the files have extension as .prd and the ask is to load these into Azure DWH or synapse.

  1. Load the files from amazon s3 to azure blob using copy data activity. No special settings in the source and target files here. Just select the encoding as UTF-8.
  2. There will be 3 tables to assist in loading.
    • TableFileName – It will hold the name of files and the value of first column and first row present in blob storage.

Example if the file (titled File1) has 1, Lokesh, 35, Health as the first row. The table will have values

File1 1

 

  • TableErrorFileName – It will hold the name of files which have failed to load.
  • StagingTable – This will be the actual target table with the data from the source files.
  1. Before loading clear or truncate all these tables.
  2. Child Pipeline

a. First step in this pipeline is GET METADATA activity which will have the dataset mapped to blob storage where                         source files are kept, and field list will be set to Child Items.

Azure Data Factory

b. Second step will be a for each loop container.

Azure Data Factory Merge Rows

 

Merge Data into a Single Table using Azure Data Factory

It further has these activities

merge rows of same file Azure data factory

  • “Get Metadata….2” – This is set to parameterized dataset which points to a file from the blob storage. This is done to get the column count from a given source file.

 

Merge Multiple rows into one using Azure Data Factory

  • Lookup First Row and Column – Points to a parameterized dataset. This activity is added to get the first column and first row value of the source file and additionally to prevent a no or blank file name being passed to the subsequent step. A no file name will be hard to handle therefore inserted a dummy file called test.txt which will prevent the operation from failing.

Azure Data Factory

Expression – @if(greaterOrEquals(activity(‘Get MetadataFileDetails2′).output.columnCount,2),item().name,’test.txt’)

  • “If First Row n Col is …” – Next activity is if activity, its job is to find out the type of file based on its first column value.

 MERGE data in a Dataflow of Azure Data Factory

Expression – @bool(startswith(activity(‘Lookup 1st Row n Col’).output.firstRow.Prop_0,’9′))

  • True condition – Insert the file name and the first column and row value in TableFileName (which is already been created, please refer to #1).

Merge Data into a Single Table using Azure Data Factory

  • False condition – Insert the file name and hard coded flag or value in TableFileName. This will help identifying or distinguishing between the 2 different types of file.

Azure Data Factory

 

  • If the “if activity” fails, then insert the file name into the same table with first column value concatenated to failure.

Azure Data Factory

Azure Data Factory

5. (Return to main pipeline) “Lookup_G…Files”. This lookup queries for one type of file out of the 2 possible ones by querying data inside the table TableFileName.

Azure Data Factory

 

6. Other lookup (“Lookup_I…Files”) queries the second type of the file. As the lookups return multiple rows the “first row” property is not set.

7. Each of these lookups then feed the output to for each loop container. These containers have a copy data activity to transfer the data into the target table.

Azure Data Factory

8. Eventually data is inserted into the respective tables with the file names of the failed files are inserted into error table TableErrorFileName. One of the important settings here is that the mapping must be explicitly specified.

Azure Data Factory

 

Azure Data Factory

 

Azure Data Factory

 

Get Metadata activity in ADF will provide important meta data parameters about the file/folder of files which can be leveraged to make decision on the actual loading of the data. For instance, in this case we made use of “Column Count” parameter. I hope this gave you an end-to-end look at how to Merge Data into a Single Table using Azure Data Factory and saved you from looking through 10 different bits of documentation.

The post Merge Data into a Single Table using Azure Data Factory appeared first on WinWire.


Viewing all articles
Browse latest Browse all 339

Trending Articles