Getting started with BETL in Azure

1. Prerequisits

For the getting started guide, you will need the following resources:

  • Azure SQL database containing betl
  • Azure SQL source database ( e.g. adventureWorks).
  • Azure SQL target database. For this example let’s call this the raw data warehouse (rdw).
  • Azure Data factory
  • Azure key vault ( for storing the secrets, for example the connection strings). You can skip the key vault for the getting started guide. 

You can create these resources manually or by using infra as code scripts (e.g. bicep).

1.1 Deploy using bicep code

Here you can find the bicep code that I used to create the resources for this getting started guide ( https://github.com/basvdberg/cicd_azure_devops_adf_and_sql ).

* Bicep does not publish the BETL database ( please see installation guide on how to publish this).

1.2 Manual deploy

You can easily create all resources using the azure portal. When creating an azure database, there is a checkbox that allows you to load it with sample data ( for the source database).

1.3 Setting permissions

Login to your azure sql database ( e.g. sqls-betl-dev.database.windows.net) using Sql server management studio(ssms) and Azure AD – Universal with MFA authentication ( this is required for setting permissions). Use the Admin login that you specified in the bicep parameters.

Grant permission for this data factory to the sql server database using the managed identity. Because the use statement is not allowed in azure, you will have to change the connection for each database.

-- database master
create user [<adf-name>] from external provider

-- database sqldb_betl
create user [<adf-name>] from external provider
ALTER ROLE db_owner ADD MEMBER [<adf-name>]

-- database sqldb_aw 
create user [<adf-name>] from external provider 
ALTER ROLE db_owner ADD MEMBER [<adf-name>]

-- database sqldb_rdw
create user [<adf-name>] from external provider 
ALTER ROLE db_owner ADD MEMBER [<adf-name>]
don’t forget to replace <adf-name>

That should be enough to get started with ingesting source dat.a. ( I use the term ingest because this is being used in azure data factory (adf) also. It means loading data into some target environment).

2. Example architecture

The goal of this getting started guide is to generate the ETL and DDL for building the first two layers in a data warehouse, which I call staging and raw data warehouse (but feel free to choose your own names).

  • Note that this setup is not mandatory for using BETL. You can choose your own setup of layers. For example: a common setup is not to have an intermediate layer between raw data warehouse and datamarts.

3. Ingest object tree of current database

First we need to extract the meta data that tells us which tables there are and under which database, schema and server. This is called the object tree.

Let’s have a look at the current meta data repository. The following command shows you the BETL meta data repository. Run this inside the betl database.

exec dbo.info
-- or if you prefer to type less:
info
It will return:
  • the objects ( of different object types. e.g. servers, databases, schemas, tables, views and users). On a fresh install it will give you only 3 objects. The first is the LOCALHOST server object and the second object is of type user. This object stores user bound properties ( e.g. log_level). The third is the default unknown record that is present in many tables ( key=-1). Instead of a null foreign key, you can use -1 to reference unknown objects.
  • column meta data relating to objects ( column name, data type, etc). This is empty on a clean install.
  • properties that can be bound to objects. For example the include_staging property can be used to mark tables to be included in our staging layer. Properties can be set at any object in the object tree ( e.g. server, database, schema or table level). Properties are inherited. For example if you set the include property to true for a schema, then this property is set for all tables and views in this schema. You can exclude a table by setting the include property to false for this table while the include property of the schema is set to true.
meta data

If you want to go back to this initial state right when betl was installed, you can always run the following:

exec dbo.clear_meta_data

Let’s get back to our first step: Ingesting the object tree. This can be done with the following code:

declare @obj_tree_param ObjTreeTableParam 

insert into @obj_tree_param 
select  *
from dbo.obj_tree_ms_sql_server

exec [dbo].[ingest_obj_tree] @obj_tree_param

-- see what happened
exec dbo.info
We only ingested the object tree of the current database. Next step: ingest the object tree of the source database .

4. Ingest the object tree of an external database

object tree

Because Azure SQL database does not support cross database queries, the solution for azure is a little bit different than on premise. The main difference is that for azure we have to use an ETL tool like Azure data factory and for the on premise solution we can just build the solution using only T-SQL and in some cases a linked server.

5. Staging

This pipeline is called pl_staging_aw and is stored here: https://github.com/basvdberg/BETL/tree/main/adf_betl_getting_started

This is the pipeline that imports the tables into our staging layer. Actually there is not much code generation done here, but we are using the meta data of betl and the generic copy task of ADF.

  • go to https://adf.azure.com/ and connect to you azure data factory. If you used the bicep code to deploy the ADF then the ADF is linked to the repo in github. ( I use the develop branch obviously for working on this demo pipelines. You should connect to the main branch to get the latest working version of the pipelines).
  1. Ingest object tree. As discussed above, but in this case for the AdventureWorks database. Have a look at the meta data by executing the stored procedure info or by running select * from obj_ext. Different database vendors require adjustments of this query ( e.g. Oracle, Postgress, MySQL, Db2).
  2. Set exclude property. We manage our configuration of what needs to be done by properties. Properties are linked to the object tree. The entire database is included and two tables are excluded. (Just by 3 simple statements). * of course you should only have to do this once but for demo purposes this is included in the pipeline.
  3. Lookup staging tables returns a list of tables that is picked up by the ForEach loop in the next step. It should contain all objects that we want to have in our staging layer..
  4. Inside the ForeachTable Loop there are two steps:
    1. First the target table is dropped if exists
    2. Then the Generic copy task first creates the table and then copies it.

This pipeline shows the generic power of ADF combined with good meta data.

5. There is one more thing to do after the tables are copied into staging and that is refreshing the meta data of our staging database, so that we can use this in our next step.

6. Raw data warehouse (RDW)

The next layer I call the raw datawarehouse. This is modelled exactly like the source system. But we add some things like:

  1. History. If something changes in the source we maintain the current and the previous values including the time period that a record is valid.
  2. Deletes. If something is deleted in the source, we mark is as deleted instead of deleting it.
  3. Meta data. for every record we link to the dbo.transfer table in betl. This table holds meta data and statistics for this transfer. A transfer is part of a batch and logging is done per batch. This meta data can also be used for lineage.

Let’s have a look at the overall orchestration pipeline for our source system adventureworks, called pl_aw. This is sometimes called the master pipeline.

Steps:

  1. Refresh staging data as we have seen above.
  2. Refresh RDW data, divided into
    1. data definition language (ddl)
    2. ETL code for the actual data loading from staging into RDW

6.1 Data definition language

6.1.1 historic tables

This step makes sure that RDW tables are created including extra columns for recording history and other meta data. Also primary keys and identity columns are created here. This ddl step can be skipped for Test, Acceptance and Production environment, because ddl changes are normally done via the release process. In our Develop environment we want to automatically make ddl changes so we can review and test them before including them in our release.

Pipeline name: pl_rdw_aw_ddl_his_tables.

  • set obj_tree_sql. sets a variable to be used in step 3
  • create_rdw_schema_if_not_exists. performs a create schema statement if needed.
  • refresh_obj_tree_staging_and_rdw. We have to refresh our the object tree meta data of our source and target database. Although our source was already refreshed in the previous pipeline as well, but we don’t want to have that as a dependency in this pipeline. The pipeline must be robust ( e.g. also work when for example something in staging changed after the staging pipeline ran).
  • update_rdw_his_table_def. This is an interesting step. The object tree meta data represents observed objects in our databases. The historic tables do not exists yet, but we want to define it’s meta data so that we can use that for creating the objects. Because the meta data of the definition is almost identical to the observed meta data we use the same meta data tables ( dbo.Obj and dbo.Col_h). The definitions have the is_definition boolean set to 1. If you want to see this meta data, run this query after you have executed the pipeline ( with happy green checkboxes):
select * from dbo.obj
order by obj_name

and

select * from dbo.col_ext
where schema_name = 'rdw' and column_type ='attribute'
order by column_name
  • Lookup_rdw_table_definitions. Load the just created definitions meta data, so it can be used in the next forEach loop.

  • Inside the For Each loop we have the following two steps:
    • generate_create_table_ddl. This is were the template engine comes into play. Templates are written in the handle bars language which add templating functionality to TSQL. In our case the template looks as follows:
-- begin drop_and_create_table {{schema_name}}.{{obj_name}}[{{obj_id}}]
IF OBJECT_ID('{{schema_name}}.{{obj_name}}', 'U') IS NOT NULL 
	DROP TABLE {{schema_name}}.{{obj_name}} 

CREATE TABLE {{schema_name}}.{{obj_name}} (
	{{#each columns}}
		[{{column_name}}] {{data_type}}{{data_size}} {{is_nullable}} {{#if column_type_id in (200)}}IDENTITY{{/if}}{{default_value}}{{#unless @last}},{{/unless}}
	{{/each}}
	)


if len('{{#each columns}}{{#if primary_key_sorting}}*{{/if}}{{/each}}') > 0 
	exec sp_executesql N'
		IF NOT EXISTS (SELECT * FROM sys.indexes WHERE name = N''PK_{{schema_name}}_{{obj_name}}_{{obj_id}}'')
		ALTER TABLE {{schema_name}}.{{obj_name}} ADD CONSTRAINT
			PK_{{schema_name}}_{{obj_name}}_{{obj_id}} PRIMARY KEY CLUSTERED 
			(
			{{#each columns}}
				{{#if primary_key_sorting}}
					[{{column_name}}] {{primary_key_sorting}}
					{{#unless @last}},{{/unless}}
				{{/if}}
			{{/each}}
			) 
		'
-- end drop_and_create_table{{schema_name}}.{{obj_name}}[{{obj_id}}]

Handlebars is pretty simple. You can lookup the syntax here Handlebars (handlebarsjs.com). {{<keyword>}} defines a handlebars function. <keyword> can refer to a meta data item. For example schema_name or obj_name. The template is always parsed using the scope of exactly one object in the object tree. You can also refer to the previous and next objects. (e.g. the previous of a latest view is the corresponding table). Functions are prefixed with #, for example conditional statements: #if or iteration: #each. Note that not the complete handlebars language has been implemented in the parser. Please try for yourself, using:

exec dbo.parse_handlebars 'rdw.aw_SalesLT_Customer_his', 'drop_and_create_table'

will give you:

-- begin drop_and_create_table rdw.aw_SalesLT_Customer_his[148]
IF OBJECT_ID('rdw.aw_SalesLT_Customer_his', 'U') IS NOT NULL 
	DROP TABLE rdw.aw_SalesLT_Customer_his 

CREATE TABLE rdw.aw_SalesLT_Customer_his (
	
		[aw_saleslt_customer_his_sid] int NOT NULL IDENTITY,
		[_eff_dt] datetime NOT NULL ,
		[_delete_dt] datetime NULL ,
		[CustomerID] int NOT NULL ,
		[NameStyle] bit NULL ,
		[Title] nvarchar(8) NULL ,
		[FirstName] nvarchar(50) NULL ,
		[MiddleName] nvarchar(50) NULL ,
		[LastName] nvarchar(50) NULL ,
		[Suffix] nvarchar(10) NULL ,
		[CompanyName] nvarchar(128) NULL ,
		[SalesPerson] nvarchar(256) NULL ,
		[EmailAddress] nvarchar(50) NOT NULL ,
		[Phone] nvarchar(25) NULL ,
		[PasswordHash] varchar(128) NULL ,
		[PasswordSalt] varchar(10) NULL ,
		[rowguid] uniqueidentifier NOT NULL ,
		[ModifiedDate] datetime NULL ,
		[_transfer_id] int NOT NULL ,
		[_record_dt] datetime NOT NULL  DEFAULT ((getdate())),
		[_record_user] sysname NOT NULL  DEFAULT ((suser_sname())))


if len('**') > 0 
	exec sp_executesql N'
		IF NOT EXISTS (SELECT * FROM sys.indexes WHERE name = N''PK_rdw_aw_SalesLT_Customer_his_148'')
		ALTER TABLE rdw.aw_SalesLT_Customer_his ADD CONSTRAINT
			PK_rdw_aw_SalesLT_Customer_his_148 PRIMARY KEY CLUSTERED 
			(
			
					[aw_saleslt_customer_his_sid] DESC
					,
					[_eff_dt] DESC
					
			) 
		'
-- end drop_and_create_tablerdw.aw_SalesLT_Customer_his[148]

  • exec_create_table_ddl. this simply executes the generated SQL.

6.1.2 ddl latest views

The historization technique including latest views is borrowed from anchor modelling . In short you only have to add an effective date time stamp and you only have to insert records. Never delete or update anything. Most of the time you need the most recent version of a record. For your convenience we create a latest view that always returns this current version of a record. A descending primary key on effective date time ensures good performance.

The next pipeline is called pl_rdw_aw_ddl_latest_views. This pipeline has almost the same setup as the previous pipeline, except that we will not be creating tables, but views in this case. Please have a look at the table static.Template to look at the different templates:

-- begin {{template_name}} {{schema_name}}.{{obj_name}}[{{obj_id}}]  

DROP VIEW IF EXISTS {{schema_name}}.[{{obj_name}}]

exec sp_executesql N'
CREATE VIEW {{schema_name}}.[{{obj_name}}] AS 
SELECT 
{{#each columns}}
	[{{column_name}}]{{#unless @last}},{{/unless}}
{{/each}}
FROM {{src_schema_name}}.[{{src_obj_name}}] h
WHERE h._delete_dt is null  -- do not show deleted records in latest view
and h._eff_dt = ( 
		select max(_eff_dt) max_eff_dt
		FROM {{src_schema_name}}.[{{src_obj_name}}] latest
		where 
		{{#each columns}}
		{{#if column_type_id in (100)}}
			h.[{{column_name}}] = latest.[{{column_name}}]{{#unless @last}}AND{{/unless}}
		{{/if}}
		{{/each}}
	) 
'
-- end {{template_name}} {{schema_name}}.{{obj_name}}[{{obj_id}}]  

Note

DDL pipelines only need to run in our development environment. All other environments receive their ddl changes using the release mechanism. Of course, you need to do a schema compare with Visual studio to get your ddl changes in your database project and under version control (git).

6.3 RDW ETL

The final step is to load the data from staging into the RDW layer. When the DDL part is done, the loading is pretty straight forward:

  • Lookup rdw tables. This will just select all rdw tables from staging that need to be processed.
select o.obj_id, o.schema_name, o.obj_name table_name
,o.full_obj_name
from obj_ext o
where 
is_definition=0
and obj_type = 'table'
and db_name = 'sqldb-rdw'
and schema_name = 'rdw'
and obj_name like 'aw_%'
  • For Each Rdw table do:
    • generate_insert_rdw_sql. We will generate a sql statement to compare staging with rdw. the template is called rdw_insert and generates this sql:
-- begin rdw_insert rdw.aw_SalesLT_Customer_his[148]  
-- exec [dbo].[parse_handlebars] 148, ''rdw_insert''
-- obj_id is historic rdw table ( we need access to src and trg)
-- E.g. staging.Customer_exp (src) -> rdw.Customer_h (obj_id) -> rdw.Customer ( trg)

declare @now as datetime = getdate() 


INSERT INTO rdw.[aw_SalesLT_Customer_his]( 

	[CustomerID],
	[NameStyle],
	[Title],
	[FirstName],
	[MiddleName],
	[LastName],
	[Suffix],
	[CompanyName],
	[SalesPerson],
	[EmailAddress],
	[Phone],
	[PasswordHash],
	[PasswordSalt],
	[rowguid],
	[ModifiedDate], _delete_dt
, _eff_dt
, _transfer_id
)
select q.*, @now _eff_dt , <<_transfer_id>> _transfer_id -- please replace transfer_id runtime 
FROM ( 
	select changed_and_new.*, null delete_dt
	FROM (
		select
			[CustomerID],
			[NameStyle],
			[Title],
			[FirstName],
			[MiddleName],
			[LastName],
			[Suffix],
			[CompanyName],
			[SalesPerson],
			[EmailAddress],
			[Phone],
			[PasswordHash],
			[PasswordSalt],
			[rowguid],
			[ModifiedDate] 
		FROM staging_aw.[SalesLT_Customer]
		EXCEPT 
		SELECT  
		
			[CustomerID],
			[NameStyle],
			[Title],
			[FirstName],
			[MiddleName],
			[LastName],
			[Suffix],
			[CompanyName],
			[SalesPerson],
			[EmailAddress],
			[Phone],
			[PasswordHash],
			[PasswordSalt],
			[rowguid],
			[ModifiedDate] 
		FROM rdw.[aw_SalesLT_Customer]
	) changed_and_new
	union all 

	select deleted.*, @now delete_dt
	FROM (
		select 
		
			[CustomerID],
			[NameStyle],
			[Title],
			[FirstName],
			[MiddleName],
			[LastName],
			[Suffix],
			[CompanyName],
			[SalesPerson],
			[EmailAddress],
			[Phone],
			[PasswordHash],
			[PasswordSalt],
			[rowguid],
			[ModifiedDate] 
		FROM rdw.[aw_SalesLT_Customer] trg
		where not exists ( 
			select 1 
			FROM staging_aw.[SalesLT_Customer] src
			where 
				src.[CustomerID]= trg.[CustomerID] AND
				src.[EmailAddress]= trg.[EmailAddress] AND
				src.[rowguid]= trg.[rowguid]  
		) 
	) deleted
) q 

select @@rowcount rec_cnt_new
-- end rdw_insert rdw.aw_SalesLT_Customer_his[148]  

The next step is exec_insert_rdw. This will execute the generated sql. Because we generated an insert into statement we cannot use the copy task, but just need to execute the sql statement. However there is currently no task to simply execute a sql statement in ADF, so we abuse the lookup component for this. The only requirement is that you return a result, so we return the row count. Note that the generated sql code still contains the <<transfer_id>> text, that you can replace in ADF with the current transfer id.

6.4 Logging

Betl has extensive logging capabilities.

Everything is related to a batch, that in its turn can be related to a parent batch. In this way we can zoom in to what happened inside our stored procedures. There is a setting called log_level that manages the amount of logging. this can be set to one of the following values:

error
warn
info ( default)
debug
verbose

When troubleshooting you can increase the logging by setting this user bound property. (using setp or simply by executing dbo.verbose or dbo.debug.) dbo.reset resets the logging to its default value info.

For every ETL process we store the transfer details in dbo.transfer. This includes the record counts.

Using this logging info you can build a pretty good monitoring dashboard to show:

  • status of the daily run. Are there any errors by using the log_type. (You might want to make errors red).

header
footer
sql
var
error
warn
step
progress

  • You can also show trends in amount of records processed for each step. Big differences in trends should also be investigated.

6.5 Naming convention

History tables are suffixed with for example _his. I’ve created the following naming convention ( please feel free to create your own). Objectives: group related tables (coming from the same source system) together by name, group layers by name, use latest views by default.

object namenaming convention *example
<schema_name>.<obj_name> **
staging table<staging>_<source_system_abbreviation>.<source_system_schema_if_not_dbo>_<object_name>staging_aw.saleslt_customer
rdw history tablerdw.<source_system_abbreviation>_
<source_system_schema_if_not_dbo>_<object_name>_his
rdw.aw_sales_lt_customer_his
rdw latest viewrdw.<source_system_abbreviation>_
<source_system_schema_if_not_dbo>_<object_name>
rdw.aw_sales_lt_customer
* always lower case** db_name = sqldb_rdw