Data warehouse automation

  • Why is DWA framework x better than DWA framework y ?
  • How do you compare DWA frameworks ?

To answer these questions we first have to define what a DWA framework exactly is. This article will lay a foundation to compare DWA frameworks.

A DWA framework or DWA software generates the code for building your data warehouse that would otherwise have to be written by hand and/or by using a visual designer.

Code can be divided in :

  • ETL code. Code that read data (extract) and writes it to a destination (load). during this copy the data can also be transformed.
  • DDL. Data definition language. Code that defines the structure of the data. For example in databases, schemas, tables, columns, datatypes, constraints (e.g. primary keys) and associations ( e.g. foreign keys).
  • Orchestration. Code that defines a batch process in which certain activities are executed in parallel or sequential. For example: the loading of a staging layer in the data warehouse.

A Code generator needs at least two thing to generate code:

  • Code templates
  • Meta data

In my experience the differences of DWA frameworks consists of differences in how you store and maintain the meta data and what kind of templates you use. Before we dive into that, first I want to discuss design time code generation.

Design time versus runtime code generation

Design time code generation means that ETL, DDL and Orchestration is generated during development. This means that when changes are made in meta data or in templates, you need to build and generate the code. Depending on how your framework is setup you might need to generate only the changed entities or the entire project. The generation might need a lot of manual intervention ( e.g. running scripts in visual studio) , or can be completely automated.

The advantage of runtime code generation is that meta data changes and template changes are effective immediately. So it saves time during development. It’s also possible to define dynamic ETL in your production environment ( make your ETL flexible for DDL changes for example). For example you might want to allow columns to be added, deleted or changed in your staging layer of your dwh.

runtime time code generation
design time code generation








Code templates

Your meta data should represent what needs to be done. Your templates should represent how this should be done using the technology that you have. For example: the meta data contains all tables that we need to copy into our staging layer. We need to be able to specify which tables and columns we want and possibly exclude (large) columns that are not needed. The following templates are mostly used in the staging layer.

  • truncate insert full. Truncate target and insert the complete table.
  • truncate insert delta. Truncate target and insert all records that have changed since the last time we looked. (e.g. by using a last modified timestamp in your source system).

Some frameworks mix ETL, DDL and Orchestration. For example by including meta data or orchestration in the template. The downside of this is that your templates become less generic, contains more redundancy and are more difficult to reuse. * it can also be the choice of the developer to make this mistake, so it’s not the fault of the framework. However a good framework helps in seperating etl, ddl and orchestration.

strict seperation of ETL, DDL
and Orchestration templates
templates that contain both ETL, DDL
and orchestration










Another downside is that your templates become more difficult to read and to maintain. It should be easy to read a template, so that you or your successor can understand its meaning.

easy readable templates
difficult readable templates












Using short generic templates not only improves readability, but also increase reuse and improve overall quality of your code. For example: truncate insert full is better than truncate insert source system x, because the first template can be used in many situations. When you want to change the template ( for example: by improving the logging) you only need to change 1 template and it’s effective everywhere.

generic and modular
templates
too specific templates








Meta data

Object tree

The meta data should describe what needs to be done. We need to describe the data objects that we want to have in our data warehouse. Data objects are usually hierarchically organized, e.g. server->database->schema->table->column or file share->folder->filename->column. we need to describe the source objects. For example in our source systems, or in the source layer that we are reading from and the target objects, for example in the staging layer.

Mapping

The mapping describes that data object x needs to be copied to data object y. For every column we need to known how it’s mapped to a target column.

Business rules

In our mapping we might want to apply some transformations. These transformations also need to be described in a generic reusable way and not be mixed with the template code. For example, when a source system uses an integer for date representation and we want to convert it to the datetime data type we can define a business rule that tells us what to do when the value is for example zero. I encountered this in a source system where dates occured like e.g. 2001-03-00. This meant that they did not know the day, so the day should be set to the first of the month or else we cannot use our date time functions. Using business rules we can describe this logic once and apply it everywhere where it’s needed.

Properties

A DWA framework needs configuration. For example we might want to say that the DDL in the development environment can include drop statements, so that changes in the object tree are automatically implemented in the physical tables. Other examples are: specify which tables to include in a source database ( by specifying the entire database and then exclude some tables) or specify which tables are copied using a truncate insert and which tables are copied using delta insert.

useful functionality of properties:

  • Link to object tree
  • inheritance ( specify a property on a high level and use it for all children and descendants)
  • Specify the default by inheritance and specify exceptions ( e.g. I want all tables in all schemas in database x, but not schema y and not table z )