Sunday, June 29, 2008

ETL process and concepts

ETL process and concepts

ETL stands for extraction, transformation and loading. Etl is a process that involves the following tasks:

* extracting data from source operational or archive systems which are the primary source of data for the data warehouse
* transforming the data - which may involve cleaning, filtering, validating and applying business rules
* loading the data into a data warehouse or any other database or application that houses data

The ETL process is also very often referred to as Data Integration process and ETL tool as a Data Integration platform.
The terms closely related to and managed by ETL processes are: data migration, data management, data cleansing, data synchronization and data consolidation.

The main goal of maintaining an ETL process in an organization is to migrate and transform data from the source OLTP systems to feed a data warehouse and form data marts.

ETL Tools

At present the most popular and widely used ETL tools and applications on the market are:
# IBM Websphere DataStage (Formerly known as Ascential DataStage and Ardent DataStage)
# Informatica PowerCenter
# Oracle Warehouse Builder
# Ab Initio
# Pentaho Data Integration - Kettle Project (open source ETL)
# SAS ETL studio
# Cognos Decisionstream
# Business Objects Data Integrator (BODI)
# Microsoft SQL Server Integration Services (SSIS)

ETL process references

Sample business scenarios with a real-life ETL process implementations, each of which describes a typical data warehousing problem is brought by an ETL training.

A detailed step-by-step instructions on how the ETL process can be implemented in IBM Websphere Datastage can be found in the Implementing ETL in DataStage tutorial lesson.

Etl process can be also successfully implemented using an open source ETL tool - Kettle. There is a separate Pentaho Data Integration section on our pages: Kettle ETL transformations .

ETL and Data Warehouse training with examples


The ETL and Data Warehousing examples section is structured as a number of scenarios, each of which describes a typical business intelligence and data warehousing problem.
You may consider this module as an ETL and Data Warehouse cookbook with a series of recipes which show how to manage and implement the ETL process in a data warehouse environment.

Our intention is to show and analyze the most common real life problems you may encounter when designing ETL processes to feed data warehouses in various organization using a wide range of ETL tools. We discuss and propose an optimal solution for each of the scenarios and show how to organize the data flow process using various ETL applications.

Going through the sample implementations of the business scenarios is also a good way to compare ETL tools and see the different approaches to designing the data integration process. This also gives an idea and helps identify strong and weak points of various ETL tools.
ETL business scenarios

We present a number of sample business scenarios which are very often encountered accross the organizations and need to be addressed. ETL data integration process plays the major role in handling those problems.
Sample ETL process implementations:
# Surrogate key generation example which includes information on business keys and surrogate keys and shows how to design an ETL process to manage surrogate keys in a data warehouse environment. Sample design in Pentaho Data Integration
# Header and trailer processing - considerations on processing files arranged in blocks consisting of a header record, body items and a trailer. This type of files usually come from mainframes, also it applies to EDI and EPIC files. Solution examples in Datastage, SAS and Pentaho Data Integration
# Loading customers - the customers data extract is placed on an FTP server. It needs to be copied to an ETL server and loaded into the data warehouse. The transformation applies certain business rules and implements SCD to deal with the dimension changes. Sample implementation in Teradata MultiLoad
# Site traffic Data Warehouse - a guide for creation of a data warehouse and data marts for website traffic analysis and reporting. Sample design in Pentaho Kettle
# Data Quality Tests - ETL process design aimed to test and cleanse data in a Data Warehouse. Sample outline in PDI
# XML ETL - ETL processes to handle XML files
# Data allocation ETL process case study for allocating data. Sample Cognos implementation

Processing a header and trailer textfile


Goal

Process a text file which contains records arranged in blocks consisting of a header record, details (items, body) and a trailer. The aim is to normalize records and load them into a relational database structure.

Scenario overview and details

Typically, header and item processing needs to be implemented when processing files that origin from mainframe systems, and also EDI transmission files, SWIFT Schemas, EPIC files.

The input file in our scenario is a stream of records representing invoices. Each invoice consists of :
- A header containing invoice number, dates, customer reference and other details
- One or more items representing ordered products, including an item number, quantity, price and value
- A trailer which contains summary information for all the items

The records are distinguished by the first character in each line: H stands for headers, I for items and T for the trailers.
The lines in a section of the input file are fixed length (so for example all headers have the same number of characters but it varies from items).


Input data:
# Text file in a Header-trailer format. The file presented in our example is divided into headers, items and trailers.
- A header starts with a letter H and then contains an invoice number, a customer number, a date and invoice currency
- Every item starts with a letter I which is followed by product ID, product age, quantity and net value
- The trailer starts with a T and contains two values which act as a checksum: the total number of invoice lines and a total net value.
# Output data:
# INVC_HEADER and INVC_LINE relational tables: one with invoice headers and the other with the invoice lines
# rejects_080629.txt - a text file with loading errors desired output

Generate surrogate key

Goal

Fill in a data warehouse dimension table with data which comes from different source systems and assign a unique record identifier (surrogate key) to each record.
Scenario overview and details

To illustrate this example, we will use two made up sources of information to provide data about customers dimension. Each extract contains customer records with a business key (natural key) assigned to it.

In order to isolate the data warehouse from source systems, we will introduce a technical surrogate key instead of re-using the source system's natural (business) key.
A unique and common surrogate key is a one-field numeric key which is shorter, easier to maintain and understand, and independent from changes in source system than using a business key. Also, if a surrogate key generation process is implemented correctly, adding a new source system to the data warehouse processing will not require major efforts.

Surrogate key generation mechanism may vary depending on the requirements, however the inputs and outputs usually fit into the design shown below:
Inputs:
- an input respresented by an extract from the source system
- datawarehouse table reference for identifying the existing records
- maximum key lookup

Outputs:
- output table or file with newly assigned surrogate keys
- new maximum key
- updated reference table with new records
Proposed solution

Assumptions:
- The surrogate key field for our made up example is WH_CUST_NO.
- To make the example clearer, we will use SCD 1 to handle changing dimensions. This means that new records overwrite the existing data.
The ETL process implementation requires several inputs and outputs.
Input data:
- customers_extract.csv - first source system extract
- customers2.txt - second source system extract
- CUST_REF - a lookup table which contains mapping between natural keys and surrogate keys
- MAX_KEY - a sequence number which represents last key assignment

Output data:
- D_CUSTOMER - table with new records and correctly associated surrogate keys
- CUST_REF - new mappings added
- MAX_KEY sequence increased


The design of an ETL process for generating surrogate keys will be as follows:
# The loading process will be executed twice - once for each of the input files
# Check if the lookup reference data is correct and available:
- PROD_REF table
- max_key sequence
# Read the extract and first check if a record already exists. If it does, assign an existing surrogate key to it and update the desciptive data in the main dimension table.
# If it is a new record, then:
- populate a new surrogate key and assign it to the record. The new key will be populated by incrementing the old maximum key by 1.
- insert a new record into the products table
- insert a new record into the mapping table (which stores business and surrogate keys mapping)
- update the new maximum key

7 comments:

Unknown said...

Thanks for sharing useful information about ETL Concepts and refer the link Android Training in Chennai

Nithya Sri said...
This comment has been removed by the author.
Sivakumari.B said...

Your info is really amazing with impressive content..Excellent blog with informative concept. Really I feel happy to see this useful blog, Thanks for sharing such a nice blog..
If you are looking for any Big data Hadoop Related information please visit our website Big Data Hadoop Training In Bangalore page!

Padminiprwatech said...

Thanks for sharing your innovative ideas to our vision. I have read your blog and I gathered some new information through your blog. Your blog is really very informative and unique. Keep posting like this. Awaiting for your further update. If you are looking for any Python programming related information, please visit our website python training institute in BTM layout

harisalee@64gmail.com said...

This is a very amazing and helpful post..
Thanks for sharing with us,
We are again come on your website,
Thanks and good day,
Please visit our site,
buylogo

riyaz said...

Thanks for posting useful information
Android Training Institute in Chennai | Android Training Institute in anna nagar | Android Training Institute in omr | Android Training Institute in porur | Android Training Institute in tambaram | Android Training Institute in velachery

mir said...

this is great content that helps a lot thanks for sharing it with us,
we are always here to design the best websites & logos with a discount also with an guarantee are you interested?
Logo Designers

Search 4 DataStage