Sunday, June 29, 2008

Datastage tutorial and training

Datastage tutorial and training

Datastage guide Table of contents:

# Lesson 1. Datastage-modules - the lesson contains an overview of the datastage components and modules with screenshots.
# Lesson 2. Designing jobs - datastage palette - a list of all stages and activities used in Datastage
# Lesson 3. Extracting and loading data - ODBC and ORACLE stages - description and use of the ODBC and ORACLE stages (ORAOCI9) used for data extraction and data load. Covers ODBC input and output links, Oracle update actions and best practices.
# Lesson 4. Extracting and loading data - sequential files - description and use of the sequential files (flat files, text files, CSV files) in datastage.
# Lesson 5. Transforming and filtering data - use of transformers to perform data conversions, mappings, validations and datarefining. Design examples of the most commonly used datastage jobs.
# Lesson 6. Performing lookups in Datastage - how to use hash files and database stages as a lookup source.
# Lesson 7. Implementing ETL process in Datastage - step by step guide on how to implement the ETL process efficiently in Datastage. Contains tips on how to design and run a set of jobs executed on a daily basis.
# Lesson 8. SCD implementation in Datastage - the lesson illustrates how to implement SCD's (slowly changing dimensions) in Datastage, contains job designs, screenshots and sample data. All the Slowly Changing Dimensions types are described in separate articles below:
# SCD Type 1
# SCD Type 2
# SCD Type 3 and 4
# Lesson 9. Datastage jobs real-life solutions - a set of examples of job designs resolving real-life problems implemented in production datawarehouse environments in various companies.
# Lesson 10. Header and trailer file processing - a sample Datastage job which processes a textfile organized in a header and trailer format. This format may be represented for example by mainframe, EDI or EPIC files

Lesson 1:
Datastage modules

The DataStage components:

* Administrator - Administers DataStage projects, manages global settings and interacts with the system. Administrator is used to specify general server defaults, add and delete projects, set up project properties and provides a command interface to the datastage repository.
With Datastage Administrator users can set job monitoring limits, user privileges, job scheduling options and parallel jobs default.
* Manager - it's a main interface to the Datastage Repository, allows its browsing and editing. It displays tables and files layouts, routines, transforms and jobs defined in the project. It is mainly used to store and manage reusable metadata.
* Designer - used to create DataStage jobs which are compiled into executable programs. is a graphical, user-friendly application which applies visual data flow method to develop job flows for extracting, cleansing, transforming, integrating and loading data. It’s a module mainly used by Datastage developers.
* Director - manages running, validating, scheduling and monitoring DataStage jobs. It’s mainly used by operators and testers.

Lesson 3:
Designing jobs - ODBC and ORACLE stages


ODBC stages are used to allow Datastage to connect to any data source that represents the Open Database Connectivity API (ODBC) standard.
ODBC stages are mainly used to extract or load the data. However, ODBC stage may also be very helpful when aggregating data and as a lookup stage (in that case it can play role of aggregator stage or a hash file and can be used instead).
Each ODBC stage can have any number of inputs or outputs.
The input links specify the data which is written to the database (they act as INSERT, UPDATE or DELETE statements in SQL). Input link data can be defined in various ways: using an SQL statement constructed by DataStage, a user-defined SQL query or a stored procedure.
Output links specify the data that are extracted (correspond to the SQL SELECT statement). The data on an output link is passed through ODBC connector and processed by an underlying database.

If a processing target is an Oracle database, it may be worth considering use of ORACLE (ORAOCI9) stage. It has a significantly better performance than ODBC stage and allows setting up more configuration options and parameters native to the Oracle database.
There’s a very useful option to issue an SQL before or after main dataflow operations (Oracle stage properties -> Input -> SQL). For example, when loading a big chunk of data into an oracle table, it may increase performance to drop indexes in a ‘before SQL’ tab and create indexes and analyze table in a ‘after SQL’ tab ('ANALYZE TABLE xxx COMPUTE STATISTICS' SQL statement).

Update actions in Oracle stage
The destination table can be updated using various Update actions in Oracle stage. Be aware of the fact that it's crucial to select the key columns properly as it will determine which column will appear in the WHERE part of the SQL statement. Update actions available from the drop-down list:

* Clear the table then insert rows - deletes the contents of the table (DELETE statement) and adds new rows (INSERT).
* Truncate the table then insert rows - deletes the contents of the table (TRUNCATE statement) and adds new rows (INSERT).
* Insert rows without clearing - only adds new rows (INSERT statement).
* Delete existing rows only - deletes matched rows (issues only the DELETE statement).
* Replace existing rows completely - deletes the existing rows (DELETE statement), then adds new rows (INSERT).
* Update existing rows only - updates existing rows (UPDATE statement).
* Update existing rows or insert new rows - updates existing data rows (UPDATE) or adds new rows (INSERT). An UPDATE is issued first and if succeeds the INSERT is ommited.
* Insert new rows or update existing rows - adds new rows (INSERT) or updates existing rows (UPDATE). An INSERT is issued first and if succeeds the UPDATE is ommited.
* User-defined SQL - the data is written using a user-defined SQL statement.
* User-defined SQL file - the data is written using a user-defined SQL statement from a file.

Lesson 4:
Designing jobs - sequential (text) files


Sequential File stages are used to interract with text files which may involve extracting data from and write data to a text file. Sequential File stages can read files, create (overwrite) or append data to a text file. It can be processed on any drive that is local or mapped as a shared folder or even on an FTP server (combined with an FTP stage). Each Sequential File stage can have any number of inputs or outputs. However, trying to write to a sequential file simultaneously from two data streams will surely cause an error.

Lesson 5:
Designing jobs - tranforming and filtering data


It's a very common situation and a good practice to design datastage jobs in which data flow goes in the following way:
EXTRACT SOURCE -> DATA VALIDATION, REFINING, CLEANSING -> MAPPING -> DESTINATION
The data refining, validation and mapping part of the process is mainly handled by a transformer stage. Transformer stage doesn't extract or write data to a target database. It handles extracted data, performs conversions, mappings, validations, passes values and controls the data flow.
Transformer stages can have any number of input and output links. Input links can be primary or reference (used for lookups) and there can only be one primary input and any number of reference inputs.

2 comments:

Unknown said...

This is really nice information for all the datastage concepts. I want to share one more for Datastage

Unknown said...

Thanks for Information Datastage is an ETL tool and it is part of the IBM Information Platforms Solutions suite and IBM InfoSphere. It uses a graphical notation to construct data integration solutions and is available in different versions such as the Server Edition, Enterprise Edition, and the MVS Edition. It is capable of integrating data on demand across multiple and high volumes of data sources and target applications using a high performance parallel framework. Datastage Online Training

Search 4 DataStage