Sunday, June 29, 2008

Study Guide for DataStage Certification

This entry is a comprehensive guide to preparing for the DataStage 7.5 Certification exam.

Regular readers may be feeling a sense of de ja vu. Haven’t we seen this post before? I originally posted this in 2006 and this the Directors Cut - I’ve added some deleted scenes, a commentary for DataStage 7.0 and 8.0 users and generally improved the entry. By reposting I retain the links from other sites such as my DataStage Certification Squidoo lens with links to my certification blog entries and IBM certification pages.

This post shows all of the headings from the IBM exam Objectives and describes how to prepare for that section.

Before you start read work out how you add Environment variables to a job as a job parameter as they are handy for exercises and testing. See the DataStage Designer Guide for details.

Section 1 - Installation and Configuration

Versions: Version 7.5.1 and 7.5.2 are the best to study and run exercises on. Version 6.x is risky but mostly the same as 7. Version 8.0 is no good for any type of installation and configuration preparation as it has a new approach to installation and user security.

Reading: Read the Installation and Upgrade Guide for DataStage, especially the section on parallel installation. Read the pre-requisites for each type of install such as users and groups, the compiler, project locations, kernel settings for each platform. Make sure you know what goes into the dsenv file. Read the section on DataStage for USS as you might get one USS question. Do a search for threads on dsenv on the dsxchange forum to become familiar with how this file is used in different production environments.

Exercise: installing your own DataStage Server Enterprise Edition is the best exercise - getting it to connect to Oracle, DB2 and SQL Server is also beneficial. Run the DataStage Administrator and create some users and roles and give them access to DataStage functions.

Section 4 - Parallel Architecture (10%)

Section 9 - Monitoring and Troubleshooting (10%)

I’ve move section 4 and 9 up to the front as you need to study this before you run exercises and read about parallel stages in the other sections. Understanding how to use and monitor parallel jobs is worth a whopping 20% so it’s a good one to know well.

Versions: you can study this using DataStage 6, 7 and 8. Version 8 has the best definition of the parallel architecture with better diagrams.

Reading: Parallel Job Developers Guide opening chapters on what the parallel engine and job partitioning is all about. Read about each partitioning type. Read how sequential file stages partition or repartition data and why datasets don’t. The Parallel Job Advanced Developers Guide has sections on environment variables to help with job monitoring, read about every parameter with the word SCORE in it. The DataStage Director Guide describes how to run job monitoring - use the right mouse click menu on the job monitor window to see extra parallel information.

Exercises: Turn on various monitoring environment variables such as APT_PM_SHOW_PIDS and APT_DUMP_SCORE so you can see what happens during your exercises. It shows you what really runs in a job - the extra processes that get added across parallel nodes.

Try creating one node, two node and four node config files and see how jobs behave under each one. Try the remaining exercises on a couple different configurations by adding the configuration environment variable to the job. Try some pooling options. I got to admit I guessed my way through some of the pooling questions as I didn’t do many exercises.

Generate a set of rows into a sequential file for testing out various partitioning types. One column with unique ids 1 to 100 and a second column with repeating codes such as A, A, A, A, A, B, B, B, B, B etc. Write a job that reads from the input, sends it through a partitioning stage such as a transformer and writes it to a peek stage. The Director logs shows which rows went where. You should also view the Director monitor and expand and show the row counts on each instance of each stage in the job to see how stages are split and run on each node and how many rows each instance gets.

Use a filter stage to split the rows down two paths and bring them back together with a funnel stage, then replace the funnel with a collector stage. Compare the two.

Test yourself on estimating how many processes will be created by a job and check the result after the job has run using the Director monitor or log messages. Do this throughout all your exercises across all sections as a habit.

Section 2 - Metadata

Section 3 - Persistant Storage

I’ve merged these into one. Both sections talk about sequential files, datasets, XML files and Cobol files.

Versions: you can study this on DataStage 6, 7 or 8 as it is a narrow focus on DataStage Designer metadata. DataStage 8 will have slightly different CFF options but not enough to cause a problem.

Reading: read the section in the DataStage Developers Guide on Orchestrate schemas and partial schemas. Read the plugin guide for the Complex Flat File Stage to understand how Cobol metadata is imported (if you don’t have any cobol copybooks around you will just have to read about them and not do any exercises). Quickly scan through the NLS guide - but don’t expect any hard questions on this.

Exercises: Step through the IBM XML tutorial to get the tricky part on reading XML files. Find an XML file and do various exercises reading it and writing it to a sequential file. Switch between different key fields to see the impact of the key on the flattening of the XML hierarchy. Don’t worry too much about XML Transform.

Import from a database using the Orchestrate Import and the Plugin Import and compare the table definitions. Run an exercise on column propagation using the Peek stage where a partial schema is written to the Peek stage to reveal the propagated columns.

Create a job using the Row Generator stage. Define some columns, on the columns tab doubleclick on a column to bring up the advanced column properties. Use some properties to generate values for different data types. Get to know the advanced properties page.

Create a really large sequential file, dataset and fileset and use each as a reference in a lookup stage. Monitor the Resource and Scratch directories as the job runs to see how these lookup sources are prepared prior to and during a job run. Get to know the difference between the lookup fileset and other sources for a lookup stage.

Section 5 - Databases (15%)

One of the more difficult topics if you get questions for a database you are not familiar with. I got one database parallel connectivity question that I still can’t find the answer to in any of the manuals.

Versions: DataStage 7.x any version or at a pinch DataStage 8. Earlier versions do not have enough database stages and DataStage 8 has a new approach to database connections.

Reading: read the plugin guide for each enterprise database stage: Oracle, SQL Server, DB2 and ODBC. In version 8 read the improved Connectivity Guides for these targets. If you have time you can dig deeper, the Parallel Job Developers Guide and/or the Advanced Developers Guide has a section on the Oracle/DB2/Informix/Teradata/Sybase/SQL Server interface libraries. Look for the section called "Operator action" and read it for each stage. It’s got interesting bits like whether the stage can run in parallel, how it converts data and handles record sizes.

Exercise: Add each Enterprise database stage to a parallel job as both an input and output stage. Go in and fiddle around with all the different types of read and write options. You don’t need to get a connection working or have access to that database, you just need to have the stage installed and add it to your job. Look at the differences between insert/update/add/load etc. Look at the different options for each database stage. If you have time and a database try some loads to a database table.

Section 6 - Data Transformation (15%)

If you’ve used DataStage for longer than a year this is probably the topic you are going to ace - as long as you have done some type of stage variable use.

Versions: should be okay studying on versions 6, 7 or 8. Transformation stages such as Transformer, Filter and Modify have not changed much.

Reading: there is more value in using the transformation stages than reading about it. It’s hard to read about it and take it in as the Transformer stage is easier to navigate and understand if you are using it. If you have to make do with reading then visit the dsxchange and look for threads on stage variables, the FAQ on the parallel number generator, removing duplicates using a transformer and questions in the parallel forum on null handling. This will be better than reading the manuals as they will be full of practical examples. Read the Parallel Job Advanced Developers Guide section on "Specifying your own parallel stages".

Exercises: Focus on Transformer, Modify Stage (briefly), Copy Stage (briefly) and Filter Stage.

Create some mixed up source data with duplicate rows and a multiple field key. Try to remove duplicates using a Transformer with a sort stage and combination of stage variables to hold the prior row key value to compare to the new row key value.

Process some data that has nulls in it. Use the null column in a Transformer concatenate function with and without a nulltovalue function and with and without a reject link from the Transformer. This gives you an understanding of how rows get dropped and/or trapped from a transformer. Explore the right mouse click menu in the Transformer, output some of the DS Macro values and System Variables to a peek stage and think of uses for them in various data warehouse scenarios. Ignore DS Routine, it’s not on the test.

Don’t spend much time on the Modify stage - it would take forever memorizing functions. Just do an exercise on handle_null, string_trim and convert string to number. Can be tricky getting it working and you might not even get a question about it.

Section 7 - Combining and Sorting Data (10%)

Section 10 - Job Design (10%)

I’ve combined these since they overlap. Don’t underestimate this section, it covers a very narrow range of functionality so it is an easy set of questions to prepare for and get right. There are easy points on offer in this section.

Versions: Any version 7.x is best, version 6 has a completely different lookup stage, version 8 can be used but remember that the Range lookup functionality is new.

Reading: The Parallel Job Developers Guide has a table showing the differences between the lookup, merge and join stages. Try to memorize the parts of this table about inputs and outputs and reject links. This is a good place to learn about some more environment variables. Read the Parallel Job Advanced Developers Guide looking for any environment variables with the word SORT or SCORE in them.

Exercises: Compare the way join, lookup and merge work. Create a job that switches between each type.

Add various combinations of the SORT and COMBINE_OPERATORS environment variables to your job and examine the score of the job to see what sorts get added to your job at run time. A simple job with a Remove Duplicates or Join stage and no sorting will add a sort into the job - even if the data comes from a sorted database source. Use the SORT variables to turn off this sort insertion. See what happens to Transformer, Lookup and Copy stages when COMBINE_OPERATORS is turned on or off using the SCORE log entry.

Create an annotation and a description annotation and explore the differences between the two. Use the Multiple Compile tool in the DataStage Manager (version 6, 7) or Designer (version 8). Create and use shared containers.

Section 8 - Automation and Production Deployment (10%)

Versions: most deployment methods have remained the same from version 6, 7 and 8. Version 8 has the same import and export functions. DataStage 8 parameter sets will not be in the version 7 exam.

Reading: you don’t need to install or use the Version Control tool to pass this section however you should read the PDF that comes with it to understand the IBM recommendations for deployment. It covers the move from dev to test to prod. Read the Server Job Developers Guide section on command line calls to DataStage such as dsjob and dssearch and dsadmin.

Exercises: practice a few dsjob and dssearch commands. Practice saving a log to a text file. Create some job specific environment variables.

2 comments:

Jason Adams said...

Truly amazing post. Thanks for sharing.

Buy Custom Website

Unknown said...

This post is great, thanks for the share. Also, visit our website if you are looking to buy amazing products in Pakistan:

Cheezain

Search 4 DataStage