DataStage Information: 2008

Saturday, October 18, 2008

DS Faqs

32.where actually the flat files store?what is the path?
· Normally flat file will be stored at FTP servers or local folders and more over .CSV , .EXL and .TXT file formats available for Flat files.
· Flat files stores the data and the path can be given in general tab of the sequential file stage
33.How does DataStage handle the user security?
· we have to create users in the Administrators and give the necessary priviliges to users.
34.How I can convert Server Jobs into Parallel Jobs?
· u cant convert server to parallel ! u have to rebuild whole graph..
· have never tried doing this, however, I have some information which will help you in saving a lot of time. You can convert your server job into a server shared container. The server shared container can also be used in parallel jobs as shared container.
35.If I add a new environment variable in Windows, how can I access it in DataStage?
· u can call it in designer window
under that job properties
there u can add an new environment variable r u can use the existing one
· U can view all the environment variables in designer. U can check it in Job properties. U can add and access the environment variables from Job properties
36.what is data set? and what is file set?
· I assume you are referring Lookup fileset only.It is only used for lookup stages only.Dataset: DataStage parallel extender jobs use data sets to manage data within a job. You can think of each link in a job as carrying a data set. The Data Set stage allows you to store data being operated on in a persistent form, which can then be used by other DataStage jobs.FileSet: DataStage can generate and name exported files, write them to their destination, and list the files it has generated in a file whose extension is, by convention, .fs. The data files and the file that lists them are called a file set. This capability is useful because some operating systems impose a 2 GB limit on the size of a file and you need to distribute files among nodes to prevent overruns.
· file set:- It allows you to read data from or write data to a file set. The stage can have a single input link. a single output link, and a single rejects link. It only executes in parallel modeThe data files and the file that lists them are called a file set. This capability is useful because some operating systems impose a 2 GB limit on the size of a file and you need to distribute files among nodes to prevent overruns.
Datasets r used to import the data in parallel jobs like odbc in server jobs
37.How the hash file is doing lookup in serverjobs?How is it comparing the key values?
Hashed File is used for two purpose: 1. Remove Duplicate Records 2. Then Used for reference lookups.The hashed file contains 3 parts: Each record having Hashed Key, Key Header and Data portion.By using hashed algorith and the key valued the lookup is faster.
38.hi! .. There are three different types of user-created stages available for PX. What are they? Which would you use? What are the disadvantage for using each type?
Hai, These are the three different stages: i) Custom ii) Build iii) Wrapped
39.If data is partitioned in your job on key 1 and then you aggregate on key 2, what issues could arise?
· data will partitioned on both the keys ! hardly it will take more for execution .
40.How can I specify a filter command for processing data while defining sequential file output data?
· We have some thing called as after job subroutine and Before subroutine, with then we can execute the Unix commands.
Here we can use the sort sommand or the filter cdommand
41.what is the meaning of the following..1)If an input file has an excessive number of rows and can be split-up then use standard 2)logic to run jobs in parallel3)Tuning should occur on a job-by-job basis. Use the power of DBMS.
· Question is not clear eventhough i wil try to answer something
If u have SMP machines u can use IPC,link-colector,link-partitioner for performance tuning
If u have cluster,MPP machines u can use parallel jobs
· The third point specifies about tuning the performance of job,use the power of DBMS means one can improve the performance of the job by using the power of Database like Analyzing,creating index,creating partitions one can improve the performance of sqls used in the jobs.
42.How can you implement Complex Jobs in datastage
· what do u mean by complex jobs.
if u used more than 15 stages in a job and if you used 10 lookup tables in a job then u can call it as a complex job
· Complex design means having more joins and more look ups. Then that job design will be called as complex job.We can easily implement any complex design in DataStage by following simple tips in terms of increasing performance also. There is no limitation of using stages in a job. For better performance, Use at the Max of 20 stages in each job. If it is exceeding 20 stages then go for another job.Use not more than 7 look ups for a transformer otherwise go for including one more transformer.Am I Answered for u'r abstract Question.
· if the job have good logic that is called as complex job.
simply we can say Scenarios .so If you have faced any complexity while creating job please share with us.
43.How can I extract data from DB2 (on IBM iSeries) to the data warehouse via Datastage as the ETL tool. I mean do I first need to use ODBC to create connectivity and use an adapter for the extraction and transformation of data? Thanks so much if anybody could provide an answer.
You would need to install ODBC drivers to connect to DB2 instance (does not come with regular drivers that we try to install, use CD provided for DB2 installation, that would have ODBC drivers to connect to DB2) and then try out
· if ur system is mainfarmes then u can utility called load and unload ..
load will load the records into main farme systme from there u hv to export in to your system ( windows)
44.how can we pass parameters to job by using file.
· You can do this, by passing parameters from unix file, and then calling the execution of a datastage job. the ds job has the parameters defined (which are passed by unix)
· u can create a UNIX shell script which will pass the parameters to the job and u also can create logs for the whole run process of the job.
45.What is DS Administrator used for - did u use it?
The Administrator enables you to set up DataStage users, control the purging of the Repository, and, if National Language Support (NLS) is enabled, install and manage maps and locales. 46.What about System variables?
DataStage provides a set of variables containing useful system information that you can access from a transform or routine. System variables are read-only. @DATE The internal date when the program started. See the Date function. @DAY The day of the month extracted from the value in @DATE. @FALSE The compiler replaces the value with 0. @FM A field mark, Char(254). @IM An item mark, Char(255). @INROWNUM Input row counter. For use in constrains and derivations in Transformer stages. @OUTROWNUM Output row counter (per link). For use in derivations in Transformer stages. @LOGNAME The user login name. @MONTH The current extracted from the value in @DATE. @NULL The null value. @NULL.STR The internal representation of the null value, Char(128). @PATH The pathname of the current DataStage project. @SCHEMA The schema name of the current DataStage project. @SM A subvalue mark (a delimiter used in UniVerse files), Char(252). @SYSTEM.RETURN.CODE Status codes returned by system processes or commands. @TIME The internal time when the program started. See the Time function. @TM A text mark (a delimiter used in UniVerse files), Char(251). @TRUE The compiler replaces the value with 1. @USERNO The user number. @VM A value mark (a delimiter used in UniVerse files), Char(253). @WHO The name of the current DataStage project directory. @YEAR The current year extracted from @DATE. REJECTED Can be used in the constraint expression of a Transformer stage of an output link. REJECTED is initially TRUE, but is set to FALSE whenever an output link is successfully written.
47.what are the Job parameters?
· These Parameters are used to provide Administrative access and change run time values of the job.
EDIT>JOBPARAMETERSIn that Parameters Tab we can define the name,prompt,type,value

DS FAQs

25.give one real time situation where link partitioner stage used?
· If we want to send more data from the source to the targets quickly we will be using the link partioner stage in the server jobs we can make a maximum of 64 partitions. And this will be in active stage. We can't connect two active stages but it is accpeted only for this stage to connect to the transformer or aggregator stage. The data sent from the link partioner will be collected by the link collector at a max of 64 partition. This is also an active stage so in order to aviod the connection of active stage from the transformer to teh link collector we will be using inter process communication. As this is a passive stage by using this data can be collected by the link collector. But we can use inter process communication only when the target is in passive stage
26.How i create datastage Engine stop start script.Actually my idea is as below.!#bin/bashdsadm - usersu - rootpassword (encript)DSHOMEBIN=/Ascential/DataStage/home/dsadm/Ascential/DataStage/DSEngine/binif check ps -ef grep DataStage (client connection is there) { kill -9 PID (client connection) }uv -admin - stop > dev/nulluv -admin - start > dev/nullverify processcheck the connectionecho "Started properly"run it as dsadm
· go to the path /DATASTAGE/PROJECTS/DSENGINE/BIN/uv -admin -stopuv -admin -start
27.What does separation option in static hash-file mean?
· The different hashing algorithms are designed to distribute records evenly among the groups of the file based on charecters and their position in the record ids.
When a hashed file is created, Separation and Modulo respectively specifies the group buffer size and the number of buffers allocated for a file. When a Static Hashfile is created, DATASTAGE creates a file that contains the number of groups specified by modulo.
Size of Hashfile = modulus(no. groups) * Separations (buffer size)
28.What is the difference between sequential file and a dataset? When to use the copy stage?
· Sequential file stores small amount of the data with any extension .txt where as DataSet stores Huge amount of the data and opens the file only with an extension .ds.
· Sequentiial Stage stores small amount of the data with any extension in order to acces the file where as DataSet is used to store Huge amount of the data and it opens only with an extension (.ds ) .The Copy stage copies a single input data set to a number of output datasets. Each record of the input data set is copied to every output data set.Records can be copied without modification or you can drop or change theorder of columns.
29.how to find errors in job sequence?
· using DataStage Director we can find the errors in job sequence
30.what is job control?how can it used explain with steps?
· JCL defines Job Control Language it is ued to run more number of jobs at a time with or without using loops. steps:click on edit in the menu bar and select 'job properties' and enter the parameters asparamete prompt typeSTEP_ID STEP_ID stringSource SRC stringDSN DSN stringUsername unm stringPassword pwd stringafter editing the above steps then set JCL button and select the jobs from the listbox and run the job
31.how to implement type2 slowly changing dimenstion in datastage? give me with example?
· Slow changing dimension is a common problem in Dataware housing. For example: There exists a customer called lisa in a company ABC and she lives in New York. Later she she moved to Florida. The company must modify her address now. In general 3 ways to solve this problem
Type 1: The new record replaces the original record, no trace of the old record at all, Type 2: A new record is added into the customer dimension table. Therefore, the customer is treated essentially as two different people. Type 3: The original record is modified to reflect the changes.
In Type1 the new one will over write the existing one that means no history is maintained, History of the person where she stayed last is lost, simple to use.
In Type2 New record is added, therefore both the original and the new record Will be present, the new record will get its own primary key, Advantage of using this type2 is, Historical information is maintained But size of the dimension table grows, storage and performance can become a concern.
Type2 should only be used if it is necessary for the data warehouse to track the historical changes.
In Type3 there will be 2 columns one to indicate the original value and the other to indicate the current value. example a new column will be added which shows the original address as New york and the current address as Florida. Helps in keeping some part of the history and table size is not increased. But one problem is when the customer moves from Florida to Texas the new york information is lost. so Type 3 should only be used if the changes will only occur for a finite number of time.

FAQs

13.What is Runtime Column Propagation and how to use it?
· If your job has more columns which are not defined in metadata if runtime propagation is enabled it will propagate those extra columns to the rest of the job
14.Can both Source system(Oracle,SQLServer,...etc) and Target Data warehouse(may be oracle,SQLServer..etc) can be on windows environment or one of the system should be in UNIX/Linux environment
· Your Source System can be (Oracle, SQL, DB2, Flat File... etc) But your Target system for complete Data Warehouse should be one (Oracle or SQL or DB2 or..)
· In server edition you can have both in Windows. But in PX target should be in UNIX.
15.how to find the process id?explain with steps?
· you can find it in UNIX by using ps -ef command it displays all the process currently running on the system along with the process ids
· From the DS Director.Follow the path :
Job > Cleanup Resources.
There also you can see the PID.It also displays all the current running processes.
· Depending on your environment, you may have lots of process id's.From one of the datastage docs:you can try this on any given node: $ ps -ef grep dsuserwhere dsuser is the account for datastage.If the above (ps command) doesn't make sense, you'll need somebackground theory about how processes work in unix (or the mksenvironment when running in windows).Also from the datastage docs (I haven't tried this one yet, but it looks interesting):APT_PM_SHOW_PIDS - If this variable is set, players will output an informational message uponstartup, displaying their process id.Good luck.
16.if we using two sources having same meta data and how to check the data in two sorces is same or not?and if the data is not same i want to abort the job ?how we can do this?
· Use a change Capture Stage.Output it into a Transformer.
Write a routine to abort the job which is initiated at the Function.
@INROWNUM = 1.
So if the data is not matching it is passed in the transformer and the job is aborted.
17.Can you tell me for what puorpse .dsx files are used in the datasatage
· dsx is the standard file extension of all the various datastage jobs.Whenever we export a job or a sequence, the file is exported in the .dsx format. A standard usage for the same can be that, we develop the job in our test environment and after testing we export the file and save it as x.dsx . This can be done using Datstage Manager.
18.HI How can we create read only jobs in Datastage.
· in export there is an options just CLICK ON OPTIONS TAB THEN THERE UNDER INCLUDE OPTIONU WILL FIND READ ONLY DATASTAGE u just enable that
19.What is environment variables?what is the use of this?
· Basically Environment variable is predefined variable those we can use while creating DS job. We can set either as Project level or Job level. Once we set specific variable that variable will be available into the project/job.
We can also define new environment variable that we can got to DS Admin .
I hope u understand. for further details refer the DS Admin guide.
20.How do you call procedures in datastage?
· Use the Stored Procedure Stage
21.How to run the job in command prompt in unix?
· Using dsjob command,
-options
dsjob -run -jobstatus projectname jobname
22.How to know the no.of records in a sequential file before running a server job?
· if your environment is unix , you can check with wc -l filename command.
23.what is data stage engine?what is its purpose?
· Datastage sever contains Datastage engine DS Server will interact with Client components and Repository. Use of DS engine is to develope the jobs .Whenever the engine is on then only we will develope the jobs.
24.How to implement slowly changing dimentions in Datastage?
· In Informatica powercenter, there is a way to implement slowly changing dimension through wizard. Datastage does not have that type of wizard to implement SCD, should be implemented by manual logic.

Datastage Faqs

6.how to read the data from XL FILES?explain with steps?
Reading data from Excel file is
* Save the file in .csv (comma separated files).
* use a flat file stage in datastage job panel.
* double click on the flat file stage and assign input file to the .csv file (which you stored ).
* import metadate for the file . (once you imported or typed metadata , click view data to check the data values)
Then do the rest transformation as needed
-Debasis
· Create a new DSN for the Excel driver and choose the workbook from which u want data
Select the ODBC stage and access the Excel through that i.e., import the excel sheet using the new DSN created for the Excel
7.how can we generate a surrogate key in server/parallel jobs?
· In parallel jobs we can use surrogatekey generator stage.
· in server jobs we can use an inbuilt routine called KeyMgtGetNextValue.
· You can also generate the surrogate key in the database using the sequence generator.
8.what is an environment variable?
· Basically Environment variable is predefined variable those we can use while creating DS job.We can set eithere as Project level or Job level.Once we set specific variable that variable will be availabe into the project/job.
We can also define new envrionment variable.For that we can got to DS Admin .
I hope u understand.for further details refer the DS Admin guide
· Theare are the variables used at the project or job level.We can use them to to configure the job ie.we can associate the configuration file(Wighout this u can not run ur job), increase the sequential or dataset read/ write buffer.
ex: $APT_CONFIG_FILE
Like above we have so many environment variables. Please go to job properties and click on Paramer tab then click on "add environment variable" to see most of the environment variables.
9.how can we create environment variables in datasatage?
· We can create environment variables by using DataStage Administrator.
· Hi This mostely will comes under Administrator part.As a Designer only we can add directly byDesigner-view-jobprops-parameters-addenvironment variable-under userdefined-then add.
10.have few questions1. What ar ethe various process which starts when the datastage engine starts?2. What are the changes need to be done on the database side, If I have to use dB2 stage?3. datastage engine is responsible for compilation or execution or both?
· There are three processes start when the DAtastage engine starts:
1. DSRPC
2. Datastage Engine Resources
3. Datastage telnet Services
11.How to write and execute routines for PX jobs in c++?
· You define and store the routines in the Datastage repository(ex:in routine folder). And these rountines are excuted on c++ compilers.
· You have to write routine in C++ (g++ in Unix). then you have to create a object file. provide this object file path in your routine.
12.how to eleminate duplicate rows in data stage?
· TO remove duplicate rows you can achieve by more than one way
1.In DS there is one stage called "Remove Duplicate" is exist where you can specify the key.
2.Other way you can specify the key while using the stage i mean stage itself remove the duplicate rows based on key while processing time.
· By using Hash File Stage in DS Server we can elliminate the Duplicates in DS.
· Using a sort stage,set property: ALLOW DUPLICATES :false
OR
You can use any Stage in input tab choose hash partition And Specify the key and Check the unique checkbox.if u r doing with server Jobs, V can use hashfile to eliminate duplicate rows.

Datastage faqs

1.how to read the data from XL FILES?my problem is my data file having some commas in data,but we are using delimitor is ?how to read the data ,explain with steps?
A)1· Create DSN for your XL file by picking Microsoft Excel Driver
2. Take ODBC as source stage
3. Configure ODBC with DSN details
4. While importing metadata for XL sheet, make sure you should select on system tables check box.
Note: In XL sheet the first line should be column names.
2.Why job sequence is use for? what is batches? what is the difference between job sequence and batches?
· Job Sequence is allows you to specify a sequence of server or parallel jobs to run. The sequence can also contain control information, for example, you can specify different courses of action to take depending on whether a job in the sequence succeeds or fails. Once you have defined a job sequence, it can be scheduled and run using the DataStage Director. It appears in the DataStage Repository and in the DataStage Director client as a job.
3.Disadvantages of staging area
· I think disadvantage of staging are is disk space as we have to dump data into a local area.. As per my knowledge concern, there is no other disadvantages of staging area.
4.whats the meaning of performance tunning techinque,Example??
· meaning of performance tuning meaning we rhave to take some action to increase performance of slowly running job by
1) use link partitioner and link collector to speedup performance
2) use sorted data for aggregation
3) use sorter at source side and aggregation at target side
4)Tuned the oci stage for 'Array Size' and 'Rows per Transaction' numerical values for faster inserts, updates and selects.
5) do not use ipc stage at target side..............
5.how to distinguish the surrogate key in different dimentional tables?
· the Surrogate key will be the key field in the dimensions.

Monday, October 6, 2008

How to fetch the flat file data into Datastage?

For Example source is COBAL file copy that file into your are local machine and then give that path in the Source Stage and fetch the Data into Datastage.

Thursday, August 21, 2008

History Of Datastage

History
DataStage was conceived at VMark, a spin off from Prime Computers that developed two notable products: UniVerse database and the DataStage ETL tool. The first VMark ETL prototype was built by Lee Scheffler in the first half of 1996[1]. Peter Weyman was VMark VP of Strategy and identified the ETL market as an opportunity. He appointed Lee Scheffler as the architect and conceived the product brand name "Stage" to signify modularity and component-orientation[2]. This tag was used to name DataStage and subsequently used in related products QualityStage, ProfileStage, MetaStage and AuditStage. Lee Scheffler presented the DataStage product overview to the board of VMark in June 1996 and it was approved for development. The product was in alpha testing in October, beta testing in November and was generally available in January of 1997.

VMark acquired UniData in October of 1997 and renamed itself to Ardent Software[3]. In 1999 Ardent Software was acquired by Informix[4] the database software vendor. In April of 2001 IBM acquired Informix and took just the database business leaving the data integration tools to be spun off as an independent software company called Ascential Software[5]. In March of 2005 IBM acquired Ascential Software[6] and made DataStage part of the WebSphere family as WebSphere DataStage. In 2006 the product was released as part of the IBM Information Server under the Information Management family but was still known as WebSphere DataStage. In 2008 the suite was renamed to InfoSphere Information Server and the product was renamed to InfoSphere DataStage.

DataStage Editions
Enterprise Edition: a name give to the version of DataStage that had a parallel processing architecture and parallel ETL jobs.
Server Edition: the name of the original version of DataStage representing Server Jobs. Early DataStage versions only contained Server Jobs. DataStage 5 added Sequence Jobs and DataStage 6 added Parallel Jobs via Enterprise Edition.
MVS Edition: mainframe jobs, developed on a Windows or Unix/Linux platform and transferred to the mainframe as compiled mainframe jobs.
DataStage for PeopleSoft: a server edition with prebuilt PeopleSoft EPM jobs under an OEM arragement with PeopeSoft and Oracle Corporation.
DataStage TX: for processing complex transactions and messages, formerly known as Mercator.
DataStage SOA: Real Time Integration pack can turn server or parallel jobs into SOA services.

1)How can handle the after job subroutine in the transformer stage?
ANS:In Transformer Stage click on properities tab in tool bar(Left corner tab). It displays stage properities page like all the other job properities page. There you can specify your before after stage subroutins.
2)What is DataStage parallel Extender ?
ANS1:its a new version of datastage called PX it process data using partion/collection methods (algorithms) it is very fast compare to older (server) version..partioning can be defined via Config file which stores no of nodes and resource pool information. there are various new stages added to PX version for example dataset, fileset, row generator, look up stage...and many morecheers,
ANS2:Parallel extender is that the parallel processing of data extraction and transformation application . there are two types of parallel processing 1) pipeline parallelism 2) partition parallelism

Thursday, July 31, 2008

Rendering Oracle HTML DB Reports as PDFs Using Formatting Objects Processor

Process Overview :
The process of rendering Oracle HTML DB report output to PDF requires a few different pieces to be in place:

Oracle HTML DB 1.6 - This version of HTML DB is required to generate report output in an XML format that can be understood by FOP.
Java Platform - an environment in which a JavaServer Page (JSP) can perform an XML/XSL transformations.
JavaScript - on the page producing the Oracle HTML DB report, a certain amount of JavaScript is required to pass the report output over to the JSP.
Extensible Stylesheet Language Transformation (XSLT) - an XSLT document is required by the JSP to transform the report in XML format to PDF.
Process Example

To illustrate this concept, we will add an "Output to PDF" link to the template of an HTML DB region that contains a report. When users click on this link, a new browser window will pop up containing the PDF version of the same report output. Although we will be using a simple query on the HR.EMPLOYEES table, the formatting code provided with this Technical Note is not query-specific and should work with any queries that you create. The first few steps of this example are performed outside the Oracle HTML DB environment to put the various supporting pieces in place.
Step 1: Add FOP Libraries to the Java Environment

In order to render the received XML report output into a specific format (PDF in this case), your Java environment needs access to the FOP libraries. In this example we will use Oracle Application Server Containers for J2EE (OC4J) 10g for that environment, so the libraries will be added to $ORACLE_HOME/oc4j/j2ee/home/config/application.xml.

Download and unzip the FOP libraries from http://xml.apache.org/fop/
Place the unzipped fop-0.20.5 directory in $ORACLE_HOME/oc4j/j2ee
Include the following lines in $ORACLE_HOME/oc4j/j2ee/home/config/application.xml:

Step 2: Add the Rendering Components Specific to Oracle HTML DB to the Java Environment
Two files specific to this Oracle HTML DB PDF rendering process are required: htmldb_fop_render.jsp and htmldb_example.xslt. Simply put, htmldb_fop_render.jsp receives the XML report output that is POSTed by the JavaScript soon to be added below and performs an XML/XSL transformation using htmldb_example.xslt and the FOP libraries. Although these two files can be placed elsewhere, putting them both in $ORACLE_HOME/oc4j/j2ee/home/default-web-app is easiest for now.

Step 3a: Create an Oracle HTML DB Report

Create a report on scalar data types that will be used for this example. As mentioned previously, the formatting code added above isn't report specific, so any report of scalar datatypes will do.

Log into your Oracle HTML DB workspace.
Go to the Application Builder home by clicking the name of an application to which you wish to add this example report.
Click Create Page.
Select Page with Component and click Next.
Select Report and click Next.
Select SQL Report and click Next.
Finish the report wizard as desired while providing a query to be used. The simple query below would suffice for these purposes:
select first_name "First", last_name "Last", email "Email ID",
job_id "Job", hire_date "Date of Hire", salary "Compensation"
from hr.employees

Please note that if your HTML DB application schema does not have access to the HR schema, simply use another query for this example.
Step 3b: Add a Page-Level Item and Branch to Oracle HTML DB Report
To illustrate the ability to reference page-level items from the XSLT used in this transformation, add an item and a branch to return users to the current page.

From the Page Definition view of the newly created page, click the Create icon in the Items section.
Select Text and click Next.
Select Text Field (always submits page when Enter pressed) and click Next.
Name the item "MY_REPORT_PARAMETER" and click Next.
Provide a Label value of "My Report Parameter" and click Next.
Accept the remaining defaults in the Create Page Item wizard and return to the Page Definition view.
Click the Create icon in the Branches section.
Accept all the defaults of the Create Branch wizard while providing &APP_PAGE_ID. for the target page.
Click Create Branch.
Lastly, include MY_REPORT_PARAMETER in your report region by changing its query to:
select first_name "First", last_name "Last", email "Email ID",
job_id "Job", hire_date "Date of Hire", salary "Compensation"
from hr.employees
where salary > :MY_REPORT_PARAMETER or :MY_REPORT_PARAMETER IS null

Step 4: Add PDF Link to Report Screen

From the Page Definition view of your report page, click the name of the region template displayed in the Templates section of the Shared Components column.
Add the link definition below somewhere to the Template field of this screen. Placing the link immediately before the #CLOSE# substitution string tends to render nicely.
Output to PDF Document
Click Apply Changes.
Step 5: Modify and Upload the JavaScript File
Download this JavaScript file and change the path shown in this line:
var g_Render_URL = 'http://:/htmldb_fop_render.jsp';

to match the location of the JSP you added in Step 2, above. For example:
var g_Render_URL = 'http://myhostname.com:8888/htmldb_fop_render.jsp';

Go back into Oracle HTML DB and click the Application Builder tab at the top of the screen.
Click the Shared Components icon.
Click the Static Files link in the Files section.
Click Create.
Click Browse to navigate to your modified htmldb_pdf.js file.
Click Upload.
Step 6: Reference the JavaScript Functions From Your Report Page
Return to the Page Definition view of the page containing your report region.
Call the uploaded JavaScript from the header of this page by adding the following line to the "HTML Header" field:

Click Apply Changes.
Step 7: Run the Report Page and Click the "Output to PDF Document" Link
Summary

This Technical Note has offered a simple example of using Oracle HTML DB 1.6 and a Java environment to render HTML DB report output as a PDF document. The process illustrated can be summarized as follows:
HTML DB generates HTML report output in an HTML DB application page.
End users click a link on the page.
The link calls JavaScript that tells the HTML DB engine to send the report output to a JSP in a specific XML format.
The JSP performs an XML/XSL transformation using the provided XSLT and the FOP libraries.
FOP renders the document as PDF.

Using this concept, Oracle HTML DB developers can now use FOP to control all aspects of generated report output down to the layout within specific formats such as PDF, PCL, PS, SVG, Print, AWT, MIF and TXT.
Additional Resources & Information

The example shown above is a simple one in which HTML DB report output is sent to a specific PDF format. The layout of the rendered PDF document is controlled almost entirely by the htmldb_example.xslt stylesheet. Developers wishing to customize this resulting output can do so by making changes to that htmldb_example.xslt file. To learn more about transformation using such XSLT stylesheets, see the following resources:

Sunday, June 29, 2008

Datastage certification Dump

QUESTION
1
Which three stages support the dynamic (runtime) definition of the physical column
metadata? (Choose three.)
A. the Sequential stage
B. the Column Export stage
C. the CFF stage
D. the DRS stage
E. the Column Import stage
Answer: A,B,E

QUESTION
2
A Varchar(10) field named SourceColumn is mapped to a Char(25) field named
TargetColumn in a Transformer stage. The APT_STRING_PADCHAR environment
variable is set in Administrator to its default value. Which technique describes how to
write the derivation so that values in SourceColumn are padded with spaces in
TargetColumn?
A. Include APT_STRING_PADCHAR in your job as a job parameter. Specify the C/C++
end of string character (0x0) as its value.
B. Map SourceColumn to TargetColumn. The Transformer stage will automatically pad
with spaces.
C. Include APT_STRING_PADCHAR in your job as a job parameter. Specify a space as
its value.
D. Concatenate a string of 25 spaces to SourceColumn in the derivation for
TargetColumn.
Answer: C

QUESTION
3
Which three privileges must the user possess when running a parallel job? (Choose
three.)
A. read access to APT_ORCHHOME
B. execute permissions on local copies of programs and scripts
C. read/write permissions to the UNIX/etc directory
D. read/write permissions to APT_ORCHHOME
E. read/write access to disk and scratch disk resources
Answer: A,B,E

QUESTION
4
In a Teradata environment, which stage invokes Teradata supplied utilities?
A. Teradata API
B. DRS Teradata
C. Teradata Enterprise
D. Teradata Multiload
Answer: D

QUESTION
5
When importing a COBOL file definition, which two are required? (Choose two.)
A. The file you are importing is accessible from your client workstation.
B. The file you are importing contains level 01 items.
C. The column definitions are in a COBOL copybook file and not, for example, in a
COBOL source file.
D. The file does not contain any OCCURS DEPENDING ON clauses.
Answer: A,B

QUESTION
6
Which two tasks will create DataStage projects? (Choose two.)
A. Export and import a DataStage project from DataStage Manager.
B. Add new projects from DataStage Administrator.
C. Install the DataStage engine.
D. Copy a project in DataStage Administrator.
Answer: B,C

QUESTION
7
Which three defaults are set in DataStage Administrator? (Choose three.)
A. default prompting options, such as Autosave job before compile
B. default SMTP mail server name
C. project level default for Runtime Column Propagation
D. project level defaults for environment variables
E. project level default for Auto-purge of job log entries
Answer: C,D,E

QUESTION
8
Which three are keyless partitioning methods? (Choose three.)
A. Entire
B. Modulus
C. Round Robin
D. Random
E. Hash
Answer: A,C,D

QUESTION
9
Which two must be specified to manage Runtime Column Propagation? (Choose two.)
A. enabled in DataStage Administrator
B. attached to a table definition in DataStage Manager
C. enabled at the stage level
D. enabled with environmental parameters set at runtime
Answer: A,C

QUESTION
10
Which three are valid ways within a Job Sequence to pass parameters to Activity stages?
(Choose three.)
A. ExecCommand Activity stage
B. UserVariables Activity stage
C. Sequencer Activity stage
D. Routine Activity stage
E. Nested Condition Activity stage
Answer: A,B,D

QUESTION
11
A client requires that a database table be done using two jobs. The first job writes to a
dataset. The second job reads the dataset and loads the table. The two jobs are connected
in a Job Sequence. What are three benefits of this approach? (Choose three.)
A. The time it takes to load the table is reduced.
B. The database table can be reloaded after a failure without re-reading the source data.
C. The dataset can be used by other jobs even if the database load fails.
D. The dataset can be read if the database is not available.
E. The data in the dataset can be archived and shared with other external applications.
Answer: B,C,D

QUESTION
12
You are reading customer data using a Sequential File stage and transforming it using the
Transformer stage. The Transformer is used to cleanse the data by trimming spaces from
character fields in the input. The cleansed data is to be written to a target DB2 table.
Which partitioning method would yield optimal performance without violating the
business requirements?

A. Hash on the customer ID field
B. Round Robin
C. Random
D. Entire
Answer: B

QUESTION
13
Which three are valid trigger expressions in a stage in a Job Sequence? (Choose three.)
A. Equality(Conditional)
B. Unconditional
C. ReturnValue(Conditional)
D. Difference(Conditional)
E. Custom(Conditional)
Answer: B,C,E

QUESTION
14
An Aggregator stage using a Hash technique processes a very large number of rows
during month end processing. The job occasionally aborts during these large runs with an
obscure memory error. When the job is rerun, processing the data in smaller amounts
corrects the problem. Which change would correct the problem?
A. Set the Combinability option on the Stage Advanced tab to Combinable allowing the
Aggregator to use the memory associated with other operators.
B. Change the partitioning keys to produce more data partitions.
C. Add a Sort stage prior to the Aggregator and change to a sort technique on the Stage
Properties tab of the Aggregator stage.
D. Set the environment variable APT_AGG_MAXMEMORY to a larger value.
Answer: C

QUESTION
15
Which three actions are performed using stage variables in a parallel Transformer stage?
(Choose three.)
A. A function can be executed once per record.
B. A function can be executed once per run.
C. Identify the first row of an input group.
D. Identify the last row of an input group.
E. Lookup up a value from a reference dataset.
Answer: A,B,C

Datstage solutions, knowledge base, FAQ and best practices

Datstage solutions, knowledge base, FAQ and best practices
1.What are the ways to execute datastage jobs?

A job can be run using a few different methods:

* from Datastage Director (menu Job -> Run now...)
* from command line using a dsjob command
* Datastage routine can run a job (DsRunJob command)
* by a job sequencer
2.How to invoke a Datastage shell command?

Datastage shell commands can be invoked from :

* Datastage administrator (projects tab -> Command)
* Telnet client connected to the datastage server

3.How to stop a job when its status is running?

To stop a running job go to DataStage Director and click the stop button (or Job -> Stop from menu). If it doesn't help go to Job -> Cleanup Resources, select a process with holds a lock and click Logout

If it still doesn't help go to the datastage shell and invoke the following command: ds.tools
It will open an administration panel. Go to 4.Administer processes/locks , then try invoking one of the clear locks commands (options 7-10).

4. How to release a lock held by jobs?

Go to the datastage shell and invoke the following command: ds.tools
It will open an administration panel. Go to 4.Administer processes/locks , then try invoking one of the clear locks commands (options 7-10).

5. How to run and schedule a job from command line?

To run a job from command line use a dsjob command

Command Syntax: dsjob [-file

6. User privileges for the default DataStage roles?

The role privileges are:

* DataStage Developer - user with full access to all areas of a DataStage project
* DataStage Operator - has privileges to run and manage deployed DataStage jobs
* -none- - no permission to log on to DataStage

7. What is a command to analyze hashed file?

There are two ways to analyze a hashed file. Both should be invoked from the datastage command shell. These are:

* FILE.STAT command
* ANALYZE.FILE command

8. Is it possible to run two versions of datastage on the same pc?

Yes, even though different versions of Datastage use different system dll libraries.
To dynamically switch between Datastage versions install and run DataStage Multi-Client Manager. That application can unregister and register system libraries used by Datastage.

9. How to send notifications from Datastage as a text message (sms) to a cell phone

There is a few possible methods of sending sms messages from Datastage. However, there is no easy way to do this directly from Datastage and all methods described below will require some effort.

The easiest way of doing that from the Datastage standpoint is to configure an SMTP (email) server as a mobile phone gateway. In that case, a Notification Activity can be used to send message with a job log and any desired details. DSSendMail Before-job or After-job subroutine can be also used to send sms messages.
If configured properly, the recipients email address will have the following format: 600123456@oursmsgateway.com

If there is no possibility of configuring a mail server to send text messages, you can to work it around by using an external application run directly from the operational system. There is a whole bunch of unix scripts and applications to send sms messages.
In that solution, you will need to create a batch script which will take care of sending messages and invoke it from Datastage using ExecDOS or ExecSh subroutines passing the required parameters (like phone number and message body).

Please keep in mind that all these solutions may require a contact to the local cellphone provider first and, depending on the country, it may not be free of charge and in some cases the provider may not support the capability at all.

10.Error in Link collector - Stage does not support in-process active-to-active inputs or outputs

To get rid of the error just go to the Job Properties -> Performance and select Enable row buffer.
Then select Inter process which will let the link collector run correctly.
Buffer size set to 128Kb should be fine, however it's a good idea to increase the timeout.

11.What is the DataStage equivalent to like option in ORACLE

The following statement in Oracle:
select * from ARTICLES where article_name like '%WHT080%';

Can be written in DataStage (for example as the constraint expression):
incol.empname matches '...WHT080...'

12.what is the difference between logging text and final text message in terminator stage

Every stage has a 'Logging Text' area on their General tab which logs an informational message when the stage is triggered or started.

* Informational - is a green line, DSLogInfo() type message.
* The Final Warning Text - the red fatal, the message which is included in the sequence abort message .

13.Error in STPstage - SOURCE Procedures must have an output link

The error appears in Stored Procedure (STP) stage when there are no stages going out of that stage.
To get rid of it go to 'stage properties' -> 'Procedure type' and select Transform

14. How to invoke an Oracle PLSQL stored procedure from a server job

To run a pl/sql procedure from Datastage a Stored Procedure (STP) stage can be used.
However it needs a flow of at least one record to run.

It can be designed in the following way:

* source odbc stage which fetches one record from the database and maps it to one column - for example: select sysdate from dual
* A transformer which passes that record through. If required, add pl/sql procedure parameters as columns on the right-hand side of tranformer's mapping
* Put Stored Procedure (STP) stage as a destination. Fill in connection parameters, type in the procedure name and select Transform as procedure type. In the input tab select 'execute procedure for each row' (it will be run once).

15. Is it possible to run a server job in parallel?

Yes, even server jobs can be run in parallel.
To do that go to 'Job properties' -> General and check the Allow Multiple Instance button.
The job can now be run simultaneously from one or many sequence jobs. When it happens datastage will create new entries in Director and new job will be named with automatically generated suffix (for example second instance of a job named JOB_0100 will be named JOB_0100.JOB_0100_2). It can be deleted at any time and will be automatically recreated by datastage on the next run.

16.Datastage routine to open a text file with error catching

Note! work dir and file1 are parameters passed to the routine.
* open file1
OPENSEQ work_dir : '\' : file1 TO H.FILE1 THEN
CALL DSLogInfo("******************** File " : file1 : " opened successfully", "JobControl")
END ELSE
CALL DSLogInfo("Unable to open file", "JobControl")
ABORT
END

17. Datastage routine which reads the first line from a text file

Note! work dir and file1 are parameters passed to the routine.

* open file1
OPENSEQ work_dir : '\' : file1 TO H.FILE1 THEN
CALL DSLogInfo("******************** File " : file1 : " opened successfully", "JobControl")
END ELSE
CALL DSLogInfo("Unable to open file", "JobControl")
ABORT
END

READSEQ FILE1.RECORD FROM H.FILE1 ELSE
Call DSLogWarn("******************** File is empty", "JobControl")
END

firstline = Trim(FILE1.RECORD[1,32]," ","A") ******* will read the first 32 chars
Call DSLogInfo("******************** Record read: " : firstline, "JobControl")
CLOSESEQ H.FILE1

18.How to test a datastage routine or transform?

To test a datastage routine or transform go to the Datastage Manager.
Navigate to Routines, select a routine you want to test and open it. First compile it and then click 'Test...' which will open a new window. Enter test parameters in the left-hand side column and click run all to see the results.
Datastage will remember all the test arguments during future tests.

19.When hashed files should be used? What are the benefits or using them?

Hashed files are the best way to store data for lookups. They're very fast when looking up the key-value pairs.
Hashed files are especially useful if they store information with data dictionaries (customer details, countries, exchange rates). Stored this way it can be spread across the project and accessed from different jobs.

20.How to construct a container and deconstruct it or switch between local and shared?

To construct a container go to Datastage designer, select the stages that would be included in the container and from the main menu select Edit -> Construct Container and choose between local and shared.
Local will be only visible in the current job, and share can be re-used. Shared containers can be viewed and edited in Datastage Manager under 'Routines' menu.
Local Datastage containers can be converted at any time to shared containers in datastage designer by right clicking on the container and selecting 'Convert to Shared'. In the same way it can be converted back to local.

21.Corresponding datastage data types to ORACLE types?

Most of the datastage variable types map very well to oracle types. The biggest problem is to map correctly oracle NUMBER(x,y) format.

The best way to do that in Datastage is to convert oracle NUMBER format to Datastage Decimal type and to fill in Length and Scale column accordingly.

There are no problems with string mappings: oracle Varchar2 maps to datastage Varchar, and oracle char to datastage char.

22.How to adjust commit interval when loading data to the database?

In earlier versions of datastage the commit interval could be set up in:
General -> Transaction size (in version 7.x it's obsolete)

Starting from Datastage 7.x it can be set up in properties of ODBC or ORACLE stage in Transaction handling -> Rows per transaction.
If set to 0 the commit will be issued at the end of a successfull transaction.

23.What is the use of INROWNUM and OUTROWNUM datastage variables?

@INROWNUM and @OUTROWNUM are internal datastage variables which do the following:

* @INROWNUM counts incoming rows to a transformer in a datastage job
* @OUTROWNUM counts oucoming rows from a transformer in a datastage job

These variables can be used to generate sequences, primary keys, id's, numbering rows and also for debugging and error tracing.
They play similiar role as sequences in Oracle.

24.Datastage trim function cuts out more characters than expected

By deafult datastage trim function will work this way:
Trim(" a b c d ") will return "a b c d" while in many other programming/scripting languages "a b c d" result would be expected.
That is beacuse by default an R parameter is assumed which is R - Removes leading and trailing occurrences of character, and reduces multiple occurrences to a single occurrence.

To get the "a b c d" as a result use the trim function in the following way: Trim(" a b c d "," ","B")

25. Database update actions in ORACLE stage

The destination table can be updated using various Update actions in Oracle stage. Be aware of the fact that it's crucial to select the key columns properly as it will determine which column will appear in the WHERE part of the SQL statement. Available actions:

* Clear the table then insert rows - deletes the contents of the table (DELETE statement) and adds new rows (INSERT).
* Truncate the table then insert rows - deletes the contents of the table (TRUNCATE statement) and adds new rows (INSERT).
* Insert rows without clearing - only adds new rows (INSERT statement).
* Delete existing rows only - deletes matched rows (issues only the DELETE statement).
* Replace existing rows completely - deletes the existing rows (DELETE statement), then adds new rows (INSERT).
* Update existing rows only - updates existing rows (UPDATE statement).
* Update existing rows or insert new rows - updates existing data rows (UPDATE) or adds new rows (INSERT). An UPDATE is issued first and if succeeds the INSERT is ommited.
* Insert new rows or update existing rows - adds new rows (INSERT) or updates existing rows (UPDATE). An INSERT is issued first and if succeeds the UPDATE is ommited.
* User-defined SQL - the data is written using a user-defined SQL statement.
* User-defined SQL file - the data is written using a user-defined SQL statement from a file.

26. Use and examples of ICONV and OCONV functions?

ICONV and OCONV functions are quite often used to handle data in Datastage.
ICONV converts a string to an internal storage format and OCONV converts an expression to an output format.
Syntax:
Iconv (string, conversion code)
Oconv(expression, conversion )

Some useful iconv and oconv examples:
Iconv("10/14/06", "D2/") = 14167
Oconv(14167, "D-E") = "14-10-2006"
Oconv(14167, "D DMY[,A,]") = "14 OCTOBER 2006"
Oconv(12003005, "MD2$,") = "$120,030.05"

That expression formats a number and rounds it to 2 decimal places:
Oconv(L01.TURNOVER_VALUE*100,"MD2")

Iconv and oconv can be combined in one expression to reformat date format easily:
Oconv(Iconv("10/14/06", "D2/"),"D-E") = "14-10-2006"

27.ERROR 81021 Calling subroutine DSR_RECORD ACTION=2

Error message:

DataStage Repository Interface:
Error calling subroutine: DSR_RECORD (Action=2);
check DataStage is set up correctly in project
Development (Internal Error (81021))

Datastage system help gives the following error desription:
SYS.HELP. 081021
MESSAGE.. dsrpc: Error writing to Pipe.

The problem appears when a job sequence is used and it contains many stages (usually more than 10) and very often when a network connection is slow.

Basically the cause of a problem is a failure between DataStage client and the server communication.

The solution to the issue is:
# Do not log in to Datastage Designer using 'Omit' option on a login screen. Type in explicitly username and password and a job should compile successfully.
# execute the DS.REINDEX ALL command from the Datastage shell - if the above does not help

28. How to check Datastage internal error descriptions

# To check the description of a number go to the datastage shell (from administrator or telnet to the server machine) and invoke the following command: SELECT * FROM SYS.MESSAGE WHERE @ID='081021'; - where in that case the number 081021 is an error number

The command will produce a brief error description which probably will not be helpful in resolving an issue but can be a good starting point for further analysis.

29.Error timeout waiting for mutex

The error message usually looks like follows:
... ds_ipcgetnext() - timeout waiting for mutex

There may be several reasons for the error and thus solutions to get rid of it.
The error usually appears when using Link Collector, Link Partitioner and Interprocess (IPC) stages. It may also appear when doing a lookup with the use of a hash file or if a job is very complex, with the use of many transformers.

There are a few things to consider to work around the problem:
- increase the buffer size (up to to 1024K) and the Timeout value in the Job properties (on the Performance tab).
- ensure that the key columns in active stages or hashed files are composed of allowed characters – get rid of nulls and try to avoid language specific chars which may cause the problem.
- try to simplify the job as much as possible (especially if it’s very complex). Consider splitting it into two or three smaller jobs, review fetches and lookups and try to optimize them (especially have a look at the SQL statements).

30. ERROR 30107 Subroutine failed to complete successfully

Error message:

Error calling subroutine:
DSR_RECORD (Action=2); or *DataStage*DSR_SELECT (Action=7);
check DataStage is set up correctly in project Development
(Subroutine failed to complete successfully(30107))

Datastage system help gives the following error desription:
SYS.HELP. 930107
MESSAGE.. DataStage/SQL: Illegal placement of parameter markers

The problem appears when a project is moved from one project to another (for example when deploying a project from a development environment to production).

The solution to the issue is:
# Rebuild the repository index by executing the DS.REINDEX ALL command from the Datastage shell

31.Datastage Designer hangs when editing job activity properties

The appears when running Datastage Designer under Windows XP after installing patches or the Service Pack 2 for Windows.
After opening a job sequence and navigating to the job activity properties window the application freezes and the only way to close it is from the Windows Task Manager.

The solution of the problem is very simple. Just Download and install the “XP SP2 patch” for the Datastage client.
It can be found on the IBM client support site (need to log in):
https://www.ascential.com/eservice/public/welcome.do

Go to the software updates section and select an appropriate patch from the Recommended DataStage patches section.
Sometimes users face problems when trying to log in (for example when the license doesn’t cover the IBM Active Support), then it may be necessary to contact the IBM support which can be reached at WDISupport@us.ibm.com

Advantages and Disadvantages of using DataStage ETL tool

Major business and technical advantages and disadvantages of using DataStage ETL tool

Business advantages of using DataStage as an ETL tool:
# Significant ROI (return of investment) over hand-coding
# Learning curve - quick development and reduced maintenance with GUI tool
# Development Partnerships - easy integration with top market products interfaced with the datawarehouse, such as SAP, Cognos, Oracle, Teradata, SAS
# Single vendor solution for bulk data transfer and complex transformations (DataStage versus DataStage TX)
# Transparent and wide range of licensing options

Technical advantages of using DataStage tool to implement the ETL processes
# Single interface to integrate heterogeneous applications
# Flexible development environment - it enables developers to work in their desired style, reduces training needs and enhances reuse. ETL developers can follow data integrations quickly through a graphical work-as-you-think solution which comes by default with a wide range of extensible objects and functions
# Team communication and documentation of the jobs is supported by data flows and transformations self-documenting engine in HTML format.
# Ability to join data both at the source, and at the integration server and to apply any business rule from within a single interface without having to write any procedural code.
# Common data infrastructure for data movement and data quality (metadata repository, parallel processing framework, development environment)
# With Datastage Enterprise Edition users can use the parallel processing engine which provides unlimited performance and scalability. It helps get most out of hardware investment and resources.
# The datastage server performs very well on both Windows and unix servers.

Major Datastage weaknesses and disadvantages
# Big architectural differences in the Server and Enterprise edition which results in the fact that migration from server to enterprise edition may require vast time and resources effort.
# There is no automated error handling and recovery mechanism - for example no way to automatically time out zombie jobs or kill locking processes. However, on the operator level, these errors can be easily resolved.
# No Unix Datastage client - the Client software available only under Windows and there are different clients for different datastage versions. The good thing is that they still can be installed on the same windows pc and switched with the Multi-Client Manager program.
# Might be expensive as a solution for a small or mid-sized company.

Data warehouse

Data warehouse

Data Warehouse is a central managed and integrated database containing data from the operational sources in an organization (such as SAP, CRM, ERP system). It may gather manual inputs from users determining criteria and parameters for grouping or classifying records.
That database contains structured data for query analysis and can be accessed by users. The data warehouse can be created or updated at any time, with minimum disruption to operational systems. It is ensured by a strategy implemented in a ETL process.

A source for the data warehouse is a data extract from operational databases. The data is validated, cleansed, transformed and finally aggregated and it becomes ready to be loaded into the data warehouse.
Data warehouse is a dedicated database which contains detailed, stable, non-volatile and consistent data which can be analyzed in the time variant.
Sometimes, where only a portion of detailed data is required, it may be worth considering using a data mart. A data mart is generated from the data warehouse and contains data focused on a given subject and data that is frequently accessed or summarized.

Keeping the data warehouse filled with very detailed and not efficiently selected data may lead to growing the database to a huge size, which may be difficult to manage and unusable. To significantly reduce number of rows in the data warehouse, the data is aggregated which leads to the easier data maintenance and efficiency in browsing and data analysis.

A well designed and maintained Data Warehouse can significantly improve the quality and accessibility of the company data and increase the amount of information delivered to the end users.

# Key Data Warehouse systems and the most widely used database engines for storing and serving data for the enterprise business intelligence and performance management: Teradata
# Oracle
# Microsoft SQL Server
# IBM DB2
# SAS

Datastage tutorial and training

Datastage tutorial and training

Datastage guide Table of contents:

# Lesson 1. Datastage-modules - the lesson contains an overview of the datastage components and modules with screenshots.
# Lesson 2. Designing jobs - datastage palette - a list of all stages and activities used in Datastage
# Lesson 3. Extracting and loading data - ODBC and ORACLE stages - description and use of the ODBC and ORACLE stages (ORAOCI9) used for data extraction and data load. Covers ODBC input and output links, Oracle update actions and best practices.
# Lesson 4. Extracting and loading data - sequential files - description and use of the sequential files (flat files, text files, CSV files) in datastage.
# Lesson 5. Transforming and filtering data - use of transformers to perform data conversions, mappings, validations and datarefining. Design examples of the most commonly used datastage jobs.
# Lesson 6. Performing lookups in Datastage - how to use hash files and database stages as a lookup source.
# Lesson 7. Implementing ETL process in Datastage - step by step guide on how to implement the ETL process efficiently in Datastage. Contains tips on how to design and run a set of jobs executed on a daily basis.
# Lesson 8. SCD implementation in Datastage - the lesson illustrates how to implement SCD's (slowly changing dimensions) in Datastage, contains job designs, screenshots and sample data. All the Slowly Changing Dimensions types are described in separate articles below:
# SCD Type 1
# SCD Type 2
# SCD Type 3 and 4
# Lesson 9. Datastage jobs real-life solutions - a set of examples of job designs resolving real-life problems implemented in production datawarehouse environments in various companies.
# Lesson 10. Header and trailer file processing - a sample Datastage job which processes a textfile organized in a header and trailer format. This format may be represented for example by mainframe, EDI or EPIC files

Lesson 1:
Datastage modules

The DataStage components:

* Administrator - Administers DataStage projects, manages global settings and interacts with the system. Administrator is used to specify general server defaults, add and delete projects, set up project properties and provides a command interface to the datastage repository.
With Datastage Administrator users can set job monitoring limits, user privileges, job scheduling options and parallel jobs default.
* Manager - it's a main interface to the Datastage Repository, allows its browsing and editing. It displays tables and files layouts, routines, transforms and jobs defined in the project. It is mainly used to store and manage reusable metadata.
* Designer - used to create DataStage jobs which are compiled into executable programs. is a graphical, user-friendly application which applies visual data flow method to develop job flows for extracting, cleansing, transforming, integrating and loading data. It’s a module mainly used by Datastage developers.
* Director - manages running, validating, scheduling and monitoring DataStage jobs. It’s mainly used by operators and testers.

Lesson 3:
Designing jobs - ODBC and ORACLE stages

ODBC stages are used to allow Datastage to connect to any data source that represents the Open Database Connectivity API (ODBC) standard.
ODBC stages are mainly used to extract or load the data. However, ODBC stage may also be very helpful when aggregating data and as a lookup stage (in that case it can play role of aggregator stage or a hash file and can be used instead).
Each ODBC stage can have any number of inputs or outputs.
The input links specify the data which is written to the database (they act as INSERT, UPDATE or DELETE statements in SQL). Input link data can be defined in various ways: using an SQL statement constructed by DataStage, a user-defined SQL query or a stored procedure.
Output links specify the data that are extracted (correspond to the SQL SELECT statement). The data on an output link is passed through ODBC connector and processed by an underlying database.

If a processing target is an Oracle database, it may be worth considering use of ORACLE (ORAOCI9) stage. It has a significantly better performance than ODBC stage and allows setting up more configuration options and parameters native to the Oracle database.
There’s a very useful option to issue an SQL before or after main dataflow operations (Oracle stage properties -> Input -> SQL). For example, when loading a big chunk of data into an oracle table, it may increase performance to drop indexes in a ‘before SQL’ tab and create indexes and analyze table in a ‘after SQL’ tab ('ANALYZE TABLE xxx COMPUTE STATISTICS' SQL statement).

Update actions in Oracle stage
The destination table can be updated using various Update actions in Oracle stage. Be aware of the fact that it's crucial to select the key columns properly as it will determine which column will appear in the WHERE part of the SQL statement. Update actions available from the drop-down list:

* Clear the table then insert rows - deletes the contents of the table (DELETE statement) and adds new rows (INSERT).
* Truncate the table then insert rows - deletes the contents of the table (TRUNCATE statement) and adds new rows (INSERT).
* Insert rows without clearing - only adds new rows (INSERT statement).
* Delete existing rows only - deletes matched rows (issues only the DELETE statement).
* Replace existing rows completely - deletes the existing rows (DELETE statement), then adds new rows (INSERT).
* Update existing rows only - updates existing rows (UPDATE statement).
* Update existing rows or insert new rows - updates existing data rows (UPDATE) or adds new rows (INSERT). An UPDATE is issued first and if succeeds the INSERT is ommited.
* Insert new rows or update existing rows - adds new rows (INSERT) or updates existing rows (UPDATE). An INSERT is issued first and if succeeds the UPDATE is ommited.
* User-defined SQL - the data is written using a user-defined SQL statement.
* User-defined SQL file - the data is written using a user-defined SQL statement from a file.

Lesson 4:
Designing jobs - sequential (text) files

Sequential File stages are used to interract with text files which may involve extracting data from and write data to a text file. Sequential File stages can read files, create (overwrite) or append data to a text file. It can be processed on any drive that is local or mapped as a shared folder or even on an FTP server (combined with an FTP stage). Each Sequential File stage can have any number of inputs or outputs. However, trying to write to a sequential file simultaneously from two data streams will surely cause an error.

Lesson 5:
Designing jobs - tranforming and filtering data

It's a very common situation and a good practice to design datastage jobs in which data flow goes in the following way:
EXTRACT SOURCE -> DATA VALIDATION, REFINING, CLEANSING -> MAPPING -> DESTINATION
The data refining, validation and mapping part of the process is mainly handled by a transformer stage. Transformer stage doesn't extract or write data to a target database. It handles extracted data, performs conversions, mappings, validations, passes values and controls the data flow.
Transformer stages can have any number of input and output links. Input links can be primary or reference (used for lookups) and there can only be one primary input and any number of reference inputs.

ETL process and concepts

ETL process and concepts

ETL stands for extraction, transformation and loading. Etl is a process that involves the following tasks:

* extracting data from source operational or archive systems which are the primary source of data for the data warehouse
* transforming the data - which may involve cleaning, filtering, validating and applying business rules
* loading the data into a data warehouse or any other database or application that houses data

The ETL process is also very often referred to as Data Integration process and ETL tool as a Data Integration platform.
The terms closely related to and managed by ETL processes are: data migration, data management, data cleansing, data synchronization and data consolidation.

The main goal of maintaining an ETL process in an organization is to migrate and transform data from the source OLTP systems to feed a data warehouse and form data marts.

ETL Tools

At present the most popular and widely used ETL tools and applications on the market are:
# IBM Websphere DataStage (Formerly known as Ascential DataStage and Ardent DataStage)
# Informatica PowerCenter
# Oracle Warehouse Builder
# Ab Initio
# Pentaho Data Integration - Kettle Project (open source ETL)
# SAS ETL studio
# Cognos Decisionstream
# Business Objects Data Integrator (BODI)
# Microsoft SQL Server Integration Services (SSIS)

ETL process references

Sample business scenarios with a real-life ETL process implementations, each of which describes a typical data warehousing problem is brought by an ETL training.

A detailed step-by-step instructions on how the ETL process can be implemented in IBM Websphere Datastage can be found in the Implementing ETL in DataStage tutorial lesson.

Etl process can be also successfully implemented using an open source ETL tool - Kettle. There is a separate Pentaho Data Integration section on our pages: Kettle ETL transformations .

ETL and Data Warehouse training with examples

The ETL and Data Warehousing examples section is structured as a number of scenarios, each of which describes a typical business intelligence and data warehousing problem.
You may consider this module as an ETL and Data Warehouse cookbook with a series of recipes which show how to manage and implement the ETL process in a data warehouse environment.

Our intention is to show and analyze the most common real life problems you may encounter when designing ETL processes to feed data warehouses in various organization using a wide range of ETL tools. We discuss and propose an optimal solution for each of the scenarios and show how to organize the data flow process using various ETL applications.

Going through the sample implementations of the business scenarios is also a good way to compare ETL tools and see the different approaches to designing the data integration process. This also gives an idea and helps identify strong and weak points of various ETL tools.
ETL business scenarios

We present a number of sample business scenarios which are very often encountered accross the organizations and need to be addressed. ETL data integration process plays the major role in handling those problems.
Sample ETL process implementations:
# Surrogate key generation example which includes information on business keys and surrogate keys and shows how to design an ETL process to manage surrogate keys in a data warehouse environment. Sample design in Pentaho Data Integration
# Header and trailer processing - considerations on processing files arranged in blocks consisting of a header record, body items and a trailer. This type of files usually come from mainframes, also it applies to EDI and EPIC files. Solution examples in Datastage, SAS and Pentaho Data Integration
# Loading customers - the customers data extract is placed on an FTP server. It needs to be copied to an ETL server and loaded into the data warehouse. The transformation applies certain business rules and implements SCD to deal with the dimension changes. Sample implementation in Teradata MultiLoad
# Site traffic Data Warehouse - a guide for creation of a data warehouse and data marts for website traffic analysis and reporting. Sample design in Pentaho Kettle
# Data Quality Tests - ETL process design aimed to test and cleanse data in a Data Warehouse. Sample outline in PDI
# XML ETL - ETL processes to handle XML files
# Data allocation ETL process case study for allocating data. Sample Cognos implementation

Processing a header and trailer textfile

Goal

Process a text file which contains records arranged in blocks consisting of a header record, details (items, body) and a trailer. The aim is to normalize records and load them into a relational database structure.

Scenario overview and details

Typically, header and item processing needs to be implemented when processing files that origin from mainframe systems, and also EDI transmission files, SWIFT Schemas, EPIC files.

The input file in our scenario is a stream of records representing invoices. Each invoice consists of :
- A header containing invoice number, dates, customer reference and other details
- One or more items representing ordered products, including an item number, quantity, price and value
- A trailer which contains summary information for all the items

The records are distinguished by the first character in each line: H stands for headers, I for items and T for the trailers.
The lines in a section of the input file are fixed length (so for example all headers have the same number of characters but it varies from items).

Input data:
# Text file in a Header-trailer format. The file presented in our example is divided into headers, items and trailers.
- A header starts with a letter H and then contains an invoice number, a customer number, a date and invoice currency
- Every item starts with a letter I which is followed by product ID, product age, quantity and net value
- The trailer starts with a T and contains two values which act as a checksum: the total number of invoice lines and a total net value.
# Output data:
# INVC_HEADER and INVC_LINE relational tables: one with invoice headers and the other with the invoice lines
# rejects_080629.txt - a text file with loading errors desired output

Generate surrogate key

Goal

Fill in a data warehouse dimension table with data which comes from different source systems and assign a unique record identifier (surrogate key) to each record.
Scenario overview and details

To illustrate this example, we will use two made up sources of information to provide data about customers dimension. Each extract contains customer records with a business key (natural key) assigned to it.

In order to isolate the data warehouse from source systems, we will introduce a technical surrogate key instead of re-using the source system's natural (business) key.
A unique and common surrogate key is a one-field numeric key which is shorter, easier to maintain and understand, and independent from changes in source system than using a business key. Also, if a surrogate key generation process is implemented correctly, adding a new source system to the data warehouse processing will not require major efforts.

Surrogate key generation mechanism may vary depending on the requirements, however the inputs and outputs usually fit into the design shown below:
Inputs:
- an input respresented by an extract from the source system
- datawarehouse table reference for identifying the existing records
- maximum key lookup

Outputs:
- output table or file with newly assigned surrogate keys
- new maximum key
- updated reference table with new records
Proposed solution

Assumptions:
- The surrogate key field for our made up example is WH_CUST_NO.
- To make the example clearer, we will use SCD 1 to handle changing dimensions. This means that new records overwrite the existing data.
The ETL process implementation requires several inputs and outputs.
Input data:
- customers_extract.csv - first source system extract
- customers2.txt - second source system extract
- CUST_REF - a lookup table which contains mapping between natural keys and surrogate keys
- MAX_KEY - a sequence number which represents last key assignment

Output data:
- D_CUSTOMER - table with new records and correctly associated surrogate keys
- CUST_REF - new mappings added
- MAX_KEY sequence increased

The design of an ETL process for generating surrogate keys will be as follows:
# The loading process will be executed twice - once for each of the input files
# Check if the lookup reference data is correct and available:
- PROD_REF table
- max_key sequence
# Read the extract and first check if a record already exists. If it does, assign an existing surrogate key to it and update the desciptive data in the main dimension table.
# If it is a new record, then:
- populate a new surrogate key and assign it to the record. The new key will be populated by incrementing the old maximum key by 1.
- insert a new record into the products table
- insert a new record into the mapping table (which stores business and surrogate keys mapping)
- update the new maximum key

What is meant by Business Intelligence

What is Business Intelligence?
Business intelligence is a broad set of applications, technologies and knowledge for gathering and analyzing data for the purpose of helping users make better business decisions.
The main challenge of Business Intelligence is to gather and serve organized information regarding all relevant factors that drive the business and enable end-users to access that knowledge easily and efficiently and in effect maximize the success of an organization.

Business intelligence produces analysis and provides in depth knowledge about performance indicators such as company's customers, competitors, business counterparts, economic environment and internal operations to help making effective and good quality business decisions.

From a technical standpoint, the most important areas that Business Intelligence covers are:

DW - Data warehouses - architecture, design, managing, processing and loading (including the ETL process)
Data cleansing and data quality management
Data mining, statistical analysis, forecasting
OLAP - Online Analytical Processing and multidimensional analysis
MIS - Management Information Systems
Reporting - Information visualization and Dashboards
CRM - Customer Relationship Management
DSS - Decision Support Systems

Etl Tools Info portal
ETL-Tools.Info portal provides information about different business intelligence tools and datawarehousing solutions, with a main focus on ETL process and tools. On our pages you will find overview and general, high-level information on various Business Intelligence applications and architectures, as well as technical documents, with a detailed and low-level description of the solutions.
A great attention is paid to the Datastage ETL tool and on the portal you will find a number of Datastage examples, Datastage tutorials, best practices and resolved problems with many real-life examples.
There's also a wide range of information on a rapidly growing Open Source Business Intelligence market (OSBI), with emphasis on applications from the Pentaho BI family with a Pentaho tutorial.
Recently we also added a new section - the SAS Guide with tutorial, which illustrates the vision of SAS on Business Intelligence.

Tutorals for datastage

1. Create custom operators for WebSphere DataStage

This is my favourite tutorial because this is a misunderstood and underutilised part of the product. It’s written by Blayne Chard a software engineer from IBM. Unlike the other tutorials on this page this one requires an IBM ID to access. If you are like me it usually a few goes before you can remember your login!

In this tutorial you learn:

1. How to write a simple DataStage operator
2. How to set up the development environment to compile and run a DataStage operator
3. The basics of the Orchestrate Shell (OSH) scripting language for DataStage jobs
4. How to load your operator into the DataStage Designer so you can use it on any job you create

The DataStage Transformer lets you call a custom routine instead of the standard list of functions on the right mouse click menu. This lets you take code that looks complex in the Transformer and put it into a module to simplify testing and re-use.

Custom routines used to be huge in DataStage, I’ve been to sites with hundreds of them, but they were for DataStage Server Edition. This edition supported routines written in Universe BASIC via a routine editor in the DataStage Manager tool. They were easy to write and easy to test and easy to re-use between jobs.

DataStage Enterprise Edition made things a bit harder. Custom routines are c++ components built and compiled external to DataStage. There was a lot more fiddling to get the routines recognised by a DataStage transformer and c++ tends to be a more challenging language than BASIC.

This is a comprehensive tutorial taking you through the steps of creating and using a routine. Fantastic.

compile your operator

Compiling your operator is a straight forward process once the environment has been setup. To compile the operator, make sure that you have run setup.bat in the current command window, then type the following commands.

2. Modify Stage Video Tutorial

The Modify Stage is a tricky parallel stage that converts fields from one metadata definition to another. It does not have any of the fancy GUI help of the Transformer such as right mouse click menus or syntax checking. You need to look up the functions in the Parallel Job Developers Guide, guess what the syntax is and use trial and error half a dozen times until you get it working!

This tutorial saves you a lot of the trial and error.

The tutorial does require premium membership to the dsxchange at US$99 a year but if it saves you a couple days of work that should be enough for your employer to splash out for membership.

3. WebSphere DataStage Parallel Job Tutorial Version 8

The mother of all DataStage tutorials - this is the one that comes on the Information Server installation CDs (or is that DVDs? Or BlueRays?) It’s available from the IBM publications centre and you can download it free of charge as a 1.05M PDF file.

The IBM WebSphere DataStage Parallel Job Tutorial demonstrates how to design and run IBM WebSphere DataStage� parallel jobs.

The exercises, sample data, and sample import files for the parallel job tutorial are in the TutorialData’DataStage’parallel_tutorial folder on the suite CD, DVD, or downloaded installation image.

It’s a simple tutorial covering opening and running a job, designing a job, using the Transformer, loading a target table and processing in parallel. Not a lot of interest to experienced DataStage users but useful to a newcomer.

4. Configure DB2 remote connectivity with WebSphere DataStage Enterprise Edition

I also reviewed this tutorial previously in DataStage Tip: Extracting database data 250% faster.

As with XML the DB2 connectivity can be a tricky beast to get running. Remote DB2 connections are even trickier. This tutorial will save you a lot of trouble and swearing as you get the wee DB2 beastie running.

5. Transform and integrate data using WebSphere DataStage XML and Web services packs

Getting started with XML processing in DataStage usually involves a lot of fumbling around before you even get a job running. By the end of it you are usually happy to never see another XML file as long as you live. The trick is that DataStage needs to read and validate the entire XML document before it starts processing any part of it as it needs to know how to flatten it into relational data. This is different to database and sequential file sources where DataStage reads and validates one row at a time.

As you can imagine this initial read and validation of the entire file puts a limit on what DataStage can handle in terms of volume. It can process a very large number of small XML files but not one very large XML file. The platform will eventually hit a RAM or resource limit.

This tutorial is invaluable for getting started with XML files in DataStage covering importing XML metadata and using it as a source or target.

DataStage XML Job Example

6. A flexible data integration architecture using WebSphere DataStage and WebSphere Federation Server

IBM wants us to combine the different products on the Information Server for a best of breed data integration platform. One of the themes of the new analytical warehouse concepts such as the IBM Dynamic Warehouse is the use of unstructured or semi structured text. DataStage and the Federation Server are a good combination for bringing this into a warehouse.

IBM are calling this a T-ETL architecture where the Federation server is a data pre-processor.

The T-ETL architecture proposed uses federation to join, aggregate, and filter data before it enters WebSphere DataStage, with WebSphere DataStage using its parallel engine to perform more complex transformations and the maintenance of the target.

Federation DataStage Job Example

This is a generous tutorial with a lot of use cases for T-ETL, ETL and ELT.

The strength of WebSphere DataStage as a data consolidation tool lies in its flexibility -- it is able to support the many different consolidation scenarios and flavors of ETL, including ETL, ELT, and TEL (against a single data source). Combining WebSphere DataStage with WebSphere Federation Server opens up a whole new area of consolidation scenarios, namely T-ETL, against homogenous or heterogeneous data sources.

7. Access application data using WebSphere Federation Server and WebSphere DataStage

I reviewed this tutorial in a previous blog post: DataStage Tip: Free IBM DataStage SAP Tutorial

As with the last tutorial it uses a combination of DataStage and the Federation Server to access atypical enterprise data, in this case SAP.

The Top 7 Online DataStage Tutorials

These are the top 7 online DataStage Tutorials for learning DataStage, becoming an expert or studying for DataStage certification.

IBM has been putting new content into the crevices and caves of the ibm.com website for some months now and some trawling has turned up half a dozen handy tutorials.

Here is the top 6 DataStage tutorials, if you keep reading you will see a review of each tutorial and why you would need it:

1. Create custom operators for WebSphere DataStage
2. Modify Stage Video Tutorial (requires dsxchange premium membership)
3. WebSphere DataStage Parallel Job Tutorial Version 8
4. Configure DB2 remote connectivity with WebSphere DataStage Enterprise Edition
5. Transform and integrate data using WebSphere DataStage XML and Web services packs
6. A flexible data integration architecture using WebSphere DataStage and WebSphere Federation Server
7. Access application data using WebSphere Federation Server and WebSphere DataStage

Study Guide for DataStage Certification

This entry is a comprehensive guide to preparing for the DataStage 7.5 Certification exam.

Regular readers may be feeling a sense of de ja vu. Haven’t we seen this post before? I originally posted this in 2006 and this the Directors Cut - I’ve added some deleted scenes, a commentary for DataStage 7.0 and 8.0 users and generally improved the entry. By reposting I retain the links from other sites such as my DataStage Certification Squidoo lens with links to my certification blog entries and IBM certification pages.

This post shows all of the headings from the IBM exam Objectives and describes how to prepare for that section.

Before you start read work out how you add Environment variables to a job as a job parameter as they are handy for exercises and testing. See the DataStage Designer Guide for details.

Section 1 - Installation and Configuration

Versions: Version 7.5.1 and 7.5.2 are the best to study and run exercises on. Version 6.x is risky but mostly the same as 7. Version 8.0 is no good for any type of installation and configuration preparation as it has a new approach to installation and user security.

Reading: Read the Installation and Upgrade Guide for DataStage, especially the section on parallel installation. Read the pre-requisites for each type of install such as users and groups, the compiler, project locations, kernel settings for each platform. Make sure you know what goes into the dsenv file. Read the section on DataStage for USS as you might get one USS question. Do a search for threads on dsenv on the dsxchange forum to become familiar with how this file is used in different production environments.

Exercise: installing your own DataStage Server Enterprise Edition is the best exercise - getting it to connect to Oracle, DB2 and SQL Server is also beneficial. Run the DataStage Administrator and create some users and roles and give them access to DataStage functions.

Section 4 - Parallel Architecture (10%)

Section 9 - Monitoring and Troubleshooting (10%)

I’ve move section 4 and 9 up to the front as you need to study this before you run exercises and read about parallel stages in the other sections. Understanding how to use and monitor parallel jobs is worth a whopping 20% so it’s a good one to know well.

Versions: you can study this using DataStage 6, 7 and 8. Version 8 has the best definition of the parallel architecture with better diagrams.

Reading: Parallel Job Developers Guide opening chapters on what the parallel engine and job partitioning is all about. Read about each partitioning type. Read how sequential file stages partition or repartition data and why datasets don’t. The Parallel Job Advanced Developers Guide has sections on environment variables to help with job monitoring, read about every parameter with the word SCORE in it. The DataStage Director Guide describes how to run job monitoring - use the right mouse click menu on the job monitor window to see extra parallel information.

Exercises: Turn on various monitoring environment variables such as APT_PM_SHOW_PIDS and APT_DUMP_SCORE so you can see what happens during your exercises. It shows you what really runs in a job - the extra processes that get added across parallel nodes.

Try creating one node, two node and four node config files and see how jobs behave under each one. Try the remaining exercises on a couple different configurations by adding the configuration environment variable to the job. Try some pooling options. I got to admit I guessed my way through some of the pooling questions as I didn’t do many exercises.

Generate a set of rows into a sequential file for testing out various partitioning types. One column with unique ids 1 to 100 and a second column with repeating codes such as A, A, A, A, A, B, B, B, B, B etc. Write a job that reads from the input, sends it through a partitioning stage such as a transformer and writes it to a peek stage. The Director logs shows which rows went where. You should also view the Director monitor and expand and show the row counts on each instance of each stage in the job to see how stages are split and run on each node and how many rows each instance gets.

Use a filter stage to split the rows down two paths and bring them back together with a funnel stage, then replace the funnel with a collector stage. Compare the two.

Test yourself on estimating how many processes will be created by a job and check the result after the job has run using the Director monitor or log messages. Do this throughout all your exercises across all sections as a habit.

Section 2 - Metadata

Section 3 - Persistant Storage

I’ve merged these into one. Both sections talk about sequential files, datasets, XML files and Cobol files.

Versions: you can study this on DataStage 6, 7 or 8 as it is a narrow focus on DataStage Designer metadata. DataStage 8 will have slightly different CFF options but not enough to cause a problem.

Reading: read the section in the DataStage Developers Guide on Orchestrate schemas and partial schemas. Read the plugin guide for the Complex Flat File Stage to understand how Cobol metadata is imported (if you don’t have any cobol copybooks around you will just have to read about them and not do any exercises). Quickly scan through the NLS guide - but don’t expect any hard questions on this.

Exercises: Step through the IBM XML tutorial to get the tricky part on reading XML files. Find an XML file and do various exercises reading it and writing it to a sequential file. Switch between different key fields to see the impact of the key on the flattening of the XML hierarchy. Don’t worry too much about XML Transform.

Import from a database using the Orchestrate Import and the Plugin Import and compare the table definitions. Run an exercise on column propagation using the Peek stage where a partial schema is written to the Peek stage to reveal the propagated columns.

Create a job using the Row Generator stage. Define some columns, on the columns tab doubleclick on a column to bring up the advanced column properties. Use some properties to generate values for different data types. Get to know the advanced properties page.

Create a really large sequential file, dataset and fileset and use each as a reference in a lookup stage. Monitor the Resource and Scratch directories as the job runs to see how these lookup sources are prepared prior to and during a job run. Get to know the difference between the lookup fileset and other sources for a lookup stage.

Section 5 - Databases (15%)

One of the more difficult topics if you get questions for a database you are not familiar with. I got one database parallel connectivity question that I still can’t find the answer to in any of the manuals.

Versions: DataStage 7.x any version or at a pinch DataStage 8. Earlier versions do not have enough database stages and DataStage 8 has a new approach to database connections.

Reading: read the plugin guide for each enterprise database stage: Oracle, SQL Server, DB2 and ODBC. In version 8 read the improved Connectivity Guides for these targets. If you have time you can dig deeper, the Parallel Job Developers Guide and/or the Advanced Developers Guide has a section on the Oracle/DB2/Informix/Teradata/Sybase/SQL Server interface libraries. Look for the section called "Operator action" and read it for each stage. It’s got interesting bits like whether the stage can run in parallel, how it converts data and handles record sizes.

Exercise: Add each Enterprise database stage to a parallel job as both an input and output stage. Go in and fiddle around with all the different types of read and write options. You don’t need to get a connection working or have access to that database, you just need to have the stage installed and add it to your job. Look at the differences between insert/update/add/load etc. Look at the different options for each database stage. If you have time and a database try some loads to a database table.

Section 6 - Data Transformation (15%)

If you’ve used DataStage for longer than a year this is probably the topic you are going to ace - as long as you have done some type of stage variable use.

Versions: should be okay studying on versions 6, 7 or 8. Transformation stages such as Transformer, Filter and Modify have not changed much.

Reading: there is more value in using the transformation stages than reading about it. It’s hard to read about it and take it in as the Transformer stage is easier to navigate and understand if you are using it. If you have to make do with reading then visit the dsxchange and look for threads on stage variables, the FAQ on the parallel number generator, removing duplicates using a transformer and questions in the parallel forum on null handling. This will be better than reading the manuals as they will be full of practical examples. Read the Parallel Job Advanced Developers Guide section on "Specifying your own parallel stages".

Exercises: Focus on Transformer, Modify Stage (briefly), Copy Stage (briefly) and Filter Stage.

Create some mixed up source data with duplicate rows and a multiple field key. Try to remove duplicates using a Transformer with a sort stage and combination of stage variables to hold the prior row key value to compare to the new row key value.

Process some data that has nulls in it. Use the null column in a Transformer concatenate function with and without a nulltovalue function and with and without a reject link from the Transformer. This gives you an understanding of how rows get dropped and/or trapped from a transformer. Explore the right mouse click menu in the Transformer, output some of the DS Macro values and System Variables to a peek stage and think of uses for them in various data warehouse scenarios. Ignore DS Routine, it’s not on the test.

Don’t spend much time on the Modify stage - it would take forever memorizing functions. Just do an exercise on handle_null, string_trim and convert string to number. Can be tricky getting it working and you might not even get a question about it.

Section 7 - Combining and Sorting Data (10%)

Section 10 - Job Design (10%)

I’ve combined these since they overlap. Don’t underestimate this section, it covers a very narrow range of functionality so it is an easy set of questions to prepare for and get right. There are easy points on offer in this section.

Versions: Any version 7.x is best, version 6 has a completely different lookup stage, version 8 can be used but remember that the Range lookup functionality is new.

Reading: The Parallel Job Developers Guide has a table showing the differences between the lookup, merge and join stages. Try to memorize the parts of this table about inputs and outputs and reject links. This is a good place to learn about some more environment variables. Read the Parallel Job Advanced Developers Guide looking for any environment variables with the word SORT or SCORE in them.

Exercises: Compare the way join, lookup and merge work. Create a job that switches between each type.

Add various combinations of the SORT and COMBINE_OPERATORS environment variables to your job and examine the score of the job to see what sorts get added to your job at run time. A simple job with a Remove Duplicates or Join stage and no sorting will add a sort into the job - even if the data comes from a sorted database source. Use the SORT variables to turn off this sort insertion. See what happens to Transformer, Lookup and Copy stages when COMBINE_OPERATORS is turned on or off using the SCORE log entry.

Create an annotation and a description annotation and explore the differences between the two. Use the Multiple Compile tool in the DataStage Manager (version 6, 7) or Designer (version 8). Create and use shared containers.

Section 8 - Automation and Production Deployment (10%)

Versions: most deployment methods have remained the same from version 6, 7 and 8. Version 8 has the same import and export functions. DataStage 8 parameter sets will not be in the version 7 exam.

Reading: you don’t need to install or use the Version Control tool to pass this section however you should read the PDF that comes with it to understand the IBM recommendations for deployment. It covers the move from dev to test to prod. Read the Server Job Developers Guide section on command line calls to DataStage such as dsjob and dssearch and dsadmin.

Exercises: practice a few dsjob and dssearch commands. Practice saving a log to a text file. Create some job specific environment variables.

DataStage Information