DataStage Information

Saturday, October 18, 2008

DS Faqs

32.where actually the flat files store?what is the path?
· Normally flat file will be stored at FTP servers or local folders and more over .CSV , .EXL and .TXT file formats available for Flat files.
· Flat files stores the data and the path can be given in general tab of the sequential file stage
33.How does DataStage handle the user security?
· we have to create users in the Administrators and give the necessary priviliges to users.
34.How I can convert Server Jobs into Parallel Jobs?
· u cant convert server to parallel ! u have to rebuild whole graph..
· have never tried doing this, however, I have some information which will help you in saving a lot of time. You can convert your server job into a server shared container. The server shared container can also be used in parallel jobs as shared container.
35.If I add a new environment variable in Windows, how can I access it in DataStage?
· u can call it in designer window
under that job properties
there u can add an new environment variable r u can use the existing one
· U can view all the environment variables in designer. U can check it in Job properties. U can add and access the environment variables from Job properties
36.what is data set? and what is file set?
· I assume you are referring Lookup fileset only.It is only used for lookup stages only.Dataset: DataStage parallel extender jobs use data sets to manage data within a job. You can think of each link in a job as carrying a data set. The Data Set stage allows you to store data being operated on in a persistent form, which can then be used by other DataStage jobs.FileSet: DataStage can generate and name exported files, write them to their destination, and list the files it has generated in a file whose extension is, by convention, .fs. The data files and the file that lists them are called a file set. This capability is useful because some operating systems impose a 2 GB limit on the size of a file and you need to distribute files among nodes to prevent overruns.
· file set:- It allows you to read data from or write data to a file set. The stage can have a single input link. a single output link, and a single rejects link. It only executes in parallel modeThe data files and the file that lists them are called a file set. This capability is useful because some operating systems impose a 2 GB limit on the size of a file and you need to distribute files among nodes to prevent overruns.
Datasets r used to import the data in parallel jobs like odbc in server jobs
37.How the hash file is doing lookup in serverjobs?How is it comparing the key values?
Hashed File is used for two purpose: 1. Remove Duplicate Records 2. Then Used for reference lookups.The hashed file contains 3 parts: Each record having Hashed Key, Key Header and Data portion.By using hashed algorith and the key valued the lookup is faster.
38.hi! .. There are three different types of user-created stages available for PX. What are they? Which would you use? What are the disadvantage for using each type?
Hai, These are the three different stages: i) Custom ii) Build iii) Wrapped
39.If data is partitioned in your job on key 1 and then you aggregate on key 2, what issues could arise?
· data will partitioned on both the keys ! hardly it will take more for execution .
40.How can I specify a filter command for processing data while defining sequential file output data?
· We have some thing called as after job subroutine and Before subroutine, with then we can execute the Unix commands.
Here we can use the sort sommand or the filter cdommand
41.what is the meaning of the following..1)If an input file has an excessive number of rows and can be split-up then use standard 2)logic to run jobs in parallel3)Tuning should occur on a job-by-job basis. Use the power of DBMS.
· Question is not clear eventhough i wil try to answer something
If u have SMP machines u can use IPC,link-colector,link-partitioner for performance tuning
If u have cluster,MPP machines u can use parallel jobs
· The third point specifies about tuning the performance of job,use the power of DBMS means one can improve the performance of the job by using the power of Database like Analyzing,creating index,creating partitions one can improve the performance of sqls used in the jobs.
42.How can you implement Complex Jobs in datastage
· what do u mean by complex jobs.
if u used more than 15 stages in a job and if you used 10 lookup tables in a job then u can call it as a complex job
· Complex design means having more joins and more look ups. Then that job design will be called as complex job.We can easily implement any complex design in DataStage by following simple tips in terms of increasing performance also. There is no limitation of using stages in a job. For better performance, Use at the Max of 20 stages in each job. If it is exceeding 20 stages then go for another job.Use not more than 7 look ups for a transformer otherwise go for including one more transformer.Am I Answered for u'r abstract Question.
· if the job have good logic that is called as complex job.
simply we can say Scenarios .so If you have faced any complexity while creating job please share with us.
43.How can I extract data from DB2 (on IBM iSeries) to the data warehouse via Datastage as the ETL tool. I mean do I first need to use ODBC to create connectivity and use an adapter for the extraction and transformation of data? Thanks so much if anybody could provide an answer.
You would need to install ODBC drivers to connect to DB2 instance (does not come with regular drivers that we try to install, use CD provided for DB2 installation, that would have ODBC drivers to connect to DB2) and then try out
· if ur system is mainfarmes then u can utility called load and unload ..
load will load the records into main farme systme from there u hv to export in to your system ( windows)
44.how can we pass parameters to job by using file.
· You can do this, by passing parameters from unix file, and then calling the execution of a datastage job. the ds job has the parameters defined (which are passed by unix)
· u can create a UNIX shell script which will pass the parameters to the job and u also can create logs for the whole run process of the job.
45.What is DS Administrator used for - did u use it?
The Administrator enables you to set up DataStage users, control the purging of the Repository, and, if National Language Support (NLS) is enabled, install and manage maps and locales. 46.What about System variables?
DataStage provides a set of variables containing useful system information that you can access from a transform or routine. System variables are read-only. @DATE The internal date when the program started. See the Date function. @DAY The day of the month extracted from the value in @DATE. @FALSE The compiler replaces the value with 0. @FM A field mark, Char(254). @IM An item mark, Char(255). @INROWNUM Input row counter. For use in constrains and derivations in Transformer stages. @OUTROWNUM Output row counter (per link). For use in derivations in Transformer stages. @LOGNAME The user login name. @MONTH The current extracted from the value in @DATE. @NULL The null value. @NULL.STR The internal representation of the null value, Char(128). @PATH The pathname of the current DataStage project. @SCHEMA The schema name of the current DataStage project. @SM A subvalue mark (a delimiter used in UniVerse files), Char(252). @SYSTEM.RETURN.CODE Status codes returned by system processes or commands. @TIME The internal time when the program started. See the Time function. @TM A text mark (a delimiter used in UniVerse files), Char(251). @TRUE The compiler replaces the value with 1. @USERNO The user number. @VM A value mark (a delimiter used in UniVerse files), Char(253). @WHO The name of the current DataStage project directory. @YEAR The current year extracted from @DATE. REJECTED Can be used in the constraint expression of a Transformer stage of an output link. REJECTED is initially TRUE, but is set to FALSE whenever an output link is successfully written.
47.what are the Job parameters?
· These Parameters are used to provide Administrative access and change run time values of the job.
EDIT>JOBPARAMETERSIn that Parameters Tab we can define the name,prompt,type,value

DS FAQs

25.give one real time situation where link partitioner stage used?
· If we want to send more data from the source to the targets quickly we will be using the link partioner stage in the server jobs we can make a maximum of 64 partitions. And this will be in active stage. We can't connect two active stages but it is accpeted only for this stage to connect to the transformer or aggregator stage. The data sent from the link partioner will be collected by the link collector at a max of 64 partition. This is also an active stage so in order to aviod the connection of active stage from the transformer to teh link collector we will be using inter process communication. As this is a passive stage by using this data can be collected by the link collector. But we can use inter process communication only when the target is in passive stage
26.How i create datastage Engine stop start script.Actually my idea is as below.!#bin/bashdsadm - usersu - rootpassword (encript)DSHOMEBIN=/Ascential/DataStage/home/dsadm/Ascential/DataStage/DSEngine/binif check ps -ef grep DataStage (client connection is there) { kill -9 PID (client connection) }uv -admin - stop > dev/nulluv -admin - start > dev/nullverify processcheck the connectionecho "Started properly"run it as dsadm
· go to the path /DATASTAGE/PROJECTS/DSENGINE/BIN/uv -admin -stopuv -admin -start
27.What does separation option in static hash-file mean?
· The different hashing algorithms are designed to distribute records evenly among the groups of the file based on charecters and their position in the record ids.
When a hashed file is created, Separation and Modulo respectively specifies the group buffer size and the number of buffers allocated for a file. When a Static Hashfile is created, DATASTAGE creates a file that contains the number of groups specified by modulo.
Size of Hashfile = modulus(no. groups) * Separations (buffer size)
28.What is the difference between sequential file and a dataset? When to use the copy stage?
· Sequential file stores small amount of the data with any extension .txt where as DataSet stores Huge amount of the data and opens the file only with an extension .ds.
· Sequentiial Stage stores small amount of the data with any extension in order to acces the file where as DataSet is used to store Huge amount of the data and it opens only with an extension (.ds ) .The Copy stage copies a single input data set to a number of output datasets. Each record of the input data set is copied to every output data set.Records can be copied without modification or you can drop or change theorder of columns.
29.how to find errors in job sequence?
· using DataStage Director we can find the errors in job sequence
30.what is job control?how can it used explain with steps?
· JCL defines Job Control Language it is ued to run more number of jobs at a time with or without using loops. steps:click on edit in the menu bar and select 'job properties' and enter the parameters asparamete prompt typeSTEP_ID STEP_ID stringSource SRC stringDSN DSN stringUsername unm stringPassword pwd stringafter editing the above steps then set JCL button and select the jobs from the listbox and run the job
31.how to implement type2 slowly changing dimenstion in datastage? give me with example?
· Slow changing dimension is a common problem in Dataware housing. For example: There exists a customer called lisa in a company ABC and she lives in New York. Later she she moved to Florida. The company must modify her address now. In general 3 ways to solve this problem
Type 1: The new record replaces the original record, no trace of the old record at all, Type 2: A new record is added into the customer dimension table. Therefore, the customer is treated essentially as two different people. Type 3: The original record is modified to reflect the changes.
In Type1 the new one will over write the existing one that means no history is maintained, History of the person where she stayed last is lost, simple to use.
In Type2 New record is added, therefore both the original and the new record Will be present, the new record will get its own primary key, Advantage of using this type2 is, Historical information is maintained But size of the dimension table grows, storage and performance can become a concern.
Type2 should only be used if it is necessary for the data warehouse to track the historical changes.
In Type3 there will be 2 columns one to indicate the original value and the other to indicate the current value. example a new column will be added which shows the original address as New york and the current address as Florida. Helps in keeping some part of the history and table size is not increased. But one problem is when the customer moves from Florida to Texas the new york information is lost. so Type 3 should only be used if the changes will only occur for a finite number of time.

FAQs

13.What is Runtime Column Propagation and how to use it?
· If your job has more columns which are not defined in metadata if runtime propagation is enabled it will propagate those extra columns to the rest of the job
14.Can both Source system(Oracle,SQLServer,...etc) and Target Data warehouse(may be oracle,SQLServer..etc) can be on windows environment or one of the system should be in UNIX/Linux environment
· Your Source System can be (Oracle, SQL, DB2, Flat File... etc) But your Target system for complete Data Warehouse should be one (Oracle or SQL or DB2 or..)
· In server edition you can have both in Windows. But in PX target should be in UNIX.
15.how to find the process id?explain with steps?
· you can find it in UNIX by using ps -ef command it displays all the process currently running on the system along with the process ids
· From the DS Director.Follow the path :
Job > Cleanup Resources.
There also you can see the PID.It also displays all the current running processes.
· Depending on your environment, you may have lots of process id's.From one of the datastage docs:you can try this on any given node: $ ps -ef grep dsuserwhere dsuser is the account for datastage.If the above (ps command) doesn't make sense, you'll need somebackground theory about how processes work in unix (or the mksenvironment when running in windows).Also from the datastage docs (I haven't tried this one yet, but it looks interesting):APT_PM_SHOW_PIDS - If this variable is set, players will output an informational message uponstartup, displaying their process id.Good luck.
16.if we using two sources having same meta data and how to check the data in two sorces is same or not?and if the data is not same i want to abort the job ?how we can do this?
· Use a change Capture Stage.Output it into a Transformer.
Write a routine to abort the job which is initiated at the Function.
@INROWNUM = 1.
So if the data is not matching it is passed in the transformer and the job is aborted.
17.Can you tell me for what puorpse .dsx files are used in the datasatage
· dsx is the standard file extension of all the various datastage jobs.Whenever we export a job or a sequence, the file is exported in the .dsx format. A standard usage for the same can be that, we develop the job in our test environment and after testing we export the file and save it as x.dsx . This can be done using Datstage Manager.
18.HI How can we create read only jobs in Datastage.
· in export there is an options just CLICK ON OPTIONS TAB THEN THERE UNDER INCLUDE OPTIONU WILL FIND READ ONLY DATASTAGE u just enable that
19.What is environment variables?what is the use of this?
· Basically Environment variable is predefined variable those we can use while creating DS job. We can set either as Project level or Job level. Once we set specific variable that variable will be available into the project/job.
We can also define new environment variable that we can got to DS Admin .
I hope u understand. for further details refer the DS Admin guide.
20.How do you call procedures in datastage?
· Use the Stored Procedure Stage
21.How to run the job in command prompt in unix?
· Using dsjob command,
-options
dsjob -run -jobstatus projectname jobname
22.How to know the no.of records in a sequential file before running a server job?
· if your environment is unix , you can check with wc -l filename command.
23.what is data stage engine?what is its purpose?
· Datastage sever contains Datastage engine DS Server will interact with Client components and Repository. Use of DS engine is to develope the jobs .Whenever the engine is on then only we will develope the jobs.
24.How to implement slowly changing dimentions in Datastage?
· In Informatica powercenter, there is a way to implement slowly changing dimension through wizard. Datastage does not have that type of wizard to implement SCD, should be implemented by manual logic.

Datastage Faqs

6.how to read the data from XL FILES?explain with steps?
Reading data from Excel file is
* Save the file in .csv (comma separated files).
* use a flat file stage in datastage job panel.
* double click on the flat file stage and assign input file to the .csv file (which you stored ).
* import metadate for the file . (once you imported or typed metadata , click view data to check the data values)
Then do the rest transformation as needed
-Debasis
· Create a new DSN for the Excel driver and choose the workbook from which u want data
Select the ODBC stage and access the Excel through that i.e., import the excel sheet using the new DSN created for the Excel
7.how can we generate a surrogate key in server/parallel jobs?
· In parallel jobs we can use surrogatekey generator stage.
· in server jobs we can use an inbuilt routine called KeyMgtGetNextValue.
· You can also generate the surrogate key in the database using the sequence generator.
8.what is an environment variable?
· Basically Environment variable is predefined variable those we can use while creating DS job.We can set eithere as Project level or Job level.Once we set specific variable that variable will be availabe into the project/job.
We can also define new envrionment variable.For that we can got to DS Admin .
I hope u understand.for further details refer the DS Admin guide
· Theare are the variables used at the project or job level.We can use them to to configure the job ie.we can associate the configuration file(Wighout this u can not run ur job), increase the sequential or dataset read/ write buffer.
ex: $APT_CONFIG_FILE
Like above we have so many environment variables. Please go to job properties and click on Paramer tab then click on "add environment variable" to see most of the environment variables.
9.how can we create environment variables in datasatage?
· We can create environment variables by using DataStage Administrator.
· Hi This mostely will comes under Administrator part.As a Designer only we can add directly byDesigner-view-jobprops-parameters-addenvironment variable-under userdefined-then add.
10.have few questions1. What ar ethe various process which starts when the datastage engine starts?2. What are the changes need to be done on the database side, If I have to use dB2 stage?3. datastage engine is responsible for compilation or execution or both?
· There are three processes start when the DAtastage engine starts:
1. DSRPC
2. Datastage Engine Resources
3. Datastage telnet Services
11.How to write and execute routines for PX jobs in c++?
· You define and store the routines in the Datastage repository(ex:in routine folder). And these rountines are excuted on c++ compilers.
· You have to write routine in C++ (g++ in Unix). then you have to create a object file. provide this object file path in your routine.
12.how to eleminate duplicate rows in data stage?
· TO remove duplicate rows you can achieve by more than one way
1.In DS there is one stage called "Remove Duplicate" is exist where you can specify the key.
2.Other way you can specify the key while using the stage i mean stage itself remove the duplicate rows based on key while processing time.
· By using Hash File Stage in DS Server we can elliminate the Duplicates in DS.
· Using a sort stage,set property: ALLOW DUPLICATES :false
OR
You can use any Stage in input tab choose hash partition And Specify the key and Check the unique checkbox.if u r doing with server Jobs, V can use hashfile to eliminate duplicate rows.

Datastage faqs

1.how to read the data from XL FILES?my problem is my data file having some commas in data,but we are using delimitor is ?how to read the data ,explain with steps?
A)1· Create DSN for your XL file by picking Microsoft Excel Driver
2. Take ODBC as source stage
3. Configure ODBC with DSN details
4. While importing metadata for XL sheet, make sure you should select on system tables check box.
Note: In XL sheet the first line should be column names.
2.Why job sequence is use for? what is batches? what is the difference between job sequence and batches?
· Job Sequence is allows you to specify a sequence of server or parallel jobs to run. The sequence can also contain control information, for example, you can specify different courses of action to take depending on whether a job in the sequence succeeds or fails. Once you have defined a job sequence, it can be scheduled and run using the DataStage Director. It appears in the DataStage Repository and in the DataStage Director client as a job.
3.Disadvantages of staging area
· I think disadvantage of staging are is disk space as we have to dump data into a local area.. As per my knowledge concern, there is no other disadvantages of staging area.
4.whats the meaning of performance tunning techinque,Example??
· meaning of performance tuning meaning we rhave to take some action to increase performance of slowly running job by
1) use link partitioner and link collector to speedup performance
2) use sorted data for aggregation
3) use sorter at source side and aggregation at target side
4)Tuned the oci stage for 'Array Size' and 'Rows per Transaction' numerical values for faster inserts, updates and selects.
5) do not use ipc stage at target side..............
5.how to distinguish the surrogate key in different dimentional tables?
· the Surrogate key will be the key field in the dimensions.

Monday, October 6, 2008

How to fetch the flat file data into Datastage?

For Example source is COBAL file copy that file into your are local machine and then give that path in the Source Stage and fetch the Data into Datastage.

Thursday, August 21, 2008

History Of Datastage

History
DataStage was conceived at VMark, a spin off from Prime Computers that developed two notable products: UniVerse database and the DataStage ETL tool. The first VMark ETL prototype was built by Lee Scheffler in the first half of 1996[1]. Peter Weyman was VMark VP of Strategy and identified the ETL market as an opportunity. He appointed Lee Scheffler as the architect and conceived the product brand name "Stage" to signify modularity and component-orientation[2]. This tag was used to name DataStage and subsequently used in related products QualityStage, ProfileStage, MetaStage and AuditStage. Lee Scheffler presented the DataStage product overview to the board of VMark in June 1996 and it was approved for development. The product was in alpha testing in October, beta testing in November and was generally available in January of 1997.

VMark acquired UniData in October of 1997 and renamed itself to Ardent Software[3]. In 1999 Ardent Software was acquired by Informix[4] the database software vendor. In April of 2001 IBM acquired Informix and took just the database business leaving the data integration tools to be spun off as an independent software company called Ascential Software[5]. In March of 2005 IBM acquired Ascential Software[6] and made DataStage part of the WebSphere family as WebSphere DataStage. In 2006 the product was released as part of the IBM Information Server under the Information Management family but was still known as WebSphere DataStage. In 2008 the suite was renamed to InfoSphere Information Server and the product was renamed to InfoSphere DataStage.

DataStage Editions
Enterprise Edition: a name give to the version of DataStage that had a parallel processing architecture and parallel ETL jobs.
Server Edition: the name of the original version of DataStage representing Server Jobs. Early DataStage versions only contained Server Jobs. DataStage 5 added Sequence Jobs and DataStage 6 added Parallel Jobs via Enterprise Edition.
MVS Edition: mainframe jobs, developed on a Windows or Unix/Linux platform and transferred to the mainframe as compiled mainframe jobs.
DataStage for PeopleSoft: a server edition with prebuilt PeopleSoft EPM jobs under an OEM arragement with PeopeSoft and Oracle Corporation.
DataStage TX: for processing complex transactions and messages, formerly known as Mercator.
DataStage SOA: Real Time Integration pack can turn server or parallel jobs into SOA services.