DataStage Information: Real time Examples For Datastage

1.how to read the data from XL FILES?my problem is my data file having some commas in data,but we are using delimitor is| ?how to read the data ,explain with steps?
• Create DSN for your XL file by picking Microsoft Excel Driver
2. Take ODBC as source stage
3. Configure ODBC with DSN details
4. While importing metadata for XL sheet, make sure you should select on system tables check box.
Note: In XL sheet the first line should be column names.
2.Why job sequence is use for? what is batches? what is the difference between job sequence and batches?
• Job Sequence is allows you to specify a sequence of server or parallel jobs to run. The sequence can also contain control information, for example, you can specify different courses of action to take depending on whether a job in the sequence succeeds or fails. Once you have defined a job sequence, it can be scheduled and run using the DataStage Director. It appears in the DataStage Repository and in the DataStage Director client as a job.
3.Disadvantages of staging area
• I think disadvantage of staging are is disk space as we have to dump data into a local area.. As per my knowledge concern, there is no other disadvantages of staging area.
4.whats the meaning of performance tunning techinque,Example??
• meaning of performance tuning meaning we rhave to take some action to increase performance of slowly running job by
1) use link partitioner and link collector to speedup performance
2) use sorted data for aggregation
3) use sorter at source side and aggregation at target side
4)Tuned the oci stage for 'Array Size' and 'Rows per Transaction' numerical values for faster inserts, updates and selects.
5) do not use ipc stage at target side..............
5.how to distinguish the surrogate key in different dimentional tables?
• the Surrogate key will be the key field in the dimensions.
6.how to read the data from XL FILES?explain with steps?

Reading data from Excel file is
* Save the file in .csv (comma separated files).
* use a flat file stage in datastage job panel.
* double click on the flat file stage and assign input file to the .csv file (which you stored ).
* import metadate for the file . (once you imported or typed metadata , click view data to check the data values)
Then do the rest transformation as needed
-Debasis
• Create a new DSN for the Excel driver and choose the workbook from which u want data
Select the ODBC stage and access the Excel through that i.e., import the excel sheet using the new DSN created for the Excel
7.how can we generate a surrogate key in server/parallel jobs?
• In parallel jobs we can use surrogatekey generator stage.
• in server jobs we can use an inbuilt routine called KeyMgtGetNextValue.
• You can also generate the surrogate key in the database using the sequence generator.
8.what is an environment variable?
• Basically Environment variable is predefined variable those we can use while creating DS job.We can set eithere as Project level or Job level.Once we set specific variable that variable will be availabe into the project/job.
We can also define new envrionment variable.For that we can got to DS Admin .
I hope u understand.for further details refer the DS Admin guide
• Theare are the variables used at the project or job level.We can use them to to configure the job ie.we can associate the configuration file(Wighout this u can not run ur job), increase the sequential or dataset read/ write buffer.
ex: $APT_CONFIG_FILE
Like above we have so many environment variables. Please go to job properties and click on Paramer tab then click on "add environment variable" to see most of the environment variables.
9.how can we create environment variables in datasatage?
• We can create environment variables by using DataStage Administrator.
• Hi This mostely will comes under Administrator part.As a Designer only we can add directly byDesigner-view-jobprops-parameters-addenvironment variable-under userdefined-then add.
10.have few questions
1. What ar ethe various process which starts when the datastage engine starts?
2. What are the changes need to be done on the database side, If I have to use dB2 stage?
3. datastage engine is responsible for compilation or execution or both?
• There are three processes start when the DAtastage engine starts:
1. DSRPC
2. Datastage Engine Resources
3. Datastage telnet Services
11.How to write and execute routines for PX jobs in c++?
• You define and store the routines in the Datastage repository(ex:in routine folder). And these rountines are excuted on c++ compilers.
• You have to write routine in C++ (g++ in Unix). then you have to create a object file. provide this object file path in your routine.
12.how to eleminate duplicate rows in data stage?
• TO remove duplicate rows you can achieve by more than one way
1.In DS there is one stage called "Remove Duplicate" is exist where you can specify the key.
2.Other way you can specify the key while using the stage i mean stage itself remove the duplicate rows based on key while processing time.
• By using Hash File Stage in DS Server we can elliminate the Duplicates in DS.
• Using a sort stage,set property: ALLOW DUPLICATES :false
OR
You can use any Stage in input tab choose hash partition And Specify the key and Check the unique checkbox.
• if u r doing with server Jobs, V can use hashfile to eliminate duplicate rows.
13.What is Runtime Column Propagation and how to use it?
• If your job has more columns which are not defined in metadata if runtime propagation is enabled it will propagate those extra columns to the rest of the job
14.Can both Source system(Oracle,SQLServer,...etc) and Target Data warehouse(may be oracle,SQLServer..etc) can be on windows environment or one of the system should be in UNIX/Linux environment
• Your Source System can be (Oracle, SQL, DB2, Flat File... etc) But your Target system for complete Data Warehouse should be one (Oracle or SQL or DB2 or..)
• In server edition you can have both in Windows. But in PX target should be in UNIX.
15.how to find the process id?explain with steps?
• you can find it in UNIX by using ps -ef command it displays all the process currently running on the system along with the process ids
• From the DS Director.Follow the path :
Job > Cleanup Resources.
There also you can see the PID.It also displays all the current running processes.
• Depending on your environment, you may have lots of process id's.From one of the datastage docs:you can try this on any given node: $ ps -ef | grep dsuserwhere dsuser is the account for datastage.If the above (ps command) doesn't make sense, you'll need somebackground theory about how processes work in unix (or the mksenvironment when running in windows).Also from the datastage docs (I haven't tried this one yet, but it looks interesting):APT_PM_SHOW_PIDS - If this variable is set, players will output an informational message uponstartup, displaying their process id.Good luck.
16.if we using two sources having same meta data and how to check the data in two sorces is same or not?and if the data is not same i want to abort the job ?how we can do this?
• Use a change Capture Stage.Output it into a Transformer.
Write a routine to abort the job which is initiated at the Function.
@INROWNUM = 1.
So if the data is not matching it is passed in the transformer and the job is aborted.
17.Can you tell me for what puorpse .dsx files are used in the datasatage
• dsx is the standard file extension of all the various datastage jobs.Whenever we export a job or a sequence, the file is exported in the .dsx format. A standard usage for the same can be that, we develop the job in our test environment and after testing we export the file and save it as x.dsx . This can be done using Datstage Manager.
18.HI How can we create read only jobs in Datastage.
• in export there is an options just CLICK ON OPTIONS TAB THEN THERE UNDER INCLUDE OPTIONU WILL FIND READ ONLY DATASTAGE u just enable that
19.What is environment variables?what is the use of this?
• Basically Environment variable is predefined variable those we can use while creating DS job. We can set either as Project level or Job level. Once we set specific variable that variable will be available into the project/job.
We can also define new environment variable that we can got to DS Admin .
I hope u understand. for further details refer the DS Admin guide.
20.How do you call procedures in datastage?
• Use the Stored Procedure Stage
21.How to run the job in command prompt in unix?
• Using dsjob command,
-options
dsjob -run -jobstatus projectname jobname
22.How to know the no.of records in a sequential file before running a server job?
• if your environment is unix , you can check with wc -l filename command.
23.what is data stage engine?what is its purpose?
• Datastage sever contains Datastage engine DS Server will interact with Client components and Repository. Use of DS engine is to develope the jobs .Whenever the engine is on then only we will develope the jobs.
24.How to implement slowly changing dimentions in Datastage?
• In Informatica powercenter, there is a way to implement slowly changing dimension through wizard. Datastage does not have that type of wizard to implement SCD, should be implemented by manual logic.
25.give one real time situation where link partitioner stage used?
• If we want to send more data from the source to the targets quickly we will be using the link partioner stage in the server jobs we can make a maximum of 64 partitions. And this will be in active stage. We can't connect two active stages but it is accpeted only for this stage to connect to the transformer or aggregator stage. The data sent from the link partioner will be collected by the link collector at a max of 64 partition. This is also an active stage so in order to aviod the connection of active stage from the transformer to teh link collector we will be using inter process communication. As this is a passive stage by using this data can be collected by the link collector. But we can use inter process communication only when the target is in passive stage
26.How i create datastage Engine stop start script.
Actually my idea is as below.
!#bin/bash
dsadm - user
su - root
password (encript)
DSHOMEBIN=/Ascential/DataStage/home/dsadm/Ascential/DataStage/DSEngine/bin
if check ps -ef | grep DataStage
(client connection is there) {
kill -9 PID (client connection) }
uv -admin - stop > dev/null
uv -admin - start > dev/null
verify process
check the connection
echo "Started properly"
run it as dsadm
• go to the path /DATASTAGE/PROJECTS/DSENGINE/BIN/uv -admin -stopuv -admin -start
27.What does separation option in static hash-file mean?
• The different hashing algorithms are designed to distribute records evenly among the groups of the file based on charecters and their position in the record ids.
When a hashed file is created, Separation and Modulo respectively specifies the group buffer size and the number of buffers allocated for a file. When a Static Hashfile is created, DATASTAGE creates a file that contains the number of groups specified by modulo.
Size of Hashfile = modulus(no. groups) * Separations (buffer size)

DataStage Information

Sunday, June 29, 2008

Real time Examples For Datastage

No comments:

Search 4 DataStage

Blog Archive