Sunday, June 29, 2008

Datastage FAQs

Types Of Lookups:

in lookup first link is 'primary link'. Other links are called "lookup' links.

when lookup links are from a stage that is other than a database stage, all data from the lookup link is read into memory. Then, for each row from the primary link, the lookup is performed.

if the source of lookups is a database. then can be two types of lookups;

those are Normal and spares lookup

Normal Lookup--- All the data form the database is read into memory, and then lookup is perfromed.

sparese Lookup--- For each incoming row from the primary link, the sql is fired on daabase at run time.

1.What is a view?

View is the vertual or logical or the duplicate copy of the original table with schema parts only.

2)Difference b/w materialized view and view?

View is nothing but a set a sql statements together which join single or multiple tables and shows the data .. however views do not have the data themselves but point to the data .

Whereas Materialized view is a concept mainly used in Datawarehousing .. these views contain the data itself .Reason being it is easier/faster to access the data.The main purpose of Materialized view is to do calculations and display data from multiple tables using joins .

3)How to remove duplicates in ds and how it keep?

Other than remove duplicate stage ,we can also use aggregator stage to count the number of records exist for the key columns.If more than one record exist for the key column,then they are considered as duplicate records and using transformer we can set a stage variable as 'COUNT' and check if 'COUNT>1'.If so,using a constraint, reject that duplicate records into reject file.

4)how to remove the locked jobs using datastage?

alternatively just restart the datastage services from the "administator tools" section of the control panel.

5)Did you Parameterize the job or hard-coded the values in the jobs?

Always parameterized the job. Either the values are coming from Job Properties or from a ‘Parameter Manager’ – a third part tool. There is no way you will hard–code some parameters in your jobs. The often Parameterized variables in a job are: DB DSN name, username, password, dates W.R.T for the data to be looked against at.

6)What is Hash file stage and what is it used for?
We can also use the Hash File stage to avoid / remove dupilcate rowsby specifying the hash key on a particular fileld

7)What are other Performance tunings you have done in your last project to increase the performance of slowly running jobs?
1.Staged the data coming from ODBC/OCI/DB2UDB stages or any database on the server using Hash/Sequential files for optimum performance also for data recovery in case job aborts.

2.Tuned the OCI stage for 'Array Size' and 'Rows per Transaction' numerical values for faster inserts, updates and selects.

3.Tuned the 'Project Tunables' in Administrator for better performance.

4.Used sorted data for Aggregator.

5.Sorted the data as much as possible in DB and reduced the use of DS-Sort for better performance of jobs

6.Removed the data not used from the source as early as possible in the job.

7.Worked with DB-admin to create appropriate Indexes on tables for better performance of DS queries

8.Converted some of the complex joins/business in DS to Stored Procedures on DS for faster execution of the jobs.

9.If an input file has an excessive number of rows and can be split-up then use standard logic to run jobs in parallel.

10.Before writing a routine or a transform, make sure that there is not the functionality required in one of the standard routines supplied in the sdk or ds utilities categories.

11.Constraints are generally CPU intensive and take a significant amount of time to process. This may be the case if the constraint calls routines or external macros but if it is inline code then the overhead will be minimal.

12.Try to have the constraints in the 'Selection' criteria of the jobs itself. This will eliminate the unnecessary records even getting in before joins are made.

13.Tuning should occur on a job-by-job basis.

14.Use the power of DBMS.

15.Try not to use a sort stage when you can use an ORDER BY clause in the database.

16.Using a constraint to filter a record set is much slower than performing a SELECT … WHERE….

17.Make every attempt to use the bulk loader for your particular database. Bulk loaders are generally faster than using ODBC or OLE.

or

surrogate is the systemgenerated

8)What will you in a situation where somebody wants to send you a file and use that file as an input or reference and then run job.

A. Under Windows: Use the 'WaitForFileActivity' under the Sequencers and then run the job. May be you can schedule the sequencer around the time the file is expected to arrive.
B. Under UNIX: Poll for the file. Once the file has start the job or sequencer depending on the file.

9)What are the command line functions that import and export the DS jobs?
A. dsimport.exe- imports the DataStage components.
B. dsexport.exe- exports the DataStage components.

10)What is the utility you use to schedule the jobs on a UNIX server other than using Ascential Director?
Use crontab utility along with dsexecute() function along with proper parameters passed.

11)Difference between Hashfile and Sequential File?

Hash file stores the data based on hash algorithm and on a key value. A sequential file is just a file with no key column. Hash file used as a reference for look up. Sequential file cannot

12)How can we implement Lookup in DataStage Server jobs?
We can use a Hash File as a lookup in server jobs. The hash file needs atleast one key column to create.

or

by using the hashed files u can implement the lookup in datasatge,hashed files stores data based on hashed algorithm and key values

13)what are the Job parameters?
These Parameters are used to provide Administrative access and change run time values of the job.

EDIT>JOBPARAMETERS

In that Parameters Tab we can define the name,prompt,type,value

14)Whats difference betweeen operational data stage (ODS) & data warehouse?
that which is volatile is ODS and the data which is nonvolatile and historical and time varient data is DWh data.in simple terms ods is dynamic data

or

A dataware house is a decision support database for organisational needs.It is subject oriented,non volatile,integrated ,time varient collect of data.

ODS(Operational Data Source) is a integrated collection of related information . it contains maximum 90 days information.

or

ods is nothing but operational data store is the part of transactional database. this db keeps integrated data from different tdb and allow common operations across organisation. eg: banking transaction.

or

An operational data store (or "ODS") is a database designed to integrate data from multiple sources to facilitate operations, analysis and reporting. Because the data originates from multiple sources, the integration often involves cleaning, redundancy resolution and business rule enforcement. An ODS is usually designed to contain low level or atomic (indivisible) data such as transactions and prices as opposed to aggregated or summarized data such as net contributions. Aggregated data is usually stored in the Data warehouse.

How do you pass filename as the parameter for a job?
While job developement we can create a paramater 'FILE_NAME' and the value can be passed while running the job.

or

1. Go to DataStage Administrator->Projects->Properties->Environment->UserDefined. Here you can see a grid, where you can enter your parameter name and the corresponding the path of the file.

2. Go to the stage Tab of the job, select the NLS tab, click on the "Use Job Parameter" and select the parameter name which you have given in the above. The selected parameter name appears in the text box beside the "Use Job Parameter" button. Copy the parameter name from the text box and use it in your job. Keep the project default in the text box

How do you eliminate duplicate rows?

Use Remove Duplicate Stage: It takes a single sorted data set as input, removes all duplicate records, and writes the results to an output data set

or

delete from from table name where rowid not in(select max/min(rowid)from emp group by column name)

or

The Duplicates can be eliminated by loading thecorresponding data in the Hash file. Specify the columns on which u want to eliminate as the keys of hash.

What about System variables?

DataStage provides a set of variables containing useful system information that you can access from a transform or routine. System variables are read-only.

@DATE The internal date when the program started. See the Date function.

@DAY The day of the month extracted from the value in @DATE.

@FALSE The compiler replaces the value with 0.

@FM A field mark, Char(254).

@IM An item mark, Char(255).

@INROWNUM Input row counter. For use in constrains and derivations in Transformer stages.

@OUTROWNUM Output row counter (per link). For use in derivations in Transformer stages.

@LOGNAME The user login name.

@MONTH The current extracted from the value in @DATE.

@NULL The null value.

@NULL.STR The internal representation of the null value, Char(128).

@PATH The pathname of the current DataStage project.

@SCHEMA The schema name of the current DataStage project.

@SM A subvalue mark (a delimiter used in UniVerse files), Char(252).

@SYSTEM.RETURN.CODE
Status codes returned by system processes or commands.

@TIME The internal time when the program started. See the Time function.

@TM A text mark (a delimiter used in UniVerse files), Char(251).

@TRUE The compiler replaces the value with 1.

@USERNO The user number.

@VM A value mark (a delimiter used in UniVerse files), Char(253).

@WHO The name of the current DataStage project directory.

@YEAR The current year extracted from @DATE.

REJECTED Can be used in the constraint expression of a Transformer stage of an output link. REJECTED is initially TRUE, but is set to FALSE whenever an output link is successfully written.

How do you merge two files in DS?

Either used Copy command as a Before-job subroutine if the metadata of the 2 files are same or created a job to concatenate the 2 files into one if the metadata is different.

How do we do the automation of dsjobs?

We can call Datastage Batch Job from Command prompt using 'dsjob'. We can also pass all the parameters from command prompt.
Then call this shell script in any of the market available schedulers.
The 2nd option is schedule these jobs using Data Stage director.

what's the difference between Datastage Developers and Datastage Designers. What are the skill's required for this.

datastage developer is one how will code the jobs.datastage designer is how will desgn the job, i mean he will deal with blue prints and he will design the jobs the stages that are required in developing the code

Importance of Surrogate Key in Data warehousing?

Surrogate Key is a Primary Key for a Dimension table. Most importance of using it is it is independent of underlying database. i.e Surrogate Key is not affected by the changes going on with a database.
key it is a numaric key it is primary key in the dimension table and it is forgien key in the fact table it is used to hadle the missing data and complex situation in the datastage

or

The concept of surrogate comes into play when there is slowely changing dimension in a table.
In such condition there is a need of a key by which we can identify the changes made in the dimensions.
These slowely changing dimensions can be of three type namely SCD1,SCD2,SCD3.
These are sustem genereated key.Mainly they are just the sequence of numbers or can be alfanumeric values also.

what is difference between data stage and informatica
Here is a very good articles on these differences... whic hhelps to get an idea.. basically it's depends on what you are tring to accomplish

No comments:

Search 4 DataStage