Sunday, June 29, 2008

Datastage basics

DATA STAGE
Question : How do you rename all of the jobs to support your new File-naming conventions?
Answer : Create a Excel spreadsheet with new and old names. Export the whole project as a dsx. Write a Perl program, which can do a simple rename of the strings looking up the Excel file. Then import the new dsx file probably into a new project for testing. Recompile all jobs. Be cautious that the name of the jobs has also been changed in your job control jobs or Sequencer jobs. So you have to make the necessary changes to these Sequencers.
Question : Does the selection of 'Clear the table and Insert rows' in the ODBC stage send a Truncate statement to the DB or does it do some kind of Delete logic.
Answer : There is no TRUNCATE on ODBC stages. It is Clear table blah blah and that is a delete from statement. On an OCI stage such as Oracle, you do have both Clear and Truncate options. They are radically different in permissions (Truncate requires you to have alter table permissions where Delete doesn't).
Question : Tell me one situation from your last project, where you had faced problem and How did u solve it?
Answer : A. The jobs in which data is read directly from OCI stages are running extremely slow. I had to stage the data before sending to the transformer to make the jobs run faster.
B. The job aborts in the middle of loading some 500,000 rows. Have an option either cleaning/deleting the loaded data and then run the fixed job or run the job again from the row the job has aborted. To make sure the load is proper we opted the former.
Question : The above might rise another question: Why do we have to load the dimensional tables first, then fact tables:
Answer : As we load the dimensional tables the keys (primary) are generated and these keys (primary) are Foreign keys in Fact tables.
Question : How will you determine the sequence of jobs to load into data warehouse?
Answer : First we execute the jobs that load the data into Dimension tables, then Fact tables, then load the Aggregator tables (if any)
Question : What are the command line functions that import and export the DS jobs?
Answer : A. dsimport.exe- imports the DataStage components.
B. dsexport.exe- exports the DataStage components.
Question : What is the utility you use to schedule the jobs on a UNIX server other than using Ascential Director?
Answer : Use crontab utility along with dsexecute() function along with proper parameters passed.
Question : How would call an external Java function which are not supported by DataStage?
Answer : Starting from DS 6.0 we have the ability to call external Java functions using a Java package from Ascential. In this case we can even use the command line to invoke the Java function and write the return values from the Java program (if any) and use that files as a source in DataStage job.
Question : What will you in a situation where somebody wants to send you a file and use that file as an input or reference and then run job.
Answer : A. Under Windows: Use the 'WaitForFileActivity' under the Sequencers and then run the job. May be you can schedule the sequencer around the time the file is expected to arrive.
B. Under UNIX: Poll for the file. Once the file has start the job or sequencer depending on the file.
Question : Read the String functions in DS
Answer : Functions like [] -> sub-string function and ':' -> concatenation operator
Syntax: string [ [ start, ] length ]
string [ delimiter, instance, repeats ]
Question : How did u connect with DB2 in your last project?
Answer : Most of the times the data was sent to us in the form of flat files. The data is dumped and sent to us. In some cases were we need to connect to DB2 for look-ups as an instance then we used ODBC drivers to connect to DB2 (or) DB2-UDB depending the situation and availability. Certainly DB2-UDB is better in terms of performance as you know the native drivers are always better than ODBC drivers. 'iSeries Access ODBC Driver 9.00.02.02' - ODBC drivers to connect to AS400/DB2
Question : What are Sequencers?
Answer : Sequencers are job control programs that execute other jobs with preset Job parameters.
Question : How did you handle an 'Aborted' sequencer?
Answer : In almost all cases we have to delete the data inserted by this from DB manually and fix the job and then run the job again.
Question : What are other Performance tunings you have done in your last project to increase the performance of slowly running jobs?
Answer :
1. Staged the data coming from ODBC/OCI/DB2UDB stages or any database on the server using Hash/Sequential files for optimum performance also for data recovery in case job aborts.
2. Tuned the OCI stage for 'Array Size' and 'Rows per Transaction' numerical values for faster inserts, updates and selects.
3. Tuned the 'Project Tunables' in Administrator for better performance.
4. Used sorted data for Aggregator.
5. Sorted the data as much as possible in DB and reduced the use of DS-Sort for better performance of jobs
6. Removed the data not used from the source as early as possible in the job.
7. Worked with DB-admin to create appropriate Indexes on tables for better performance of DS queries
8. Converted some of the complex joins/business in DS to Stored Procedures on DS for faster execution of the jobs.
9. If an input file has an excessive number of rows and can be split-up then use standard logic to run jobs in parallel.
10. Before writing a routine or a transform, make sure that there is not the functionality required in one of the standard routines supplied in the sdk or ds utilities categories.
Constraints are generally CPU intensive and take a significant amount of time to process. This may be the case if the constraint calls routines or external macros but if it is inline code then the overhead will be minimal.
11. Try to have the constraints in the 'Selection' criteria of the jobs itself. This will eliminate the unnecessary records even getting in before joins are made.
12. Tuning should occur on a job-by-job basis.
13. Use the power of DBMS.
14. Try not to use a sort stage when you can use an ORDER BY clause in the database.
15. Using a constraint to filter a record set is much slower than performing a SELECT … WHERE….
Make every attempt to use the bulk loader for your particular database. Bulk loaders are generally faster than using ODBC or OLE.
Question : How did you handle reject data?
Answer : Typically a Reject-link is defined and the rejected data is loaded back into data warehouse. So Reject link has to be defined every Output link you wish to collect rejected data. Rejected data is typically bad data like duplicates of Primary keys or null-rows where data is expected.
Question : If worked with DS6.0 and latest versions what are Link-Partitioner and Link-Collector used for?
Answer : Link Partitioner - Used for partitioning the data.
Link Collector - Used for collecting the partitioned data.
Question : What are Routines and where/how are they written and have you written any routines before?
Answer : Routines are stored in the Routines branch of the DataStage Repository, where you can create, view or edit. The following are different types of routines:
1) Transform functions
2) Before-after job subroutines
3) Job Control routines
Question : What are OConv () and Iconv () functions and where are they used?

Answer : IConv() - Converts a string to an internal storage format
OConv() - Converts an expression to an outpu
Question : How did u connect to DB2 in your last project? t format.
Answer : Using DB2 ODBC drivers
Question : How good are you with your PL/SQL?
Answer : On the scale of 1-10 say 8.5-9
Question : Did you work in UNIX environment?
Answer : Yes. One of the most important requirements.
Question : What other ETL's you have worked with?
Answer : Informatica and also DataJunction if it is present in your Resume.
Question : What are Static Hash files and Dynamic Hash files?
Answer : As the names itself suggest what they mean. In general we use Type-30 dynamic Hash files. The Data file has a default size of 2Gb and the overflow file is used if the data exceeds the 2GB size.
Question : What is Hash file stage and what is it used for? Answer : Used for Look-ups. It is like a reference table. It is also used in-place of ODBC, OCI tables for better performance.

Question :
Have you ever involved in updating the DS versions like DS 5.X, if so tell us some the steps you have taken in doing so?
Answer : Yes. The following are some of the steps; I have taken in doing so:
1) Definitely take a back up of the whole project(s) by exporting the project as a .dsx file
2) See that you are using the same parent folder for the new version also for your old jobs using the hard-coded file path to work.
3) After installing the new version import the old project(s) and you have to compile them all again. You can use 'Compile All' tool for this.
4) Make sure that all your DB DSN's are created with the same name as old one's. This step is for moving DS from one machine to another.
5) In case if you are just upgrading your DB from Oracle 8i to Oracle 9i there is tool on DS CD that can do this for you.
6) Do not stop the 6.0 server before the upgrade, version 7.0
install process collects project information during the upgrade. There is NO rework (recompilation of existing jobs/routines) needed after the upgrade.
Question :
Did you Parameterize the job or hard-coded the values in the jobs?
Answer : Always parameterized the job. Either the values are coming from Job Properties or from a ‘Parameter Manager’ – a third part tool. There is no way you will hard–code some parameters in your jobs. The often Parameterized variables in a job are: DB DSN name, username, password, dates W.R.T for the data to be looked against at.
Question : How do you merge two files in DS?
Answer : Either used Copy command as a Before-job subroutine if the metadata of the 2 files are same or created a job to concatenate the 2 files into one if the metadata is different.
What about System variables?
2. How can we create Containers?
3. How can we improve the performance of DataStage jobs?
4. what are the Job parameters?
5. what is the difference between routine and transform and function?
6. What are all the third party tools used in DataStage?
7. How can we implement Lookup in DataStage Server jobs?
8. How can we implement Slowly Changing Dimensions in DataStage?.
9. How can we join one Oracle source and Sequential file?.
10. What is iconv and oconv functions?
11. Difference between Hashfile and Sequential File?
II PART
Dimension Modelling types along with their significance

Data Modelling is Broadly classified into 2 types.
a) E-R Diagrams (Entity - Relatioships).
b) Dimensional Modelling.
2 Dimensional modelling is again sub divided into 2 types.

a)Star Schema - Simple & Much Faster. Denormalized form.
b)Snowflake Schema - Complex with more Granularity. More normalized form.
3 Importance of Surrogate Key in Data warehousing?

Surrogate Key is a Primary Key for a Dimension table. Most importance of using it is it is independent of underlying database. i.e Surrogate Key is not affected by the changes going on with a databas...


4 Differentiate Database data and Data warehouse data?

Data in a Database is
a) Detailed or Transactional
b) Both Readable and Writable.
c) Current.


5 What is the flow of loading data into fact & dimensional tables?

Fact table - Table with Collection of Foreign Keys corresponding to the Primary Keys in Dimensional table. Consists of fields with numeric values.
Dimension table - Table with Unique Primary Key...


6 Orchestrate Vs Datastage Parallel Extender?

Orchestrate itself is an ETL tool with extensive parallel processing capabilities and running on UNIX platform. Datastage used Orchestrate with Datastage XE (Beta version of 6.0) to incorporate the p...


7 Differentiate Primary Key and Partition Key?

Primary Key is a combination of unique and not null. It can be a collection of key values called as composite primary key. Partition Key is a just a part of Primary Key. There are several methods of ...


8 How do you execute datastage job from command line prompt?

Using "dsjob" command as follows.
dsjob -run -jobstatus projectname jobname


9 What are Stage Variables, Derivations and Constants?


Stage Variable - An intermediate processing variable that retains value during read and doesnt pass the value into target column.
Derivation - Expression that specifies value to be passed o...


10 What is the default cache size? How do you change the cache size if needed?

Default cache size is 256 MB. We can incraese it by going into Datastage Administrator and selecting the Tunable Tab and specify the cache size over there.
11 What are types of Hashed File?

Hashed File is classified broadly into 2 types.
a) Static - Sub divided into 17 types based on Primary Key Pattern.
b) Dynamic - sub divided into 2 types i) Gen...


12 Containers : Usage and Types?

Container is a collection of stages used for the purpose of Reusability.
There are 2 types of Containers.
a) Local Container: Job Specific
b) Shared Container: Used in any job wit...


13 Compare and Contrast ODBC and Plug-In stages?

ODBC : a) Poor Performance.
b) Can be used for Variety of Databases.
c) Can handle Stored Procedures.
Plug-In: a) Good Performance.
b) Database specific.(Only one database) <...


14 How to run a Shell Script within the scope of a Data stage job?

By using "ExcecSH" command at Before/After job properties.


15 How to handle Date convertions in Datastage? Convert a mm/dd/yyyy format to yyyy-dd-mm?

We use a) "Iconv" function - Internal Convertion.
b) "Oconv" function - External Convertion.
Function to convert mm/dd/yyyy format to yyyy-dd-mm is
Oconv(Iconv...


16 Types of Parallel Processing?

Parallel Processing is broadly classified into 2 types.
a) SMP - Symmetrical Multi Processing.
b) MPP - Massive Parallel Processing.


17 What does a Config File in parallel extender consist of?

Config file consists of the following.
a) Number of Processes or Nodes.
b) Actual Disk Storage Location.


18 Functionality of Link Partitioner and Link Collector?

Link Partitioner : It actually splits data into various partitions or data flows using various
partition methods .
Link Collector : It collects the data coming from partitions, merges ...


19 What is Modulus and Splitting in Dynamic Hashed File?

In a Hashed File, the size of the file keeps changing randomly.
If the size of the file increases it is called as "Modulus".
If the size of the file decreases it is called as "...


20 Types of vies in Datastage Director?

There are 3 types of views in Datastage Director
a) Job View - Dates of Jobs Compiled.
b) Log View - Status of Job last run
c) Status View - Warning Messages, Event Messages, Program G...






43
Did you Parameterize t...

Always parameterized the job. Either the values are coming from Job Properties or from a ‘Parameter Manager’ – a third part tool. There is no way you will hard–code some parameters in your jobs. The o...


44
Have you ever involved in updating the DS versions like DS 5.X, if so tell us some the steps you have taken in doing so?


Yes. The following are some of the steps; I have taken in doing so:
1) Definitely take a back up of the whole project(s) by exporting the project as a .dsx file
2) See that you are using the sam...


45 What is Hash file stage and what is it used for?

Used for Look-ups. It is like a reference table. It is also used in-place of ODBC, OCI tables for better performance.


46 What are Static Hash files and Dynamic Hash files?

As the names itself suggest what they mean. In general we use Type-30 dynamic Hash files. The Data file has a default size of 2Gb and the overflow file is used if the data exceeds the 2GB size.
59 What are OConv () and Iconv () functions and where are they used?

IConv() - Converts a string to an internal storage format
OConv() - Converts an expression to an output format.


60 What are Routines and where/how are they written and have you written any routines before?

Routines are stored in the Routines branch of the DataStage Repository, where you can create, view or edit. The following are different types of routines:
1) Transform functions
&nb...
61 If worked with DS6.0 and latest versions what are Link-Partitioner and Link-Collector used for?

Link Partitioner - Used for partitioning the data.
Link Collector - Used for collecting the partitioned data.


62 How did you handle reject data?

Typically a Reject-link is defined and the rejected data is loaded back into data warehouse. So Reject link has to be defined every Output link you wish to collect rejected data. Rejected data is typi...


63 What are other Performance tunings you have done in your last project to increase the performance of slowly running jobs?

1. Staged the data coming from ODBC/OCI/DB2UDB stages or any database on the server using Hash/Sequential files for optimum performance also for data recovery in case job aborts.
2. Tune...


64 How did you handle an 'Aborted' sequencer?

In almost all cases we have to delete the data inserted by this from DB manually and fix the job and then run the job again.


65 What are Sequencers?

Sequencers are job control programs that execute other jobs with preset Job parameters.


66 How did u connect with DB2 in your last project?

Most of the times the data was sent to us in the form of flat files. The data is dumped and sent to us. In some cases were we need to connect to DB2 for look-ups as an instance then we used ODBC drive...


67 Read the String functions in DS

Functions like [] -> sub-string function and ':' -> concatenation operator
Syntax: string [ [ start, ] length ]
string [ delimiter, instance, repeats ]


68 What will you in a situation where somebody wants to send you a file and use that file as an input or reference and then run job.

A. Under Windows: Use the 'WaitForFileActivity' under the Sequencers and then run the job. May be you can schedule the sequencer around the time the file is expected to arrive.
B. Under U...


69 How would call an external Java function which are not supported by DataStage?

Starting from DS 6.0 we have the ability to call external Java functions using a Java package from Ascential. In this case we can even use the command line to invoke the Java function and write the re...


70 What is the utility you use to schedule the jobs on a UNIX server other than using Ascential Director?

Use crontab utility along with dsexecute() function along with proper parameters passed.
DWH's are typically read only, batch updated on a schedule

ODS's are maintained in more real time, trickle fed constantly

No comments:

Search 4 DataStage