DataStage Information: June 2008

Sunday, June 29, 2008

Datastage certification Dump

QUESTION
1
Which three stages support the dynamic (runtime) definition of the physical column
metadata? (Choose three.)
A. the Sequential stage
B. the Column Export stage
C. the CFF stage
D. the DRS stage
E. the Column Import stage
Answer: A,B,E

QUESTION
2
A Varchar(10) field named SourceColumn is mapped to a Char(25) field named
TargetColumn in a Transformer stage. The APT_STRING_PADCHAR environment
variable is set in Administrator to its default value. Which technique describes how to
write the derivation so that values in SourceColumn are padded with spaces in
TargetColumn?
A. Include APT_STRING_PADCHAR in your job as a job parameter. Specify the C/C++
end of string character (0x0) as its value.
B. Map SourceColumn to TargetColumn. The Transformer stage will automatically pad
with spaces.
C. Include APT_STRING_PADCHAR in your job as a job parameter. Specify a space as
its value.
D. Concatenate a string of 25 spaces to SourceColumn in the derivation for
TargetColumn.
Answer: C

QUESTION
3
Which three privileges must the user possess when running a parallel job? (Choose
three.)
A. read access to APT_ORCHHOME
B. execute permissions on local copies of programs and scripts
C. read/write permissions to the UNIX/etc directory
D. read/write permissions to APT_ORCHHOME
E. read/write access to disk and scratch disk resources
Answer: A,B,E

QUESTION
4
In a Teradata environment, which stage invokes Teradata supplied utilities?
A. Teradata API
B. DRS Teradata
C. Teradata Enterprise
D. Teradata Multiload
Answer: D

QUESTION
5
When importing a COBOL file definition, which two are required? (Choose two.)
A. The file you are importing is accessible from your client workstation.
B. The file you are importing contains level 01 items.
C. The column definitions are in a COBOL copybook file and not, for example, in a
COBOL source file.
D. The file does not contain any OCCURS DEPENDING ON clauses.
Answer: A,B

QUESTION
6
Which two tasks will create DataStage projects? (Choose two.)
A. Export and import a DataStage project from DataStage Manager.
B. Add new projects from DataStage Administrator.
C. Install the DataStage engine.
D. Copy a project in DataStage Administrator.
Answer: B,C

QUESTION
7
Which three defaults are set in DataStage Administrator? (Choose three.)
A. default prompting options, such as Autosave job before compile
B. default SMTP mail server name
C. project level default for Runtime Column Propagation
D. project level defaults for environment variables
E. project level default for Auto-purge of job log entries
Answer: C,D,E

QUESTION
8
Which three are keyless partitioning methods? (Choose three.)
A. Entire
B. Modulus
C. Round Robin
D. Random
E. Hash
Answer: A,C,D

QUESTION
9
Which two must be specified to manage Runtime Column Propagation? (Choose two.)
A. enabled in DataStage Administrator
B. attached to a table definition in DataStage Manager
C. enabled at the stage level
D. enabled with environmental parameters set at runtime
Answer: A,C

QUESTION
10
Which three are valid ways within a Job Sequence to pass parameters to Activity stages?
(Choose three.)
A. ExecCommand Activity stage
B. UserVariables Activity stage
C. Sequencer Activity stage
D. Routine Activity stage
E. Nested Condition Activity stage
Answer: A,B,D

QUESTION
11
A client requires that a database table be done using two jobs. The first job writes to a
dataset. The second job reads the dataset and loads the table. The two jobs are connected
in a Job Sequence. What are three benefits of this approach? (Choose three.)
A. The time it takes to load the table is reduced.
B. The database table can be reloaded after a failure without re-reading the source data.
C. The dataset can be used by other jobs even if the database load fails.
D. The dataset can be read if the database is not available.
E. The data in the dataset can be archived and shared with other external applications.
Answer: B,C,D

QUESTION
12
You are reading customer data using a Sequential File stage and transforming it using the
Transformer stage. The Transformer is used to cleanse the data by trimming spaces from
character fields in the input. The cleansed data is to be written to a target DB2 table.
Which partitioning method would yield optimal performance without violating the
business requirements?

A. Hash on the customer ID field
B. Round Robin
C. Random
D. Entire
Answer: B

QUESTION
13
Which three are valid trigger expressions in a stage in a Job Sequence? (Choose three.)
A. Equality(Conditional)
B. Unconditional
C. ReturnValue(Conditional)
D. Difference(Conditional)
E. Custom(Conditional)
Answer: B,C,E

QUESTION
14
An Aggregator stage using a Hash technique processes a very large number of rows
during month end processing. The job occasionally aborts during these large runs with an
obscure memory error. When the job is rerun, processing the data in smaller amounts
corrects the problem. Which change would correct the problem?
A. Set the Combinability option on the Stage Advanced tab to Combinable allowing the
Aggregator to use the memory associated with other operators.
B. Change the partitioning keys to produce more data partitions.
C. Add a Sort stage prior to the Aggregator and change to a sort technique on the Stage
Properties tab of the Aggregator stage.
D. Set the environment variable APT_AGG_MAXMEMORY to a larger value.
Answer: C

QUESTION
15
Which three actions are performed using stage variables in a parallel Transformer stage?
(Choose three.)
A. A function can be executed once per record.
B. A function can be executed once per run.
C. Identify the first row of an input group.
D. Identify the last row of an input group.
E. Lookup up a value from a reference dataset.
Answer: A,B,C

Datstage solutions, knowledge base, FAQ and best practices

Datstage solutions, knowledge base, FAQ and best practices
1.What are the ways to execute datastage jobs?

A job can be run using a few different methods:

* from Datastage Director (menu Job -> Run now...)
* from command line using a dsjob command
* Datastage routine can run a job (DsRunJob command)
* by a job sequencer
2.How to invoke a Datastage shell command?

Datastage shell commands can be invoked from :

* Datastage administrator (projects tab -> Command)
* Telnet client connected to the datastage server

3.How to stop a job when its status is running?

To stop a running job go to DataStage Director and click the stop button (or Job -> Stop from menu). If it doesn't help go to Job -> Cleanup Resources, select a process with holds a lock and click Logout

If it still doesn't help go to the datastage shell and invoke the following command: ds.tools
It will open an administration panel. Go to 4.Administer processes/locks , then try invoking one of the clear locks commands (options 7-10).

4. How to release a lock held by jobs?

Go to the datastage shell and invoke the following command: ds.tools
It will open an administration panel. Go to 4.Administer processes/locks , then try invoking one of the clear locks commands (options 7-10).

5. How to run and schedule a job from command line?

To run a job from command line use a dsjob command

Command Syntax: dsjob [-file

6. User privileges for the default DataStage roles?

The role privileges are:

* DataStage Developer - user with full access to all areas of a DataStage project
* DataStage Operator - has privileges to run and manage deployed DataStage jobs
* -none- - no permission to log on to DataStage

7. What is a command to analyze hashed file?

There are two ways to analyze a hashed file. Both should be invoked from the datastage command shell. These are:

* FILE.STAT command
* ANALYZE.FILE command

8. Is it possible to run two versions of datastage on the same pc?

Yes, even though different versions of Datastage use different system dll libraries.
To dynamically switch between Datastage versions install and run DataStage Multi-Client Manager. That application can unregister and register system libraries used by Datastage.

9. How to send notifications from Datastage as a text message (sms) to a cell phone

There is a few possible methods of sending sms messages from Datastage. However, there is no easy way to do this directly from Datastage and all methods described below will require some effort.

The easiest way of doing that from the Datastage standpoint is to configure an SMTP (email) server as a mobile phone gateway. In that case, a Notification Activity can be used to send message with a job log and any desired details. DSSendMail Before-job or After-job subroutine can be also used to send sms messages.
If configured properly, the recipients email address will have the following format: 600123456@oursmsgateway.com

If there is no possibility of configuring a mail server to send text messages, you can to work it around by using an external application run directly from the operational system. There is a whole bunch of unix scripts and applications to send sms messages.
In that solution, you will need to create a batch script which will take care of sending messages and invoke it from Datastage using ExecDOS or ExecSh subroutines passing the required parameters (like phone number and message body).

Please keep in mind that all these solutions may require a contact to the local cellphone provider first and, depending on the country, it may not be free of charge and in some cases the provider may not support the capability at all.

10.Error in Link collector - Stage does not support in-process active-to-active inputs or outputs

To get rid of the error just go to the Job Properties -> Performance and select Enable row buffer.
Then select Inter process which will let the link collector run correctly.
Buffer size set to 128Kb should be fine, however it's a good idea to increase the timeout.

11.What is the DataStage equivalent to like option in ORACLE

The following statement in Oracle:
select * from ARTICLES where article_name like '%WHT080%';

Can be written in DataStage (for example as the constraint expression):
incol.empname matches '...WHT080...'

12.what is the difference between logging text and final text message in terminator stage

Every stage has a 'Logging Text' area on their General tab which logs an informational message when the stage is triggered or started.

* Informational - is a green line, DSLogInfo() type message.
* The Final Warning Text - the red fatal, the message which is included in the sequence abort message .

13.Error in STPstage - SOURCE Procedures must have an output link

The error appears in Stored Procedure (STP) stage when there are no stages going out of that stage.
To get rid of it go to 'stage properties' -> 'Procedure type' and select Transform

14. How to invoke an Oracle PLSQL stored procedure from a server job

To run a pl/sql procedure from Datastage a Stored Procedure (STP) stage can be used.
However it needs a flow of at least one record to run.

It can be designed in the following way:

* source odbc stage which fetches one record from the database and maps it to one column - for example: select sysdate from dual
* A transformer which passes that record through. If required, add pl/sql procedure parameters as columns on the right-hand side of tranformer's mapping
* Put Stored Procedure (STP) stage as a destination. Fill in connection parameters, type in the procedure name and select Transform as procedure type. In the input tab select 'execute procedure for each row' (it will be run once).

15. Is it possible to run a server job in parallel?

Yes, even server jobs can be run in parallel.
To do that go to 'Job properties' -> General and check the Allow Multiple Instance button.
The job can now be run simultaneously from one or many sequence jobs. When it happens datastage will create new entries in Director and new job will be named with automatically generated suffix (for example second instance of a job named JOB_0100 will be named JOB_0100.JOB_0100_2). It can be deleted at any time and will be automatically recreated by datastage on the next run.

16.Datastage routine to open a text file with error catching

Note! work dir and file1 are parameters passed to the routine.
* open file1
OPENSEQ work_dir : '\' : file1 TO H.FILE1 THEN
CALL DSLogInfo("******************** File " : file1 : " opened successfully", "JobControl")
END ELSE
CALL DSLogInfo("Unable to open file", "JobControl")
ABORT
END

17. Datastage routine which reads the first line from a text file

Note! work dir and file1 are parameters passed to the routine.

* open file1
OPENSEQ work_dir : '\' : file1 TO H.FILE1 THEN
CALL DSLogInfo("******************** File " : file1 : " opened successfully", "JobControl")
END ELSE
CALL DSLogInfo("Unable to open file", "JobControl")
ABORT
END

READSEQ FILE1.RECORD FROM H.FILE1 ELSE
Call DSLogWarn("******************** File is empty", "JobControl")
END

firstline = Trim(FILE1.RECORD[1,32]," ","A") ******* will read the first 32 chars
Call DSLogInfo("******************** Record read: " : firstline, "JobControl")
CLOSESEQ H.FILE1

18.How to test a datastage routine or transform?

To test a datastage routine or transform go to the Datastage Manager.
Navigate to Routines, select a routine you want to test and open it. First compile it and then click 'Test...' which will open a new window. Enter test parameters in the left-hand side column and click run all to see the results.
Datastage will remember all the test arguments during future tests.

19.When hashed files should be used? What are the benefits or using them?

Hashed files are the best way to store data for lookups. They're very fast when looking up the key-value pairs.
Hashed files are especially useful if they store information with data dictionaries (customer details, countries, exchange rates). Stored this way it can be spread across the project and accessed from different jobs.

20.How to construct a container and deconstruct it or switch between local and shared?

To construct a container go to Datastage designer, select the stages that would be included in the container and from the main menu select Edit -> Construct Container and choose between local and shared.
Local will be only visible in the current job, and share can be re-used. Shared containers can be viewed and edited in Datastage Manager under 'Routines' menu.
Local Datastage containers can be converted at any time to shared containers in datastage designer by right clicking on the container and selecting 'Convert to Shared'. In the same way it can be converted back to local.

21.Corresponding datastage data types to ORACLE types?

Most of the datastage variable types map very well to oracle types. The biggest problem is to map correctly oracle NUMBER(x,y) format.

The best way to do that in Datastage is to convert oracle NUMBER format to Datastage Decimal type and to fill in Length and Scale column accordingly.

There are no problems with string mappings: oracle Varchar2 maps to datastage Varchar, and oracle char to datastage char.

22.How to adjust commit interval when loading data to the database?

In earlier versions of datastage the commit interval could be set up in:
General -> Transaction size (in version 7.x it's obsolete)

Starting from Datastage 7.x it can be set up in properties of ODBC or ORACLE stage in Transaction handling -> Rows per transaction.
If set to 0 the commit will be issued at the end of a successfull transaction.

23.What is the use of INROWNUM and OUTROWNUM datastage variables?

@INROWNUM and @OUTROWNUM are internal datastage variables which do the following:

* @INROWNUM counts incoming rows to a transformer in a datastage job
* @OUTROWNUM counts oucoming rows from a transformer in a datastage job

These variables can be used to generate sequences, primary keys, id's, numbering rows and also for debugging and error tracing.
They play similiar role as sequences in Oracle.

24.Datastage trim function cuts out more characters than expected

By deafult datastage trim function will work this way:
Trim(" a b c d ") will return "a b c d" while in many other programming/scripting languages "a b c d" result would be expected.
That is beacuse by default an R parameter is assumed which is R - Removes leading and trailing occurrences of character, and reduces multiple occurrences to a single occurrence.

To get the "a b c d" as a result use the trim function in the following way: Trim(" a b c d "," ","B")

25. Database update actions in ORACLE stage

The destination table can be updated using various Update actions in Oracle stage. Be aware of the fact that it's crucial to select the key columns properly as it will determine which column will appear in the WHERE part of the SQL statement. Available actions:

* Clear the table then insert rows - deletes the contents of the table (DELETE statement) and adds new rows (INSERT).
* Truncate the table then insert rows - deletes the contents of the table (TRUNCATE statement) and adds new rows (INSERT).
* Insert rows without clearing - only adds new rows (INSERT statement).
* Delete existing rows only - deletes matched rows (issues only the DELETE statement).
* Replace existing rows completely - deletes the existing rows (DELETE statement), then adds new rows (INSERT).
* Update existing rows only - updates existing rows (UPDATE statement).
* Update existing rows or insert new rows - updates existing data rows (UPDATE) or adds new rows (INSERT). An UPDATE is issued first and if succeeds the INSERT is ommited.
* Insert new rows or update existing rows - adds new rows (INSERT) or updates existing rows (UPDATE). An INSERT is issued first and if succeeds the UPDATE is ommited.
* User-defined SQL - the data is written using a user-defined SQL statement.
* User-defined SQL file - the data is written using a user-defined SQL statement from a file.

26. Use and examples of ICONV and OCONV functions?

ICONV and OCONV functions are quite often used to handle data in Datastage.
ICONV converts a string to an internal storage format and OCONV converts an expression to an output format.
Syntax:
Iconv (string, conversion code)
Oconv(expression, conversion )

Some useful iconv and oconv examples:
Iconv("10/14/06", "D2/") = 14167
Oconv(14167, "D-E") = "14-10-2006"
Oconv(14167, "D DMY[,A,]") = "14 OCTOBER 2006"
Oconv(12003005, "MD2$,") = "$120,030.05"

That expression formats a number and rounds it to 2 decimal places:
Oconv(L01.TURNOVER_VALUE*100,"MD2")

Iconv and oconv can be combined in one expression to reformat date format easily:
Oconv(Iconv("10/14/06", "D2/"),"D-E") = "14-10-2006"

27.ERROR 81021 Calling subroutine DSR_RECORD ACTION=2

Error message:

DataStage Repository Interface:
Error calling subroutine: DSR_RECORD (Action=2);
check DataStage is set up correctly in project
Development (Internal Error (81021))

Datastage system help gives the following error desription:
SYS.HELP. 081021
MESSAGE.. dsrpc: Error writing to Pipe.

The problem appears when a job sequence is used and it contains many stages (usually more than 10) and very often when a network connection is slow.

Basically the cause of a problem is a failure between DataStage client and the server communication.

The solution to the issue is:
# Do not log in to Datastage Designer using 'Omit' option on a login screen. Type in explicitly username and password and a job should compile successfully.
# execute the DS.REINDEX ALL command from the Datastage shell - if the above does not help

28. How to check Datastage internal error descriptions

# To check the description of a number go to the datastage shell (from administrator or telnet to the server machine) and invoke the following command: SELECT * FROM SYS.MESSAGE WHERE @ID='081021'; - where in that case the number 081021 is an error number

The command will produce a brief error description which probably will not be helpful in resolving an issue but can be a good starting point for further analysis.

29.Error timeout waiting for mutex

The error message usually looks like follows:
... ds_ipcgetnext() - timeout waiting for mutex

There may be several reasons for the error and thus solutions to get rid of it.
The error usually appears when using Link Collector, Link Partitioner and Interprocess (IPC) stages. It may also appear when doing a lookup with the use of a hash file or if a job is very complex, with the use of many transformers.

There are a few things to consider to work around the problem:
- increase the buffer size (up to to 1024K) and the Timeout value in the Job properties (on the Performance tab).
- ensure that the key columns in active stages or hashed files are composed of allowed characters – get rid of nulls and try to avoid language specific chars which may cause the problem.
- try to simplify the job as much as possible (especially if it’s very complex). Consider splitting it into two or three smaller jobs, review fetches and lookups and try to optimize them (especially have a look at the SQL statements).

30. ERROR 30107 Subroutine failed to complete successfully

Error message:

Error calling subroutine:
DSR_RECORD (Action=2); or *DataStage*DSR_SELECT (Action=7);
check DataStage is set up correctly in project Development
(Subroutine failed to complete successfully(30107))

Datastage system help gives the following error desription:
SYS.HELP. 930107
MESSAGE.. DataStage/SQL: Illegal placement of parameter markers

The problem appears when a project is moved from one project to another (for example when deploying a project from a development environment to production).

The solution to the issue is:
# Rebuild the repository index by executing the DS.REINDEX ALL command from the Datastage shell

31.Datastage Designer hangs when editing job activity properties

The appears when running Datastage Designer under Windows XP after installing patches or the Service Pack 2 for Windows.
After opening a job sequence and navigating to the job activity properties window the application freezes and the only way to close it is from the Windows Task Manager.

The solution of the problem is very simple. Just Download and install the “XP SP2 patch” for the Datastage client.
It can be found on the IBM client support site (need to log in):
https://www.ascential.com/eservice/public/welcome.do

Go to the software updates section and select an appropriate patch from the Recommended DataStage patches section.
Sometimes users face problems when trying to log in (for example when the license doesn’t cover the IBM Active Support), then it may be necessary to contact the IBM support which can be reached at WDISupport@us.ibm.com

Advantages and Disadvantages of using DataStage ETL tool

Major business and technical advantages and disadvantages of using DataStage ETL tool

Business advantages of using DataStage as an ETL tool:
# Significant ROI (return of investment) over hand-coding
# Learning curve - quick development and reduced maintenance with GUI tool
# Development Partnerships - easy integration with top market products interfaced with the datawarehouse, such as SAP, Cognos, Oracle, Teradata, SAS
# Single vendor solution for bulk data transfer and complex transformations (DataStage versus DataStage TX)
# Transparent and wide range of licensing options

Technical advantages of using DataStage tool to implement the ETL processes
# Single interface to integrate heterogeneous applications
# Flexible development environment - it enables developers to work in their desired style, reduces training needs and enhances reuse. ETL developers can follow data integrations quickly through a graphical work-as-you-think solution which comes by default with a wide range of extensible objects and functions
# Team communication and documentation of the jobs is supported by data flows and transformations self-documenting engine in HTML format.
# Ability to join data both at the source, and at the integration server and to apply any business rule from within a single interface without having to write any procedural code.
# Common data infrastructure for data movement and data quality (metadata repository, parallel processing framework, development environment)
# With Datastage Enterprise Edition users can use the parallel processing engine which provides unlimited performance and scalability. It helps get most out of hardware investment and resources.
# The datastage server performs very well on both Windows and unix servers.

Major Datastage weaknesses and disadvantages
# Big architectural differences in the Server and Enterprise edition which results in the fact that migration from server to enterprise edition may require vast time and resources effort.
# There is no automated error handling and recovery mechanism - for example no way to automatically time out zombie jobs or kill locking processes. However, on the operator level, these errors can be easily resolved.
# No Unix Datastage client - the Client software available only under Windows and there are different clients for different datastage versions. The good thing is that they still can be installed on the same windows pc and switched with the Multi-Client Manager program.
# Might be expensive as a solution for a small or mid-sized company.

Data warehouse

Data warehouse

Data Warehouse is a central managed and integrated database containing data from the operational sources in an organization (such as SAP, CRM, ERP system). It may gather manual inputs from users determining criteria and parameters for grouping or classifying records.
That database contains structured data for query analysis and can be accessed by users. The data warehouse can be created or updated at any time, with minimum disruption to operational systems. It is ensured by a strategy implemented in a ETL process.

A source for the data warehouse is a data extract from operational databases. The data is validated, cleansed, transformed and finally aggregated and it becomes ready to be loaded into the data warehouse.
Data warehouse is a dedicated database which contains detailed, stable, non-volatile and consistent data which can be analyzed in the time variant.
Sometimes, where only a portion of detailed data is required, it may be worth considering using a data mart. A data mart is generated from the data warehouse and contains data focused on a given subject and data that is frequently accessed or summarized.

Keeping the data warehouse filled with very detailed and not efficiently selected data may lead to growing the database to a huge size, which may be difficult to manage and unusable. To significantly reduce number of rows in the data warehouse, the data is aggregated which leads to the easier data maintenance and efficiency in browsing and data analysis.

A well designed and maintained Data Warehouse can significantly improve the quality and accessibility of the company data and increase the amount of information delivered to the end users.

# Key Data Warehouse systems and the most widely used database engines for storing and serving data for the enterprise business intelligence and performance management: Teradata
# Oracle
# Microsoft SQL Server
# IBM DB2
# SAS

Datastage tutorial and training

Datastage tutorial and training

Datastage guide Table of contents:

# Lesson 1. Datastage-modules - the lesson contains an overview of the datastage components and modules with screenshots.
# Lesson 2. Designing jobs - datastage palette - a list of all stages and activities used in Datastage
# Lesson 3. Extracting and loading data - ODBC and ORACLE stages - description and use of the ODBC and ORACLE stages (ORAOCI9) used for data extraction and data load. Covers ODBC input and output links, Oracle update actions and best practices.
# Lesson 4. Extracting and loading data - sequential files - description and use of the sequential files (flat files, text files, CSV files) in datastage.
# Lesson 5. Transforming and filtering data - use of transformers to perform data conversions, mappings, validations and datarefining. Design examples of the most commonly used datastage jobs.
# Lesson 6. Performing lookups in Datastage - how to use hash files and database stages as a lookup source.
# Lesson 7. Implementing ETL process in Datastage - step by step guide on how to implement the ETL process efficiently in Datastage. Contains tips on how to design and run a set of jobs executed on a daily basis.
# Lesson 8. SCD implementation in Datastage - the lesson illustrates how to implement SCD's (slowly changing dimensions) in Datastage, contains job designs, screenshots and sample data. All the Slowly Changing Dimensions types are described in separate articles below:
# SCD Type 1
# SCD Type 2
# SCD Type 3 and 4
# Lesson 9. Datastage jobs real-life solutions - a set of examples of job designs resolving real-life problems implemented in production datawarehouse environments in various companies.
# Lesson 10. Header and trailer file processing - a sample Datastage job which processes a textfile organized in a header and trailer format. This format may be represented for example by mainframe, EDI or EPIC files

Lesson 1:
Datastage modules

The DataStage components:

* Administrator - Administers DataStage projects, manages global settings and interacts with the system. Administrator is used to specify general server defaults, add and delete projects, set up project properties and provides a command interface to the datastage repository.
With Datastage Administrator users can set job monitoring limits, user privileges, job scheduling options and parallel jobs default.
* Manager - it's a main interface to the Datastage Repository, allows its browsing and editing. It displays tables and files layouts, routines, transforms and jobs defined in the project. It is mainly used to store and manage reusable metadata.
* Designer - used to create DataStage jobs which are compiled into executable programs. is a graphical, user-friendly application which applies visual data flow method to develop job flows for extracting, cleansing, transforming, integrating and loading data. It’s a module mainly used by Datastage developers.
* Director - manages running, validating, scheduling and monitoring DataStage jobs. It’s mainly used by operators and testers.

Lesson 3:
Designing jobs - ODBC and ORACLE stages

ODBC stages are used to allow Datastage to connect to any data source that represents the Open Database Connectivity API (ODBC) standard.
ODBC stages are mainly used to extract or load the data. However, ODBC stage may also be very helpful when aggregating data and as a lookup stage (in that case it can play role of aggregator stage or a hash file and can be used instead).
Each ODBC stage can have any number of inputs or outputs.
The input links specify the data which is written to the database (they act as INSERT, UPDATE or DELETE statements in SQL). Input link data can be defined in various ways: using an SQL statement constructed by DataStage, a user-defined SQL query or a stored procedure.
Output links specify the data that are extracted (correspond to the SQL SELECT statement). The data on an output link is passed through ODBC connector and processed by an underlying database.

If a processing target is an Oracle database, it may be worth considering use of ORACLE (ORAOCI9) stage. It has a significantly better performance than ODBC stage and allows setting up more configuration options and parameters native to the Oracle database.
There’s a very useful option to issue an SQL before or after main dataflow operations (Oracle stage properties -> Input -> SQL). For example, when loading a big chunk of data into an oracle table, it may increase performance to drop indexes in a ‘before SQL’ tab and create indexes and analyze table in a ‘after SQL’ tab ('ANALYZE TABLE xxx COMPUTE STATISTICS' SQL statement).

Update actions in Oracle stage
The destination table can be updated using various Update actions in Oracle stage. Be aware of the fact that it's crucial to select the key columns properly as it will determine which column will appear in the WHERE part of the SQL statement. Update actions available from the drop-down list:

* Clear the table then insert rows - deletes the contents of the table (DELETE statement) and adds new rows (INSERT).
* Truncate the table then insert rows - deletes the contents of the table (TRUNCATE statement) and adds new rows (INSERT).
* Insert rows without clearing - only adds new rows (INSERT statement).
* Delete existing rows only - deletes matched rows (issues only the DELETE statement).
* Replace existing rows completely - deletes the existing rows (DELETE statement), then adds new rows (INSERT).
* Update existing rows only - updates existing rows (UPDATE statement).
* Update existing rows or insert new rows - updates existing data rows (UPDATE) or adds new rows (INSERT). An UPDATE is issued first and if succeeds the INSERT is ommited.
* Insert new rows or update existing rows - adds new rows (INSERT) or updates existing rows (UPDATE). An INSERT is issued first and if succeeds the UPDATE is ommited.
* User-defined SQL - the data is written using a user-defined SQL statement.
* User-defined SQL file - the data is written using a user-defined SQL statement from a file.

Lesson 4:
Designing jobs - sequential (text) files

Sequential File stages are used to interract with text files which may involve extracting data from and write data to a text file. Sequential File stages can read files, create (overwrite) or append data to a text file. It can be processed on any drive that is local or mapped as a shared folder or even on an FTP server (combined with an FTP stage). Each Sequential File stage can have any number of inputs or outputs. However, trying to write to a sequential file simultaneously from two data streams will surely cause an error.

Lesson 5:
Designing jobs - tranforming and filtering data

It's a very common situation and a good practice to design datastage jobs in which data flow goes in the following way:
EXTRACT SOURCE -> DATA VALIDATION, REFINING, CLEANSING -> MAPPING -> DESTINATION
The data refining, validation and mapping part of the process is mainly handled by a transformer stage. Transformer stage doesn't extract or write data to a target database. It handles extracted data, performs conversions, mappings, validations, passes values and controls the data flow.
Transformer stages can have any number of input and output links. Input links can be primary or reference (used for lookups) and there can only be one primary input and any number of reference inputs.

ETL process and concepts

ETL process and concepts

ETL stands for extraction, transformation and loading. Etl is a process that involves the following tasks:

* extracting data from source operational or archive systems which are the primary source of data for the data warehouse
* transforming the data - which may involve cleaning, filtering, validating and applying business rules
* loading the data into a data warehouse or any other database or application that houses data

The ETL process is also very often referred to as Data Integration process and ETL tool as a Data Integration platform.
The terms closely related to and managed by ETL processes are: data migration, data management, data cleansing, data synchronization and data consolidation.

The main goal of maintaining an ETL process in an organization is to migrate and transform data from the source OLTP systems to feed a data warehouse and form data marts.

ETL Tools

At present the most popular and widely used ETL tools and applications on the market are:
# IBM Websphere DataStage (Formerly known as Ascential DataStage and Ardent DataStage)
# Informatica PowerCenter
# Oracle Warehouse Builder
# Ab Initio
# Pentaho Data Integration - Kettle Project (open source ETL)
# SAS ETL studio
# Cognos Decisionstream
# Business Objects Data Integrator (BODI)
# Microsoft SQL Server Integration Services (SSIS)

ETL process references

Sample business scenarios with a real-life ETL process implementations, each of which describes a typical data warehousing problem is brought by an ETL training.

A detailed step-by-step instructions on how the ETL process can be implemented in IBM Websphere Datastage can be found in the Implementing ETL in DataStage tutorial lesson.

Etl process can be also successfully implemented using an open source ETL tool - Kettle. There is a separate Pentaho Data Integration section on our pages: Kettle ETL transformations .

ETL and Data Warehouse training with examples

The ETL and Data Warehousing examples section is structured as a number of scenarios, each of which describes a typical business intelligence and data warehousing problem.
You may consider this module as an ETL and Data Warehouse cookbook with a series of recipes which show how to manage and implement the ETL process in a data warehouse environment.

Our intention is to show and analyze the most common real life problems you may encounter when designing ETL processes to feed data warehouses in various organization using a wide range of ETL tools. We discuss and propose an optimal solution for each of the scenarios and show how to organize the data flow process using various ETL applications.

Going through the sample implementations of the business scenarios is also a good way to compare ETL tools and see the different approaches to designing the data integration process. This also gives an idea and helps identify strong and weak points of various ETL tools.
ETL business scenarios

We present a number of sample business scenarios which are very often encountered accross the organizations and need to be addressed. ETL data integration process plays the major role in handling those problems.
Sample ETL process implementations:
# Surrogate key generation example which includes information on business keys and surrogate keys and shows how to design an ETL process to manage surrogate keys in a data warehouse environment. Sample design in Pentaho Data Integration
# Header and trailer processing - considerations on processing files arranged in blocks consisting of a header record, body items and a trailer. This type of files usually come from mainframes, also it applies to EDI and EPIC files. Solution examples in Datastage, SAS and Pentaho Data Integration
# Loading customers - the customers data extract is placed on an FTP server. It needs to be copied to an ETL server and loaded into the data warehouse. The transformation applies certain business rules and implements SCD to deal with the dimension changes. Sample implementation in Teradata MultiLoad
# Site traffic Data Warehouse - a guide for creation of a data warehouse and data marts for website traffic analysis and reporting. Sample design in Pentaho Kettle
# Data Quality Tests - ETL process design aimed to test and cleanse data in a Data Warehouse. Sample outline in PDI
# XML ETL - ETL processes to handle XML files
# Data allocation ETL process case study for allocating data. Sample Cognos implementation

Processing a header and trailer textfile

Goal

Process a text file which contains records arranged in blocks consisting of a header record, details (items, body) and a trailer. The aim is to normalize records and load them into a relational database structure.

Scenario overview and details

Typically, header and item processing needs to be implemented when processing files that origin from mainframe systems, and also EDI transmission files, SWIFT Schemas, EPIC files.

The input file in our scenario is a stream of records representing invoices. Each invoice consists of :
- A header containing invoice number, dates, customer reference and other details
- One or more items representing ordered products, including an item number, quantity, price and value
- A trailer which contains summary information for all the items

The records are distinguished by the first character in each line: H stands for headers, I for items and T for the trailers.
The lines in a section of the input file are fixed length (so for example all headers have the same number of characters but it varies from items).

Input data:
# Text file in a Header-trailer format. The file presented in our example is divided into headers, items and trailers.
- A header starts with a letter H and then contains an invoice number, a customer number, a date and invoice currency
- Every item starts with a letter I which is followed by product ID, product age, quantity and net value
- The trailer starts with a T and contains two values which act as a checksum: the total number of invoice lines and a total net value.
# Output data:
# INVC_HEADER and INVC_LINE relational tables: one with invoice headers and the other with the invoice lines
# rejects_080629.txt - a text file with loading errors desired output

Generate surrogate key

Goal

Fill in a data warehouse dimension table with data which comes from different source systems and assign a unique record identifier (surrogate key) to each record.
Scenario overview and details

To illustrate this example, we will use two made up sources of information to provide data about customers dimension. Each extract contains customer records with a business key (natural key) assigned to it.

In order to isolate the data warehouse from source systems, we will introduce a technical surrogate key instead of re-using the source system's natural (business) key.
A unique and common surrogate key is a one-field numeric key which is shorter, easier to maintain and understand, and independent from changes in source system than using a business key. Also, if a surrogate key generation process is implemented correctly, adding a new source system to the data warehouse processing will not require major efforts.

Surrogate key generation mechanism may vary depending on the requirements, however the inputs and outputs usually fit into the design shown below:
Inputs:
- an input respresented by an extract from the source system
- datawarehouse table reference for identifying the existing records
- maximum key lookup

Outputs:
- output table or file with newly assigned surrogate keys
- new maximum key
- updated reference table with new records
Proposed solution

Assumptions:
- The surrogate key field for our made up example is WH_CUST_NO.
- To make the example clearer, we will use SCD 1 to handle changing dimensions. This means that new records overwrite the existing data.
The ETL process implementation requires several inputs and outputs.
Input data:
- customers_extract.csv - first source system extract
- customers2.txt - second source system extract
- CUST_REF - a lookup table which contains mapping between natural keys and surrogate keys
- MAX_KEY - a sequence number which represents last key assignment

Output data:
- D_CUSTOMER - table with new records and correctly associated surrogate keys
- CUST_REF - new mappings added
- MAX_KEY sequence increased

The design of an ETL process for generating surrogate keys will be as follows:
# The loading process will be executed twice - once for each of the input files
# Check if the lookup reference data is correct and available:
- PROD_REF table
- max_key sequence
# Read the extract and first check if a record already exists. If it does, assign an existing surrogate key to it and update the desciptive data in the main dimension table.
# If it is a new record, then:
- populate a new surrogate key and assign it to the record. The new key will be populated by incrementing the old maximum key by 1.
- insert a new record into the products table
- insert a new record into the mapping table (which stores business and surrogate keys mapping)
- update the new maximum key

What is meant by Business Intelligence

What is Business Intelligence?
Business intelligence is a broad set of applications, technologies and knowledge for gathering and analyzing data for the purpose of helping users make better business decisions.
The main challenge of Business Intelligence is to gather and serve organized information regarding all relevant factors that drive the business and enable end-users to access that knowledge easily and efficiently and in effect maximize the success of an organization.

Business intelligence produces analysis and provides in depth knowledge about performance indicators such as company's customers, competitors, business counterparts, economic environment and internal operations to help making effective and good quality business decisions.

From a technical standpoint, the most important areas that Business Intelligence covers are:

DW - Data warehouses - architecture, design, managing, processing and loading (including the ETL process)
Data cleansing and data quality management
Data mining, statistical analysis, forecasting
OLAP - Online Analytical Processing and multidimensional analysis
MIS - Management Information Systems
Reporting - Information visualization and Dashboards
CRM - Customer Relationship Management
DSS - Decision Support Systems

Etl Tools Info portal
ETL-Tools.Info portal provides information about different business intelligence tools and datawarehousing solutions, with a main focus on ETL process and tools. On our pages you will find overview and general, high-level information on various Business Intelligence applications and architectures, as well as technical documents, with a detailed and low-level description of the solutions.
A great attention is paid to the Datastage ETL tool and on the portal you will find a number of Datastage examples, Datastage tutorials, best practices and resolved problems with many real-life examples.
There's also a wide range of information on a rapidly growing Open Source Business Intelligence market (OSBI), with emphasis on applications from the Pentaho BI family with a Pentaho tutorial.
Recently we also added a new section - the SAS Guide with tutorial, which illustrates the vision of SAS on Business Intelligence.

Tutorals for datastage

1. Create custom operators for WebSphere DataStage

This is my favourite tutorial because this is a misunderstood and underutilised part of the product. It’s written by Blayne Chard a software engineer from IBM. Unlike the other tutorials on this page this one requires an IBM ID to access. If you are like me it usually a few goes before you can remember your login!

In this tutorial you learn:

1. How to write a simple DataStage operator
2. How to set up the development environment to compile and run a DataStage operator
3. The basics of the Orchestrate Shell (OSH) scripting language for DataStage jobs
4. How to load your operator into the DataStage Designer so you can use it on any job you create

The DataStage Transformer lets you call a custom routine instead of the standard list of functions on the right mouse click menu. This lets you take code that looks complex in the Transformer and put it into a module to simplify testing and re-use.

Custom routines used to be huge in DataStage, I’ve been to sites with hundreds of them, but they were for DataStage Server Edition. This edition supported routines written in Universe BASIC via a routine editor in the DataStage Manager tool. They were easy to write and easy to test and easy to re-use between jobs.

DataStage Enterprise Edition made things a bit harder. Custom routines are c++ components built and compiled external to DataStage. There was a lot more fiddling to get the routines recognised by a DataStage transformer and c++ tends to be a more challenging language than BASIC.

This is a comprehensive tutorial taking you through the steps of creating and using a routine. Fantastic.

compile your operator

Compiling your operator is a straight forward process once the environment has been setup. To compile the operator, make sure that you have run setup.bat in the current command window, then type the following commands.

2. Modify Stage Video Tutorial

The Modify Stage is a tricky parallel stage that converts fields from one metadata definition to another. It does not have any of the fancy GUI help of the Transformer such as right mouse click menus or syntax checking. You need to look up the functions in the Parallel Job Developers Guide, guess what the syntax is and use trial and error half a dozen times until you get it working!

This tutorial saves you a lot of the trial and error.

The tutorial does require premium membership to the dsxchange at US$99 a year but if it saves you a couple days of work that should be enough for your employer to splash out for membership.

3. WebSphere DataStage Parallel Job Tutorial Version 8

The mother of all DataStage tutorials - this is the one that comes on the Information Server installation CDs (or is that DVDs? Or BlueRays?) It’s available from the IBM publications centre and you can download it free of charge as a 1.05M PDF file.

The IBM WebSphere DataStage Parallel Job Tutorial demonstrates how to design and run IBM WebSphere DataStage� parallel jobs.

The exercises, sample data, and sample import files for the parallel job tutorial are in the TutorialData’DataStage’parallel_tutorial folder on the suite CD, DVD, or downloaded installation image.

It’s a simple tutorial covering opening and running a job, designing a job, using the Transformer, loading a target table and processing in parallel. Not a lot of interest to experienced DataStage users but useful to a newcomer.

4. Configure DB2 remote connectivity with WebSphere DataStage Enterprise Edition

I also reviewed this tutorial previously in DataStage Tip: Extracting database data 250% faster.

As with XML the DB2 connectivity can be a tricky beast to get running. Remote DB2 connections are even trickier. This tutorial will save you a lot of trouble and swearing as you get the wee DB2 beastie running.

5. Transform and integrate data using WebSphere DataStage XML and Web services packs

Getting started with XML processing in DataStage usually involves a lot of fumbling around before you even get a job running. By the end of it you are usually happy to never see another XML file as long as you live. The trick is that DataStage needs to read and validate the entire XML document before it starts processing any part of it as it needs to know how to flatten it into relational data. This is different to database and sequential file sources where DataStage reads and validates one row at a time.

As you can imagine this initial read and validation of the entire file puts a limit on what DataStage can handle in terms of volume. It can process a very large number of small XML files but not one very large XML file. The platform will eventually hit a RAM or resource limit.

This tutorial is invaluable for getting started with XML files in DataStage covering importing XML metadata and using it as a source or target.

DataStage XML Job Example

6. A flexible data integration architecture using WebSphere DataStage and WebSphere Federation Server

IBM wants us to combine the different products on the Information Server for a best of breed data integration platform. One of the themes of the new analytical warehouse concepts such as the IBM Dynamic Warehouse is the use of unstructured or semi structured text. DataStage and the Federation Server are a good combination for bringing this into a warehouse.

IBM are calling this a T-ETL architecture where the Federation server is a data pre-processor.

The T-ETL architecture proposed uses federation to join, aggregate, and filter data before it enters WebSphere DataStage, with WebSphere DataStage using its parallel engine to perform more complex transformations and the maintenance of the target.

Federation DataStage Job Example

This is a generous tutorial with a lot of use cases for T-ETL, ETL and ELT.

The strength of WebSphere DataStage as a data consolidation tool lies in its flexibility -- it is able to support the many different consolidation scenarios and flavors of ETL, including ETL, ELT, and TEL (against a single data source). Combining WebSphere DataStage with WebSphere Federation Server opens up a whole new area of consolidation scenarios, namely T-ETL, against homogenous or heterogeneous data sources.

7. Access application data using WebSphere Federation Server and WebSphere DataStage

I reviewed this tutorial in a previous blog post: DataStage Tip: Free IBM DataStage SAP Tutorial

As with the last tutorial it uses a combination of DataStage and the Federation Server to access atypical enterprise data, in this case SAP.

The Top 7 Online DataStage Tutorials

These are the top 7 online DataStage Tutorials for learning DataStage, becoming an expert or studying for DataStage certification.

IBM has been putting new content into the crevices and caves of the ibm.com website for some months now and some trawling has turned up half a dozen handy tutorials.

Here is the top 6 DataStage tutorials, if you keep reading you will see a review of each tutorial and why you would need it:

1. Create custom operators for WebSphere DataStage
2. Modify Stage Video Tutorial (requires dsxchange premium membership)
3. WebSphere DataStage Parallel Job Tutorial Version 8
4. Configure DB2 remote connectivity with WebSphere DataStage Enterprise Edition
5. Transform and integrate data using WebSphere DataStage XML and Web services packs
6. A flexible data integration architecture using WebSphere DataStage and WebSphere Federation Server
7. Access application data using WebSphere Federation Server and WebSphere DataStage

Study Guide for DataStage Certification

This entry is a comprehensive guide to preparing for the DataStage 7.5 Certification exam.

Regular readers may be feeling a sense of de ja vu. Haven’t we seen this post before? I originally posted this in 2006 and this the Directors Cut - I’ve added some deleted scenes, a commentary for DataStage 7.0 and 8.0 users and generally improved the entry. By reposting I retain the links from other sites such as my DataStage Certification Squidoo lens with links to my certification blog entries and IBM certification pages.

This post shows all of the headings from the IBM exam Objectives and describes how to prepare for that section.

Before you start read work out how you add Environment variables to a job as a job parameter as they are handy for exercises and testing. See the DataStage Designer Guide for details.

Section 1 - Installation and Configuration

Versions: Version 7.5.1 and 7.5.2 are the best to study and run exercises on. Version 6.x is risky but mostly the same as 7. Version 8.0 is no good for any type of installation and configuration preparation as it has a new approach to installation and user security.

Reading: Read the Installation and Upgrade Guide for DataStage, especially the section on parallel installation. Read the pre-requisites for each type of install such as users and groups, the compiler, project locations, kernel settings for each platform. Make sure you know what goes into the dsenv file. Read the section on DataStage for USS as you might get one USS question. Do a search for threads on dsenv on the dsxchange forum to become familiar with how this file is used in different production environments.

Exercise: installing your own DataStage Server Enterprise Edition is the best exercise - getting it to connect to Oracle, DB2 and SQL Server is also beneficial. Run the DataStage Administrator and create some users and roles and give them access to DataStage functions.

Section 4 - Parallel Architecture (10%)

Section 9 - Monitoring and Troubleshooting (10%)

I’ve move section 4 and 9 up to the front as you need to study this before you run exercises and read about parallel stages in the other sections. Understanding how to use and monitor parallel jobs is worth a whopping 20% so it’s a good one to know well.

Versions: you can study this using DataStage 6, 7 and 8. Version 8 has the best definition of the parallel architecture with better diagrams.

Reading: Parallel Job Developers Guide opening chapters on what the parallel engine and job partitioning is all about. Read about each partitioning type. Read how sequential file stages partition or repartition data and why datasets don’t. The Parallel Job Advanced Developers Guide has sections on environment variables to help with job monitoring, read about every parameter with the word SCORE in it. The DataStage Director Guide describes how to run job monitoring - use the right mouse click menu on the job monitor window to see extra parallel information.

Exercises: Turn on various monitoring environment variables such as APT_PM_SHOW_PIDS and APT_DUMP_SCORE so you can see what happens during your exercises. It shows you what really runs in a job - the extra processes that get added across parallel nodes.

Try creating one node, two node and four node config files and see how jobs behave under each one. Try the remaining exercises on a couple different configurations by adding the configuration environment variable to the job. Try some pooling options. I got to admit I guessed my way through some of the pooling questions as I didn’t do many exercises.

Generate a set of rows into a sequential file for testing out various partitioning types. One column with unique ids 1 to 100 and a second column with repeating codes such as A, A, A, A, A, B, B, B, B, B etc. Write a job that reads from the input, sends it through a partitioning stage such as a transformer and writes it to a peek stage. The Director logs shows which rows went where. You should also view the Director monitor and expand and show the row counts on each instance of each stage in the job to see how stages are split and run on each node and how many rows each instance gets.

Use a filter stage to split the rows down two paths and bring them back together with a funnel stage, then replace the funnel with a collector stage. Compare the two.

Test yourself on estimating how many processes will be created by a job and check the result after the job has run using the Director monitor or log messages. Do this throughout all your exercises across all sections as a habit.

Section 2 - Metadata

Section 3 - Persistant Storage

I’ve merged these into one. Both sections talk about sequential files, datasets, XML files and Cobol files.

Versions: you can study this on DataStage 6, 7 or 8 as it is a narrow focus on DataStage Designer metadata. DataStage 8 will have slightly different CFF options but not enough to cause a problem.

Reading: read the section in the DataStage Developers Guide on Orchestrate schemas and partial schemas. Read the plugin guide for the Complex Flat File Stage to understand how Cobol metadata is imported (if you don’t have any cobol copybooks around you will just have to read about them and not do any exercises). Quickly scan through the NLS guide - but don’t expect any hard questions on this.

Exercises: Step through the IBM XML tutorial to get the tricky part on reading XML files. Find an XML file and do various exercises reading it and writing it to a sequential file. Switch between different key fields to see the impact of the key on the flattening of the XML hierarchy. Don’t worry too much about XML Transform.

Import from a database using the Orchestrate Import and the Plugin Import and compare the table definitions. Run an exercise on column propagation using the Peek stage where a partial schema is written to the Peek stage to reveal the propagated columns.

Create a job using the Row Generator stage. Define some columns, on the columns tab doubleclick on a column to bring up the advanced column properties. Use some properties to generate values for different data types. Get to know the advanced properties page.

Create a really large sequential file, dataset and fileset and use each as a reference in a lookup stage. Monitor the Resource and Scratch directories as the job runs to see how these lookup sources are prepared prior to and during a job run. Get to know the difference between the lookup fileset and other sources for a lookup stage.

Section 5 - Databases (15%)

One of the more difficult topics if you get questions for a database you are not familiar with. I got one database parallel connectivity question that I still can’t find the answer to in any of the manuals.

Versions: DataStage 7.x any version or at a pinch DataStage 8. Earlier versions do not have enough database stages and DataStage 8 has a new approach to database connections.

Reading: read the plugin guide for each enterprise database stage: Oracle, SQL Server, DB2 and ODBC. In version 8 read the improved Connectivity Guides for these targets. If you have time you can dig deeper, the Parallel Job Developers Guide and/or the Advanced Developers Guide has a section on the Oracle/DB2/Informix/Teradata/Sybase/SQL Server interface libraries. Look for the section called "Operator action" and read it for each stage. It’s got interesting bits like whether the stage can run in parallel, how it converts data and handles record sizes.

Exercise: Add each Enterprise database stage to a parallel job as both an input and output stage. Go in and fiddle around with all the different types of read and write options. You don’t need to get a connection working or have access to that database, you just need to have the stage installed and add it to your job. Look at the differences between insert/update/add/load etc. Look at the different options for each database stage. If you have time and a database try some loads to a database table.

Section 6 - Data Transformation (15%)

If you’ve used DataStage for longer than a year this is probably the topic you are going to ace - as long as you have done some type of stage variable use.

Versions: should be okay studying on versions 6, 7 or 8. Transformation stages such as Transformer, Filter and Modify have not changed much.

Reading: there is more value in using the transformation stages than reading about it. It’s hard to read about it and take it in as the Transformer stage is easier to navigate and understand if you are using it. If you have to make do with reading then visit the dsxchange and look for threads on stage variables, the FAQ on the parallel number generator, removing duplicates using a transformer and questions in the parallel forum on null handling. This will be better than reading the manuals as they will be full of practical examples. Read the Parallel Job Advanced Developers Guide section on "Specifying your own parallel stages".

Exercises: Focus on Transformer, Modify Stage (briefly), Copy Stage (briefly) and Filter Stage.

Create some mixed up source data with duplicate rows and a multiple field key. Try to remove duplicates using a Transformer with a sort stage and combination of stage variables to hold the prior row key value to compare to the new row key value.

Process some data that has nulls in it. Use the null column in a Transformer concatenate function with and without a nulltovalue function and with and without a reject link from the Transformer. This gives you an understanding of how rows get dropped and/or trapped from a transformer. Explore the right mouse click menu in the Transformer, output some of the DS Macro values and System Variables to a peek stage and think of uses for them in various data warehouse scenarios. Ignore DS Routine, it’s not on the test.

Don’t spend much time on the Modify stage - it would take forever memorizing functions. Just do an exercise on handle_null, string_trim and convert string to number. Can be tricky getting it working and you might not even get a question about it.

Section 7 - Combining and Sorting Data (10%)

Section 10 - Job Design (10%)

I’ve combined these since they overlap. Don’t underestimate this section, it covers a very narrow range of functionality so it is an easy set of questions to prepare for and get right. There are easy points on offer in this section.

Versions: Any version 7.x is best, version 6 has a completely different lookup stage, version 8 can be used but remember that the Range lookup functionality is new.

Reading: The Parallel Job Developers Guide has a table showing the differences between the lookup, merge and join stages. Try to memorize the parts of this table about inputs and outputs and reject links. This is a good place to learn about some more environment variables. Read the Parallel Job Advanced Developers Guide looking for any environment variables with the word SORT or SCORE in them.

Exercises: Compare the way join, lookup and merge work. Create a job that switches between each type.

Add various combinations of the SORT and COMBINE_OPERATORS environment variables to your job and examine the score of the job to see what sorts get added to your job at run time. A simple job with a Remove Duplicates or Join stage and no sorting will add a sort into the job - even if the data comes from a sorted database source. Use the SORT variables to turn off this sort insertion. See what happens to Transformer, Lookup and Copy stages when COMBINE_OPERATORS is turned on or off using the SCORE log entry.

Create an annotation and a description annotation and explore the differences between the two. Use the Multiple Compile tool in the DataStage Manager (version 6, 7) or Designer (version 8). Create and use shared containers.

Section 8 - Automation and Production Deployment (10%)

Versions: most deployment methods have remained the same from version 6, 7 and 8. Version 8 has the same import and export functions. DataStage 8 parameter sets will not be in the version 7 exam.

Reading: you don’t need to install or use the Version Control tool to pass this section however you should read the PDF that comes with it to understand the IBM recommendations for deployment. It covers the move from dev to test to prod. Read the Server Job Developers Guide section on command line calls to DataStage such as dsjob and dssearch and dsadmin.

Exercises: practice a few dsjob and dssearch commands. Practice saving a log to a text file. Create some job specific environment variables.

Faqs for .Net Framework

1. Introduction

1.1 What is .NET?

.NET is a general-purpose software development platform, similar to Java. At its core is a virtual machine that turns intermediate language (IL) into machine code. High-level language compilers for C#, VB.NET and C++ are provided to turn source code into IL. C# is a new programming language, very similar to Java. An extensive class library is included, featuring all the functionality one might expect from a contempory development platform - windows GUI development (Windows Forms), database access (ADO.NET), web development (ASP.NET), web services, XML etc.

See also Microsoft's definition.

1.2 When was .NET announced?

Bill Gates delivered a keynote at Forum 2000, held June 22, 2000, outlining the .NET 'vision'. The July 2000 PDC had a number of sessions on .NET technology, and delegates were given CDs containing a pre-release version of the .NET framework/SDK and Visual Studio.NET.
1.3 What versions of .NET are there?

The final versions of the 1.0 SDK and runtime were made publicly available around 6pm PST on 15-Jan-2002. At the same time, the final version of Visual Studio.NET was made available to MSDN subscribers.

.NET 1.1 was released in April 2003, and was mostly bug fixes for 1.0.

.NET 2.0 was released to MSDN subscribers in late October 2005, and was officially launched in early November.
1.4 What operating systems does the .NET Framework run on?

The runtime supports Windows Server 2003, Windows XP, Windows 2000, NT4 SP6a and Windows ME/98. Windows 95 is not supported. Some parts of the framework do not work on all platforms - for example, ASP.NET is only supported on XP and Windows 2000/2003. Windows 98/ME cannot be used for development.

IIS is not supported on Windows XP Home Edition, and so cannot be used to host ASP.NET. However, the ASP.NET Web Matrix web server does run on XP Home.

The .NET Compact Framework is a version of the .NET Framework for mobile devices, running Windows CE or Windows Mobile.

The Mono project has a version of the .NET Framework that runs on Linux.
1.5 What tools can I use to develop .NET applications?

There are a number of tools, described here in ascending order of cost:
The .NET Framework SDK is free and includes command-line compilers for C++, C#, and VB.NET and various other utilities to aid development.
SharpDevelop is a free IDE for C# and VB.NET.
Microsoft Visual Studio Express editions are cut-down versions of Visual Studio, for hobbyist or novice developers.There are different versions for C#, VB, web development etc. Originally the plan was to charge $49, but MS has decided to offer them as free downloads instead, at least until November 2006.
Microsoft Visual Studio Standard 2005 is around $300, or $200 for the upgrade.
Microsoft VIsual Studio Professional 2005 is around $800, or $550 for the upgrade.
At the top end of the price range are the Microsoft Visual Studio Team Edition for Software Developers 2005 with MSDN Premium and Team Suite editions.

You can see the differences between the various Visual Studio versions here.
1.6 Why did they call it .NET?

I don't know what they were thinking. They certainly weren't thinking of people using search tools. It's meaningless marketing nonsense.
2. Terminology

2.1 What is the CLI? Is it the same as the CLR?

The CLI (Common Language Infrastructure) is the definiton of the fundamentals of the .NET framework - the Common Type System (CTS), metadata, the Virtual Execution Environment (VES) and its use of intermediate language (IL), and the support of multiple programming languages via the Common Language Specification (CLS). The CLI is documented through ECMA - see http://msdn.microsoft.com/net/ecma/ for more details.

The CLR (Common Language Runtime) is Microsoft's primary implementation of the CLI. Microsoft also have a shared source implementation known as ROTOR, for educational purposes, as well as the .NET Compact Framework for mobile devices. Non-Microsoft CLI implementations include Mono and DotGNU Portable.NET.
2.2 What is the CTS, and how does it relate to the CLS?

CTS = Common Type System. This is the full range of types that the .NET runtime understands. Not all .NET languages support all the types in the CTS.

CLS = Common Language Specification. This is a subset of the CTS which all .NET languages are expected to support. The idea is that any program which uses CLS-compliant types can interoperate with any .NET program written in any language. This interop is very fine-grained - for example a VB.NET class can inherit from a C# class.
2.3 What is IL?

IL = Intermediate Language. Also known as MSIL (Microsoft Intermediate Language) or CIL (Common Intermediate Language). All .NET source code (of any language) is compiled to IL during development. The IL is then converted to machine code at the point where the software is installed, or (more commonly) at run-time by a Just-In-Time (JIT) compiler.
2.4 What is C#?

C# is a new language designed by Microsoft to work with the .NET framework. In their "Introduction to C#" whitepaper, Microsoft describe C# as follows:

"C# is a simple, modern, object oriented, and type-safe programming language derived from C and C++. C# (pronounced “C sharp”) is firmly planted in the C and C++ family tree of languages, and will immediately be familiar to C and C++ programmers. C# aims to combine the high productivity of Visual Basic and the raw power of C++."

Substitute 'Java' for 'C#' in the quote above, and you'll see that the statement still works pretty well :-).

If you are a C++ programmer, you might like to check out my C# FAQ.
2.5 What does 'managed' mean in the .NET context?

The term 'managed' is the cause of much confusion. It is used in various places within .NET, meaning slightly different things.

Managed code: The .NET framework provides several core run-time services to the programs that run within it - for example exception handling and security. For these services to work, the code must provide a minimum level of information to the runtime. Such code is called managed code.

Managed data: This is data that is allocated and freed by the .NET runtime's garbage collector.

Managed classes: This is usually referred to in the context of Managed Extensions (ME) for C++. When using ME C++, a class can be marked with the __gc keyword. As the name suggests, this means that the memory for instances of the class is managed by the garbage collector, but it also means more than that. The class becomes a fully paid-up member of the .NET community with the benefits and restrictions that brings. An example of a benefit is proper interop with classes written in other languages - for example, a managed C++ class can inherit from a VB class. An example of a restriction is that a managed class can only inherit from one base class.
2.6 What is reflection?

All .NET compilers produce metadata about the types defined in the modules they produce. This metadata is packaged along with the module (modules in turn are packaged together in assemblies), and can be accessed by a mechanism called reflection. The System.Reflection namespace contains classes that can be used to interrogate the types for a module/assembly.

Using reflection to access .NET metadata is very similar to using ITypeLib/ITypeInfo to access type library data in COM, and it is used for similar purposes - e.g. determining data type sizes for marshaling data across context/process/machine boundaries.

Reflection can also be used to dynamically invoke methods (see System.Type.InvokeMember), or even create types dynamically at run-time (see System.Reflection.Emit.TypeBuilder).
3.1 What is an assembly?

An assembly is sometimes described as a logical .EXE or .DLL, and can be an application (with a main entry point) or a library. An assembly consists of one or more files (dlls, exes, html files etc), and represents a group of resources, type definitions, and implementations of those types. An assembly may also contain references to other assemblies. These resources, types and references are described in a block of data called a manifest. The manifest is part of the assembly, thus making the assembly self-describing.

An important aspect of assemblies is that they are part of the identity of a type. The identity of a type is the assembly that houses it combined with the type name. This means, for example, that if assembly A exports a type called T, and assembly B exports a type called T, the .NET runtime sees these as two completely different types. Furthermore, don't get confused between assemblies and namespaces - namespaces are merely a hierarchical way of organising type names. To the runtime, type names are type names, regardless of whether namespaces are used to organise the names. It's the assembly plus the typename (regardless of whether the type name belongs to a namespace) that uniquely indentifies a type to the runtime.

Assemblies are also important in .NET with respect to security - many of the security restrictions are enforced at the assembly boundary.

Finally, assemblies are the unit of versioning in .NET - more on this below.
3.2 How can I produce an assembly?

The simplest way to produce an assembly is directly from a .NET compiler. For example, the following C# program:
public class CTest
{
public CTest() { System.Console.WriteLine( "Hello from CTest" ); }
}

can be compiled into a library assembly (dll) like this:
csc /t:library ctest.cs

You can then view the contents of the assembly by running the "IL Disassembler" tool that comes with the .NET SDK.

Alternatively you can compile your source into modules, and then combine the modules into an assembly using the assembly linker (al.exe). For the C# compiler, the /target:module switch is used to generate a module instead of an assembly.
3.3 What is the difference between a private assembly and a shared assembly?

Location and visibility: A private assembly is normally used by a single application, and is stored in the application's directory, or a sub-directory beneath. A shared assembly is normally stored in the global assembly cache, which is a repository of assemblies maintained by the .NET runtime. Shared assemblies are usually libraries of code which many applications will find useful, e.g. the .NET framework classes.
Versioning: The runtime enforces versioning constraints only on shared assemblies, not on private assemblies.

3.4 How do assemblies find each other?

By searching directory paths. There are several factors which can affect the path (such as the AppDomain host, and application configuration files), but for private assemblies the search path is normally the application's directory and its sub-directories. For shared assemblies, the search path is normally same as the private assembly path plus the shared assembly cache.
3.5 How does assembly versioning work?

Each assembly has a version number called the compatibility version. Also each reference to an assembly (from another assembly) includes both the name and version of the referenced assembly.

The version number has four numeric parts (e.g. 5.5.2.33). Assemblies with either of the first two parts different are normally viewed as incompatible. If the first two parts are the same, but the third is different, the assemblies are deemed as 'maybe compatible'. If only the fourth part is different, the assemblies are deemed compatible. However, this is just the default guideline - it is the version policy that decides to what extent these rules are enforced. The version policy can be specified via the application configuration file.

Remember: versioning is only applied to shared assemblies, not private assemblies.
3.6 How can I develop an application that automatically updates itself from the web?
For .NET 1.x, use the Updater Application Block. For .NET 2.x, use ClickOnce.
4. Application Domains

4.1 What is an application domain?

An AppDomain can be thought of as a lightweight process. Multiple AppDomains can exist inside a Win32 process. The primary purpose of the AppDomain is to isolate applications from each other, and so it is particularly useful in hosting scenarios such as ASP.NET. An AppDomain can be destroyed by the host without affecting other AppDomains in the process.

Win32 processes provide isolation by having distinct memory address spaces. This is effective, but expensive. The .NET runtime enforces AppDomain isolation by keeping control over the use of memory - all memory in the AppDomain is managed by the .NET runtime, so the runtime can ensure that AppDomains do not access each other's memory.

One non-obvious use of AppDomains is for unloading types. Currently the only way to unload a .NET type is to destroy the AppDomain it is loaded into. This is particularly useful if you create and destroy types on-the-fly via reflection.

Microsoft have an AppDomain FAQ.
4.2 How does an AppDomain get created?

AppDomains are usually created by hosts. Examples of hosts are the Windows Shell, ASP.NET and IE. When you run a .NET application from the command-line, the host is the Shell. The Shell creates a new AppDomain for every application.

AppDomains can also be explicitly created by .NET applications. Here is a C# sample which creates an AppDomain, creates an instance of an object inside it, and then executes one of the object's methods:
using System;
using System.Runtime.Remoting;
using System.Reflection;

public class CAppDomainInfo : MarshalByRefObject
{
public string GetName() { return AppDomain.CurrentDomain.FriendlyName; }
}

public class App
{
public static int Main()
{
AppDomain ad = AppDomain.CreateDomain( "Andy's new domain" );
CAppDomainInfo adInfo = (CAppDomainInfo)ad.CreateInstanceAndUnwrap(
Assembly.GetCallingAssembly().GetName().Name, "CAppDomainInfo" );
Console.WriteLine( "Created AppDomain name = " + adInfo.GetName() );
return 0;
}
}

4.3 Can I write my own .NET host?

Yes. For an example of how to do this, take a look at the source for the dm.net moniker developed by Jason Whittington and Don Box. There is also a code sample in the .NET SDK called CorHost.

Unix Basics

Basic UNIX commands
Note: not all of these are actually part of UNIX itself, and you may not find them on all UNIX machines. But they can all be used on turing in essentially the same way, by typing the command and hitting return. Note that some of these commands are different on non-Solaris machines - see SunOS differences.
If you've made a typo, the easiest thing to do is hit CTRL-u to cancel the whole line. But you can also edit the command line (see the guide to More UNIX).
UNIX is case-sensitive.
Files
• ls --- lists your files
ls -l --- lists your files in 'long format', which contains lots of useful information, e.g. the exact size of the file, who owns the file and who has the right to look at it, and when it was last modified.
ls -a --- lists all files, including the ones whose filenames begin in a dot, which you do not always want to see.
There are many more options, for example to list files by size, by date, recursively etc.
• more filename --- shows the first part of a file, just as much as will fit on one screen. Just hit the space bar to see more or q to quit. You can use /pattern to search for a pattern.
• emacs filename --- is an editor that lets you create and edit a file. See the emacs page.
• mv filename1 filename2 --- moves a file (i.e. gives it a different name, or moves it into a different directory (see below)
• cp filename1 filename2 --- copies a file
• rm filename --- removes a file. It is wise to use the option rm -i, which will ask you for confirmation before actually deleting anything. You can make this your default by making an alias in your .cshrc file.
• diff filename1 filename2 --- compares files, and shows where they differ
• wc filename --- tells you how many lines, words, and characters there are in a file
• chmod options filename --- lets you change the read, write, and execute permissions on your files. The default is that only you can look at them and change them, but you may sometimes want to change these permissions. For example, chmod o+r filename will make the file readable for everyone, and chmod o-r filename will make it unreadable for others again. Note that for someone to be able to actually look at the file the directories it is in need to be at least executable. See help protection for more details.
• File Compression
o gzip filename --- compresses files, so that they take up much less space. Usually text files compress to about half their original size, but it depends very much on the size of the file and the nature of the contents. There are other tools for this purpose, too (e.g. compress), but gzip usually gives the highest compression rate. Gzip produces files with the ending '.gz' appended to the original filename.
o gunzip filename --- uncompresses files compressed by gzip.
o gzcat filename --- lets you look at a gzipped file without actually having to gunzip it (same as gunzip -c). You can even print it directly, using gzcat filename | lpr
• printing
o lpr filename --- print. Use the -P option to specify the printer name if you want to use a printer other than your default printer. For example, if you want to print double-sided, use 'lpr -Pvalkyr-d', or if you're at CSLI, you may want to use 'lpr -Pcord115-d'. See 'help printers' for more information about printers and their locations.
o lpq --- check out the printer queue, e.g. to get the number needed for removal, or to see how many other files will be printed before yours will come out
o lprm jobnumber --- remove something from the printer queue. You can find the job number by using lpq. Theoretically you also have to specify a printer name, but this isn't necessary as long as you use your default printer in the department.
o genscript --- converts plain text files into postscript for printing, and gives you some options for formatting. Consider making an alias like alias ecop 'genscript -2 -r \!* | lpr -h -Pvalkyr' to print two pages on one piece of paper.
o dvips filename --- print .dvi files (i.e. files produced by LaTeX). You can use dviselect to print only selected pages. See the LaTeX page for more information about how to save paper when printing drafts.
Directories
Directories, like folders on a Macintosh, are used to group files together in a hierarchical structure.
• mkdir dirname --- make a new directory
• cd dirname --- change directory. You basically 'go' to another directory, and you will see the files in that directory when you do 'ls'. You always start out in your 'home directory', and you can get back there by typing 'cd' without arguments. 'cd ..' will get you one level up from your current position. You don't have to walk along step by step - you can make big leaps or avoid walking around by specifying pathnames.
• pwd --- tells you where you currently are.
Finding things
• ff --- find files anywhere on the system. This can be extremely useful if you've forgotten in which directory you put a file, but do remember the name. In fact, if you use ff -p you don't even need the full name, just the beginning. This can also be useful for finding other things on the system, e.g. documentation.
• grep string filename(s) --- looks for the string in the files. This can be useful a lot of purposes, e.g. finding the right file among many, figuring out which is the right version of something, and even doing serious corpus work. grep comes in several varieties (grep, egrep, and fgrep) and has a lot of very flexible options. Check out the man pages if this sounds good to you.
About other people
• w --- tells you who's logged in, and what they're doing. Especially useful: the 'idle' part. This allows you to see whether they're actually sitting there typing away at their keyboards right at the moment.
• who --- tells you who's logged on, and where they're coming from. Useful if you're looking for someone who's actually physically in the same building as you, or in some other particular location.
• finger username --- gives you lots of information about that user, e.g. when they last read their mail and whether they're logged in. Often people put other practical information, such as phone numbers and addresses, in a file called .plan. This information is also displayed by 'finger'.
• last -1 username --- tells you when the user last logged on and off and from where. Without any options, last will give you a list of everyone's logins.
• talk username --- lets you have a (typed) conversation with another user
• write username --- lets you exchange one-line messages with another user
• elm --- lets you send e-mail messages to people around the world (and, of course, read them). It's not the only mailer you can use, but the one we recommend. See the elm page, and find out about the departmental mailing lists (which you can also find in /user/linguistics/helpfile).
About your (electronic) self
• whoami --- returns your username. Sounds useless, but isn't. You may need to find out who it is who forgot to log out somewhere, and make sure *you* have logged out.
• finger & .plan files
of course you can finger yourself, too. That can be useful e.g. as a quick check whether you got new mail. Try to create a useful .plan file soon. Look at other people's .plan files for ideas. The file needs to be readable for everyone in order to be visible through 'finger'. Do 'chmod a+r .plan' if necessary. You should realize that this information is accessible from anywhere in the world, not just to other people on turing.
• passwd --- lets you change your password, which you should do regularly (at least once a year). See the LRB guide and/or look at help password.
• ps -u yourusername --- lists your processes. Contains lots of information about them, including the process ID, which you need if you have to kill a process. Normally, when you have been kicked out of a dialin session or have otherwise managed to get yourself disconnected abruptly, this list will contain the processes you need to kill. Those may include the shell (tcsh or whatever you're using), and anything you were running, for example emacs or elm. Be careful not to kill your current shell - the one with the number closer to the one of the ps command you're currently running. But if it happens, don't panic. Just try again :) If you're using an X-display you may have to kill some X processes before you can start them again. These will show only when you use ps -efl, because they're root processes.
• kill PID --- kills (ends) the processes with the ID you gave. This works only for your own processes, of course. Get the ID by using ps. If the process doesn't 'die' properly, use the option -9. But attempt without that option first, because it doesn't give the process a chance to finish possibly important business before dying. You may need to kill processes for example if your modem connection was interrupted and you didn't get logged out properly, which sometimes happens.
• quota -v --- show what your disk quota is (i.e. how much space you have to store files), how much you're actually using, and in case you've exceeded your quota (which you'll be given an automatic warning about by the system) how much time you have left to sort them out (by deleting or gzipping some, or moving them to your own computer).
• du filename --- shows the disk usage of the files and directories in filename (without argument the current directory is used). du -s gives only a total.
• last yourusername --- lists your last logins. Can be a useful memory aid for when you were where, how long you've been working for, and keeping track of your phonebill if you're making a non-local phonecall for dialling in.
Connecting to the outside world
• nn --- allows you to read news. It will first let you read the news local to turing, and then the remote news. If you want to read only the local or remote news, you can use nnl or nnr, respectively. To learn more about nn type nn, then \tty{:man}, then \tty{=.*}, then \tty{Z}, then hit the space bar to step through the manual. Or look at the man page. Or check out the hypertext nn FAQ - probably the easiest and most fun way to go.
• rlogin hostname --- lets you connect to a remote host
• telnet hostname --- also lets you connect to a remote host. Use rlogin whenever possible.
• ftp hostname --- lets you download files from a remote host which is set up as an ftp-server. This is a common method for exchanging academic papers and drafts. If you need to make a paper of yours available in this way, you can (temporarily) put a copy in /user/ftp/pub/TMP. For more permanent solutions, ask Emma. The most important commands within ftp are get for getting files from the remote machine, and put for putting them there (mget and mput let you specify more than one file at once). Sounds straightforward, but be sure not to confuse the two, especially when your physical location doesn't correspond to the direction of the ftp connection you're making. ftp just overwrites files with the same filename. If you're transferring anything other than ASCII text, use binary mode.
• lynx --- lets you browse the web from an ordinary terminal. Of course you can see only the text, not the pictures. You can type any URL as an argument to the G command. When you're doing this from any Stanford host you can leave out the .stanford.edu part of the URL when connecting to Stanford URLs. Type H at any time to learn more about lynx, and Q to exit.
Miscellaneous tools
• webster word --- looks up the word in an electronic version of Webster's dictionary and returns the definition(s)
• date --- shows the current date and time.
• cal --- shows a calendar of the current month. Use e.g., 'cal 10 1995' to get that for October 95, or 'cal 1995' to get the whole year.
password
When you received your username and password from the consulting desk, the password was a generated password. You should personalize your own password so that it is easy to remember without being something others could easily guess. It should not be a word that is in a dictionary (as a program can be written to try all the words in a dictionary in seconds). Your password has to be at least 6 characters. It can be letters (UPPER case or lower case) or numbers. CHANGE IT!
DO NOT give your password to others. This is an account and you are responsible for what is done on this account. If you give the password to someone else and they do something wrong, your account information will be pointing to you. DO NOT give your password to others.
If or when you change your password, it might take a few minutes for the new password to be activated. There are several changes that must take place on several computers. Be patient. If you just changed your password and the new password doesn't work yet, the old password is still active and it will work.
type this in this is what it will do
sunburst% password you will be prompted to enter your old password
old password enter your old password here
new password you will be asked to enter a new password
new password type it again for verification
________________________________________
ls (list schema)
The ls command displays the files and directories in your current working directory. If there are no visible files in your directory, the UNIX prompt will be returned. ls does not display hidden files which begin with a . (period). To display all files you must include the option -a.
type this in this is what it will do
sunburst% ls displays the files and directories in your current directory
sunburst% ls -a displays all files and directories (even the hidden ones)
sunburst% ls -l displays all files and directories w/FULL description
sunburst% dir same as ls -l (alias for DOS people)
sunburst% ls -R displays all files and directories in current directory and below (recursive)
You should not delete or change any hidden files. These files are created for administrative use of your e-mail, news readers, login, etc.
________________________________________
mkdir (make directory)
The mkdir command will create a subdirectory inside of your current working directory. If you are logged into sunburst go ahead and try the command as follows.
type this in this is what will happen
sunburst% mkdir tutor_stuff this will create a subdirectory called tutor_stuff
Now let us look to make sure the directory was actually created.
sunburst% ls should display your files and subdirectories in your current working directory (including the subdirectory you just created.)
________________________________________
cd (change directories)
The cd command will change the current working directory to the location or path you type.
sunburst% cd tutor_stuff change the working directory to tutor_stuff
sunburst% cd .. cd followed by .. will take you to the current parent directory
sunburst% cd cd by itself will take you to your home directory
no matter where you currently are
sunburst% cd tutor_stuff change the working directory to tutor_stuff
If you type in the ls command here (make sure you are inside of tutor_stuff) you should see that the directory (you just created) is empty. Try it and see! Now that you are in the directory called tutor_stuff create another subdirectory called more_stuff. We will use this subdirectory later.
________________________________________
touch (touches a file and updates it)
The touch command will update the last time a file was edited. If you touch a file that doesn't yet exist, it will create a file with zero length. You probably won't use it much, but it will allow us to easily create files that we will use and delete later.
sunburst% cd ~/tutor_stuff makes sure everyone is in the right place
sunburst% touch testfile creates a file named testfile
sunburst% ls check to see if the file is there
Create another file of your own choice. Make sure you remember the name so you can delete it later.
________________________________________
pwd (print working directory)
The pwd command shows you where you are (path wise) on the computer (server - not your desktop). You have probably noticed that every time you change directories you get a description of the path. Sometimes it is necessary to know your current path without changing to a different directory.
sunburst% cd takes you to your home directory
sunburst% pwd prints your working directory
you should see something like /home/coyote/username
sunburst% cd tutor_stuff takes you to tutor_stuff subdirectory
sunburst% pwd you should see /home/coyote/username/tutor_stuff
See if you can get back to the more_stuff subdirectory. Next, have the full pathname printed to the screen using the pwd command. (Can you guess what the full pathname will be?)
________________________________________
Miscellaneous stuff about directories and pathnames
Follow the directions and make sure you understand what is happening.
sunburst% cd takes you to your home directory
sunburst% ls lists files and directories in the current directory
sunburst% ls tutor_stuff lists files and directories inside of tutor_stuff
sunburst% ls what is not a directory and will say "what not found"
sunburst% ls ~ lists files and directories in your home directory
(no matter where you currently are in the system)
The tilde (~) is a reference to your home directory.
________________________________________
cp (copy file)
You can make an exact copy of a file using the copy (cp) command.
sunburst% cd ~/tutor_stuff makes sure everyone is in the right place
sunburst% cp testfile file2 makes a copy of testfile and calls it file2
sunburst% dir make sure the files are really there
See if you can make a copy of file2 (call it file3).
________________________________________
mv (move file)
The mv command moves files and directories around in the file system. You can also use the mv command to rename a file. To make sure these examples work you must be inside of tutor_stuff with the testfile and file2 created.
sunburst% mv testfile file1 renames testfile with the name of file1
sunburst% mv file1 .. moves file1 to its parent directory
sunburst% mv file2 more_stuff moves file2 into subdirectory more_stuff
See if you can move the file2 that is in more_stuff back to tutor_stuff. Also move the file1 that is in home directory back to tutor_stuff. If you do this you should have file1, file2, and file3 inside of tutor_stuff. Use the dir or ls commands to make sure you moved the files to the right spots.
________________________________________
rm (remove file)
The rm command removes or deletes files from the file system. You must be careful. DO NOT delete anything you might need later. You can't get it back.
sunburst% rm file1 removes file1
sunburst% rm file3 .. removes file3
Check to see that file2 still exists and that file1 and file3 are gone. Then remove file2.
________________________________________
rmdir (remove directory)
The rmdir command will remove an empty directory. If you have followed closely up to this point you should not have any files in the subdirectory more_stuff, so let's remove that directory.
sunburst% cd ~/tutor_stuff makes sure everyone is in the right place
sunburst% rmdir more_stuff removes directory called more_stuff
See if you can remove the directory tutor_stuff. You need to make sure that there are no files or directories inside. (Remember you created another file of your choice - with a touch command earlier.)
________________________________________
chmod (changing a file mode)
The chmod command lets the owner of a file change the permissions settings for a file or directory. These permissions are displayed with the dir or ls -l listings. The permissions are in the form of
-rwxrwxrwx
This collection of letters tells us how the permissions for a file (starting with a -) or directory (starting with a d) are set. The first three letters give the permissions for the owner of the file. The middle group (next three letters) give us the permissions for the group to which the owner belongs. The last three letters tell the permissions for everyone else. The permissions can be set for directories as well as files but have slightly different meanings.
files
• r- read permission
• w-write permission
• x-execute permission
directories
• r-permission to list files in a directory
• w-permission to delete files or move files
• x-permission to access to files in the directory
permissions
owner group everyone
- rwx rwx rwx
codes r - Read Permission
4
w- Write Permission 2
x- Execute Permission

If you wanted to set a file with full permissions for yourself or owner (4+2+1)=7, read and write permissions for the group (4+2)=6 and read only for everyone 4=4, the command would be
chmod 764 filename.
Depending on what you are doing with (and from) the files, the permissions need to be set appropriately for owner, group and everyone.
________________________________________
du (disk utility)
The du command prints the number of kilobytes you have in each subdirectory. This can be useful to see which directory has the most files if you have gone over quota. From your home directory type
sunburst% du
________________________________________
quota (displays user quota and usage)
Each user on sunburst is allocated a certain amount of disk space (usually about 5 megabytes). If you exceed this quota some of your files must be removed within a certain time period. To check your quota and see how much you are using type
sunburst% quota -v
________________________________________
clear (clears screen)
The command clear just clears your terminal screen. Go ahead and try this. You should see a clear terminal window with sunburst% at the top.
________________________________________
Ways to display file contents
In order for us to experiment with the next five commands, we need to create an example file. To do this you can either type the list below exactly as it appears or you can highlight (or select) the list below and cut-n-paste it into UNIX at the sunburst% prompt. Either way will work the same. In this example we are sending the input from the keyboard into the file demofile. AFTER the list is either typed into UNIX or pasted from Netscape into UNIX you must stop the input with a return and a control-d.
---------------------------list below
cat > demofile
one
two
three
four
five
six
seven
eight
nine
ten
eleven
twelve
thirteen
fourteen
fifteen
sixteen
seventeen
eighteen
nineteen
twenty
twenty-one
twenty-two
twenty-three
twenty-four
twenty-five
twenty-six
------------------list above
AFTER the list is either typed into UNIX or pasted from Netscape into UNIX you must stop the input with a return and a control-d.
________________________________________
cat (concatenate)
The cat command will display the entire contents of the file on the terminal screen. If it is more than one page it will all scroll by quickly.
sunburst% cat demofile displays entire file contents to terminal screen
________________________________________
more (display one screen at a time)
To display the entire file one screen at a time use the more command instead of cat. After each screen is drawn you must hit the space bar to go on to the next screen.
sunburst% more demofile displays the file example one screen at a time
________________________________________
head (display first 10 lines of a file)
The head command will display the first 10 lines of a file.
sunburst% head demofile should display one - ten in our demofile
________________________________________
tail (displays last 10 lines of a file)
displays last 10 lines of a file
sunburst% tail demofile should display seventeen-twenty-six in our file
________________________________________
wc (word count)
The wc command will tell you how many lines, words and characters.
sunburst% wc demofile would return 26 lines 26 words and about 170 letters
If you pasted the info from Netscape there would be a few more letters because of additional spaces used in formatting HTML (approximately 383 letters).
________________________________________
* (wild card - matches none or more )
The * is a wild card that will replace anything in its place. Below are some examples of its use.
sunburst% dir *.gif displays all files ending with .gif
sunburst% dir test* displays all files beginning with test
________________________________________
? (wild card - matches just one character)
The ? is a wild card that will match one character only. Below is an examples of its use.
sunburst% dir test?.gif displays all files starting with test
then valid character
then ending with .gif
would not find test10.gif
________________________________________
file name conventions
Unix file names should contain only letters, numbers, underscore and period (. often referred to as dot). The file names can contain up to 256 characters. Try to give your file names meaning. Don't use all 256 characters.
Directories are basically a special type of file, so the rules for naming files apply to directories also.
________________________________________
!!
This is called the bang bang command. It will repeat the last command entered into unix from your keyboard. This is like the F3 command from DOS.
Try it!
A variation of this is ! followed by a letter (or letters). This will re-issue the last unix command that started with that letter (or letters). Lets look at the following example:
sunburst% dir would display the current working directory
sunburst% !! would repeat last command
sunburst% mkdir stuff creates directory stuff
sunburst% clear clears the terminal window
sunburst% cd stuff change working directory to stuff
sunburst% !c would try to change directories again
this wouldn't work because you are already there
sunburst% !cl would send the last unix command starting with cl which would be clear
________________________________________
More help directly from UNIX here
________________________________________
man
Read the UNIX on line manual page for a specific command.
sunburst% man cp displays online manual for copy command
Notice the --More--(26%) message at the bottom of the screen (if you type the above command). If you do not want to view the remaining pages you can [interrupt Process] with the Control-C command.
________________________________________
whatis
The whatis command will display a one-line explanation of a command.
sunburst% whatis cp displays a one-line description of the cp command
________________________________________
apropos
If you don't remember the UNIX command but you know you want to copy a file you can use the apropos command to find reference to the copy command within the online manual.
sunburst% apropos copy displays references to copy within the online manuals
________________________________________
A little fun?
________________________________________
finger
If you want to find out some information about another user (such as: are they logged on, when did they read their mail last, do they have a plan) you can use the finger command. The finger command reads some of those invisible files in your directory we were talking about earlier to display the information you will see on your screen.
sunburst% finger username@host displays information regarding the person you are checking on
________________________________________
talk
If you fingered someone and know that they are currently logged into sunburst you can talk to them. The talk command will split your window into a upper portion and lower portion. You will type on one and read the person's responses an the other port ion. It is kind of like a telephone (well no sound). You can do this with anyone in the world that is logged on if you know their e-mail address.
After you issue the talk command the other person should see a message that says you are requesting to talk to them. If they want to talk to you they need to type in a corresponding message on their computer screen.

DataStage Information