Datastage Interview

DataStage Designer
What is Data warehouse ?
What is Operational Databases ?
Data Extraction ?
Data Aggregation ?
Data Transformation ?
Advantages of Data warehouse ?
DataStage ?
Client Component ?
Server Component ?
DataStage Jobs ?
DataStage NLS ?
Stages
Passive Stage ? Active Stage ?
Server Job Stages
Parallel Job Stage
Links ?
Parallel Processing
Types of Parallelism
Plug in Stage?
Difference Between Lookup and Join: What is Staging Variable?
What are Routines?
what are the Job parameters?
What are Stage Variables, Derivations and Constants?
why fact table is in normal form?
What are an Entity, Attribute and Relationship?
What is Metastage?
How many places u can call Routines?
What about System variables?
What are all the third party tools used in DataStage?
What is the difference between change capture and change apply stages
DataStage Engine Commands What is the difference between Transform and Routine in DataStage?
Where can you output data using the peek stage?
What is complex stage? In which situation we are using this one?
What is Ad-hoc query?
What is Version Control?
How Version Control Works?
Benefits of Using Version Control
Lookup types in Datastage 8
DataStage Designer
A data warehouse is a central integrated database containing data from all the operational sources and archive systems in an organization. It contains a copy of transaction data specifically structured for query analysis.
This database can be accessed by all users, ensuring that each group in an organization is accessing valuable, stable data. Operational databases are usually accessed by many concurrent users. The data in the database changes quickly and often. It is very difficult to obtain an accurate picture of the contents of the database at any one time. Because operational databases are task oriented, for example, stock inventory systems, they are likely to contain dirty data. The high throughput of data into operational databases makes it difficult to trap mistakes or incomplete entries. However, you can cleanse data before loading it into a data warehouse, ensuring that you store only good complete records.
Data extraction is the process used to obtain data from operational sources, archives, a external data sources. The summed (aggregated) total is stored in the data warehouse. Because the number of records stored in the data warehouse is greatly reduced, it is easier for the end user to browse and analyze the data. Transformation is the process that converts data to a required definition and value. Data is transformed using routines based on a transformation rule, for example, product codes can be mapped to a common format using a transformation rule that applies only to product codes. After data has been transformed it can be loaded into the data warehouse in a recognized and required format.
Capitalizes on the potential value of the organizations information Improves the quality and accessibility of data Combines valuable archive data with the latest data in operational sources Increases the amount of information available to users Reduces the requirement of users to access operational data Reduces the strain on IT departments, as they can produce one database to serve all Allows new reports and studies to be introduced without disrupting operational syste Promotes users to be self sufficient
the design and processing required to build a data warehouse. It is ETL Extracts data from any number or type of database. Transforms data. DataStage has a set of predefined transforms and functions you can You can easily extend the functionality by defining your own transforms to use. Loads the data warehouse.
It Consist of number of Client Component ans Server Component DataStage server and parallel jobs are compiled and run on the DataStage server. The j extract data, process it, then write the data to the target data warehouse.
DataStage mainframe jobs are compiled and run on a mainframe. Data extracted by su
DataStage Designer -> A design interface used to create DataStageapplications (known DataStage Director-> A user interface used to validate, schedule,run, and monitor Data DataStage Manager -> A user interface used to view and edit thecontents of the Repos DataStage Administrator -> A user interface used to perform administrationtasks such a creating and moving projects, and setting up purging criteria.
Repository -> A central store that contains all the informationrequired to build a data m DataStage Server -> Runs executable jobs that extract, transform,and load data into a d Datastage Package Installer -> A user interface used to install packagedDataStage jobs a Basic type of DataStage
Server Jobs ->These are compiled and run on the DataStage server. A server job will connect to databases on other machines as necessary, extract data, process it, then write the data to the target data warehouse. Parallel Jobs -> These are compiled and run on the DataStage server in a similar way to server jobs, but support parallel processing on SMP, MPP, and cluster systems. MainFrame Jobs -> These are available only if you have Enterprise MVS Edition installed. A mainframe job is compiled and run on the mainframe. Data extracted by such jobs is then loaded into the data warehouse. Shared Containers -> These are reusable job elements. They typically comprise a number of stages and links. Copies of shared containers can be used in any number of server jobs or parallel jobs and edited as required. Job Sequences -> A job sequence allows you to specify a sequence of DataStage jobs to be executed, and actions to take depending on results. Built in Stages -> Supplied with DataStage and used for extracting, aggregating, transforming, or writing data. All types of job have these stages. Plug in Stages-> Additional stages that can be installed in DataStage to perform specialized tasks that the built-in stages do not support. Server jobs and parallel jobs can make use of these. Job Sequences Stages-> Special built-in stages which allow you to define sequences of activities to run. Only Job Sequences have these. DataStage has built-in National Language Support (NLS). With NLS installed, DataStage can do the following: Process data in a wide range of languages Accept data in any character set into most DataStage fields Use local formats for dates, times, and money (server jobs)
Sort data according to local rules Convert data between different encodings of the same language (for example, for Japanese it can convert JIS to EUC) A job consists of stages linked together which describe the flow of data from a data source to a data target (for example, a final data warehouse). The different types of job have different stage types. The stages that are available in the DataStage Designer depend on the type of job that is currently open in the Designer. A passive stage handlesaccess to databases for the extraction or writing of data.
Active stagesmodel the flow of data and provide mechanisms for combining datastream and converting data from one data type to another.
Database ODBC. -> Extracts data from or loads data into databases that support the industry sta Connectivity API. -> This stage is also used as an intermediate stage for aggregating dat
UniVerse. -> Extracts data from or loads data into UniVerse databases. This stage is als stage for aggregating data. This is a passive stage.
UniData. -> Extracts data from or loads data into UniData databases. This is a passive s Oracle 7 Load. - > Bulk loads an Oracle 7 database. Previously known as ORABULK. Sybase BCP Load. - >Bulk loads a Sybase 6 database. Previously known as BCPLoad. File Hashed File. -> Extracts data from or loads data into databases that contain hashed files. Also acts as an intermediate stage for quick lookups. This is a passive stage.
Sequential File. -> Extracts data from, or loads data into, operating system text files. This is a passive stage. Processing Aggregator.-> Classifies inc oming data into groups, computes totals and other summary functions for each group, and passes them to another stage in the job. This is an active stage. BASIC Transformer. -> Receives incoming data, transforms it in a variety of ways, and outputs it to another stage in the job. This is an active stage. Folder. -> Folder stages are used to read or write data as files in a directory located on the DataStage server. Inter-process. ->Provides a communication channel between DataStage processes running simultaneously in the same job. This is a passive stage. Link Partitioner. -> Allows you to partition a data set into up to 64 partitions. Enables server jobs to run in parallel on SMP systems. This is an active stage. Link Collector. -> Collects partitioned data from up to 64 partitions. Enables server jobs to run in parallel on SMP systems. This is an active stage. RealTime RTI Source. -> Entry point for a Job exposed as an RTI service. The Table Definition specified on the output link dictates the input arguments of the generated RTI service. RTI Target. -> Exit point for a Job exposed as an RTI service. The Table Definition on the input link dictates the output arguments of the generated RTI service. Containers
Server Shared Container. -> Represents a group of stages and links. The group is replaced by a single Shared Container stage in the Diagram window. Local Container. -> Represents a group of stages and links. The group is replaced by a single Container stage in the Diagram window Container Input and Output. -> Represent the interface that links a container stage to the rest of the job design. DataBases DB2/UDB Enterprise. Allows you to read and write a DB2 database. Informix Enterprise. Allows you to read and write an Informix XPS database. Oracle Enterprise. Allows you to read and write an Oracle database. Teradata Enterprise. Allows you to read and write a Teradata database. Development/Debug Stages Row Generator. -> Generates a dummy data set. Column Generator. -> Adds extra columns to a data set. Head. -> Copies the specified number of records from the beginning of a data partition. Peek. -> Prints column values to the screen as records are copied from its input data set to one or more output data sets. Sample. -> Samples a data set. Tail. -> Copies the specified number of records from the end of a data partition. Write range map. -> Enables you to carry out range map partitioning on a data set.
File Stages Complex Flat File. -> Allows you to read or writecomplex flat files on a mainframe mach Data set.-> Stores a set of data. External source. -> Allows a parallel job to read anexternal data source. External target. -> Allows a parallel job to write to anexternal data source. File set. -> A set of files used to store data. Lookup file set. ->Provides storage for a lookup table. SAS data set. -> Provides storage for SAS data sets. Sequential file. -> Extracts data from, or writes data to, atext file.
Processing Stages Transformer. - >Receives incoming data, transforms it in avariety of ways, and outputs i Aggregator. -> Classifies incoming data into groups, computes totals and other summary functions for each group, and passes them to another stage in the job. Change apply. -> Applies a set of captured changes to a data set. Change Capture. -> Compares two data sets and recordsthe differences between them. Compare. -> Performs a column by column compare oftwo pre-sorted data sets. Compress. -> Compresses a data set. Copy . -> Copies a data set. Decode. -> Uses an operating system command to decodea previously encoded data se Difference. -> Compares two data sets and works out the difference between them. Encode. -> Encodes a data set using an operating systemcommand. Expand. -> Expands a previously compressed data set. External Filter. -> Uses an external program to filter a dataset. Filter. -> Transfers, unmodified, the records of the inputdata set which satisfy requirem Funnel. -> Copies multiple data sets to a single data set. Generic. -> Allows Orchestrate experts to specify their owncustom commands. Lookup. -> Performs table lookups. Merge.-> Combines data sets. Modify. -> Alters the record schema of its input data set. Remove duplicates.-> Removes duplicate entries from adata set.
SAS(Statistical Analysis System)-> Allows you to run SAS applications from within Sort. -> Sorts input columns. Switch. -> Takes a single data set as input and assigns eachinput record to an output da Surrogate Key.-> Generates one or more surrogate keycolumns and adds them to an ex Real Time RTI Source. -> Entry point for a Job exposed as an RTI service. The Table Definition specified on the output link dictates the input arguments of the generated RTI service. RTI Target. -> Exit point for a Job exposed as an RTI service. The Table Definition on the input link dictates the output arguments of the generated RTI service.
Restructure Column export. -> Exports a column of another type to astring or binary column. Column import. -> Imports a column from a string orbinary column. Combine records. -> Combines several columns associatedby a key field to build a vect Make subrecord. -> Combines a number of vectors to forma subrecord. Make vector. -> Combines a number of fields to form avector. Promote subrecord. -> Promotes the members of asubrecord to a top level field. Split subrecord. -> Separates a number of subrecords intotop level fields. Split vector. -> Separates a number of vector members intoseparate columns. Other Stages Parallel Shared Container. -> Represents a group of stages and links. The group is replaced by a single Parallel Shared Container stage in the Diagram window. Parallel Shared Container stages are handled differently to other stage types, they do not appear on the palette.
Local Container. -> Represents a group of stages and links. The group is replaced by a single Container stage in the Diagram window Container Input and Output. -> Represent the interface that links a container stage to the rest of the job design. Links join the various stages in a job together and are used to specify how data flows when the job is run. Linking Server Stages - > Stream. A link representing the flow of data. This is the principal type of link, and is used by both active and passive stages. Reference. A link representing a table lookup. Reference links are only used by active stages. They are used to provide information that might affect the way data is changed, but do not supply the data to be changed. Linkning Parallel Stages -> Stream. -> A link representing the flow of data. This is the principal type of link, and is used by all stage types. Reference.-> A link representing a table lookup. Reference links can only be input to Lookup stages, they can only be output from certain types of stage. Reject. -> Some parallel job stages allow you to output records that have been rejected for some reason onto an output link.
Parallel processing is the ability to carry out multiple operations or tasks simultaneousl
Pipeline Parallelism ->If we run a job on a system with at least three processors the stage reading would start on one processor and start filling a pipeline with the data it had read. ->The transformation stage would start running on second processor as soon as there was a data in a pipeline, process it and start filling another pipeline. ->The target stage would start running on 3rd processor as soon as there was Partitioning Parallelism -> Using Partitioning Parallelism the same job would effectively be run on simultaneously by several processors. BULK COPY PROGRAM: Microsoft SQL Server and Sybase have a utility called BCP (Bulk Copy Program). This command line utility copies SQL Server data to or from an operating system file in a user-specified format. BCP uses the bulk copy API in the SQL Server client libraries. By using BCP, you can load large volumes of data into a table without recording each insert in a log file. You can run BCP manually from a command line using command line options (switches). A format (.fmt) file The Orabulk stage is a plug-in stage supplied by Ascential. The Orabulk plug-in is installed automatically when you install DataStage. An Orabulk stage generates control and data files for bulk loading into a single table on an Oracle target database. The files are suitable for loading into the target database using the Oracle command sqlldr. One input link provides a sequence of rows to load into an Oracle table. The meta data for each input column determines how it is loaded. One optional output link provides a copy of all input rows to allow easy combination of this stage with other stages.
Lookup and join perform equivalent operations: combining two or more input datasets based on one or more specified keys. Lookup requires all but one (the first or primary) input to fit into physical memory. Join requires all inputs to be sorted. When one unsorted input is very large or sorting isnt feasible, lookup is the preferred solution. When all inputs are of manageable size or are presorted, These are the temporary variables created in transformer for calculation. Routines are the functions which we develop in BASIC Code for required tasks, which we Datastage is not fully supported (Complex). These Parameters are used to provide Administrative access and change run time values of the job. EDIT > JOBPARAMETERSIn that Parameters Tab we can define the name,prompt,type,value. Stage Variable - An intermediate processing variable that retains value during read and does not pass the value into target column. Derivation - Expression that specifies value to be passed on to the target column. A fact table consists of measurements of business requirements and foreign keys of dimensions tables as per business rules. An entity represents a chunk of information. In relational databases, an entity often maps to a table. An attribute is a component of an entity and helps define the uniqueness of the entity. In relational databases, an attribute maps to a column. MetaStage is a persistent metadata Directory that uniquely synchronizes metadata across multiple separate silos, eliminating re keying and the manual establishment of cross-tool relationships. Based on patented technology, it provides seamless cross-tool integration throughout the entire Business
Four Places u can call (i) Transform of routine (A) Date Transformation (B) Upstring Transformation (ii) Transform of the Before & After Subroutines (iii) XML transformation (iv)Web base trannsformation DataStage provides a set of variables containing useful system information that you can access from a transform or routine. System variables are readonly. @DATE The internal date when the program started. See the Date function. @DAY The day of the month extracted from the value in @DATE. @FALSE The compiler replaces the value with 0. @FM A field mark, Char(254). @IM An item mark, Char(255). @INROWNUM Input row counter. For use in constrains and derivations in Transformer stages. @OUTROWNUM Output row counter (per link). For use in derivations in Transformer stages.
Autosys, TNG, event coordinator,Maestro Schedular,Contl-M job schedular are the third party Tool.which are being used in datatstage projects Change capture stage is used to get the difference between two sources i.e. after dataset and before dataset. The source which is used as a reference to capture the changes is called after dataset. The source in which we are looking for the change is called before dataset. This change capture will add one field called "chage code" in the output from this stage. By this change code one can recognize which kind of change this is like whether it is delete, insert or update. the following commands can be taken as DS Engine commands, used to start and stop the DS Engine
DSHOME/bin/uv -admin -start
Routines are used to return the values ,transform cannot return the values In datastage Director! Look at the datastage director Log A complex flat file can be used to read the data at the intial level. By using CFF, we can read ASCII or EBCDIC (Extended Binary coded Decimal Interchage Code) data. We can select the required columns and can omit the remaining. We can collect the rejects (bad formatted records) by setting the property of
Ad hoc querying is a term in information science. Many application software systems have an underlying database which can be accessed by only a limited number of queries and reports. Typically these are available via some sort of menu, and will have been carefully designed, pre-programmed and optimized for performance by expert programmers. By contrast, "ad hoc" reporting systems allow the users themselves to create specific, customized queries. Typically this would be via a user-friendly GUIbased system without the need for the in-depth knowledge of SQL, or database schema that a programmer would have. Because such reporting has the potential to severely degrade the Version Control allows you to: Store different versions of DataStage jobs. Run different versions of the same job. Revert to a previous version of a job. View version histories. Ensure that everyone is using the same version of a job. Protect jobs by making them read-only. Store all changes in one centralized place.
Version Control utilizes the DataStage repository, and uses a specially created DataStage project (normally called VERSION) to store its information. This special project stores all changes made to all the projects and Version Control is effective because it captures entire component releases, making it possible to view all changes between release levels. Version Control also provides these benefits: Version tracking Central code repository DataStage integration Team coordination
Two types of Lookup: Range Lookup and Caseless Lookup
n an organization
nventory
urces, archives, and
on and value.
ransformation
base to serve all user groups operational systems
unctions you can use to convert your data. ms to use.
age server. The job will connect to databases on other machines as necessary,
a extracted by such jobs is then loaded into the data warehouse.
ications (known as jobs). nd monitor DataStage server jobs and parallel jobs. nts of the Repository. ationtasks such as setting up DataStage users,
to build a data mart or data warehouse. oad data into a data warehouse. DataStage jobs and plug-ins.
g of data.
bining datastreams, aggregating data,
the industry standard Open Database r aggregating data. This is a passive stage. This stage is also used as an intermediate
This is a passive stage.
s ORABULK.
n as BCPLoad.
mainframe machine. This isintended for use on USS systems
ys, and outputs it to another stage in thejob.
s between them. data sets.
encoded data set. etween them.
satisfy requirements that you specify, andfilters out all other records.
mmands.
to an output data set based on the value of aselector field. ds them to an existing data set.
y column.
d to build a vector.
evel field.
lumns.
ks simultaneously.
JOB SEQUENCE
Job Sequence?
Activity Stages?
Triggers?
Job Sequence Properties?
Job Report
How do you generate Sequence number in Datastage?
Sequencers are job control programs that execute other jobs with preset Job parameters.
JOB SEQUENCE
DataStage provides a graphical Job Sequencer which allows you to specify a sequence of server jobs or parallel jobs to run. The sequence can also contain control information; for example, you can specify different courses of action to take depending on whether a job in the sequence succeeds or fails. Once you have defined a job sequence, it can be scheduled and run using the DataStage Director. It appears in the DataStage Repository and in the DataStage Director client as a job. Job. Specifies a DataStage server or parallel job. Routine. Specifies a routine. This can be any routine in the DataStage Repository (but not transforms). ExecCommand. Specifies an operating system command to execute. Email Notification. Specifies that an email notification should be sent at this point of the sequence (uses SMTP). Wait-for-file. Waits for a specified file to appear or disappear. Exception Handler. There can only be one of these in a job sequence. It is executed if a job in the sequence fails to run (other exceptions are handled by triggers) or if the job aborts and the Automatically handle job runs that fail option is set for that job. Nested Conditions. Allows you to further branch the execution of a sequence depending on a condition. Sequencer. Allows you to synchronize the control flow of multiple activities in a job sequence. Terminator. Allows you to specify that, if certain situations occur, the jobs a sequence is running shut down cleanly. Start Loop and End Loop. Together these two stages allow you to implement a ForNext or ForEach loop within your sequence. User Variable. Allows you to define variables within a sequence. These variables can then be used later on in the sequence, for example to set job parameters.
The control flow in the sequence is dictated by how you interconnect activity icons with triggers. There are three types of trigger: Conditional. A conditional trigger fires the target activity if the source activity fulfills the specified condition. The condition is defined by an expression, and can be one of the following types: OK. Activity succeeds. Failed. Activity fails. Warnings. Activity produced warnings. ReturnValue. A routine or command has returned a value. Custom. Allows you to define a custom expression. User status. Allows you to define a custom status message to write to the log. Unconditional. An unconditional trigger fires the target activity once the source activity completes, regardless of what other triggers are fired from the same activity. Otherwise. An otherwise trigger is used as a default where a source activity has multiple output triggers, but none of the conditional ones have fired. General,Parameters,Job Control,Dependencies,NLS The job reporting facility allows you to generate an HTML report of a server, parallel, or mainframe job or shared containers. You can view this report in a standard Internet browser (such as Microsoft Internet Explorer) and print it from the browser. The report contains an image of the job design followed by information about the job or container and its stages. Hotlinks facilitate navigation through the report. The following illustration shows the first page of an example report, showing the job image and the contents list from which you can link to more detailed job component descriptions: report is not dynamic, if you change the job design you will need to regenerate the report.
The
Using the Routine KeyMgtGetNextVal KeyMgtGetNextValConn They can also be done by Oracle Sequence.
A sequencer allows you to synchronize the control flow of multiple activities in a job sequence. It can have multiple input triggers as well as multiple output triggers.The sequencer operates in two modes:ALL mode. In this mode all of the inputs to the sequencer must be TRUE for any of the sequencer outputs to fire.ANY mode. In this mode, output triggers can be fired if any of the sequencer inputs are TRUE
Scenarios
if suppose we have 3 jobs in sequencer, while running if job1 is failed then we have to run job2 and job 3 ,how we can run?
how do you remove duplicates using transformer stage in datastage.
how you will call shell scripts in sequencers in datastage
What are the Environmental variables in Datastage?
How to extract job parameters from a file?
How to get the unique records on multiple columns by using sequential file stage only if a column contains data like abc,aaa,xyz,pwe,xok,abc,xyz,abc,pwe,abc,pwe,xok,xyz ,xxx,abc, roy,pwe,aaa,xxx,xyz,roy,xok.... how to send the unique data to one source and remaining data to another source????
how do u reduce warnings? Is there any possibility to generate alphanumeric surrogate key?
How to lock\unlock the jobs as datastage admin? How to enter a log in auditing table whenever a job get finished? what is Audit table? Have u use audit table in ur project?
Can we use Round Robin for aggregator? Is there any benefit underlying? How many number of reject links merge stage can have? I have 3 jobs A,B and C , which are dependent each other. I want to run A & C jobs daily and B job run only on sunday. how can we do it?
How to generate surrogate key without using surrogate key stage? what is push and pull technique??? I want to two seq files using push technique import in my desktop what i will do?
what is .dsx files how to capture rejected data by using join stage not for lookup stage. please let me know?
What is APT_DUMP_SCORE? Country, state 2 tables r there. in table 1 have cid,cname table2 have sid,sname,cid. i want based on cid which country's having more than 25 states i want to display?
what is the difference between 7.1,7.5.2,8.1 versions in datastage?
what is normalization and denormalization?
What is diff between Junk dimensions and conform dimension? 30 jobs are running in unix.i want to find out my job.how to do this?Give me command?
How do u convert the columns to rows in DataStage?
What is environment variables? Where the DataStage stored his repository?
How one source columns or rows to be loaded in to two different tables?
How do you register plug-ins?
How many number of ways that you can implement SCD2 ? Explain them
A sequential file has 8 records with one column, below are the values in the column separated by space,1 1 2 2 3 4 5 6in a parallel job after reading the sequential file 2 more sequential files should be created, one with duplicate records and the other without duplicates.File 1 records separated by...
how to perform left outer join and right outer join in lookup stage
what are the ways to read multiiple files from sequential file if the both files are different
What happens if the job fails at night?
If there are 10000 records and while loading, if the session fails in between, how will you load the remaining data?
Tell me one situation from your last project, where you had faced problem and How did u solve it?
How to handle Date convertions in Datastage? Convert a mm/dd/yyyy format to yyyy-dd-mm?
what is trouble shooting in server jobs ? what are the diff kinds of errors encountered while running any job?
what are validations you perform after creating jobs in designer.what r the different type of errors u faced during loading and how u solve them
If the size of the Hash file exceeds 2GB..What happens? Does it overwrite the current rows?
What is the purpose of Debugging stages? In real time Where we will use?
How do you you delete header and footer on the source sequential file and how do you create header and footer on target sequential file using datastage?
Using server job, how to transform data in XML file into sequential file?? i have used XML input, XML transformer and a sequential file.
How to develop the SCD using LOOKUP stage?
source has 10000 records, Job failed after 5000 records are loaded. This status of the job is abort , Instead of removing 5000 records from target , How can i resume the load
if we using two sources having same meta data and how to check the data in two sorces is same or not?and if the data is not same i want to abort the job ?how we can do this?
Scenario based Question ........... Suppose that 4 job control by the sequencer like (job 1, job 2, job 3, job 4 )if job 1 have 10,000 row ,after run the job only 5000 data has been loaded in target table remaining are not loaded and your job going to be aborted then.. How can short out the problem. Tell me the environment in your last projects Give the OS of the Server and the OS of the Client of your recent most project
Where does UNIX script of datastage executes weather in client machine or in server.Suppose if it executes on server then it will execute ?
What are the Repository Tables in DataStage and What are they?
How the hash file is doing lookup in serverjobs?How is it comparing the key values? how to extract data from more than 1 hetrogenious Sources. mean, example 1 sequenal file, Sybase , Oracle in a single Job.
how can you do incremental load in datastage?
Job run reports generated by sequence jobs do not show the final error message
Scenarios
To run a job even if its previous job in the sequence is failed you need to go to the TRIGGER tab of that particular job activity in the sequence itself. There you will find three fields:
Name: This is the name of the next link (link goin to the next job, e.g. for job activity 1 link name
will be the link goin to job activity 2).
Expression Type: This will allow you to trigger your next job activity based on the status you
want. For example, if in case job 1 fails and you want to run the job 2 and job 3 then go to trigger properties of the job 1 and select expression type as "Failed - (Conditional)". This way you can run your job 2 even if your job 1 is aborted. There are many other options available.
Expression: This is editable for some options. Like for expression type "Failed" you can not
change this field. I think this will solve your problem.
In that Time double click on transformer stage---> Go to Stage properties(its having in hedder line first icon) ---->double click on stage properties --->Go to inputs ---->go to partitioning---->select one partition technick(with out auto)--->now enable perform sort--->click on perfom sort----> now enable unique---->click on that and we can take required colum name. now out put will come unique values so here duplicats will be removed.
Shell scripts can be called in the sequences by using "Execute command activity". In this activity type following command : bash /path of your script/scriptname.sh bash command is used to run the shell script.
The Environmental variables in datastage are some pathes which can support system can use as shortcuts to fulfill the program running instead of doing nonsense activity. In most time, environmental variables are defined when the software have been installed or being installed. Could we use dsjob command on linux or unix platform to achive the activity of extacting parameters from a job?
In sequential file there is one option is there i.e filter.in this filter we use unix commands like what ever we want. Goto Seq Properties -> Output -> Option ->set Filter
By Using Sort Stage. GoTo Properties -> set Sorting Keys key=column name and set option Allow Duplicate= false. In order to reduce the warnings you need to get clear idea about particular warning, if you get any idea on code or design side you fix it, other wise goto director-->select warning and right click and add rule to message, then click ok. from next run onward you shouldn't find any warnings.
It is not possible to generate alphanumeric surrogate key in datastage. I think this answer might satisfy you.. 1.just open administrator 2.Go to projects tab 3.click on command button. 4.Give list.readu command and press execute(It gives you all the jobs status and please not the PID(Process ID) of those jobs which you want to unlock) 5.Now close that and again come back to command window. 6.now give the command ds.tools and execute 7.read the options given there.... and type "4" (option) 8.and now give 6/7 depending up on ur requirement... 9.Now give the PID that you have noted before.. 10.Then "yes" 11.Generally at first time it won't work.. but if we press again 7 then after that give PID again.. It ll work.... Please get back to me If any further clarifications req
some companies using shell script to load logs into audit table or some companies load logs into audit table using datastage jobs. These jobs are we developed.
Audit table mean its log file.in every job should has audit table.
Yes we can use Round Robin in Aggregator. It is used for Partitioning and Collecting.
we can have n-1 rejects for merge.
First you have to schedule A & C jobs Monday to Saturday in one sequence.Next take three jobs according to dependency in one more sequence and schedule that jobonly Sunday. by using the transformer we can do it.To generate seqnum there is a formula by using the system variables ie [@partation num + (@inrow num -1) * @num partation. OR @PARTITIONNUM+ @INROWNUM- @NUMPARTITIONS
push means the source team sends the data and pull means the developer extracts the data from source. .dsx file is nothing but the datastage project backup file.. when we want to load the project at the another system or server we take the file and load at the other system/server.
We can not capture the reject data by using join stage. For that we can use transformer stage after join stage. APT_DUMP_SCORE is an reporting environment variable , used to show how the data is processing and processes are combining.
Join these two tables on cid and get all the columns to output. Then in aggregator stage, count rows with key collumn cid..Then use filter or transformer to get records with count> 25
The main difference is in 7.5 we can open job only once at a system but in 8.1 we can open on job in multiple time as a read only mode and another difference is in 8.1 having Slowly Changing Dimention stage and Repository are there in 8.1. IN Normalization is controlled by elimination redundant data where as in Denormalisation is controlled by redundant data. JUNK DIMENSION A Dimension which cannot be used to describe the facts is known as junk dimension(junk dimension provides additional information to the main dimension) ex:-customer add Confirmed Dimension A dimension table which can be shared by multiple fact tables is known as Confirmed dimension Ex:- Time dimension
ps -ef|grep USER_ID|grep JOB_NAME
Using Pivot Stage .
Basically Environment variable is predefined variable those we can use while creating DS job.We can set eithere as Project level or Job level.Once we set specific variable that variable will be availabe into the project/job. DataStage stored his repository in IBM Universe Database. For Columns - We can directly map the single source columns to two different targets. For Rows - We have to put some constraint (condition ).
Using DataStage Manager. Tool-> Register Plugin -> Set Specific path and ->ok
3 ways to construct the scd2 in datastage 8.0.1 1)using SCD stage in processing stage 2)using change capture and change applay stages 3)using source file,lookup,transformers,filters,surrogate key gene...
Hi,we are having the data1122345By using sort we can send the duplicates into one link and non-duplicates into another link.In sort by using keychange column we can identify the duplicates .By using transformer
In Lookup stage properties, you will have constraints option. If you click on constraints button- you will get options like continue, drop, fail and reject If you select the option continue: it means left outer join operation will be performed. If you select the option drop: it means inner join operation will be performed.
This can be achieved by selecting the File pattern option and the path of the the files in the sequential stage. U can define a job sequence to send an email using SMTP activity if the job fails. Or log the failure to a log file using DSlogfatal/DSLogEvent from controlling job or using a After Job Routine. or Use dsJob -log from CLI.
Different companies use different strategies to recover the workflows. 1) You can use the session properties to Recover from last check point. 2) Use a temporary table before every target and load it with the keys. When a job fails, You can identify the rows that are not loaded from the source by using these keys in SQL override 3) You can delete the rows that are loaded in to the target by date, and restart the job from begining.
a) We had a big job with around 40 stages.The job was taking too long tocompile and run.We broke the job into 3 smaller jobs.After this ,we observed that the performance was slighly improved and maintenance of the jobs became easier. b) We were facing problems in deleting the records using OEE stage.We wrote a bulk delete statment instead of record by record delete.it improved the performance of our job and the deletion time reduced to 5 minutes.Earlier the same job was taking 25 minutes. etc.. I will explain how to Convert a mm/dd/yyyy format to yyyy-dd-mm Below is the format Oconv(Iconv(Filedname D/MDY[2 2 4] ) D-YDM[4 2 2] ) here first Iconv(Filedname D/MDY[2 2 4] ) will convert our given date in the internal format later Oconv( Inter_date_format D-YDM[4 2 2] ) will convert our internal date format to required yyyy-dd-mm...
Troubleshooting in datastage server jobs involves monitoring the job log for fatal errors and taking appropriate actions to resolve them.There can be various errors which could be encountered while running the ds jobs.Some are following: a) Ora-1400 error b) Invalid userid or password.login denied(From OCI stage) c) error - Dataset does not exist. (parallel jobs) d) Job may fail for lookup failure saiyng -- lookup failed on a key column.(If "failure" setting is done in lookup stage for lookup failures.) etc.... I performed the following validations: 1)all letters should be in smallcase 2)email id field should not contain more than 255 characters 3)it should not contain special characters except underscore While loading sometimes i came across to the following errors: 1)"unknown field name....." because metadata was not properly loaded..i reloaded the data and it worked fine... 2)"data truncation warning"..bcoz in data stage data type size was less than the size of data type in database
When you create hash file, by default in that directory we will have 2 files data.30 over.30 If data has exceed the specified limit, extra data will be written into over.30. It again depends up on storage capacity. The main use of Debugging Stages(row gen,peak,tail,head etc) are they are help full to monitor jobs,and they generate mock data wen we dont have real time data to test
In Designer Pallete Development/Debug we can find Head & tail. By using this we can do...... I will explain u the stages used inorder.. FOLDER STAGE--------->XMLINPUTSTAGE--------->TRANSFORMER----->SEQUENTIAL FILE folder stage is to check for the folder which has xmlfile and u have to give wildcard as .xml in XML inputstage load the columns from the xml importer and select only the values and map the same in transformer.thats it
we can impliment SCD by using LOOKUP stage, but it is for only scd1, not for scd2. we have to take source(file or db) and dataset as a ref link(for look up) and then LOOKUP stage, in this we have to compare the source with dataset and we have to give condition as continue, continue there. after that in t/r we have to give the conditon, after that we have to take two targets for insert and update, there we have to manually write the sql insert and update statements. If u see the design, then u can easily understand that.
But we keep the Extract , Transform and Load proess seperately. Generally only load job never failes unles there is a data issue. All data issues are cleared before in trasform only. there are some DB tools that do this automatically If you want to do this manually. Keep track of number of records in a has file or test file. Update the file as you insert the record. if job failed in the middle then read the number from the file and process the records from there only ignoring the record numbers before that try @INROWNUM function for better result.
Use a change Capture Stage.Output it into a Transformer. Write a routine to abort the job which is initiated at the Function. @INROWNUM = 1. So if the data is not matching it is passed in the transformer and the job is aborted.
Suppose job sequencer synchronies or control 4 job but job 1 have problem, in this condition should go director and check it what type of problem showing either data type problem, warning massage, job fail or job aborted, If job fail means data type problem or missing column action .So u should go Run window >Click-> Tracing->Performance or In your target table ->general -> action-> select this option here two option (i) On Fail -- commit , Continue (ii) On Skip -- Commit, Continue. First u check how many data already load after then select on skip option then continue and what remaining position data not loaded then select On Fail , Continue ...... Again Run the job defiantly u get successful massage
server is unix and client machine i.e is ur machine where u design a job is windows xp professional
Datastage jobs are executed in the server machines only. There is nothing that is stored in the client machine.
A datawarehouse is a repository(centralized as well as distributed) of Data, able to answer any adhoc,analytical,historical or complex queries.Metadata is data about data. Examples of metadata include data element descriptions, data type descriptions, attribute/property descriptions, range/domain descriptions, and process/method descriptions. The repository environment encompasses all corporate metadata resources: database catalogs, data dictionaries, and navigation services. Metadata includes things like the name, length, valid values, and description of a data element. Metadata is stored in a data dictionary and repository. It insulates the data warehouse from changes in the schema of operational systems.In data stage I/O and Transfer , under interface tab: input , out put & transfer pages.U will have 4 tabs and the last one is build under that u can find the TABLE NAME .
The DataStage client components are:AdministratorAdministers DataStage projects and conducts housekeeping on the serverDesignerCreates DataStage jobs that are compiled into executable programs DirectorUsed to run and monitor the DataStage jobsManagerAllows you to view and edit the contents of the repository. Hashed File is used for two purpose: 1. Remove Duplicate Records 2. Then Used for reference lookups.The hashed file contains 3 parts: Each record having Hashed Key, Key Header and Data portion.By using hashed algorith and the key valued the lookup is faster. U can convert all hetrogenous sources into sequential files & join them using merge or U can write user defined query in the source itself to join them Incremental load means daily load. when ever you are selecting data from source, select the records which are loaded or updated between the timestamp of last successful load and todays load start date and time. for this u have to pass parameters for those two dates. store the last run date and time in a file and read the parameter through job parameters and state second argument as currentdate and time.
BM InfoSphere DataStage: A sequence job collects job run information after each job activity is run. This information can be written to the job log or sent by email using the Notification Activity stage. If any stages or links in a job activity produce warning or error messages from the job run, the last warning or error message is retrieved and added to the report.
DataStage Important Interview Que
What is DatawareHouse? Concept of Dataware house?
What type of data available in Datawarehouse?
What is Node? What is Node Configuration?
What are the types of nodes in datastage?
What is the use of Nodes
Fork-join
Execution flow
Conductor
Section
Player
What are descriptor file and data file in Dataset.
What is Job Commit ( in Datastage).
What is Iconv and Oconv functions
How to Improve Performance of Datastage Jobs?
Difference between Server Jobs and Parallel Jobs
Difference between Datastage and Informatica.
What is complier ? Compliation Process in datastage
What is Modelling Of Datastage?
Types Of Modelling ?
What is DataMart, Importance and Advantages?
Data Warehouse vs. Data Mart
What are different types of error in datastage?
What are the client components in DataStage 7.5x2 version?
Difference Between 7.5x2 And 8.0.1?
What is IBM Infosphere? And History
What is Datastage Project Contains?
What is Difference Between Hash And Modulus Technique?
What are Features of Datastage?
ETL Project Phase?
What is RCP?
What is Roles And Responsibilties of Software Engineer?
Server Component of DataStage 7.5x2 version?
How to create Group ID in Sort Stage?
What is Fastly Changing Dimension?
Force Compilation ?
how many rows sorted in sort stage by default in server jobs when we have to go for a sequential file stage & for a dataset in datastage?
what is the diff b/w switch and filter stage in datastage?
specify data stage strength?
symmetric multiprocessing (SMP)
Briefly state different between data ware house & data mart?
What are System variables?
What are Sequencers?
Whats difference betweeen operational data stage (ODS) and data warehouse?
What is the difference between Hashfile and Sequential File?
What is OCI? Which algorithm you used for your hashfile?
how to perform left outer join and right outer join in lookup stage
What is the difference between DataStage and DataStage Scripting?
Orchestrate Vs Datastage Parallel Extender? The above might rise another question: why do we have to load the dimensional tables first, then fact tables:
how to create batches in Datastage from command prompt
How will the performance affect if we use more number of Transformer stages in Datastage parallel jobs?
What various validations do you perform on the data after extraction?
what is PROFILE STAGE , QUALITY STAGE,AUDIT STAGE in datastage.. please expalin in detail.
How do you fix the error "OCI has fetched truncated data" in DataStage
Why is hash file is faster than sequential file n odbc stage??
how to fetch the last row from a particular column.. Input file may be sequential file...
What is project life cycle and how do you implement it?
What is the alternative way where we can do job control??
It is possible to access the same job two users at a time in datastage? How to kill the job in data stage?
What is Integrated & Unit testing in DataStage ? how do u clean the datastage repository.
give one real time situation where link partitioner stage used?
what is the transaction size and array size in OCI stage?how these can be used?
How do you do Usage analysis in datastage ?
ant Interview Question And Answer

Datawarehouse is a database which is used to store the heterogeneous sources of data with characteristics like a) Stucture Oriented b) Historical Information c) Integrated d) Non Volatile e) Time Variant Source will be Online Transaction Process ( OLTP). It collects the data from Online Transaction Process ( OLTP). It maintains the data for 30 - 90 days. It is time sensitive. If we like to store the data for long period, we need a permanent data base. That is Archyl Database ( AD). Data in the Datawarehouse comes from the client systems.Data that you are using to manage your business is very important to do the manupulations according to the client requirements. Node is a Logical Cpu in datastage . Each node in a configuration file is distinguished by the virtual name and defines a number , speed, cpu's , memory availability etc. Node configuration is a technique of creating logical C.P.U The degree of parallellism of parallel jobs depends on the number of nodes you define in your configuration file.Nodes are just the logically created processes by the OS. basically two types of nodes exist : a) Conductor node : Datastage engine is loaded into conductor node. b) processing nodes : One section leader is created per node.Section leaders fork the player processes.
In a Grid environment a node is the place where the jobs are executes. Nodes are like processors , if we have more nodes when running the job , the performance will be good to run parallel to make the job efficient. A job is split into N sub-jobs which are served by each of the N servers. After service, sub-job wait until all other sub-jobs have also been processed. The sub-jobs are then rejoined and leave the system. Actual data flows from player to player the conductor and section leader are only used to control process execution through control and message channels. * Conductor is the initial framework process. It creates the Section Leader (SL) processes (one per node), consolidates messages to the DataStage log, and manages orderly shutdown. The Conductor node has the start-up process. The Conductor also communicates with the players.
* Section Leader is a process that forks player processes (one per stage) and manages up/down communications. SLs communicate between the conductor and player processes only. For a given parallel configuration file, one section leader will be started for each logical node.
* Players are the actual processes associated with the stages. It sends stderr and stdout to the SL, establishes connections to other players for data flow, and cleans up on completion. Each player has to be able to communicate with every other player. There are separate communication channels (pathways) for control, errors, messages and data. The data channel does not go through the section leader/conductor as this would limit scalability. Data flows directly from upstream operator to downstream operator.
Descriptor and Data files are the dataset files. Descriptor file contains the Schema details and address of the data. And Data file contains the data in the native format. In DRS Stage we have a transaction Isolation , set to read committed . And set Array Sze and transaction size to 10,2000 . So that , it will commit for every 2000 records. Iconv and Oconv functions are used to convert the date functions. Iconv() is used to convert string to Internal storage format. Oconv() is used to convert expression to an output format.
Performance of the Job is really important to maintain.Some of the precautions are as follows to get good performance of the Jobs.Avoid the use of only one flow of tuning for performance testing or tuning testing.Try to work in Increment. Isolate and solve the Jobs. And Work in increment.
For that a) Avoid using Transformer stage where ever necessary. For example if you are using Transformer stage to change the column names or to drop the column names. Use Copy stage, rather than using Transformer stage. It will give good performance to the Job. b)Take care to take correct partitioning technique, according to the Job and requirement. c) Use User defined queries for extracting the data from databases . d) If the data is less , use Sql Join statements rather then using a Lookup stage. e) If you have more number of stages in the Job, divide the job into multiple jobs. Server Jobs works only if the server jobs datastage has been installed in your system. Server Jobs doesnot supports the parallelism and partition techniques. Server Jobs generates basic programs after Job Compilation. Parallel Jobs works, if you have installed Enterprise Edition. This works on the Datastage Servers that are SMP (Symmetric Multi-Processing) , MPP ( Massively Parallel Processing ) etc. Parallel Jobs generates OSH ( Orchestrate Shell ) Programs after job compilation. Different Stages will be like datasets, lookup stages etc. Server Jobs works in sequential way while parallel jobs work in parallel fashion (Parallel Extender work on the principal of pipeline and partition) for Input/Output processing.
Difference between Datastage and Informatica is Datastage is having Partition, Parallelism, Lookup , Merge etc But Informtica Doesn't have this concept of partition and parallelism. File lookup is really horrible Compilation is the process of converting the GUI into its machine code .That is nothing but machine understandable language. In this process it will checks all the link requirements, stage mandatory property values, and if there any logical errors. And Compiler produces OSH Code. Modeling is a Logical and physical representation of Source system. Modeling have two types of Modeling Tools They are ERWIN AND ER-STUDIO In Source System there will be a ER-Model and in the Target system there will be a ER-Model and Dimensional Model Dimension:- The table which was designed for the client perspective. We can see in many ways in the Dimension tables.
And there are two types of Models. They are Forward Engineering (F.E) Reverse Engineering (R.E) F.E:- F.E is the process starting the process from the scratch for banking sector. Ex: Any Bank which was required Datawarehouse. R.E:- R.E is the process altering existing model for another bank.
A data mart is a repository of data gathered from operational data and other sources that is designed to serve a particular community of knowledge workers. In scope, the data may derive from an enterprise-wide database or data warehouse or be more specialized. The emphasis of a data mart is on meeting the specific demands of a particular group of knowledge users in terms of analysis, content, presentation, and ease-of-use. Users of a data mart can expect to have data presented in terms that are familiar. There are many reasons to create Datamart.There is lot of importance of Datamart and advantages. It is easy to access frequently needed data from the database when required by the client. We can give access to group of users to view the Datamart when it is required. Ofcourse performance will be good. It is easy to maintain and to create the datamart. It will be related to specific business. And It is low cost to create a datamart rather than creating datwarehouse with a huge space.
A data warehouse tends to be a strategic but somewhat unfinished concept. The design of a data warehouse tends to start from an analysis of what data already exists and how it can be collected in such a way that the data can later be used. A data warehouse is a central aggregation of data (which can be distributed physically); A data mart tends to be tactical and aimed at meeting an immediate need. The design of a data mart tends to start from an analysis of user needs. A data mart is a data repository that may derive from a data warehouse or not and that emphasizes ease of access and usability for a particular designed purpose. You may get many errors in datastage while compiling the jobs or running the jobs. Some of the errors are as follows a)Source file not found. If you are trying to read the file, which was not there with that name. b)Some times you may get Fatal Errors. c) Data type mismatches. This will occur when data type mismatches occurs in the jobs. d) Field Size errors. e) Meta data Mismatch f) Data type size between source and target different g) Column Mismatch i) Pricess time out. If server is busy. This error will come some time.
In Datastage 7.5X2 Version, they are 4 client Components. They are 1) Datastage Designer 2) Datastage Director 3) Datastage Manager 4) Datastage Admin In Datastage Designer, We Create the Jobs Compile the Jobs Run the Jobs In Director, We can View the Jobs View the Logs Batch Jobs Unlock Jobs Scheduling Jobs Monitor the JOBS Message Handling
1) In Datastage 7.5X2, there are 4 client components. They are a) Datastage Design b) Datastage Director c Datastage Manager d) Datastage Admin And in 2) Datastage 8.0.1 Version, there are 5 components. They are a) Datastage Design b) Datastage Director c) Datastage Admin d) Web Console e) Information Analyzer Here Datastage Manager will be integrated with the Datastage Design option.
2) Datastage 7.X.2 Version is OS Dependent. That is OS users are Datastage Users.
Datastage is the product owned by I.B.M Datastage is a ETL Tool an it is independent of platform. Etl means Extraction , Transformation and loading the jobs. Datastage is the product introduced by the company called V-mark with the name DataIntegrator in UK in the year 1997. And later it was acquired by other companies. Finally it was reached to I.B.M in 2006. Datastage got parallel capabilities when it was integrated with the Orchestrate file and got independent platform capabilities when integrated with the MKS Tool Kit
Datastage is a Comprehensive ETL Tool. It is used to Extract , transformation and loading the Jobs. Datastage Project will be worked on the Datastage don't. We can login to the Datastage Designer in order to enter the Datastage too for datastage jobs, designing of the jobs etc. Datastage jobs are maintained according to the project standards. In every project we contain the Datastage Jobs , Built in Components , Table Definitions , Repository and components required for the project.
Hash and Modulus techniques are Key based partition techniques. Hash and Modulus techniques are used for different purpose. If Key column data type is textual then we use hash partition technique for the job. If Key column data type is numeric, we use modulus partition technique. If one key column numeric and another text then also we use hash partition technique. if both the key columns are numeric data type then we use modulus partition technique. 1)Any to Any That means Datastage can Extrace the data from any source and can loads the data into the any target. 2) Platform Independent The Job developed in the one platform can run on the any other platform. That means if we designed a job in the Uni level processing, it can be run in the SMP machine. 3 )Node Configuration Node Configuration is a technique to create logical C.P.U Node is a Logical C.P.U 4)Partition Parallelism Partition parallelim is a technique distributing the data across the nodes based on the partition techniques. Partition Techniques are a) Key based Techniques are 1 ) Hash 2)Modulus 3) Range 4) DB2
And four phases are 1) Data Profiling 2) Data Quality 3) Data Transformation 4) Meta data management Data Profiling:Data Profiling performs in 5 steps. Data Profiling will analysis weather the source data is good or dirty or not. And these 5 steps are a) Column Analysis b) Primary Key Analysis c) Foreign Key Analysis d) Cross domain Analysis e) Base Line analysis After completing the Analysis, if the data is good not a problem. If your data is dirty, it will be sent for cleansing. This will be done in the second phase. Data Quality:Data Quality, after getting the dirty data it will clean the data by using 5 RCP is nothing but Runtime Column Propagation. When we run the Datastage Jobs, the columns may change from one stage to another stage. At that point of time we will be loading the unnecessary columns in to the stage, which is not required. If we want to load the required columns to load into the target, we can do this by enabling a RCP. If we enable RCP, we can sent the required columns into the target.
Roles and Responsibilities of Software Engineer are 1) Preparing Questions 2) Logical Designs ( i.e Flow Chart ) 3) Physical Designs ( i.e Coding ) 4) Unit Testing 5) Performance Tuning. 6) Peer Review 7) Design Turnover Document or Detailed Design Document or Technical design Document 8) Doing Backups 9) Job Sequencing ( It is for Senior Developer ) There are three Architecture Components in datastage 7.5x2 They are Repository:-Repository is an environment where we create job, design, compile and run etc. Some Components it contains are JOBS,TABLE DEFINITIONS,SHARED CONTAINERS, ROUTINES ETC Server( engine):-- Here it runs executable jobs that extract , transform, and load data into a datawarehouse. Datastage Package Installer:-It is a user interface used to install packaged datastage jobs and plugins.
Group ids are created in two different ways. We can create group id's by using a) Key Change Column b) Cluster Key change Column Both of the options used to create group id's . When we select any option and keep true. It will create the Group id's group wise. Data will be divided into the groups based on the key column and it will give (1) for the first row of every group and (0) for rest of the rows in all groups. Key change column and Cluster Key change column used, based on the data we are getting from the source. If the data we are getting is not sorted , then we use key change column to create group id's If the data we are getting is sorted data, then we use Cluster Key change Column to create Group Id's . The Entities in the Dimension which are change rapidly is called Rapidly(fastly) changing dimention. best example is atm machine transactions. For parallel jobs there is also a force compile option. The compilation of parallel jobs is by default optimized such that transformer stages only get recompiled if they have changed since the last compilation. The force compile option overrides this and causes all transformer stages in the job to be compiled. To select this option: Choose File Force Compile
10,000
When there is Memory limit is requirement is more, then go for Dataset, And sequential file doesnt support more than 2gb. filter:1)we can write the multiple conditions on multiple fields 2)it supports one inputlink and n number of outputlinks Switch:1)multiple conditions on a single field(column) 2)it supports one inputlink and 128 output links
The major strength of the datastage are : Partitioning, pipelining, Node configuration, handles Huge volume of data, Platform independent.
symmetric multiprocessing (SMP) involves a multiprocessor computer hardware architecture where two or more identical processors are connected to a single shared main memory and are controlled by a single OS instance. Most common multiprocessor systems today use an SMP architecture.
Data warehouse is made up of many datamarts. DWH contain many subject areas. However, data mart focuses on one subject area generally. E.g. If there will be DHW of bank then there can be one data mart for accounts, one for Loans etc. This is high-level definitions. A data mart (DM) is the access layer of the data warehouse (DW) environment that is used to get data out to the users. The DM is a subset of the DW, usually oriented to a specific business line or team. System variables comprise of a set of variables which are used to get system information and they can be accessed from a transformer or a routine. They are read only and start with an @. A sequencer allows you to synchronize the control flow of multiple activities in a job sequence. It can have multiple input triggers as well as multiple output triggers.
A dataware house is a decision support database for organisational needs.It is subject oriented,non volatile,integrated ,time varient collect of data. ODS(Operational Data Source) is a integrated collection of related information . it contains maximum 90 days information. ODS is nothing but operational data store is the part of transactional database. this db keeps integrated data from different tdb and allow common operations across organisation. eg: banking transaction. In simple terms ODS is dynamic data. Hash file stores the data based on hash algorithm and on a key value. A sequential file is just a file with no key column. Hash file used as a reference for look up. Sequential file cannot.
If you mean by Oracle Call Interface (OCI), it is a set of low-level APIs used to interact with Oracle databases. It allows one to use operations like logon, execute, parss etc. using a C or C++ program. It uses GENERAL or SEQ.NUM. algorithm
In Lookup stage properties, you will have constraints option. If you click on constraints button- you will get options like continue, drop, fail and reject If you select the option continue: it means left outer join operation will be performed. If you select the option drop: it means inner join operation will be performed. Datastage jobs,when compiled generate OSH.OSH is the abbreviation of Orchestrate scripting language.When a datastage job is run,the generated OSH is executed in the backend.
Orchestrate itself is an ETL tool with extensive parallel processing capabilities and running on UNIX platform. Datastage used Orchestrate with Datastage XE (Beta version of 6.0) to incorporate the parallel processing capabilities. Now Datastage has purchased Orchestrate and integrated it with Datastage XE and released a new version Datastage 6.0 i.e Parallel Extender. In dimensional model,fact tables are dependenet on the dimension tables.This means that fact table contains foreign keys to dimension tables.This is the reason,dimension tables are loaded first and th...
From command prompt batches can be created in following way : a) create a batch file say RunbatchJobs.bat b) Open this file in notepad. c) Now,write the command "dsjob" with proper syntax for each job you want to run. d) in there are four jobs to be run in a batch,use dsjob command 4 times with different job names on each line. e) Save the file and close it. f) Next time whenever you want to run the jobs,just click on the batch file "RunbatchJobs.bat".All jobs will run one by one by the batch file. Traditionally batch programs are created in the following way : A Batch Program is used to run a batch of jobs by writing the Server routine code in the job control section.To generate a batch program,do following: a) Open datasatge director. b) Go to Tools->Batch->New. c) A new window will open with the "Job Control" tab selected. d)Write the routine code and save it.You may run multiple jobs in batch by making use of this. Transformer stages compile in C++ whereas other stages compile into OSH (Orchestarte scripting language.).If number of transfomers are more,first thing is the compilation time will be impacted.it will take more time to compile the Transformer stage. Practically,transformer stage really does not have performance impact on DS Jobs. If in your jobs,the number of stages are more,the performance will be impacted(not necessarily transformer stages).Hence,try to implement the job logic by using minimum stages in your DS Jobs.
NULL check MetaData Check Duplicate Check Invalid Check Profile stage: is a profiling tool to investigate data sources to see inherent structures, frequencies of phrases, identify datatypes, etc.. In addition it can, based on the real data, rather than metadata, suggest a data model for the union of your data sources. This datamodel would be in 3 NF. Quality Stage: is now embedded in Information Server and provides functionality for fuzzy matching records and for standardizing record fields based on predefined rules. Audit stage: is now a part of Information Analyzer. This part of IA can, based on predefined rules, expose exceptions in your data from the required format, contents and relationships. This error occurs when Oracle Stage try to fetch a column like 34.55676776... and actually it's data type is decimal(10,2)The solution here is to either truncate or Round the data till 2 decimal positions
Hash file is datastage internal file. Data will be stored on computer memory.. and works on key column. So retrieval will be faster when compare to hitting the database.
Develop a job source seq file--> Transformer--> output stage In the transformer write a stage variable as rowcount with the following derivation Goto DSfunctions click on DSGetLinkInfo.. you will get "DSGetLinkInfo(DSJ.ME,%Arg2%,%Arg3%,%Arg4%)" Arg 2 is your source stage name Arg 3 is your source link name Arg 4 --> Click DS Constant and select DSJ.LINKROWCOUNT. Now ur derivation is "DSGetLinkInfo(DSJ.ME,"source","link", DSJ.LINKROWCOUNT)" Create a constraint as @INROWNUM =rowcount and map the required column to output link. Project life cycle is related to SDLC that is software development life cycle....which mean there are 4 stages involved that is 1)Analysis 2)development 3)Testing 4)Implementation This covers the entire project life cycle ! Jobcontrol can be done using : Datastage job Sequencers Datastage Custom routines Scripting Scheduling tools like Autosys
No chance ..... u have to kill the job process U can also do it by using data stage director clean up resources Unit Testing: In Datastage senario Unit Testing is the technique of testing the individual Datastage jobs for its functionality. Integrating Testing: When the two or more jobs are collectively tested for its functionality that is callled Integrating testing. REmove log files periodically..... And by using command CLEAR.FILE &PH&
If we want to send more data from the source to the targets quickly we will be using the link partioner stage in the server jobs we can make a maximum of 64 partitions. And this will be in active stage. We can't connect two active stages but it is accpeted only for this stage to connect to the transformer or aggregator stage. The data sent from the link partioner will be collected by the link collector at a max of 64 partition. This is also an active stage so in order to aviod the connection of active stage from the transformer to teh link collector we will be using inter process communication. As this is a passive stage by using this data can be collected by the link collector. But we can use inter process communication only when the target is in passive stage
Transaction Size - This field exists for backward compatibility, but it is ignored for release 3.0 and later of the Plug-in. The transaction size for new jobs is now handled by Rows per transaction on the Transaction Handling tab on the Input page. Rows per transaction - The number of rows written before a commit is executed for the transaction. The default value is 0, that is, all the rows are written before being committed to the data table. Array Size - The number of rows written to or read from the database at a time. The default value is 1, that is, each row is written in a separate statement.
1. If u want to know some job is a part of a sequence, then in the Manager right click the job and select Usage Analysis. It will show all the jobs dependents. 2. To find how many jobs are using a particular table. 3. To find how many jobs are usinga particular routine. Like this, u can find all the dependents of a particular object. Its like nested. U can move forward and backward and can see all the dependents.
SQL
SQL SELECT DISTINCT
SQL AND & OR Operators
SQL ORDER BY
SQL UPDATE
SQL DELETE
SQL SUBQUERY
SQL CASE
SQL TOP
SQL LIKE
SQL IN
SQL BETWEEN
SQL Alias
SQL Joins
SQL INNER JOIN
SQL LEFT JOIN
SQL RIGHT JOIN
SQL FULL JOIN
SQL UNION
SQL INTERSECT
SQL MINUS
SQL LIMIT
SQL CREATE DATABASE
SQL CREATE TABLE
SQL Constraints
SQL NOT NULL
SQL UNIQUE
SQL PRIMARY KEY
SQL FOREIGN KEY
SQL CHECK
SQL DEFAULT
SQL CREATE INDEX
SQL ALTER TABLE
SQL AUTO INCREMENT
SQL Views
SQL Date Functions
SQL NULL Values
SQL ISNULL VALUES
SQL COALESCE FUNCTION
SQL IFNULL VALUES
SQL NVL Function
SQL NULLIF FUNCTION
SQL RANK FUNCTION
SQL RUNNINNG TOTAL
SQL PERCENT TOTAL
SQL CUMULATIVE PERCENT TOTAL
SQL Functions
SQL AVG() Function
SQL COUNT() Function
SQL FIRST() Function
SQL MAX() Function
SQL MIN() Function
SQL SUM() Function
SQL GROUP BY Statement
SQL HAVING Clause
SQL Upper() Function/UCASE
SQL lower() Function/LCASE
SQL MID() Function
SQL LENGTH() Function
SQL ROUND() Function
SQL NOW() Function
STRING FUNCTION
Concatenate Function
Substring Function
INSTR Function
Trim Function
Length Function
Replace Function
DATE FUNCTION (SQL SERVER)
DATEADD FUNCTION
DATEDIFF FUNCTION
DATEPART FUNCTION
GETDATE FUNCTION
SYSDATE FUNCTION
SQL
In a table, some of the columns may contain duplicate values. This is not a problem, however, sometimes you will want to list only the different (distinct) values in a table. The DISTINCT keyword can be used to return only distinct (different) values. SELECT DISTINCT column_name(s) FROM table_name The AND operator displays a record if both the first condition and the second condition is true. The OR operator displays a record if either the first condition or the second condition is true. AND SELECT * FROM Persons WHERE FirstName='Tove' AND LastName='Svendson' OR SELECT * FROM Persons WHERE FirstName='Tove' OR FirstName='Ola'
The ORDER BY keyword is used to sort the result-set by a specified column. The ORDER BY keyword sort the records in ascending order by default. If you want to sort the records in a descending order, you can use the DESC keyword. SQL ORDER BY Syntax SELECT column_name(s) FROM table_name ORDER BY column_name(s) ASC|DESC The UPDATE statement is used to update records in a table. UPDATE table_name SET column1=value, column2=value2,... WHERE some_column=some_value
The DELETE statement is used to delete records in a table. DELETE FROM table_name WHERE some_column=some_value
It is possible to embed a SQL statement within another. When this is done on the WHERE or the HAVING statements, we have a subquery construct. The syntax is as follows: SELECT "column_name1" FROM "table_name1" WHERE "column_name2" [Comparison Operator] (SELECT "column_name3" FROM "table_name2" WHERE [Condition])
Case is used to provide if-then-else type of logic to SQL. Its syntax is: SELECT CASE ("column_name") WHEN "condition1" THEN "result1" WHEN "condition2" THEN "result2" ... [ELSE "resultN"] END FROM "table_name" "condition" can be a static value or an expression. The ELSE clause is optional. Example :- SELECT store_name, CASE store_name WHEN 'Los Angeles' THEN Sales * 2 WHEN 'San Diego' THEN Sales * 1.5 ELSE Sales END "New Sales", Date FROM Store_Information
The TOP clause is used to specify the number of records to return. SELECT column_name(s) FROM table_name WHERE ROWNUM <= number The LIKE operator is used in a WHERE clause to search for a specified pattern in a column. Start searchin from first character 's' SELECT * FROM Persons WHERE City LIKE 's%' Start searchin from last character 's' SELECT * FROM Persons WHERE City LIKE '%s' Start searching which not contain 'tav' SELECT * FROM Persons WHERE City NOT LIKE '%tav%'
The IN operator allows you to specify multiple values in a WHERE clause. SQL IN Syntax SELECT column_name(s) FROM table_name WHERE column_name IN (value1,value2,...) The BETWEEN operator is used in a WHERE clause to select a range of data between two values. SQL BETWEEN Syntax SELECT column_name(s) FROM table_name WHERE column_name BETWEEN value1 AND value2
With SQL, an alias name can be given to a table or to a column. SQL Alias Syntax for Tables SELECT column_name(s) FROM table_name AS alias_name SQL Alias Syntax for Columns SELECT column_name AS alias_name FROM table_name SQL joins are used to query data from two or more tables, based on a relationship between certain columns in these tables. The INNER JOIN keyword return rows when there is at least one match in both tables. SQL INNER JOIN Syntax SELECT column_name(s) FROM table_name1 INNER JOIN table_name2 ON table_name1.column_name=table_name2.column_name
The LEFT JOIN keyword returns all rows from the left table (table_name1), even if there are no matches in the right table (table_name2). SQL LEFT JOIN Syntax SELECT column_name(s) FROM table_name1 LEFT JOIN table_name2 ON table_name1.column_name=table_name2.column_name
The RIGHT JOIN keyword returns all the rows from the right table (table_name2), even if there are no matches in the left table (table_name1). SQL RIGHT JOIN Syntax SELECT column_name(s) FROM table_name1 RIGHT JOIN table_name2 ON table_name1.column_name=table_name2.column_name
The FULL JOIN keyword return rows when there is a match in one of the tables. SQL FULL JOIN Syntax SELECT column_name(s) FROM table_name1 FULL JOIN table_name2 ON table_name1.column_name=table_name2.column_name
The UNION operator is used to combine the result-set of two or more SELECT statements. Notice that each SELECT statement within the UNION must have the same number of columns. The columns must also have similar data types. Also, the columns in each SELECT statement must be in the same order. SQL UNION Syntax SELECT column_name(s) FROM table_name1 UNION SELECT column_name(s) FROM table_name2 Similar to the UNION command, INTERSECT also operates on two SQL statements. The difference is that, while UNION essentially acts as an OR operator (value is selected if it appears in either the first or the second statement), the INTERSECT command acts as an AND operator (value is selected only if it appears in both statements). The syntax is as follows: [SQL Statement 1] INTERSECT [SQL Statement 2]
The MINUS operates on two SQL statements. It takes all the results from the first SQL statement, and then subtract out the ones that are present in the second SQL statement to get the final answer. If the second SQL statement includes results not present in the first SQL statement, such results are ignored. The syntax is as follows: [SQL Statement 1] MINUS [SQL Statement 2] we may not want to retrieve all the records that satsify the critera specified in WHERE or HAVING clauses. In MySQL, this is accomplished using the LIMIT keyword. The syntax for LIMIT is as follows: [SQL Statement 1] LIMIT [N] The CREATE DATABASE statement is used to create a database. SQL CREATE DATABASE Syntax CREATE DATABASE database_name The CREATE TABLE statement is used to create a table in a database. SQL CREATE TABLE Syntax CREATE TABLE table_name ( column_name1 data_type, column_name2 data_type, column_name3 data_type, .... )
Constraints are used to limit the type of data that can go into a table. Constraints can be specified when a table is created (with the CREATE TABLE statement) or after the table is created (with the ALTER TABLE statement). We will focus on the following constraints: NOT NULL UNIQUE PRIMARY KEY FOREIGN KEY CHECK DEFAULT The NOT NULL constraint enforces a column to NOT accept NULL values. CREATE TABLE Persons ( P_Id int NOT NULL, LastName varchar(255) NOT NULL, FirstName varchar(255), Address varchar(255), City varchar(255) )
The UNIQUE constraint uniquely identifies each record in a database table. The UNIQUE and PRIMARY KEY constraints both provide a guarantee for uniqueness for a column or set of columns. A PRIMARY KEY constraint automatically has a UNIQUE constraint defined on it. Note: that you can have many UNIQUE constraints per table, but only one PRIMARY KEY constraint per table.
CREATE TABLE Persons ( P_Id int NOT NULL, LastName varchar(255) NOT NULL, FirstName varchar(255), Address varchar(255), City varchar(255), CONSTRAINT uc_PersonID UNIQUE (P_Id,LastName) ) SQL UNIQUE Constraint on ALTER TABLE ALTER TABLE Persons ADD CONSTRAINT uc_PersonID UNIQUE (P_Id,LastName) To DROP a UNIQUE Constraint ALTER TABLE Persons DROP CONSTRAINT uc_PersonID
The PRIMARY KEY constraint uniquely identifies each record in a database table. Primary keys must contain unique values. A primary key column cannot contain NULL values. Each table should have a primary key, and each table can have only ONE primary key. CREATE TABLE Persons ( P_Id int NOT NULL PRIMARY KEY, LastName varchar(255) NOT NULL, FirstName varchar(255), Address varchar(255), City varchar(255) ) SQL PRIMARY KEY Constraint on ALTER TABLE ALTER TABLE Persons ADD CONSTRAINT pk_PersonID PRIMARY KEY (P_Id,LastName) To DROP a PRIMARY KEY Constraint ALTER TABLE Persons DROP CONSTRAINT pk_PersonID
A FOREIGN KEY in one table points to a PRIMARY KEY in another table. CREATE TABLE Orders ( O_Id int NOT NULL PRIMARY KEY, OrderNo int NOT NULL, P_Id int FOREIGN KEY REFERENCES Persons(P_Id) ) SQL FOREIGN KEY Constraint on ALTER TABLE To create a FOREIGN KEY constraint on the "P_Id" column when the "Orders" table is already created, use the following SQL: ALTER TABLE Orders ADD CONSTRAINT fk_PerOrders FOREIGN KEY (P_Id) REFERENCES Persons(P_Id) To DROP a FOREIGN KEY Constraint ALTER TABLE Orders DROP CONSTRAINT fk_PerOrders The CHECK constraint is used to limit the value range that can be placed in a column. If you define a CHECK constraint on a single column it allows only certain values for this column. If you define a CHECK constraint on a table it can limit the values in certain columns based on values in other columns in the row. CREATE TABLE Persons ( P_Id int NOT NULL CHECK (P_Id>0), LastName varchar(255) NOT NULL, FirstName varchar(255), Address varchar(255), City varchar(255) )
SQL CHECK Constraint on ALTER TABLE To create a CHECK constraint on the "P_Id" column when the table is already created, use the following SQL: ALTER TABLE Persons ADD CONSTRAINT chk_Person CHECK (P_Id>0 AND City='Sandnes') To DROP a CHECK Constraint To drop a CHECK constraint, use the following SQL: SQL Server / Oracle / MS Access: ALTER TABLE Persons DROP CONSTRAINT chk_Person
The DEFAULT constraint is used to insert a default value into a column. The default value will be added to all new records, if no other value is specified. CREATE TABLE Persons ( P_Id int NOT NULL, LastName varchar(255) NOT NULL, FirstName varchar(255), Address varchar(255), City varchar(255) DEFAULT 'Sandnes' ) SQL DEFAULT Constraint on ALTER TABLE ALTER TABLE Persons ALTER COLUMN City SET DEFAULT 'SANDNES' To DROP a DEFAULT Constraint ALTER TABLE Persons ALTER COLUMN City DROP DEFAULT
An index can be created in a table to find data more quickly and efficiently. The users cannot see the indexes, they are just used to speed up searches/queries. Note: Updating a table with indexes takes more time than updating a table without (because the indexes also need an update). So you should only create indexes on columns (and tables) that will be frequently searched against. SQL CREATE INDEX Syntax Creates an index on a table. Duplicate values are allowed: CREATE INDEX index_name ON table_name (column_name) SQL CREATE UNIQUE INDEX Syntax Creates a unique index on a table. Duplicate values are not allowed: CREATE UNIQUE INDEX index_name ON table_name (column_name) The ALTER TABLE statement is used to add, delete, or modify columns in an existing table. SQL ALTER TABLE Syntax To add a column in a table, use the following syntax: ALTER TABLE table_name ADD column_name datatype Very often we would like the value of the primary key field to be created automatically every time a new record is inserted. We would like to create an auto-increment field in a table. Use the following CREATE SEQUENCE syntax: CREATE SEQUENCE seq_person MINVALUE 1 START WITH 1 INCREMENT BY 1 CACHE 10
In SQL, a view is a virtual table based on the result-set of an SQL statement. A view contains rows and columns, just like a real table. The fields in a view are fields from one or more real tables in the database. You can add SQL functions, WHERE, and JOIN statements to a view and present the data as if the data were coming from one single table. SQL CREATE VIEW Syntax CREATE VIEW view_name AS SELECT column_name(s) FROM table_name WHERE condition SQL Updating a View You can update a view by using the following syntax: SQL CREATE OR REPLACE VIEW Syntax CREATE OR REPLACE VIEW view_name AS SELECT column_name(s) FROM table_name WHERE condition SQL Dropping a View You can delete a view with the DROP VIEW command. SQL DROP VIEW Syntax DROP VIEW view_name The most difficult part when working with dates is to be sure that the format of the date you are trying to insert, matches the format of the date column in the database. SQL Server comes with the following data types for storing a date or a date/time value in the database: DATE - format YYYY-MM-DD DATETIME - format: YYYY-MM-DD HH:MM:SS SMALLDATETIME - format: YYYY-MM-DD HH:MM:SS TIMESTAMP - format: a unique number
NULL values represent missing unknown data. By default, a table column can hold NULL values. NULL means that data does not exist. NULL does not equal to 0 or an empty string. Both 0 and empty string represent a value, while NULL has no value. Any mathematical operations performed on NULL will result in NULL. For example, 10 + NULL = NULL SQL IS NULL How do we select only the records with NULL values in the "Address" column? We will have to use the IS NULL operator: SELECT LastName,FirstName,Address FROM Persons WHERE Address IS NULL SQL IS NOT NULL How do we select only the records with no NULL values in the "Address" column? We will have to use the IS NOT NULL operator: SELECT LastName,FirstName,Address FROM Persons WHERE Address IS NOT NULL In SQL Server, the ISNULL() function is used to replace NULL value with another value. For example, if we have the following table, Table Sales_Data store_name, Sales Store A, 300 Store B, NULL EXAMPLE :-SELECT SUM(ISNULL(Sales,100)) FROM Sales_Data;
COALESCE function in SQL returns the first non-NULL expression among its arguments.It is the same as the following CASE statement: SELECT CASE ("column_name") WHEN "expression 1 is not NULL" THEN "expression 1" WHEN "expression 2 is not NULL" THEN "expression 2" ... [ELSE "NULL"] END FROM "table_name" EXAMPLE :-SELECT Name, COALESCE(Business_Phone, Cell_Phone, Home_Phone) Contact_Phone FROM Contact_Info;
This function takes two arguments. If the first argument is not NULL, the function returns the first argument. Otherwise, the second argument is returned. This function is commonly used to replace NULL value with another value. It is similar to the NVL function in Oracle and the ISNULL Function in SQL Server. For example, if we have the following table, Table Sales_Data store_name Sales Store A 300 Store B NULL EXAMPLE :- SELECT SUM(IFNULL(Sales,100)) FROM Sales_Data; returns 400. This is because NULL has been replaced by 100 via the ISNULL function.
Is available in Oracle, and not in MySQL or SQL Server. This function is used to replace NULL value with another value. It is similar to the IFNULL Function in MySQL and the ISNULL Function in SQL Server. For example, if we have the following table, Table Sales_Data store_name Sales Store A 300 Store B NULL Store C 150 EXAMPLE :- SELECT SUM(NVL(Sales,100)) FROM Sales_Data; returns 550. This is because NULL has been replaced by 100 via the ISNULL function, hence the sum of the 3 rows is 300 + 100 + 150 = 550. function takes two arguments. If the two arguments are equal, then NULL is returned. Otherwise, the first argument is returned. It is the same as the following CASE statement: SELECT CASE ("column_name") WHEN "expression 1 = expression 2 " THEN "NULL" [ELSE "expression 1"] END FROM "table_name" EXAMPLE :- SELECT Store_name, NULLIF(Actual,Goal) FROM Sales_Data;
The rank associated with each row is a common request, and there is no straightforward way to do so in SQL. To display rank in SQL, the idea is to do a self-join, list out the results in order, and do a count on the number of records that's listed ahead of (and including) the record of interest. Let's use an example to illustrate. Say we have the following table, EXAMPLE :- SELECT a1.Name, a1.Sales, COUNT(a2.sales) Sales_Rank FROM Total_Sales a1, Total_Sales a2 WHERE a1.Sales <= a2.Sales or (a1.Sales=a2.Sales and a1.Name = a2.Name) GROUP BY a1.Name, a1.Sales ORDER BY a1.Sales DESC, a1.Name DESC;
running totals is a common request, and there is no straightforward way to do so in SQL. The idea for using SQL to display running totals similar to that for displaying rank: first do a self-join, then, list out the results in order. Where as finding the rank requires doing a count on the number of records that's listed ahead of (and including) the record of interest, finding the running total requires summing the values for the records that's listed ahead of (and including) the record of interest. EXAMPLE :- SELECT a1.Name, a1.Sales, SUM(a2.Sales) Running_Total FROM Total_Sales a1, Total_Sales a2 WHERE a1.Sales <= a2.sales or (a1.Sales=a2.Sales and a1.Name = a2.Name) GROUP BY a1.Name, a1.Sales ORDER BY a1.Sales DESC, a1.Name DESC;
Percent to total in SQL, we want to leverage the ideas we used for rank/running total plus subquery. Different from what we saw in the SQL Subquery section, here we want to use the subquery as part of the SELECT. EXAMPLE :- SELECT a1.Name, a1.Sales, a1.Sales/(SELECT SUM(Sales) FROM Total_Sales) Pct_To_Total FROM Total_Sales a1, Total_Sales a2 WHERE a1.Sales <= a2.sales or (a1.Sales=a2.Sales and a1.Name = a2.Name) GROUP BY a1.Name, a1.Sales ORDER BY a1.Sales DESC, a1.Name DESC;
cumulative percent to total in SQL, we use the same idea as we saw in the Percent To Total section. The difference is that we want the cumulative percent to total, not the percentage contribution of each individual row. EXAMPLE :SELECT a1.Name, a1.Sales, SUM(a2.Sales)/(SELECT SUM(Sales) FROM Total_Sales) Pct_To_Total FROM Total_Sales a1, Total_Sales a2 WHERE a1.Sales <= a2.sales or (a1.Sales=a2.Sales and a1.Name = a2.Name) GROUP BY a1.Name, a1.Sales ORDER BY a1.Sales DESC, a1.Name DESC;
SQL Aggregate Functions SQL aggregate functions return a single value, calculated from values in a column. Useful aggregate functions: AVG() - Returns the average value COUNT() - Returns the number of rows FIRST() - Returns the first value LAST() - Returns the last value MAX() - Returns the largest value MIN() - Returns the smallest value SUM() - Returns the sum SQL Scalar functions SQL scalar functions return a single value, based on the input value. Useful scalar functions: UCASE() - Converts a field to upper case LCASE() - Converts a field to lower case MID() - Extract characters from a text field LEN() - Returns the length of a text field ROUND() - Rounds a numeric field to the number of decimals specified NOW() - Returns the current system date and time FORMAT() - Formats how a field is to be displayed The AVG() Function The AVG() function returns the average value of a numeric column. SELECT AVG(column_name) as (Alias_column_name)FROM table_name Now we want to find the customers that have an OrderPrice value higher than the average OrderPrice value. We use the following SQL statement: SELECT Customer FROM Orders WHERE OrderPrice>(SELECT AVG(OrderPrice) FROM Orders) The COUNT() function returns the number of rows that matches a specified criteria. SQL COUNT(column_name) Syntax
SQL COUNT(*) Syntax The COUNT(*) function returns the number of records in a table: SELECT COUNT(*) FROM table_name SQL COUNT(DISTINCT column_name) Syntax The COUNT(DISTINCT column_name) function returns the number of distinct values of the specified column: SELECT COUNT(DISTINCT column_name) FROM table_name The FIRST() function returns the first value of the selected column. SQL FIRST() Syntax SELECT FIRST(OrderPrice) AS FirstOrderPrice FROM Orders
The MAX() Function The MAX() function returns the largest value of the selected column. SQL MAX() Syntax SELECT MAX(column_name) as (Alias_Column_name) FROM table_name The MIN() Function The MIN() function returns the smallest value of the selected column. SQL MIN() Syntax SELECT MIN(column_name) as (Alias_Column_name) FROM table_name
The SUM() Function The SUM() function returns the total sum of a numeric column. SQL SUM() Syntax SELECT SUM(column_name) as (Alias_Column_name) FROM table_name The GROUP BY Statement The GROUP BY statement is used in conjunction with the aggregate functions to group the result-set by one or more columns. SQL GROUP BY Syntax SELECT column_name, aggregate_function(column_name) FROM table_name WHERE column_name operator value GROUP BY column_name The HAVING Clause The HAVING clause was added to SQL because the WHERE keyword could not be used with aggregate functions. SQL HAVING Syntax SELECT column_name, aggregate_function(column_name) FROM table_name WHERE column_name operator value GROUP BY column_name HAVING aggregate_function(column_name) operator value
The Upper() function converts the value of a field to uppercase. Syntax for SQL Server SELECT UPPER(column_name) FROM table_name The lower() function converts the value of a field to uppercase. Syntax for SQL Server SELECT lower(column_name) FROM table_name The MID() function is used to extract characters from a text field. SQL MID() Syntax SELECT MID(column_name,start[,length]) FROM table_name Example SELECT MID(City,1,4) as SmallCity FROM Persons The LENGTH() Function The LENGTH() function returns the length of the value in a text field. SQL LENGTH() Syntax SELECT LENGTH(column_name) FROM table_name The ROUND() Function The ROUND() function is used to round a numeric field to the number of decimals specified. SQL ROUND() Syntax SELECT ROUND(column_name,decimals) FROM table_name
STRING FUNCTION
it is necessary to combine together (concatenate) the results from several different fields. Each database provides a way to do this: MySQL: CONCAT() Oracle: CONCAT(), || SQL Server: + Example :- MySQL/Oracle: SELECT CONCAT(Column1,Column2) FROM Geography WHERE Column2 = 'Boston'; Oracle: SELECT Column1 || ' ' || Column2 FROM Geography WHERE Column2 = 'Boston'; SQL Server: SELECT Column1 + ' ' + Column2 FROM Geography WHERE Column2 = 'Boston'; is used to grab a portion of the stored data. This function is called differently for the different databases: MySQL: SUBSTR(), SUBSTRING() Oracle: SUBSTR() SQL Server: SUBSTRING() Example 1 :- SELECT SUBSTR(store_name, 3) FROM Geography WHERE store_name = 'Los Angeles'; Example 2 :- SELECT SUBSTR(store_name,2,4) FROM Geography WHERE store_name = 'San Diego';
is used to find the starting location of a pattern in a string. This function is available in MySQL and Oracle, though they have slightly different syntaxes: The syntax for the Length function is as follows: MySQL: INSTR (str, pattern): Find the staring location of pattern in string str. Oracle: INSTR (str, pattern, [starting position, [nth location]]): Example 1 :-SELECT INSTR(store_name,'o') FROM Geography WHERE store_name = 'Los Angeles'; Example 2 :- SELECT INSTR(store_name,'p') FROM Geography WHERE store_name = 'Los Angeles'; Examle 3 :- SELECT INSTR(store_name,'e', 1, 2) FROM Geography WHERE store_name = 'Los Angeles'; s used to remove specified prefix or suffix from a string. The most common pattern being removed is white spaces. This function is called differently in different databases: MySQL: TRIM(), RTRIM(), LTRIM() Oracle: RTRIM(), LTRIM() SQL Server: RTRIM(), LTRIM() Example 1 :- SELECT TRIM(' Sample '); Example 2 :- SELECT LTRIM(' Sample '); Example 3 :- Select RTIM(' Sample ');
is used to get the length of a string. This function is called differently for the different databases: MySQL: LENGTH() Oracle: LENGTH() SQL Server: LEN() Example 1 :- SELECT Length(store_name) FROM Geography WHERE store_name = 'Los Angeles'; Example 2 :- SELECT region_name, Length(region_name) FROM Geography;
is used to update the content of a string. The function call is REPLACE() for MySQL, Oracle, and SQL Server. The syntax of the Replace function is Syntax : Replace(str1, str2, str3): In str1, find where str2 occurs, and replace it with str3. Example :- SELECT REPLACE(region_name, 'ast', 'astern') FROM Geography;
DATE FUNCTION (SQL SERVER)

is used to add an interval to a date. This function is available in SQL Server. The usage for the DATEADD function is DATEADD (datepart, number, expression) Example :- SELECT DATEADD(day, 10,'2000-01-05 00:05:00.000'); is used to calculate the difference between two days, and is used in MySQL and SQL Server. Example :- SELECT DATEDIFF(day, '2000-01-10','2000-01-05'); Is a SQL Server function that extracts a specific part of the date/time value. Its syntax is as follows: DATEPART (part_of_day, expression)
Example :- SELECT DATEPART (yyyy,'2000-01-20'); Example :- SELECT DATEPART(dy, '2000-02-10');
Is used to retrieve the current database system time in SQL Server. Its syntax is GETDATE() Example :- SELECT DATEPART (yyyy,'2000-01-20'); is used to retrieve the current database system time in Oracle and MySQL. Example :- SELECT SYSDATE FROM DUAL;
Troubleshooting
Installation log files
Troubleshooting
%TEMP%\ibm_is_logs

Datastage Interview

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Datastage Interview

Uploaded by

Copyright:

Available Formats

DataStage Designer

What is Data warehouse ?

What is Operational Databases ?

Advantages of Data warehouse ?

Passive Stage ? Active Stage ?

Server Job Stages

Parallel Job Stage

Difference Between Lookup and Join: What is Staging Variable?

What are Routines?

what are the Job parameters?

What are Stage Variables, Derivations and Constants?

why fact table is in normal form?

What are an Entity, Attribute and Relationship?

How many places u can call Routines?

What about System variables?

What are all the third party tools used in DataStage?

Where can you output data using the peek stage?

What is complex stage? In which situation we are using this one?

What is Ad-hoc query?

What is Version Control?

How Version Control Works?

Benefits of Using Version Control

Lookup types in Datastage 8

DSHOME/bin/uv -admin -start

Two types of Lookup: Range Lookup and Caseless Lookup

urces, archives, and

base to serve all user groups operational systems

unctions you can use to convert your data. ms to use.

a extracted by such jobs is then loaded into the data warehouse.

bining datastreams, aggregating data,

This is a passive stage.

mainframe machine. This isintended for use on USS systems

ys, and outputs it to another stage in thejob.

s between them. data sets.

encoded data set. etween them.

Job Sequence Properties?

How do you generate Sequence number in Datastage?

how do you remove duplicates using transformer stage in datastage.

how you will call shell scripts in sequencers in datastage

What are the Environmental variables in Datastage?

How to extract job parameters from a file?

what is the difference between 7.1,7.5.2,8.1 versions in datastage?

what is normalization and denormalization?

How do u convert the columns to rows in DataStage?

What is environment variables? Where the DataStage stored his repository?

How one source columns or rows to be loaded in to two different tables?

How do you register plug-ins?

What happens if the job fails at night?

How to handle Date convertions in Datastage? Convert a mm/dd/yyyy format to yyyy-dd-mm?

How to develop the SCD using LOOKUP stage?

how can you do incremental load in datastage?

we can have n-1 rejects for merge.

ps -ef|grep USER_ID|grep JOB_NAME

Using Pivot Stage .

DataStage Important Interview Que

What is DatawareHouse? Concept of Dataware house?

What type of data available in Datawarehouse?

What is Node? What is Node Configuration?

What are the types of nodes in datastage?

What is the use of Nodes

What are descriptor file and data file in Dataset.

What is Job Commit ( in Datastage).

What is Iconv and Oconv functions