You are on page 1of 21

Primary Index

1. The key to distribution of data in Teradata is PI. It determines where a row will reside
2. PI provides the fastest physical path to retrieve the data and is incredibly important to joins
3. selection of proper PI avoids data storage Skewness
4. Teradata can hash two very different values and the result can sometimes be the same Row Hash. This is a
called a Collision. It is sometimes called a Synonym
Criteria to select Primary Index column for a given table
1. Identify index candidates that maximize one-AMP operations.
2. Columns most frequently used for access (Value and Join).
3. Identify index candidates that optimize parallel processing.
4. Columns that provide good distribution.
Unique Primary Index (UPI)
1. Is unique and can't have duplicates. Duplicate rows will be rejected and doesn't require duplicate checking
2. Is always one AMP operation
Non-Unique Primary Index (NUPI)
1. Values for the selected column can be non-unique.
2. Use When the NUPI column may be more effective for query access and joins.
Skew Factor
The data distribution of table among AMPs is called Skew Factor. Generally For Non-Unique PI we get duplicate values
so the more duplicate vales we get more the data have same row hash so all the same data will come to same amp, it
makes data distribution inequality, One amp will store more data and other amp stores less amount of data, when we
are accessing full table, The amp which is having more data will take longer time and makes other amps waiting which
leads processing wastage
Hashing
1. Hashing is a mathematical process where an Index (UPI, NUPI) is converted into a 32-bit row hash value.
2. Teradata takes that Primary Index value and runs it through a Hashing Algorithm. The output of the Hashing
Algorithm is a 32-bit Row Hash.
3. The 32-bit Row Hash will point to a certain spot on the Hash Map, which will indicate which AMP will hold the
row. This 32-bit Row Hash will always remain with the Row as part of a Row Identifier (Row ID).
4. The first 16 bits of the Row Hash (Destination Selection Word) are used to locate an entry in the Hash Map.
This entry is called a Hash Map Bucket. The only thing that resides inside a Hash Map Bucket is the AMP
number where the row will reside.The row along with the Row Hash are delivered to that AMP
5. The AMP will then assign a Uniqueness Value to the Row Hash It assigns a 1 if the Row Hash is unique or a 2 if
it is the second or a 3 if the third, etc.
6. The 32-bit row hash and the 32-bit uniqueness value make up the 64-bit Row ID. The Row ID is how tables
are sorted on an AMP
HASH FUNCTIONS
HASHROW : returns the row hash value for a given value
HASHBUCKET : the grouping of a specific hash value
HASHAMP : the AMP that is associated with the hash bucket
HASHBAKAMP :the fallback AMP that is associated with the hash bucket
SELECTHASHROW ('Teradata') AS "Hash Value"
, HASHBUCKET (HASHROW ('Teradata')) AS "Bucket Num"
, HASHAMP (HASHBUCKET (HASHROW ('Teradata'))) AS "AMP Num"
, HASHBAKAMP (HASHBUCKET (HASHROW ('Teradata'))) AS "AMP Fallback Num" ;
Binary Search
1. When an AMP searches for a row using a Primary Index the AMP can perform a Binary Search each table is
sorted by the
2. Primary Index Row-ID and all Row-IDs are made up of zeros and ones.The AMP can go to the
middle of the rows and pick a row. The system will say either "Too high", "Too low" or "Got it"
3. If the system says "Too Low" or "Too High" then the AMP will go halfway up or down the file and
check again till it finds the row
Partition Primary Index ( PPI)

Data in the PPI is always distributed by PI column and then partitioned by the PPI column and then sorted by
row hash. PPI are best for queries that specifies range constraints

Partition column can be different then PI column. But You can NOT have a UNIQUE PRIMARY INDEX on a table
that is partitioned by somethingnot included in the Primary Index

PPI reduce the number of rows to be processed by using partition elimination.The process of accessing chunks
of data along the partitioning attributes is often referred to as partition elimination.
PPI avoids full table scans without the overhead of a secondary index and allows for instantaneous dropping of
old data and rapid addition of newer data
Partitioning doesn't affect distribution. Partitioning only affects how each AMP sorts the rows they get.
To handle queries when you partition by a column that is not part of the Primary Index you can assign a
Unique Secondary Index or you can include the partition column in your SQL
A partitioned table will always add two bytes to every row as part of the Row-ID.
If a table is partitioned, the partition number is placed in front of the Row-ID for each row.
This combination of the Partition number, Row-Hash, and Uniqueness value are now called the ROW KEY.
Instead of sorting by the Row-ID we are merely first sorting by the Partition Number. We are really just sorting
by the Row Key!
If a table is NOT partitioned the Partition Number is merely set to ZERO!
While accessing PPI the cylinder index of the AMP is queried to find out on which cylinder the first data block
of the accessed partition is located.
For NUPI many rows can have the same primary index and will hash to the same AMP. However, each of this
row could belong to a different partition.
For UPI if partition is allowed without partition column being part of the primary index. Any update or insert
statement would require Teradata to check each partition to avoid the creation of duplicates. This is very
inefficient from a performance point of view.
In case the primary index is not including the partitioning columns, each time a primary index access is
required, the responsible AMP has to scan all its partitions for this particular primary index. This will not be the
case if you include the partition columns into the primary index
When another table (without PPI) is joined with PPI table on PI=PI condition. If one of the tables is
partitioned, the rows won't be ordered the same, and the task, in effect, becomes a set of sub-joins, one for
each partition of the PPI table. This type of join is sliding window join

Types of Partitions
Partitioning with CASE_N
PRIMARY INDEX(Customer_Number)
PARTITION BY CASE_N
(Order_Total<1000,Order_Total< 5000
,Order_Total< 10000,Order_Total< 50000, NO Case, Unknown);
Partitioning with RANGE_N
PRIMARY INDEX (customer_number)
PARTITION BY range1 ( Order_date BETWEEN date '2010-01-01' AND date '2011-12-01' EACH interval '1'
month );
OR
PARTITION BY range1 ( Order_date BETWEEN date '2010-01-01' AND date '2010-12-01' EACH interval '7' Day,
date '2011-01-01' AND date '2011-12-01' EACH interval '7' Day );
Queries
SEL * FROM dbc.indices WHERE indextype = 'Q'
Select Partition as Partition_NumberCOUNT(*) AS Rows_In_PartitionFROM Order_TableGROUP BY 1
Restrictions
If a table is populated then there are restrictions to the ALTER Table command with PPI Tables:
You can't ALTER the primary index columns.
You can ALTER the table to change to a UNIQUE PRIMARY INDEX only if the NUPI had aUnique Secondary Index
already.
You can't Add or Drop the NO RANGE or UNKNOWN PartitionsYou can only ADD RANGE or DROP RANGE at the ends.
If the table is empty then we can also change the NO RANGE or UNKNOWNpartitions
ALTER TABLE Order_tableMODIFY PRIMARY INDEX
DROP RANGE BETWEEN DATE'2004-01-01 AND DATE '2004-12-31'EACH INTERVAL '1' MONTH
ADD RANGE BETWEEN DATE'2006-01-01 AND DATE '2006-12-31'EACH INTERVAL '1' MONTH
WITH DELETE ; / WITH INSERT INTOOrder_Table_Backup ;
NO CASEOR UNKNOWN: Have NULL and NOT in range data in same partition
NO CASE, UNKNOWN: Have NULL and NOT in range data in different partitions
Advantages

Automatic optimization occurs for queries that specify a restrictive condition on the partitioning column.

Only the rows of the qualified partitions in a query need to be accessed avoid full table scans.
Provides an access path to the rows in the base table while still providing efficient join Strategies

Limitations of Partitioned Primary Index (PPI) :

Primary index of PPI table has to be Non unique PI, if PPI column is not part of Index, since enforcing a
Unique PI would require checking for a duplicate key value in each partition, which would be very expensive.

Primary Index of PPI table can be Unique, if PPI is part of UPI. This will result in checking for unique constraint
in same partition.

PPI can be defined on Global temporary tables and Volatile tables and cannot be defined on compressed join
indices

PPI Table rows occupy two extra bytes compared to NPPI table row, as these extra bytes store the partition
number for each row . PPI table rows are four bytes wider if value compression is specified for the table.
It is beneficial to collect stats on Partition column .Collecting stats on the system derived column Partition is faster
because rather than reading all the base table rows for collecting information, it usually just scans the cylinder index
for that PPI table
Help stats tablenamecolumn PARTITION; -- used to list partitions in table and their details
Collect stats on tablenamecolumn PARTITION; -- refresh partition details
Multi-Level Partitioning

For multi-level partitioning, you do not have to include all partitions in the WHERE condition in order to be able
to eliminate partitions. Each of the partitions can be addressed independently

You can have up to 15 levels of partitions within partitions.

Partitioned Table merely tells each AMP how to sort their rows for the table.

So think of Multi-Level partitioning as a table with multiple sort keys. The first partition statement is how the
data is sorted first. The second partition statement is the second sort key.

Entire purpose of partitioning is to eliminate the Full Table Scan. Instead of reading all rows in a table each
AMP merely has to one or more of their partitions.
Ex:- CREATE TABLE ORDER_DATA ( ORDER_NUM INTEGER NOT NULL
,CUST_NUM INTEGER,ORDER_DATE DATE,ORDER_TOT DECIMAL(10,2))PRIMARY INDEX(ORDER_NUM)
PARTITION BY (RANGE_N ( ORDER_DATE BETWEEN DATE 2012-01-01 AND DATE 2012-12-31 EACH INTERVAL 1
DAY), CASE_N(ORDER_TOT<1000, ORDER_TOT<5000, ORDER_TOT< 6000, NO CASE, UNKNOWN));
Teradata Space
TERADATA has three kinds of Storage Space namely PERMANENT, SPOOL and TEMPORARY.
Permanent Space is used for storing permanent date like Permanent Tables, Secondary Indexes and Permanent
Journals. So whenever a table is created , it occupies Permanent Space and whenever data is deleted or database
objects like tables, indexes are dropped Permanent Space is released. Permanent Space is allocated at the time of
creating USER/DATABASE.
Perm space is always calculated by adding all the space on all the AMPs. at the time of delivery DBC user owns all the
perm space
Spool Space is all the space available or we can say un-occupied Permanent Space is called SPOOL Space. Spool
Space is used for carrying out all the intermediary SQL operations like creating derived tables or performing some
aggregations or storing result set of joins of tables etc. Whenever a SQL query exceeds the SPOOL Space available
query is aborted.
User will run out of the spool space if they exceed their limit on the per AMP basis. Different user can have different
spool spaces and its released when the query ends
Temporary Space is used for storing GLOBAL TEMPORARY TABLES in Teradata. For such tables, table definition is
stored however the table data is truncated once the session is over. Temp space is also un-used Permanent Space.
Types of Locks in Teradata
Locking in Teradata is automatic and cannot be turned off for normal tables. There arefour types of locks that are used
and they are

EXCLUSIVE
WRITE

The resource is temporarily owned. Not available to users until the lock is released. Exclusive locks
are placed only on a database or table when the object is going through a structural change. An
Exclusive lock restricts access to the object by any other user.
A Write lock happens on an INSERT, DELETE, or UPDATE request. A Write lock restricts access by
other users
This is placed in response to a SELECT request. A Read lock restricts access by users who require
Exclusive or Write locks. This lock can also be explicitly placed using the LOCKING modifier
Affectionately called a Dirty Read lock. Allows a SELECT to read data that is locked for WRITE. It is a
very minimal form of locking. Placed in response to a user-defined LOCKING FOR ACCESS phrase.

READ
ACCESS

This chart shows the automatic locking in Teradata respective to SQL commands:
Type of Lock

Caused by

EXCLUSIVE

DDL

WRITE

INSERT
DELETE

READ

SELECT

Locks Blocked for Other Users


EXCLUSIVE, WRITE, READ, ACCESS
,

UPDATE,

ACCESS

EXCLUSIVE, WRITE, READ


EXCLUSIVE, WRITE
EXCLUSIVE

The resource that is locked depends on the SQL command requested by the user.The lock
may be set at the database, view, table, or row level.

f Locked at
DATABASE
VIEW
TABLE
ROW

Resource(s) unavailable to other users


All tables, views, macros and triggers owned by the database/user.
All tables referenced in the View.
All rows in the table.
All rows with the same row hash.

All SQL commands automatically request a lock. The Teradata RDBMS attempts to lockthe resource at the lowest level
possible. The lowest level is a row lock. However,Teradata places more importance on performance than resource
availability. This implies that the optimizer has the last say in the locking level that is used.
For instance, an UPDATE has the option of locking at the table or row level. Theoptimizer knows that when an entire
table is locked, all other users must wait to readeven a single row from the table. However, when only a row is WRITE
locked, otherusers still have access to the table, and only have to wait if they need to read the rowcurrently locked.
Therefore, normally row level locks are preferable so that rows have amaximum availability for users.
However, if the optimizer also knows when all rows in table are going to be changed. Itcould follow the row locking to
allow as much access as possible. However, eventuallyall rows are locked. Also, it knows that to lock a row and then
read a row over and overagain takes longer than locking the table once, reading all rows as fast as possible, andthen
releasing all locks at once. Therefore, the normalrow level lock will be escalated to a table level lock for speed on a
full table scan.Additionally, by locking the table, it eliminates the potential for a deadlock betweenmultiple user
requests.
These are the various syntax formats of the LOCKING Modifier:
LOCKING
LOCKING
LOCKING
LOCKING
LOCKING

[<table-name>] FOR <desired-locking> [NOWAIT]


ROW FOR <desired-locking>
DATABASE <database-name> FOR <desired-locking>
VIEW <view-name> FOR <desired-locking>
TABLE <table-name> FOR <desired-locking>

No Primary index tables ( from release 13)


The primary index is the main idea behind an evenly data distribution on a Teradata system. Basically, rows are
distributed randomly across the AMPs. In NOPI table hashing does not take place, but to identify the row uniquely
ROWID is added to the rows Teradata uses the HASHBUCKET of the responsible AMP and adds a uniqueness value to
generate the ROWID

CREATE TABLE <TABLE>( PK INTEGER NOT NULL ) NO PRIMARY INDEX ;


CREATE TABLE Table1 AS Table2 WITH DATA NO PRIMARY INDEX
SELECT * FROM DBC.TABLES WHERE TABLEKIND = 'O';
DBSControl Flag determines if the PI or NOPI table is created when we dont specify following in the CREATE TABLE
DDL.

PRIMARY INDEX clause.

NO PRIMARY index clause.

PRIMARY KEY OR UNIQUE constraints.


This DBS control field is Field 53 and is named as 'Primary Index Default' .
Possible values for this field are

D --> Teradata Default. Currently Default is same as Option P.

P --> "First Column NUPI". Creates tables with 1st column as NUPI.

N --> "No Primary Index". Creates tables without PI.


However note that if this option is chosen as N, and we create a table without explicit PI but table has UNIQUE or
PRIMARY KEY defined , then UNIQUE and PRIMARY KEY take precedence over the 'N' option and the table is created
with a Unique Primary index.

Consider using NOPI tables during the ETL-Process in case Teradata has to do full table scans like SQL
transformations carried out on each row etc.
In NOPI Table hashing and redistribution is not needed only after distributing the rows randomly across the
AMPs table is ready No sorting is needed. Further, as rows are assigned randomly to the AMPs, your data will
always be distributed evenly across all AMPs and no skewing will occur. This makes loading faster we can say
only the acquisition phase of the loading utilities is executed.
Another advantage of NOPI tables is that records are always appended to the end of the tables data blocks
thus avoiding any overhead normally caused by sorting the data by rowhash into the data blocks. For example
in case you INSERTSELECT huge amounts of rows into your NOPI table this will reduce IOs significantly
compared against primary index tables.
NOPI tables being bulk loaded are never skewed. Still, if you INSERTSELECT from a primary index table into
a NOPI table local copying of the rows will be applied.Basically, no primary index tables are not designed for
being production tables

There are some further restrictions if you decide to use no primary index tables. Here are the most important:

Multiload is not supported for NOPI table as multiload makes use of PI for its operation

Only MULTISET tables can be created

No identity columns can be used

No PI tables cannot be partitioned with a PPI

No statements with an update character allowed (UPDATE,MERGE INTO,UPSERT), still you can use
INSERT,DELETE and SELECT

No Permanent Journal possible

Cannot be defined as Queue Tables and No Queue Tables allowed

Update Triggers cannot update a NOPI table (probably introduces with a later release)

No hash indexes are allowed (use join indexes instead)


Although above restrictions apply to NOPI tables, you still can use the below features as usual:

Fallback protection of the table

Secondary Indexes (USI, NUSI)

Join Indexes

CHECK and UNIQUE constraints

Triggers

Collection of statistics
Secondary Indexes
Secondary Indexes provide an alternate path to the data, and should be used on queries that runthousands of times.
Teradata runs extremely well without secondary indexes. Requires extra perm space to store subtables and overhead
for their maintenance

When a USI is designated on a table, each AMP will build a subtable to point back to the base table. When a
Non-Unique Secondary Index (NUSI) is designated on a table, each AMP will build a subtable. The NUSI

subtable is said to be AMP local because each AMP will create its secondary index subtable to point to its own
base rows.
You can have up to 32 secondary indexes for a table.
Secondary Indexes provide an alternate path to the data and uses permanent storage space
Every secondary index defined causes each AMP to create a subtable.
USI subtables are hash distributed. USI queries are Two-AMP operations.
NUSI subtables are AMP local. NUSI queries are All-AMP operations, but not Full Table Scans. For NUSI if an
AMP contain duplicate value only one subtable row will have multiple base row id
Value-Ordered NUSIs can be any non-unique index of integer type.
Always Collect Statistics on all NUSI indexes.
The PE will decide if a NUSI is strongly selective and worth using over a Full Table Scan.
Use the Explain function to see if a NUSI is being utilized or if bitmapping is taking place
Secondary index subtable contains Secondary Index value, Secondary index row id, Primary index row id

Value Ordered NUSI


When a Value Ordered Non-Unique Secondary Index (Value Ordered NUSI) is designated on atable, each AMP will
build a subtable. IN Value Ordered NUSI instead of the subtable being sorted by Secondary Index Value HASH it
issorted numerically by the SI column. Value Ordered NUSI are efficient for processing queries with range conditions
and inequality conditions on the secondary index column
Advantages:

A secondary index might be created and dropped dynamically

A table may have up to 32 secondary indexes.

Secondary index can be created on any column. Either Unique or Non-Unique

It is used as alternate path or Least frequently used cases. ex. defining SI on non indexed column can
improve the performance, if it is used in join or filter condition of a given query.
Disadvantages

Since Sub tables are to be created, there is always an overhead for additional spaces.

They require additional I/Os to maintain their sub tables.

The Optimizer may, or may not, use a NUSI, depending on its selectivity.

If the base table is Fallback, the secondary index sub table is Fallback as well.

If statistics are not collected accordingly, then the optimizer would go for Full Table Scan.
NUSI bitmapping used when multiple NUSI used with AND condition. Identifies common row id before retrieving the
base table rows
Joins and Join Indexes in Teradata
Teradata's Optimizer has the ability to interpret a user's join types and then make decisions onwhat should be best
join strategy to take in order complete the query. Basically, joins arecombining rows from two or more tables.
The Key Things about Teradata and Joins

Each AMP holds a portion of a table.

Teradata uses the Primary Index to distribute the rows among the AMPs.

Each AMP keeps their tables separated from other tables like someone might keep clothes in a dresser drawer.

Each AMP sorts their tables by Row ID. For a JOIN to take place the two rows being joined must find a way to
get to the same AMP.

If the rows are not naturally on the same AMP then Teradata will perform two strategies to get them placed
together. Teradata will redistribute one or both of the tables in spool or it will copy the smaller table to all of
the AMPs.
In Teradata, we have determines type of joinstrategy to be used based on user input taking performance factor in
mind.In Teradata, some of common join types are used like

Inner join (can also be "self join" in some cases)

Outer Join (Left, Right, Full)

Exclusion

Cross join (Cartesian product join)


Merge Join

Merge join is a concept in which rows to be joined must be present in same AMP. If the rows tobe joined are
not on the same AMP, Teradata will either redistribute the data or duplicate thedata in spool to make that
happen based on row hash of the columns involved in the joinsWHERE Clause.

If two tables to be joined have same primary Index, then the records will be present inSame AMP and ReDistribution of records is not required.

There are four scenarios in which redistribution can happen for Merge Join

Case 1: If joining columns are on UPI = UPI, the records to be joined are present in SameAMP and
redistribution is not required. This is most efficient and fastest join strategy

Case 2: If joining columns are on UPI = Non Index column, the records in 2nd table has tobe redistributed
on AMP's based on data corresponding to first table.

Case 3: If joining columns are on Non Index column = Non Index column , the both thetables are to be
redistributed so that matching data lies on same amp , so the join can happenon redistributed data. This
strategy is time consuming since complete redistribution of boththe tables takes across all the amps

Case 4: For join happening on UPI = Non Index column, If the Referenced table (second table in the join)is
very small, then this table is duplicated /copied on to every AMP.
Nested Join
Nested Join is one of the most precise join plans suggested by Optimizer. Nested Join workson UPI/USI used in Join
statement and is used to retrieve the single row from first table . Itthen checks for one more matching rows in second
table based on being used in the join usingan index (primary or secondary) and returns the matching results.
Select EMP.Ename,DEP.Deptno, EMP.salaryfromEMPLOYEE EMP ,DEPARTMENT DEP
Where EMP.Enum= DEP.Enumand EMp.Enum= 2345; -- this results in nested join
Hash join
Hash join is one of the plans suggested by Optimizer based on joining conditions. Hash Join is a close relative of
Merge based on its functionality. In case of merge join, joiningwould happen in same amp. In Hash Join, one or both
tables which are on same amp are fitcompletely inside the AMP's Memory. Amp chooses to hold small tables in its
memory forjoins happening on ROW hash.
Advantages of Hash joins are
They are faster than Merge joins since the large table doesnt need to be sorted.
Since the join happening b/w table in AMP memory and table in unsorted spool, it happensso quickly.
Hash Join gets its name from the fact that one smaller table is built as hash-table, and potential matching rows from
the second table are searched by hashing against the smaller table. Usually optimizer will first identify a smaller table,
and then sort it by the join column row hash sequence. If the smaller table is really small and can fit in the memory,
the performance will be best. Otherwise, the sorted smaller table wills be duplicated to all the AMPs. Then the larger
table is processed one row at a time by doing a binary search of the smaller table for a match
Exclusion Join
These type of joins are suggested by optimizer when following are used in the queriesNOT IN, EXCEPT, MINUS, SET
subtraction operations
Select EMP.Ename,DEP.Deptno, EMP.salary
fromEMPLOYEE EMPWHERE EMP.EnumNOT IN( Select EnumfromDEPARTMENT DEPwhere Enumis NOT NULL );
Please make sure to add an additional WHERE filter with <column> IS NOT NULL sinceusage of NULL in a NOT IN
<column> list will return no results.
Product Joins

Product Joins compare every row of one table to every row of another table. They are called product joins
because they are a product of the number of rows in table one multiplied by the number of rows in table two.
For example, if one table has five rows and the other table has five rows, then the Product Join will compare 5
x 5 or 25 rows with a potential of 25 rows coming back.

To avoid a product join, check your syntax to ensure that the join is based on an EQUALITY condition. A
Product Join always results when the join condition is based on Inequality. The reason the optimizer chooses
Product Joins for join conditions other than equality is because Hash Values cannot be compared for greater
than or less then comparisons.

A Product Join, Merge Join, and Exclusion Merge Join always requires SPOOL Files

To Implement Product Join Identify the smaller table then duplicate it in spool on all AMPs.Join each spool row
of the smaller table to every row of the larger table.
Join processing
Each AMP performs join processing in parallel.

Optimizer chooses best join strategy based on

Available indexes, andData Demographics (Collect Statistics/Dynamic Sampling)

Rows must be on the same AMP for matching.

Teradata temporarily moves the rows to same AMP if they are not in the same AMP for join. This is called row
redistribution.

Join Indexes
Join Index is an index structure that stores and maintains results from joining two or more tables

Join Indexes provide the means of improving performance on any type of recurring query thatinvolves joins
and/or aggregate functions. A Join Index pre-joins tables and physically keeps themon disks.
The closest option of having materialized view in case of Teradata is by using JOIN index
Join index is the index structure that can contains columns from one or more tables.
Note that once this is created, it available only to optimizer. Its the optimizer who decides whether to use join
index or not. This index can never we directly accessed by the user.
JOIN index helps in joining tables by providing the data needed by using index itself and also by avoiding
redistribution of data in many cases.
JOIN index once created we dont need to maintain the index , RDBMS does that automatically which means
that when the base rows change the join index is also changed automatically.
When creating the JOIN index we specify a primary index. The primary index gets assigned irrespective of
whether we explicitly specify one or not. Primary index is used to redistribute the index rows across the AMP's.
The index rows on the AMP's are sequenced by the hash value of the primary index of the join index.
However this type of sequencing is not beneficial in range processing. Hence we have a option to use a ORDER
BY clause to override the default sequencing
A join index with outer join covers both inner join query as well as outer join query

Following are the types of JOIN indexes:

Multiple table Join index: This type of index is used to pre-join the tables, which can help prevent
redistribution of data

Single table Join index: This type of Join index is used to rehash and redistribute the rows of a single table
based on specified columns

Aggregate Join index: Aggregate join index is used to create summary table.

Sparse Join Indexes are a type of Join Index which contains a WHERE clause that reduces the number of rows
which would otherwise be included in theindex. All types of join indexes, including single table, multitable,
simple oraggregate can be sparse.
CREATE JOIN INDEX OrderByCustomer
AS SELECT
departmentname, d.DEPARTMENTNO, employeeid, salary, hiredate
FROM department d Join employee e on d.departmentno=e.departmentno
primary index (departmentno);
CREATE INDEX(O_Orderdate) ORDER BY VALUES ON OrderByCustomer;
COLLECT STATISTICS ON OrderByCustINDEX(O_Orderdate);
Single table JI is to rehash and redistribute the rows of the table by column other than the primary index. Assume a
scenario where we join a two table and one of the two tables needs to get distributed on the join column so that join
can be performed. This would be time consuming if the table is very huge. However we can create a single table join
index on this table with the column used for redistribution as the primary index of the join index. Thus rows will be
pre-distributed and hence there wont be any re distribution while performing the join and thus will speed up the join.
Hash Index

Hash Indexes minimize disk IOs by offering an alternate access path to the data records.The Query can be
covered if Hash index has all the columns. The base table rows can also be accessed as each row carries
ROWID if the index is not covering

HI allows you to define distribution Key which cannot be done with secondary index.

Columns used for data distribution have to be part of the columns which make up the hash index.

Maintained automatically by the system hence has overhead.


Limitations

A hash index cannot have a partitioned primary index

A hash index cannot have a non-unique secondary index.

Hash indexes cannot be specified for NOPI or column-partitioned base tables as they are designed around
the Teradata hashing algorithm for data partitioning (like ROWID pointers).

A hash index cannot be column partitioned

A hash index must have a primary index but a single-table join index can be created with or without a primary
index

Difference between HI and single table join index

A hash index cannot have a partitioned primary index, but a single-table join index can

A hash index must have a primary index, but a single-table join index can be created with or without a
primary index if it is column-partitioned.
CREATE HASH INDEX HIOrder (O_CustKey ,O_OrderDate,
O_TotalPrice) ON OrderTbl BY (O_CustKey) ORDER BY (O_CustKey)
Tables
Global

Temporary tables (GTT)


When they are created, its definition goes into Data Dictionary.
When materialized data goes in temp space.
Data is active up to the session ends, and definition will remain there up-to its not dropped using Drop table
statement. If dropped from some other session then its should be Drop table all;
Can collect stats on GTT.
It is used whenever there is a need for a temporary table with same table definition for all users.

Volatile

Temporary tables (VTT)


Table Definition is stored in System cache
Data is stored in spool space that's why, data and table definition both are active only up to session ends.
No collect stats for VTT. If you are using volatile table, you cannot put the default values oncolumn level
( while creating table )
The LOG option allows a Volatile Table to use the Transient Journal during transactions
ON COMMIT { PRESERVE | DELETE } ROWS
LOG | NO LOG

The transient journal maintains a copy of before images of all rows affected by the transaction. In the event of
transaction failure, the before images are reapplied to the affected tables, then are deleted from the journal, and a
rollback operation is completed In the event of transaction success, the before images for the transaction are
discarded from the journal at the point of transaction commit.
The main difference between the Permanent Journal and the Transient Journal is that the Transient Journal is used to
rollback a transaction in case of a failure, and is automatic, while the Permanent Journal is used to recover all or some
of the database from a specified point in time, and is user created.
Fallback protects your data by storing a second copy of each row of a table on an alternative "fallback AMP". If one
AMP fails, the system accesses the fallback rows to meet the request. Fallback tables allow users to access data even
if one AMP fails.
The purpose of a permanent journal is to maintain a sequential history of all changes made to the rows of one or more
tables. Permanent journals help protect user data when users commit, uncommits or abort transactions. A permanent
journal can capture a snapshot of rows before a change, after a change, or both. Permanent journaling is usually used
to protect data. like in case of the automatic journal, the contents of a permanent journal remain until you drop them.
The MERGEBLOCKRATIO option provides a way to combine existing small data blocks into a single larger data block
during full table modification operations for permanent tables and permanent journal tables. This option is not
available for volatile and global temporary tables. The file system uses the merge block ratio that you specify to
reduce the number of data blocks within a table that would otherwise consist mainly of small data blocks
Data compression:Compression in Teradata plays a very important role in saving some space and increasing the
performance of SQL Query. In Teradata, COMPRESSION can be implemented in three ways:
Single Value or Multi Value Compression (MVC): MVC uses a dictionary to maintain value of data and its
corresponding bit pattern. So while saving, Teradata replace the exact value with the bit pattern and save it. Hence,
occupying much less space. MVC works at column level and should be defined for each column explicitly for which
COMPRESSION is required. The problem with MVC is you should know the values which are expected in the column
CREATE TABLE EMPLOYEES
(

EMP_NAME CHAR(50) COMPRESS (RAJ,KEVIN,OBAMA),


EMP_LAST_DATE DATE COMPRESS,
EMP_DEPT CHAR(30) COMPRESS (HR,IT,FS)
)
PRIMARY INDEX (EMP_NAME);
Algorithmic Compression (ALC): This type of compression uses Alogrithm to COMPRESS the data while storing and
reverse Algorithm to DECOMPRESS the data while displaying. Using ALC is more resource intensive process

CREATE TABLE EMPLOYEES


(
EMP_NAME CHAR(50) COMPRESS USING ALGO_NAME DECOMPRESS USING REV_ALGO_NAME,
EMP_LAST_DATE DATE,
EMP_DEPT CHAR(30)
)
PRIMARY INDEX (EMP_NAME);
Block Level Compression (BLC): This type of compression is used to Compress data at block level or table level
and not at column level. The cold data or the data which is not accessed frequently is idle for compression using BLC.
BLC is very resource intensive process and may take sometime for compression and decompression. However the
space saving which can be achieved using this method is phenomenal.
Turn BLC ON
SET QUERY_BAND = BLOCKCOMPRESSION=YES; FOR SESSION;
Insert into empty table
INSERT INTO EMPLOYEE_BKP AS SELECT * FROM EMPLOYEE;
Turn BLC OFF
SET QUERY_BAND = BLOCKCOMPRESSION=NO; FOR SESSION;
Advantages
Allows more rows per block
Reduces the number of I/Os
Implemented in column level
Compression is a I/O-intensive workload.
Improvement gained through the more-rows-per-block concept is significant in the Full Table Scan operations.
Compression is transparent to applications.
Performance Tunning
Explain
The Explain facility provides English like translation of the plan the SQL optimizer develops to service a request
The execution cost and row count depend upon the statistics
Teradata optimizer is the cost based optimizer it looks for the lowest cost plan. It does not store the plan but
dynamically generates the plan.
As data demographics changes so may the plan
Join Preparation:
Redistribution is needed as join steps are done by the AMPs holding the rows to be joined
You will see something like the following in the explain output for sorting:
sort to order by hash code, sort to order by row hash, sort to partition by rowkeyetc
Row retrieval strategy:
You will see something like the following in the explain output for row retrieval:
by way of all-row scan, by way of rowhash match scan, by way of the primary index, by the way of hash value etc.
Join Type:
Finally, if the operation is a join operation, the explain output will tell you exactly which kind of join strategy was
chosen: using a product join, using a single partition hash join, using a merge join, using a rowkey based merge join
etc.
Confidence level:
HIGH CONFIDENCE: Statistics are available on an index or column

JOIN INDEX CONFIDENCE: Join based on the unique index


LOW CONFIDENCE: Random sampling of the index. Statistics are not collected. But the where condition is having the
condition on indexed column.Then estimations can be based on sampling. If stat are available then and /or clause
used with non indexed column
NO CONFIDENCE:Statistics are not collected and the condition is on non indexed column. Random sampling based in
the AMP row count
Low and no confidence indicate need to collect stat on indexes or columns involved in restricting
conditions
Difference between GROUP BY and DISTINCT
DISTINCT
1. It reads each row on AMP
2. Hashes the column value identified in the distinct clause of select statement.
3. Then redistributes the rows according to row value into appropriate AMP
4. Once redistribution is completed, it Sorts data to group duplicates on each AMP and will remove all the duplicates
on each amp and sends the original/unique value
P.s: There are cases when "Error : 2646 No more Spool Space " . In such cases try using
GROUP BY
1. It reads all the rows part of GROUP BY
2. It will remove all duplicates in each AMP for given set of values using "BUCKETS" concept
3. Hashes the unique values on each AMP
4. Then it will re-distribute them to particular /appropriate AMP's
5. Once redistribution is completed, it Sorts data to group duplicates on each AMP and will remove all the duplicates
on each amp and sends the original/unique value
Hence

it is better to go for
GROUP BY - when Many duplicates
DISTINCT - when few or no duplicates
GROUP BY - SPOOL space is exceeded

To Include the Stats collection recommendations in the explain plan.


DIAGNOSTIC HELPSTATS ON FOR SESSION;
At the end of the explain text is the recommended statistics for collection will be as follows
/*BEGIN RECOMMENDED STATS ->
16) "COLLECT STATISTICS ADW.PRODUCT COLUMN P_SIZE". (HighConf)
17) "COLLECT STATISTICS ADW.PRODUCT COLUMN P_CODE". (HighConf)
18) "COLLECT STATISTICS ADW.PRODUCT COLUMN P_DESC". (HighConf) */
If you want explain to stop showing recommendations for collection of stats, then use the following
DIAGNOSTIC HELPSTATS NOT ON FOR SESSION;
Diagnostic help stats has some drawbacks like

It does not give any sort of indication of stale stats

Stats should be chosen carefully as recommended by diagnostic help stats

Care should be taken to see that too many stats on a given table can impact batch running of scripts and
increases the overload of stats maintenance.

If recommended stats dont show any improvements in performance, DROP them!


Explain terminology

Meaning

We do a SMS

combining rows using unions

BMSMS

NUSI bitmap operation

Two Amp retrieve

selected based on USI

enhanced by dynamic part elimination

product join partition elimination

Row key based

hash join by partition

DBQL(Database Query log)


DBQL captures important information about queries that run on your system. With DBQL, you can find out everything
from who uses the most CPU and when, to which step in a particular query was skewed and how much CPU each step

used, and tons more. This information is critical in order to know what is going on with your system, and is even
more important for upgrade situations.
There are several parameters that can help us in understanding SQL Query Performance in Teradata. AMPCPUTime,
TotalIOCount, SpoolUsage are three main parameters to determine SQL Query performance. Provide proper privileges
to your administrative user for query logging. Determine what type of information you want to collect.
DBQL tables include:
DBC.DBQLogTbl (default table, core performance data of the query)
DBC.DBQLSqlTbl (full SQL text of the query)
DBC.DBQLObjTbl (objects accessed by the query)
DBC.DBQLStepTbl (step processing performance by the query)
DBC.DBQLExplainTbl (explain text of the query)
DBC.DBQLSummaryTbl (summary construct typically for tactical queries)
Example:
SET QUERY_BAND = Version=1; FOR SESSION;
SELECT
AMPCPUTIME,
(FIRSTRESPTIME-STARTTIME DAY(2) TO SECOND(6)) RUNTIME,
SPOOLUSAGE/1024**3 AS SPOOL_IN_GB,
CAST(100-((AMPCPUTIME/(HASHAMP()+1))*100/NULLIFZERO(MAXAMPCPUTIME)) AS INTEGER) AS CPU_SKEW,
MAXAMPCPUTIME*(HASHAMP()+1) AS CPU_IMPACT,
AMPCPUTIME*1000/NULLIFZERO(TOTALIOCOUNT) AS LHR
FROM
DBC.DBQLOGTBL
WHERE
QUERYBAND = Version=1;
Above

query gives you detailed insight about how good or bad each step of your query is:
The total CPU Usage
The Spool Space needed
The LHR (ratio between CPU and IO usage)
The CPU Skew
The Skew Impact on the CPU
Goal is to reduce total CPU usage, consumed spool space and Skew impact on the CPU. Further, the LHR is
optimally around 1.00

You can add or remove columns per your requirement. However the ones highlighted are important parameters for
determining any Query Performance in Teradata. If the AMPCPUTIME is high, you have to tune your query to make
sure it performs well.
Three points to consider while running the above mentioned query:

You may not see results immediately after running your sql queries. There is few minute delay when query
information comes to DBQL tables.

The above mentioned query may take some time to give output. The reason behind it is the NOT SO PROPER
Index columns for these two tables. When we check the PRIMARY INDEX columns for both the tables, we
observe that the PI is same. Both the tables have ProcID, CollectTimeStamp as PI however the value for
CollectTimeStamp can be different for same query in both the tables. Hence joining on the basis of second
column in not advisable. Therefore, you cannot leverage PI completely here hence the query may take some
time for giving results.

To get the SessionID, just run SEL SESSION; command in the same session in which you are running your
queries.So now on, never say that query which took the maximum time is the worst. Fetch the Query DBQL
stats and check the worst query yourself.
DBQL Views
DBC.QryLog contains the details about the query with respect to the user, session, application, type of statement,
CPU, IO, and other fields associated with a particular query.
DBC.QryLogSQL contains the SQL statements. If a SQL statement is exceeds a certain length it is split across
multiple rows which is denoted by a column in this table. If you join this to the main Query Log table care must be
taken if you are aggregating and metrics in the Query Log table. Although more often then not if your are joining the
Query Log table to the SQL table you are not doing any aggregation.
DBC.QryLogObjects contains the objects used by a particular query and how they were used. This includes tables,
columns, and indexes referenced by a particular query. These tables can be joined together in DBC via QueryID and
ProcID.
SET QUERY_BAND='PROJECT=TeraTuningBlog;TASK=QB_example;' for session;

selqueryband, NumResultRows, NumSteps, TotalIOCount, AMPCPUTime, ParserCPUTime, NumOfActiveAMPs,


MaxCPUAmpNumber, MinAmpIO,MAxAmPIO, MaxIOAmpNumber, SpoolUsage
fromdbc.dbqlogtbl
where trim(queryband) LIKE %QUERY1=% and queryText LIKE %SELECT%
Viewpoint

The Teradata Viewpoint Workload Designer port lets users define Active System Management rules (such as
filters and throttles) according to which workload is managed.
Provides systems management via web browserProvides a single operation view for multiple systemsHighly
customizable and can be personalized
Teradata Management Port lets are the replacement for Teradata Manager and PMON
Teradata viewpoint provides System Overview, Workload Management, Session Management, Utilities,
Application, Node overview, Trends
Teradata viewpoint shows the session ID, user ID,runtime, expected row count,spool space occupied,
approximate completion time of the query. Viewpoint shows the details of active sessions only.

Statistics
COLLECT STATISTICS scans columns and indexes of a table and records demographics of the data.
COLLECT STATISITICS is used to provide the Teradata Optimizer with as much information on data as possible. The
Optimizer uses this information to determine how many rows exist and which rows qualify for given values.
Collecting statistics can improve the execution of a SQL. The optimizer can have more details about each column or
index, and therefore determine a better join plan for resolving the query.
Collect stats derives the data demographics of the table. These demographics areuseful for optimizer to decide the
execution of given query which in turn improvesperformance.It collects the information like:

total row counts of the table,

how many distinct values are there in the column,

how many rows per value, is the column indexed,

if so unique or non unique etc.


Features

Teradata uses a cost based optimizer and cost estimates are done based on statistics.So if you dont have
statistics collected then optimizer will use a Dynamic AMPSampling method to get the stats. If your table is big
and data was unevenly distributedthen dynamic sampling may not get right information and your performance
will suffer

Collected statistics are stored in DBC.TVFields or DBC.Indexes tables. However, thesetwo tables cannot be
queried.

Run the Help Stats command on that table.


e.g HELP STATISTICS TABLE_NAME ;
This will give you Date and time when stats were last collected. You will also see statsfor the columns ( for
which stats were defined) for the table

Typical collect stat roughlyif 10% of the data has changed. (By measuring delta inperm space since last
collected.)

Recollect based on stats that have aged 60-90 days. (say last time stats collectedwas 2 months ago) .

Collect stats could be pretty resource consuming for large tables. So it is alwaysadvisable to schedule the job
at off peak period

A optimizer would prefer FTS over NUSI, when there are no Statistics defined on NUSI columns
Here are some excellent guidelines on when to collect statistics:

All Non-Unique indices

Non-index join columns

The Primary Index of small tables

Primary Index of a Join Index

Secondary Indices defined on any join index

Join index columns that frequently appear on any additional join index columns that frequently appear in
WHERE search conditions

Columns that frequently appear in WHERE search conditions or in the WHERE clause of joins.
Statistics are especially informative if index values are distributed unevenly.
When a query uses conditionals based on non-unique index values, then Teradata uses statistics to determine whether
indexing or a full search of all table rows is more efficient.

If Teradata determines that indexing is the best method, then it uses the statistics to determine whether spooling or
building a bitmap would be the most efficient method of qualifying the data rows.
Without COLLECT STATISTICS the Optimizer assumes:

Non-unique indexes are highly non-unique. (Lots of rows per value).

Non-Index columns are even more non-unique than non-unique indexes. (Lots of rows per value)

Teradata derives row counts from a random AMP sample for: Small tables (less than 1000 rows per
amp),Unevenly distributed tables (skewed row distribution due to PI).

Random amp sample: Look at data from 1 amp of table, and from this, estimate the total rows in the table.
Random amp samples may not represent the true total number of rows in the table because the rows in the
table may not be distributed evenly. This occurs often with small tables. As of 9/2000, per table, the random
amp sample uses the same amp for each sql or query.
Hints:
The columns part of join must be of the same data type (CHAR, INTEGER,).why??
When trying to join columns from two tables, optimizer makes sure that datatype is same or else it will translate the
column in driving table to match that of derived table.
Do not use functions like SUBSTR, COALESCE , CASE ... on the indices used as part of Join. Why?!?
add up to cost factor resulting in performance issue. Optimizer will not be able to read stats on those columns which
have functions as it is busy converting functions..
Use NOT NULL where ever possible!
Reason being that all the Null values might get sorted to one poor AMP resulting in infamous " NO SPOOL SPACE "
Error as that AMP cannot accommodate any more Null values.
Optimization Rules
Ensure completeness and correctness of Teradata Statistics: Use DIAGNOSTIC HELPSTATS ON FOR SESSION
and EXPLAIN your SQL statement. At the end of the explain output a list of statistics will be added which the optimizer
would consider helpful in creating a better execution plan. Add them one by one and re-check the execution plan
The Primary Index (PI) Choice: Use primary indexes for joins whenever possible, and specify in the where clause
all the columns for the primary indexes. Joining on the complete set of primary index columns is the least resource
intense join possibility.
Teradata Indexing Techniques: Using Teradata Indexing Techniques may be another option to improve your SQL
statement. For example, secondary indexes could be especially helpful if you have highly selective WHERE conditions.
You could try as well join indexes or even work with partitioning. Whenever working with indexing techniques you
have to keep the overall data warehouse architecture in mind and how your solution fits into this architecture.
Query Rewriting: Many times, queries performance can be improved by rewriting the query in different way.
Examples like using DISTINCT instead of GROUP BY on columns with many different values come to my mind.
Union could be used to break up a large SQLstatements into several smaller ones, which may be executed in parallel.
Real Time Monitoring: Watch your query running in real-time. Observing your query while its running in Viewpoint
or PMON, helps you to find out the critical steps of your query.Most performance issues are caused either by query
steps with heavy skewing in the AMPs or by a wrong execution plan caused by stale and missing statistics.
Comparison of Resource Usage: Another very important task in performance optimization is to measure the
resources used before and after the optimization. Plain query run times can be misleading as they heavily depend on
the current load on the Teradata Server and workload management blocking you may not even notice.
Here is one example query which only needs the DBC.DBQLOGTBL table. Set a different QUERYBAND for each version
of the query you are running
Detect Skewing
The PDM( Physical Data Model) is one of the most obvious areas to investigate for skewing problems. Bad Primary
index choice could cause uneven data distribution and impact query performance.
SELECT
TABLENAME,
SUM(CURRENTPERM) CURRENTPERM,
CAST((100-(AVG(CURRENTPERM)/MAX(CURRENTPERM)*100)) AS DECIMAL(5,2)) AS SKEWFACTOR_PERCENT
FROM DBC.TABLESIZE
WHERE DATABASENAME = the_database
GROUP BY 1
ORDER BY 1

Detect Teradata Skewing by analyzing the Joins: During query execution we may have to fight with dynamic
skewing caused by the uneven distribution of spool files. The principle of dynamic skewing is simple:Whenever a join
takes place, the rows to be joined have to be co-located on the same AMP.
Detect Teradata Skewing by analyzing Column Values: While join skew described in point 2 can be detected
probably quite easily by analyzing the query and having some common knowledge about the data content, there
exists another hidden skewing risk caused by data demographics:Frequent column values in an evenly distributed
table
Teradata Utilities
Transferring of large amount of data can be done using various Application Teradata Utilities which resides on the host
computer ( Mainframe or Workstation) i.e. BTEQ, FastLaod, MultiLoad, Tpump and FastExport.

BTEQ (Basic Teradata Query) supports all 4 DMLs: SELECT INSERT, UPDATE and DELETE. BTEQ also support
IMPORT/EXPORT protocols.
Fastload, MultiLoad and Tpump transfer the data from Host to Teradata.
FastExport is used to export data from Teradata to the Host.

BETQ: (Batch Teradata Query)


(BTEQ) tool was the original way that SQL was submitted to Teradata. Its TD native utility
BTEQ can be used to submit SQL in either a batch or interactive environment.
BTEQ outputs a report format, where Queryman outputs data in a format more like aspreadsheet.
BETQ is also an excellent tool for importing and exportingdata.
Placing the semi-colon at the beginning of the next line (followed by another statement) willbundle those statements
together as one transaction.
It enables users on a workstation toeasily access one or more Teradata Database systems for ad hoc queries,
reportgeneration, data movement (suitable for small volumes) and databaseadministration.

BETQ Modes
Record Mode: (also called DATA mode): This is set by .EXPORT DATA. This will bring data backas a flat file
Field Mode (also called REPORT mode): This is set by .EXPORT REPORT. This is the defaultmode for BTEQ and brings
the data back as if it was a standard SQL SELECT statement. Theoutput of this BTEQ export would return the column
headers for the fields, white space etc.
Indicator Mode: This is set by .EXPORT INDICDATA. This mode writes the data in data mode, butalso provides host
operating systems with the means of recognizing missing or unknown data(NULL) fields. This is important if the data
is to be loaded into another Relational Database System(RDBMS).
DIF Mode: Known as Data Interchange Format, which allows users to export data from Teradata tobe directly utilized
for spreadsheet applications like Excel, FoxPro and Lotus.
Return Code Descirption
00 Job completed with no errors.
02 User alert to log on to the Teradata DBS.
04 Warning error.
08 User error.
12 Severe internal error
Override Code Description
.QUIT 15
.EXIT 15
Bteq Export: All bteq export processes should use the close option of the export command and the 'set retry off' to
ensure that the process aborts immediately upon a DBMS restart. If not, the export will reconnect sessions when
Teradata is available again, retransmitting rows already sent.
Bteq Import: All bteq import processes populating empty tables should be preceded by a delete of that table for
restart ability. Import will not automatically reconnect sessions after a Teradata restart. The job must be manually
restarted.
Create the BTEQ Script
To create and edit a BTEQ script we can use an editor on or client workstation. For example, on a UNIX workstation we can
se text editor.
.SET SESSION TRANSACTION ANSI
.LOGON TDUSER/tdpassword

SELECT

emp_name --Name of employee of Dept table


,department_name --Name of Department of Dept table FROM

dept;

.QUIT
and save it with name test.scr
Step 2:
To Execute The Script:
Start BTEQ, then enter the following BTEQ command to submit a BTEQ script:
Format
.run file = <bteqscriptname>
Example :.run file=test.scr

Teradata Fast Load

Main use: to load empty tables at high speed.


The target tables must be empty in order to use FastLoad
Supports inserts only - it is not possible to perform updates or deletes in FastLoad
Although Fastload uses multiple sessions to load the data, only one target table can be processed at a time
The maximum number of concurrent Teradata Fastload tasks can be adjusted by a system administrator.
Fastload runs in two operating modes: Interactive and Batch
An errlimit count should be specified on all fastload utility
Duplicate rows will not be loaded
OGON 127.0.0.1/username,password;
BEGIN LOADING DB.FLOAD_TEST ERRORFILES db1.fload_test _err1, db1.fload_test _err2;
DEFINE
in_transno
(INTEGER),
in_transdate
(CHAR (10), NULLIF='0000-00-00'),
in_accno
(INTEGER),
in_trans_id
(CHAR(10)),
in_trans_amt
(DECIMAL(12,2))
FILE = TestFloadData;
INSERT INTO DB.FLOAD_TEST
VALUES (:in_transno, :in_transdate (FORMAT 'YYYY-MM-DD'), :in_accno, :in_trans_id, :in_trans_amt );
END LOADING;
LOGOFF;

Fastload Performance:
CHECKPOINTS:Fastload provides the capability to issue a checkpoint in the 2nd phase of the fastload. This checkpoint
is an increment of rows being loaded into the table. A checkpoint is issued after each increment of rows. If a fastload
process gets aborted in the 2nd phase, then the fastload can be rescued from the last checkpoint completed.
TABLE UPDATE PROCESS:
The fastload table update process is typically composed of 3 steps:
1) Fastload: Reads the unix records and loads them into temporary database table.
2) Delete: Deletes the rows from the permanent table, where the primary indexes of the temporary and permanent
tables match.
3) Insert Inserts the rows from the temporary table into the permanent table.
Restrictions
No Secondary Indexes are allowed on the Target Table
No Referential Integrity is allowed.
No Triggers are allowed at load time
Duplicate Rows (in Multi-Set Tables) are not supported.
No AMPs may go down (i.e., go offline) while FastLoad is processing
No more than one data type conversion is allowed per column during a FastLoad
Error and Log Tables

Log Table: FastLoad needs a place to record information on its progress during a load. It uses thetable called Fastlog
in the SYSADMIN database. This table contains one row for every FastLoadrunning on the system. In order for your
FastLoad to use this table, you need INSERT, UPDATEand DELETE privileges on that table.
Empty Target Table: We have already mentioned the absolute need for the target table to beempty.
Two Error Tables: Each FastLoad requires two error tables. These are error tables that will onlybe populated should
errors occur during the load process. These are required by the FastLoadutility, which will automatically create them
for you; all you must do is to name them. The first errortable is for any translation errors or constraint violations. For
example, a row with a columncontaining a wrong data type would be reported to the first error table. The second error
table is forerrors caused by duplicate values for Unique Primary Indexes (UPI). FastLoad will load just one
occurrence for every UPI. The other occurrences will be stored in this table. However, if the entirerow is a duplicate,
FastLoad counts it but does not store the row.
When CHECKPOINT is requested, it allowsFastLoad to resume loading from the first row following the last successful
CHECKPOINT
Fast Export

FastExport is known for its lightning speed when it comes to exporting vast amounts of data from Teradata
and transferring the data into flat files on either a mainframe or network-attached computer. In addition,
FastExport has the ability to use OUTMOD routines, which provide the user the capability to write, select,
validate, and preprocess the exported data.

A good rule of thumb is that if you have more than half a million rows of data to export to either a flat file
format or with NULL indicators, then FastExport is the best choice to accomplish this task.

FastExport is extremely attractive for exporting data because it takes full advantage of multiple sessions,
which leverages Teradata parallelism. FastExport can also export from multiple tables during a single
operation.
How FastExport Works
When FastExport is invoked, the utility logs onto the Teradata database and retrieves the rows thatare specified in the
SELECT statement and puts them into SPOOL. From there, it must build blocksto send back to the client. In
comparison, BTEQ starts sending rows immediately for storage into afile.
If the output data is sorted, FastExport may be required to redistribute the selected data two timesacross the AMP
processors in order to build the blocks in the correct sequence. Remember, a lot ofrows fit into a 64K block and both
the rows and the blocks must be sequenced. While all of thisredistribution is occurring, BTEQ continues to send rows.
FastExport is getting behind in theprocessing. However, when FastExport starts sending the rows back a block at a
time, it quicklyovertakes and passes BTEQ's row at time processing.
The other advantage is that if BTEQ terminates abnormally, all of your rows (which are in SPOOL)are discarded. You
must rerun the BTEQ script from the beginning. However, if FastExportterminates abnormally, all the selected rows are
in worktables and it can continue sending themwhere it left off. Pretty smart and very fast!
Restrictions

FastExport only supports the SELECT statement.

FastExport EXPORTS data from Teradata.

Choose FastExport over BTEQ when Exporting Data of more than half a million+ rows

FastExport supports multiple SELECT statements and multiple tables in a single run

FastExport supports conditional logic, conditional expressions, arithmetic

calculations, and data conversions.

FastExport does NOT support error files or error limits.

FastExport supports user-written routines INMODs and OUTMODs. FastExport allows

Can write INMOD and OUTMOD routines so you can select, validate and preprocess the exported data
The Teradata RDBMS will only support a maximum of 15 simultaneous FastLoad, MultiLoad, orFastExport utility jobs.
This maximum value is determined and configured in the DBS Controlrecord. This value can be set from 0 to 15. When
Teradata is initially installed, this value is set at 5.The reason for this limitation is that FastLoad, MultiLoad, and
FastExport all use large blocks totransfer data. If more then 15 simultaneous jobs were supported, a saturation point
could bereached on the availability of resources.
FastExport has two modes: RECORD or INDICATOR. In the mainframe world, only use RECORDmode. In the UNIX or
LAN environment, INDICATOR mode is the default, but you can useINDICATOR mode if desired. The difference
between the two modes is INDICATOR mode will setthe indicator bits to 1 for column values containing NULLS.

MLoad

Main use: Load, update and delete large tables in Teradata in a bulk mode
Efficient in loading very large tables
Multiple tables can be loaded at a time.
Updates data in a database in a block mode (one physical write can update multiple rows)

Uses table-level locks


Resource consumption: loading at the highest possible throughput
Duplicate rows allowed
Can perform DML Operations on up to five (5) empty or populated target tables at a time

Overview

Multiload is faster than Bteq for updating a populated table. Bteq updates 1 row at a time,where Multiload
updates blocks of rows at a time. When Multiload is compared to the Fastload/delete/insert method, then
Multiload is faster for

volumes above 10,000 records. For volumes less than 10,000 records, the difference is seconds, and is
negligible.

Multiloads speed is not affected by the number of rows already in the target table. The speedis affected by
the number of update records, and can be affected by the number of error records written to the error
journals.

MultiLoad delete is faster then normal Delete command, since the deletion happens in datablocks of 64Kbytes,
where as delete command deletes data row by row.

Whenever we define a SI an SI subtable is created in each AMP.For USI they go for a hashdistribution, and
hence the actual data row pointed by the USI subtable rows in one AMP maynot be in the same AMP as the
subtable. So the AMPs have to communicate, which is notsupported by Multiload. For NUSI the subtable will
store references of only those actual datarows who exist in the same AMP as the subtable,they all point to the
data in their own AMPhence AMPs dont need to communicate here.Thus the AMPs work in parallal with NUSI
and hence Mload supports that.

We can Load SET, MULTISET tables using Mload, But here when loading into MULTISET table using MLOAD
duplicate rows will not be rejected

MultiLoad supports the following five format options: BINARY, FASTLOAD, TEXT, UNFORMAT and VARTEXT
MultiLoad provides two types of operations via modes:
MultiLoadIMPORT mode, Supports up to twenty (20) INSERTs, UPDATEsor DELETEs on up to five target tables.
For UPDATEsor DELETEs to be successful in IMPORT mode, they must reference the Primary Index in theWHERE
clause.
MultiLoad DELETE mode is used to perform a global (all AMP) delete on just one table.The reason to use .BEGIN
DELETE MLOAD is that it bypasses the Transient Journal (TJ) and canbe RESTARTed if an error causes it to terminate
prior to finishing. When performing in DELETEmode, the DELETE SQL statement cannot reference the Primary Index in
the WHERE clause.This due to the fact that a primary index access is to a specific AMP; this is a global operation.
Restrictions:

Unique Secondary Indexes are not supported on a Target Table But unlike FastLoad, it does support the use of
Non-Unique Secondary Indexes (NUSIs) because the index subtable row is on the same AMP as the data row.

Referential Integrity is not supported

Triggers are not supported at load time

No concatenation of input files is allowed

The host will not process aggregates, arithmetic functions or exponentiation

Import task require use of PI (Primary Index).

MultiLoad Utility doesnt support SELECT statement.


Multiload Phases
Preliminary Phase

Checks SQL syntax and MultiLoad commands are valid.

All MultiLoad sessions with Teradata need to be established The general rule of thumb for the number of
sessions to use for smaller systems is the following: use the number of AMPs plus two more. the extra two
sessions are for first one is a control session to handle the SQL and logging. The second is a backup or
alternate for logging.

The final task of the Preliminary Phase is to apply utility locks to the target tables
DML Transaction Phase

Teradata's Parsing Engine (PE) parses the DML and generates a step-by-step plan to execute the request.
ACQUISITION Transaction Phase

PE's plan stored on each AMP, MultiLoad is now ready to receive the INPUT data.

MultiLoad now acquires the data in large, unsorted 64K blocks from the host and sends it to the AMPs.

Each receiving AMP hashes each row on the primary index and sends it over the BYNET, but the row are not
yest inserted in Target table The AMP puts all of the hashed rows it has received from other AMPs into the
worktables
Application Phase

The purpose of this phase is to write, or APPLY, the specified changes to both the target tables and NUSI
subtables.

Every hashsequence sorted block from Phase 3 and each block of the base table is read only once to reduce
I/O operations to gain speed. Then, all matching rows in the base block are inserted, updated or deleted
before the entire block is written back to disk, one time.
Clean Up Phase

This being the case, all empty error tables, worktables and the log table are dropped. All locks,both Teradata
and MultiLoad, are released.
Mload also uses 2 error tables (ET and UV), 1 work table and 1 log table
1. ET TABLE - Data error :MultiLoad uses the ET table, also called the Acquisition Phase error table, to store data
errorsfound during the acquisition phase of a MultiLoad import task.It contains constraint violations
2. UV TABLE - UPI violations :MultiLoad uses the UV table, also called the Application Phase error table, to store data
errorsfound during the application phase of a MultiLoad import or delete taskApart from error tables, it also has work
and log tables It contains UniquePrimary Index violations.
3. WORK TABLE - WTMload loads the selected records in the work table. The worktables are created in a database
using PERM space
4. LOG TABLEA log table maintains record of all checkpoints related to the load job, it is essential l/mandatory to
specify a log table in mload job. This table will be useful in case you have a jobabort or restart due to any reason.
Mload Options
DUPLICATE INSERT ROWS: This option logs an entry for all duplicate INSERT rows in the UV_ERRtable. Use
this when you want to know about the duplicates.

IGNORE DUPLICATE INSERT ROWS: This tells MultiLoad to IGNORE duplicate INSERT rows because you do not
want to see them.

MARK DUPLICATE UPDATE ROWS: This logs the existence of every duplicate UPDATE row.

IGNORE DUPLICATEUPDATE ROWS: This eliminates the listing of duplicate update row errors.
MARK MISSING UPDATE ROWSThis option ensures a listing of data rows that had to be INSERTed sincethere
was no row to UPDATE.

IGNORE MISSING UPDATE ROWS: This tells MultiLoad NOT to list UPDATE rows as an error. This is a good
option when doing an UPSERT since UPSERT will INSERT a new row.

MARK MISSING DELETE ROWS: This option makes a note in the ET_Error Table that a row to be deleted is
missing.

IGNORE This option says, "Do not tell me that a row to be deleted is missing

MISSING DELETE ROWS DO INSERT for MISSINGUPDATE ROWS This is required to accomplish an UPSERT. It
tells MultiLoad that if the row to be updated does not exist in the target table, then INSERT the entire row
from the data source.
MLOAD CHECKPOINT:MultiLoad will check the Restart Logtable and automatically resume the load process from the
last successful CHECKPOINT before the failure occurred MultiLoad uses neither the Transient Journal nor rollbacks
during a failure. That is why you must designate a Logtable at the beginning of your script. The default number for
CHECKPOINT is 15 minutes, but if you specify the CHECKPOINT as 60 or less, minutes are assumed
/* Simple Mload script */
.LOGTABLE SQL01.CDW_Log;
.LOGON TDATA/SQL01,SQL0;
Sets Up a Logtable and Logs on to Teradata
.BEGIN IMPORT MLOAD TABLES
SQL01.Employee_Dept1
WORKTABLES SQL01.CDW_WT
ERRORTABLES SQL01.CDW_ET
SQL01.CDW_UV;
Begins the Load Process by naming the Target
Table, Work table and error tables; Notice NO
comma between the error tables
.LAYOUT FILEIN;
.FIELD Employee_No * CHAR(11);
.FIELD Last_Name * CHAR(20);
.FILLER Junk_stuff * CHAR(100);
.FIELD Dept_No * CHAR(6);
Names the LAYOUT of the INPUT record and
defines its structure; Notice the dots before the
FIELD and FILLER and the semi-colons after each
definition.
.DML LABEL INSERTS; Names the DML Label
DO INSERT FOR MISSING UPDATE ROWS; -- optional value

INSERT INTO SQL01.Employee_Dept1


(Employee_No
,Last_Name
,Dept_No)
VALUES
(:Employee_No
,:Last_Name
,:Dept_No);
Tells MultiLoad to INSERT a row into the target
table and defines the row format.
Lists, in order, the VALUES (each one preceded
by a colon) to be INSERTed.
.IMPORT INFILE CDW_Join_Export.txt
FORMAT TEXT
LAYOUT FILEIN
APPLY INSERTS;
Names the Import File and its Format type; Cites
the LAYOUT file to use tells Mload to APPLY the
INSERTs.
.END MLOAD;
.LOGOFF;
Ends MultiLoad and Logs off all

Teradata Parallel Data Pump (TPump)

Main use: to load or update a small amount of target table rows

Sends data to a database as a statement which is much slower than using bulk mode

TPump does NOT movedata in the large blocks. Instead, it loads data one row at a time, using row hash locks

Resource consumption: loading speed can be adjusted using a built-in resource consumption management
utility. The throughput can be turned down in peak periods.

TPump does not support MULTI-SET tables.

Can accomplish near real-time updates from source systems into the Teradata datawarehouse.

Throttle-switch Capability. You can throttle up and down the number of updates

FastLoad can only load one table and MultiLoad can load up to five tables. But, when it pulls datafrom a single
source, TPump can load more than 60 tables at a time! And the number of concurrentinstances in such
situations is unlimited

TPump allows both Unique and Non-UniqueSecondary Indexes (USIs and NUSIs)
Following are the limitations of Teradata TPUMP Utility:

Use of SELECT statement is not allowed.

Concatenation of Data Files is not supported.

TPump will not process aggregates, arithmetic functions or exponentiation.

No more than four IMPORT commands may be used in a single load task

Dates before 1900 or after 1999 must be represented by the yyyy format forthe year portion of the date, not
the default format of yy.

On some network attached systems, the maximum file size when usingTPump is 2GB.

TPump performance will be diminished if Access Logging is used


TPUMP allows near real time updates from Transactional Systems into the Data
Warehouse.
It can perform Insert, Update and Delete operations or a combination from the samesource.
It can be used as an alternative to MLOAD for low volume batch maintenance of largedatabases.
TPUMP allows target tables to have Secondary Indexes, Join Indexes, Hash Indexes,Referential Integrity, Populated or
Empty Table, Multiset or Set Table or Triggers definedon the Tables.
TPUMP can have many sessions as it doesnt have session limit.

TPUMP uses row hash locks thus allowing concurrent updates on the same table
Error Table per target table, not two. If you name the table, TPump will create it automatically.Entries are made
to these tables whenever errors occur during the load process. Like MultiLoad,TPump offers the option to either MARK
errors (include them in the error table) or IGNORE errorsThe default is to MARK.When doing an UPSERT, this default
does not apply.
. It is the errors that occur when the data is being moved, such asdata translation problems that TPump will want to
report. Stores a portion the actual offending row for debugging

You might also like