You are on page 1of 14

What is data warehouse?

A data warehouse is a eletronic storage of an Organization's historical data for the purpose of reporting,
analysis and data mining or knowledge discovery.Other than that a data warehouse can also be used for
the purpose of data integration, master data management etc.
A data warehouse is a system that retrieves and consolidates data periodically from the
source systems into a dimensional or normalized data store. It usually keeps years of
history and is queried for business intelligence or other analytical activities. It is typically
updated in batches, not every time a transaction happens in the source system.

What is the benefits of data warehouse?
A data warehouse helps to integrate data (see Data integration) and store them historically so that we can
analyze different aspects of business including, performance analysis, trend, prediction etc. over a given
time frame and use the result of our analysis to improve the efficiency of business processes
Difference between data warehousing and business intelligence?
Les Barbusinskis Answer:Data warehousing deals with all aspects of managing the development, implementation and
operation of a data warehouse or data mart including meta data management, data acquisition, data cleansing, data
transformation, storage management, data distribution, data archiving, operational reporting, analytical reporting, security
management, backup/recovery planning, etc. Business intelligence, on the other hand, is a set of software tools that enable an
organization to analyze measurable aspects of their business such as sales performance, profitability, operational efficiency,
effectiveness of marketing campaigns, market penetration among certain customer groups, cost trends, anomalies and
exceptions, etc. Typically, the term business intelligence is used to encompass OLAP, data visualization, data mining and
query/reporting tools.

What is Data Mining?
Data mining is the process of exploring data to find the patterns and relationships that describe the data
and to predict the unknown or future values of the data. The key value of data mining is the ability to
understand why some things happened in the past and the ability to predict what will happen in the
future. popular applications of data mining are for fraud detection (credit card industry)

What is the difference between OLTP and OLAP?
OLTP is the transaction system that collects business data. Whereas OLAP is the reporting and analysis
system on that data.
OLTP systems are optimized for INSERT, UPDATE operations and therefore highly normalized. On the other
hand, OLAP systems are deliberately denormalized for fast data retrieval through SELECT operations.
What is ER model?
ER model or entity-relationship model is a particular methodology of data modeling wherein the goal of
modeling is to normalize the data by reducing redundancy. ER modeling is used for normalizing the
OLTP database design.
This is different than dimensional modeling where the main goal is to improve the data retrieval
mechanism.
Reasons for Normalising a Data Warehouse
no data redundancy
to keep the data warehouse real time. Because there is only 1 place to update (or insert), the update
would be quick and efficient.
Third, to manage master data. The idea is that rather than having MDM as a separate system, the
master table in the normalised warehouse become the master store
Fourth, to enable the enterprise to maintain consistency between multiple dimensional data marts
Fifth, to make data integration easier. Because in a normalised DW each individual data is located only in
one place, it is easier to update the target. Secondly, the source systems are usually normalised, it is
easier to map them to a normalised DW because both the source and the target are normalised e

Advantages of dimensional DW are: a) flexibility, e.g. we can accommodate changes in the
requirements with minimal changes on the data model, b) performance, e.g. you can query
it faster than normalised model, c) its quicker and simpler to develop than normalised DW
and easier to maintain.


What is dimensional modeling?
Dimensional model consists of dimension and fact tables. Fact tables store different transactional
measurements and the foreign keys from dimension tables that qualifies the data. The goal of
Dimensional model is not to achive high degree of normalization but to facilitate easy and faster data
retrieval.
What is dimension?
A dimension is something that qualifies a quantity (measure).
For an example, consider this: If I just say 20kg, it does not mean anything. But if I say, "20kg of Rice
(Product) is sold to Ramesh (customer) on 5th April (date)", then that gives a meaningful sense.
These product, customer and dates are some dimension that qualified the measure - 20kg.
Dimensions are mutually independent. Technically speaking, a dimension is a data element that
categorizes each item in a data set into non-overlapping regions.
What is Fact?
A fact is something that is quantifiable (Or measurable). Facts are typically (but not always) numerical
values that can be aggregated.
What is Star-schema?
This schema is used in data warehouse models where one centralized fact table references number of
dimension tables so as the keys (primary key) from all the dimension tables flow into the fact table (as
foreign key) where measures are stored. This entity-relationship diagram looks like a star, hence the
name.This schema is de-normalized and results in simple join and less complex query as well as
faster results


What is snow-flake schema?
This is another logical arrangement of tables in dimensional modeling where a centralized fact table
references number of other dimension tables; however, those dimension tables are further normalized
into multiple related tables.
Consider a fact table that stores sales quantity for each product and customer on a certain time. Sales
quantity will be the measure here and keys from customer, product and time dimension tables will flow
into the fact table. Additionally all the products can be further grouped under different product families
stored in a different table so that primary key of product family tables also goes into the product table as
a foreign key. Such construct will be called a snow-flake schema as product table is further snow-flaked
into product family.
This schema is normalized and results in complex join and very complex query as well as slower
results.

Note
Snow-flake increases degree of normalization in the design.

What is show flaking? What are the advantages and disadvantages? {M}
Answer: In dimensional modelling, snow flaking is breaking a dimension into several tables
by normalising it. The advantages are: a) performance when processing dimensions in
SSAS, b) flexibility if the sub dim is used in several places e.g. city is used in dim customer
and dim supplier (or in insurance DW: dim policy holder and dim broker), c) one place to
update, and d) the DW load is quicker as there are less duplications of data. The
disadvantages are: a) more difficult in navigating the star*, i.e. need joins a few tables, b)
worse sum group by* query performance (compared to pure star*), c) more flexible in
accommodating requirements, i.e. the city attributes for dim supplier dont have to be the
same as the city attributes for dim customer, d) the DW load is simpler as you dont have to
integrate the city.
*: a star is a fact table with all its dimensions, navigating means browsing/querying,
sum group by is a SQL select statement with a group by clause, pure star is a fact table
with all its dimensions and none of the dims are snow-flaked.

The Main Weakness of Snowflake Schemas
inability to store the history of attributes in sub-dimension tables
http://dwbi1.wordpress.com/2012/07/16/the-main-weakness-of-snowflake-schemas/#comments
Delete All Rows in the Dimension Table
The word delete and dim table should not be in the same sentence!
Say its a Member dimension. In the source system we have Membership table, like this:

And in the data warehouse we have Member dimension, like this:

There are 2 types of changes happening in the source Membership table:
Change to an existing row
New row created
Lets make these 2 types of changes: change Ayanes name to Eyane and create a new member: Agus Salim, like
this:

When we truncate the Member dimension and reload all rows, this is what we get:

All members get new surrogate keys.
G48s surrogate key changed from 1 to 3.
G49s surrogate key changed from 2 to 4.
That is the issue about And they are having problem because the existing rows get new surrogate keys that I
mentioned at the beginning of this article.
The problem with G48s SK changed from 1 to 3 is: the fact table row for G48 is still referring to SK = 1. Now the
fact table SKs dont match with the dim table SKs, causing issues when you join them.
As I said above, we should not delete from the dim table. Instead, we should update changed rows, and insert new
rows. After the update, the dim table should be like this:

So once again, the word delete and dim table should not be in the same sentence!


What is the difference between OLAP and data warehouse?
Datawarehouse is the place where the data is stored for analyzing where as OLAP is the process of
analyzing the data,managing aggregations, partitioning information into cubes for in depth
visualization.
Whats is ODS?
Operational Data Store. A database structure that is a repository for near real-time operational data
rather than long term data.

ODS :-Operational Data Store which contains data .Ods comes after the staging area
eg:- In our e.g lets consider that we have day level Granularity in the OLTP & Year level
Granularity in the Data warehouse.
If the business(manager) asks for week level Granularity then we have to go to the oltp and
summarize the day level to the week level which would be pain taking.So wat we do is that we
maintain week level Granularity in the ods for the data,for abt 30 to 90 days.
Note : Ods information would contain cleansed data only. ie after staging area


Difference between a ODS and Staging Area
In general datawarehousing scenarios,there is a Staging Area first which extracts the data
dump from source systems.Depending on the need and requirement,there will be an ODS in
the second stage and then a datamart.


Scenario Where ODS comes before staging area
The context here is that a staging area takes ODS as one of the source systems. In this way ETL can benefit from the
data integration, sanitization and transformation which an ODS might have been already doing
The ODS should be mature, and stable and as good as being part of your production system.

Staging Area :-
It comes after the etl has finished.Staging Area consists of
1.Meta Data .
2.The work area where we apply our complex business rules.
3.Hold the data and do calculations.
In other words we can say that its a temp work area.

Why do you need a staging area? {M}
Answer: Because:
a) Some data transformations/manipulations from source system to DWH cant be done on the fly, but
requires several stages and therefore needs to be landed on disk first
b) The time to extract data from the source system is limited (e.g. we were only given 1 hour window) so
we just get everything we need out first and process later
c) For traceability and consistency, i.e. some data transform are simple and some are complex but for
consistency we put all of them on stage first, then pick them up from stage for further processing
d) Some data is required by more than 1 parts of the warehouse (e.g. ODS and DDS) and we want to
minimise the impact to the source systems workload. So rather than reading twice from the source
system, we land the data on the staging then both the ODS and the DDS read the data from staging.

What is data modeling?
A Data model is a conceptual representation of data structures (tables) required for a database and is
very powerful in expressing and communicating the business requirements.

Tell us something about data modeling tools?
Data modeling tools transform business requirements into logical data model, and logical data model to
physical data model. From physical data model, these tools can be instructed to generate SQL code for
creating database entities


difference between Views and Materialized Views in
Oracle?
Views evaluate the data in the tables underlying the view definition at the time the view is queried. It
is a logical view of your tables, with no data stored anywhere else. The upside of a view is that it will
always return the latest data to you. The downside of a view is that its performance depends on how
good a select statement the view is based on. If the select statement used by the view joins many
tables, or uses joins based on non-indexed columns, the view could perform poorly.
Materialized views are similar to regular views, in that they are a logical view of your data (based on
a select statement), however, the underlying query resultset has been saved to a table. The upside
of this is that when you query a materialized view, you are querying a table, which may also be
indexed. In addition, because all the joins have been resolved at materialized view refresh time, you
pay the price of the join once (or as often as you refresh your materialized view), rather than each
time you select from the materialized view. In addition, with query rewrite enabled, Oracle can
optimize a query that selects from the source of your materialized view in such a way that it instead
reads from your materialized view. In situations where you create materialized views as forms of
aggregate tables, or as copies of frequently executed queries, this can greatly speed up the
response time of your end user application. The downside though is that the data you get back from
the materialized view is only as up to date as the last time the materialized view has been refreshed.
Materialized views can be set to refresh manually, on a set schedule, or based on the database
detecting a change in data from one of the underlying tables. Materialized views can be
incrementally updated by combining them with materialized view logs, which act as change data
capture sources on the underlying tables.
Materialized views are most often used in data warehousing / business intelligence
applications where querying large fact tables with thousands of millions of rows would result in
query response times that resulted in an unusable application.

Recognizing Change
USE THE HASHROW FUNCTION TO QUICKLY IDENTIFY ALTERED DATA.

THREE TYPES OF SCDS
SCDs have been broadly classified into three types:
TYPE 1. Old data is deleted and the new data is inserted. This method is used where the data tracking history is
not required, historical data is inaccurate, and the latest loaded data is always valid and accurate.
TYPE 2. The history of data is tracked. This is typically done by creating a new surrogate key for the changed data
record or by using effective date ranges. The current record is identified by the max surrogate for the business key. A
flag is assigned for the current dimension record, or the record with the highest expiration date is picked. The Type 2
option provides more business value than the other two options and is widely used in data warehouses and data
marts since reports can be run with AS-IS and AS-WAS scenarios. This is the most common SCD.
TYPE 3. The data is versioned. Additional columns are created for tracking the column changes (current and
previous column value). This type is best used in situations where a limited history of data must be tracked (e.g., only
the current and previous value is important).
LOADING DATA
When the complete set of source data is provided, identifying and updating the columns that have changed is a
challenge that lies on the extract, transform and load (ETL) side. A typical approach is to identify the changed
columns and rows, inside or outside the database, and load the data into tables.
However, Teradatas HASHROW function provides a simpler, faster method of identifying this data. It matches the
hash row of the columns between the stage and the target, inside the database. In addition, it identifies the rows that
have changed. If the columns and rows do not match, they are candidates for updating. In that case, the previous
record is closed and a new record is created.
+DEFINITION
LOADING TYPE 1
To load Type 1 to a table, issue the DELETE FROM <target_table> ALL statement to delete the old data, then use an
INSERT INTO <target_table> AS SELECT <columns> FROM <stage_table> to add the new data. The processing
time is very fast. Data is not journaled, because it is being loaded into an empty table. It is recommended to have the
same primary indexes (PIs) on the staging table as the target table for faster data transfer.
LOADING TYPE 2
Multiple strategies can be adopted to identify changed data.
One strategy involves the source data provider marking the changed data with appropriate flags for data processing.
Since a small portion of changed data is loaded, the processing time is short and the burden of process and accuracy
lies on the data provider. Because the data is provided, it reduces the ETL window time. Another strategy is to have a
complete refresh of the data provided. Other approaches:
If the data provided is a flat file, the data is compared with the previous file. By comparing the two, the changed rows
are identified using like file diff or similar utilities.
The checksum between the previous and current rows is compared.
The ETL vendor best practices are used to handle SCD (by creating various mapping for identifying the changes,
caching the target table data, comparing data, identifying the process to be used, loading the data, etc.).
The data is loaded into the staging area, and the comparison operation is performed inside the database.
Performing checkscolumn comparisons to determine whether changes have been madeon large data inside the
Teradata Database, rather than outside the database, is faster because the database is optimized for these
operations. The HASHROW function provides a simple, fast method of identifying changed data. The columns in the
HASHROW function are considered for any changes. If the data types are different, then cast the columns to the
appropriate data types. Always keep the PI of stage to target table the same for faster performance.
This is a sample SQL for identifying changed data:
SELECT A.SKU_ID,

CASE WHEN B.SKU_ID IS NULL

THEN 'INSERT'

WHEN HASHROW(A.SKU_QTY,A.SKU_DESC,A.SKU_COLR,A.SKU_CRE_DTE)=

HASHROW(B.SKU_QTY,B.SKU_DESC,B.SKU_COLR,B.SKU_CRE_DTE)

THEN 'IGNORE'

ELSE 'UPDATE'

END PROCESS_TYPE

FROM STG_DB.SKU_STG A

LEFT OUTER JOIN

TGT_DB.SKU_TGT B

ON A.SKU_ID= B.SKU_ID /* join on key columns */

ORDER BY A.SKU_ID, PROCESS_TYPE



LOADING TYPE 3
The HASHROW function can also be used with Type 3 to load tables more quickly. To load them, use the Type 2
methodology to identify the changed columns. When the PROCESS_TYPE is INSERT, insert the data into the
CURRENT_<column>. When PROCESS_TYPE is UPDATE, update the PREVIOUS_<column> value to the
CURRENT_<column> and update the CURRENT_<column> with the new data.
SOLUTION FOR HIGH DATA VOLUMES
Identifying and updating SCD might be a simple task when the number of columns is small. However, on larger
numbers of columns, it can get complicated and decrease the performance on higher volumes of data. Teradatas
HASHROW function solves this problem by matching the hash row of the columns between the stage and the target.
This allows for a simpler, faster way to identify and update changing data.
Thanks ramakrishna for the information. I tried to use the same information in my project.But I'm facing
some issue with using hashrow function. In some cases though the change is happened same hashrow is
generated. Example: Consider we have four columns on which we are calculating. Data1: 1,2,3,4 Data2:
1,2,4,3 In this scenario data there is a change between data1 and data2(3 is changed to 4 and 4 is
changed to 3)same hashrow value is generated. Hashrow o/p in both cases: D8-F0-00-91 Please help me
to solve this scenario. Thanks Shravan
1/30/2013 5:11:48 AM
Anonymous



I know the post is old but thought I would reply for anyone who would be searching in future. You can
use concatenation of columns to avoid issues mentioned in the above post.
10/4/2012 12:37:10 PM
Anonymous



what you can do is to have append a placeholder value to the columns you want to hash on, to eliminate
synonyms: SELECT x, y, HASHROW(1|| x, 2 || y) FROM Original_Data; SELECT x, y, HASHROW(1|| x, 2 ||
y) FROM Altered_Data; - Mike Ong
6/6/2012 2:19:36 AM
Anonymous

Teradata Temporal Feature
Teradata has been playing an important role in DB/DW market for its prebuilt solution for PB level data volume. The
temporal feature is available in Teradata version 13.10 and on forward. To know about this feature, we would need to
look at the new field type of PERIOD that represents span of date. It has a beginning bound (defined by the value of a
beginning element) and an ending bound (defined by the value of an ending element). Beginning and ending
elements can be DATE, TIME, or TIMESTAMP types, but both must be the same
type. If we are to create a temporal feature based table:
CREATE MULTISET TABLE Employee(Surrogate_Key
INTEGER,
Employee_Num VARCHAR(50) NOT
NULL,
Employee_Name
VARCHAR(200),
Cell_Phone_Number
VARCHAR(50),
Effective_Date PERIOD(DATE) NOT NULL AS
Date)
PRIMARY INDEX(Surrogate_Key);
If we are to insert a new record into this temporal feature enable table:
INSERT INTO
Employee
(Surrogate_Key, Employee_Num, Employee_Name, Cell_Phone_Number,
Effective_Date) VALUES (100, 99999999999, Employee, 888-888-888-
888, PERIOD(DATE 2009-10-01,
UNTIL_CHANGED));
In the DW design model, SCD type 2 is quite popular to track and do analysis on historical information. In most
cases, business dimension (employee) is only valid in a specified period and gets expired at point-in-time so
Taradata PERIOD data type can exactly satisfy the need.

: What is Hierarchy in data warehouse terms?
A: Hierarchies are logical structures that use ordered levels as a means of organizing data. A
hierarchy can be used to define data aggregation. For example, in a time dimension, a hierarchy
might aggregate data from the month level to the quarter level to the year level

Factless Fact Tables
A factless fact table is fact table that does not contain fact.They contain only dimesional keys
and it captures events that happen only at information level but not included in the calculations
level.just an information about an event that happen over a period.
Common examples of factless fact tables include:
Identifying product promotion events (to determine promoted products that didnt sell)
Tracking student attendance or registration events
Tracking insurance-related accident events




What is 3
rd
normal form? {L} Give me an example of a situation where the tables are not in 3
rd
NF, then
make it 3
rd
NF. {M}
Answer: No column is transitively depended on the PK. For example, column1 is dependant on column2
and column2 is dependant on column3. In this case column3 is transitively dependant on column1. To
make it 3
rd
NF we need to split it into 2 tables: table1 which has column1 & column2 and table2 which
has column2 and column3.

Tell me how to design a data warehouse, i.e. what are the steps of doing dimensional modelling? {M}
Answer: There are many ways, but it should not be too far from this order: 1. Understand the business
process, 2. Declare the grain of the fact table, 3. Create the dimension tables including attributes, 4. Add
the measures to the fact tables (from Kimballs Toolkit book chapter 2). Step 3 and 4 could be reversed
(add the fact first, then create the dims), but step 1 & 2 must be done in that order. Understanding the
business process must always be the first, and declaring the grain must always be the second.

How do you join 2 fact tables? {H}
Answer: Its a trap question. You dont usually join 2 fact tables especially if they have different grain.
When designing a dimensional model, you include all the necessary measures into the same fact table. If
the measure you need is located on another fact table, then theres something wrong with the design.
You need to add that measure to the fact table you are working with. But what if the measure has a
different grain? Then you add the lower grain measure to the higher grain fact table. What if the fact
table you are working with has a lower grain? Then you need to get the business logic for allocating the
measure.
It is possible to join 2 fact tables, i.e. using the common dim keys. But the performance is usually
horrible, hence people dont do this in practice, except for small fact tables (<100k rows). For example: if
FactTable1 has dim1key, dim2key, dimkey3 and FactTable2 has dim1key and dim2key then you could
join them like this:
1
2
3
4
5
6
7
8
select f2.dim1key, f2.dim2key, f1.measure1, f2.measure2
from
( select dim1key, dim2key, sum(measure1) as measure1
from FactTable1
group by dim1key, dim2key
) f1
join FactTable2 f2
on f1.dim1key = f2.dim1key and f1.dim2key = f2.dim2key

How do we build a real time data warehouse? {H}
Answer: if the candidate asks Do you mean real time or near real time it may indicate that they have a
good amount of experience dealing with this in the past. There are two ways we build a real time data
warehouse (and this is applicable for both Normalised DW and Dimensional DW):
a) By storing previous periods data in the warehouse then putting a view on top of it pointing to the
source systems current period data. Current period is usually 1 day in DW, but in some industries e.g.
online trading and ecommerce, it is 1 hour.
b) By storing previous periods data in the warehouse then use some kind of synchronous mechanism to
propagate current periods data. An example of synchronous data propagation mechanism is SQL Server
2008s Change Tracking or the old schools trigger.
Near real time DW is built using asynchronous data propagation mechanism, aka mini batch (2-5 mins
frequency) or micro batch (30s 1.5 mins frequency).

What the purpose of having a multidimensional database? {L}
Answer: Many candidates dont know what a multidimensional database (MDB) is. They have heard
about OLAP, but not MDB. So if the candidate looks puzzled, help them by saying an MDB is an OLAP
database. Many will say Oh I see but actually they are still puzzled so it will take a good few
moments before they are back to earth again. So ask again: What is the purpose of having an OLAP
database? The answer is performance and easier data exploration. An MDB (aka cube) is a hundred
times faster than relational DB for returning an aggregate. An MDB will be very easy to navigate, drilling
up and down the hierarchies and across attributes, exploring the data.


Fact table columns usually numeric. In what case does a fact table have a varchar column? {M}
Answer: degenerate dimension
Purpose: to check if the candidate has ever involved in detailed design of warehouse tables. Follow up
with question 19.
degenerate dimension? Give me an example. {L}
Answer: A dimension which stays in the fact table. It is usually the reference number of the
transaction. For example: Transaction ID, payment ref and order ID

how to execute data warehouse project
some people create the data model based on the source systems. But I usually create the data model
based on the reporting/cubes requirements, i.e. I identified what data elements are required for the
cubes/reports. These data elements are the ones I created on the dimensional model. Based on that I
specify the requirements for the ETL, i.e. where these data element should be sourced from. After it
goes live, often there are new reporting/cube requirements for new data elements (new attributes, new
measures) so we add them to the dimensional model, and source them on the ETL design. So no, I dont
bring everything from the source systems, but only the data elements required for the
analytics/reporting/dashboard.
As per conceptual model logical model physical model, I usually start with list of entities and how
they are connected to each other. Based on this list of entities I draw the conceptual model at entity
level. Then I identify the attributes and measures for each entity and create the logical dimensional
model. Translating to physical model usually it is an exercise of a) identifying the appropriate data type,
b) identify appropriate indexes, c) identify appropriate partitioning criteria. If there is an ODS or NDS on
the warehouse, I usually design it based on the source systems, but I try to make it truely 3NF (lots of
tables from the source systems are 2NF or 1NF).

Explain What is ETL process ?How many steps ETL contains explain with example?
ETL is extraction , transforming , loading process , you will extract data from the source and apply the
business role on it then you will load it in the target
the steps are :
1-define the source(create the odbc and the connection to the source DB)
2-define the target (create the odbc and the connection to the target DB)
3-create the mapping ( you will apply the business role here by adding transformations , and define how the
data flow will go from the source to the target )
4-create the session (its a set of instruction that run the mapping , )
5-create the work flow (instruction that run the session)
Explain What are the various methods of getting incremental records or delta records from the
source systems?
One foolproof method is to maintain a field called 'Last Extraction Date' and then impose a condition in the
code saying 'current_extraction_date > last_extraction_date'.
First Method: If there is a column in the source which identifies the record inserted date. Then it will be easy
to put a filter condition in the source qualifier.
Second Method: If there is no record in the source to identify the record inserted date. Then we need to do a
target lookup based on the primary key and determine the new record and then insert.

How to calculate fact table granularity?
Granularity , is the level of detail in which the fact table is describing, for example if we are making time
analysis so the granularity maybe day based - month based or year based

fallback or no fallback option while creating table DDL?
FALLBACK requests that a second copy of the each rows inserted into the table has a duplicate copy in
another AMP in the same cluster. This way we can make the copy of the data inserted into tables. While
NO FALLBACK will not store any duplicate rows.

You have to make a BTEQ script, which drops a table and creates a table. Now you have to make this
script so that it will not return any error if while dropping the table does not exist?
We can do it by setting error level to zero before our drop statement and then setting it back to 8 after
dropping the table
e.g.
ERRORLEVEL (3807) SEVERITY 0;
DROP TABLE EMPLOYEE;
ERRORLEVEL (3807) SEVERITY 8;

Types of indexes :
Primary Index
Secondary Indexes
Join Indexes
Hash Indexes
Partitioned Primary Indexes.

How will you choose a PI?
Access Demographics Choose the column most frequently used for access to maximize the
number of one AMP operations.
Distribution Demographics Better distribution of data optimizes parallel processing.
Volatility Changing PI may cause the row itself to be moved to another AMP. Stable PI reduces data
movement overhead.

You might also like