Professional Documents
Culture Documents
As shown in the figure, the DBMS is a central system which provides a common
interface between the data and the various front-end programs in the application. It
also provides a central location for the whole data in the application to reside.
Due to its centralized nature, the database system can overcome the disadvantages
of the file-based system as discussed below.
Since the whole data resides in one central database, the various programs in
the application can access data in different data files. Hence data present in
one file need not be duplicated in another. This reduces data redundancy.
However, this does not mean all redundancy can be eliminated. There could
be business or technical reasons for having some amount of redundancy. Any
such redundancy should be carefully controlled and the DBMS should be
aware of it.
• Data Consistency
• Data Integration
Since related data is stored in one single database, enforcing data integrity is much
easier. Moreover, the functions in the DBMS can be used to enforce the integrity rules
with minimum programming in the application programs.
• Data Sharing
Related data can be shared across programs since the data is stored in a centralized
manner. Even new applications can be developed to operate against the same data.
• Enforcement of Standards
The application programmer need not build the functions for handling issues
like concurrent access, security, data integrity, etc. The programmer only
needs to implement the application business rules. This brings in application
development ease. Adding additional functional modules is also easier than in
file-based systems.
• Better Controls
Better controls can be achieved due to the centralized nature of the system.
• Data Independence
- The external level which is the level of the application programs or the end
user.
• Reduced Maintenance
Maintenance is less and easy, again, due to the centralized nature of the
system.
• Data Definition
The DBMS provides functions to define the structure of the data in the
application. These include defining and modifying the record structure, the
type and size of fields and the various constraints/conditions to be satisfied by
the data in each field.
• Data Manipulation
The DBMS contains functions which handle the security and integrity of data
in the application. These can be easily invoked by the application and hence
the application programmer need not code these functions in his/her
programs.
Maintaining the Data Dictionary which contains the data definition of the
application is also one of the functions of a DBMS.
• Performance
Thus the DBMS provides an environment that is both convenient and efficient to use
when there is a large volume of data and many transactions to be processed.
Typically there are three types of users for a DBMS. They are :
1. The End User who uses the application. Ultimately, this is the user who
actually puts the data in the system into use in business. This user need not
know anything about the organization of data in the physical level. She also
need not be aware of the complete data in the system. She needs to have
access and knowledge of only the data she is using.
2. The Application Programmer who develops the application programs. She has
more knowledge about the data and its structure since she has manipulate
the data using her programs. She also need not have access and knowledge
of the complete data in the system.
3. The Database Administrator (DBA) who is like the super-user of the system.
The role of the DBA is very important and is defined by the following
functions.
The DBA defines the schema which contains the structure of the data in the
application. The DBA determines what data needs to be present in the system
ad how this data has to be represented and organized.
The DBA needs to interact continuously with the users to understand the data
in the system and its use.
The DBA finds about the access restrictions to be defined and defines security
checks accordingly. Data Integrity checks are also defined by the DBA.
The DBA also defines procedures for backup and recovery. Defining backup
procedures includes specifying what data is to backed up, the periodicity of
taking backups and also the medium and storage place for the backup data.
• Monitoring Performance
The DBA has to continuously monitor the performance of the queries and take
measures to optimize all the queries in the application.
Database Systems can be catagorised according to the data structures and operators
they present to the user. The oldest systems fall into inverted list, hierarchic and
network systems. These are the pre-relational models.
• In the Network Model, a parent can have several children and a child can
also have many parent records. Records are physically linked through linked-
lists. IDMS from Computer Associates International Inc. is an example of a
Network DBMS.
• In the Relational Model, unlike the Hierarchical and Network models, there
are no physical links. All data is maintained in the form of tables consisting of
rows and columns. Data in two tables is related through common columns
and not physical links or pointers. Operators are provided for operating on
rows in tables. Unlike the other two type of DBMS, there is no need to
traverse pointers in the Relational DBMS. This makes querying much more
easier in a Relational DBMS than in the the Hierarchical or Network DBMS.
This, in fact, is a major reason for the relational model to become more
programmer friendly and much more dominant and popular in both industrial
and academic scenarios. Oracle, Sybase, DB2, Ingres, Informix, MS-SQL
Server are few of the popular Relational DBMSs.
CUSTOMER
CUST. NO. CUSTOMER NAME ADDRESS CITY
15371 Nanubhai & Sons L. J. Road Mumbai
... ... ... ...
... ... ... ...
... ... ... ...
•
CONTACTS ORDERS
CUST.NO. CONTACT DESIGNATION CUSTOMER
ORDER NO.ORDER DATE
15371 Nanubhai Owner NO.
15371 Rajesh Munim Accountant 3216 24-June-1997 15371
... ... ... ... ... ...
... ... ... ... ... ...
... ... ...
PARTS
PARTS NO. PARTS DESC PART PRICE ORDERS-PARTS
Amkette 3.5" ORDER NO.PART NO. QUANTITY
S3 400.00
Floppies 3216 C1 300
... ... ... 3216 S3 120
... ... ... ... ... ...
... ... ... ... ... ...
•
SALES-HISTORY
The recent developments in the area have shown up in the form of certain object and
object/relational DBMS products. Examples of such systems are GemStone and
Versant ODBMS. Research has also proceeded on to a variety of other schemes
including the multi-dimensional approach and the logic-based approach.
This chapter discusses the issues related to how the data is physically stored on the
disk and some of the access mechanisms commonly used for retrieving this data.
The Internal Level is the level which deals with the physical storage of data. While
designing this layer, the main objective is to optimize performance by minimizing the
number of disk accesses during the various database operations.
The figure shows the process of database access in general. The DBMS views the
database as a collection of records. The File Manager of the underlying Operating
System views it as a set of pages and the Disk Manager views it as a collection of
physical locations on the disk.
When the DBMS makes a request for a specific record to the File Manager, the latter
maps the record to a page containing it and requests the Disk Manager for the
specific page. The Disk Manager determines the physical location on the disk and
retrieves the required page.
2.1 Clustering
In the above process, if the page containing the requested record is already in the
memory, retrieval from the disk is not necessary. In such a situation, time taken for
the whole operation will be less. Thus, if records which are frequently used together
are placed physically together, more records will be in the same page. Hence the
number of pages to be retrieved will be less and this reduces the number of disk
accesses which in turn gives a better performance.
Assume that the Customer record size is 128 bytes and the typical size of a page
retrieved by the File Manager is 1 Kb (1024 bytes).
If there is no clustering, it can be assumed that the Customer records are stored at
random physical locations. In the worst-case scenario, each record may be placed in
a different page. Hence a query to retrieve 100 records with consecutive Cust_Ids
(say, 10001 to 10100), will require 100 pages to be accessed which in turn translates
to 100 disk accesses.
But, if the records are clustered, a page can contain 8 records. Hence the number of
pages to be accessed for retrieving the 100 consecutive records will be ceil(100/8) =
13. i.e., only 13 disk accesses will be required to obtain the query results. Thus, in
the given example, clustering improves the speed by a factor of 7.7
A: When the record size and page size are such that a page can contain only one
record.
A: No
•Intra-file Clustering – Clustered records belong to the same file (table) as in the
above example.
•Inter-file Clustering – Clustered records belong to different files (tables). This type
of clustering may be required to enhance the speed of queries retrieving related
records from more than one tables. Here interleaving of records is used.
2.2 Indexing
Indexing is another common method for making retrievals faster.
Consider the example of CUSTOMER table used above. The following query is based
on Customer's city.
Here a sequential search on the CUSTOMER table has to be carried out and all
records with the value 'Delhi' in the Cust_City field have to be retrieved. The time
taken for this operation depends on the number of pages to be accessed. If the
records are randomly stored, the page accesses depends on the volume of data. If
the records are stored physically together, the number of pages depends on the size
of each record also.
If such queries based on Cust_City field are very frequent in the application, steps
can be taken to improve the performance of these queries. Creating an Index on
Cust_City is one such method. This results in the scenario as shown below.
A new index file is created. The number of records in the index file is same as that of
the data file. The index file has two fields in each record. One field contains the value
of the Cust_City field and the second contains a pointer to the actual data record in
the CUSTOMER table.
Whenever a query based on Cust_City field occurs, a search is carried out on the
Index file. Here, it is to be noted that this search will be much faster than a
sequential search in the CUSTOMER table, if the records are stored physically
together. This is because of the much smaller size of the index record due to which
each page will be able to contain more number of records.
When the records with value 'Delhi' in the Cust_City field in the index file are located,
the pointer in the second field of the records can be followed to directly retrieve the
corresponding CUSTOMER records.
Thus the access involves a Sequential access on the index file and a Direct access on
the actual data file.
Retrieval Speed v/s Update Speed : Though indexes help making retrievals faster,
they slow down updates on the table since updates on the base table demand update
on the index field as well.
It is possible to create an index with multiple fields i.e., index on field combinations.
Multiple indexes can also be created on the same table simultaneously though there
may be a limit on the maximum number of indexes that can be created on a table.
b) When the data table is small and the index record is of almost the same size as of
the actual data record.
Q: Can a clustering based on one field and indexing on another field exist on the
same table simultaneously ?
A: Yes
2.3 Hashing
Hashing is yet another method used for making retrievals faster. This method
provides direct access to record on the basis of the value of a specific field called the
hash_field. Here, when a new record is inserted, it is physically stored at an address
which is computed by applying a mathematical function (hash function) to the value
of the hash field. Thus for every new record,
hash_address = f (hash_field), where f is the hash function.
Later, when a record is to be retrieved, the same hash function is used to compute
the address where the record is stored. Retrievals are faster since a direct access is
provided and there is no search involved in the process.
An example of a typical hash function is given by a numeric hash field, say an id,
modulus a very large prime number.
A: No
As hashing relates the field value to the address of the record, multiple hash fields
will map a record to multiple addresses at the same time. Hence there can be only
one hash field per file.
Collisions : Consider the example of the CUSTOMER table given earlier while
discussing clustering. Let CUST_ID be the hash field and the hash function be
defined as ((CUST_ID mod 10000)*64 + 1025). The records with CUST_ID 10001,
10002, 10003 etc. will be stored at addresses 1089, 1153, 1217 etc. respectively.
It is possible that two records hash to the same address leading to a collision. In the
above example, records with CUST_ID values 20001, 20002, 20003 etc. will also
map on to the addresses 1089, 1153, 1217 etc. respectively. And same is the case
with CUST_ID values 30001, 30002, 30003 etc.
1. Linear Search:
While inserting a new record, if it is found that the location at the hash address is
already occupied by a previously inserted record, search for the next free location
available in the disk and store the new record at this location. A pointer from the first
record at the original hash address to the new record will also be stored. During
retrieval, the hash address is computed to locate the record. When it is seen that the
record is not available at the hash address, the pointer from the record at that
address is followed to locate the required record.
In this method, the over head incurred is the time taken for the linear search to
locate the next free location while inserting a record.
2. Collision Chain:
Here, the hash address location contains the head of a list of pointers linking
together all records which hash to that address.
In this method, an overflow area needs to be used if the number of records mapping
on to the same hash address exceeds the number of locations linked to it.
A field or a column in a
Attribute Ord_Date, Item#, CustName etc.
relation.
• No Duplicate Tuples – A relation cannot contain two or more tuples which have the
same values for all the attributes. i.e., In any relation, every row is unique.
• Tuples are unordered – The order of rows in a relation is immaterial.
• Attributes are unordered – The order of columns in a relation is immaterial.
• Attribute Values are Atomic – Each tuple contains exactly one value for each
attribute.
It may be noted that many of the properties of relations follow the fact that the body
of a relation is a mathematical set.
• The Database must not contain any unmatched Foreign Key values. This is called
the referential integrity rule.
Unlike the case of Primary Keys, there is no integrity rule saying that no component
of the foreign key can be null. This can be logically explained with the help of the
following example:
Employee
In the case example given, Cust# in Ord_Aug cannot accept Null if the business rule
insists that the Customer No. needs to be stored for every order placed.
The next issue related to foreign key reference is handling deletes / updates of
parent?
In the case example, can we delete the record with Cust# value 002, 003 or 005 ?
The default answer is NO, as long as there is a foreign key reference to these records
from some other table. Here, the records are referenced from the order records in
Ord_Aug relation. Hence Restrict the deletion of the parent record.
Nullify: Update the referencing to Null and then delete/update the parent record. In
the above example of Employee and Account relations, an account record may have
to be deleted if the account is to be closed. For example, if Employee Raj decides to
close his account, Account record with Acc# 120002 has to be deleted. But this
deletion is not possible as long as the Employee record of Raj references it. Hence
the strategy can be to update the EmpAcc# field in the employee record of Raj to
Null and then delete the Account parent record of 120002. After the deletion the data
in the tables will be as follows:
Employee
Emp# EmpName EmpCity EmpAcc#
Account
Descr Price
101-Keyboard 2000
Mouse 800
Note: The union operation shown above logically implies retrieval of records of
Orders placed in July or in August
Cust#
003
Cust#
001
Note:103
The above join 21-08-94
operation logically
003 implies retrieval of details
Gupta Delhiof all orders and
the details of the corresponding customers who placed the orders.
104 28-08-94 002 Srinivasan Madras
Such a join operation where only those rows having corresponding rows in the both
105
the relations 30-08-94
are retrieved 005 natural
is called the Apte
join or inner Bombay
join. This is the most
common join operation.
EMPLOYEE
A join can be formed between the two relations based on the common column Acc#.
The result of the (inner) join is :
Note that, from each table, only those records which have corresponding records in
the other table appear in the result set. This means that result of the inner join
shows the details of those employees who hold an account along with the account
details.
The other type of join is the outer join which has three variations – the left outer
join, the right outer join and the full outer join. These three joins are explained as
follows:
The left outer join retrieves all rows from the left-side (of the join operator) table. If
there are corresponding or related rows in the right-side table, the correspondence
will be shown. Otherwise, columns of the right-side table will take null values.
EMPLOYEE left outer join ACCOUNT gives:
The right outer join retrieves all rows from the right-side (of the join operator) table.
If there are corresponding or related rows in the left-side table, the correspondence
will be shown. Otherwise, columns of the left-side table will take null values.
(Assume that Acc# 120004 belongs to someone who is not an employee and hence
the details of the Account holder are not available here)
The full outer join retrieves all rows from both the tables. If there is a
correspondence or relation between rows from the tables of either side, the
correspondence will be shown. Otherwise, related columns will take null values.
EMPLOYEE full outer join ACCOUNT gives:
a1 b1 c1
a2 b2 c2
a3 b3 c3
8. DIVIDE
Thus the result contains those values from R1 whose corresponding R2 values in R3
include all R2 values.
b. Data Definition Language – Consists of SQL statements for defining the schema
(Creating, Modifying and Dropping tables, indexes, views etc.)
c. Data Control Language – Consists of SQL statements for providing and revoking
access permissions to users
Tables used:
Ord_Items
Ord_Aug
Customers
General form:
Query 1:
FROM items;
Result
Query 2:
SELECT cust#,custname
FROM customers;
Result
Query 3:
FROM ord_items;
Result
Query 4:
SELECT ord# "Order ", orddate "Ordered On" In the result set the column headings will appear as
<---- “Order” and “Ordered On” instead of ord# and ordda
FROM ord_aug;
Result
Query 5:
FROM items
WHERE price>2000;
Result
Query 6:
SELECT custname
FROM customers
WHERE city<>'Bombay';
Result
Query 7:
SELECT custname
FROM customers
WHERE UPPER(city)<>'BOMBAY';
Result
Query 8:
SELECT *
FROM ord_aug
Illustrates the use of 'date' fields. In SQL, a separate
datatype (eg: date, datetime etc.) is available to store
WHERE orddate > '15-AUG-94'; <-----------
data which is of type date.
Result
Query 9:
SELECT *
FROM ord_items
Result
Query 10:
SELECT custname
FROM customers
Result
Query 11:
SELECT custname
FROM customers
WHERE custname LIKE 'S%' ; <------------ LIKE 'S%' - 'S' followed by zero or more characters
Result
Query 12:
SELECT *
FROM ord_items
Result
Query 13:
SELECT custname
FROM customers
Query 14:
SELECT *
FROM customers
WHERE city='Bombay'
Records in the result set is displayed in
ORDER BY custname; <-------------------- the ascending order of custname
Result
Query 15:
SELECT *
Result
Query 16:
SELECT descr, price
ORDER BY 2
FROM items
Result
Query 17:
WHERE city='Delhi'
Result
Query 18:
Result
Query 19:
Result
Query 20:
FROM items
WHERE price > (SELECT AVG(price) FROM items); <------ Inner SELECT statement
Result
Query 21:
WHERE custname='Shah');
Result
Arithmetic Expressions
()
Query 22:
FROM items
WHERE price >= 4000
ORDER BY 3;
Result
Query 23:
SELECT descr
Result
Numeric Functions
Query 24:
FROM ord_items
WHERE item#='HW2';
Result
Query 25:
FROM ord_items
WHERE item#='HW2';
Result
SQRT(n)
ROUND(n,m)
TRUNC(n,m)
'm' indicates the number of digits after decimal points in the result.
Date Arithemetic
Date – Date
Query 26:
FROM ord_aug;
Result
Date Functions
MONTHS_BETWEEN(date1, date2)
SYSDATE
Query 27:
SELECT ord#,
MONTHS_BETWEEN(SYSDATE,orddate)
FROM ord_aug;
Result
Query 28:
Result
Note:
MM - month (01-12)
HH:MI:SS - hours:minutes:seconds
|| - Concatenate operator
Query 29:
FROM customers;
Result
INITCAP(string)
UPPER(string)
LOWER(string)
SUBSTR(string,start,no. of characters)
Group Functions
Group functions are functions which act on the entire column of selected rows.
Query 30:
Result
SUM
AVG
COUNT
MAX
MIN
Query 31:
Result
Query 32:
Result
Query 33:
FROM ord_items
GROUP BY item#
HAVING COUNT(*)>2;
Result
General forms:
Query 35: Insert values of item# & descr columns for a new row
Query 37: Inserts a new row with the date field being specified in non DD-MON-YY
format
General form:
UPDATE <table-name>
UPDATE items
UPDATE ord_items
General form:
DDL statements are those which are used to create, modify and drop the definitions
or structures of various tables, views, indexes and other elements of the DBMS.
General form:
(<table-element (comma)list>*);
* - table element may be attribute with its data-type and size or any integrity
constraint on attributes.
Some CREATE TABLE statements on the Case Example
Query:
custname CHAR(30) ,
city CHAR(20));
- This query Creates a table CUSTOMERS with 3 fields - cust#, custname and city.
Cust# cannot be null
Query:
CREATE TABLE ord_sep <------------------- Creates a new table ord_sep, which has the same structu
of ord_aug. The data in ord_aug is copied to the new tab
AS SELECT * from ord_aug; ord_sep.
- This query Creates table ORD_SEP as a cpy of ORD-AUG. Copies structure as well
as data.
Query:
- This query Creates table ORD_SEP as a copy of ORD_AUG, but does not copy any
data as the WHERE clause is never satisfied.
General form:
Query:
ALTER TABLE customers
MODIFY custname CHAR(35); <------------- Modifies the data type/size of an attribute in the table
- This query changes the custname field to a character field of length 35. Used for
modifying field lengths and attributes.
Query:
- This query adds two new fields - phone & credit_rating to the customers table.
General form:
Example:
Query:
A view is a virtual relation created with attributes from one or more base tables.
SELECT * FROM myview1; at any given time will evaluate the view-defining query in
the CREATE VIEW statement and display the result.
Query:
AS SELECT
ord#, orddate, ord_aug.cust#, custname
- This query defines a view consisting of ord#, cust#, and custname using a join of
ORD_AUG and CUSTOMERS tables.
Query:
FROM ord_items;
- This query defines a view with columns item# and qty from the ORD_ITEMS table,
and renames these columns as ItemNo. and Quantity respectively.
Query:
FROM items
- This query defines the view as defined. WITH CHECK OPTION ensures that if this
view is used for updation, the updated values do not cause the row to fall outside the
view.
Query:
CREATE INDEX i_city <-------------------- Creates a new index named i_city. The new
index file(table) will have the values of city
ON customers (city); column of Customers table
Query:
CREATE UNIQUE INDEX i_custname Creates an index which allows only unique values for
<------ custnames
ON customers (custname);
Query:
Query:
DCL statements are those which are used to control access permissions on the
tables, indexes, views and other elements of the DBMS.
Query:
ON customers
TO ashraf;
Query:
GRANT SELECT <-------------- Grants SELECT permission on the table customers to the user
'sunil'. User 'sunil' does not have permission to insert, update,
delete or perform any other operation on customers table.
ON customers
TO sunil;
Query:
GRANT SELECT
ON customers
TO sunil
FROM ashraf;
Recovery and Concurrency in a DBMS are part of the general topic of transaction
management. Hence we shall begin the discussion by examining the fundamental
notion of a transaction.
5.1 Transaction
The procedure for transferring an amount of Rs. 100/- from the account of one
customer to another is given.
Here, it has to be noted that the single operation “amount transfer” involves two
database updates – updating the record of from_cust and updating the record of
to_cust. In between these two updates the database is in an inconsistent (or
incorrect in this example) state. i.e., if only one of the updates is performed, one
cannot say by seeing the database contents whether the amount transfer operation
has been done or not. Hence to guarantee database consistency it has to be ensured
that either both updates are performed or none are performed. If, after one update
and before the next update, something goes wrong due to problems like a system
crash, an overflow error, or a violation of an integrity constraint etc., then the first
update needs to be undone.
This is true with all transactions. Any transaction takes the database from one
consistent state to another. It need not necessarily preserve consistency of database
at all intermediate points. Hence it is important to ensure that either a transaction
executes in its entirety or is totally cancelled. The set of programs which handles this
forms the transaction manager in the DBMS. The transaction manager uses COMMIT
and ROLLBACK operations for ensuring atomicity of transactions.
ROLLBACK – The ROLLBACK operation indicates that the transaction has been
unsuccessful which means that all updates done by the transaction till then need to
be undone to bring the database back to a consistent state. To help undoing the
updates once done, a system log or journal is maintained by the transaction
manager. The before- and after-images of the updated tuples are recorded in the log.
Isolation: Transactions are isolated from one another. i.e., A transaction's updates
are concealed from all others until it commits (or rolls back).
Durability: Once a transaction commits, its updates survive in the database even if
there is a subsequent system crash.
5.2 Recovery from System Failures
System failures (also called soft crashes) are those failures like power outage which
affect all transactions in progress, but do not physically damage the database.
During a system failure, the contents of the main memory are lost. Thus the
contents of the database buffers which contain the updates of transactions are lost.
(Note: Transactions do not directly write on to the database. The updates are written
to database buffers and, at regular intervals, transferred to the database.) At restart,
the system has to ensure that the ACID properties of transactions are maintained
and the database remains in a consistent state. To attain this, the strategy to be
followed for recovery at restart is as follows:
• An online logfile or journal – The logfile maintains the before- and after-images of
the tuples updated during a transaction. This helps in carrying out the UNDO and
REDO operations as required. Typical entries made in the logfile are :
T1 does not enter the recovery procedure at all since it updates were all written to
the database at time tc as part of the checkpoint proces
5.4 Concurrency
Concurrency refers to multiple transactions accessing the same database at the same
time. In a system which allows concurrency, some kind of control mechanism has to
be in place to ensure that concurrent transactions do not interfere with each other.
Three typical problems which can occur due to concurrency are explained here.
• there is a record R, with a field, say Amt, having value 1000 before time t1.
o Both transactions A & B fetch this value at t1 and t2 respectively.
o Transaction A updates the Amt field in R to 800 at time t3.
o Transaction B updates the Amt field in R to 1200 at time t4.
Thus after time t4, the Amt value in record R has value 1200. Update by Transaction
A at time t3 is over-written by the Transaction B at time t4.)
• there is a record R, with a field, say Amt, having value 1000 before time t1.
o Transaction B fetches this value and updates it to 800 at time t1.
o Transaction A fetches R with Amt field value 800 at time t2.
o Transaction B rolls back and its update is undone at time t3. The Amt
field takes the initial value 1000 during rollback.
Transaction A continues processing with Amt field value 800 without knowing about
B's rollback.)
5.5 Locking
1. shared (S lock)
2. and exclusive (X Lock).
Normally, locks are implicit. A FETCH request is an implicit request for a shared lock
whereas an UPDATE request is an implicit request for an exclusive lock.
Explicit lock requests need to be issued if a different kind of lock is required during
an operation. For example, if an X lock is to acquired before a FETCH it has to be
explicitly requested for.
5.6 Deadlocks
Locking can be used to solve the problems of concurrency. However, locking can also
introduce the problem of deadlock as shown in the example below.
Deadlock is a situation in which two or more transactions are in a simultaneous wait
state, each of them waiting for one of the others to release a lock before it can
proceed.
If a deadlock occurs, the system may detect it and break it. Detecting involves
detecting a cycle in the “Wait-For Graph” (a graph which shows 'who is waiting for
whom'). Breaking a deadlock implies choosing one of the deadlocked transactions as
the victim and rolling it back, thereby releasing all its locks. This may allow some
other transaction(s) to proceed.
6. Query Optimization
6.1 Overview
Let us look at a query being evaluated in two different ways to see the dramatic
effect of query optimization.
Assumptions:
T1 = ORDTBL X ORD_ITEMS
(Perform the Product operation as the first step towards joining the two tables)
- 10000 X 100 tuple reads (1000000 tuple reads -> generates 1000000 tuples as
intermediate result)
- 1000000 tuples written to disk (Assuming that 1000000 tuples in the intermediate
result cannot be held in the memory. 1000000 tuple writes to a temporary space in
the disk.)
T3 = ORDDATE,ITEM#,QTY (T2)
(Projection performed as the final step. No more tuple i/o s)
Total no. of tuple i/o s = 1000000 reads + 1000000 writes + 1000000 reads
= 3000000 tuple i/o s
Query Evaluation – Method 2
T2 = ORDTBL JOIN T1
10,100 tuple I/O's (of Method 2) v/s 3,000,000 tuple I/O's (of Method 1) !
Here it needs to be noted that in the Method 2 of evaluation, the first operation to be
performed was a 'Select' which filters out 50 tuples from the 10,000 tuples in the
ORD_ITEMS table. Thus this operation causes elimination of 9950 tuples. Thus
elimination in the initial steps would help optimization.
select CITY, COUNT(*) from CUSTTBL select CITY, COUNT(*) from CUSTTBL
1. where CITY != 'BOMBAY' v/s group by CITY
group by CITY; having CITY != 'BOMBAY';
Here the second version is faster. In the first form of the query, a function to_char is
applied on an attribute and hence needs to be evaluated for each tuple in the table.
The time for this evaluation will be thus proportional to the cardinality of the relation.
In the second form, a function to_date is applied on a constant and hence needs to
be evaluated just once, irrespective of the cardinality of the relation. Moreover, if the
attribute ORDDATE is indexed, the index will not be used in the first case, since the
attribute appears in an expression and its value is not directly used.
a) Cast into some Internal Representation – This step involves representing each
SQL query into some internal representation which is more suitable for machine
manipulation. The internal form typically chosen is a query tree as shown below.
b)Convert to Canonical Form – In this second step, the optimizer makes use of some
transformation laws or rules for sequencing the internal operations involved. Some
examples are given below.
(Note: In all these examples the second form will be more efficient irrespective of
the actual data values and physical access paths that exist in the stored database. )
Rule 1:
Rule 3:
(A[projection_1])[projection_2]
A[projection_2]
If there is a sequence of successive projections applied on the same relation, all but
the last one can be ignored. i.e., The entire operation is equivalent to applying the
last projection alone.
Rule 4:
(A WHERE restriction)[projection]
Restrictions when applied first, cause eliminations and hence better performance.
The basic strategy here is to consider the query expression as a set of low-level
implementation procedures predefined for each operation. For eg., there will be a set
of procedures for implementing the restriction operation: one (say, procedure 'a') for
the case where the restriction attribute is indexed, one (say, procedure 'b') where
the restriction attribute is hashed and so on.
Each such procedure has and associated cost measure indicating the cost, typically in
terms of disk I/Os.
The optimizer chooses one or more candidate procedures for each low-level
operations in the query. The information about the current state of the database
(existence of indexes, current cardinalities etc.) which is available from the system
catalog will be used to make this choice of candidate procedures.
d)Generate Query Plans and Choose the Cheapest – In this last step, query plans are
generated by combining a set of candidate implementation procedures. This can be
explained with the following example(A trivial one but illustrative enough).
Implementation
Operation Condition Existing
Procedure
Join d
Join e
Projection f
Projection g
Now the various query plans for the original query expression can be generated by
making permutations of implementation procedures available for different
operations. Thus the query plans can be
– adf
- adg
– aef
– aeg
– bdf
...
...
It has to be noted that in reality, the number of such query plans possible can be too
many and hence generating all such plans and then choosing the cheapest will be
expensive by itself. Hence a heuristic reduction of search space rather than
exhaustive search needs to be done. Considering the above example, one such
heuristic method can be as follows:
If the system knows that the restriction attribute is neither indexed nor hashed, then
the query plans involving implementation procedure 'c ' alone (and not 'a' and 'b')
need to be considered and the cheapest plan can be chosen from the reduced set of
query plans.
Some of the query optimization measures used in Oracle are the following:
–Indexes unnecessary for small tables. i.e., if the size of the actual data record is not
much larger than the index record, the search time in the index table and the data
table will be comparable. Hence indexes will not make much difference in the
performance of queries.
–Indexes/clusters when retrieving less than 25% of rows. The overhead of searching
in the index file will be more when retrieving more rows.
–Index not used in queries containing NULL / NOT NULL. Index tables will not have
NULL / NOT NULL entries. Hence need not search for these in the index table.