You are on page 1of 42

SCHOOL OF INFORMATION SCIENCES AND TECHNOLOGY

BTech (Hons) COMPUTER SCIENCE


ICS 124: DATABASE DESIGN CONCEPTS
INTRODUCTION

INTRODUCTION
What is a database?
A database is a collection of related data.
Data is known facts that can be recorded and that have implicit meaning.
A database has the following implicit properties:
A database represents some aspect of the real world, sometimes called the miniworld or
the universe of discourse (DoD). Changes to the miniworld are reflected in the database.
A database is a logically coherent collection of data with some inherent meaning. A
random assortment of data cannot correctly be referred to as a database.
A database is designed, built, and populated with data for a specific purpose. It has an
intended group of users and some preconceived applications in which these users are
interested.
A database management system (DBMS) is a collection of programs that enables users to create
and maintain a database. The DBMS is hence a general-purpose software system that facilitates
the processes of defining, constructing, manipulating, and sharing databases among various users
and applications.
Defining a database involves specifying the data types, structures, and constraints for the data to
be stored in the database.
Constructing the database is the process of storing the data itself on some storage medium that is
controlled by the DBMS.
Manipulating a database includes such functions as querying the database to retrieve specific
data, updating the database to reflect changes in the miniworld, and generating reports from the
data.
Sharing a database allows multiple users and programs to access the database concurrently.
Protection includes both system protection against hardware or software malfunction (or
crashes), and security protection against unauthorized or malicious access.
A database system is the database and DBMS software together.

File systems
File processing systems was an early attempt to computerize the manual filing system that we are
all familiar with. A file system is a method for storing and organizing computer files and the data
they contain to make it easy to find and access them. File systems may use a storage device such
as a hard disk or CD-ROM and involve maintaining the physical location of the files.
In our own home, we probably have some sort of filing system, which contains receipts,
guarantees, invoices, bank statements, and such like. When we need to look something up, we go
2

to the filing system and search through the system starting from the first entry until we find what
we want. Alternatively, we may have an indexing system that helps to locate what we want more
quickly. For example we may have divisions in the filing system or separate folders for different
types of item that are in some way logically related.
The manual filing system works well when the number of items to be stored is small. It even
works quite adequately when there are large numbers of items and we have only to store and
retrieve them. However, the manual filing system breaks down when we have to cross-reference
or process the information in the files. For example, a typical real estate agent's office might have
a separate file for each property for sale or rent, each potential buyer and renter, and each
member of staff.
Clearly the manual system is inadequate for this' type of work. The file based system was
developed in response to the needs of industry for more efficient data access. In early processing
systems, an organization's information was stored as groups of records in separate files.
In the traditional approach, we used to store information in flat files which are maintained by the
file system under the operating system's control. Here, flat files are files containing records
having no structured relationship among them. The file handling which we learn under C/C ++ is
the example of file processing system. The Application programs written in C/C ++ like
programming languages go through the file system to access these flat. files as shown.

Characteristics of File Processing System


Here is the list of some important characteristics of file processing system:
It is a group of files storing data of an organization.
Each file is independent from one another.
Each file is called a flat file.
3

Each file contained and processed information for one specific function, such as accounting or
inventory.
Files are designed by using programs written in programming languages such as COBOL, C,
C++.
The physical implementation and access procedures are written into database application;
therefore, physical changes resulted in intensive rework on the part of the programmer.
As systems became more complex, file processing systems offered little flexibility, presented
many limitations, and were difficult to maintain.

Limitations of the File Processing System I File-Based Approach


There are following problems associated with the File Based Approach:
1. Separated and Isolated Data: To make a decision, a user might need data from two separate
files. First, the files were evaluated by analysts and programmers to determine the specific data
required from each file and the relationships between the data and then applications could be
written in a programming language to process and extract the needed data. Imagine the work
involved if data from several files was needed.
2. Duplication of data: Often the same information is stored in more than one file. Uncontrolled
duplication of data is not required for several reasons, such as:
Duplication is wasteful. It costs time and money to enter the data more than once
It takes up additional storage space, again with associated costs.
Duplication can lead to loss of data integrity; in other words the data is no longer consistent.
For example, consider the duplication of data between the Payroll and Personnel departments. If
a member of staff moves to new house and the change of address is communicated only to
Personnel and not to Payroll, the person's pay slip will be sent to the wrong address. A more
serious problem occurs if an employee is promoted with an associated increase in salary. Again,
the change is notified to Personnel but the change does not filter through to Payroll. Now, the
employee is receiving the wrong salary. When this error is detected, it will take time and effort to
resolve. Both these examples, illustrate inconsistencies that may result from the duplication of
data. As there is no automatic way for Personnel to update the data in the Payroll files, it is
difficult to foresee such inconsistencies arising. Even if Payroll is notified of the changes, it is
possible that the data will be entered incorrectly.
3. Data Dependence: In file processing systems, files and records were described by specific
physical formats that were coded into the application program by programmers. If the format of a
certain record was changed, the code in each file containing that format must be updated.
Furthermore, instructions for data storage and access were written into the application's code.
Therefore, .changes in storage structure or access methods could greatly affect the processing or
results of an application.
4

In other words, in file based approach application programs are data dependent. It means that,
with the change in the physical representation (how the data is physically represented in disk) or
access technique (how it is physically accessed) of data, application programs are also affected
and needs modification. In other words application programs are dependent on the how the data
is physically stored and accessed.
If for example, if the physical format of the master/transaction file is changed, by making the
modification in the delimiter of the field or record, it necessitates that the application programs
which depend on it must be modified.
Let us consider a student file, where information of students is stored in text file and each field is
separated by blank space as shown below:
I Rahat 35 Thapar
Now, if the delimiter of the field changes from blank space to semicolon as shown below:
1; Rahat; 35; Thapar
Then, the application programs using this file must be modified, because now it will token the
field on semicolon; but earlier it was blank space.
4. Difficulty in representing data from the user's view: To create useful applications for the
user, often data from various files must be combined. In file processing it was difficult to
determine relationships between isolated data in order to meet user requirements.
5. Data Inflexibility: Program-data interdependency and data isolation, limited the flexibility of
file processing systems in providing users with ad-hoc information requests
6. Incompatible file formats: As the structure of files is embedded in the application programs,
the structures are dependent on the application programming language. For example, the
structure of a file generated by a COBOL program may be different from the structure of a file
generated by a 'C' program. The direct incompatibility of such files makes them difficult to
process jointly.
7. Data Security. The security of data is low in file based system because, the data is maintained
in the flat file(s) is easily accessible. For Example: Consider the Banking System. The Customer
Transaction file has details about the total available balance of all customers. A Customer wants
information about his account balance. In a file system it is difficult to give the Customer access
to only his data in the file. Thus enforcing security constraints for the entire file or for certain
data items are difficult.
8. Transactional Problems. The File based system approach does not satisfy transaction
properties like Atomicity, Consistency, Isolation and Durability properties commonly known as
ACID properties.
For example: Suppose, in a banking system, a transaction that transfers Rs. 1000 from account A
to account B with initial values' of A and B being Rs. 5000 and Rs. 10000 respectively. If a
5

system crash occurred after the withdrawal of Rs. 1000 from account A, but before depositing of
amount in account B, it will result an inconsistent state of the system. It means that the
transactions should not execute partially but wholly. This concept is known as Atomicity of a
transaction (either 0% or 100% of transaction). It is difficult to achieve this property in a file
based system.
9. Concurrency problems. When multiple users access the same piece of data at same interval
of time then it is called as concurrency of the system. When two or more users read the data
simultaneously there is ll( problem, but when they like to update a file simultaneously, it may
result in a problem.
For example:
Let us consider a scenario where in transaction T 1 a user transfers an amout1t 1000 from
Account A to B (initial value of A is 5000 and B is 8000). In mean while, another transaction T2,
tries to display the sum of account A and B is also executed. If both the transaction runs in
parallel it may results inconsistency as shown below:

The above schedule results inconsistency of database and it shows Rs.12,000 as sum of accounts
A and B instead of Rs .13,000. The problem occurs because second concurrently running
transaction T2, reads A and B at intermediate point and computes its sum, which results
inconsistent value.
10. Poor data modeling of real world. The file based system is not able to represent the
complex data and interfile relationships, which results poor data modeling properties.

The Database Approach


In the database approach, a single repository of data is maintained that is defined once and then
is accessed by various users. The following are the main characteristics of the database approach:
Self-describing nature of a database system
A fundamental characteristic of the database approach is that the database system contains not
only the database itself but also a complete definition or description of the database structure and
constraints. This definition is stored in the DBMS catalog, which contains information such as
the structure of each file, the type and storage format of each data item, and various constraints
6

on the data. The information stored in the catalog is called meta-data, and it describes the
structure of the primary database
Insulation between programs and data, and data abstraction
The structure of data files is stored in the DBMS catalog separately from the access programs.
We call this property program-data independence. The characteristic that allows program-data
independence and program-operation independence is called data abstraction. A DBMS provides
users with a conceptual representation of data that does not include many of the details of how
the data is stored or how the operations are implemented. Informally, a data model is a type of
data abstraction that is used to provide this conceptual representation. The data model uses
logical concepts, such as objects, their properties, and their interrelationships, that may be easier
for most users to understand than computer storage concepts. Hence, the data model hides
storage and implementation details that are not of interest to most database users.
Support of multiple views of the data
A database typically has many users, each of whom may require a different perspective or view
of the database. A view may be a subset of the database or it may contain virtual data that is
derived from the database files but is not explicitly stored. Some users may not need to be aware
of whether the data they refer to is stored or derived. A multiuser DBMS whose users have a
variety of distinct applications must provide facilities for defining multiple views
Sharing of data and multiuser transaction processing
A multiuser DBMS, as its name implies, must allow multiple users to access the database at the
same time. This is essential if data for multiple applications is to be integrated and maintained in
a single database. The DBMS must include concurrency control software to ensure that several
users trying to update the same data do so in a controlled manner so that the result of the updates
is correct.
Roles in the database environment
Database Administrators
In any organization where many persons use the same resources, there is a need for a chief
administrator to oversee and manage these resources. In a database environment, the primary
resource is the database itself, and the secondary resource is the DBMS and related software.
Administering these resources is the responsibility of the database administrator (DBA). The
DBA is responsible for authorizing access to the database, for coordinating and monitoring its
use, and for acquiring software and hardware resources as needed. The DBA is accountable for
7

problems such as breach of security or poor system response time. In large organizations, the
DBA is assisted by a staff that helps carry out these functions.
Database Designers
Database designers are responsible for identifying the data to be stored in the database and for
choosing appropriate structures to represent and store this data. These tasks are mostly
undertaken before the database is actually implemented and populated with data. It is the
responsibility of database designers to communicate with all prospective database users in order
to understand their requirements, and to come up with a design that meets these requirements. In
many cases, the designers are on the staff of the DBA and may be assigned other staff
responsibilities after the database design is completed. Database designers typically interact with
each potential group of users and develop views of the database that meet the data and
processing requirements of these groups. Each view is then analyzed and integrated with the
views of other user groups. The final database design must be capable of supporting the
requirements of all user groups.
End Users
End users are the people whose jobs require access to the database for querying, updating, and
generating reports; the database primarily exists for their use. There are several categories of end
users:
Casual end users occasionally access the database, but they may need different information
each time. They use a sophisticated database query language to specify their requests and are
typically middle- or high-level managers or other occasional browsers.
Naive or parametric end users make up a sizable portion of database end users. Their main job
function revolves around constantly querying and updating the database, using standard types of
queries and updates-called canned transactions-that have been carefully programmed and tested.
The tasks that such users perform are varied:
Bank tellers check account balances and post withdrawals and deposits.
Reservation clerks fur airlines, hotels, and car rental companies check availability for a given
request and make reservations.
Clerks at receiving stations for courier mail enter package identifications via bar codes and
descriptive information through buttons to update a central database of received and in-transit
packages.

Sophisticated end users include engineers, scientists, business analysts, and others who
thoroughly familiarize themselves with the facilities of the DBMS so as to implement their
applications to meet their complex requirements.
Stand-alone users maintain personal databases by using ready-made program packages that
provide easy-to-use menu-based or graphics-based interfaces. An example is the user of a tax
package that stores a variety of personal financial data for tax purposes.
A typical DBMS provides multiple facilities to access a database. Naive end users need to learn
very little about the facilities provided by the DBMS; they have to understand only the user
interfaces of the standard transactions designed and implemented for their use. Casual users learn
only a few facilities that they may use repeatedly. Sophisticated users try to learn most of the
DBMS facilities in order to achieve their complex requirements. Stand-alone users typically
become very proficient in using a specific software package.
System Analysts and Application Programmers (Software Engineers)
System analysts determine the requirements of end users, especially naive and parametric end
users, and develop specifications for canned transactions that meet these requirements.
Application programmers implement these specifications as programs; then they test, debug,
document, and maintain these canned transactions. Such analysts and programmers-commonly
referred to as software engineers-should be familiar with the full range of capabilities provided
by the DBMS to accomplish their tasks.
In addition to those who design, use, and administer a database, others are associated with the
design, development, and operation of the DBMS software and system environment. These
persons are typically not interested in the database itself. We call them the "workers behind the
scene," and they include the following categories.
DBMS system designers and implementers are persons who design and implement the DBMS
modules and interfaces as a software package. A DBMS is a very complex software system that
consists of many components, or modules, including modules for implementing the catalog,
processing query language, processing the interface, accessing and buffering data, controlling
concurrency, and handling data recovery and security. The DBMS must interface with other
system software, such as the operating system and compilers for various programming
languages.
Tool developers include persons who design and implement tools-the software packages that
facilitate database system design and use and that help improve performance. Tools are optional
packages that are often purchased separately. They include packages for database design,
performance monitoring, natural language or graphical interfaces, prototyping, simulation, and
9

test data generation. In many cases, independent software vendors develop and market these
tools.
Operators and maintenance personnel are the system administration personnel who are
responsible for the actual running and maintenance ofthe hardware and software environment for
the database system.
Although these categories of workers behind the scene are instrumental in making the database
system available to end users, they typically do not use the database for their own purposes.

Advantages and disadvantages of using databases


1. Controlling Redundancy: In file system, each application has its own private files, which
cannot be shared between multiple applications. This can often lead to considerable
redundancy in the stored data, which results in wastage of storage space. By having
centralized database most of this can be avoided. It is not possible that all redundancy
should be eliminated. Sometimes there are sound business and technical reasons for
maintaining multiple copies of the same data. In a database system, however this
redundancy can be controlled.
For example: In case of college database, there may be the number of applications like General
Office, Library, Account Office, Hostel etc. Each of these applications may maintain the
following information into own private file applications:

It is clear from the above file systems, that there is some common data of the student which has
to be mentioned in each application, like Rollno, Name, Class, Phone_No~ Address etc. This will
cause the problem of redundancy which results in wastage of storage space and difficult to
10

maintain, but in case of centralized database, data can be shared by number of applications and
the whole college can maintain its computerized data with the following database:

It is clear in the above database that Rollno, Name, Class, Father_Name, Address, Phone_No,
Date_of_birth which are stored repeatedly in file system in each application, need not be stored
repeatedly in case of database, because every other application can access this information by
joining of relations on the basis of common column i.e. Rollno. Suppose any user of Library
system need the Name, Address of any particular student and by joining of Library and General
Office relations on the basis of column Rollno he/she can easily retrieve this information.
Thus, we can say that centralized system of DBMS reduces the redundancy of data to great
extent but cannot eliminate the redundancy because RollNo is still repeated in all the relations.
2. Integrity can be enforced: Integrity of data means that data in database is always accurate,
such that incorrect information cannot be stored in database. In order to maintain the integrity of
data, some integrity constraints are enforced on the database. A DBMS should provide
capabilities for defining and enforcing the constraints.
For Example: Let us consider the case of college database and suppose that college having only
BTech, MTech, MSc, BCA, BBA and BCOM classes. But if a \.,ser enters the class MCA, then
this incorrect information must not be stored in database and must be prompted that this is an
invalid data entry. In order to enforce this, the integrity constraint must be applied to the class
attribute of the student entity. But, in case of file system tins constraint must be enforced on all
the application separately (because all applications have a class field).
In case of DBMS, this integrity constraint is applied only once on the class field of the General
Office (because class field appears only once in the whole database), and all other applications
will get the class information about the student from the General Office table so the integrity
constraint is applied to the whole database. So, we can conclude that integrity constraint can be
easily enforced in centralized DBMS system as compared to file system.
11

3. Inconsistency can be avoided: When the same data is duplicated and changes are made at
one site, which is not propagated to the other site, it gives rise to inconsistency and the two
entries regarding the same data will not agree. At such times the data is said to be inconsistent.
So, if the redundancy is removed chances of having inconsistent data is also removed.
Let us again, consider the college system and suppose that in case of General_Office file it is
indicated that Roll_Number 5 lives in Amritsar but in library file it is indicated that
Roll_Number 5 lives in Jalandhar. Then, this is a state at which tIle two entries of the same
object do not agree with each other (that is one is updated and other is not). At such time the
database is said to be inconsistent.
An inconsistent database is capable of supplying incorrect or conflicting information. So there
should be no inconsistency in database. It can be clearly shown that inconsistency can be avoided
in centralized system very well as compared to file system.
Let us consider again, the example of college system and suppose that RollNo 5 is .shifted from
Amritsar to Jalandhar, then address information of Roll Number 5 must be updated, whenever
Roll number and address occurs in the system. In case of file system, the information must be
updated separately in each application, but if we make updation only at three places and forget to
make updation at fourth application, then the whole system show the inconsistent results about
Roll Number 5.
In case of DBMS, Roll number and address occurs together only single time in General_Office
table. So, it needs single updation and then another application retrieve the address information
from General_Office which is updated so, all application will get the current and latest
information by providing single update operation and this single update operation is propagated
to the whole database or all other application automatically, this property is called as Propagation
of Update.
We can say the redundancy of data greatly affect the consistency of data. If redundancy is less, it
is easy to implement consistency of data. Thus, DBMS system can avoid inconsistency to great
extent.
4. Data can be shared: As explained earlier, the data about Name, Class, Father __name etc. of
General_Office is shared by multiple applications in centralized DBMS as compared to file
system so now applications can be developed to operate against the same stored data. The
applications may be developed without having to create any new stored files.

12

5. Standards can be enforced: Since DBMS is a central system, so standard can be enforced
easily may be at Company level, Department level, National level or International level. The
standardized data is very helpful during migration or interchanging of data. The file system is an
independent system so standard cannot be easily enforced on multiple independent applications.
6. Restricting unauthorized access: When multiple users share a database, it is likely that some
users will not be authorized to access all information in the database. For example, account office
data is often considered confidential, and hence only authorized persons are allowed to access
such data. In addition, some users may be permitted only to retrieve data, whereas other are
allowed both to retrieve and to update. Hence, the type of access operation retrieval or update
must also be controlled. Typically, users or user groups are given account numbers protected by
passwords, which they can use to gain access to the database. A DBMS should provide a security
and authorization subsystem, which the DBA uses to create accounts and to specify account
restrictions. The DBMS should then enforce these restrictions automatically.
7. Solving Enterprise Requirement than Individual Requirement: Since many types of users
with varying level of technical knowledge use a database, a DBMS should provide a variety of
user interface. The overall requirements of the enterprise are more important than the individual
user requirements. So, the DBA can structure the database system to provide an overall service
that is "best for the enterprise".
For example: A representation can be chosen for the data in storage that gives fast access for the
most important application at the cost of poor performance in some other application. But, the
file system favors the individual requirements than the enterprise requirements
8. Providing Backup and Recovery: A DBMS must provide facilities for recovering from
hardware or software failures. The backup and recovery subsystem of the DBMS is responsible
for recovery. For example, if the computer system fails in the middle of a complex update
program, the recovery subsystem is responsible for making sure that the .database is restored to
the state it was in before the program started executing.
9. Cost of developing and maintaining system is lower: It is much easier to respond to
unanticipated requests when data is centralized in a database than when it is stored in a
conventional file system. Although the initial cost of setting up of a database can be large, but the
cost of developing and maintaining application programs to be far lower than for similar service
using conventional systems. The productivity of programmers can be higher in using nonprocedural languages that have been developed with DBMS than using procedural languages.
10. Data Model can be developed: The centralized system is able to represent the complex data
and interfile relationships, which results better data modeling properties. The data madding
13

properties of relational model is based on Entity and their Relationship, which is discussed in
detail in chapter 4 of the book.
11. Concurrency Control: DBMS systems provide mechanisms to provide concurrent access of
data to multiple users.

Disadvantages of DBMS
The disadvantages of the database approach are summarized as follows:
1. Complexity: The provision of the functionality that is expected of a good DBMS makes the
DBMS an extremely complex piece of software. Database designers, developers, database
administrators and end-users must understand this functionality to take full advantage of it.
Failure to understand the system can lead to bad design decisions, which can have serious
consequences for an organization.
2. Size: The complexity and breadth of functionality makes the DBMS an extremely large piece
of software, occupying many megabytes of disk space and requiring substantial amounts
of memory to run efficiently.
3. Performance: Typically, a File Based system is written for a specific application, such as
invoicing. As result, performance is generally very good. However, the DBMS is written to be
more general, to cater for many applications rather than just one. The effect is that some
applications may not run as fast as they used to.
4. Higher impact of a failure: The centralization of resources increases the vulnerability of the
system. Since all users and applications rely on the ~vailabi1ity of the DBMS, the failure of any
component can bring operations to a halt.
5. Cost of DBMS: The cost of DBMS varies significantly, depending on the environment and
functionality provided. There is also the recurrent annual maintenance cost.
6. Additional Hardware costs: The disk storage requirements for the DBMS and the database
may necessitate the purchase of additional storage space. Furthermore, to achieve the required
performance it may be necessary to purchase a larger machine, perhaps even a machine
dedicated to running the DBMS. The procurement of additional hardware results in further
expenditure.
7. Cost of Conversion: In some situations, the cost of the DBMS and extra hardware may be
insignificant compared with the cost of converting existing applications to run on the new DBMS
and hardware. This cost also includes the cost of training staff to use these new systems and
possibly the employment of specialist staff to help with conversion and running of the system.
14

This cost is one of the main reasons why some organizations feel tied to their current systems
and cannot switch to modern database technology.

Database Architecture
DBMSs do not all conform to the same architecture.

The three-level architecture forms the basis of modern database architectures.

This is in agreement with the ANSI/SPARC study group on Database Management


Systems.

ANSI/SPARC is the American National Standards Institute/Standard Planning and


Requirement Committee).

The architecture for DBMSs is divided into three general levels:

external

conceptual

internal

Three level database architecture

15

Figure 1: Three level architecture


1. the external level : concerned with the way individual users see the data
2. the conceptual level : can be regarded as a community user view a formal description of
data of interest to the organization, independent of any storage considerations.
3. the internal level : concerned with the way in which the data is actually stored

16

Figure 2 : How the three level architecture works


External View
A user is anyone who needs to access some portion of the data. They may range from application
programmers to casual users with adhoc queries. Each user has a language at his/her disposal.
The application programmer may use a high level language (eg. COBOL) while the casual user
will probably use a query language.
Regardless of the language used, it will include a data sublanguage DSL which is that subset of
the language which is concerned with storage and retrieval of information in the database and
may or may not be apparent to the user.
A DSL is a combination of two languages:

a data definition language (DDL) - provides for the definition or description of database
objects

a data manipulation language (DML) - supports the manipulation or processing of


database objects.

Each user sees the data in terms of an external view: Defined by an external schema, consisting
basically of descriptions of each of the various types of external record in that external view, and
also a definition of the mapping between the external schema and the underlying conceptual
schema.

17

Conceptual View

An abstract representation of the entire information content of the database.

It is in general a view of the data as it actually is, that is, it is a `model' of the `realworld'.

It consists of multiple occurrences of multiple types of conceptual record, defined in the


conceptual schema.

To achieve data independence, the definitions of conceptual records must involve


information content only.

storage structure is ignored

access strategy is ignored

In addition to definitions, the conceptual schema contains authorization and validation


procedures.

Internal View
The internal view is a low-level representation of the entire database consisting of multiple
occurrences of multiple types of internal (stored) records.
It is however at one remove from the physical level since it does not deal in terms of physical
records or blocks nor with any device specific constraints such as cylinder or track sizes. Details
of mapping to physical storage is highly implementation specific and are not expressed in the
three-level architecture.
The internal view described by the internal schema:

defines the various types of stored record

what indices exist

how stored fields are represented

what physical sequence the stored records are in

In effect the internal schema is the storage structure definition.


Mappings
18

The conceptual/internal mapping:


o

defines conceptual and internal view correspondence

specifies mapping from conceptual records to their stored counterparts

An external/conceptual mapping:
o

defines a particular external and conceptual view correspondence

A change to the storage structure definition means that the conceptual/internal mapping
must be changed accordingly, so that the conceptual schema may remain invariant,
achieving physical data independence.

A change to the conceptual definition means that the conceptual/external mapping must
be changed accordingly, so that the external schema may remain invariant, achieving
logical data independence.

Database languages
Once the design of a database is completed and a DBMS is chosen to implement the database,
the first order of the day is to specify conceptual and internal schemas for the database and any
mappings between the two. In many DBMSs where no strict separation of levels is maintained,
one language, called the data definition language (OOL), is used by the DBA and by database
designers to define both schemas. The DBMS will have a DDL compiler whose function is to
process LJDL statements in order to identify descriptions of the schema constructs and to store
the schema description in the DBMS catalog. In DBMSs where a clear separation is maintained
between the conceptual and internal levels, the DDL is used to specify the conceptual schema
only. Another language, the storage definition language (SOL), is used to specify the internal
schema. The mappings between the two schemas may be specified in either one of these
languages. For a true three-schema architecture, we would need a third language, the view
definition language (VDL), to specify user views and their mappings to the conceptual schema,
but in most DBMSs the DDL is used to define both conceptual and external schemas. Once the
database schemas arc compiled and the database is populated with data, users must have some
means to manipulate the database. Typical manipulations include retrieval, insertion, deletion,
and modification of the data. The DBMS provides a set of operations or a language called the
data manipulation language (OML) for these purposes. In current DBMSs, the preceding types of
languages are usually not considered distinct languages; rather, a comprehensive integrated
language is used that includes constructs for conceptual schema definition, view definition and
data manipulation. Storage definition is typically kept separate, since it is used for defining
physical storage structures to fine tune the performance of the database system, which is usually
19

done by the DBA staff. A typical example of a comprehensive database language is the SQL
relational database language which represents a combination of DDL, VDL, and DML, as well as
statements for constraint specification, schema evolution, and other features. The SDL was a
component in early versions of SQL but has been removed from the language to keep it at the
conceptual and external levels only.

Categories of Data Models


Many data models have been proposed, which we can categorize according to the types of
concepts they use to describe the database structure. High-level or conceptual data models
provide concepts that are close to the way many users perceive data, whereas low-level or
physical data models provide concepts that describe the details of how data is stored in the
computer. Concepts provided by low-level data models are generally meant for computer
specialists, not for typical end users. Between these two extremes is a class of representational
(or implementation) data models, which provide concepts that may be understood by end users
but that are not too far removed from the way data is organized within the computer.
Representational data models hide some details of data storage but can be implemented on a
computer system in a direct way. Conceptual data models use concepts such as entities,
attributes, and relationships. An entity represents a real-world object or concept, such as an
employee or a project, that is described in the database. An attribute represents some property of
interest that further describes an entity, such as the employee's name or salary. A relationship
among two or more entities represents an association among two or more entities, for example, a
works-on relationship between an employee and a project. Representational or implementation
data models are the models used most frequently in traditional commercial DBMSs. These
include the widely used relational data model, as well as the so-called legacy data models-the
network and hierarchical models-that have been widely used in the past. Representational data
models represent data by using record structures and hence are sometimes called record-based
data models. We can regard object data models as a new family of higher-level implementation
data models that are closer to conceptual data models. Object data models are also frequently
utilized as high-level conceptual models, particularly in the software engineering domain.
Physical data models describe how data is stored as files in the computer by representing
information such as record formats, record orderings, and access paths. An access path is a
structure that makes the search for particular database records efficient.

Conceptual modelling
The Conceptual Design phase takes the high-level data model and converts into a conceptual
schema, which is specific to a particular DBMS class (e.g. relational). For a relational system,
such as Oracle, an appropriate conceptual schema would be relations.

20

Finally, in the Physical Design phase the conceptual schema is converted into database internal
structures. This is specific to a particular DBMS product.
Basics
Entity Relationship (ER) modelling

is a design tool

is a graphical representation of the database system

provides a high-level conceptual data model

supports the user's perception of the data

is DBMS and hardware independent

had many variants

is composed of entities, attributes, and relationships

Entities

An entity is any object in the system that we want to model and store information about

Individual objects are called entities

Groups of the same type of objects are called entity types or entity sets

Entities are represented by rectangles (either with round or square corners)

Figure: Entities

There are two types of entities; weak and strong entity types.

Attribute

All the data relating to an entity is held in its attributes.

An attribute is a property of an entity.

Each attribute can have any value from its domain.


21

Each entity within an entity type:


o

May have any number of attributes.

Can have different attribute values than that in any other entity.

Have the same number of attributes.

Attributes can be

simple or composite

single-valued or multi-valued

Attributes can be shown on ER models

They appear inside ovals and are attached to their entity.

Note that entity types can have a large number of attributes... If all are shown then the
diagrams would be confusing. Only show an attribute if it adds information to the ER
diagram, or clarifies a point.

Figure : Attributes
Keys

A key is a data item that allows us to uniquely identify individual occurrences or an entity
type.

A candidate key is an attribute or set of attributes that uniquely identifies individual


occurrences or an entity type.

An entity type may have one or more possible candidate keys, the one which is selected
is known as the primary key.

A composite key is a candidate key that consists of two or more attributes

The name of each primary key attribute is underlined.

Relationships
22

A relationship type is a meaningful association between entity types

A relationship is an association of entities where the association includes one entity from
each participating entity type.

Relationship types are represented on the ER diagram by a series of lines.

As always, there are many notations in use today...

In the original Chen notation, the relationship is placed inside a diamond, e.g. managers
manage employees:

Figure : Chens notation for relationships

For this module, we will use an alternative notation, where the relationship is a label on
the line. The meaning is identical

Figure : Relationships used in this document


Degree of a Relationship

The number of participating entities in a relationship is known as the degree of the


relationship.

If there are two entity types involved it is a binary relationship type

Figure : Binary Relationships

If there are three entity types involved it is a ternary relationship type

Figure : Ternary relationship

It is possible to have a n-ary relationship (e.g. quaternary or unary).

Unary relationships are also known as a recursive relationship.


23

Figure : Recursive relationship

It is a relationship where the same entity participates more than once in different roles.

In the example above we are saying that employees are managed by employees.

If we wanted more information about who manages whom, we could introduce a second
entity type called manager.

Degree of a Relationship

It is also possible to have entities associated through two or more distinct relationships.

Figure : Multiple relationships

In the representation we use it is not possible to have attributes as part of a relationship.


To support this other entity types need to be developed.

Replacing ternary relationships


When ternary relationships occurs in an ER model they should always be removed before
finishing the model. Sometimes the relationships can be replaced by a series of binary
relationships that link pairs of the original ternary relationship.

Figure : A ternary relationship example

This can result in the loss of some information - It is no longer clear which sales assistant
sold a customer a particular product.

Try replacing the ternary relationship with an entity type and a set of binary relationships.

Relationships are usually verbs, so name the new entity type by the relationship verb rewritten as
a noun.
24

The relationship sells can become the entity type sale.

Figure : Replacing a ternary relationship

So a sales assistant can be linked to a specific customer and both of them to the sale of a
particular product.

This process also works for higher order relationships.

Cardinality

Relationships are rarely one-to-one

For example, a manager usually manages more than one employee

This is described by the cardinality of the relationship, for which there are four possible
categories.

One to one (1:1) relationship

One to many (1:m) relationship

Many to one (m:1) relationship

Many to many (m:n) relationship

On an ER diagram, if the end of a relationship is straight, it represents 1, while a "crow's


foot" end represents many.

A one to one relationship - a man can only marry one woman, and a woman can only
marry one man, so it is a one to one (1:1) relationship

Figure : One to One relationship example

A one to may relationship - one manager manages many employees, but each employee
only has one manager, so it is a one to many (1:n) relationship

25

Figure : One to Many relationship example

A many to one relationship - many students study one course. They do not study more
than one course, so it is a many to one (m:1) relationship

Figure : Many to One relationship example

A many to many relationship - One lecturer teaches many students and a student is taught
by many lecturers, so it is a many to many (m:n) relationship

Figure : Many to Many relationship example


Optionality
A relationship can be optional or mandatory.

If the relationship is mandatory

an entity at one end of the relationship must be related to an entity at the other end.

The optionality can be different at each end of the relationship

For example, a student must be on a course. This is mandatory. To the relationship


`student studies course' is mandatory.

But a course can exist before any students have enrolled. Thus the relationship `course
is_studied_by student' is optional.

To show optionality, put a circle or `0' at the `optional end' of the relationship.

As the optional relationship is `course is_studied_by student', and the optional part of this
is the student, then the `O' goes at the student end of the relationship connection.

Figure : Optionality example

It is important to know the optionality because you must ensure that whenever you create
a new entity it has the required mandatory links.
26

Entity Sets
Sometimes it is useful to try out various examples of entities from an ER model. One reason for
this is to confirm the correct cardinality and optionality of a relationship. We use an `entity set
diagram' to show entity examples graphically. Consider the example of `course is_studied_by
student'.

Figure : Entity set example

Confirming Correctness

Figure : Entity set confirming errors

Use the diagram to show all possible relationship scenarios.

Go back to the requirements specification and check to see if they are allowed.

If not, then put a cross through the forbidden relationships

This allows you to show the cardinality and optionality of the relationship

Deriving the relationship parameters


To check we have the correct parameters (sometimes also known as the degree) of a relationship,
ask two questions:
1. One course is studied by how many students? Answer = `zero or more'.
o

This gives us the degree at the `student' end.


27

The answer `zero or more' needs to be split into two parts.

The `more' part means that the cardinality is `many'.

The `zero' part means that the relationship is `optional'.

If the answer was `one or more', then the relationship would be `mandatory'.

2. One student studies how many courses? Answer = `One'


o

This gives us the degree at the `course' end of the relationship.

The answer `one' means that the cardinality of this relationship is 1, and is
`mandatory'

If the answer had been `zero or one', then the cardinality of the relationship would
have been 1, and be `optional'.

Redundant relationships
Some ER diagrams end up with a relationship loop.

check to see if it is possible to break the loop without losing info

Given three entities A, B, C, where there are relations A-B, B-C, and C-A, check if it is
possible to navigate between A and C via B. If it is possible, then A-C was a redundant
relationship.

Always check carefully for ways to simplify your ER diagram. It makes it easier to read
the remaining information.

Redundant relationships example

Consider entities `customer' (customer details), `address' (the address of a customer) and
`distance' (distance from the company to the customer address).

Figure : Redundant relationship


Splitting n:m Relationships
A many to many relationship in an ER model is not necessarily incorrect. They can be replaced
using an intermediate entity. This should only be done where:

the m:n relationship hides an entity


28

the resulting ER diagram is easier to understand.

Splitting n:m Relationships - Example


Consider the case of a car hire company. Customers hire cars, one customer hires many card and
a car is hired by many customers.

Figure : Many to Many example


The many to many relationship can be broken down to reveal a `hire' entity, which contains an
attribute `date of hire'.

Figure : Splitting the Many to Many example


Constructing an ER model
Before beginning to draw the ER model, read the requirements specification carefully. Document
any assumptions you need to make.
1. Identify entities - list all potential entity types. These are the object of interest in the
system. It is better to put too many entities in at this stage and them discard them later if
necessary.
2. Remove duplicate entities - Ensure that they really separate entity types or just two names
for the same thing.
o

Also do not include the system as an entity type

e.g. if modelling a library, the entity types might be books, borrowers, etc.

The library is the system, thus should not be an entity type.

3. List the attributes of each entity (all properties to describe the entity which are relevant to
the application).
o

Ensure that the entity types are really needed.

are any of them just attributes of another entity type?

if so keep them as attributes and cross them off the entity list.

Do not have attributes of one entity as attributes of another entity!

4. Mark the primary keys.

29

Which attributes uniquely identify instances of that entity type?

This may not be possible for some weak entities.

5. Define the relationships


o

Examine each entity type to see its relationship to the others.

6. Describe the cardinality and optionality of the relationships


o

Examine the constraints between participating entities.

7. Remove redundant relationships


o

Examine the ER model for redundant relationships.

ER modelling is an iterative process, so draw several versions, refining each one until you are
happy with it. Note that there is no one right answer to the problem, but some solutions are better
than others!
Entity Relationship Modelling - 2
Country Bus Company
A Country Bus Company owns a number of busses. Each bus is allocated to a particular route,
although some routes may have several busses. Each route passes through a number of towns.
One or more drivers are allocated to each stage of a route, which corresponds to a journey
through some or all of the towns on a route. Some of the towns have a garage where busses are
kept and each of the busses are identified by the registration number and can carry different
numbers of passengers, since the vehicles vary in size and can be single or double-decked. Each
route is identified by a route number and information is available on the average number of
passengers carried per day for each route. Drivers have an employee number, name, address, and
sometimes a telephone number.
Entities

Bus - Company owns busses and will hold information about them.

Route - Buses travel on routes and will need described.

Town - Buses pass through towns and need to know about them

Driver - Company employs drivers, personnel will hold their data.

Stage - Routes are made up of stages


30

Garage - Garage houses buses, and need to know where they are.

Relationships

A bus is allocated to a route and a route may have several buses.

Bus-route (m:1) is serviced by

A route comprises of one or more stages.

route-stage (1:m) comprises

One or more drivers are allocated to each stage.

driver-stage (m:1) is allocated

A stage passes through some or all of the towns on a route.

stage-town (m:n) passes-through

A route passes through some or all of the towns

route-town (m:n) passes-through

Some of the towns have a garage

garage-town (1:1) is situated

A garage keeps buses and each bus has one `home' garage

garage-bus (m:1) is garaged

Draw E-R Diagram

31

Figure : Bus Company


Attributes

Bus (reg-no,make,size,deck,no-pass)

Route (route-no,avg-pass)

Driver (emp-no,name,address,tel-no)

Town (name)

Stage (stage-no)

Garage (name,address)

Problems with ER Models


There are several problems that may arise when designing a conceptual data model. These are
known as connection traps.
There are two main types of connection traps:
1. fan traps
2. chasm traps
Fan traps
A fan trap occurs when a model represents a relationship between entity types, but the pathway
between certain entity occurrences is ambiguous. It occurs when 1:m relationships fan out from a
single entity.

Figure : Fan Trap


A single site contains many departments and employs many staff. However, which staff work in
a particular department?
The fan trap is resolved by restructuring the original ER model to represent the correct
association.

Figure : Resolved Fan Trap


32

Chasm traps
A chasm trap occurs when a model suggests the existence of a relationship between entity types,
but the pathway does not exist between certain entity occurrences.
It occurs where there is a relationship with partial participation, which forms part of the pathway
between entities that are related.

Figure : Chasm Trap

A single branch is allocated many staff who oversee the management of properties for
rent. Not all staff oversee property and not all property is managed by a member of staff.

What properties are available at a branch?

The partial participation of Staff and Property in the oversees relation means that some
properties cannot be associated with a branch office through a member of staff.

We need to add the missing relationship which is called `has' between the Branch and the
Property entities.

You need to therefore be careful when you remove relationships which you consider to be
redundant.

Figure : Resolved Chasm Trap


Enhanced ER Models (EER)
The basic concepts of ER modelling is not powerful enough for some complex applications... We
require some additional semantic modelling concepts:

Specialisation

Generalisation

Categorisation

Aggregation
33

First we need some new entity constructs.

Superclass - an entity type that includes distinct subclasses that require to be represented
in a data model.

Subclass - an entity type that has a distinct role and is also a member of a superclass.

Figure : Superclass and subclasses


Subclasses need not be mutually exclusive; a member of staff may be a manager and a sales
person.
The purpose of introducing superclasses and subclasses is to avoid describing types of staff with
possibly different attributes within a single entity. This could waste space and you might want to
make some attributes mandatory for some types of staff but other staff would not need these
attributes at all.
Specialisation
This is the process of maximising the differences between members of an entity by identifying
their distinguishing characteristics.

Staff(staff_no,name,address,dob)

Manager(bonus)

Secretary(wp_skills)

Sales_personnel(sales_area, car_allowance)

Figure : Specialisation in action


34

Here we have shown that the manages relationship is only applicable to the Manager
subclass, whereas the works_for relationship is applicable to all staff.

It is possible to have subclasses of subclasses.

Generalisation
Generalisation is the process of minimising the differences between entities by identifying
common features.
This is the identification of a generalised superclass from the original subclasses. This is the
process of identifying the common attributes and relationships.
For instance, taking:
car(regno,colour,make,model,numSeats)
motorbike(regno,colour,make,model,hasWindshield)
And forming:
vehicle(regno,colour,make,model,numSeats,hasWindshielf)
In this case vehicle has numSeats which would be NULL if the vehicle was a motorbike, and has
hasWindshield which would be NULL if it was a car.
Mapping ER Models into Relations
What is a relation?
A relation is a table that holds the data we are interested in. It is two-dimensional and has rows
and columns.
Each entity type in the ER model is mapped into a relation.

The attributes become the columns.

The individual entities become the rows.

Figure : a relation
35

Relations can be represented textually as:


tablename(primary key, attribute 1, attribute 2, ... , foreign key)
If matric_no was the primary key, and there were no foreign keys, then the table above could be
represented as:
student(matric no, name, address, date_of_birth)
When referring to relations or tables, cardinality is considered to the the number of rows in the
relation or table, and arity is the number of columns in a table or attributes in a relation.
Foreign keys
A foreign key is an attribute (or group of attributes) that is the primary key to another relation.

Roughly, each foreign key represents a relationship between two entity types.

They are added to relations as we go through the mapping process.

They allow the relations to be linked together.

A relation can have several foreign keys.

It will generally have a foreign key from each table that it is related to.

Foreign keys are usually shown in italics or with a wiggly underline.

Preparing to map the ER model


Before we start the actual mapping process we need to be certain that we have simplified the ER
model as much as possible.
This is the ideal time to check the model, as it is really the last chance to make changes to the ER
model without causing major complications.
Mapping 1:1 relationships
Before tackling a 1:1 relationship, we need to know its optionality.
There are three possibilities the relationship can be:
1. mandatory at both ends
2. mandatory at one end and optional at the other
3. optional at both ends
Mandatory at both ends
If the relationship is mandatory at both ends it is often possible to subsume one entity type into
the other.
36

The choice of which entity type subsumes the other depends on which is the most
important entity type (more attributes, better key, semantic nature of them).

The result of this amalgamation is that all the attributes of the `swallowed up' entity
become attributes of the more important entity.

The key of the subsumed entity type becomes a normal attribute.

If there are any attributes in common, the duplicates are removed.

The primary key of the new combined entity is usually the same as that of the original
more important entity type.

When not to combine


There are a few reason why you might not combine a 1:1 mandatory relationship.

the two entity types represent different entities in the `real world'.

the entities participate in very different relationships with other entities.

efficiency considerations when fast responses are required or different patterns of


updating occur to the two different entity types.

If not combined...
If the two entity types are kept separate then the association between them must be represented
by a foreign key.

The primary key of one entity type comes the foreign key in the other.

It does not matter which way around it is done but you should not have a foreign key in
each entity.

Example

Two entity types; staff and contract.

Each member of staff must have one contract and each contract must have one member of
staff associated with it.

It is therefore a mandatory relations at both ends.

37

Figure : 1:1 mandatory relationship

These to entity types could be amalgamated into one.

Staff(emp_no, name, cont_no, start, end, position, salary)

or kept apart and a foreign key used

Staff(emp_no, name, contract_no)


Contract(cont_no, start, end, position, salary)

or

Staff(emp_no, name)
Contract(cont_no, start, end, position, salary, emp_no)
Mandatory <->Optional
The entity type of the optional end may be subsumed into the mandatory end as in the previous
example.
It is better NOT to subsume the mandatory end into the optional end as this will create null
entries.

Figure : 1:1 with 1 optional end


If we add to the specification that each staff member may have at most one contract (thus making
the relation optional at one end).

Map the foreign key into Staff - the key is null for staff without a contract.

Staff(emp_no, name, contract_no)


Contract(cont_no, start, end, position, salary)

Map the foreign key into Contract - emp_no is mandatory thus never null.

Staff(emp_no, name)
Contract(cont_no, start, end, position, salary, emp_no)
Example
Consider this example:

Staff Gordon, empno 10, contract no 11.

38

Staff Andrew, empno 11, no contract.

Contract 11, from 1st Jan 2001 to 10th Jan 2001, lecturer, on 2.00 a year.

Foreign key in Staff:


Contract Table:
Cont_no

Start

End

Position

Salary

11
Staff Table:

1st Jan 2001

10th Jan 2001

Lecturer

2.00

Empno

Name

Contract No

10

Gordon

11

11
Andrew
However, Foreign key in Contract:

NULL

Contract Table:
Cont_no

Start

End

Position

Salary

Empno

11
Staff Table:

1st Jan 2001

10th Jan 2001

Lecturer

2.00

10

Empno

Name

10

Gordon

11
Andrew
As you can see, both ways store the same information, but the second way has no NULLs.
Mandatory <->Optional - Subsume?
The reasons for not subsuming are the same as before with the following additional reason.

very few of the entities from the mandatory end are involved in the relationship. This
could cause a lot of wasted space with many blank or null entries.

Figure : 1 optional end

39

If only a few lecturers manage courses and Course is subsumed into Lecturer then there
would be many null entries in the table.

Lecturer(lect_no, l_name, cno, c_name, type, yr_vetted, external)

It would be better to keep them separate.

Lecturer(lect_no, l_name)
Course(cno, c_name, type, yr_vetted, external,lect_no)
Summary...
So for 1:1 optional relationships, take the primary key from the `mandatory end' and add it to the
`optional end' as a foreign key.
So, given entity types A and B, where A <->B is a relationship where the A end it optional, the
result would be:
A (primary key,attribute,...,foreign key to B)
B (primary key,attribute,...)
Optional at both ends...
Such examples cannot be amalgamated as you could not select a primary key. Instead, one
foreign key is used as before.

Figure : 2 optional end

Each staff member may lease up to one car

Each car may be leased by at most one member of staff

If these were combined together...

Staff_car(emp_no, name, reg_no, year, make, type, colour)


what would be the primary key?

If emp_no is used then all the cars which are not being leased will not have a key.

Similarly, if the reg_no is used, all the staff not leasing a car will not have a key.

A compound key will not work either.

Mapping 1:m relationships


40

To map 1:m relationships, the primary key on the `one side' of the relationship is added to the
`many side' as a foreign key.
For example, the 1:m relationship `course-student':

Figure : Mapping 1:m relationships

Assuming that the entity types have the following attributes:

Course(course_no, c_name)
Student(matric_no, st_name, dob)

Then after mapping, the following relations are produced:

Course(course_no, c_name)
Student(matric_no, st_name, dob, course_no)

If an entity type participates in several 1:m relationships, then you apply the rule to each
relationship, and add foreign keys as appropriate.

Mapping n:m relationships


If you have some m:n relationships in your ER model then these are mapped in the following
manner.

A new relation is produced which contains the primary keys from both sides of the
relationship

These primary keys form a composite primary key.

Figure : Mapping n:m relationships

Thus

Student(matric_no, st_name, dob)


Module(module_no, m_name, level, credits)

becomes

Student(matric_no, st_name, dob)


Module(module_no, m_name, level, credits)
Studies(matric_no,module_no)
41

This is equivalent to:

Figure : After Mapping a n:m relationship


Student(matric_no,st_name,dob)
Module(module_no,m_name,level,credits)
Study()

42

You might also like