You are on page 1of 38

BM307File Organization

Gazi University
Computer Engineering Department

9/24/2014

Index

Sequential File Organization

Binary Search
Interpolation Search
Self-Organizing Sequential Search

Direct File Organization

9/24/2014

Locating Information
Hashing Functions
Collision Resolution
Coalesced Hashing
2

File Organization

Goal Organizing files efficiently in terms of both


space and performance

File Organization
sequential
indexed sequential
direct

9/24/2014

File Access
sequential
sequential & direct
direct (random)

File Access Types

Sequential accessing multiple records (often an entire


file) and usually according to a predefined order

Direct (random) locating a single record

Question How can we have an effective organization?


Answer matching the type of organization with the
type of intended access

9/24/2014

Sequential File Organization


Background

Fields (eg.: Employee name, number)


Records contain data about individual entities
Files (eg.: employee list)
Primary Key field(s) which uniquely distinguishes a record
from all others
Secondary Key all the remaining fields

9/24/2014

Sequential File Organization

File consists of records of the same format


Fixed-length records
Variable-length records

Sequential File Organization (i+1)st element of a file


is stored immediately after the ith element.

9/24/2014

Sequential File Organization

Sequential access moving from one record in the


file to the next by incrementing the address of the
current record by the record size

Direct access processing a single record directly if


we know subscript

9/24/2014

Sequential File Organization

Probe access to a distinct location


Sequential Search
In an entire file of N records

N/2 probes are needed in average


Need to probe entire file for an unseccessful retrieval

Computational complexity O(N)

Appropriate when N is small


Performance improvement?

9/24/2014

Sorting

Eg. - Sequential Search

100000 records, each record size is 400 bytes,


block size is 2400 bytes.
Sequential search time for retrieving 10000 records?
Each probe one block of data
(100000*400)/2400 = 16667 blocks
Reading time for one block 0.84ms (IBM 3380)
Time requirement for each record
(16667/2)*0.84 = 7 sec.
For 10000 records 7sec * 10000 = 19 hours
Better organization is needed!!

9/24/2014

Sequential File Organization

Binary Search

Requires sorting
Compares the key of the sought record with the middle record of the file
Half of the file is eliminated in each turn
Computational complexity O(log2n)
Eg. the key of the sought record 17

9/24/2014

10

Sequential File Organization


Binary Search (Algoritma)

9/24/2014

11

Sequential File Organization


Interpolation Search

Approximate relative position


Eg.: Searching a name in a telephone book
Choses the next position for a comparison based upon the
estimated position of the sought keyrelative to the remainder
of the file to be searched

key[sought] key [LOWER]


NEXT := LOWER + (UPPER-LOWER)
key[UPPER] key [LOWER]

Worst case computational complexity O(n)


Average case computational complexity O(log2 log2n)
Its performance improves as the distribution of keys becomes
more uniform

9/24/2014

12

binary search should be preferred when the data is stored in


primary memory

Why?

interpolation search should be preferred when the data is stored in


auxilary memory

9/24/2014

Why?

13

binary search should be preferred when the data is stored in


primary memory

The additional calculations needed for the interpolation search cancel


any savings gained from fewer probes

interpolation search should be preferred when the data is stored in


auxilary memory

9/24/2014

An access of auxiliary storage is an order of magnitude greater than


the time required for the additional calculations

14

Sequential File Organization


Self-Organizing Sequential Search

Modifies the order of records


Moves the most frequently retrieved records to the beginning
of the file

Most popular algorithms:


Move_to_front
Transpose
Count

9/24/2014

15

Sequential File Organization

Move_to_front

The sought record is moved to the front position of the file

Potential of making big mistakes if a record accessed , moved to the


front of the file, and then rarely if ever accessed again!

A linked implementation is preferable even though it takes more storage

Appropriate when space is not limited and locality of access is important

Essentially the same as the LRU (least recently used) paging algorithm
used by operating systems

9/24/2014

16

Eg. - Move_to_front

The records are accessed in the order of fileediting

abcdefghijklmnoprqstvwyz
fabcdeghijklmnoprqstvwyz
ifabcdeghjklmnoprqstvwyz
lifabcdeghjkmnoprqstvwyz
elifabcdghjkmnoprqstvwyz
elifabcdghjkmnoprqstvwyz
delifabcghjkmnoprqstvwyz
idelfabcghjkmnoprqstvwyz
tidelfabcghjkmnoprqsvwyz
itdelfabcghjkmnoprqsvwyz
nitdelfabcghjkmoprqsvwyz
gnitdelfabchjkmoprqsvwyz

9/24/2014

17

Sequential File Organization

Transpose

Interchanges the sought record with its immediate predecessor

More stable than the Move_to_front algorithm

A record needs to be accessed many times before it is moved to the


front of the list

Easily implemented

Does not need additional space

Should be used when space is premium

9/24/2014

18

Eg. - Transpose

The records are accessed in the order of fileediting

abcdefghijklmnoprqstvwyz
abcdfeghijklmnoprqstvwyz
abcdfegihjklmnoprqstvwyz
abcdefgihjklmnoprqstvwyz
abcedfgihjklmnoprqstvwyz
abcdefgihjklmnoprqstvwyz
abcdefighjklmnoprqstvwyz
abcdefighjklmnoprqtsvwyz
abcdeifghjklmnoprqtsvwyz
abcdeifghjklnmoprqtsvwyz
abcdeigfhjklnmoprqtsvwyz

9/24/2014

19

Sequential File Organization

Count

Keeps count of the number of accesses of each record

The file is always ordered in a decreasing order of frequency of


access

Requires extra sorage to keep the count

Use it only when the counts are needed for another purpose

9/24/2014

20

Direct File Organization

Ideally, we want to go directly to the address where the record is stored


A key can be unique address one probe
0

0
11
Key
space

correspondence
999-99-9999

Address
space

999-99-9999

More address space than needed


Eg.1 billion addresses for 300 million people

9/24/2014

21

Direct File Organization

Converting information into a unique address

Eg. : Airline reservation system

Flight numbers from 1 to 999


Days are numbered from 1 to 366
Flight number and day of the year could be concatenated to determine the
location

Location = flight number || day of the year, address range 001001-999366


(???367 - ???999 would not exist)
Location = day of the year || flight number , address range 001001-366999

9/24/2014

22

Direct File Organization

The key converts to a probable address

If we remove most of the empty spaces in the address space, we have


lost the 1-1 correspondence btw keys & addresses
Hashing functions are used to map the wider range of key values into
the narrower range of address values
Hash (key)

probable address

Initial probable address home address


Hashing function should

Evenly distribute the keys among the addresses


Executes efficiently

9/24/2014

23

Direct File Organization

A collision occurs when two distinct keys map to the same


address
0
0
Key
space

Address
space
1200
999-99-9999

Hashing is then composed of two aspects;

The function
The collision resolution method

9/24/2014

24

Direct File Organization


Hashing Functions

9/24/2014

25

Direct File Organization


Hashing Functions

Squaring

Taking square of a key and then substringing or truncating a portion


of the result

Radix conversion

9/24/2014

The key is considered to be in a base other than 10 ans is then


converted into a number in base 10
Eg.: Base 11
1234 = 1 * 113 + 2 * 112 + 3 * 111 + 4 * 110= 1331 + 242 + 33 + 4
= 1610
substringing or truncation could then be used
26

Direct File Organization


Hashing Functions

Polynomial hashing

The key is divided by a polynomial


f(information area)
cyclic check bytes

Alphabetic keys

9/24/2014

Alphabetic or alphanumeric key values can be input to a hashing


function if the values are interpreted as integers

27

Direct File Organization


Collisions

For a given set of data, one hashing function may distribute the keys
more evenly over the address space than another

A hashing function that has a large number of collisions is said to


exhibit primary clustering

It is better to have a slightly more expensive hashing function for data


that need to be stored on auxiliary storage

Another method for reducing collisions is reducing the packing factor

Packing factor =
total number of storage locations

9/24/2014

collisions

number of records stored

storage

28

Direct File Organization


Collision Resolution

Collision resolution with links

Collision resolution without links


Static positioning of records
Dynamic positioning of records

Collision resolution with pseudolinks

9/24/2014

29

Direct File Organization


Collision resolution with links

If multiple synonyms occur for a


particular home address, we form
a chain of synonym records

Disadvantage extra storage is


needed

Collision resolution without links

We can use implied links by


applying a convention , or set of
rules for deciding where to go
next

A simple convention is to look at


the next location in memory

Advantage NO extra storage is


needed
9/24/2014

30

Direct File Organization


Coalesced Hashing

Occurs when we attempt to insert a record with a home


address that is already occupied by a record from a chain with
a different home address

The two chains with records having different home addresses


coalesce or grow together

X ,D, Y were inserted

9/24/2014

31

Direct File Organization


Coalesced Hashing

(Eg.)

Hash (key) = key mod 11


27, 18, 29, 28, 39, 13, 16

Average # of probes 1.8

42 & 17 added

9/24/2014

32

Direct File Organization


Coalesced Hashing

Discussion

Packing factor of the final table = 9/11 (82%)

One method of reducing coalescing is to reduce the packing factor

It would be advisable to place the most frequently accessed records early in


the insertion process

Deleting a record is complicated

If coalescing has occurred,


a simple deletion procedure is
to move a record later in the
probe chain into the position of
the deleted record
Final table after deleting 39 ---------->

9/24/2014

33

Direct File Organization


Coalesced Hashing

Variants

Table organization (whether or not a seperate


overflow area is used)
The manner of linking a colliding item into a chain
The manner of choosing unoccupied locations

Table Organization

Table primary area + overflow area


Adres factor = (primary area ) / (total table size)
Best performance when the adres factor is 0.86

9/24/2014

34

Direct File Organization


Coalesced Hashing

Variants
Late Insertion Standart Colesced Hashing (LISCH)

New records are inserted at the end ofa probe chain

Lack of a cellar

Late Insertion Coalesced Hashing (LICH)

Uses a cellar

Eg. Keys: 27, 18, 29, 28, 39, 13, 16, 42, 17
hashing function: key mod 7

Average # of probes 1.3


(It was 1.8 for LISCH)

In general, for a 90 percent packing factor,


using a cellar will reduce the number of
probes by about 6 percent compared
with LISCH
9/24/2014

35

Direct File Organization


Coalesced Hashing

Variants
Early Insertion Standart Colesced Hashing (EISCH)

nserts a new record into a position on the probe chain immediately after the record
srored at its home address

nsertion of the record with key 17 according to EISCH algorithm:


Hash (key) = key mod 11

9/24/2014

36

Direct File Organization


Coalesced Hashing

Variants
Random Early Insertion Standart Colesced Hashing (REISCH)

Choosing a random unoccupied location for the new insertion


Gives only a 1% improvement over EISCH

Random Late Insertion Standart Colesced Hashing (RLISCH)

Bidirectional Late Insertion Standart Colesced Hashing (BLISCH)


Choosing the overflow location for a collision insertion by alternating the
selection between the top and bottom of the table

Bidirectional Early Insertion Standart Colesced Hashing (BEISCH)

9/24/2014

37

Direct File Organization


Coalesced Hashing
Comparison

9/24/2014

38