You are on page 1of 8

FINDING CANDIDATEKEYS FOR RELATIONAL DATA BASES

Raymond Fadous John Forsyth Michigan State University

Computer Science Department East Lansing, Michigan 48824

The candidate keys, as defined reducing a normalized relation

by E. F. Codd 141, are important in the process of into second and third normal forms,

Given a set of functional relations, Delobel and Casey [6] transformed this set into a Boolean function and it was shown that the set of all prime implicants of this function that have no primed variables are the only candidate keys, Starting only with the functional relations (dependencies), a new approach is proposed for finding all the candidate keys of a normalized relation without using a Boolean function. The algorithm depends on an implication matrix, its transitive closure and a systematic method for introducing attributes to form keys. This algorithm is suitable for hand computation as well as computer implementation.

203

INTRODUCTION

E. F. Codd [2] proposed the relational data base model in an effort to provide 'data independence" for apThe term "data independence" means that the correctness of applications programs is plicationsprograms. not dependent upon the details of any particular hierarchy between data items which are stored in the data programmer working with a relational data base may view the data as being stored in base. An applications tabular form; each unique collection of data items which describes a single entity'occupies a unique row of A number of different collections of tables may be used to represent any given data base. some table. Certain collections might be preferable to others if they eliminate exceptional conditions, or anomalies, which can occur when items in the data base are added, deleted, or changed. Several researchers have investigated normal forms for relational data bases and characterized their role in eliminating addition, A good overview of these concepts can be obtained by reading Codd deletion, and insertion anomalies. [3-41 and Kent [7]. In order to establish an appropriate normal form for a relational data base, one must identify minimal called candidate a which uniquely determine the values of the remaining attrisubsets of attributes, Delobel butes, so that the rows of the tables in the data base have the required property of uniqueness. and Casey [6] developed an algorithm for finding all candidate keys in a relational data base, given the set of fundamental functional relations defined on the data base. Their method maps the functional relations to a Boolean function and produces the candidate keys from those prime implicants of the Boolean function which cover the implicant which consists of all of the uncomplemented variables of the function. This paper describes an alternative algorithm, whose correctness is shown to derive directly fromthe fundamental properties of the functional relations as given by Armstrong [l]. Experience with applying this new algorithm to several examples has shown that it is easier to apply than methods commonly used for finding prime implicants. The remainder of the discussion develops each of these ideas in detail. Section 2 defines the terms and notations to be used; the functional relations and the implication matrix derived from them are of special Section 3 characterizes the mathematical properties of the functional forms and implication importance. matrix which are critical to the algorithm presented in Section 4, which finds all of the candidate keys. Section 5 presents an example of the algorithm. Conclusions are offered in Section 6. 2 AND NOTATION -a BASIC CONCEPTS Definition L: called domains, not necessarily distinct, then a s a subset of the Cartesian product Dl X D2 X i Thatis, R is a set of elements each of the form (dl, d , . . . . d ) where each set Di is called the i-th domain of R. Each element ( f 1, . ..) gn) is called R. Each di is the ith component of a tuple. Given sets D defined Definition 2: to a domain of a relation. L Any value associated with an attribute is relation 5 of degree n, ... XD . n di is an element of D . The an n-tuple or simply t uple of

An attribute is a name assigned called an attribute value.

While the domains of a relation need not be distinct, The relations are time-varying relations; distinct. relation. In the A ;z;ioa number A29 l ... set nis

the attribute names assigned to them must all be tuples may be updated, deleted, and inserted in a n attributes Al, A2 @. The union and iniera set, then IAl is the , . . . . A iS 0 = {A k; a 8. AnPartition ..q

discussion which follows, the symbol nwill be used to denote the set of all The empty set is denoted by on which a relation R of degree n is defined. If A is operations between sets will be denoted by U and n, respectively. of elements in A. The usual set notation for a set n with n elements A , A A 3; we will also use n = AIAj...An. Two sets X and Y are disjoint iiX&= ancollection of non-empty d s oint subsets of nwhose union is n. 2:

Definition

The set of attributes B in a relation R, defined on.the set of attributes 0, is dependent on the set of attributes A in R, if there exists a function f: A *B, from the values of A into the values of B, which holds true for every set of tuplee that are valid in the relation R.: We simply write A -B and also say that A implies B. If B is not dependent on A, we write Definition 4: If there is no B, non-empty proper subset of A, such that B *R, then A is A %B.

Let A S n, A f # and A 4 n. called a candidate key of R.

204

Definition

5: is called the set of --

Let L and R i = 1, 2, . . . , m be non-empty subsets in n, then the set {Li -)Ri] funct i onal &ations 0 -in -*

Armstrong [l] has shown that the following properties Pl - P5, are sufficient to find the set of all functional relations that can be derived from a given set of functional relations. Let A, B, C, and D be any non-empty subsets of n then, Pl. P2. P3. P4. P5. Reflexivity: A + Transitivity: If Augmentation: If Projectivity: If Additivity: If A A; A -B and B A '*B, then A *B, then + B and C -) dC, then A' -B for A *B' for D, then A A MC; any A' ;? A; any B's B and B' + 6. U C -B U D. candidate keys which can be found or the user. generality, we 2, . . . . In, that A -'BC whose

The purpose of this paper is to show an algorithm which generates all using only the given functional relations and properties Pl - PS.

The starting point is a set of functional relations specified by the data base administrator Let {Li +Ri}, i = 1, 2, .,., m, be the set of functional relations in .Q. Without lossof &can impose the restrictions that L # L for i # j and that the intersection Li fl Ri, i = 1, is empty. Note that if AB +BC, t ii en i 4 is true that AB -) C, but it is not necessarily true or that A -C.

The implication matrix P of a relation R of degree n, defined on n= AlA2...An, is an m X n matrix rows are labeled L 1' L2 l -*, Lm and whose columns are labeled Al, A2, . . . . An such that: LiAj
for

1 if Aj E(Li U Ri) 0 otherwise n.

i = 1, 2, . . . . m and j = 1, 2, .,.,

Example 1: Consider 0 = 1. 2. 3. 4. Then, the the following functional relations where

ABCDEFG and ABC -'DEG AB -m CD +EF EG -AC implication matrix

P is:

ABCDEFG ABc1111101 p= AB 1110010 CD 0011110 EG 1010101 Since row labels of P are not isomorphic method of finding the transitive closure defined as follows: 1. 2. to subsets of the column labels we need an extension of the usual of an implication matrix. The transitive closure, P*, of P is

3. Example 2:

Put P* = P For every two distinct rows Li and L in P*, if for every attribute Ak in L , the entry - 1, then copy all 1 entries injrow L into the corresponding entries I n row Li. Note 1 need change only the 0 entries in row Li to 1 if the corresponding entries in row L J are 1 and the original 1 entries in row Li will remain. This changes P* Repeat part 2 above until P* cannot be changed any further.

1.
I

We will

find

P* of P in Example 1

ABCDEFG ABC 1111111 AB 1111111 P*=CD 0011110 1010101 where Lmeans a 0 has been changed to 1 in the process of taking

EG

the transitive

closure.

205

Before we proceed with the proof of the basic lemmas and theorems that support the algorithm, we will inGiven the functional relations L *R i = 1, 2, troduce the following notation. m, we form the implication matrix P, and then its transitive closure P*. It is flea$'from the de;in;tion of P* that We will call these new functional relations L 4 i possibly, we get a new set of functional relations. i = 1, 2, .,., m, where Ti 2 Ri, Li n Ti = d. Now in P*, it might be the case that some rows are !iot B;l 1. So, the set of attributes that correspond to 0 entries in row Li will be denoted by T;. Note that T; = 4 on n, if T; # Q. We are also if row L. is all 1 in P*. Also note that n = LiTiT; is a partition assuming t ii at Li and Ti are not empty. PRELIMINARIES 2.. MATHEMATICAL Lemma 1: -In P*, if Li-)T. Proof: Since L -T Li U T;i,LiiU Lemma 2: -In P*, if row L Proof: Since row Lj is all 1, this implies that L -) R. L U T -Lj (by P4)and so Li-@ Lj+ n (byP2 9, i i Lennna3: -In P*, if LJ is a subset of Li U Ti and if row Li is not all 1, hence row L is not all of Tf and T; is no 4 empty. T' such that Li U T;-+Rand Lj U T! -n, the intersection J 5 Proof: It follows from the transitive property P2, that any 1 entry in row L. is a 1 entry any 0 entry in row Li is a 0 entry in row L . But T; corresponds to 'the 0 entries j T; I-I T; # 6. Example 2: R = ABCDE 1. AB-C kenAiiyE, Lemma4: -In P*, if Lj is a subset of Li U Ti and if row Li is not all the intersection of T; and o is not empty. j Proof: 1, then for any (Y. E T' such that L U (Y - 0, J J 1 j % T;=BE AB CD pdc=AB 11 1-1 AC 10 110 and Ti fI T2 # a. E 0 1, then for T; and Also, since Li-)Ti, then Li 4 Li U Ti (by P5). But 1 is all 1 and L j E Li u, Ti, then row Li is also all 1. and L -'L then Li+ Li U Ti (by P5). Ti UiT; 5'0 (by P5). Also Ti-)T; (by Pl), hence
1

and Li U Ti f 0,

then Li can be extended by the subset T; such that Li U T;-+ 0.

in row Li and hence in row Li, hence

The proof is by contradiction. If o nT; = 8, then (Y. rLiU Ti since LiTiT! is a partition But this wouldJimply that Lid Li U Ti -'L j U aj * 0, Lj E Li u Tip hence L. U ct. E L. U Tj Tiereke crj n Ti # ti. contradiction since ii % x. Lenuna5: -In P*, that Li Li U Ti section Proof: If the intersection of T; and Lj is empty, this implies that the 1 entries Therefore in row Li include if

on 0. Also which is a

L is not a subset of Li U Ti and if row L is all 1 and row Li is not all 1, then for Ti such 1 I U Ti-+ 0, the intersection of T; and L is not empty. Further, in P*, if L is not a subset of j and if row Lj is not all 1 and row Li 1 not all 1, then for Ti such that Li U T;-) n, the inter1s of T; and Lj is not empty.

Lj would be a subset of Li U Ti which contradicts

the hypothesis.
206

L and hence 5 T; n Lj # (d. The same proof

applies

to the second statement.

&ample 4: AB C DE n = ABCDE A 1110 0 1. A-BC p* = AD1 LLll 2. AD-E then Ti = DE, L2 = AD and Ti n L2 # 8.

n= ABCDE AB C DE 1. A-'BC 2. BD-A then T' = DE, L2 = BD and Ti n L2 # #T 1 Lema 6: -In P*, if L is a subset of Li U Ti, the union Li U L 1' % Proof: In P*, LidLi then the subset of ndependent on Li is the same subset dependent on

U Ti and so, by definition of Ti, if Zi E 0 such that Zi 2 Li j U But, by hypothesis, L E Li U Ti and so Li+Li U Ti-)Lj4L respectively which imp 1ies that Lj U Tj S Li U Ti+ Also LidTi LLi U L +Li U Ti U L U Tj = Li U Ti since Lj U Tj S LF U-TIT Therefore Li -1 subset, Li U Ti, in h .

U T., than Lif( Zi. Similarly, Tj rbyprojectivity and and L.-T., and so and LiJU L' imply the same j

ntample 6: Refer back to l&ample 3. Lemma 7: -In P*, if L is not a subset of Li U Ti, then there exists a subset o s L , j 1, such tha i! Li U Q and Li U Lj imply the same subset in R. Proof:
possibly Q = Q

Note that Ll = AB-'ABCD and Ll U L2 = ABC+ABCD.

if

row Li is all

Therefore, using L~~WJ 6, if L. S (Li U o) U Ti where L U CY -)Ti, then the subset in RdeLi U cy 4Ti. Li U Lj since-cr,s L.. pendent on Li U cr is the same subset dependeAt on the union (Li U Or) JLj= J (Y = D S L2 = BD and that A U D and A U BD imply the same subset in n key, then K is said to be derived from Li, and so we have the following

Refer to Example 5.

Note that

Now, if K = LF U ui is a candidate theorem. Theorem 1: -Any candidate Proof:

key that can be derived

from Li U Lj can also be derived

from Li or Lj separately.

By Lenunas 6 and 7, we have in general, for Li U L imply the same subset in n. IfLiU Li U Lj cannot be a candidate key whenever Lemma2 , L U IY and Li U L can be extended (Li U Lj) 6 g 4 n. But (Y h Lj and for any and therefore
J

some.& S L , where cy could possibly be empty, that .I '0% but Li U L 1 Li U a: (Y-+n, then LiUL ucu. Ift U afn, Li U (Y is unless Li U$,=L B so that Li t ((uU g) by the same minimum subset S, we have LYU g S Lj U g, hence Li U (a U 8) s (Li (Lr U L ) U S can never be a candidate key whenever Li U (cr. p) is unless
207

Li U cy and hence
then, by

-nand U Lj) U g

(Li

ULj)

u B = Li u (CYU B>-

The previous Lemmas 1 - 7 and Theorem 1 show that to find the candidate keys in a relation consider the functional relations L.-,T , i = 1, 2, . . . . m and that taking the L. will not result in any new &ndi Aate key that cannot be found from L or L uses the transitive property repeatedly to find the minimum oi sufh tha giving the algorithm, we need the following theorem which asserts that for each Li explained in the algorithm, then Li U ai- 0. Theorem 2: -If L -) n, and if Ti # d, then Li U (Tf n L) - n. Proof:

E, we need only

L into Ll U L2 such that Ll E Li U Ti and L2 ET;, This is possible By Lemmas 3 - 5, T; n L # 0. Partition since Lipid; is a partition on 0 and L = T; n L # 0. Now Li-* Li U Ti-)Ll by projectivity (~4). 2 T' IlL2=L+~2 Since T; n L2 ET' n L, T; n L-'L by augmentation (P3). Also, by by reflexivity (Pl). ai ditivity SO, ii u (T; fI L) -) & by transitivity (P2). t ~5), Lo U (T; fl L) -) ~1 U ~~ = L -) n. &. 1. 2. 3. 4. 5. FINDING THE CANDIDATEKEYS i = 1, . . . . m. Form P* with row labels L i' Tl= 0 and T2 = 0. For i = 1 to m enter into Tl each Li U T' where ITi 1 2 0. If Li U T; s Lj U T; and if (Tj 1 < 1 anf ITil 5 1, then delete Lj U T' from Tl. If IT;I 5 1 for all remaining entries in Tl then terminate the algorit La. Tl contains keys. a. b. c. d. e. f. Else for all i in Tl such that IT'1 2 2 form o. = T' II (Lj U T;) for all j For each i, delete any oi., such t Aat (Y c (Y..:;j # fl. into T?? Then delete from T2 any set For each remaining oi. en2er Li U (Y.. is =J of any other set in T1 . For all i in .Tl such that IT' I 2 2 and all L in T2 for which Li # Lj form k # k'. For each i, delete any cyik, &xh that CY s$ Enter into T2 any Li U oik which is not% supzEi:t of a set already in T2. any set which is a superset of any other set in T2. If any new entries are repeat from step 5d. Otherwise go to step 6. The remaining

all

candidate

# i where IT;~ 2 0. which is a superset oik = Tf II (Lj U ajk). Then delete from T2 thus created in T2,

6. 7.

copy from Tl into T2 any sets Li U Ti in Tl where IT;I 2 1. Delete from T2 any set which is a superset of another set in T2. the candidate keys and the algorithm terminates.

sets in T2 are all

of

It remains only to show that the algorithm Note that if L. U CY is a candidate key, and no subset &f LJiimplies n. Therefore, then L U Q, is fo&d by the algorithm. the se E of $unctional relations ha8 more Theorem3: --

is complete, in the sense that it finds all the candidate keys. in P*, L M n then for 1~ I = 0, L is a candidate key only if to show c&pletenesi we need only show that if Icy.1 > 1, i The following completeness theorem will show this.l'A&une that than one element in it, otherwise it is a trivial case. 'llI

When the algorithm terminates, every set remaining in T2 is a candidate key. Further, no candidate k ys Sj 2 0, 2 1, is a candidate key, then 9 = T; n (Lj U Bj), are missed (not in T2). That is, if L U i for some j # i.

QiSbtl

T I

Proof: By construction, the entries in Tl satisfy Li U Ti+ n.By Theorem 2, all entries in T2 satisfy Li U o 4 R,and step 7 of the algorithm assures that after termination no subset of a set in T2 implies Yl n;so t e first statement of this theorem is true. To prove the second statement, note that if Li U (yi is a candidate key, then Li U oiq n. But since Lif) fl, this implies that there exists a subset Si E Li U ~~ U cxi such that Si-T; where Si = ai U [S n (Li U Till and Si = ui only if si n <Li U TV) = 0. Note that there is always at least one such Si, namef y we must show that there exists a j f i, such that Ah0 CXi.=Si n T;. SO, if Si *T;, si = L. u CYi' n-T;. case when there is only one functional relation which would give = L; U gj-+Ti, except in the trivial si If L +T' i, then let Si = L.. We do this for each Lj such that L -) T; = Ti and Si = Li U ai* Cl* T;. % j J j 208

If and then choose the minimal aC -- no superset of Cy is chosen -- where o = T' fl L and L U (Y -) n, i5 an(f for'some pi wei have L %T', then for some Bj, LJ b 8 -'T' by augmentat f on P3 and additivity L' U b U St-) T'UL U 8 U 8 =n 'whefe 8' E L U T again by P3 and ~5. Therefore 8' ll qf = 1 since L'T T"is a'par&!! c& n! if we let !! = ii u b', then T; n (Lj U 8j U 8;) = T; &Lj IfBj) = T;nSi=ai iii Again, we do this for each L U 8 such ttat Lj U @j-T. But this is exactly how the algorithm iteratively .finds the different o@iSO t hi t Li U cti -) n and 4hen J hooies the minimal of these (Y i' 5 -. &Btamole Example 8: We will find the candidate keys of the relation in Example 1. manner but the steps still agree with the algorithm. Tl: AB; ABC; CD U ABG; EG U BDF We can delete Tl: ABC 2 AB from Tl, so as follows CDU [ABG~(EGUBDF)]=CDUBG CD U (ABG fI AB) = CD U AB AB; CD U ABG; EG U BDF It will be presented in a somewhat optimized

Apply 5a - 5f of the algorithm

EGU[BDF~(CDUABG)I=EGUBD % =EGUB EG U (BDF n AB) Therefore T2 holds; T2: EG u B; CD U AB; CD U BG. Now intersect every subset T;, where ITi1 12, in Tl with to Li of Li U Tj and put the resulting subset in T2 if it

every subset in T2, then add this intersection is not a superset of a set al ready in T2 we get, BDF n CDAB = BD gives EG U BD (superset, do not put in T2) BDF n CDBG= BD gives EG U BD (superset, do not put in T2) ABGnEGB = BG gives CD U BG (already in T2) every subset in T2 and no new subset 8 were added to T2,

We have intersected every T;, ITi1 2 2, in Tl with SO we go to step 6 of the algorithm, Applying step 6, we get T2: AB; EGB; CDAB; CDBG then applying step 7, we get the candidate T2: AB; EGB; CDBG. 6 -. CONCLUSION

keys:

In this paper we have shown that the functional relations enjoy a rich algebraic and set theoretic structure and that the candidate keys could be studied within the framework of the implication matrix which is a familiar algebraic approach. The algorithm as presented in Section 4 is not written in a ready to program Also, optinotation, so care must be taken in choosing the appropriate notation for actual programing. mization is possible and Example 8 in Section 5 is only one such way. We have not programmed the algorithm, so no comparison criteria are possible between our approach and that of Delobel and Casey. Using our new approach, we are currently studying the effect of adding or deleting original candidate keys and hope to report on our findings in a future paper. REFERENCES [1] [2] [31 [4] [5] of Data Base Relationships", W. W. Armstrong, "Dependency Structures Holland Publishing Company, 1974, 580 - 583. E. F. Codd, "A Relational 377 - 387. Information Vol. Processing 74, North attributes on the

Model of Data for Large Shared Data Banks", M,

13, No. 6, 1970,

A brief Tutorial, Proc. 1971 AM-SIGFIDET Workshop on E. F. Codd, 'Normalized Data Base Structure: Data Description, Access and Control, available from Aa, New York. of the Data Base Relational Model", Courant Computer Science E. F. Codd, "Further Normalization N. J., 1971, 33 - 64. Symposia 6 "Data Base Systems", Prentice Ball, Englewood Cliffs, C. J. Date, "Relational Data Base Systems: A Tutorial", Fourth and Information Sciences, Miami Beach, Florida, Plenum Press. International Symposium on Computer

209

[6] [7]

C. Delobel, IBM Journal

R. G. Casey, "Decomposition of a Data Base and the Theory of Boolean Switching of Research and Development, Vol. 17, No. 5, Sept. 1973, 374 - 386. IBM System Development Division, San Jose, California:

Functions", Technical

W. Kent, "A Primer of Normal Foms", Report TR02.600 December 17, 1973.

210

You might also like