You are on page 1of 24

Introduction to Databases

Daniela Puiu Applications Specialist Center for the Study of Biological Complexity, VCU dpuiu@vcu.edu 804-827-0952

General Concepts
Database definition
Organized collection of logically related data

Data
Known facts Types: text, graphics, images, sound, videos

Database management system (DBMS)


Software package for defining and managing a database

Database Examples
Class roster Hospital patients Literature (published articles in a certain field) Genomic information Protein structure Taxonomy Single nucleotide polymorphism

Example: Microbial Database


Data about the protein coding regions in the microbial genomes sequenced so far. Organism: Name Accession number Genome size GC% Release date Genome center Sequence Gene (protein coding regions): Name Accession number Organism Location on the chromosome (start,end) Strand Size Product Sequence

Database Models
Flat files Hierarchical Network Relational Object oriented Object relational Web enabled 60 60 70 80 90 90 90

Database Types (cont.)


Type Personal Workgroup Department Enterprise Internet Typical number of users 1 5-25 25-100 >100 >1000 Typical architecture Desktop/Laptop/ PDA Typical size MB

Client/server:2 tier MB-GB Client/server:3 tier GB Client/server: distributed Web sever & application servers GB-TB MB-GB

Flat Files
Characteristics: Data is stored as records in regular files Records usually have a simple structure and fixed number of fields For fast access may support indexing of fields in the records No mechanisms for relating data between files One needs special programs in order to access and manipulate the data

Flat Files Example


Microbial database:
Genbank format:
Escherichia coli K12 Streptococcus pneumoniae R6

Fasta format: multiple files


Escherichia coli K12: genome , genes , gene positions Streptococcus pneumoniae R6: genome , genes , gene positions

Data manipulation:
Sequence extraction, search Indexing Format conversion

Relational Database
Characteristics: Data is organized into tables: rows & columns Each row represents an instance of an entity Each column represents an attribute of an entity Metadata describes each table column Relationships between entities are represented by values stored in the columns of the corresponding tables (keys) Accessible through Standard Query Language (SQL)

Enterprise data model


Graphical representation of the high level entities Example: Microbial database
each organism has multiple corresponding genes One:Many relation

1 Organism

m Gene

Metadata
Data that describes the properties or characteristics of other data Does not include sample data Allows database designers and users to understand the meaning of the data

Metadata & Data Table


Organism
Name Name Size Gc Accession Release Center Sequence Name Type Alphanumeric Integer Float Alphanumeric Date Alphanumeric Alphanumeric Size Gc Max Length 100 10 5 10 8 100 Variable Accession Description Organism name Genome length (bases) Percent GC Accession number Release date Genome center name Sequence Release Center Sequence

Escherichia coli K12


Streptococcus pneumoniae R6

4,640,000
2,040,000

50
40

NC_000913
NC_003098

09/05/1997
09/07/2001

Univ. Wisconsin
Eli Lilly and Company

AGCTTTTC ATT
TTGAAAGA AAA

Metadata & Data Table (cont.)


Gene
Name Name Accession OAccesion Start End Strand Product Sequence Name thrL thrA transposas e_A Accession 16127995 16127996 15902058 Type Alphanumeric Alphanumeric Alphanumeric Integer Integer Character Alphanumeric Alphanumeric OAccession NC_000913 NC_000913 NC_003098 Start 190 337 20207 Max Length 100 10 10 10 10 1 1000 Variable End 255 2799 20554 Description Gene name Gene accession number Organism accession number Gene start Gene end Gene strand Gene annotation Gene sequence Strand + + + Product the operon leader peptide homoserine dehydrogenase I transposase Sequence MKRI MRVL MWYN

Relationships
Used to connect tables Field(s) that have the same value in the related tables Organism.Accession=Gene.OAccession Organism.Accession Unique Primary key Gene.OAccession Not unique Secondary key

SQL
ANSI (American National Standards Institute) standard computer language for accessing and manipulating database systems. SQL statements are used to retrieve and update data in a database. Includes:
Data Manipulation Language (DML) Data Definition Language (DDL)

Data Manipulation Language


Syntax for executing queries, updating, inserting, and deleting records.
SELECT - extracts data from one or more table INSERT INTO - inserts new data into a table UPDATE - updates data in a table DELETE FROM - deletes data from a table

DML Example
Select all Escherichia coli K12 genes which are in the 1MB2MB region of the chromosome: SELECT * FROM Organism, Gene WHERE Organism.Name=Escherichia coli K12 AND Organism.Accession=Gene.OAccession AND Gene.Start>=1,000,000 AND Gene.End<=2,000,000

DML Example (cont.)


INSERT INTO Gene (Name, Accession, OAccession, Start, End, Strand, Sequence) VALUES (thrL, 16127995,NC_000913,190,255,+,thr operon leader peptide, MKRI) UPDATE Gene SET Start=160 WHERE Accession= NC_000913 DELETE FROM Gene WHERE Accession= NC_000913

Data Definition Language


Syntax for creating ,editing, deleting: Databases Tables Views Indexes Constraints Users Privileges

DDL Examples
CREATE DATABASE Microbial; CREATE TABLE Organism ( Name varchar(100) Size int(10) Gc decimal(5) Accession varchar(10) Release date(8) Center varchar(100)); ALTER TABLE Organism ADD Sequence varchar; DROP TABLE Organism;

DBMS
Software package for defining and managing a database. Examples:
Proprietary: MS Access, MS SQL Server, DB2, Oracle, Sybase Open source: MySql, PostgreSQL

DBMS Advantages
Program-data independence Minimal data redundancy Improved data consistency & quality
Access control Transaction control

Improved accessibility & data sharing Increased productivity of application development Enforced standards

Web Databases
Data is accessible through Internet Have different underlying database models Example: biological databases
Molecular data: NCBI , Swissprot , PDB , GO Protein interaction : DIP , BIND Organism specific: Mouse , Worm, Yeast Literature: Pubmed Disease

CSBC Resources
Database and software list
Molecular databases: Genbank, EMBL, NR, NT, RefSeq, Swissprot DBMS:
MS Excel, MS Access MySQL, PostgreSQL

Computer resources
watson.vcu.edu : 8 processor Sun server medusa.vcu.edu : 64 processor Beowulf cluster

You might also like