CPT105/Databases

Data and Information

Data is unprocessed items, whereas information is processed data, organized, meaningful and useful.

The importance of data for companies can hardly be overestimated. Data must

have integrity, and
be kept secure.

File Processing versus Databases

A database contains all data, which many programs or users share. This data is processed by some DBMS (Database Management System) into Information. The DBMS is also used to create and to maintain (adding, changing, deleting records) the database. Users may only work with the front-end, a program that interacts with the DBMS.

Before working with databases, the file processing approach was used. Each departmant had its own files. Data redundancy (wastes space and time and increases the probability for errors) and isolated data---the inability to match data maintained in different departments---were the disadvantages ofthis approach.

Some features of DBMSs

A data dictionary or repository contains information about each file, as number of fields, field names, type of each filed,validitation rules for validitation checks, default entries in each field, number of records.
Tools for file retrieval (queries) and maintenance (adding, changing, removing records).
A query language is used to retrieve data.Very often it is SQL (structured query language), a special programming language. Examples of database programs that use SQL are Access or Oracle. Many DBMSs have also a QBE (query be example) feature.
Forms to enter or change data.
Reports and report generators.
Security (Access privileges), backup, and recovery features.

Relational databases

Nowadays the most commonly used database models are:

relational: as described below. Examples are Access, MySql, Visual FoxPro.
object-oriented: Uses the object-oriented paradigm introduced recently for programming languages. You can store more types of data there. Examples of object-relational databases (hybrids) are DB2 and Oracle.
multidimensional. Tables used in relational databases are two-dimensional, but multidimensinal databases represent data in a multidimensional "hypercube". A hospital database could have the dimensions time, procedure type, patient, hospital, physician, and diagnosis. These databases can consolidate data very quickly, and are used in data warehousing and data mining. Examples are D³ and Oracle Express.

We will concentrate in the following on relational databases, that have been introduced by E.F. Codd in 1970 at IBM. A relational database is a collection of related tables. Each table has different records as rows and fields as columns. See here an example of a table, let's call it the book table:

Author's last name	Author's first name	Title	Year	Pages	Publisher	ISBN
Sayers	Dorothy	Gaudy Night	1935	446	Hodder and Stoughton	0450021548
Grimes	Martha	Biting the Moon	1999	436	Headline	0747268061
Bernhard	William	Deadly Justice	1993	306	Ballantine Books	0345380274
Bernhard	William	Cruel Justice	1996	466	Ballantine Books	0345408039

A primary key is any field that uniquely identifies the record. Though in this small table, the title field looks like a good candidate, the only primary key of the table is the ISBN number.

Now look at the author table:

Author's last name	Author's first name	Born at	Author number
Sayers	Dorothy	1893	1
Grimes	Martha	?	2
Bernhard	William	?	3
Grafton	Sue	1940	4

Again we introduced a primary key---the author number.

How could we link these two tables? Adding an ISDN number field into the second table is not possible, but adding the author number field into the first is. If we do this, then the author number field in the book table is called a foreign key. Both tables are related, since they have a common field which is the primary key in one and a foreign key in the other.

Why do we use 2 related tables? Couldn't we squeeze the same information into one larger table? We could, but assume we have 100 authors, and assume each author has written 30 books. Assume both tables have 10 fields. Then the author table has 100*10 = 1000 data items and the book table has 3000*10 = 30,000 data items Together we have 31,000 items, compared to maybe 3000*19 = 57,000 items (remember, one field at least would be in common). Things would get much worse if we have more related tables.

Therefore one reason for using relational databases is to avoid data redundancy. The second reason is that if we have to change one data item value in one of the related tables (like second name), we would have to change many data items in the large supertable. Unless hierarchical or network databses, relational databases do not have to be application specific. They provide data independence---the ability to exist independently of specific applications.

Organizing your Data in

D a t a b a s e s

Data and Information

File Processing versus Databases