Logical database design with entity-relationship model 84
Logical database design with Unified Modeling Language 99
Physical database design 101
For more information 106
Practice exam questions 108
Answers to practice exam questions 110
Exam topics that this chapter covers
Working with DB2 UDB objects:
Ability to demonstrate use of DB2 UDB data types
Knowledge to identify characteristics of a table, view, or index
When you design any sort of database, you need to answer many different questions. The same is true when you are designing a DB2 database. How will you organize your data? How will you create relationships between tables? How should you define the columns in your tables? What kind of table space should you use?
To design a database, you perform two general tasks. The first task is logical data modeling, and the second task is physical data modeling. In logical data modeling, you design a model of the data without paying attention to specific functions and capabilities of the DBMS that will store the data. In fact, you could even build a logical data model without knowing which DBMS you will use. Next comes the task of physical data modeling. This is when you move closer to a physical implementation. The primary purpose of the physical design stage is to optimize performance while ensuring the integrity of the data.
This chapter begins with an introduction to the task of logical data modeling. The logical data modeling section focuses on the entity-relationship model and provides an overview of the Unified Modeling Language (UML). The chapter ends with the task of physical database design.
After completing the logical and physical design of your database, you implement the design. You can read about this task in "Chapter 7. Implementing your database design."
Logical database design with entity-relationship model
Before you implement a database, you should plan or design it so that it satisfies all requirements. This section introduces the first task of designing a databaselogical design.
Modeling your data
Designing and implementing a successful database, one that satisfies the needs of an organization, requires a logical data model. Logical data modeling is the process of documenting the comprehensive business information requirements in an accurate and consistent format. Analysts who do data modeling define the data items and the business rules that affect those data items. The process of data modeling acknowledges that business data is a vital asset that the organization needs to understand and carefully manage. This section contains information that was adapted from Handbook of Relational Database Design.
Consider the following business facts that a manufacturing company needs to represent in its data model:
Customers purchase products.
Products consist of parts.
Suppliers manufacture parts.
Warehouses store parts.
Transportation vehicles move the parts from suppliers to warehouses and then to manufacturers.
These are all business facts that a manufacturing company's logical data model needs to include. Many people inside and outside the company rely on information that is based on these facts. Many reports include data about these facts.
Any business, not just manufacturing companies, can benefit from the task of data modeling. Database systems that supply information to decision makers, customers, suppliers, and others are more successful if their foundation is a sound data model.
An overview of the data modeling process
You might wonder how people build data models. Data analysts can perform the task of data modeling in a variety of ways. (This process assumes that a data analyst is performing the steps, but some companies assign this task to other people in the organization.) Many data analysts follow these steps:
Build critical user views.
Analysts begin building a logical data model by carefully examining a single business activity or function. They develop a user view, which is the model or representation of critical information that the business activity requires. (In a later stage, the analyst combines each individual user view with all the other user views into a consolidated logical data model.) This initial stage of the data modeling process is highly interactive. Because data analysts cannot fully understand all areas of the business that they are modeling, they work closely with the actual users. Working together, analysts and users define the major entities (significant objects of interest) and determine the general relationships between these entities.
Add keys to user views.
Next, analysts add key detailed information items and the most important business rules. Key business rules affect insert, update, and delete operations on the data.
Example: A business rule might require that each customer entity have at least one unique identifier. Any attempt to insert or update a customer identifier that matches another customer identifier is not valid. In a data model, a unique identifier is called a primary key, which you read about in "Primary keys" on page 50.
Add detail to user views and validate them.
After the analysts work with users to define the key entities and relationships, they add other descriptive details that are less vital. They also associate these descriptive details, called attributes, to the entities.
Example: A customer entity probably has an associated phone number. The phone number is a nonkey attribute of the customer entity.
Analysts also validate all the user views that they have developed. To validate the views, analysts use the normalization process (which you can read about later in this chapter) and process models. Process models document the details of how the business will use the data. You can read more about process models and data models in other books on those subjects.
Determine additional business rules that affect attributes.
Next, analysts clarify the data-driven business rules. Data-driven business rules are constraints on particular data values, which you read about in "Referential integrity and referential constraints" on page 54. These constraints need to be true, regardless of any particular processing requirements. Analysts define these constraints during the data design stage, rather than during application design. The advantage to defining data-driven business rules is that programmers of many applications don't need to write code to enforce these business rules.
Example: Assume that a business rule requires that a customer entity have a phone number, an address, or both. If this rule doesn't apply to the data itself, programmers must develop, test, and maintain applications that verify the existence of one of these attributes.
Data-driven business requirements have a direct relationship with the data, thereby relieving programmers from extra work.
Integrate user views.
In this last phase of data modeling, analysts combine into a consolidated logical data model the different user views that they have built. If other data models already exist in the organization, the analysts integrate the new data model with the existing ones. At this stage, analysts also strive to make their data model flexible so that it can support the current business environment and possible future changes.
Example: Assume that a retail company operates in a single country and that business plans include expansion to other countries. Armed with knowledge of these plans, analysts can build the model so that it is flexible enough to support expansion into other countries.
Recommendations for logical data modeling
To build sound data models, analysts follow a well-planned methodology, which includes these tasks:
Work interactively with the users as much as possible.
Use diagrams to represent as much of the logical data model as possible.
Build a data dictionary to supplement the logical data model diagrams. (A data dictionary is a repository of information about an organization's application programs, databases, logical data models, users, and authorizations. A data dictionary can be manual or automated.)
Data modeling: Some practical examples
"An overview of the data modeling process" on page 85 summarizes the key activities in data modeling. This section shows how you might perform these activities in real life.
You begin by defining your entities, the significant objects of interest. Entities are the things about which you want to store information. For example, you might want to define an entity, called EMPLOYEE, for employees because you need to store information about everyone who works for your organization. You might also define an entity, called DEPARTMENT, for departments.
Next, you define primary keys for your entities. A primary key is a unique identifier for an entity. In the case of the EMPLOYEE entity, you probably need to store lots of information. However, most of this information (such as gender, birth date, address, and hire date) would not be a good choice for the primary key. In this case, you could choose a unique employee ID or number (EMPLOYEE_NUMBER) as the primary key. In the case of the DEPARTMENT entity, you could use a unique department number (DEPARTMENT_NUMBER) as the primary key.
After you have decided on the entities and their primary keys, you can define the relationships that exist between the entities. The relationships are based on the primary keys. If you have an entity for EMPLOYEE and another entity for DEPARTMENT, the relationship that exists is that employees are assigned to departments. You can read more about this topic in the next section.
After defining the entities, their primary keys, and their relationships, you can define additional attributes for the entities. In the case of the EMPLOYEE entity, you might define the following additional attributes:
Office phone number
You can read more about defining attributes later in this chapter.
Finally, you normalize the data, a task that is outlined in "Normalizing your entities to avoid redundancy" on page 94.
Defining entities for different types of relationships
In a relational database, you can express several types of relationships. Consider the possible relationships between employees and departments. If a given employee can work in only one department, this relationship is one-to-one for employees. One department usually has many employees; this relationship is one-to-many for departments. Relationships can be one-to-many, many-to-one, one-to-one, or many-to-many.
The type of a given relationship can vary, depending on the specific environment. If employees of a company belong to several departments, the relationship between employees and departments is many-to-many.
You need to define separate entities for different types of relationships. When modeling relationships, you can use diagram conventions to depict relationships by using different styles of lines to connect the entities.
When you are doing logical database design, one-to-one relationships are bidirectional relationships, which means that they are single-valued in both directions. For example, an employee has a single resume; each resume belongs to only one person. Figure 4.1 illustrates that a one-to-one relationship exists between the two entities. In this case, the relationship reflects the rules that an employee can have only one resume and that a resume can belong to only one employee.
Figure 4.1 Assigning one-to-one facts to an entity
One-to-many and many-to-one relationships
A one-to-many relationship occurs when one entity has a multivalued relationship with another entity. In Figure 4.2, you see that a one-to-many relationship exists between the two entitiesemployee and department. This figure reinforces the business rules that a department can have many employees, but that each individual employee can work for only one department.
Figure 4.2 Assigning many-to-one facts to an entity
A many-to-many relationship is a relationship that is multivalued in both directions. Figure 4.3 illustrates this kind of relationship. An employee can work on more than one project, and a project can have more than one employee assigned. If you look at this book's example tables (in "Appendix A. Example tables in this book"), you can find answers for the following questions:
What does Wing Lee work on?
Who works on project number OP2012?
Figure 4.3 Assigning many-to-many facts to an entity
Both questions yield multiple answers. Wing Lee works on project numbers OP2011 and OP2012. The employees who work on project number OP2012 are Ramlal Mehta and Wing Lee.
Applying business rules to relationships
Whether a given relationship is one-to-one, one-to-many, many-to-one, or many-to-many, your relationships need to make good business sense. Therefore, database designers and data analysts can be more effective when they have a good understanding of the business. If they understand the data, the applications, and the business rules, they can succeed in building a sound database design.
When you define relationships, you have a big influence on how smoothly your business runs. If you don't do a good job at this task, your database and associated applications are likely to have many problems, some of which may not manifest themselves for years.
Defining attributes for the entities
When you define attributes for the entities, you generally work with the data administrator (DA) to decide on names, data types, and appropriate values for the attributes.
Most organizations have naming conventions. In addition to following these conventions, DAs also base attribute definitions on class words. A class word is a single word that indicates the nature of the data that the attribute represents.
Example: The class word NUMBER indicates an attribute that identifies the number of an entity. Attribute names that identify the numbers of entities should therefore include the class word of NUMBER. Some examples are EMPLOYEE_NUMBER, PROJECT_NUMBER, and DEPARTMENT_NUMBER.
When an organization does not have well-defined guidelines for attribute names, the DAs try to determine how the database designers have historically named attributes. Problems occur when multiple individuals are inventing their own naming schemes without consulting one another.
Choosing data types for attributes
In addition to choosing a name for each attribute, you must specify a data type. Most organizations have well-defined guidelines for using the different data types. Here is an overview of the main data types that you can use for the attributes of your entities.
Data that contains a combination of letters, numbers, and special characters. Some of the string data types are listed below:
CHARACTER: Fixed-length character strings. The common short name for this data type is CHAR.
VARCHAR: Varying-length character strings.
CLOB: Varying-length character strings, typically used when a character string might exceed the limits of the VARCHAR data type.
GRAPHIC: Fixed-length graphic strings that contain double-byte characters.
VARGRAPHIC: Varying-length graphic strings that contain double-byte characters.
DBCLOB: Varying-length strings of double-byte characters.
BLOB: Varying-length binary strings.
Data that contains digits. The numeric data types are listed below:
SMALLINT: for small integers.
INTEGER: for large integers.
DECIMAL(p,s) or NUMERIC(p,s), where p is precision and s is scale: for packed decimal numbers with precision p and scale s. Precision is the total number of digits and scale is the number of digits to the right of the decimal point.
REAL, for single-precision floating-point numbers.
DOUBLE, for double-precision floating-point numbers.
Data values that represent dates, times, or timestamps. The datetime data types are listed below:
DATE: Dates with a three-part value that represents a year, month, and day.
TIME: Times with a three-part value that represents a time of day in hours, minutes, and seconds.
TIMESTAMP: Timestamps with a seven-part value that represents a date and time by year, month, day, hour, minute, second, and microsecond.
Examples: You might use the following data types for attributes of the EMPLOYEE entity:
The data types that you choose are business definitions of the data type. During physical database design you might need to change data type definitions or use a subset of these data types. The database or the host language might not support all of these definitions, or you might make a different choice for performance reasons.
For example, you might need to represent monetary amounts, but DB2 and many host languages do not have a data type MONEY. In the United States, a natural choice for the SQL data type in this situation is DECMAL(10,2) to represent dollars. But you might also consider the INTEGER data type for fast, efficient performance.
"Determining column attributes" on page 223 provides additional details about selecting data types when you define columns.
Deciding what values are appropriate for attributes
When you design a database, you need to decide what values are acceptable for the various attributes of an entity. For example, you would not want to allow numeric data in an attribute for a person's name. The data types that you choose limit the values that apply to a given attribute, but you can also use other mechanisms. These other mechanisms are domains, null values, and default values.
A domain describes the conditions that an attribute value must meet to be a valid value. Sometimes the domain identifies a range of valid values. By defining the domain for a particular attribute, you apply business rules to ensure that the data will make sense.
A domain might state that a phone number attribute must be a 10-digit value that contains only numbers. You would not want the phone number to be incomplete, nor would you want it to contain alphabetic or special characters and thereby be invalid. You could choose to use either a numeric data type or a character data type. However, the domain states the business rule that the value must be a 10-digit value that consists of numbers.
A domain might state that a month attribute must be a 2-digit value from 01 to 12. Again, you could choose to use datetime, character, or numeric data types for this value, but the domain demands that the value must be in the range of 01 through 12. In this case, incorporating the month into a datetime data type is probably the best choice. This decision should be reviewed again during physical database design.
When you are designing attributes for your entities, you will sometimes find that an attribute does not have a value for every instance of the entity. For example, you might want an attribute for a person's middle name, but you can't require a value because some people have no middle name. For these occasions, you can define the attribute so that it can contain null values.
A null value is a special indicator that represents the absence of a value. The value can be absent because it is unknown, not yet supplied, or nonexistent. The DBMS treats the null value as an actual value, not as a zero value, a blank, or an empty string.
Just as some attributes should be allowed to contain null values, other attributes should not contain null values.
Example: For the EMPLOYEE entity, you might not want to allow the attribute EMPLOYEE_LAST_NAME to contain a null value.
You can read more about null values in "Chapter 7. Implementing your database design."
In some cases, you may not want a given attribute to contain a null value, but you don't want to require that the user or program always provide a value. In this case, a default value might be appropriate.
A default value is a value that applies to an attribute if no other valid value is available.
Example: Assume that you don't want the EMPLOYEE_HIRE_DATE attribute to contain null values and that you don't want to require users to provide this data. If data about new employees is generally added to the database on the employee's first day of employment, you could define a default value of the current date.
You can read more about default values in "Chapter 7. Implementing your database design."
Normalizing your entities to avoid redundancy
After you define entities and decide on attributes for the entities, you normalize entities to avoid redundancy. An entity is normalized if it meets a set of constraints for a particular normal form, which this section describes. Normalization helps you avoid redundancies and inconsistencies in your data. This section summarizes rules for first, second, third, and fourth normal forms of entities, and it describes reasons why you should or shouldn't follow these rules.
The rules for normal form are cumulative. In other words, for an entity to satisfy the rules of second normal form, it also must satisfy the rules of first normal form. An entity that satisfies the rules of fourth normal form also satisfies the rules of first, second, and third normal form.
In this section, you will see many references to the word instance. In the context of logical data modeling, an instance is one particular occurrence. An instance of an entity is a set of data values for all of the attributes that correspond to that entity.
Example: Figure 4.4 shows one instance of the EMPLOYEE entity.
Figure 4.4 One instance of an entity
First normal form
A relational entity satisfies the requirement of first normal form if every instance of an entity contains only one value, never multiple repeating attributes. Repeating attributes, often called a repeating group, are different attributes that are inherently the same. In an entity that satisfies the requirement of first normal form, each attribute is independent and unique in its meaning and its name.
Example: Assume that an entity contains the following attributes:
This situation violates the requirement of first normal form, because JANUARY_SALARY_AMOUNT, FEBRUARY_SALARY_AMOUNT, and MARCH_SALARY_AMOUNT are essentially the same attribute, EMPLOYEE_MONTHLY_SALARY_AMOUNT.
Second normal form
An entity is in second normal form if each attribute that is not in the primary key provides a fact that depends on the entire key. (For a quick refresher on keys, see "Keys" on page 49.)
A violation of the second normal form occurs when a nonprimary key attribute is a fact about a subset of a composite key.
Example: An inventory entity records quantities of specific parts that are stored at particular warehouses. Figure 4.5 shows the attributes of the inventory entity.
Figure 4.5 A primary key that violates second normal form
Here, the primary key consists of the PART and the WAREHOUSE attributes together. Because the attribute WAREHOUSE_ADDRESS depends only on the value of WAREHOUSE, the entity violates the rule for second normal form. This design causes several problems:
Each instance for a part that this warehouse stores repeats the address of the warehouse.
If the address of the warehouse changes, every instance referring to a part that is stored in that warehouse must be updated.
Because of the redundancy, the data might become inconsistent. Different instances could show different addresses for the same warehouse.
If at any time the warehouse has no stored parts, the address of the warehouse might not exist in any instances in the entity.
Figure 4.6 Two entities that satisfy second normal form
Third normal form
An entity is in third normal form if each nonprimary key attribute provides a fact that is independent of other nonkey attributes and depends only on the key.
A violation of the third normal form occurs when a nonprimary attribute is a fact about another nonkey attribute.
Example: The first entity in Figure 4.7 contains the attributes EMPLOYEE_NUMBER and DEPARTMENT_NUMBER. Suppose that a program or user adds an attribute, DEPARTMENT_NAME, to the entity. The new attribute depends on DEPARTMENT_NUMBER, whereas the primary key is on the EMPLOYEE_NUMBER attribute. The entity now violates third normal form.
Figure 4.7 The update of an unnormalized entity. Information in the entity has become inconsistent.
Changing the DEPARTMENT_NAME value based on the update of a single employee, David Brown, does not change the DEPARTMENT_NAME value for other employees in that department. The updated version of the entity in Figure 4.7 illustrates the resulting inconsistency. Additionally, updating the DEPARTMENT_NAME in this table does not update it in any other table that might contain a DEPARTMENT_NAME column.
You can normalize the entity by modifying the EMPLOYEE_DEPARTMENT entity and creating two new entities: EMPLOYEE and DEPARTMENT. Figure 4.8 shows the new entities. The DEPARTMENT entity contains attributes for DEPARTMENT_NUMBER and DEPARTMENT_NAME. Now, an update such as changing a department name is much easier. You need to make the update only to the DEPARTMENT entity.
Figure 4.8 Normalized entities: EMPLOYEE, DEPARTMENT, and EMPLOYEE_DEPARTMENT
Fourth normal form
An entity is in fourth normal form if no instance contains two or more independent, multivalued facts about an entity.
Example: Consider the EMPLOYEE entity. Each instance of EMPLOYEE could have both SKILL_CODE and LANGUAGE_CODE. An employee can have several skills and know several languages. Two relationships exist, one between employees and skills, and one between employees and languages. An entity is not in fourth normal form if it represents both relationships, as Figure 4.9 shows.
Figure 4.9 An entity that violates fourth normal form
Instead, you can avoid this violation by creating two entities that represent both relationships, as Figure 4.10 shows.
Figure 4.10 Entities that are in fourth normal form
If, however, the facts are interdependent (that is, the employee applies certain languages only to certain skills), you should not split the entity.
You can put any data into fourth normal form. A good rule to follow when doing logical database design is to arrange all the data in entities that are in fourth normal form. Then decide whether the result gives you an acceptable level of performance. If the performance is not acceptable, denormalizing your design is a good approach to improving performance. You can read about this next step in "Denormalizing tables to improve performance" on page 102.