In the previous chapter, we discussed the definitions of various database objects using the Data Definition Language (DDL). In this chapter, we start manipulating the database objects using the portion of SQL known as Data Manipulation Language (DML). We will be populating (inserting) data into the database and retrieving the data using many powerful methods. Depending on the sophistication of the database users, they can use SQL to query the database. The majority of the SQL statements within a DB2 application involve DML statements. Therefore, application developers must understand the various methods of inserting, updating, and retrieving data from the database. We will start with simple retrieval statements and gradually introduce more complex methods of data manipulation. The DB2 Certification database will be used for most of the examples. There are four main DML SQL statements we consider: SELECT, INSERT, UPDATE, and DELETE.
SQL is based on mathematical principles, specifically on set theory and relational algebra. The data is stored in the database as unordered sets of data records. SQL is a set-oriented language, and many of its language elements are directly related to relational algebraic terms, such as permutation, projection, restriction, and join.
A set of data is represented in a DB2 database as a table or a view and is stored in a DB2 table without regard to order. To retrieve data in a particular order, an ORDER BY phrase must be added to a SELECT statement. Similarly, if the data is to be grouped, then a GROUP BY phrase must be added to the statement.
Now, let's review the DB2CERT database design defined in the previous chapter and manipulate data using various SQL statements. There are three main tables: CANDIDATE, TEST, and TEST_CENTER. Each table represents a set of records that correspond to a test candidate (person), a test, and a test center (location).
There is an associative table, known as the TEST_TAKEN table, that is used to reflect the relationships among the three main tables. The TEST_TAKEN table is used to schedule the test candidates and also to maintain their test scores.
The longest SQL statement is 32,765 bytes.
Remember that to execute any operation, the user must have the necessary privileges.
Retrieving the Entire Table
The most basic of all retrieval command involves the SELECT statement with no operators other than the name of the table. This SQL statement retrieves all of the candidates who have taken a DB2 Certification exam. The information requested is contained in the table TEST_TAKEN.
SELECT * FROM db2cert.test_taken
SQL is a data access language that consists of language statements and clauses. There are many optional clauses that can be used to modify the output. The output of a SELECT statement is known as a result set or result table. The results from the SELECT statement are shown next. In this example, using an asterisk or star character (*) for column selection, the columns will be returned to the user in the order in which they are defined in the table.
CID TCID NUMBER DATE_TAKEN START_TIME FINISH_TIME SCORE PASS_FAIL SEAT_NO --- ---- ------ ---------- ---------- ----------- ----- --------- ------- 111 TX01 500 01/01/2000 11:30:00 12:30:00 65 Y 1 111 TX01 501 02/02/2000 10:30:00 11:45:00 73 Y 1 111 TX01 502 03/03/2000 12:30:00 13:30:00 67 Y 1 222 TR01 500 01/01/2000 14:00:00 15:30:00 55 N 2 222 TR01 502 01/02/2000 09:00:00 10:15:00 53 N 2 222 TR01 502 02/18/2000 10:00:00 11:30:00 75 Y 2 333 TX01 500 03/01/2000 11:30:00 13:00:00 82 Y 2 333 TX01 501 12/29/2000 14:00:00 - - - 1 333 TX01 502 03/02/2000 14:00:00 14:30:00 92 Y 1 9 record(s) selected.
In SQL, the * is used to indicate that all columns of a table are being referenced. In this example, the SQL statement refers to all of the columns defined for the DB2CERT.TEST_TAKEN table. If the table is altered and a new column is added to the table definition, the result set would contain the new column.
Adding a new column to an existing table will result in default values being populated for the existing rows.
Since the output of the SQL statement using * character varies according to the table definition, it is recommended that you specify all of the column names you want to see in the SELECT statement. We could have obtained the same result as the statement using SELECT * with the following SQL statement.
SELECT cid, tcid, number,date_taken,start_time, finish_time,score,seat_no FROM db2cert.test_taken;
The * character is used to refer to all of the columns defined for a table. The order of the columns in the result table is the same order as specified in the CREATE TABLE or CREATE VIEW statement.
What is returned to the user is known as the result set. If the result set is large it is advisable to be sure to filter the data using a WHERE predicate.
The FROM clause is required for the DML SQL statement, since it describes the location (table or view) of the data. Our example references a single table called DB2CERT.TEST_TAKEN. The SELECT and FROM clauses are required in all data retrieval statements. The list of columns following the SELECT keyword is referred to as the select list.
Projecting Columns from a Table
Projection is a relational operation that allows you to retrieve a subset of the defined columns from a table. The next example restricts the output from the SELECT command so that only the candidate ID, test center, and test number attributes from the TEST_TAKEN table are shown.
SELECT cid,tcid,number FROM db2cert.test_taken
The output of this SELECT statement is shown below.
CID TCID NUMBER --- ---- ------ 111 TX01 500 111 TX01 501 111 TX01 502 222 TR01 500 222 TR01 502 222 TR01 502 333 TX01 500 333 TX01 501 333 TX01 502 9 record(s) selected.
The order of the columns in the result table will always match the order in the select list. The order of the columns as they were defined in the CREATE TABLE or CREATE VIEW statement is ignored when a select list is provided in the SQL statement. In this example, the order of the columns is similar to the order in the CREATE TABLE statement, since the CID column was defined prior to TCID and NUMBER columns.
Changing the Order of the Columns
Permutation is the relational operation that allows you to change the order of the columns in your result table. Permutation is used every time you select columns in an order different than the order defined in the CREATE TABLE statement. For example, to display the test center ID prior to the candidate IDs and the test number, you could execute the following:
SELECT tcid,cid,number FROM db2cert.test_taken
The result of this SELECT statement specifies a select list in a different order than was defined in the table definition.
TCID CID NUMBER ---- --- ------ TX01 111 500 TX01 111 501 TX01 111 502 TR01 222 500 TR01 222 502 TR01 222 502 TX01 333 500 TX01 333 501 TX01 333 502 9 record(s) selected.
We refer to the output of a SELECT statement as the result table because the output of all SELECT statements can be considered a relational table.
Restricting Rows from a Table
Restriction is a relational operation that will filter the resulting rows of a table. Restriction can be accomplished through the use of predicates defined in a SQL WHERE clause. To restrict the result set, we need to add a WHERE clause to the SQL statement.
A predicate is a condition placed on the data. The result of the condition is TRUE, FALSE, or UNKNOWN.
The WHERE clause specifies conditions or predicates that must be evaluated by DB2 before the result table is returned to the end user. There are many valid types of predicates that can be used. In the following example, the equality (=) predicate is used to restrict the records to only those candidates who have taken a DB2 Certification test at the test center TR01.
SELECT tcid,cid FROM db2cert.test_taken WHERE tcid ='TR01'
The WHERE clause also accepts other comparison operators, such as greater than (>), less than (<), greater than or equal to (>=), less than or equal to (<=), and not equal to (<>).This statement is an example of a basic predicate. A basic predicate compares two values. In addition, more complex predicates, such as LIKE, BETWEEN, and IN, are also valid and will be discussed later.
Trying to execute this example SQL statement will result in a data type compatibility error, because the column TCID is defined with a user-defined data type (UDT).
Predicate Evaluation for UDTs
The column TCID was defined as a user-defined data type. To make the comparison in the WHERE clause valid, a casting function needs to be used. This is not mandatory if the comparison is to a literal or host variable. This technique is however demonstrated below:
SELECT tcid, cid FROM db2cert.test_taken WHERE tcid=CAST('TR01' AS db2cert.center_id)
Predicate evaluation requires that the data types be compatible (same data type or a compatible data type). We can accomplish the data type conversion (cast) using one of two methods:
- Use the CAST expression.
- Use a casting function.
Refer to Chapter 15, "Object Relational Programming," for more information on casting for UDTs.
Although not always mandatory, casting is recommended to ensure compatible data types are compared, including length and scale. This allows DB2 to resolve these as indexable rather than stage 2 predicates.
Restricting Rows Using Multiple Conditions
It is possible to combine multiple conditions (predicates) in a single SQL statement. The predicates can be combined using Boolean operators, such as the AND or OR operators. These operators allow you to combine multiple conditions in a single SQL statement. The order of the predicate evaluation will not affect the result set (known as set closure).
The next example retrieves the records for the test candidates who took a test at test center TR01 and achieved a score greater than 65. The rows that satisfy the predicates are known as the qualifying rows. The following example is an SQL statement using multiple predicates:
SELECT tcid,cid,score FROM db2cert.test_taken WHERE tcid= CAST('TR01' AS db2cert.center_id) AND score > 65
Selecting Columns from Multiple Tables
There are basically two operations that combine columns from multiple tables in a single SQL statement. These operations are
- Cartesian product
A Cartesian product is a relational operation that will merge all the values from one table with all the values from another table. This operation is not used frequently because the result table can be very large. The number of rows in the result table is always equal to the product of the number of rows in the qualifying rows for each of the tables being accessed.
The DB2 optimizer may choose to utilize a Cartesian product of unrelated tables if this is deemed to be an efficient method of accessing multiple tables. An example would be two single-row tables that are joined with a large table. The cross-product of 1 ∴ 1 = 1; thus the large table access is deferred as late as possible, with a potential increase in the restrictive predicates that can be applied without incurring the overhead of a large Cartesian result. This method of table access is typical in processing queries against a star schema data model.
The following example is a Cartesian product of all test numbers and test names from the TEST table, with all candidates from the TEST_TAKEN table. First, select from the TEST table.
SELECT number, name FROM db2cert.test NUMBER NAME ------ -------------------------------------------------- 500 DB2 Fundamentals 501 DB2 Administration 502 DB2 Application Development 3 record(s) selected. Next select of all candidates from the test_taken table. SELECT cid,tcid FROM db2cert.test_taken CID TCID --- ---- 222 TR01 222 TR01 222 TR01 111 TX01 111 TX01 111 TX01 333 TX01 333 TX01 333 TX01 9 record(s) selected.
Then combine the two tables to form a Cartesian product result table.
SELECT db2cert.test_taken.number,cid,tcid FROM db2cert.test_taken,db2cert.test NUMBER CID TCID ------ --- ---- 500 111 TX01 501 111 TX01 502 111 TX01 500 222 TR01 502 222 TR01 502 222 TR01 500 333 TX01 501 333 TX01 502 333 TX01 500 111 TX01 501 111 TX01 502 111 TX01 500 222 TR01 502 222 TR01 502 222 TR01 500 333 TX01 501 333 TX01 502 333 TX01 500 111 TX01 501 111 TX01 502 111 TX01 500 222 TR01 502 222 TR01 502 222 TR01 500 333 TX01 501 333 TX01 502 333 TX01 27 record(s) selected.
There are two tables referenced in the FROM clause of this query. The tables are separated by commas. There is no relationship expression in the WHERE clause. This type of query results in a Cartesian product.
The result table is a representation of all possible combinations of the input tables. The TEST table has three rows, and the TEST_TAKEN table has nine rows. Therefore, the SELECT statement shown above returns 27 rows. Note the first column name in this query. It is necessary to fully qualify the column name by providing the schema name and table name with the column name because this column exists in both the TEST table and TEST_TAKEN table. In this case, we needed to specify that the NUMBER column is to be retrieved from the DB2CERT.TEST_TAKEN table and not from the DB2CERT.TEST table.
By adding a predicate to a Cartesian product SQL query, the result table can represent a more useful representation of the data. In the example below, the query represents all of the tests that were taken by the candidate whose ID is 111.
SELECT db2cert.test_taken.number,cid,tcid FROM db2cert.test_taken,db2cert.test WHERE cid='111' NUMBER CID TCID ------ --- ---- 500 111 TX01 500 111 TX01 500 111 TX01 501 111 TX01 501 111 TX01 501 111 TX01 502 111 TX01 502 111 TX01 502 111 TX01 9 record(s) selected.
Adding a WHERE clause to your query does not always provide the desired result. In the example shown above, you want to know all of the tests that were taken by the candidate whose ID is 111, and the query returns nine rows. But we know from previous queries that the candidate took only three tests. The query in the example above has a WHERE clause to filter out the candidate whose ID is 111 from the TEST_TAKEN table, but there was no filter on the TEST_CENTER table. Therefore, the result of the query would always be a multiple of the number of testing centers. Usually, when multiple tables are referenced, you should include a cross-table relationship using a table merge or join method, as shown in the following example. We will examine table join methods further in the next section.
SELECT db2cert.test_taken.number,cid,tcid FROM db2cert.test_taken,db2cert.test WHERE cid= '111' AND db2cert.test_taken.number=db2cert.test.number NUMBER CID TCID ------ --- ---- 500 111 TX01 501 111 TX01 502 111 TX01 3 record(s) selected.
To avoid data redundancy, it is recommended that the database tables be normalized. Following a normalization process, a number of related tables will exist. To satisfy some of the required queries, the tables must be reconstructed. The tables are reconstructed temporarily, using a table join strategy to produce a single-result table.
The result tables in the previous examples usually provided candidate ID numbers and not the complete name of the test candidates. The candidate IDs are stored in the TEST_TAKEN table, and the full names are stored in the CANDIDATE table. To obtain the name of a candidate, the data must be retrieved from the CANDIDATE table, using a relationship or join strategy.
Consider an example that will list the names and phone numbers of candidates who were registered to take a DB2 Certification test in 2000. To accomplish this, we need to select data from two different tables:
Let's retrieve a list of candidate names, phone numbers, and IDs from the CANDIDATE table. The candidate names were stored in multiple columns to allow for easy retrieval by last name.
SELECT fname,initial,lname,hphone,cid FROM db2cert.candidate
The output of this example follows. Pay special attention to the values in the CID column. It will be used as the join column in the next example.
FNAME INITIAL LNAME HPHONE CID -------- ------- ---------- ---------- --- Bill B Wong 1115551234 111 George Baklarz 2226543455 222 Susan M Visser 4442314244 333 Glen R Sheffield 5552143244 444 Jim G Stittle 6662341234 555 Kevin W Street 7773142134 666 Bert F Nicol 8886534534 777 Paul C Zikopoulos 9992112212 888 8 record(s) selected.
Now, let's retrieve the ID numbers of those candidates who were registered to take the test in 2000.
SELECT distinct cid FROM db2cert.test_taken WHERE YEAR(date_taken) = 2000 CID --- 222 111 333 3 record(s) selected.
The candidate IDs in the TEST_TAKEN table must correspond to a candidate ID in the CANDIDATE table because of the declarative referential integrity constraints. The parent table in the relationship is the CANDIDATE table and the child table (dependent table) is the TEST_TAKEN table.
The result table from the above query does not include the test candidate 444, since that candidate did not have a test scheduled for 2000. We need to join the two result tables based on the candidate ID values. This column is known as the join column.
Query performance can significantly improve if the join columns are appropriately indexed.
The following single query will satisfy the end-user requirement.
SELECT distinct fname,initial,lname,hphone FROM db2cert.test_taken,db2cert.candidate WHERE YEAR(date_taken) = 2000 AND db2cert.test_taken.cid=db2cert.candidate.cid
A table join requires a predicate that includes an expression based on columns from the tables referenced in the FROM clause. This is known as a join predicate. The FROM clause has not changed from the Cartesian product examples. The only difference is in the join predicate (test_taken.cid = candidate.cid).
An alternative form of the FROM clause for joins involves explicitly coding the JOIN keyword between the tables (rather than using a comma) and coding the join predicates in the ON clause rather than in the WHERE clause. This method of coding will be covered in more detail in Chapter 4, "Advanced SQL Coding."
The table names needed to be explicitly stated because there is a column named CID in both of the referenced tables. When multiple tables are being accessed in a single query, any selected columns that occur in more than one table must be qualified with the table name.
An error will occur if the columns being referenced are ambiguous (not properly qualified).
The maximum number of base tables that can be in a single SQL statement is 225; however, you are limited to 15 tables in a join (the FROM clause).
The 15-table limit can be increased by updating the macro SPRMMXT to a value greater than 15.
This kind of join operation shown in the above example is also known as an inner join. An inner join displays only the rows that are present in both of the joined tables.
Using Correlation Names
If each of the columns needed to be fully qualified with the table name, such as tableschema.tablename.columnname, the queries would become very large and cumbersome to work with. Fortunately, there is an easier way to qualify the ambiguous columns resulting from a multitable SELECT statement.
The columns can be qualified using a correlation name. A correlation name is a temporary alias for the tables referenced in an SQL statement. We rewrite the previous query using correlated names as follows:
SELECT distinct fname,initial,lname,hphone FROM db2cert.test_taken tt, db2cert.candidate c WHERE YEAR(date_taken) = 2000 AND tt.cid = c.cid
The correlation name immediately follows the name of the table as stated in the FROM clause. In this example, the correlated name for the TEST_TAKEN table is tt, and the correlated name for the CANDIDATE table is c.
Correlation names can be optionally prefixed with the AS keyword, such as DB2CERT.CANDITATE AS C. This is purely for readability.
The correlated names are accessible within the SQL statement only. Following the execution of the SQL statement, the correlation name is no longer defined. Once a correlation name has been defined, it can be referenced in the rest of the query instead of the table name. However, the table name can still be referenced.
Use simple, easy-to-remember correlation names. Table initials are good candidates for correlation names.
Sorting Your Output
We have been retrieving data from one or more tables. The order of the result table has not been specified in any of the SQL statements. Therefore, the data is retrieved in an undetermined order if there is no ORDER BY clause in the SQL statement.
The only guaranteed way to return data in the required sequence is with the ORDER BY clause. Any data retrieval that is currently returned in sequence without this clause is purely based upon the data retrieval method at the time. A future access path change may not return the data in the same sequence.
The following example produces a list of the test candidates in alphabetical order by last name for the candidates who have taken a DB2 Certification test at the TR01 test center.
SELECT lname,initial,fname FROM db2cert.candidate c,db2cert.test_taken tt WHERE c.cid=tt.cid AND tcid='TR01' ORDER BY lname
This example contains a new clause, ORDER BY. After the ORDER BY clause, you can list the columns that will specify the sort order and the type of sort.
Appropriate indexing may allow DB2 to avoid sorting the data to match the ORDER BY clause. If the data is already sequenced via the index, DB2 may choose to utilize the index and avoid sorting the data. DB2 cannot avoid a sort for an ORDER BY involving columns from more than one table. Where possible, include columns from only one table in the ORDER BY to give DB2 greater opportunity for sort avoidance.
The SQL can be modified so that the output is changed to descending order by last name and a secondary order column on the first name in ascending order.
SELECT lname,fname,hphone FROM db2cert.candidate c,db2cert.test_taken tt WHERE c.cid=tt.cid AND tcid= 'TR01' ORDER BY lname DESC, fname
In this example, the DESC keyword that follows the lname column indicates that the result table should be in descending order based on the last name. More than one record can have the same last name. This situation is quite common. There is a second column specified in the ORDER BY clause, fname. There is no keyword specifying the sort sequence based on the fname column. Therefore, the default ordering sequence (ascending) is used.
The next example contains three columns lname, fname, and hphone. You can reference the column that should be used to sort the data using the column name or by specifying its position in the select list. Using the column position is useful when the column in the select list is made up of derived columns (calculated columns) that have no explicit name.
SELECT lname,fname,hphone FROM db2cert.candidate c,db2cert.test_taken tt WHERE c.cid=tt.cid AND tcid='TR01' ORDER BY 1 DESC, 2
In this example, the sort order is specified using the column position. Therefore, the query result is exactly the same as the in previous example.
You can also assign an alternative column name using column renaming. The assigned name can then be referenced in the ORDER BY clause, and it appears as the column heading where appropriate.
Although useful, coding a column position rather than a column (or renamed column) in the ORDER BY clause may lead to a different sequence if another column is inadvertently added to the query that alters the column positioning.
The ORDER BY clause must succeed all subselects in the FULLSELECT statement. The only clauses that can succeed the ORDER BY clause are the for FOR FETCH ONLY, FOR READ ONLY, OPTIMIZE, WITH QUERYNO, FETCH FIRST n ROWS clauses.
There are some cases when you will need to perform calculations on the data. The SQL language has some basic mathematical and string functions built in. Mathematical operations include standard addition, subtraction, multiplication, and division.
The calculation can be defined in the WHERE clause of the SQL statement or the select list. Suppose that you need to calculate a passing rate for a DB2 test. The passing rate is defined as the percentage of candidates that pass the test (totalpassed*100/totaltaken). The following SQL statement will accomplish this for us for test number 500:
SELECT number, totalpassed*100/totaltaken FROM test WHERE number='500'
In this example, the second column of the output list is a calculated column. Remember that you must use the column position if you want to use this calculated column for the ORDER BY clause unless you name it (as we now discuss).
Occasionally, the results of a derived column may not display as expected. The example using totalpassed*100/totaltaken will result in a value of 66 being retrieved. Since both the totalpassed and totaltaken columns are integers, the final result is also an integer, and the fractional part is discarded. If this is not your desired result, you should use other functions (like decimal) to change the way the calculation is performed or displayed.
Naming Derived/Calculated Columns
You can specify a column name for any expression. When you provide the derived (calculated) column with a name, the ORDER BY clause can reference the derived name to allow for a more readable SQL statement.
The following SQL calculates the percentage of people that have passed the DB2 Certification exams and orders the output in descending order of the passing rate.
SELECT number,totalpassed*100/totaltaken AS PassedRate FROM test ORDER BY PassedRate DESC
The AS clause is used to rename the default name of an element in the select list. In this example, we are giving the name of PassedRate to the result of the division of columns totalpassed by totaltaken. The named column is used in the query to specify the column that should be used for sorting the output.
The AS keyword is optional, although it should be added for readability. Without the AS clause, it may be unclear whether the intention was to rename a column or a comma was missed between two retrieved columns.
In DB2, there are different types of functions provided. For example, two types of functions provided by DB2 are scalar and column functions (a third type of function called a table function is discussed in Chapter 15, "Object Relational Programming."
Scalar functions (also known as row functions) provide a result for each row of the result table. A scalar function can be used any place an expression is allowed.
Column functions (also known as vector functions) work with a group of rows to provide a result. The group is specified using a FULLSELECT and optionally grouped using the GROUP BY clause.
In this section, we introduce you to some of the SQL functions provided with DB2. SQL functions are categorized by their implementation type. Either the functions are built in or they are extensions of DB2 and are known as user-defined functions (UDFs).
Built-in functions are defined within the SQL standards, and they are provided by DB2. These can be either scalar or column functions.
UDFs are not defined within the SQL standards because they are extensions of the current SQL language. These functions can be developed by a DB2 administrator or application developer. UDFs can be either scalar or table functions, but not column functions. Once the UDFs have been created, they can be invoked by any end user with the proper privileges. For more information about UDFs, refer to Chapter 15.
Scalar functions are applied to each row of data, and there is a per-row result provided. If we want to retrieve only the first three digits of telephone numbers for each candidate, we could use a scalar function. The function that is used is called SUBSTR. The arguments for this function include a string data type column, a beginning offset, and length. The output data type and attribute of the function depend on the input data type and attribute. The following example retrieves the telephone area code for the column wphone.
SELECT lname, SUBSTR(CHAR(wphone),1,3) FROM db2cert.candidate
The SUBSTR function is a scalar function. In this example, SUBSTR returns a character string of three characters. The result string corresponds to the first three characters of the wphone column. This function is known as a string function because it works with any string data type. If we wanted to provide the output column with a meaningful name, we could provide an alias, as was done for calculated columns.
In the example above, the substring starts from the beginning of the string, because we indicate one (1) as the second parameter of the function. The length of the resulting string is indicated in the third argument. In our example, the length is three. Note that the data type of the wphone column is phone, so a casting function is used to convert the phone data type to the char data type.
The following query will provide the month when the exam was taken. The input for this function is a DATE string, and the output is an integer.
SELECT fname, MONTH (date_taken) FROM candidate c, test_taken tt WHERE c.cid=tt.cid
Column functions provide a single result for a group of qualifying rows for a specified table or view. Many common queries can be satisfied using column functions where they include common tasks, such as finding the smallest value, the largest value, or the average value for a group of data records. In the following example you can obtain the maximum length of time of any of the DB2 Certification exams:
SELECT MAX("length") FROM test
Because the length column shares its name with the length scalar function, the column name is encased within double-quotes to distinguish this as a column and not a function.
If we added a WHERE clause to this example, the maximum would represent the maximum length for the qualifying rows, since the predicate is used to filter the data prior to the application of the MAX function.
This next example calculates the average of the number of seats for all of the test centers. Notice the column function AVG is used in this example:
SELECT AVG(noseats) FROM test_center
DB2 provides many more built-in functions. If you are interested in calculating statistical information, you can use statistical functions, such as VARIANCE, STDDEV, or a sampling of these functions.
MAX and MIN functions can be either column or scalar functions depending on the input arguments.
Many queries require some level of aggregated data. This is accomplished in SQL through the use of the GROUP BY clause. The following SQL obtains the average number of seats for each country:
SELECT country, AVG(noseats) FROM test_center GROUP BY country
This SQL statement obtains the average number of seats per country, and the GROUP BY clause tells DB2 to group together the rows that have the same values in the columns indicated in the GROUP BY list. In our example, we are grouping countries into subsets. As the subsets are created, DB2 calculates the average of each of those groups or subsets, in this case, by each country.
When you combine column functions and other elements, such as column names, scalar functions, or calculated columns, you must use the GROUP BY clause. In this case, you must include every element that is not a column function in the GROUP BY list. The only elements that can be omitted in the GROUP BY list are constant values.
The next SQL statement obtains a list that includes the average cut score and minimum test length for the DB2 Certification exams. We group this list by the type of exam, as follows:
SELECT type, AVG(cut_score), MIN("length") FROM test GROUP BY type
Appropriate indexing can allow DB2 to avoid a sort to group the data rows to match the GROUP BY clause.
It is possible to sort the output of the previous example using an ORDER BY clause.
GROUP BY may return data in the same order as an ORDER BY, but this is not guaranteed and is based upon the access path. The only method to guarantee data sequence is with an ORDER BY.
Restricting the Use of Sets of Data
Up to now, we have discussed how to restrict output based on row conditions. With SQL, it is also possible to restrict that output using column functions and the GROUP BY clause. Suppose you want a list of all the test centers that have administered more than five DB2 Certification exams. To make it easier to understand, let's first get the number of tests that have been taken in each test center.
SELECT tcid, count(*) FROM test_taken GROUP BY tcid
We use the COUNT column function to get the total number of tests that have been taken in each test center. When you use an asterisk (*) with the COUNT function, you are indicating that you want the number of rows in a table that meet the criteria established in the SQL statement. In this example, we are grouping by TCID because we have a number of occurrences for all the test centers in the TEST_TAKEN table. The TEST_TAKEN table has an entry for every DB2 Certification exam that has been taken. Finally, the output is restricted to only those test centers that have administered more than four exams.
SELECT tcid FROM test_taken GROUP BY tcid HAVING COUNT(*) > 4
This example introduces the HAVING clause. The HAVING clause is equivalent to the WHERE clause for groups and column functions. The HAVING clause will restrict the result set to only the groups that meet the condition specified in it. In our example, only the test centers that have administered more than four DB2 Certification exams will be displayed.
When you execute a query, you might get duplicate rows in the answer set. The SQL language provides a special clause to remove the duplicate rows from your output. The following SQL generates a list of names and phone numbers for all the candidates who have taken a test. In the following example, we eliminate the duplicate rows from our output list using the DISTINCT clause.
SELECT DISTINCT fname,wphone,hphone FROM candidate c,test_taken tt WHERE c.cid=tt.cid
The DISTINCT clause can also be used with the COUNT function. When you use DISTINCT inside a COUNT function, it will not count the duplicate entries for a particular column. The following example allows you to count how many different test centers have candidates registered.
SELECT COUNT(DISTINCT tcid) FROM test_taken
This example provides the number of test centers that are registered in the TEST_TAKEN table. Remember that all the candidates who have registered for DB2 Certification exams are stored in this table. Make sure that you understand the difference between COUNT(*), COUNT(colname), and COUNT(DISTINCT colname). They are very similar in syntax but differ in function.
COUNT(*) returns a count of all rows that qualify against the WHERE clause. COUNT(colname) returns a count of all rows that qualify against the WHERE clause, with null occurrences of colname removed. COUNT(DISTINCT colname) counts distinct occurrences of colname, with nulls removed.
Searching for String Patterns
SQL has a powerful predicate that allows you to search for patterns in character string columns. This is the LIKE predicate. Suppose you want to generate a list of the candidates whose first name starts with the letter G.
SELECT fname,lname,wphone,hphone FROM candidate WHERE fname LIKE 'G%' ORDER BY lname,fname
In this query, we use a wildcard character with the LIKE predicate. In SQL, the percent character (%) is a substitute for zero or more characters. The search string G% can be substituted with names like George, Gary, Ginger, and so on (since the percent character can substitute zero or more characters, the search string can also be a single letter G).
The percent character can be used any place in the search string. It also can be used as many times as you need it. The percent sign is not case-sensitive, so it can take the place of uppercase or lowercase letters. However, the constant characters included in your search string are case-sensitive.
Another wildcard character used with the LIKE predicate is the underline character (_). This character substitutes one and only one character. The underline character can take the place of any character. However, the underline character cannot be substituted for an empty character.
If the pattern needs to search for occurrences of the wildcard characters % and _ as their actual values, then the ESCAPE clause is used to specify a character that precedes the percent or underscore in the pattern.
The previous SQL can be modified to include all candidates' names and the telephone numbers for those candidates whose name has a lowercase letter "a" as its second letter.
SELECT fname,lname,wphone,hphone FROM candidate WHERE fname LIKE '_a%' ORDER BY lname,fname
This example uses two wildcard characters that work with the LIKE predicate. The search string in this example can include names, such as Paul, Gabriel, or Natalie. (The first character may be any character, the lowercase letter "a" is the second character in the string, and the string ends with any number of characters.)
When the pattern in a LIKE predicate is a fixed-length host variable, the correct length must be specified for the string to be returned. 'G%' assigned to an 8-byte variable (LIKE :variable) will search for all occurrences of 'G% ' (G, followed by any character, followed by 6 blank characters). To find rows that begin with a G, then 'G%%%%%%%' should be assigned to the fixed-length variable.
Searching for Data in Ranges
SQL also offers us a range operator. This operator is used to restrict rows that are in a particular range of values. Consider the requirement to list those candidates whose scores in the DB2 Certification exam are between 60 and 75.
SELECT DISTINCT fname,lname,wphone,hphone FROM candidate c, test_taken tt WHERE c.cid=tt.cid AND score BETWEEN 60 AND 75
The BETWEEN predicate includes the values that you specify for searching your data. An important fact about the BETWEEN predicate is that it can work with character ranges as well.
In addition to the score requirement, this example modifies the SQL to include only those candidates whose last name begins with a letter between B and G.
SELECT DISTINCT fname,lname,wphone,hphone FROM candidate c, test_taken tt WHERE c.cid=tt.cid AND score BETWEEN 60 AND 75 AND lname BETWEEN 'B' AND 'GZ'
In this example, the second BETWEEN predicate contains character values. We need to specify the GZ value to include all the possible names that start with the letter G. This was done assuming that the letter Z is the last possible value in the alphabet.
The arguments of the BETWEEN clause are not interchangeable; the first must specify the low value, and the second, the high value. BETWEEN 1 AND 2 will return all values within the range (inclusive of 1 and 2). BETWEEN 2 AND 1 will return zero rows.
Searching for Null Values
Null values represent an unknown value for a particular occurrence of an entity. We can use a null value in the cases where we don't know a particular value of a column. Let's say that we want a list of all those candidates whose score is not yet input. This condition is represented with a null value.
SELECT fname,lname,wphone,hphone FROM candidate c, test_taken tt WHERE c.cid=tt.cid AND score IS NULL
The IS predicate is used to search for the null value in this example. Remember that the null value means "unknown." Because it has no particular value, it can't be compared with other values. You can't use conditional operands, such as equal (=) or greater than (>), with null values.
Searching for Negative Conditions
The BETWEEN, IS, and LIKE predicates always look for the values that meet a particular condition. These predicates can also be used to look for values that don't meet a particular criterion.
The NOT predicate can be used to look for the opposite condition, combined with the LIKE, BETWEEN, and IS predicate, to accomplish negative searches, as shown in the following example. This example has a LIKE predicate combined with the NOT predicate. We want a list of those candidates whose last names do not start with the letter S.
SELECT DISTINCT fname,lname,wphone,hphone FROM candidate WHERE lname NOT LIKE 'S%' ORDER BY lname,fname
The next example has a BETWEEN predicate combined with the NOT predicate. We want the list of those candidates whose score, in any test, is not in the range 60 to 75.
SELECT DISTINCT fname,lname,wphone,hphone FROM candidate c, test_taken tt WHERE c.cid=tt.cid AND integer(score) NOT BETWEEN 60 and 75
In this example, the NOT predicate will exclude all the values that are in the range 60 to 75.
Negation can also be applied to the null value. This SQL produces a report that searches for those candidates that have a seat number assigned. This is expressed with a NOT NULL value.
SELECT DISTINCT fname,lname,wphone,hphone FROM candidate c, test_taken tt WHERE c.cid=tt.cid AND seat_no IS NOT NULL
The NOT operator can also be used to negate the standard comparison operators, =, <, <=, >, and >=.
When dealing with nulls, NOT or negation may not return the opposite of the positive logic. For example, WHERE SEAT_NO = 1 will return only the rows for seat number 1. Any value other than 1 is discarded, as these rows are FALSE, including nulls, since these are unknown. WHERE SEAT_NO <> 1 excludes rows where the seat number is 1, and also discards nulls because these are unknown.
Searching for a Set of Values
In SQL, it is possible to establish a restriction condition based on a set of values. Suppose that you need a list of the test centers that have candidates registered for the DB2 Fundamentals test and for the DB2 Application Development test. This can be queried with the following statement:
SELECT DISTINCT name,phone FROM test_center tc, test_taken tt WHERE tc.tcid=tt.tcid AND (number = '500' or number ='502')
To simplify building multiple OR conditions when multiple values for the same column are being compared, it is possible to rewrite the statement using the IN clause.
SELECT DISTINCT name,phone FROM test_center tc, test_taken tt WHERE tc.tcid=tt.tcid AND number IN ('500','502')
The IN clause is used to denote a set of values. In this example, we use a constant set of values.
You can also use the NOT predicate with the IN clause. In this case, the condition will be true when a value is not present in the set of values provided to the IN clause. You can use as many values as you wish in the IN clause, within the defined limits of the size of a SQL statement.
Advanced Selection Functionality
In Chapter 4, "Advanced SQL Coding," we will look at more of the functionality and power of the SELECT statement. We will cover topics such as
- Inner and outer joins
- Nested table expressions
- CASE expressions
- Row expressions