Introduction to Structured Query Language


 

Objectives

This is the first lecture from the series describins structured query language. In the next weeks your objectives will be to:

  1. To learn Structured Query Language (SQL)

    SQL consists of three parts:

    1. Data definition language (DDL) that is used to create/remove/modify databases, create/remove users, create/remove/modify tables, create/remove indices, etc.

    2. Data manipulation language (DML) that is used to select (and display or somehow process) data from database, update data, delete data, etc.

    3. Data control language (DCL) that is used to prevent users/other systems from making harm to the database. It is used to control privilages, define what to do when unexpected happens, etc.

  2. Gain practical skill on how to apply SQL to build, manage and use datanases.

This lecture covers basics of data manipolation part of SQL. Specifically, it will concentrate on "SELECT" queries used to retrieve data from relational databases. Other parts of SQL will be covered in further lectures.

Basics of Data Manipulation

The most important, and probably the most common use of SQL is to select desired subset of data in the database. The selected data can be displayed, exported to file, or processed by some other applications. The most important is that you learn how to retrieve subset of data  that you want. In the other words, you need to learn how to create queries for selecting right data from database.  Not surprisingly, the SQL command that is used to select data is called select. In general, in the select command you specify what fields from your database you want to select, from what tables, and under what conditions. Additionally, you may specify how you want to aggregate or sort your results.

Let us consider a very simple data table of patients' information (Figure 1).

Figure 1: Patient table.

 

The simplest query one can create is to select all data from our the table with patients. The query that selects everything is (Q1):

SELECT * FROM patient;

In SQL, an asterisk "*" stands for all fields. So, the query directly translates to select all fields from table patient. The result of select, delete, update, and other SQL commands are always applied to all records in a considered table. This means that out query will produce result depicted in Figure 2.

Figure 2: Result of the first query.

A good news is that although we are using Access to show tables and results of queries, SQL is fairy (but not completely) standard, so the same query will work in other database systems (oracle, postgres, mysql, ...).

Suppose we are interested only in patient whose record number is AAA-123. To limit record which are displayed SQL used where clause. Our query looks like this (Q2):

SELECT * FROM patient WHERE RecordNumber = 'AAA-123';

The query will give answer shown in Figure 3:

Figure 3: Result of the second query.

Similarly we can create query that displays patients whose last name starts with "B" or "S" (Q4):

SELECT * FROM patient WHERE LastName like 'B*' OR LastName like 'S*';

or patients whose record number starts with "X" and last name starts with "B" (Q5):

SELECT * FROM patient WHERE RecordNumber like 'X*' AND LastName like 'B*';

Let's now consider an additional table with patients' addresses show in Figure 4. The following query displays patients with their addresses (Q6):

SELECT * FROM patient INNER JOIN address ON patient.RecordNumber = Address.RecordNumber;

Here is result:

Figure 4: Result of the sixth query.

Similarly, the query that displays only patients' names, and states is (Q7):

SELECT patient.FirstName, patient.LastName, address.state FROM patient

INNER JOIN address ON patient.RecordNumber = Address.RecordNumber;

and produces result:

Figure 5: The result of the seventh query.

 

Another important feature of queries is the ability to aggregate data. Aggregation of data in SQL is based on the "group by" command. Here is an example query that returns the number of addresses per patient from address table (Q8):

SELECT RecordNumber, count(*) FROM address GROUP BY RecordNumber;

and here is its result:

Figure 6: The result of the eight query.

By combining ideas from the seventh and eight query, we are now able to create query that displays patient's names, and numbers of addresses they have (Q9):

SELECT first(Patient.FirstName) as FirstName, first(Patient.LastName) as LastName, count(*) as NumberOfAddresses FROM patient

INNER JOIN address ON Patient.RecordNumber = Address.RecordNumber

GROUP BY Address.RecordNumber;

Result of the query is:

Figure 7: Result of the ninth query.

 

More SQL

Everything that you learned when creating queries in Access, applies to queries in SQL. In fact, when creating queries in Access you implicitly create SQL queries through its graphical interface. There advantages and disadvantages of using the graphical interface. You have access to expression builder, you do not need to know SQL syntax, etc. On the other hand, many database systems do not have such a graphical interface, or those they have are different, some complicated queries cannot be created, and so on.

Recall from last week tables with claims and icd9 codes. Now, we will build SQL queries to answer some of previously asked questions without using graphical interface.

  • Let's start with the simplest query: "identify patients that have diabetes in the above database." The query that answers this question is:

SELECT claims.ptid
FROM claims INNER JOIN icd9 ON claims.dignss_c = icd9.icd9
WHERE icd9.icddescription like '*diab*'
GROUP BY claims.ptid;

       You can see recording how this query was created here.

  • Another query on the same dataset responds to the question: "Calculate the average cost of each diagnosis sorted from most expensive to least expensive.  Exclude all bills with negative or 0 values." The query that answers this question is:

SELECT dignss_c, avg(blld_amt) FROM claims
WHERE blld_amt > 0
GROUP BY dignss_c
ORDER BY avg(blld_amt) desc;

     You can see recording how this query was created here.

  • Let's now answer some more complicated questions. In the section on the complex queries, one of the questions was to: "For each person, find the description of the diagnoses with maximum cost." There is no one the best way of solving complex problems. The same relates to creating complex queries. To answer this question one may work in three steps: (1) identify the maximum cost of diagnosis for each person; (2) identify diagnoses (can be more than one) whose cost equals to the maximum; and (3) find their descriptions. Temporary results in these steps can be stored in additional tables. The first query that finds the maximum billed amount per patient is:

SELECT ptid, max(blld_amt) as max_amt INTO temp1 FROM claims
group by ptid;

       Note that this is a SELECT INTO query that works similarly to SELECT queries, but results are inserted into a new table. The new table created here is called temp1.In the second step we use the result of the first query and join it with the claims table to get diagnoses for which the amount is maximum:

SELECT claims.ptid, claims.dignss_c INTO temp2 from claims
INNER JOIN temp1 on claims.blld_amt = temp1.max_amt AND claims.ptid = temp1.ptid;

       In the last step we need to identify names of diagnoses from the ICD9 codes table. This can be done by the query:

SELECT temp2.ptid, icd9.icddescription FROM temp2
INNER JOIN icd9 ON icd9.icd9 = temp2.dignss_c
GROUP BY temp2.ptid, icd9.icddescription
ORDER BY temp2.ptid;

       The query does two additional tricks. It sorts output by patients' ids.  It also removed duplicates by grouping by both fields. It may happen (and actually does in the data) that the same patient was billed more than one the same for the same, an the amount was maximum. We want to get rid of this. In general, if you group by all displayed fields, you simply remove duplicates.

Obviously, if you like to make things complicated, here is a complete one query that does the same thing:

SELECT t2.ptid, icd9.icddescription
FROM
(SELECT claims.ptid, claims.dignss_c FROM
(SELECT ptid, max(blld_amt) as max_amt FROM claims GROUP BY ptid) as t1
INNER JOIN claims
ON (claims.blld_amt=t1.max_amt) AND (claims.ptid=t1.ptid)) AS t2
INNER JOIN icd9 ON icd9.icd9 = t2.dignss_c
GROUP BY t2.ptid, icd9.icddescription
ORDER BY t2.ptid;

 

When you write complex queries, it is often a good idea to write your query in steps, and store intermediate results in temporary tables. In most cases, understandability of queries is the most important. If you understand a query, you know if it will work or not, and why. If you create a very complicated query, you can only guess if it works, or test it on a few examples. This is not sufficient!

 

Implicit Joins

There are situations, when it is easier not to explicitly state JOIN in your queries, but use an implicit join. While some criticize implicit joins, for the completeness of your SQL education, you should be able to use them.  It is matter of personal preference which one to use. Implicit joins may in many cases significantly simplify your queries, particularly when joining multiple tables.

In implicit joins, in the FROM part of query you list all tables to be joined, and conditions (relations) on which you want to join tables are moved to the WHERE part of query. For example:

SELECT * FROM patient INNER JOIN address ON patient.RecordNumber = Address.RecordNumber;

is equivalent to:

SELECT * FROM patient, address WHERE patient.RecordNumber = Address.RecordNumber;

Another example is:

SELECT * FROM patient INNER JOIN address ON patient.RecordNumber = Address.RecordNumber WHERE patient.LastName = 'Smith';

is equivalent to:

SELECT * FROM patient, address WHERE patient.RecordNumber = Address.RecordNumber AND patient.LastName = 'Smith';

 

Readings

From "Taylor AG.  SQL for dummies:  A reference for the rest of us"  read the following chapters:

  • Chapter 1: Introduction

  • Chapter 2:  SQL fundamentals

  • Chapter 3:  The components of SQL

  • Chapter 4:  Building and maintaining a simple database

  • Chapter 6:  Manipulating database data

  • Chapter 9:  Zeroing in on data you want

  • Chapter 10: Relational operators

  • Chapter 11:  Delving into nested queries

Presentations

There are several files available for this lecture:

  1. Listen to lecture on SQL Data Manipulation lecture (SWF file)

  2. Download slides for data manipulation

  3. Download Access 2007 database used in examples, or its Access 2003 version.

Narrated slides and videos require Flash.

Analyze Data

In this assignment you are asked to create SQL queries. Please write your queries "by hand" as SQL commands and do not use query graphical view in Access.

  1. Create two tables in Access.  Each table should contain at least three records and two fields.  Write an SQL command that would combine the two tables using the Union command.   What is the result?

  2. Download claims and icd9 tables, and write SQL queries that:

    1. Display patients with diabetes
    2. For each patient, display average charge
    3. For each patient, display the most expensive diagnoses
    4. For each diagnosis, display patients for which cost for this diagnosis was maximum.
  3. Download simple EMR database (Access 2007 or Access 2003). Write an SQL query that lists all diagnoses (text) for parients whose last name is Li.

Submit your responses as a word document attached to an email to your instructor.  Keep a copy of all of your work till end of semester.   

Recently Asked Questions

Ask a question and we will answer it within the next 48 hours.  If you have no questions, please review the answer to the questions asked by others: 
 

Question: In Q#1 do you mean the commands used in both questions we had, also do you mean the commands used to get the tables as in the result showm at the end? is that right?  Answer: Please use your answers to the question 1 in complex queries.   This question was asked on 12/6/2005 4:32:32 PM and answered on 12/6/2005 5:14:59 PM.

Question: Examples would be EXTREMELY useful for this lecture. Are you planning on providing any?  Answer: Examples of commands are posted in the lectures. I will add some more also over time.   This question was asked on 12/4/2005 2:53:22 PM and answered on 12/4/2005 8:37:38 PM.

Question: When will the final presentations/assignment for this lecture be posted?  Answer: On 12/2    This question was asked on 11/30/2005 8:19:00 PM and answered on 12/1/2005 6:07:44 PM.

Question: When will lectures 10 and 11 be completed as class is this Friday, 12/2? Per syllabus on 11/29/05: "Dec 2 Lecture on SQL (data definition) Lecture on SQL (data manipulation). These lectures are under development." And there isn't "Analyze Data" for either lecture. Will there be assignments that will be due 12/9/05 from these lectures? Thank you.   Answer: I am still working on these lectures and they will be complete on 12/2   This question was asked on 11/29/2005 10:04:57 PM and answered on 12/1/2005 6:07:25 PM.

 


Suggested Improvements

Comment: If questions and answers are going to be retained and used on these online courses, it would be advisable to ensure that questions and answers have dates which include the year and or the semester so students know to which semester/year the questions and answers apply.   This comment was left on 12/2/2006 3:22:19 PM.

Comment: I never received feedback on my complex queries homework so it was somewhat difficult to complete this homework. I could not do the last question on the complex queries homework, therefore, could not do it on this homework. I need feedback faster and I need concrete answers and more examples because the answers I am getting to my questions are very vague and confusing.    This comment was left on 11/6/2008 9:16:02 AM.

Comment: This lecture was ok. I wish we would have gone more in depth on how to write SQL instead of just having to read an entire book for a homework assignement.   This comment was left on 11/15/2008 9:21:08 PM.

Comment: the lectures helped    This comment was left on 11/26/2008 5:47:00 PM.

Comment: At first it was a little difficult for me to grasp the concepts, but after reading some of the SQL for dummies book, I was able to better understand. I think if the concepts were defined first, for example, the definitions of '*' or ';', it would make it a bit easier for students to understand.   This comment was left on 11/1/2009 7:50:17 PM.

Comment: NEED MORE EXAMPLES & VIDEOS. More guidance when learning a language for the first time!!   This comment was left on 11/8/2009 2:49:21 PM.

More


This page is part of the course on Healthcare Databases, the lecture on SQL.  Copyright © 2005 Farrokh Alemi, Ph.D. Created on January 9th 2005. Most recent revision 11/21/2008 by Janusz Wojtusiak.