Learning SQL, a bottom up approach to learning to write queries

By May 15, 2018SQL Server

What prompted me to write this series?

Whether you realize it or not, working with data has become commonplace in the professional world. It’s easy to think of software developers or database administrators when you think of people who interact with databases, but this is only half the story. I’ve seen everyone from marketing gurus digging through interfaces building pseudo queries to finance wonks bending Excel and Access to their will. Data is an ever growing part of the IT landscape and it’s no longer confined to the safety of our databases. Unfortunately, we make learning databases a lot harder than we need to. For the last few years I’ve been responsible for training new developers as part of their onboarding process at my company where I teach them about databases and writing SQL to retrieve and change data.

One thing I’ve noticed time and time again is that the education around databases is far from practical and leaves a lot of gaps. This means that people never reach a point of feeling comfortable with their grasp of the subject. I ask “So how much experience do you have with databases?” or “Are you familiar with how to utilize a database in your application?” Ultimately, I receive a pretty generic response about crows foot diagrams or something about normalization. The way I was taught in school was the same way, a very top down perspective that focused way too much on learning normal forms and the proper way to create diagrams, and not nearly enough time on what happens if you have bad normalization or don’t think about intention when creating tables. This is a disservice because for most people, by the time in your career where you’re trusted enough to make database design decisions, you’re quite a long way from learning about them in school.

I’ve been refining my part of the on-boarding process over the last few years and I’ve distilled it down to a 3-4 hour bootcamp that I now facilitate for our new employees. I’ve had good results and they always say things like “I wish I had something like this in school.” So I’m going to take this as an opportunity to share my content here and hope that it helps someone. This will be a 4 (technically 5 counting this one) part series that will cover getting you started from selecting what data you need, combining multiple datasets together, applying filters where needed, and even doing computations of the data.

A quick overview:

  • Learning SQL, a bottom up approach to learning to write queries. (you are here)
  • SELECT-ing what you need and what you don’t.
  • When WHERE is what you need, how to use it, and why it’s awesome.
  • Why JOINs aren’t rocket surgery.
  • Figuring out GROUP BY and why it can be confusing.

Quick Disclaimer: If you’re comfortable with the academic side of databases and are really itching to get started, you can skip this primer and go to the next article. This post is to act as a refresher and make sure that everyone has access to the prerequisite information for the rest of the series. Originally this was part of the first post but it kept getting larger and so I decided it deserved it’s own post. I’ll continue to refer back to it in later articles so keep the link handy.

Reliving database 101 – The Foundation

As I said earlier, most of the problems I observe are a result of a very superficial understanding of databases. I have had fresh out of college students ask me, “Why would you even use a database?” Which at first blush feels like an odd question to a database administrator, but think about how most computer science curriculums are structured. Most of the assignments you do in school involve taking input from a file or from the user. If you have a good, teacher maybe using an API of some kind. I’m going to model some of the conversations I’ve had with new employees that have helped me get to the heart of some common misunderstandings.

Q: Who uses databases?

  • Literally everyone. I challenge you to find a company that is more than a small mom and pop operation or a lemonade stand that doesn’t rely on a database.

Q: Why are they so ubiquitous? What do they do better than shared files like we used in school?

  • Great question, hypothetical and inquisitive student. In a very reductionist bullet point, they’re really good at making it easy to store and change data. They also make it very easy to share that data with other people and trust that people only see complete changes. Finally they take care of controlling who can change data, and what happens if people try to do it at the same time. This is an incredibly simplistic explanation of modern database systems and I’m sure this will make most data professionals cringe but I think you get the point.
  • Additionally most databases provide a programming language specifically for accessing and manipulating data. Coincidentally the point of this series.

Q: Okay, I’ll bite that everyone uses databases. People have data and need to store it. What’s the deal with normalization?

  • Man, you’re on a roll with these questions. Normalization is just a well defined set of rules for how data should be stored to make your database (and life) better. I feel like we focus too much on the what and not nearly enough on the why. I believe firmly that if you ever have to deal with poor normalization, you’ll never have to be coerced into following it. I’ll give a brief overview here but I’m not going to belabor it. It will make more sense when we’re going over actual query writing.
  • A very brief explanation of normalization. All of these are cumulative, you don’t get to 3rd normal form unless you’re also 2nd and 1st.
    • 1NF – Every column should represent a single atomic value. Don’t put more than 1 value in a column. (like first and last name)
    • 2NF – 1NF + Everything in the table should relate back to the thing that identifies the record. Don’t cohabitate discrete pieces of information in the same table.
    • 3NF – 2NF + 1NF + All type values should be pulled out into their own tables so that they can be stored independently of the data.

Q: That was a lot and I don’t really get it.

  • Technically that’s not a question, but that’s fine. This is just setting the groundwork for the training. Just remember it’s here if you need to come back and review.

Q:Should we talk about ACID?

  • This is another one of those foundational things that is very important, and provides a formative foundation, but like normalization, most people really understand it better once they have context. You can read about driving a car for years, but it all really comes together once you get behind the wheel. That’s no excuse for not knowing the rules, but it’s understandable if you’re not an expert before you start writing queries.
    • Atomicity
      • The atomicity property identifies that the transaction is atomic. An atomic transaction is either fully completed, or is not begun at all. Any updates that a transaction might affect on a system are completed in their entirety. If for any reason an error occurs and the transaction is unable to complete all of its steps, the system is returned to the state it was in before the transaction was started. An example of an atomic transaction is an account transfer transaction. The money is removed from account A then placed into account B. If the system fails after removing the money from account A, then the transaction processing system will put the money back into account A, thus returning the system to its original state. This is known as a rollback.
    • Consistency
      • A transaction enforces consistency in the system state by ensuring that at the end of any transaction the system is in a valid state. If the transaction completes successfully, then all changes to the system will have been properly made, and the system will be in a valid state. If any error occurs in a transaction, then any changes already made will be automatically rolled back. This will return the system to its state before the transaction was started. Since the system was in a consistent state when the transaction was started, it will once again be in a consistent state.
        Looking again at the account transfer system, the system is consistent if the total of all accounts is constant. If an error occurs and the money is removed from account A and not added to account B, then the total in all accounts would have changed. The system would no longer be consistent. By rolling back the removal from account A, the total will again be what it should be, and the system back in a consistent state.Here is the textbook definition in readers digest version.
    • Isolation
      • When a transaction runs in isolation, it appears to be the only action that the system is carrying out at one time. If there are two transactions that are both performing the same function and are running at the same time, transaction isolation will ensure that each transaction thinks it has exclusive use of the system. This is important in that as the transaction is being executed, the state of the system may not be consistent. The transaction ensures that the system remains consistent after the transaction ends, but during an individual transaction, this may not be the case. If a transaction was not running in isolation, it could access data from the system that may not be consistent. By providing transaction isolation, this is prevented from happening.
    • Durability
      • A transaction is durable in that once it has been successfully completed, all of the changes it made to the system are permanent. There are safeguards that will prevent the loss of information, even in the case of system failure. By logging the steps that the transaction performs, the state of the system can be recreated even if the hardware itself has failed. The concept of durability allows the developer to know that a completed transaction is a permanent part of the system, regardless of what happens to the system later on..

Q: So now that that’s out of the way we can start writing queries?

  • Sort of, congratulations on completing database 101 “the good parts” but remember, we’re going to actually get started in the next blog post. We do have a few things left to resolve.

Some assumptions as we get started

The programming language we’re going to use is SQL which stands for Structured Query Language. SQL is the language that we use to tell the database what we want it to return to us, or how we want it to change the data stored within. I write all of this based off of my experience with SQL Server, because that is what I do for a living, but I believe that most of the things I will show you fall under the ANSI standard and should be compatible regardless of the platform you are using.

I’m making a few assumptions. I’m also going to assume that you have a basic understanding of what a database is used for and what you want to get, but you’re not quite sure how to get it. I will make references or comparisons to programming methodologies as examples but often it’s just to give context, you should be fine without it and can just skip it. I’m also making an assumption for now that you are actually able to query a database and just need the know-how. I’ll write an article at a future date to help you get setup with a local database with some test data. For now I will simply provide the scripts with the sample data that I am using here in my blog and you can run them on your server to use as a basis for following along.

The order of things

Now we’re ready to start talking about writing queries. I want to lay the groundwork for something that was the most confusing thing for me when first starting out writing queries. I’m a programmer by trade and have experience with languages like Java, C, C#, and Powershell. I’m used to code being executed the way that I write it, and explicitly writing code to make things happen. This type of programming is called Imperative programming and is a common mainstay of modern programming languages. Most mainstream languages are either all or mostly imperative in nature. It was shocking and sometimes befuddling to find out that SQL does not work the same way. This is because SQL is a declarative programming language. Instead of explicitly telling the server where to go, how to find, what methods to use, and what algorithms to sort, you simply say go, find, or sort, and it does.

The first thing we need to understand is the processing order of our queries. I won’t belabor this too much, but I will refer to it again later and so I wanted to put it here now. It’s okay if you don’t understand this right now, it will make more sense later. You will see this material again.

Logical query processing order

  1. FROM
  2. JOIN ON
  3. OUTER
  4. WHERE
  5. GROUP BY
  6. CUBE | ROLLUP
  7. HAVING
  8. SELECT
  9. DISTINCT
  10. ORDER BY
  11. TOP

Again, this doesn’t have to make sense yet, but just understand that when the query is being processed, it doesn’t execute the way it is written, instead it follows this order. This will help you later to understand why certain operations don’t work like you would expect them to. I’ll include information about what each phase of the process can do in a later post, but for now just commit to coming back to refer to this and we can move on.

Preparing the environment

I’m going to provide a few files to run that will help you get started, but a quick overview won’t hurt.

We’re going to be creating a few tables:

  • TblIndividual
    • This is a table that contains an example set of people that we use for our example data. This is all generated data.
  • TblAddresses
    • This is a table containing addresses belonging to our people; we will use this for demonstrating joins and other operations.
  • TblAddressTypes
    • This is our final table and is used to demonstrate following normal forms and how to do complex, multi-table joins.

Here is a sample of one of the tables. The script attached below has the full command to create and populate the tables, but this is an example of what our individual table looks like.

CREATE TABLE TblIndividuals
(
IndividualID INT NOT NULL PRIMARY KEY IDENTITY(1, 1)
, Prefix VARCHAR(50) NULL
, FirstName VARCHAR(50) NOT NULL
, MiddleName VARCHAR(50) NULL
, LastName VARCHAR(50) NOT NULL
, MaidenName VARCHAR(50) NULL
, Suffix VARCHAR(50) NULL
, Gender VARCHAR(1) NULL
, SSN VARCHAR(9) NULL
, GraduationYear INT NULL
, GPA VARCHAR(50) NULL
, Major VARCHAR(100) NULL
, Class INT NULL

We will be using this table and the others to learn how to write queries to access and modify the data stored within. We will then learn how to join the tables together to create a single result set out of multiple sets of data.

In my next post, we’ll start using all of this information to actually write some queries. Attached below is a zip file with the script needed to follow along with all of our examples. It will create a few tables in tempdb for you to use and test with. tempdb is reset when you restart the box, so you may need to rerun these scripts if that happens. These should work on SQL Server 2008r2 forward and I have tested them on SQL Server 2012, 2016, and 2017.

A big thanks to Lucas Flores for help with editing and format!

Attachments

Leave a Reply