What is SQL?
SQL (you can say just the letters or pronounce it “sequel”) stands for “Structured Query Language”, and it’s an old paradigm that hasn’t lost its vigor after several decades at the forefront of database technologies.
There are several flavors of SQL: MySQL, MSSQL, postGreSQL, etc. They have the basics of the language in common, but will have some differences. All of these SQL variants support “relational” databases that are table-based. A relational database is one that assumes that the various tables in the database are related to one another and that one table will have data that has something to do with another table. For example, a common use case is that you have a “customer” table that has all the customer personal info, like name and phone number, with each customer identified by a code (say, a sequential number we call customer_id). Then, in a different table, called “orders”, we can say that a given order belongs to customer 201357, instead of including all their info.
Why does this matter? It saves space. Instead of repeating “Jane Doe, (215) 555-1212” for each order that Jane makes, we can just have that information in one place, in her row in the Customers table. Then, with each order she makes, we indicate it’s her order by putting her customer id alongside her order details. Also, if Jane changes her phone number, we only have to find and replace in one location, instead of many. Finally, it allows us to have a pretty simple “many to many” relationship. For example, let’s say Jane has multiple shipping addresses registered – one for work, one for her home, one for her daughter who’s away at college, etc. We can then use a “shipping_address” table to list all of them, indicating in each one that they belong to customer 201357.
If you really want to get into relational database design techniques, you can read up on the various degrees of “normalization”, which are intended to remove as much needless repetition as possible. But for now, let’s leave that aside. Suffice it to say that depending on the use case, you might have to combine data from multiple tables in order to get the information you want.
SQL databases are made up of tables, and each table has columns and rows. Columns are fields (like “name”, “age”, “address”, “BMI”, “start_date”, etc.). Rows are observations or data points (like orders, patients, inventory items, bills, etc.)
How can I get data from SQL?
First, you’ll need to connect to a SQL database. This depends on several factors, including what flavor of SQL you’re using, where it’s located, and what kind of computer you’re using to connect to it. This goes beyond the scope of this article. If you’re working with a SQL database set up for you by a data professional, they can help you manage the connection details and figure out how to connect.
Then, you’ll need to issue queries to extract data.
Regardless of the SQL flavor, you’re going to have to answer three questions to construct the most basic SQL query:
- What’s the table name where the data lives?
- What data fields do you want to pull out? (study id, mrn, age when measure was taken, score or value, etc.)? If you want everything, you’ll use an asterisk to indicate that.
- Under what conditions? (Only males, only the rows with non-empty values for a certain field, only the rows where patient age is above 12, etc.). If you have no conditions, and just want to pull from every record in the table, you can omit this.
You put the query together like this: SELECT + FROM + WHERE. While SQL keywords can be any case, conventionally we put these words in all caps to make them stand out.
SELECT field_i_want, other_field_i_want FROM table_im_interested_in WHERE age > 12;
SELECT * FROM my_table;
SELECT * FROM nutritional_value WHERE food IN ("apple", "banana", "grapefruit");
Notice here that I’m assuming interest in just one table. But data is connected across tables! That’s true, and you can certainly learn how to join tables in SQL to get that connected information. However, if you already use a statistical programming language like R or Python, you might find it more helpful just to download entire tables to your computing environment and then combine data there. That won’t be true if you’re dealing with millions of rows of data – you’ll be better off shaping your data in SQL. But for general use, when you’re dealing with thousands or tens of thousands of records, you’re probably better off doing your data combining and reshaping in your favorite programming environment.
Where can I learn SQL?
My recommendation is to go to Khan Academy’s SQL instruction. The first chapter is sufficient for most consumers of database data – you’ll learn by doing, issuing queries, examining database features, and learning about SQL commands. It really is fairly simple.