Optimising SQL: Subqueries vs. Joins
TL;DR: joins are the winner in this battle. Why? Explore the nuances of SQL optimisation through a detailed comparison of subqueries and joins.
Joins and subqueries are both used to combine data from different tables into a single result set. If you are starting your journey with SQL, you might choose subqueries, as they are often the most intuitive and logical way to achieve this. On the other hand, more experienced SQL users understand that the key difference between subqueries and joins lies in their impact on query performance, with joins often being more efficient than subqueries.
Let's explore why this is the case in this article:
- What Is a subquery?
- What Is a join?
- Comparing subqueries and joins
- Advantages of subqueries
- Advantages of joins
- Preferred scenarios
- The impact on query performance
- Final thoughts
What Is a Subquery?
A SQL subquery, also known as inner queries, is essentially a query within another SQL query. To be more precise, subquery is determined as follows:
A subquery is a SELECT statement that coded within another SQL statement. For this to work, you must enclose the subquery on parentheses [1].
When a subquery is coded within another subquery it is called a nested subquery. The inner query executes first, and its result is passed to the outer query.
A subquery can return:
- a single value,
- a list of values (a result set that has a single column), or
- a table of values (a result set has multiple columns).
A subquery can be coded, or introduced, anywhere a single value, a list of values, or a table of values is allowed. There are four ways to introduce a subquery:
- In a
WHERE
clause as a search condition - In a
HAVING
clause as a search condition - In the
FROM
clause as a table specification - In the
SELECT
clause as a column specification.
The syntax of subquery is the same as for a standard SELECT
statement. However, a subquery can’t include an ORDER BY
clause.
Subqueries are often used for comparison purposes, to compute aggregate values, or to determine the existence of rows in a table.
What Is a Join?
A join combines columns from two or more tables into a result set based on the join conditions you specify.
Joins are fundamental to relational database operations as they enable you to query data from multiple tables as if it were from one single table.
I elaborated on joins in my article All About Joins. Therefore, to gain a comprehensive understanding of what joins are, what types of joins exist, and how to work with them, I recommend referring to that article. For the purposes of this article, I will briefly touch on the following points:
- A join condition names a column in each of the tables involved in the join and indicates how the two columns should be compared. In most cases, you use the equal operator to retrieve rows with matching columns. However you can also use any other comparison operator in a join condition.
- Joins merge tables horizontally and are specified in the
FROM
clause of a SQL statement. - There are several types of joins, including
INNER JOIN
,LEFT JOIN
,RIGHT JOIN
, andFULL JOIN
.INNER JOIN
orJOIN
andLEFT JOIN
are most commonly used joins in practice. - Tables are typically joined on the relationship between the primary key on one table and a foreign key in the other table. However, you can also join tables based on relationships not defined in the database. These are called ad-hoc relationships.
Comparing Subqueries and Joins
Advantages of Subqueries
- You can use a subquery to pass an aggregate value to the main query.
- A subquery tends to be more intuitive when it uses an ad-hoc relationship between the two tables.
- Long, complex queries can sometimes be easier to code using subqueries. Subqueries can simplify complex queries by breaking them down into more manageable parts.
Advantages of Joins
- The
SELECT
clause of a join can include columns from both tables. Joins allow for more flexible result sets, combining multiple columns from different tables. - A join tends to be more intuitive when it uses an existing relationship between the two tables, such as primary key to foreign key relationship.
- Joins are more readable and maintainable when dealing with multiple tables, as the relationships are explicitly defined in the
FROM
clause. - Often more efficient than subqueries, especially for joining large tables, as databases are optimised for join operations.
Preferred Scenarios
Subqueries:
- Preferred when you need to select a single value or a set of values to use in an outer query condition.
- Useful for data that requires aggregation before being joined or compared to another table.
- Ideal for situations where the intermediate result set is small or needs to be reused multiple times within the outer query.
Joins:
- Preferred for retrieving data from multiple tables in a single query operation, especially when dealing with large datasets.
- More efficient for straightforward queries where data from different tables need to be combined based on direct relationships.
- Ideal for scenarios requiring data from multiple tables to be presented in a single result set, without the need for complex subquery logic.
The Impact on Query Performance
The impact on query performance of SQL subqueries and joins can vary significantly depending on the structure of the databases, the specific queries, and the SQL database management system being used.
Different database systems have different strengths and weaknesses in how they handle subqueries and joins. It's essential to consult the documentation and performance tuning guides for your specific database system. However, in general:
- Subqueries might not always be well-optimised by the SQL optimiser, especially if they are deeply nested or complex. This can lead to inefficient execution plans where the database does more work than necessary.
- On the other hand, database systems are typically optimised to handle joins efficiently, especially when the tables involved are properly indexed. Joins are a fundamental part of relational databases and are usually well-optimised by the database's query planner.
Considering the data volume, the larger the dataset, the more pronounced the performance differences might be. Joins on large datasets can be very fast if the data is well-indexed, but slow if not. Subqueries on large datasets can degrade performance if they result in multiple full table scans.
Subqueries, especially nested or correlated subqueries, can be expensive in terms of performance. A correlated subquery (one that references a column from the outer query) can be particularly costly because it may need to be executed repeatedly, once for each row processed by the outer query.
In some cases, subqueries are the only way to achieve certain results, or they can make the query more readable and logically structured, when they can replace complex joins and unions with only minimal performance degradation, if any. However, if a subquery can be rewritten as a join, it often should be, as joins are generally more efficiently handled by database systems.
In summary, both subqueries and joins have their place in SQL querying, and their impact on performance can vary widely based on the query's structure, the size of the data, and the database's optimisation capabilities.
Final Thoughts
The choice between subqueries and joins largely hinges on the specific requirements of the query, the relationships between the data, and considerations of performance. However, joins are typically more efficient and straightforward for combining multiple tables.
Many queries can be restated as joins and most joins can be restated as subqueries. If a subquery can be rewritten as a join, it often should be, as joins are generally handled more efficiently by database systems. Alongside this, it's important to ensure tables are properly indexed.
Testing and analysing are the keys in mastering the art of SQL optimisation. Happy querying and experimenting!
REFERENCES
- Joel Murach, Murach’s MySQL. Training & reference. First Edition (Fresno, CA: Mike Murach & Associates, Incorporated, 2012).
- Alan Beaulieu, Learning SQL.Generate, Manipulate, and Retrieve Data. 3rd Edition (O'Reilly Media, Inc., 2020)