SQL for Data Analysis: Complete Data Analyst Guide

Foundations

Analytical SQL: essential foundations

SQL (Structured Query Language) is the standard language for querying relational databases. But in a data analysis context, queries go far beyond a simple SELECT * FROM table: they combine aggregations, multiple joins, subqueries, and analytical functions to answer precise business questions.

Basic aggregation functions — COUNT, SUM, AVG, MIN, MAX — combined with GROUP BY and HAVING form the backbone of analysis. JOIN (INNER, LEFT, RIGHT, FULL OUTER) combines multiple tables. WHERE filters rows before aggregation; HAVING filters after.

SQL vs NoSQL for analysis

Modern analytical data warehouses (BigQuery, Snowflake, Redshift, DuckDB) all use SQL as their primary interface, even if they store data in columnar format rather than row-based. SQL knowledge is therefore transferable across tools. NoSQL databases (MongoDB, Cassandra) have their own query languages, but SQL remains the analytical standard.

Analytical SQL vs transactional SQL

Transactional SQL (OLTP) is optimized for fast reads and writes on few rows. Analytical SQL (OLAP) works on millions of rows, prioritizing reads and aggregations. Analytical queries use different patterns: no frequent UPDATE/INSERT, but complex SELECT statements with window calculations.

Advanced functions

Window Functions

Window functions are the most powerful SQL feature for data analysis. They calculate a value for each row based on a set of related rows (the 'window'), without reducing the number of rows in the result — unlike GROUP BY.

Basic syntax: FUNCTION() OVER (PARTITION BY column ORDER BY column ROWS/RANGE BETWEEN ... AND ...). The OVER clause defines the window: PARTITION BY groups rows (like GROUP BY but without aggregating them), ORDER BY orders within the partition, and the frame clause delimits which rows are included in the calculation.

Ranking functions: ROW_NUMBER, RANK, DENSE_RANK

ROW_NUMBER() assigns a unique number to each row within the partition. RANK() assigns the same rank to ties but creates 'gaps' (1, 1, 3). DENSE_RANK() assigns the same rank to ties without gaps (1, 1, 2). Typical use case: 'for each customer, get their TOP 1 purchase by amount' — solution: ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY amount DESC) and filter WHERE rn = 1.

LAG and LEAD: comparing with the previous period

LAG(column, n) returns the value of the row n positions before in the window. LEAD(column, n) returns the value n positions after. Classic analytical use case: calculating monthly growth. SELECT month, revenue, LAG(revenue, 1) OVER (ORDER BY month) AS previous_revenue, revenue - LAG(revenue, 1) OVER (ORDER BY month) AS variation. No more self-join on the same table with a date offset.

Rolling aggregates: SUM, AVG, COUNT with OVER

Aggregation functions become window functions when you add OVER. SUM(revenue) OVER (PARTITION BY year ORDER BY month) calculates a monthly running total per year. AVG(score) OVER (ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) calculates a 7-day moving average. These calculations were previously impossible without a self-join or application code.

Universal adoption

Window functions are defined in the SQL:2003 standard and supported by PostgreSQL, MySQL 8+, MariaDB 10.2+, SQLite 3.25+, BigQuery, Snowflake, Redshift, DuckDB, and almost all modern analytical warehouses. They are available everywhere you work.

SQL:2003 Standard — ISO/IEC 9075

Code structure

CTEs: Common Table Expressions for readable queries

A CTE (WITH ... AS (...)) is a named subquery that can be referenced multiple times in the main query. It dramatically improves the readability and maintainability of complex analytical queries.

Key advantage over nested subqueries: the CTE is defined once at the top and referenced by name, which allows chaining transformation steps logically. A complex analytical query can be broken down into 5-6 successive CTEs, each performing one precise operation.

Recursive CTEs: hierarchies and graphs

A recursive CTE (WITH RECURSIVE) references itself to traverse hierarchical structures: org charts, nested categories, dependency graphs. Structure: an 'anchor' part (base case) and a recursive part (WITH RECURSIVE cte AS (SELECT ... UNION ALL SELECT ... FROM cte JOIN ...)). Use case: find all subordinates of a manager at n levels deep.

CTE vs subquery vs view

CTE = temporary within the query, not stored. Subquery = embedded in the WHERE or FROM, less readable when complex. View = stored in the database, accessible from multiple queries. For readability of a complex one-off query: CTE. To reuse the same logic from multiple places: view.

Performance

Optimizing analytical SQL queries

A SQL query can return the correct result in 50ms or 5 minutes depending on whether it uses indexes correctly. Understanding the execution plan is the key skill for optimization.

EXPLAIN and EXPLAIN ANALYZE

EXPLAIN shows the execution plan without running the query. EXPLAIN ANALYZE executes the query and shows real times vs estimates. Elements to watch: Seq Scan (sequential full table read, expensive), Index Scan (reading via index, fast), Hash Join vs Nested Loop (join strategies), and the number of rows processed at each step.

Indexing strategy

A B-tree index speeds up queries on filtered columns (WHERE) and joined columns (JOIN ... ON). Practical rules: index frequently filtered columns, foreign keys, ORDER BY columns. Avoid indexes on low-cardinality columns (boolean columns, statuses with few values). Partial indexes (WHERE active = true) are useful for tables where a subset is frequently queried.

Common SQL anti-patterns

SELECT * on a large table retrieves unnecessary columns and overloads the network. Functions on indexed columns in WHERE (WHERE YEAR(date) = 2026) prevent index use — prefer WHERE date BETWEEN '2026-01-01' AND '2026-12-31'. Correlated subqueries (re-executed for each row of the main query) are often replaceable with a JOIN or CTE.

Use cases

Common SQL analytical patterns

Certain SQL patterns recur systematically in the daily work of a Data Analyst. Knowing them by heart significantly accelerates productivity.

Cohort analysis

Cohort analysis groups users by acquisition period and tracks their behavior over time. SQL pattern: (1) CTE 'first_activity' with MIN(date) per user. (2) CTE 'activity' that joins each event to its cohort. (3) SELECT cohort, months_since_acquisition, COUNT(DISTINCT user_id). This pattern measures retention, churn, and long-term value by cohort.

Running totals and market share

SUM(revenue) OVER (ORDER BY date) for a running total. SUM(revenue) / SUM(SUM(revenue)) OVER () * 100 for each row's share of the total (divide by the grand total via a window without PARTITION BY). NTILE(4) OVER (ORDER BY revenue) to segment into quartiles.

SQL in the modern stack

dbt transforms SQL into a transformation framework: models are .sql files, versioned in Git, with automatic tests (unique, not_null, referential integrity) and generated documentation. dbt allows industrializing analytical SQL logic with software engineering practices.

Method

Anchoring analytical SQL with spaced repetition

SQL is a practical skill: mastery comes from repeating patterns, not passive reading. Flashcards are particularly effective for memorizing syntaxes (window functions, recursive CTEs, EXPLAIN ANALYZE) and subtle distinctions (RANK vs DENSE_RANK, LAG vs LEAD, CTE vs subquery).

Essential SQL cards to master

OVER (PARTITION BY ... ORDER BY ...) syntax, difference between ROW_NUMBER / RANK / DENSE_RANK, LAG and LEAD with n offset, WITH ... AS CTE structure, reading an EXPLAIN ANALYZE, and the 3 common anti-patterns. These are the fundamentals of every Data Analyst or Data Engineer interview.

Frequently asked questions about SQL and data analysis

What is the difference between GROUP BY and a window function?

GROUP BY reduces the number of rows in the result by aggregating: one row per group. A window function (OVER) calculates a value for each row considering other rows in the window, without reducing the row count. GROUP BY: 100 orders → 12 rows (one per month). Window function: 100 orders → 100 rows, each with its monthly running total.

What is the difference between RANK and DENSE_RANK?

RANK() and DENSE_RANK() both assign the same rank to ties. The difference: with RANK, after two tied rows at rank 1, the next value is rank 3 (a 'gap'). With DENSE_RANK, the next value is rank 2 ('dense', no gap). ROW_NUMBER() assigns a unique number to each row even if values are identical.

What are LAG and LEAD used for?

LAG(column, n) returns the value of the row n positions before in the window. LEAD(column, n) returns the value n positions after. Typical use: calculating period-over-period change — revenue - LAG(revenue, 1) OVER (ORDER BY month) — without needing a self-join on the same table.

What is a CTE and why use it?

A CTE (Common Table Expression) is a named subquery defined with WITH name AS (...) before the main query. It improves readability by breaking a complex query into named steps, can be referenced multiple times, and allows creating recursive CTEs for hierarchies. It replaces illegible nested subqueries.

How do you optimize a slow SQL query?

First reflex: EXPLAIN ANALYZE to see the execution plan and identify costly Seq Scans. Check that WHERE and JOIN columns are indexed. Avoid functions on indexed columns in WHERE. Replace SELECT * with only needed columns. For analytical warehouses (BigQuery, Snowflake), check table partitioning and clustering.

What is the difference between WHERE and HAVING?

WHERE filters rows BEFORE aggregation (GROUP BY). HAVING filters groups AFTER aggregation. Example: WHERE revenue > 100 keeps order rows over 100. HAVING SUM(revenue) > 1000 keeps only groups whose total exceeds 1000. WHERE is more performant because it reduces data before aggregation.

What is dbt and why use it with SQL?

dbt (data build tool) is a framework that turns analytical SQL into versionable code. Each model is a .sql file, organized in a dependency DAG. dbt adds: automatic tests (uniqueness, not_null, referential integrity), generated documentation, configurable materialization (view, table, incremental), and visual lineage. It is the standard tool for data transformation in modern data stacks.

Previous article: Business Intelligence and KPI

Next article: RAG - Retrieval Augmented Generation

SQL for data analysis:
from basic SELECT to window functions

What you will learn