SQL GROUP BY -- Complete Guide to Data Aggregation and Analysis
GROUP BY collapses rows that share a value into a single summary row. Combined with aggregate functions like COUNT, SUM, and AVG, it turns raw transaction data into reports: revenue by region, orders by month, average salary by department. Every reporting query you write uses GROUP BY.
SQL GROUP BY groups rows with identical values in specified columns and applies aggregate functions (COUNT, SUM, AVG, MIN, MAX) to each group. It returns one row per group. Use WHERE to filter rows before grouping and HAVING to filter groups after aggregation. Every non-aggregated column in SELECT must appear in the GROUP BY clause -- violating this rule is the most common GROUP BY error.
How GROUP BY Works
Conceptually, GROUP BY works in three steps. First, the database takes the rows that survived the WHERE filter. Second, it sorts or hashes those rows into buckets based on the GROUP BY columns -- all rows with the same value in those columns land in the same bucket. Third, it runs the aggregate function on each bucket and produces one output row per bucket.
The critical rule: every column in SELECT must either appear in GROUP BY or be inside an aggregate function. If you select department and employee_name but only group by department, the database has three employee names per group and no rule for which one to return. Most databases raise an error. MySQL with default settings silently picks an arbitrary value, which is worse.
GROUP BY with Aggregate Functions
The five core aggregate functions cover most reporting needs:
-- COUNT: how many orders per customer
SELECT customer_id, COUNT(*) AS order_count
FROM orders
GROUP BY customer_id;
-- SUM: total revenue by product category
SELECT p.category, SUM(oi.quantity * oi.unit_price) AS total_revenue
FROM order_items oi
JOIN products p ON oi.product_id = p.product_id
GROUP BY p.category;
-- AVG: average order value by region
SELECT region, AVG(total) AS avg_order_value
FROM orders
GROUP BY region;
-- MIN / MAX: salary range by department
SELECT department,
MIN(salary) AS lowest_salary,
MAX(salary) AS highest_salary
FROM employees
GROUP BY department;Multi-column GROUP BY creates finer granularity. Grouping by category, region produces one row for each unique category-region combination:
SELECT category, region,
COUNT(*) AS transaction_count,
SUM(amount) AS total_sales
FROM sales
GROUP BY category, region
ORDER BY category, total_sales DESC;WHERE vs HAVING
WHERE filters rows before grouping. HAVING filters groups after aggregation. They are not interchangeable:
-- Combined example: active employees only, departments with > 5 people
SELECT department,
COUNT(*) AS headcount,
AVG(salary) AS avg_salary
FROM employees
WHERE status = 'active'
GROUP BY department
HAVING COUNT(*) > 5
ORDER BY avg_salary DESC;Why not just use HAVING for everything? Because WHERE filters rows before the grouping work happens, reducing the dataset the database has to process. Filtering with HAVING after grouping wastes effort -- the database groups rows only to throw them away. Use WHERE for row-level conditions and HAVING exclusively for conditions on aggregate results.
Query optimizers can push WHERE predicates down to the scan level, reducing I/O before any grouping occurs. HAVING predicates, by definition, can only be evaluated after grouping is complete. Always prefer WHERE when the filter does not depend on an aggregate result.
— PostgreSQL Documentation, Table Expressions: GROUP BY and HAVING
Advanced Grouping
GROUP BY with date functions
Date functions inside GROUP BY produce time-period reports. Monthly revenue, quarterly headcount, yearly churn -- they all follow the same pattern:
-- Monthly revenue report
SELECT YEAR(order_date) AS yr,
MONTH(order_date) AS mo,
SUM(total) AS monthly_revenue
FROM orders
GROUP BY YEAR(order_date), MONTH(order_date)
ORDER BY yr, mo;
-- PostgreSQL equivalent using DATE_TRUNC
SELECT DATE_TRUNC('month', order_date) AS month,
SUM(total) AS monthly_revenue
FROM orders
GROUP BY DATE_TRUNC('month', order_date)
ORDER BY month;Custom buckets with CASE
CASE inside GROUP BY lets you create arbitrary groupings that do not exist as columns in the data:
SELECT
CASE
WHEN total < 50 THEN 'Small (<$50)'
WHEN total < 200 THEN 'Medium ($50-$199)'
ELSE 'Large ($200+)'
END AS order_size,
COUNT(*) AS order_count,
AVG(total) AS avg_value
FROM orders
GROUP BY
CASE
WHEN total < 50 THEN 'Small (<$50)'
WHEN total < 200 THEN 'Medium ($50-$199)'
ELSE 'Large ($200+)'
END;ROLLUP, CUBE, and GROUPING SETS
These extensions produce multi-level aggregations in a single query. They are essential for reporting dashboards that show both detail and totals.
ROLLUP adds subtotals and a grand total. The hierarchy follows the column order in the ROLLUP list:
-- MySQL syntax
SELECT category, region, SUM(amount) AS total_sales
FROM sales
GROUP BY category, region WITH ROLLUP;
-- PostgreSQL syntax
SELECT category, region, SUM(amount) AS total_sales
FROM sales
GROUP BY ROLLUP(category, region);GROUPING SETS lets you define exactly which groupings to compute:
-- PostgreSQL: three separate groupings in one query
SELECT category, region, SUM(amount) AS total_sales
FROM sales
GROUP BY GROUPING SETS (
(category), -- totals by category
(region), -- totals by region
() -- grand total
);CUBE computes every possible combination of the listed columns:
-- SQL Server
SELECT category, region, SUM(amount) AS total_sales
FROM sales
GROUP BY CUBE(category, region);Practical Patterns
Top N by category
SELECT category, SUM(amount) AS total_sales
FROM sales
GROUP BY category
ORDER BY total_sales DESC
LIMIT 5;Percentage of total
Compute each category's share using a subquery or window function for the denominator:
SELECT category,
SUM(amount) AS category_sales,
ROUND(SUM(amount) * 100.0 / (SELECT SUM(amount) FROM sales), 1) AS pct_of_total
FROM sales
GROUP BY category
ORDER BY category_sales DESC;Year-over-year comparison
SELECT YEAR(order_date) AS yr,
MONTH(order_date) AS mo,
SUM(total) AS revenue,
LAG(SUM(total)) OVER (
PARTITION BY MONTH(order_date)
ORDER BY YEAR(order_date)
) AS prev_year_revenue
FROM orders
GROUP BY YEAR(order_date), MONTH(order_date)
ORDER BY yr, mo;Groups above average
SELECT category, AVG(amount) AS avg_sale
FROM sales
GROUP BY category
HAVING AVG(amount) > (SELECT AVG(amount) FROM sales)
ORDER BY avg_sale DESC;GROUP BY queries aggregate metrics like "total revenue" or "customer count" -- but those metrics need agreed-upon definitions. Dawiso's business glossary provides canonical metric definitions, ensuring that GROUP BY SUM(revenue) means the same thing across every team's reports. Through the data catalog, analysts can verify which source tables define "revenue" and what transformations were applied before aggregation.
Performance
GROUP BY performance depends on how much data the database has to process and how it executes the grouping.
Filter with WHERE first. Rows removed by WHERE never reach the GROUP BY phase. A query that filters orders to the last quarter before grouping by category does far less work than one that groups all orders and then filters the groups.
Index the GROUP BY columns. An index on the grouping columns can enable a sorted aggregation -- the database reads rows in order and collapses them without a separate sort step. Without an index, the database typically uses a hash aggregation, which builds a hash table in memory.
-- Index supports both WHERE and GROUP BY
CREATE INDEX idx_orders_date_category
ON orders(order_date, category);
-- This query can scan the index in order
SELECT category, SUM(total) AS revenue
FROM orders
WHERE order_date >= '2025-01-01'
GROUP BY category;Use EXPLAIN to check the strategy. Look for HashAggregate (hash-based, good for many groups) vs GroupAggregate (sort-based, good for few groups or when data is already sorted). If you see a Sort node before GroupAggregate, an index on the GROUP BY columns could eliminate it.
Materialized views for repeated aggregations. If the same GROUP BY query runs hourly for a dashboard, consider a materialized view that pre-computes the result and refreshes on a schedule.
Hash aggregation allocates memory proportional to the number of groups. When the hash table exceeds work_mem (PostgreSQL) or equivalent memory limits, it spills to disk, causing a significant performance cliff. For queries with millions of distinct groups, increase the memory limit or pre-filter with WHERE to reduce group count.
— PostgreSQL Documentation, Resource Consumption: work_mem
Common Errors
Non-aggregate column not in GROUP BY
The single most common GROUP BY error. If you select a column that is not aggregated and not in GROUP BY, the database does not know which of the group's values to return:
-- ERROR in PostgreSQL, SQL Server, and strict MySQL
SELECT department, employee_name, SUM(salary)
FROM employees
GROUP BY department;
-- Fix: either add to GROUP BY or aggregate it
SELECT department, COUNT(*) AS headcount, SUM(salary) AS total_salary
FROM employees
GROUP BY department;MySQL's permissive ONLY_FULL_GROUP_BY
Older MySQL versions (and current ones with ONLY_FULL_GROUP_BY disabled) silently pick an arbitrary value for non-grouped columns. This produces nondeterministic results -- the same query can return different employee names on different runs. Always enable ONLY_FULL_GROUP_BY or use a database that enforces the rule by default.
NULL values form their own group
Rows where the GROUP BY column is NULL are collected into a single group. This is standard behavior, but it surprises people who expect NULLs to be excluded. If you do not want a NULL group, filter them out with WHERE:
SELECT region, COUNT(*) AS order_count
FROM orders
WHERE region IS NOT NULL
GROUP BY region;Confusing WHERE and HAVING
Using HAVING where WHERE would work is not an error, but it is wasteful. HAVING runs after grouping, so it processes more data. Use WHERE for row-level conditions (e.g., WHERE status = 'active') and HAVING only when the condition involves an aggregate (e.g., HAVING COUNT(*) > 5).