Skip to main content
sql group bydata aggregationdatabase analyticssql grouping

SQL GROUP BY -- Complete Guide to Data Aggregation and Analysis

GROUP BY collapses rows that share a value into a single summary row. Combined with aggregate functions like COUNT, SUM, and AVG, it turns raw transaction data into reports: revenue by region, orders by month, average salary by department. Every reporting query you write uses GROUP BY.

TL;DR

SQL GROUP BY groups rows with identical values in specified columns and applies aggregate functions (COUNT, SUM, AVG, MIN, MAX) to each group. It returns one row per group. Use WHERE to filter rows before grouping and HAVING to filter groups after aggregation. Every non-aggregated column in SELECT must appear in the GROUP BY clause -- violating this rule is the most common GROUP BY error.

How GROUP BY Works

Conceptually, GROUP BY works in three steps. First, the database takes the rows that survived the WHERE filter. Second, it sorts or hashes those rows into buckets based on the GROUP BY columns -- all rows with the same value in those columns land in the same bucket. Third, it runs the aggregate function on each bucket and produces one output row per bucket.

GROUP BY DATA FLOWRaw RowsSales$60kSales$55kMarketing$65kMarketing$70kMarketing$62kEngineering$90kEngineering$85kEngineering$95kSales$58kGROUP BYdepartmentSales (3 rows)$60k + $55k + $58kMarketing (3 rows)$65k + $70k + $62kEngineering (3 rows)$90k + $85k + $95kSUM(salary)Result (3 rows)depttotalSales$173kMarketing$197kEngin.$270kEvery non-aggregated column in SELECT must appear in GROUP BY -- otherwise the database cannot pick a single value for the row
Click to enlarge

The critical rule: every column in SELECT must either appear in GROUP BY or be inside an aggregate function. If you select department and employee_name but only group by department, the database has three employee names per group and no rule for which one to return. Most databases raise an error. MySQL with default settings silently picks an arbitrary value, which is worse.

GROUP BY with Aggregate Functions

The five core aggregate functions cover most reporting needs:

-- COUNT: how many orders per customer SELECT customer_id, COUNT(*) AS order_count FROM orders GROUP BY customer_id; -- SUM: total revenue by product category SELECT p.category, SUM(oi.quantity * oi.unit_price) AS total_revenue FROM order_items oi JOIN products p ON oi.product_id = p.product_id GROUP BY p.category; -- AVG: average order value by region SELECT region, AVG(total) AS avg_order_value FROM orders GROUP BY region; -- MIN / MAX: salary range by department SELECT department, MIN(salary) AS lowest_salary, MAX(salary) AS highest_salary FROM employees GROUP BY department;

Multi-column GROUP BY creates finer granularity. Grouping by category, region produces one row for each unique category-region combination:

SELECT category, region, COUNT(*) AS transaction_count, SUM(amount) AS total_sales FROM sales GROUP BY category, region ORDER BY category, total_sales DESC;

WHERE vs HAVING

WHERE filters rows before grouping. HAVING filters groups after aggregation. They are not interchangeable:

WHERE VS HAVING EXECUTIONStep 1: WHEREFilters individual rows✓ status = 'active'✗ status = 'inactive'Step 2: GROUP BYCollapses into groupsSales: 8 emps, $420kMktg: 3 emps, $195kStep 3: HAVINGFilters groups✓ Sales (8 > 5)✗ Mktg (3 < 5)WHERE status = 'active' ... GROUP BY department ... HAVING COUNT(*) > 5WHERE removes inactive employees first, then GROUP BY creates department groups, then HAVING keeps only large departments
Click to enlarge
-- Combined example: active employees only, departments with > 5 people SELECT department, COUNT(*) AS headcount, AVG(salary) AS avg_salary FROM employees WHERE status = 'active' GROUP BY department HAVING COUNT(*) > 5 ORDER BY avg_salary DESC;

Why not just use HAVING for everything? Because WHERE filters rows before the grouping work happens, reducing the dataset the database has to process. Filtering with HAVING after grouping wastes effort -- the database groups rows only to throw them away. Use WHERE for row-level conditions and HAVING exclusively for conditions on aggregate results.

Query optimizers can push WHERE predicates down to the scan level, reducing I/O before any grouping occurs. HAVING predicates, by definition, can only be evaluated after grouping is complete. Always prefer WHERE when the filter does not depend on an aggregate result.

— PostgreSQL Documentation, Table Expressions: GROUP BY and HAVING

Advanced Grouping

GROUP BY with date functions

Date functions inside GROUP BY produce time-period reports. Monthly revenue, quarterly headcount, yearly churn -- they all follow the same pattern:

-- Monthly revenue report SELECT YEAR(order_date) AS yr, MONTH(order_date) AS mo, SUM(total) AS monthly_revenue FROM orders GROUP BY YEAR(order_date), MONTH(order_date) ORDER BY yr, mo; -- PostgreSQL equivalent using DATE_TRUNC SELECT DATE_TRUNC('month', order_date) AS month, SUM(total) AS monthly_revenue FROM orders GROUP BY DATE_TRUNC('month', order_date) ORDER BY month;

Custom buckets with CASE

CASE inside GROUP BY lets you create arbitrary groupings that do not exist as columns in the data:

SELECT CASE WHEN total < 50 THEN 'Small (<$50)' WHEN total < 200 THEN 'Medium ($50-$199)' ELSE 'Large ($200+)' END AS order_size, COUNT(*) AS order_count, AVG(total) AS avg_value FROM orders GROUP BY CASE WHEN total < 50 THEN 'Small (<$50)' WHEN total < 200 THEN 'Medium ($50-$199)' ELSE 'Large ($200+)' END;

ROLLUP, CUBE, and GROUPING SETS

These extensions produce multi-level aggregations in a single query. They are essential for reporting dashboards that show both detail and totals.

ROLLUP adds subtotals and a grand total. The hierarchy follows the column order in the ROLLUP list:

-- MySQL syntax SELECT category, region, SUM(amount) AS total_sales FROM sales GROUP BY category, region WITH ROLLUP; -- PostgreSQL syntax SELECT category, region, SUM(amount) AS total_sales FROM sales GROUP BY ROLLUP(category, region);

GROUPING SETS lets you define exactly which groupings to compute:

-- PostgreSQL: three separate groupings in one query SELECT category, region, SUM(amount) AS total_sales FROM sales GROUP BY GROUPING SETS ( (category), -- totals by category (region), -- totals by region () -- grand total );

CUBE computes every possible combination of the listed columns:

-- SQL Server SELECT category, region, SUM(amount) AS total_sales FROM sales GROUP BY CUBE(category, region);

Practical Patterns

Top N by category

SELECT category, SUM(amount) AS total_sales FROM sales GROUP BY category ORDER BY total_sales DESC LIMIT 5;

Percentage of total

Compute each category's share using a subquery or window function for the denominator:

SELECT category, SUM(amount) AS category_sales, ROUND(SUM(amount) * 100.0 / (SELECT SUM(amount) FROM sales), 1) AS pct_of_total FROM sales GROUP BY category ORDER BY category_sales DESC;

Year-over-year comparison

SELECT YEAR(order_date) AS yr, MONTH(order_date) AS mo, SUM(total) AS revenue, LAG(SUM(total)) OVER ( PARTITION BY MONTH(order_date) ORDER BY YEAR(order_date) ) AS prev_year_revenue FROM orders GROUP BY YEAR(order_date), MONTH(order_date) ORDER BY yr, mo;

Groups above average

SELECT category, AVG(amount) AS avg_sale FROM sales GROUP BY category HAVING AVG(amount) > (SELECT AVG(amount) FROM sales) ORDER BY avg_sale DESC;

GROUP BY queries aggregate metrics like "total revenue" or "customer count" -- but those metrics need agreed-upon definitions. Dawiso's business glossary provides canonical metric definitions, ensuring that GROUP BY SUM(revenue) means the same thing across every team's reports. Through the data catalog, analysts can verify which source tables define "revenue" and what transformations were applied before aggregation.

Performance

GROUP BY performance depends on how much data the database has to process and how it executes the grouping.

Filter with WHERE first. Rows removed by WHERE never reach the GROUP BY phase. A query that filters orders to the last quarter before grouping by category does far less work than one that groups all orders and then filters the groups.

Index the GROUP BY columns. An index on the grouping columns can enable a sorted aggregation -- the database reads rows in order and collapses them without a separate sort step. Without an index, the database typically uses a hash aggregation, which builds a hash table in memory.

-- Index supports both WHERE and GROUP BY CREATE INDEX idx_orders_date_category ON orders(order_date, category); -- This query can scan the index in order SELECT category, SUM(total) AS revenue FROM orders WHERE order_date >= '2025-01-01' GROUP BY category;

Use EXPLAIN to check the strategy. Look for HashAggregate (hash-based, good for many groups) vs GroupAggregate (sort-based, good for few groups or when data is already sorted). If you see a Sort node before GroupAggregate, an index on the GROUP BY columns could eliminate it.

Materialized views for repeated aggregations. If the same GROUP BY query runs hourly for a dashboard, consider a materialized view that pre-computes the result and refreshes on a schedule.

Hash aggregation allocates memory proportional to the number of groups. When the hash table exceeds work_mem (PostgreSQL) or equivalent memory limits, it spills to disk, causing a significant performance cliff. For queries with millions of distinct groups, increase the memory limit or pre-filter with WHERE to reduce group count.

— PostgreSQL Documentation, Resource Consumption: work_mem

Common Errors

Non-aggregate column not in GROUP BY

The single most common GROUP BY error. If you select a column that is not aggregated and not in GROUP BY, the database does not know which of the group's values to return:

-- ERROR in PostgreSQL, SQL Server, and strict MySQL SELECT department, employee_name, SUM(salary) FROM employees GROUP BY department; -- Fix: either add to GROUP BY or aggregate it SELECT department, COUNT(*) AS headcount, SUM(salary) AS total_salary FROM employees GROUP BY department;

MySQL's permissive ONLY_FULL_GROUP_BY

Older MySQL versions (and current ones with ONLY_FULL_GROUP_BY disabled) silently pick an arbitrary value for non-grouped columns. This produces nondeterministic results -- the same query can return different employee names on different runs. Always enable ONLY_FULL_GROUP_BY or use a database that enforces the rule by default.

NULL values form their own group

Rows where the GROUP BY column is NULL are collected into a single group. This is standard behavior, but it surprises people who expect NULLs to be excluded. If you do not want a NULL group, filter them out with WHERE:

SELECT region, COUNT(*) AS order_count FROM orders WHERE region IS NOT NULL GROUP BY region;

Confusing WHERE and HAVING

Using HAVING where WHERE would work is not an error, but it is wasteful. HAVING runs after grouping, so it processes more data. Use WHERE for row-level conditions (e.g., WHERE status = 'active') and HAVING only when the condition involves an aggregate (e.g., HAVING COUNT(*) > 5).

Dawiso
Built with love for our users
Make Data Simple for Everyone.
Try Dawiso for free today and discover its ease of use firsthand.
© Dawiso s.r.o. All rights reserved