Data Analysis with PostgreSQL Window Functions: A Titanic Case Study

Neetu Rathaur
Jan 12
6 min read

When working with data analytics, one of the biggest challenges in SQL is answering questions like:

What is the rank of each record?

What is the running total over time?

How does today’s value compare with yesterdays?

What is each row’s value compared to the group average?

Traditional SQL aggregation using GROUP BY often not enough, because it collapses rows.

This is where window functions come in picture.

What Is a Window Function?

A window function performs a calculation across a set of related rows (called a window) while keeping each individual row visible in the result. Let’s understand with an example of Employee table with columns empId, empName, deptNo, salary.

empId	empName	deptNo	Salary
1001	Alex	2	5000
1202	Jose	2	3000
1103	George	3	4000
1091	Ethan	1	5000
1201	Charlie	3	2000
1121	Alice	2	3000
1108	Eddey	1	4000
1015	Jade	1	2000

Question:

👉 Show each employee with the average salary of their department?

· Solution using GROUP BY

 SELECT deptNo, AVG(Salary)

          FROM employee

         GROUP BY deptNo;

deptNo	avg
1	3666
2	3666
3	3000

Problem with above result is employee-level details (Name, Id) are lost.

· Solution using a Window Function

SELECT

  Id,

  Name,

  deptNo,

  Salary,

  AVG(Salary) OVER (PARTITION BY deptNo) AS dept_avg_salary

FROM employee;

Result will be

Name	deptNo	Salary	dept_avg_salary
Ethan	1	5000	3666
Eddey	1	4000	3666
Jade	1	2000	3666
Alex	2	5000	3666
Jose	2	3000	3666
Alice	2	3000	3666
George	3	4000	3000
Charlie	3	2000	3000

From above result what we observed? Number of records are same from source to result. Here we can see employee names and their dept with average salaries.

We can say that average is calculated per row. Here you can observe that because there were 3 departments, and we wrote over clause as, OVER(PARTITION BY deptNo)

So data is divided into 3 windows (set of data) and each window (department wise) has all data belongs to it.

This is exactly, calculating across related rows while keeping each row visible.

There are following window function

1. ROW_NUMBER() - Assigns a unique number to each row.

2. RANK() and DENSE_RANK()

3. Running Totals - Calculates cumulative totals

4. LAG() and LEAD() — Comparing Rows

Lets understand more with a real data . I have a table with data of passengers were on titanic. We can see the table description with the query SELECT * FROM information_schema.columns WHERE table_name = 'passenger';

Table has 21 columns.

Ranking Passengers by Fare Within Each Class

SELECT

    name,

    pclass,

    fare,

    RANK() OVER (

        PARTITION BY pclass

        ORDER BY fare DESC

    ) AS fare_rank

FROM passenger;

Understand What’s happening in query

PARTITION BY pclass → create a separate window for each class and assigning separate ranking for each class. Example class1 has ranks 1 to 80 and class 2 has rank 1 to 100.

ORDER BY fare DESC → highest fare gets rank 1

Rows are not grouped, only ranked

Observe above result from query here you can see fare_rank column has been created, that assign a rank according to fare. You must be wondering why there is rank 5 after 1. It is because rank() function assign rank row wise, there is 4 rows are with same fare has same rank as 1 and row 5 has less fair so assigned with 5 instead of 2.

Difference Between ROW_NUMBER, RANK, and DENSE_RANK

SELECT

    name,

    pclass,

    fare,

    ROW_NUMBER() OVER (PARTITION BY pclass ORDER BY fare DESC) AS row_num,

    RANK()       OVER (PARTITION BY pclass ORDER BY fare DESC) AS rank,

    DENSE_RANK() OVER (PARTITION BY pclass ORDER BY fare DESC) AS dense_rank

FROM passenger;

Now observe above query result, it has row_num, rank and dense_rank.

ROW_NUMBER assigns a unique, sequential number to each row based on the descending order of fare. Even if multiple rows have the same fare, each row still receives a different number.

RANK assigns the same rank to rows with the same fare, but the next rank reflects the total number of rows ranked so far. This causes gaps in the ranking sequence.

DENSE_RANK also assigns the same rank to rows with the same fare; however, it always assigns the next rank as the immediate next sequence number, regardless of how many rows share the previous rank. As a result, there are no gaps in the ranking.

If we want to retrieve the top 10 records for each class, we need to apply a WHERE condition. However, since window functions are evaluated after the WHERE clause, we cannot filter directly on a window function result.

To achieve this, we must first calculate the window function in a subquery and then apply the WHERE condition in the outer query. The example below demonstrates this approach.

With ranked_passenger AS

         (SELECT

                  name,

                  pclass,

                  fare,

                  ROW_NUMBER() OVER (PARTITION BY pclass ORDER BY fare DESC) AS row_num,

                  RANK()       OVER (PARTITION BY pclass ORDER BY fare DESC) AS rank,

                  DENSE_RANK() OVER (PARTITION BY pclass ORDER BY fare DESC) AS dense_rank

         FROM passenger)

select * from ranked_passenger where row_num<=10;

Let’s do some analysis

1. Passenger Class Impact on Ticket Fare

Let’s analyze how fare differs by class. Normally, we’d use GROUP BY. With window functions, we can keep each passenger row:

SELECT

          name,

              pclass,

            fare,

             AVG(fare) OVER (PARTITION BY pclass) AS avg_fare_in_class

    FROM passenger;

This is extremely useful in analytics dashboards.

Insight:

First-class passengers paid significantly higher fares

Third-class fares cluster near the minimum

Window functions help us compare individual passenger’s vs class averages.

2. Average survival Rate by Gender (Window Function Style)

SELECT

    name,

    sex,

    survived,

    AVG(survived) OVER (PARTITION BY sex) AS survival_rate

FROM passenger;

Insight:

Females show a much higher survival rate

Confirms the “women and children first” policy

This query is useful for feature engineering, model input preparation and row-level reporting

3. Identifying High-Value Outliers Within Each Class

SELECT

    name,

    embarked,

    fare,

    SUM(fare) OVER (

        PARTITION BY embarked

        ORDER BY fare

        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW

    ) AS running_fare_total

· Passengers paying the same fare receive the same rank

· If multiple passengers share a rank, the next rank is skipped

· This helps preserve the idea of relative standing within a class

Even though 69 is far less than 512, Passenger ‘Sage, Miss. Constance Gladys’ is still a top-paying outlier in 3rd class.

Why This Identifies High-Value Outliers

· Each passenger is evaluated relative to their peers

· Outliers identified naturally as those with top ranks per class

· This avoids bias caused by class-level price differences

Filtering Top Outliers (Example)

To fetch only the top 5 fare-paying passengers per class:

WITH ranked_passengers AS (

    SELECT

        name,

        pclass,

        fare,

        RANK() OVER (

            PARTITION BY pclass

            ORDER BY fare DESC

        ) AS fare_rank_in_class

    FROM passenger

)

SELECT *

FROM ranked_passengers

WHERE fare_rank_in_class <= 5;

Analytical Use of the above query is to detect pricing outliers, Identify elite passengers, Useful for segmentation.

4. Age-Based Survival Patterns

Age was one of the most critical factors influencing survival on the Titanic. The famous rule “women and children first” suggests that younger passengers may have had a higher chance of survival. In this section, we analyze how survival rates vary across different age groups.

Before analyzing survival patterns, we group passengers into age bands, which makes trends easier to interpret.

Age based grouping Matters because raw ages are noisy and harder to compare. Age grouping reveals macro-level survival patterns, and this mirrors how analysts approach demographic data in real projects

Now, calculate survival statistics for each age group.