top of page

Welcome
to NumpyNinja Blogs

NumpyNinja: Blogs. Demystifying Tech,

One Blog at a Time.
Millions of views. 

Data Analysis with PostgreSQL Window Functions: A Titanic Case Study


When working with data analytics, one of the biggest challenges in SQL is answering questions like:

What is the rank of each record?

What is the running total over time?

How does today’s value compare with yesterdays?

What is each row’s value compared to the group average?

Traditional SQL aggregation using GROUP BY often not enough, because it collapses rows.

This is where window functions come in picture.

 

What Is a Window Function?

A window function performs a calculation across a set of related rows (called a window) while keeping each individual row visible in the result. Let’s understand with an example of Employee table with columns empId, empName, deptNo, salary.

empId

empName

deptNo

Salary

1001

Alex

2

5000

1202

Jose

2

3000

1103

George

3

4000

1091

Ethan

1

5000

1201

Charlie

3

2000

1121

Alice

2

3000

1108

Eddey

1

4000

1015

Jade

1

2000

 

Question:

👉 Show each employee with the average salary of their department?

·      Solution using GROUP BY

 SELECT deptNo, AVG(Salary)

          FROM employee

         GROUP BY deptNo;

deptNo

avg

1

3666

2

3666

3

3000

Problem with above result is employee-level details (Name, Id) are lost.

·      Solution using a Window Function

SELECT

  Id,

  Name,

  deptNo,

  Salary,

  AVG(Salary) OVER (PARTITION BY deptNo) AS dept_avg_salary

FROM employee;

  Result will be

Name

deptNo

Salary

dept_avg_salary

Ethan

1

5000

3666

Eddey

1

4000

3666

Jade

1

2000

3666

Alex

2

5000

3666

Jose

2

3000

3666

Alice

2

3000

3666

George

3

4000

3000

Charlie

3

2000

3000

 

                  From above result what we observed? Number of records are same from source to result. Here we can see employee names and their dept with average salaries.

We can say that average is calculated per row. Here you can observe that because there were 3 departments, and we wrote over clause as, OVER(PARTITION BY deptNo)

So data is divided into 3 windows (set of data) and each window (department wise) has all data belongs to it.

                  This is exactly, calculating across related rows while keeping each row visible.

There are following window function

1.   ROW_NUMBER() - Assigns a unique number to each row.

2.   RANK() and DENSE_RANK()

3.   Running Totals - Calculates cumulative totals

4.   LAG() and LEAD() — Comparing Rows

 

Lets understand more with a real data . I have a table with data of passengers were on titanic. We can see the table description with the query SELECT * FROM information_schema.columns WHERE table_name = 'passenger';


Table has 21 columns.

Ranking Passengers by Fare Within Each Class

SELECT

    name,

    pclass,

    fare,

    RANK() OVER (

        PARTITION BY pclass

        ORDER BY fare DESC

    ) AS fare_rank

FROM passenger;

 Understand What’s happening in query

PARTITION BY pclass → create a separate window for each class and assigning separate ranking for each class. Example class1 has ranks 1 to 80 and class 2 has rank 1 to 100.

ORDER BY fare DESC → highest fare gets rank 1

Rows are not grouped, only ranked



Observe above result from query here you can see fare_rank column has been created, that assign a rank according to fare. You must be wondering why there is rank 5 after 1. It is because rank() function assign rank row wise, there is 4 rows are with same fare has same rank as 1 and row 5 has less fair so assigned with 5 instead of 2.

Difference Between ROW_NUMBER, RANK, and DENSE_RANK

SELECT

    name,

    pclass,

    fare,

    ROW_NUMBER() OVER (PARTITION BY pclass ORDER BY fare DESC) AS row_num,

    RANK()       OVER (PARTITION BY pclass ORDER BY fare DESC) AS rank,

    DENSE_RANK() OVER (PARTITION BY pclass ORDER BY fare DESC) AS dense_rank

FROM passenger;

 

    Now observe above query result, it has row_num, rank and dense_rank.

    ROW_NUMBER assigns a unique, sequential number to each row based on the descending order of fare. Even if multiple rows have the same fare, each row still receives a   different number.

   RANK assigns the same rank to rows with the same fare, but the next rank reflects the total number of rows ranked so far. This causes gaps in the ranking sequence.

   DENSE_RANK also assigns the same rank to rows with the same fare; however, it always assigns the next rank as the immediate next sequence number, regardless of how many   rows share the previous rank. As a result, there are no gaps in the ranking.

If we want to retrieve the top 10 records for each class, we need to apply a WHERE condition. However, since window functions are evaluated after the WHERE clause, we cannot filter directly on a window function result.

To achieve this, we must first calculate the window function in a subquery and then apply the WHERE condition in the outer query. The example below demonstrates this approach.

With ranked_passenger AS

         (SELECT

                  name,

                  pclass,

                  fare,

                  ROW_NUMBER() OVER (PARTITION BY pclass ORDER BY fare DESC) AS row_num,

                  RANK()       OVER (PARTITION BY pclass ORDER BY fare DESC) AS rank,

                  DENSE_RANK() OVER (PARTITION BY pclass ORDER BY fare DESC) AS dense_rank

         FROM passenger)

select * from ranked_passenger where row_num<=10;

Let’s do some analysis

1.   Passenger Class Impact on Ticket Fare

Let’s analyze how fare differs by class. Normally, we’d use GROUP BY. With window functions, we can keep each passenger row:

SELECT

          name,

              pclass,

            fare,

             AVG(fare) OVER (PARTITION BY pclass) AS avg_fare_in_class

    FROM passenger;

 

This is extremely useful in analytics dashboards.

Insight:

First-class passengers paid significantly higher fares

Third-class fares cluster near the minimum

Window functions help us compare individual passenger’s vs class averages.


2.   Average survival Rate by Gender (Window Function Style)

SELECT

    name,

    sex,

    survived,

    AVG(survived) OVER (PARTITION BY sex) AS survival_rate

FROM passenger;

Insight:

Females show a much higher survival rate

Confirms the “women and children first” policy

This query is useful for feature engineering, model input preparation and row-level reporting

 

3.   Identifying High-Value Outliers Within Each Class

SELECT

    name,

    embarked,

    fare,

    SUM(fare) OVER (

        PARTITION BY embarked

        ORDER BY fare

        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW

    ) AS running_fare_total

 

·      Passengers paying the same fare receive the same rank

·      If multiple passengers share a rank, the next rank is skipped

·      This helps preserve the idea of relative standing within a class


 


Even though 69 is far less than 512, Passenger ‘Sage, Miss. Constance Gladys’ is still a top-paying outlier in 3rd class.

Why This Identifies High-Value Outliers

·      Each passenger is evaluated relative to their peers

·      Outliers identified naturally as those with top ranks per class

·      This avoids bias caused by class-level price differences


Filtering Top Outliers (Example)

To fetch only the top 5 fare-paying passengers per class:

WITH ranked_passengers AS (

    SELECT

        name,

        pclass,

        fare,

        RANK() OVER (

            PARTITION BY pclass

            ORDER BY fare DESC

        ) AS fare_rank_in_class

    FROM passenger

)

SELECT *

FROM ranked_passengers

WHERE fare_rank_in_class <= 5;

Analytical Use of the above query is to detect pricing outliers, Identify elite passengers, Useful for segmentation.


4.   Age-Based Survival Patterns

Age was one of the most critical factors influencing survival on the Titanic. The famous rule “women and children first” suggests that younger passengers may have had a higher chance of survival. In this section, we analyze how survival rates vary across different age groups.

Before analyzing survival patterns, we group passengers into age bands, which makes trends easier to interpret.

Age based grouping Matters because raw ages are noisy and harder to compare. Age grouping reveals macro-level survival patterns, and this mirrors how analysts approach demographic data in real projects

         Now, calculate survival statistics for each age group.

                 

                 

Above result shows survival rates per age group and comparison of each age group’s survival rate with the previous age group.

Insights from LAG Analysis

·      The largest survival drop occurs when moving from children to teens

·      Survival probability steadily decreases with age

·      The decline becomes more gradual in older age groups

Similarly, we can analyze how each age group compares to the next one.

                          


Insights from LEAD Analysis

·      Children had a much higher survival advantage compared to teens

·      Adults and middle-aged passengers faced consistently worsening odds

·      Seniors had the lowest survival probability, with no further group to compare

Key Takeaways

·      Window functions preserve row-level detail while enabling deep analytics

·      PARTITION BY enables fair, segment-level comparisons

·      Ranking and running totals uncover hidden patterns

·      LAG() and LEAD() enable trend and transition analysis

·      Window functions turn SQL into a true analytical language

Final Thought

If GROUP BY tells you what happened, window functions tell you why—row by row, segment by segment.


Reffrences

+1 (302) 200-8320

NumPy_Ninja_Logo (1).png

Numpy Ninja Inc. 8 The Grn Ste A Dover, DE 19901

© Copyright 2025 by Numpy Ninja Inc.

  • Twitter
  • LinkedIn
bottom of page