The beauty of OLAP SQL

In my daily job, I develop backends for analytical dashboards that show graphs to users to help them better understand their data. I want to share with you a real case scenario using SQL to write powerful OLAP queries.

Imagine you sell Pizza brands, and you are interested in comparing with your competitors to see how much of the pizza shelf your products are occupying. Some interesting questions for you would be:

What is your share of assortment in the pizza category? Is 50% of distributed pizza products on Walmart yours? Is it only 5%?
On average, how many of your pizza products are distributed in Kroger? You have 5 different products but only 2 are available on average? Maybe more/less?
How is your share of the assortment evolving in time? Are you +10% or -5% with respect to the last year? last month?
Who are the competitors detaining the majority of the shelf? How well do you compare with them?
Who is putting new products the market? Who is putting less?
Which of your brands is successful? Which one needs to be taken out of the market? or enhanced with ads/better prices, etc…

The data looks like the following (you can find it in a parquet file in here):

┌───────┬──────────────┬──────────┬───────┬──────────┬─────────┬──────────────────┐
│ p_id  ┆ manufacturer ┆ category ┆ brand ┆ retailer ┆ period  ┆ distributed_days │
│ ---   ┆ ---          ┆ ---      ┆ ---   ┆ ---      ┆ ---     ┆ ---              │
│ i64   ┆ i64          ┆ i64      ┆ i64   ┆ i64      ┆ str     ┆ i64              │
╞═══════╪══════════════╪══════════╪═══════╪══════════╪═════════╪══════════════════╡
│ 6581  ┆ 61           ┆ 20       ┆ 273   ┆ 1        ┆ 2023_52 ┆ 31407            │
│ 8563  ┆ 34           ┆ 4        ┆ 274   ┆ 1        ┆ 2023_52 ┆ 27515            │
│ 8411  ┆ 11           ┆ 6        ┆ 276   ┆ 1        ┆ 2023_52 ┆ 30693            │
│ 20163 ┆ 21           ┆ 4        ┆ 277   ┆ 1        ┆ 2023_52 ┆ 61               │
│ 5669  ┆ 56           ┆ 4        ┆ 278   ┆ 1        ┆ 2023_52 ┆ 20665            │
│ …     ┆ …            ┆ …        ┆ …     ┆ …        ┆ …       ┆ …                │
│ 8932  ┆ 2            ┆ 23       ┆ 2     ┆ 1        ┆ 2024_1  ┆ 25584            │
│ 8927  ┆ 2            ┆ 2        ┆ 2     ┆ 1        ┆ 2024_1  ┆ 27163            │
│ 8923  ┆ 2            ┆ 14       ┆ 2     ┆ 1        ┆ 2024_1  ┆ 21691            │
│ 8918  ┆ 2            ┆ 3        ┆ 2     ┆ 1        ┆ 2024_1  ┆ 31407            │
│ 8910  ┆ 2            ┆ 3        ┆ 2     ┆ 1        ┆ 2024_1  ┆ 17               │
└───────┴──────────────┴──────────┴───────┴──────────┴─────────┴──────────────────┘

This is anonymized data and represents a subset of scraped products from retailers websites (Walmart, Kroger, …). In this specific case, we are interested in analyzing the share of assortment of each brand manufacturer with respect to different variables.

Each line contains data about a product:

Unique ID of the product
The Manufacturer’s ID
Categorization of the product : Toothpaste, Deodorants, Sauces, …
Brand of the product, for example: Dove, Pepsi, KitKat, …
Retailer and the week where data was scraped. Data here represents Last week of 2023 and first week of 2024.
Distributed days: The number of the days we’ve seen the product on the website during the period and over all shops of that retailer. For example if Walmart has 1000 shops in the US and the product was available on all shops during the entire week, the number of distributed days would be 1000 * 7 = 7000 days. We will see how is this used to compute share of assortment later.

In the following I will be using DuckDB. An amazing and powerful embedded query engine.

1) How many unique products distributed by manufacturer for both weeks over all retailers?

This an easy query, we just have to take distinct products and group by the manufacturer

SELECT
    manufacturer,
    COUNT(DISTINCT p_id) AS nb_unique_products -- the number of unique products
FROM
    product_details
GROUP BY
    ALL -- A powerful construct provided by DuckDB
ORDER BY
    nb_unique_products DESC
LIMIT
    10;
-- ┌──────────────┬────────────────────┐
-- │ manufacturer │ nb_unique_products │
-- │    int64     │       int64        │
-- ├──────────────┼────────────────────┤
-- │            7 │               3463 │ <-- The big players
-- │           34 │               2688 │
-- │            2 │               2368 │
-- │           26 │               1227 │ 
-- │           39 │                978 │
-- │            6 │                783 │
-- │           58 │                727 │
-- │           21 │                499 │
-- │           57 │                372 │ <-- Let's say this is me
-- │            3 │                333 │ 
-- ├──────────────┴────────────────────┤
-- │ 10 rows                 2 columns │
-- └───────────────────────────────────┘

2) How are my products distributed over the different categories?

Filter on my products only, and then count the number of unique products grouping by the product category.

It’s good to see the percentage each category accounts for. A window function is an amazing tool. It applies an aggregate like computation over a window of rows.

Important: The rows considered by a window function are those produced WHERE, GROUP BY, and HAVING clauses if any. You can think of it as if the window function was run the last.

nb_unique_products / SUM(nb_unique_products) OVER () means:

Take the current row’s nb_unique_productsand divide it by the sum of nb_unique_products of all rows in the window. The window here is all categories.

SELECT
    category,
    COUNT(DISTINCT p_id) AS nb_unique_products,
    ROUND(
        100 * nb_unique_products / SUM(nb_unique_products) OVER (),
        2
    ) AS "% of products"
FROM
    product_details
WHERE
    manufacturer = 57
GROUP BY
    category
ORDER BY
    nb_unique_products DESC

-- ┌──────────┬────────────────────┬───────────────┐
-- │ category │ nb_unique_products │ % of products │
-- │  int64   │       int64        │    double     │
-- ├──────────┼────────────────────┼───────────────┤
-- │        4 │                203 │         54.57 │< products concentrated on
-- │        3 │                 50 │         13.44 │< these two categories.
-- │       18 │                 23 │          6.18 │
-- │       16 │                 22 │          5.91 │
-- │       22 │                 22 │          5.91 │
-- │        2 │                 22 │          5.91 │
-- │        1 │                 18 │          4.84 │
-- │       11 │                 10 │          2.69 │
-- │        5 │                  2 │          0.54 │
-- └──────────┴────────────────────┴───────────────┘

Now, imagine you are responsible for categories 1, 2, 3 and 4 only. You don’t want to see other categories. You might be tempted by doing:

SELECT
    category,
    COUNT(DISTINCT p_id) AS nb_unique_products,
    ROUND(
        100 * nb_unique_products / SUM(nb_unique_products) OVER (),
        2
    ) AS "% of products"
FROM
    product_details
WHERE
    manufacturer = 57
    AND category IN (1, 2, 3, 4) -- You added this
GROUP BY
    category
ORDER BY
    nb_unique_products DESC;

-- which gives:
-- ┌──────────┬────────────────────┬───────────────┐
-- │ category │ nb_unique_products │ % of products │
-- │  int64   │       int64        │    double     │
-- ├──────────┼────────────────────┼───────────────┤
-- │        4 │                203 │         69.28 │
-- │        3 │                 50 │         17.06 │
-- │        2 │                 22 │          7.51 │
-- │        1 │                 18 │          6.14 │
-- └──────────┴────────────────────┴───────────────┘

What happened? Share of products went from: 54.57% to 69.28%. As I said, window function is evaluated on rows after the WHERE and GROUP BY clauses were executed. In this case, it means you are considering the share on only these categories (1, 2, 3 and 4).

There are two ways of doing this:

Use a sub query:

SELECT
  *
FROM
 (<PREVIOUS QUERY>)
WHERE
  category IN (1, 2, 3, 4);

Use QUALIFY provided by DuckDB:

SELECT
  category,
  COUNT(DISTINCT p_id) AS nb_unique_products,
  ROUND(
      100 * nb_unique_products / SUM(nb_unique_products) OVER (),
      2
  ) AS "% of products"
FROM
  product_details
WHERE
  manufacturer = 57 
GROUP BY
  category
QUALIFY -- At last, filter on window function result using the following:
  category IN (1, 2, 3, 4)
ORDER BY
  nb_unique_products DESC
-- ┌──────────┬────────────────────┬───────────────┐
-- │ category │ nb_unique_products │ % of products │
-- │  int64   │       int64        │    double     │
-- ├──────────┼────────────────────┼───────────────┤
-- │        4 │                203 │         54.57 │
-- │        3 │                 50 │         13.44 │
-- │        2 │                 22 │          5.91 │
-- │        1 │                 18 │          4.84 │
-- └──────────┴────────────────────┴───────────────┘

3) Since I am competing on categories 3 and 4, who am I competing with?

We have seen that we are distributing products mainly in two catagories (for example: Toothpaste and Mouthwash) I should care more about competitors on these categories only.

Let’s start simple:

SELECT 
    category,
    manufacturer,
    COUNT( DISTINCT p_id) as nb_unique_products
FROM product_details
WHERE category IN (3, 4)
GROUP BY ALL
ORDER BY category, nb_unique_products DESC
LIMIT 5;
-- ┌──────────┬──────────────┬────────────────────┐
-- │ category │ manufacturer │ nb_unique_products │
-- │  int64   │    int64     │       int64        │
-- ├──────────┼──────────────┼────────────────────┤
-- │        3 │            2 │                196 │
-- │        3 │           39 │                185 │
-- │        3 │           41 │                124 │
-- │        3 │           67 │                119 │
-- │        3 │           34 │                 92 │
-- └──────────┴──────────────┴────────────────────┘

This gives us the number of unique products distributed by all manufacturers in categories 3 and 4.

Let’s try to only get the top 3 manufacturers on each category, which are competitors we need to worry about the most.

SELECT
    category,
    manufacturer,
    COUNT(DISTINCT p_id) as nb_unique_products, -- over the category+manufacturer,
    row_number() OVER (
        PARTITION BY category -- rank inside the category
        ORDER BY -- more products = better position 
            nb_unique_products DESC
    ) as position,
    ROUND(
        100 * nb_unique_products / SUM(nb_unique_products) OVER (
            PARTITION BY category
        ),
        2
    ) as "% of the market"
FROM
    product_details
WHERE
    category IN (3, 4)
GROUP BY
    category,
    manufacturer 
QUALIFY 
    position <= 3 -- top 3 competitors
    OR manufacturer = 57 -- and me please
ORDER BY
    category,
    position
-- ┌──────────┬──────────────┬────────────────────┬──────────┬─────────────────┐
-- │ category │ manufacturer │ nb_unique_products │ position │ % of the market │
-- │  int64   │    int64     │       int64        │  int64   │     double      │
-- ├──────────┼──────────────┼────────────────────┼──────────┼─────────────────┤
-- │        3 │            2 │                196 │        1 │           18.23 │
-- │        3 │           39 │                185 │        2 │           17.21 │
-- │        3 │           41 │                124 │        3 │           11.53 │
-- │        3 │           57 │                 50 │        7 │            4.65 │< this is me
-- │        4 │           34 │                452 │        1 │           23.12 │
-- │        4 │            7 │                303 │        2 │            15.5 │
-- │        4 │            2 │                278 │        3 │           14.22 │
-- │        4 │           57 │                203 │        5 │           10.38 │< this is me
-- └──────────┴──────────────┴────────────────────┴──────────┴─────────────────┘

So, this allows you to see for example that you are 5th on category 4 and allows you to see who are the ones with the highest share on it.

Let’s deconstruct this query:

Filter on categories 3 and 4: WHERE
Get the number of unique products per manufacturer per category: GROUP BY
Rank each manufacturer on the category: row_number() over a window partitioned by category and ordered by the number of unique products we compute at step 1. We have to partition by category to not take both categories into account. So each manufacturer will have 2 positions: one on category 2 and another one on category 3
Just like step 3, compute the share of products for a manufacturer on a given category.
QUALIFY to only return top 3 manufacturers

4) Distributed products at multi-levels?

Imagine you want the number of your unique products:

Per retailer per category
Per retailer for all categories
Per category for all retailers
For all retailers for all categories

You can compute each one independently and union everything but there is a simpler way.

SELECT
    COALESCE(retailer, 0) AS retailer, -- convert nulls to a readable id
    COALESCE(category, 0) AS category, -- convert nulls to a readable id
    COUNT(DISTINCT p_id) AS nb_unique_products
FROM
    product_details
GROUP BY
    GROUPING SETS (
        -- per retailer per category
        (category, retailer),
        -- per category for all retailers
        (category),
        -- per retailer for all categories
        (retailer),
        -- for all categories for all retailers
        () 
    )
ORDER BY
    ALL
-- ┌──────────┬──────────┬────────────────────┐
-- │ retailer │ category │ nb_unique_products │
-- │  int64   │  int64   │       int64        │
-- ├──────────┼──────────┼────────────────────┤
-- │        0 │        0 │              17428 │
-- │        0 │        1 │                136 │
-- │        0 │        2 │               2105 │
-- │        0 │        3 │               1075 │
-- │        0 │        4 │               1955 │
-- │        · │        · │                 ·  │
-- │        · │        · │                 ·  │
-- │        · │        · │                 ·  │
-- │        2 │       11 │                346 │
-- ├──────────┴──────────┴────────────────────┤
-- │ x rows (xx shown)              3 columns │
-- └──────────────────────────────────────────┘

Grouping sets is equivalent to the union of different GROUP BYs.

The result will contain NULL for the dimension not included in the grouping set. So results for the group (category) will have retailer set as NULL.

If you are using all possible grouping sets, than you can simplify syntax by using CUBE:

SELECT
    COALESCE(retailer, 0) AS retailer,
    COALESCE(category, 0) AS category,
    COUNT(DISTINCT p_id) AS nb_unique_products
FROM
    product_details
GROUP BY
    -- This is equivalent to all possible grouping sets:
    -- (category, retailer), (category), (retailer), (category), ()
    CUBE(retailer, category)
ORDER BY
    ALL

ROLLUP is another grouping construct but it works differently than CUBE by restricting the possible grouping sets. ROLLUP(x, y) is defined as:

(x, y)
(x)
()

(y) would be included by CUBE but not by ROLLUP.

This is more useful in cases when there is a hierarchical relationship between the different dimensions as in:

SUM(population) GROUP BY ROLLUP(county, region, city)

In which case the grouping set (region) won’t make a lof of sense as a region is linked to one country and that would be equivalent to (country, region).

We have seen a query to look at the number of unique distributed products in categories 3 and 4.

Having more unique distributed products is not completely correlated with your share of assortment.

Imagine you have 15 unique frozen food products and a competitor who has only 3 unique frozen food products, and there is only you too on the market.

If at a given retailer (Walmart for e.g.), they always put 2 of you competitor’s products and 1 of your products. In this case you will only have 1/3=33% of SOA while your competitor will have 2/3=77% of the frozen food shelf. Although you detain a big share on the number of unique products 15/18=83%, you are being surpassed by your competitor.

Share of assortment can be defined as:

SOA = SUM(distributed days of my products) / SUM(distributed days of my products + competitors' products)

Here is a query to get top 3 competitors in categories 3 and 4 with respect to SOA:

WITH soa_per_category_per_manufacturer AS (
    SELECT
        category,
        manufacturer,
        SUM(distributed_days) as _total_ddays, -- total ddays per each manufacturer
        ROUND(
            100 * _total_ddays / SUM(_total_ddays) OVER (PARTITION BY category),
            2
        ) as soa -- this is exactly the SOA formula,
    FROM
        product_details
    WHERE
        category IN (3, 4)
    GROUP BY
        category,
        manufacturer
)
SELECT
    * EXCLUDE(_total_ddays), -- Thanks DuckDB for EXCLUDE :) 
    row_number() OVER (
        PARTITION BY category
        ORDER BY
            soa DESC
    ) as position -- order then by soa on the category
FROM
    soa_per_category_per_manufacturer 
QUALIFY 
    position <= 3 -- top 3 competitors
    OR manufacturer = 57 -- and me please

-- ┌──────────┬──────────────┬────────┬──────────┐
-- │ category │ manufacturer │  soa   │ position │
-- │  int64   │    int64     │ double │  int64   │
-- ├──────────┼──────────────┼────────┼──────────┤
-- │        3 │           39 │  22.12 │        1 │
-- │        3 │            2 │  16.04 │        2 │
-- │        3 │           67 │  14.36 │        3 │
-- │        3 │           57 │   3.95 │        8 │< I am no longer 7th on the market
-- │        4 │           34 │   24.6 │        1 │
-- │        4 │            7 │  17.27 │        2 │
-- │        4 │            2 │   10.9 │        3 │
-- │        4 │           57 │  10.36 │        5 │< Still 5th
-- └──────────┴──────────────┴────────┴──────────┘

This query uses CTEs which are a way to organize your queries to make them readable and in some cases avoid recomputing the same results over and over if materialized and used in many places (check Materialized CTEs).

6) Temporal evolution!

Imagine now you want to compare SOA on your categories between the selected week 2024_1 with respect to previous week 2023_52.

WITH soa_per_period_category_per_manufacturer AS (
    SELECT
        period,
        category,
        manufacturer,
        SUM(distributed_days) as _total_ddays,
        ROUND(
            100 * _total_ddays / SUM(_total_ddays) OVER (
                PARTITION BY 
                period, -- notice adding period here :)
                category
            ),
            2
        ) as soa,
    FROM
        product_details
    GROUP BY
        period -- and here too :),
        category,
        manufacturer
),
soa_variation AS (
    SELECT
        * EXCLUDE (_total_ddays),
        -- notice the use of a defined window to
        -- simplify the query especially if it's used in many places.
        soa - lag(soa, 1, 0) OVER period_window as variation_soa,
    FROM
        soa_per_period_category_per_manufacturer
        -- define the window and assign a name to it
        WINDOW period_window AS (
            PARTITION BY category,
            -- Notice that we need manufacturer here
            -- since the variation is by manufacturer by category
            manufacturer
            ORDER BY
                -- this is very important as variation is
                -- current soa - previous soa. The order
                -- of the values in the window should be like that
                -- to be able to use soa - lag(soa, 1, 0).
                -- notice also that '2023_52' < '2024_1'
                period 
        )
        -- take this week only after the window function
        -- as tha variation is linked to it
        QUALIFY period = '2024_1'
    ORDER BY
        period
)
SELECT
    -- thanks DuckDB for EXCLUDE/REPLACE :)
    * EXCLUDE (period) REPLACE (ROUND(variation_soa, 2) as variation_soa),
    row_number() OVER (
        PARTITION BY category
        ORDER BY
            variation_soa DESC
    ) as position -- order then by soa on the category
FROM
    -- take first from each category
    soa_variation QUALIFY position = 1
ORDER BY
    category
LIMIT
    10;

-- ┌──────────┬──────────────┬────────┬───────────────┬──────────┐
-- │ category │ manufacturer │  soa   │ variation_soa │ position │
-- │  int64   │    int64     │ double │    double     │  int64   │
-- ├──────────┼──────────────┼────────┼───────────────┼──────────┤
-- │        1 │           39 │  34.19 │          3.11 │        1 │< 39 is getting better
-- │        2 │           39 │  11.68 │          1.37 │        1 │< over multiple categories
-- │        3 │           39 │   24.8 │          5.64 │        1 │
-- │        4 │            6 │  11.84 │          2.41 │        1 │
-- │        5 │           39 │  40.07 │          6.52 │        1 │
-- │        6 │           39 │  42.08 │         10.42 │        1 │< That's a good improvement
-- │        7 │           34 │   6.29 │          3.65 │        1 │
-- │        8 │            6 │  51.29 │          0.86 │        1 │
-- │        9 │            3 │  52.69 │          9.33 │        1 │
-- │       10 │            3 │  42.24 │          6.81 │        1 │
-- ├──────────┴──────────────┴────────┴───────────────┴──────────┤
-- │ 10 rows                                           5 columns │
-- └─────────────────────────────────────────────────────────────┘

This query get’s the manufacturer with the best variation over each category. Notice the multiple CTEs to organize the query.

lag(n) is a window function to allow using the value of the previous n row inside the window.

6) Not all retailers are equal!

Imagine you have SOA by retailer, and you want an average but not just a normal average, because retailers don’t have the same importance (number of shops, zones, contracts, …) you would like to compute the weighted average over the different retailers.

Imagine you have these weights (you can find the data here):

select * from retailer_w;
-- ┌──────────┬────────┐
-- │ retailer │   w    │
-- │  int32   │ double │
-- ├──────────┼────────┤
-- │       2  │    1.0 │
-- │       1  │    3.0 │< 3 times more important than retailer 2
-- └──────────┴────────┘

Average SOA would be then:

avg_soa = SUM(retailer_i * w_i) / SUM(w_i) for all (retailer_i, w_i)

WITH soa_per_retailer_per_manufacturer AS (
    SELECT
        retailer,
        manufacturer,
        SUM(distributed_days) as _total_ddays,
        -- total ddays per each manufacturer
        ROUND(
            100 * _total_ddays / SUM(_total_ddays) OVER (PARTITION BY retailer),
            2
        ) as soa -- this is exactly the SOA formula,
    FROM
        product_details
    GROUP BY
        retailer,
        manufacturer
)
SELECT
    manufacturer,
    ROUND(
        -- we are grouping my manufacturer so we can just
        -- do this to have the weighted average
        SUM(soa * w) / SUM(w),
        2
    ) as avg_soa,
FROM
    soa_per_retailer_per_manufacturer
    JOIN retailer_w USING(retailer)
GROUP BY
    manufacturer
ORDER BY avg_soa DESC
LIMIT 5;
-- ┌──────────────┬─────────┐
-- │ manufacturer │ avg_soa │
-- │    int64     │ double  │
-- ├──────────────┼─────────┤
-- │            7 │   22.52 │
-- │           34 │   16.03 │
-- │            2 │    12.8 │
-- │           26 │    7.56 │
-- │           39 │    6.72 │
-- └──────────────┴─────────┘

The beauty of OLAP SQL

1) How many unique products distributed by manufacturer for both weeks over all retailers?

2) How are my products distributed over the different categories?

3) Since I am competing on categories 3 and 4, who am I competing with?

4) Distributed products at multi-levels?

6) Temporal evolution!

6) Not all retailers are equal!

That’s it. Hope you had fun reading this :)

Taki

1) How many unique products distributed by manufacturer for both weeks over all retailers?

2) How are my products distributed over the different categories?

3) Since I am competing on categories 3 and 4, who am I competing with?

4) Distributed products at multi-levels?

5) Since I am competing on categories 3 and 4, what’s my share of assortment (SOA)?

6) Temporal evolution!

6) Not all retailers are equal!

That’s it. Hope you had fun reading this :)

Taki