Behavioral Time Series Segmentation in ClickHouse

At Contentsquare, one of our key features is our segmentation engine. It allows our customers to filter their population through multiples criteria such as "I want all users that saw the page P, clicked on the button B, between date D1 and date D2, on mobile on Safari and located in France."

Two ClickHouse functions were a real game changer, allowing us to introduce segmentation based on time series. It means we can now build filters based on a sequence of actions (for example: Went to the product page then clicked on the "add to cart" button then went to the checkout page) done by a user.

Disclaimer: The ClickHouse version used for this blog post is 21.3.13

Matching Funnel Pattern

Let’s say we want to identify users which follow a specific sequence of actions, like reaching specific pages or doing some specific events. To do this, we will use windowFunnel, an aggregated parametric function.

The function looks like this:

windowFunnel($SizeOfTheSlidingWindow)($TimeStampReferenceField, $condition1, $condition2, ...')

As you can see, there are several parameters:

  • SizeOfTheSlidingWindow is the size of the sliding window used to match your event chain.
  • TimeStampReferenceFields is a timestamp reference to compute. It can be a DATE, DATETIME or an UInt64 field. Your field should carry a notion of "ordering" for your actions. It can be the date of your event, your page view order. It will be used by ClickHouse to understand the rank of a row into your dataset.
  • conditionX is a chain of conditions you want to match.

The result of this function is the maximum number of conditions matched during the sliding window. The window is triggered every time your first condition ($condition1 in this example) is encountered.

Let's use it on some data to have a better understanding:

user_name (String)view_number (UInt64)path (String)
Alice1/homepage
Alice2/product
Alice3/checkout
Alice4/homepage
Bob1/homepage

Where user_name is the name of a user, view_number represents the order a user went into a specific page (here our user first went to "/homepage", then to "/product", then to "payment" and then on "/homepage" again) and path the URI of a web page.

In other words, windowFunnel(2)(view_number, path='/homepage', path='/product') means:

Given the chain of conditions "path equals /homepage" and "path equals /product", I want to know how many conditions my users match in a maximum of two pages after matching my first condition".

Which, once used into an SQL query looks like this:

SELECT
    user_name,
    windowFunnel(2)(view_number, path = '/homepage', path = '/product') as total
FROM shop
GROUP BY user_name

And the result is:

user_nametotal
Bob1
Alice2

As you can see, Bob matches only one condition while Alice matches both of them during the required window. Actually, Alice triggered the sliding window twice as she matches our first condition (path=/homepage) two times. For her first view and for the fourth.

user_nameview_numberpath
Alice1/homepage← Start of the first sliding window
Alice2/product
Alice3/checkout← End of the first sliding window
Alice4/homepage← Start of the second sliding window.

The resulting value of the first window is 2 as Alice matches two conditions and the value of the second is one because she matches only our first condition. The function returns 2 because the result is the longest matching chain and so, the highest number.

Now, let's see how we can use it for more complex use cases. We will use the following dataset:

user_nameview_numberpath
Alice1/homepage
Alice2/product
Alice3/checkout
John1/homepage
John2/product
John3/about
Bob1/homepage
Bob2/checkout
Bob3/product
June1/homepage
June2/product
June3/about
June4/checkout

Here we want to identify users that browsed a few items before going to "/checkout". This means they reached "/homepage", then several "/product" pages and then "/checkout". We will arbitrarily define that we need a sliding window of 10, meaning that there will be a maximum of 8 pages between "/homepage" and "/checkout".

We will be using the windowFunnel function in a HAVING clause because it is unfortunately not available in a WHERE clause (as every aggregated function).

SELECT user_name 
FROM shop 
GROUP BY user_name 
HAVING windowFunnel(9)(view_number, path='/homepage', path='/product', path='/checkout') = 3

Why this = 3 at the end of our statement? Remember, the function returns the number of matching conditions. If we want our users to match all conditions, we should set the result equal to the number of such conditions.

And the results are Alice ("/homepage" > "/product" > "/checkout") and June ("homepage" > "/product" > "/about" > "/checkout").

Note that June matches, even if she has an intermediate page ("/about") that was not part of our conditions.

This is because the windowFunnel function ignores non-matching conditions by default. It means we can have any number of non-matching events in our windows. The method only cares that the conditions we defined happen in order.

If we wanted to be stricter and exclude people that visited pages outside of our conditions, we could have done it either by:

  • Reducing the window size. In our case, specifying windowFunnel(2) would exclude June from our analysis.
  • Specifying the strict_order for the function windowFunnel(9, ‘strict_order’). This forbids the insertion of others events. You can check the documentation to find the list of other options for this function.

The parameter tuning really depends of your use case and requirements. But there is another way if you want to do what we call "Directly Then" filtering.

Describing a sequence of actions

We were already pretty happy when we implemented such features. But our users were now asking for more complex segmentation use cases:

We want to be able to describe an exact list of steps mixing up pageviews and events happening on the same pageview.

Let’s increase complexity step by step and introduce the sequenceMatch function:

sequenceMatch('$Pattern')($TimeStampReferenceField, $condition1, $condition2)

This looks pretty familiar, right? As for the windowFunnel it requires a timestamp reference field and a list of conditions. The parameter is a pattern instead of the length of a sliding window. A pattern looks like this: '(?1)(?2)' where (?1) is your first condition, (?2) the second etc. Think about it as a placeholder for your conditions.

But we can also define a time frame between two events by using (?t operator value). t is a magic value to specify the time frame between two events. For example, (?1)(?t<=1)(?2) means we want to have no more than 1 increment on our time reference (one second, one minute, one something, it is up to your field and schema design) between the condition 1 and the condition 2.

We even submitted a Pull Request to add support for the == operator (this will be available for everyone in version v21.7) and another one to improve the performance.

Keeping our previous dataset, let's see how it works:

SELECT user_name
FROM shop
GROUP BY user_name
HAVING sequenceMatch('(?1)(?2)(?3)')(view_number, path = '/homepage', path = '/product', path = '/checkout')

Here we specify a list of conditions without any restriction regarding the time. It returns Alice and June as both match all our conditions.

SELECT user_name
FROM shop
GROUP BY user_name
HAVING sequenceMatch('(?1)(?t<=1)(?2)(?t<=1)(?3)')(view_number, path = '/homepage', path = '/product', path = '/checkout')

For this one, we explicitly defined a maximum difference of one pageview between each step. Now, only Alice matches as June has an intermediary page between "/product" and "/checkout".

SELECT user_name
FROM shop
GROUP BY user_name
HAVING sequenceMatch('(?1)(?t<=1)(?2)(?t>=2)(?3)')(view_number, path = '/homepage', path = '/product', path = '/checkout')

This time, we defined a difference of at least 2 pages between our last conditions. Only June matches this pattern, thanks to "/about" happening between "/product" and "/checkout".

Let’s increase the complexity of our dataset and requirements by introducing some events. Our dataset will look like this:

user_nameview_numberpathevents.typeevents.time
Alice1/homepage['click','hover'][1,2]
Alice2/product['click'][1]
Alice3/checkout['click','hover','add_to_cart','pay'][1,2,3,4]
Bob1/homepage['hover'][1]
Bob2/checkout['add_to_cart','pay'][1,2]
June1/checkout['click','hover'][1,2]
June2/checkout['add_to_cart','pay'][3,4]

If we want to get the users which have the events click, hover, add_to_cart, pay in this exact order we can use the following query:

SELECT user_name
FROM 
(
    SELECT 
        user_name, 
        event_type = 'click' AS cond1, 
        event_type = 'hover' AS cond2, 
        event_type = 'add_to_cart' AS cond3, 
        event_type = 'pay' AS cond4, 
        event_time
    FROM shop
    ARRAY JOIN 
        events.type AS event_type, 
        events.time AS event_time
)
GROUP BY user_name
HAVING sequenceMatch('(?1)(?t<=1)(?2)(?t<=1)(?3)(?t<=3)(?4)')(event_time, cond1, cond2, cond3, cond4)

Nothing too fancy here. Only Alice and June match this criteria. Actually, the windowFunnel could have done the trick.

But this syntax allows us more complex filtering:

We want this sequence of events happening on the same '/checkout' page.

However, adding a condition based on the path does not help much here:

SELECT user_name
FROM 
(
    WITH 
        (path = '/checkout') AND (event_type = 'click') AS cond1, 
        (path = '/checkout') AND (event_type = 'hover') AS cond2, 
        (path = '/checkout') AND (event_type = 'add_to_cart') AS cond3, 
        (path = '/checkout') AND (event_type = 'pay') AS cond4
    SELECT 
        user_name, 
        cond1, 
        cond2, 
        cond3, 
        cond4, 
        event_time
    FROM shop
    ARRAY JOIN 
        events.type AS event_type, 
        events.time AS event_time
)
GROUP BY user_name
HAVING sequenceMatch('(?1)(?t<=1)(?2)(?t<=1)(?3)(?t<=3)(?4)')(event_time, cond1, cond2, cond3, cond4)

It still returns Alice and June. By flattening our data, we are losing granularity and we do not know which events belong to which pageview. It means we are not sure about our context. Let's imagine a form. Our users can start filling two fields, leave our website and then come back on the same page later. They fill two others fields and our current query will think it is a continuous sequence of actions. This is not what our business user requested.

It is time to update our query to match this new requirement:

SELECT user_name
FROM 
(
    SELECT 
        user_name, 
        view_number, 
        sequenceMatch('(?1)(?t<=1)(?2)(?t<=1)(?3)(?t<=3)(?4)')(event_time, cond1, cond2, cond3, cond4) AS goal
    FROM 
    (
        WITH 
            (path = '/checkout') AND (event_type = 'click') AS cond1, 
            (path = '/checkout') AND (event_type = 'hover') AS cond2, 
            (path = '/checkout') AND (event_type = 'add_to_cart') AS cond3, 
            (path = '/checkout') AND (event_type = 'pay') AS cond4
        SELECT 
            user_name, 
            cond1, 
            cond2, 
            cond3, 
            cond4, 
            event_time, 
            view_number
        FROM shop
        ARRAY JOIN 
            events.type AS event_type, 
            events.time AS event_time
    )
    GROUP BY 
        user_name, 
        view_number
    HAVING goal = 1
)

It works but we had to add another layer of subquery and another field to read and use into our GROUP BY clause. It will quickly become overly complicated to use into another context and increase the maintenance cost.

But then, our users asked for a third requirement that made us think of a better solution:

We want to be able to filter users based not only on events happening on the same page but through their whole journey.

For example users that looked at a product then clicked on "add to cart", then went to the /checkout page and then clicked on "pay".

To do this we need to find a way to create a time field allowing us to know that:

  • Those events belongs to the same page.
  • We are on a new page.

We will also need to qualify the reach of the page as an event.

To do so, we will change a bit our ARRAY JOIN to compute a time based on the page number and add 0 as event type when a new page is reached:

    ARRAY JOIN 
        range(view_number * 100000, ((view_number * 100000) + length(events.type)) + 1, 1) AS artificial_time, 
        arrayConcat(['page_view'], events.type) AS event_type

We multiply each view_number by an arbitrary number and we increment this value for each event belonging to the pageview. We had to be very careful to have this number greater than the maximum number of events. Otherwise our data may collapse and return wrong results. We also prepended an events.type of 0 before any page even to symbolise the "page view" event.

To make it easy to visualize, here is the result for our dataset:

user_nameview_numberartificial_timeevent_type
Alice1100000'page_view'
Alice1100001'click'
Alice1100002'hover'
Alice2200000'page_view'
Alice2200001'click'
Alice3300000'page_view'
Alice3300001'click'
Alice3300002'hover'
Alice3300003'add_to_cart'
Alice3300004'pay'
Bob1100000'page_view'
Bob1100001'hover'
Bob2200000'page_view'
Bob2200001'add_to_cart'
Bob2200002'pay'
June1100000'page_view'
June1100001'click'
June1100002'hover'
June2200000'page_view'
June2200001'add_to_cart'
June2200002'pay'

We could now rewrite our previous query without all the subqueries and GROUP BY clauses by using the above tricks:

SELECT user_name
FROM 
(
    WITH 
        (path = '/checkout') AND (event_type = 'page_view') AS cond0, 
        (path = '/checkout') AND (event_type = 'click') AS cond1, 
        (path = '/checkout') AND (event_type = 'hover') AS cond2, 
        (path = '/checkout') AND (event_type = 'add_to_cart') AS cond3, 
        (path = '/checkout') AND (event_type = 'pay') AS cond4
    SELECT 
        user_name, 
        cond0, 
        cond1, 
        cond2, 
        cond3, 
        cond4, 
        artificial_time
    FROM shop
    ARRAY JOIN 
        range(view_number * 100000, ((view_number * 100000) + length(events.type)) + 1, 1) AS artificial_time, 
        arrayConcat(['page_view'], events.type) AS event_type
)
GROUP BY user_name
HAVING sequenceMatch('(?1)(?t<=100000)(?2)(?t<=1)(?3)(?t<=1)(?4)(?t<=1)(?5)')(artificial_time, cond0, cond1, cond2, cond3, cond4)

Here we used 100000 as our arbitrary multiplier and used it both in the range(view_number * 100000) and sequenceMatch((?1)(?t<100000)(?2)) parts of the query. If we wanted to filter based on the first event happening on our page, we would use (?1)(?t<=1)(?2) instead.

To do cross-page sequential behavior, we just have to play with the conditions and our temporal value:

SELECT user_name
FROM 
(
    WITH 
        (path = '/checkout') AND (event_type = 'page_view') AS cond0, 
        (path = '/checkout') AND (event_type = 'click') AS cond1, 
        (path = '/checkout') AND (event_type = 'hover') AS cond2, 
        (path = '/checkout') AND (event_type = 'add_to_cart') AS cond3, 
        (path = '/checkout') AND (event_type = 'pay') AS cond4
    SELECT 
        user_name, 
        cond0, 
        cond1, 
        cond2, 
        cond3, 
        cond4, 
        artificial_time
    FROM shop
    ARRAY JOIN 
        range(view_number * 100000, ((view_number * 100000) + length(events.type)) + 1, 1) AS artificial_time, 
        arrayConcat(['page_view'], events.type) AS event_type
)
GROUP BY user_name
HAVING sequenceMatch('(?1)(?t<100000)(?2)(?t<=1)(?3)(?t<100000)(?4)(?t<=1)(?5)(?t<=1)(?6)')(artificial_time, cond0, cond1, cond2, cond0, cond3, cond4)

The result of this query is June.

Conclusion

As you can see, ClickHouse is an amazing database with some powerful functions for time series. There is so much more to explore, we are only scratching the surface here. Now, it is up to you to use those time series functions on your own databases and solve new challenges!

I would like to thank the whole Contentsquare R&D which worked very hard to put those use cases into production and specifically Ryad Zenine, Pawel Gontarz, Yunbo Wang and Christophe Kalenzaga for their help, support and patience on this challenging project.