Behavioral Time Series Segmentation in ClickHouse
At Contentsquare, one of our key features is our segmentation engine. It allows our customers to filter their population through multiples criteria such as “I want all users that saw the page P, clicked on the button B, between date D1 and date D2, on mobile on Safari and located in France.”
Two ClickHouse functions were a real game changer, allowing us to introduce segmentation based on time series. It means we can now build filters based on a sequence of actions (for example: Went to the product page then clicked on the “add to cart” button then went to the checkout page) done by a user.
Disclaimer: The ClickHouse version used for this blog post is 21.3.13
Matching Funnel Pattern
Let’s say we want to identify users which follow a specific sequence of actions, like reaching specific pages or doing some specific events.
To do this, we will use windowFunnel
, an aggregated parametric function.
The function looks like this:
As you can see, there are several parameters:
- SizeOfSlidingWindow is the size of the sliding window used to match your event chain.
- TimeReferenceFields is a timestamp reference to compute. It can be a
DATE
,DATETIME
or anUInt64
field. Your field should carry a notion of “ordering” for your actions. It can be the date of your event, your page view order. It will be used by ClickHouse to understand therank
of a row into your dataset. - conditionX is a chain of conditions you want to match.
The result of this function is the maximum number of conditions matched during the sliding window. The window is triggered every time your first condition ($condition1
in this example) is encountered.
Let’s use it on some data to have a better understanding:
user_name (String) | view_number (UInt64) | path (String) |
---|---|---|
Alice | 1 | /homepage |
Alice | 2 | /product |
Alice | 3 | /checkout |
Alice | 4 | /homepage |
Bob | 1 | /homepage |
Where user_name
is the name of a user, view_number
represents the order a user went into a specific page (here our user first went to “/homepage”, then to “/product”, then to “payment” and then on “/homepage” again) and path
the URI of a web page.
In other words,
windowFunnel(2)(view_number, path='/homepage', path='/product')
means:
Given the chain of conditions “path equals /homepage” and “path equals /product”, I want to know how many conditions my users match in a maximum of two pages after matching my first condition”. Which, once used into an SQL query looks like this:
And the result is:
user_name | total |
---|---|
Bob | 1 |
Alice | 2 |
As you can see, Bob matches only one condition while Alice matches both of them during the required window. Actually, Alice triggered the sliding window twice as she matches our first condition (path=/homepage
) two times. For her first view and for the fourth.
user_name | view_number | path | |
---|---|---|---|
Alice | 1 | /homepage | ← Start of the first sliding window |
Alice | 2 | /product | |
Alice | 3 | /checkout | ← End of the first sliding window |
Alice | 4 | /homepage | ← Start of the second sliding window. |
The resulting value of the first window is 2 as Alice matches two conditions and the value of the second is one because she matches only our first condition. The function returns 2 because the result is the longest matching chain and so, the highest number.
Now, let’s see how we can use it for more complex use cases. We will use the following dataset:
user_name | view_number | path |
---|---|---|
Alice | 1 | /homepage |
Alice | 2 | /product |
Alice | 3 | /checkout |
John | 1 | /homepage |
John | 2 | /product |
John | 3 | /about |
Bob | 1 | /homepage |
Bob | 2 | /checkout |
Bob | 3 | /product |
June | 1 | /homepage |
June | 2 | /product |
June | 3 | /about |
June | 4 | /checkout |
Here we want to identify users that browsed a few items before going to “/checkout”. This means they reached “/homepage”, then several “/product” pages and then “/checkout”. We will arbitrarily define that we need a sliding window of 10, meaning that there will be a maximum of 8 pages between “/homepage” and “/checkout”.
We will be using the windowFunnel
function in a HAVING
clause because it is unfortunately not available in a WHERE
clause (as every aggregated function).
Why this = 3
at the end of our statement?
Remember, the function returns the number of matching conditions. If we want our users to match all conditions, we should set the result equal to the number of such conditions.
And the results are Alice (“/homepage” > “/product” > “/checkout”) and June (“homepage” > “/product” > “/about” > “/checkout”).
Note that June matches, even if she has an intermediate page (“/about”) that was not part of our conditions.
This is because the windowFunnel
function ignores non-matching conditions by default. It means we can have any number of non-matching events in our windows. The method only cares that the conditions we defined happen in order.
If we wanted to be stricter and exclude people that visited pages outside of our conditions, we could have done it either by:
- Reducing the window size. In our case, specifying
windowFunnel(2)
would exclude June from our analysis. - Specifying the
strict_order
for the functionwindowFunnel(9, ‘strict_order’)
. This forbids the insertion of others events. You can check the documentation to find the list of other options for this function.
The parameter tuning really depends of your use case and requirements. But there is another way if you want to do what we call “Directly Then” filtering.
Describing a sequence of actions
We were already pretty happy when we implemented such features. But our users were now asking for more complex segmentation use cases:
We want to be able to describe an exact list of steps mixing up pageviews and events happening on the same pageview.
Let’s increase complexity step by step and introduce the sequenceMatch
function:
This looks pretty familiar, right?
As for the windowFunnel
it requires a timestamp reference field and a list of conditions. The parameter is a pattern instead of the length of a sliding window.
A pattern looks like this: '(?1)(?2)'
where (?1)
is your first condition, (?2)
the second etc. Think about it as a placeholder for your conditions.
But we can also define a time frame between two events by using (?t operator value)
. t
is a magic value to specify the time frame between two events.
For example, (?1)(?t<=1)(?2)
means we want to have no more than 1 increment on our time reference (one second, one minute, one something, it is up to your field and schema design) between the condition 1 and the condition 2.
We even submitted a Pull Request to add support for the ==
operator (this will be available for everyone in version v21.7
) and another one to improve the performance.
Keeping our previous dataset, let’s see how it works:
Here we specify a list of conditions without any restriction regarding the time. It returns Alice and June as both match all our conditions.
For this one, we explicitly defined a maximum difference of one pageview between each step. Now, only Alice matches as June has an intermediary page between “/product” and “/checkout”.
This time, we defined a difference of at least 2 pages between our last conditions. Only June matches this pattern, thanks to “/about” happening between “/product” and “/checkout”.
Let’s increase the complexity of our dataset and requirements by introducing some events. Our dataset will look like this:
user_name | view_number | path | events.type | events.time |
---|---|---|---|---|
Alice | 1 | /homepage | [‘click’,‘hover’] | [1,2] |
Alice | 2 | /product | [‘click’] | [1] |
Alice | 3 | /checkout | [‘click’,‘hover’,‘add_to_cart’,‘pay’] | [1,2,3,4] |
Bob | 1 | /homepage | [‘hover’] | [1] |
Bob | 2 | /checkout | [‘add_to_cart’,‘pay’] | [1,2] |
June | 1 | /checkout | [‘click’,‘hover’] | [1,2] |
June | 2 | /checkout | [‘add_to_cart’,‘pay’] | [3,4] |
If we want to get the users which have the events click
, hover
, add_to_cart
, pay
in this exact order we can use the following query:
Nothing too fancy here. Only Alice and June match this criteria. Actually, the windowFunnel
could have done the trick.
But this syntax allows us more complex filtering:
We want this sequence of events happening on the same ‘/checkout’ page.
However, adding a condition based on the path does not help much here:
It still returns Alice and June. By flattening our data, we are losing granularity and we do not know which events belong to which pageview. It means we are not sure about our context. Let’s imagine a form. Our users can start filling two fields, leave our website and then come back on the same page later. They fill two others fields and our current query will think it is a continuous sequence of actions. This is not what our business user requested.
It is time to update our query to match this new requirement:
It works but we had to add another layer of subquery and another field to read and use into our GROUP BY
clause. It will quickly become overly complicated to use into another context and increase the maintenance cost.
But then, our users asked for a third requirement that made us think of a better solution:
We want to be able to filter users based not only on events happening on the same page but through their whole journey.
For example, users that looked at a product then clicked on “add to cart”, then went to the /checkout page and then clicked on “pay”.
To do this we need to find a way to create a time field allowing us to know that:
- Those events belongs to the same page.
- We are on a new page.
We will also need to qualify the reach of the page as an event.
To do so, we will change a bit our ARRAY JOIN
to compute a time based on the page number and add 0 as event type when a new page is reached:
We multiply each view_number
by an arbitrary number and we increment this value for each event belonging to the pageview. We had to be very careful to have this number greater than the maximum number of events. Otherwise our data may collapse and return wrong results.
We also prepended an events.type
of 0 before any page even to symbolise the “page view” event.
To make it easy to visualize, here is the result for our dataset:
user_name | view_number | artificial_time | event_type |
---|---|---|---|
Alice | 1 | 100000 | ’page_view’ |
Alice | 1 | 100001 | ’click’ |
Alice | 1 | 100002 | ’hover’ |
Alice | 2 | 200000 | ’page_view’ |
Alice | 2 | 200001 | ’click’ |
Alice | 3 | 300000 | ’page_view’ |
Alice | 3 | 300001 | ’click’ |
Alice | 3 | 300002 | ’hover’ |
Alice | 3 | 300003 | ’add_to_cart’ |
Alice | 3 | 300004 | ’pay’ |
Bob | 1 | 100000 | ’page_view’ |
Bob | 1 | 100001 | ’hover’ |
Bob | 2 | 200000 | ’page_view’ |
Bob | 2 | 200001 | ’add_to_cart’ |
Bob | 2 | 200002 | ’pay’ |
June | 1 | 100000 | ’page_view’ |
June | 1 | 100001 | ’click’ |
June | 1 | 100002 | ’hover’ |
June | 2 | 200000 | ’page_view’ |
June | 2 | 200001 | ’add_to_cart’ |
June | 2 | 200002 | ’pay’ |
We could now rewrite our previous query without all the subqueries and GROUP BY clauses by using the above tricks:
Here we used 100000 as our arbitrary multiplier and used it both in the range(view_number * 100000)
and sequenceMatch((?1)(?t<100000)(?2))
parts of the query. If we wanted to filter based on the first event happening on our page, we would use (?1)(?t<=1)(?2)
instead.
To do cross-page sequential behavior, we just have to play with the conditions and our temporal value:
The result of this query is June.
Conclusion
As you can see, ClickHouse is an amazing database with some powerful functions for time series. There is so much more to explore, we are only scratching the surface here. Now, it is up to you to use those time series functions on your own databases and solve new challenges!
I would like to thank the whole Contentsquare R&D which worked very hard to put those use cases into production and specifically Ryad Zenine, Pawel Gontarz, Yunbo Wang and Christophe Kalenzaga for their help, support and patience on this challenging project.