Refreshing Select Partitions
We have a dataset with orders and we want to aggregate data while having decent performance. Orders have a creation time, so we can use partitioning by time to optimize pre-aggregations build and refresh time. The problem is that the order's status can change after a long period. In this case, we want to rebuild only partitions associated with this order.
In the recipe below, we'll learn how to use the
refresh_key
together with the
FITER_PARAMS for
partition separately.
Let's explore the orders cube data that contains various information about
orders, including number and status:
| id | number | status | created_at | updated_at |
|---|---|---|---|---|
| 1 | 1 | processing | 2021-08-10 14:26:40 | 2021-08-10 14:26:40 |
| 2 | 2 | completed | 2021-08-20 13:21:38 | 2021-08-22 13:10:38 |
| 3 | 3 | shipped | 2021-09-01 10:27:38 | 2021-09-02 01:12:38 |
| 4 | 4 | completed | 2021-09-20 10:27:38 | 2021-09-20 10:27:38 |
In our case, each order has created_at and updated_at properties. The
updated_at property is the last order update timestamp. To create a
pre-aggregation with partitions, we need to specify the
partition_granularity property.
Partitions will be split monthly by the created_at dimension.
cubes:
- name: orders
# ...
pre_aggregations:
- name: orders
type: rollup
dimensions:
- number
- status
- created_at
- updated_at
time_dimension: created_at
granularity: day
partition_granularity: month # this is where we specify the partition
refreshKey:
sql: SELECT max(updated_at) FROM public.orders # check for updates of the updated_at propertyAs you can see, we defined custom a
refresh_key
that will check for new values of the updated_at property. The refresh key is
evaluated for each partition separately. For example, if we update orders from
August and update their updated_at property, the current refresh key will
update for all partitions. There is how it looks in the Cube logs:
Executing SQL: 5b4c517f-b496-4c69-9503-f8cd2b4c73b6
--
SELECT max(updated_at) FROM public.orders
--
Performing query completed: 5b4c517f-b496-4c69-9503-f8cd2b4c73b6 (15ms)
Performing query: 5b4c517f-b496-4c69-9503-f8cd2b4c73b6
Performing query: 5b4c517f-b496-4c69-9503-f8cd2b4c73b6
Executing SQL: 5b4c517f-b496-4c69-9503-f8cd2b4c73b6
--
select min(("orders".created_at::timestamptz AT TIME ZONE 'UTC')) from public.orders AS "orders"
--
Executing SQL: 5b4c517f-b496-4c69-9503-f8cd2b4c73b6
--
select max(("orders".created_at::timestamptz AT TIME ZONE 'UTC')) from public.orders AS "orders"
--Note that the query for two partitions is the same. It's the reason why all partitions will be updated.
How do we fix this and update only the partition for August? We can use the
FILTER_PARAMS for
that!
Let's update our pre-aggregation definition:
cubes:
- name: orders
# ...
pre_aggregations:
- name: orders
type: rollup
dimensions:
- number
- status
- created_at
- updated_at
time_dimension: created_at
granularity: day
partition_granularity: month # this is where we specify the partition
refreshKey:
sql: >
SELECT max(updated_at)
FROM public.orders
WHERE {FILTER_PARAMS.orders.created_at.filter('created_at')}Cube will filter data by the created_at property and then apply the refresh
key for the updated_at property. Here's how it looks in the Cube logs:
Executing SQL: e1155b2f-859b-4e61-a760-17af891f5f0b
--
select min(("updated_orders".created_at::timestamptz AT TIME ZONE 'UTC')) from public.orders AS "updated_orders"
--
Executing SQL: e1155b2f-859b-4e61-a760-17af891f5f0b
--
select max(("updated_orders".created_at::timestamptz AT TIME ZONE 'UTC')) from public.orders AS "updated_orders"
--
Performing query completed: e1155b2f-859b-4e61-a760-17af891f5f0b (10ms)
Performing query completed: e1155b2f-859b-4e61-a760-17af891f5f0b (13ms)
Performing query: e1155b2f-859b-4e61-a760-17af891f5f0b
Performing query: e1155b2f-859b-4e61-a760-17af891f5f0b
Executing SQL: e1155b2f-859b-4e61-a760-17af891f5f0b
--
SELECT max(updated_at) FROM public.orders WHERE created_at >= '2021-08-01T00:00:00.000Z'::timestamptz AND created_at <= '2021-08-31T23:59:59.999Z'::timestamptz
--
Executing SQL: e1155b2f-859b-4e61-a760-17af891f5f0b
--
SELECT max(updated_at) FROM public.orders WHERE created_at >= '2021-09-01T00:00:00.000Z'::timestamptz AND created_at <= '2021-09-30T23:59:59.999Z'::timestamptzNote that Cube checks the refresh key value using a date range over the
created_at property. With this refresh key, only one partition will be
updated.
We have received orders from two partitions of a pre-aggregation and only one of them has been updated when an order changed its status:
// orders before update:
[
{
"orders.number": "1",
"orders.status": "processing",
"orders.created_at": "2021-08-10T14:26:40.000",
"orders.updated_at": "2021-08-10T14:26:40.000"
},
{
"orders.number": "2",
"orders.status": "completed",
"orders.created_at": "2021-08-20T13:21:38.000",
"orders.updated_at": "2021-08-20T13:21:38.000"
},
{
"orders.number": "3",
"orders.status": "shipped",
"orders.created_at": "2021-09-01T10:27:38.000",
"orders.updated_at": "2021-09-01T10:27:38.000"
},
{
"orders.number": "4",
"orders.status": "completed",
"orders.created_at": "2021-09-20T10:27:38.000",
"orders.updated_at": "2021-09-20T10:27:38.000"
}
]// orders after update:
[
{
"orders.number": "1",
"orders.status": "shipped",
"orders.created_at": "2021-08-10T14:26:40.000",
"orders.updated_at": "2021-09-30T06:45:28.000"
},
{
"orders.number": "2",
"orders.status": "completed",
"orders.created_at": "2021-08-20T13:21:38.000",
"orders.updated_at": "2021-08-20T13:21:38.000"
},
{
"orders.number": "3",
"orders.status": "shipped",
"orders.created_at": "2021-09-01T10:27:38.000",
"orders.updated_at": "2021-09-01T10:27:38.000"
},
{
"orders.number": "4",
"orders.status": "completed",
"orders.created_at": "2021-09-20T10:27:38.000",
"orders.updated_at": "2021-09-20T10:27:38.000"
}
]Please feel free to check out the
full source code
or run it with the docker-compose up command. You'll see the result, including
queried data, in the console.
Did you find this page useful?