Data modeling concepts
Cube borrows a lot of terminology from data science and OLAP theory, and this document is intended for both newcomers and regular users to refresh their understanding.
We'll use a sample e-commerce database with two tables, orders and
line_items to illustrate the concepts throughout this page:
orders
| id | status | completed_at | created_at |
|---|---|---|---|
| 1 | completed | 2019-01-05 00:00:00.000000 | 2019-01-02 00:00:00.000000 |
| 2 | shipped | 2019-01-17 00:00:00.000000 | 2019-01-02 00:00:00.000000 |
| 3 | completed | 2019-01-27 00:00:00.000000 | 2019-01-02 00:00:00.000000 |
| 4 | shipped | 2019-01-09 00:00:00.000000 | 2019-01-02 00:00:00.000000 |
| 5 | processing | 2019-01-29 00:00:00.000000 | 2019-01-02 00:00:00.000000 |
line_items
| id | product_id | order_id | quantity | price | created_at |
|---|---|---|---|---|---|
| 1 | 31 | 1 | 1 | 275 | 2019-01-31 00:00:00.000000 |
| 2 | 49 | 2 | 6 | 248 | 2021-01-20 00:00:00.000000 |
| 3 | 89 | 3 | 6 | 197 | 2021-11-25 00:00:00.000000 |
| 4 | 71 | 4 | 8 | 223 | 2019-12-23 00:00:00.000000 |
| 5 | 64 | 5 | 5 | 75 | 2019-04-20 00:00:00.000000 |
A cube represents a dataset in Cube, and is conceptually similar to a view in
SQL. Cubes are typically declared in separate files with one
cube per file. Within each cube are definitions of dimensions
and measures. Typically, a cube points to a single table in
your database using the sql_table property:
cubes:
- name: orders
sql_table: ordersYou can also use the sql property to accommodate more complex SQL queries:
cubes:
- name: orders
sql: >
SELECT *
FROM orders, line_items
WHERE orders.id = line_items.order_idDimensions represent the properties of a single data point in the cube.
The orders table contains only dimensions, so representing them in the
orders cube is straightforward:
cubes:
- name: orders
# ...
dimensions:
- name: id
sql: id
type: number
# Here we explicitly let Cube know this field is the primary key
# This is required for de-duplicating results when using joins
primary_key: true
- name: status
sql: status
type: stringThe line_items table also has a couple of dimensions which can be
represented as follows:
cubes:
- name: line_items
# ...
dimensions:
- name: id
sql: id
type: number
# Again, we explicitly let Cube know this field is the primary key
# This is required for de-duplicating results when using joins
primary_key: true
- name: order_id
sql: order_id
type: numberDimensions can be of different types, and you can find them all here.
Time-based properties should be represented as dimensions with type time. Time
dimensions allow grouping the result set by a unit of time (e.g. hours, days,
weeks). In analytics, this is also known as "granularity".
We can add the necessary time dimensions to both data models as follows:
cubes:
- name: orders
# ...
dimensions:
- name: created_at
sql: created_at
type: time
- name: completed_at
sql: completed_at
type: timecubes:
- name: line_items
# ...
dimensions:
- name: created_at
sql: created_at
type: timeTime dimensions are essential to enabling performance boosts such as partitioned pre-aggregations and incremental refreshes.
Measures represent the properties of a set of data points in the cube. To
add a measure called count to our orders cube, for example, we can do the
following:
cubes:
- name: orders
# ...
measures:
- name: count
type: countIn our LineItems cube, we can also create a measure to sum up the total value
of line items sold:
cubes:
- name: line_items
# ...
measures:
- name: total
sql: price
type: sumMeasures can be of different types, and you can find them all here.
Joins define the relationships between cubes, which then allows accessing and
comparing properties from two or more cubes at the same time. In Cube, all joins
are LEFT JOINs.
An INNER JOIN can be replicated with Cube; when making a Cube query, add a
filter for IS NOT NULL on the required column.
In the following example, we are left-joining the line_items cube onto our
orders cube:
cubes:
- name: orders
# ...
joins:
- name: line_items
# Here we use the `CUBE` global to refer to the current cube,
# so the following is equivalent to `orders.id = line_items.order_id`
sql: "{CUBE}.id = {line_items.order_id}"
relationship: many_to_oneThere are the three types of join relationships:
one_to_one, one_to_many, and many_to_one.
Segments are filters that are predefined in the data model instead of a Cube query. They allow simplifying Cube queries and make it easy to re-use common filters across a variety of queries.
To add a segment which limits results to completed orders, we can do the following:
cubes:
- name: orders
# ...
segments:
- name: only_completed
sql: "{CUBE}.status = 'completed'"Pre-aggregations are a powerful way of caching frequently-used, expensive
queries and keeping the cache up-to-date on a periodic basis. Within a data
schema, they are defined under the pre_aggregations property:
cubes:
- name: orders
# ...
pre_aggregations:
- name: main
measures:
- count
dimensions:
- status
time_dimension: created_at
granularity: dayA more thorough introduction can be found in Getting Started with Pre-Aggregations.
Did you find this page useful?