Data modeling concepts

Cube borrows a lot of terminology from data science and OLAP theory, and this document is intended for both newcomers and regular users to refresh their understanding.

We'll use a sample e-commerce database with two tables, orders and line_items to illustrate the concepts throughout this page:

orders

id	status	completed_at	created_at
1	completed	2019-01-05 00:00:00.000000	2019-01-02 00:00:00.000000
2	shipped	2019-01-17 00:00:00.000000	2019-01-02 00:00:00.000000
3	completed	2019-01-27 00:00:00.000000	2019-01-02 00:00:00.000000
4	shipped	2019-01-09 00:00:00.000000	2019-01-02 00:00:00.000000
5	processing	2019-01-29 00:00:00.000000	2019-01-02 00:00:00.000000

line_items

id	product_id	order_id	quantity	price	created_at
1	31	1	1	275	2019-01-31 00:00:00.000000
2	49	2	6	248	2021-01-20 00:00:00.000000
3	89	3	6	197	2021-11-25 00:00:00.000000
4	71	4	8	223	2019-12-23 00:00:00.000000
5	64	5	5	75	2019-04-20 00:00:00.000000

Cubes

A cube represents a dataset in Cube, and is conceptually similar to a view in SQL. Cubes are typically declared in separate files with one cube per file. Within each cube are definitions of dimensions and measures. Typically, a cube points to a single table in your database using the sql_table property:

YAML

cubes:
  - name: orders
    sql_table: orders

You can also use the sql property to accommodate more complex SQL queries:

YAML

cubes:
  - name: orders
    sql: >
      SELECT *
      FROM orders, line_items
      WHERE orders.id = line_items.order_id

Dimensions

Dimensions represent the properties of a single data point in the cube. The orders table contains only dimensions, so representing them in the orders cube is straightforward:

YAML

cubes:
  - name: orders
    # ...

    dimensions:
      - name: id
        sql: id
        type: number
        # Here we explicitly let Cube know this field is the primary key
        # This is required for de-duplicating results when using joins
        primary_key: true

      - name: status
        sql: status
        type: string

The line_items table also has a couple of dimensions which can be represented as follows:

YAML

cubes:
  - name: line_items
    # ...

    dimensions:
      - name: id
        sql: id
        type: number
        # Again, we explicitly let Cube know this field is the primary key
        # This is required for de-duplicating results when using joins
        primary_key: true

      - name: order_id
        sql: order_id
        type: number

Dimensions can be of different types, and you can find them all here.

Time Dimensions

Time-based properties should be represented as dimensions with type time. Time dimensions allow grouping the result set by a unit of time (e.g. hours, days, weeks). In analytics, this is also known as "granularity".

We can add the necessary time dimensions to both data models as follows:

YAML

cubes:
  - name: orders
    # ...

    dimensions:
      - name: created_at
        sql: created_at
        type: time

      - name: completed_at
        sql: completed_at
        type: time

YAML

cubes:
  - name: line_items
    # ...

    dimensions:
      - name: created_at
        sql: created_at
        type: time

Time dimensions are essential to enabling performance boosts such as partitioned pre-aggregations and incremental refreshes.

Measures

Measures represent the properties of a set of data points in the cube. To add a measure called count to our orders cube, for example, we can do the following:

YAML

cubes:
  - name: orders
    # ...

    measures:
      - name: count
        type: count

In our LineItems cube, we can also create a measure to sum up the total value of line items sold:

YAML

cubes:
  - name: line_items
    # ...

    measures:
      - name: total
        sql: price
        type: sum

Measures can be of different types, and you can find them all here.

Joins

Joins define the relationships between cubes, which then allows accessing and comparing properties from two or more cubes at the same time. In Cube, all joins are LEFT JOINs.

An INNER JOIN can be replicated with Cube; when making a Cube query, add a filter for IS NOT NULL on the required column.

In the following example, we are left-joining the line_items cube onto our orders cube:

YAML

cubes:
  - name: orders
    # ...

    joins:
      - name: line_items
        # Here we use the `CUBE` global to refer to the current cube,
        # so the following is equivalent to `orders.id = line_items.order_id`
        sql: "{CUBE}.id = {line_items.order_id}"
        relationship: many_to_one

There are the three types of join relationships: one_to_one, one_to_many, and many_to_one.

Segments

Segments are filters that are predefined in the data model instead of a Cube query. They allow simplifying Cube queries and make it easy to re-use common filters across a variety of queries.

To add a segment which limits results to completed orders, we can do the following:

YAML

cubes:
  - name: orders
    # ...

    segments:
      - name: only_completed
        sql: "{CUBE}.status = 'completed'"

Pre-aggregations

Pre-aggregations are a powerful way of caching frequently-used, expensive queries and keeping the cache up-to-date on a periodic basis. Within a data schema, they are defined under the pre_aggregations property:

YAML

cubes:
  - name: orders
    # ...

    pre_aggregations:
      - name: main
        measures:
          - count
        dimensions:
          - status
        time_dimension: created_at
        granularity: day

A more thorough introduction can be found in Getting Started with Pre-Aggregations.

Did you find this page useful?