¬ę Back to help center

Data modelling

We explain dimensional modelling in Contiamo via an example.

Contents

Data modelling

1. Introduction

When you create a new custom data source, the data in your CSV files may need to be transformed before being uploaded. Contiamo uses a dimensional data model that ensures optimal functionality and performance. We illustrate this with an example.

2. Example

Let's consider a retail operation with sales split across two stores (A and B) and two categories (electronics and jewelry). The typical sales report might be organized as follows:

Date Store A Electronics Store A Jewelry Store A Total Store B Electronics Store B Jewelry Store B Total Total Sales
2015-01 75,000 35,000 110,000 50,000 38,000 88,000 198,000
2015-02 69,000 12,000 81,000 42,000 20,000 62,000 143,000
2015-03 73,000 18,000 91,000 56,000 20,000 76,000 167,000
... ... ... ... ... ... ... ...

This format is not adapted to the way Contiamo works. In order to make the most of Contiamo, the data should be formatted as follows:

Date Sales Store Category
2015-01 75,000 A Electronics
2015-01 35,000 A Jewelry
2015-01 50,000 B Electronics
2015-01 38,000 B Jewelry
2015-02 69,000 A Electronics
... ... ... ...

2.1. Metrics vs. dimensions

In this example, there is one metric: sales. It measures the currency amount of sales across the company. The metric is split along dimensions: stores and categories.

Let us upload the tables above to two data contracts called 'Right' and 'Wrong'. Here are the resulting metrics:

data_modelling_metrics

With the right approach, only one metric, sales, is created (ignore the greyed out 'count' fields). With the wrong approach, many metrics are created, which would quickly become unmanageable as the number of stores and categories increased, or new dimensions were added.

In other words, with the wrong approach, the dimensionality of the data manifests itself in the multiplication of metrics. With the right approach, the dimensionality is simply embodied in the data's dimensions:

data_modelling_right_dimensions

We now show how the right approach to modelling makes it much easier to perform two basic tasks: drilling and filtering.

3. Drilling down

Let us plot sales by store. With the right approach, we select the Sales metric, then add Store as a dimension:

data_modelling_right_by_store

With the wrong approach, we would select the two metrics Store A total and Store B total:

data_modelling_wrong_by_store

The result is obviously the same. So far so good. But what if we wanted to see sales by category? With the right approach, we remove the Store dimension and replace it with the Category dimension:

data_modelling_right_by_category

With the wrong approach, we would have to manually select all the metrics for all categories, and then create an ad-hoc subtotal calculation for each category. With more stores and categories, this approach would quickly lead to death by a thousand clicks.

data_modelling_wrong_by_category

4. Filtering

Let us plot total sales restricted to electronics. With the right approach, we can simply add the Category dimension as a filter:

data_modelling_right_filter

With the wrong approach, we would have to select the metrics measuring electronics sales for every single store, and add them all in a calculation. With dozens of stores and categories, this would quickly become impossible.

5. Do not fear redundancy

We have seen that by modelling the dimensionality of the data as dimensions in Contiamo, tasks such as drilling into the data or filtering it flow naturally in Contiamo. However, one natural worry about this approach is data redundancy.

Let us consider another example:

Date Televisions Computers Electronics Total Rings Necklaces Jewelery Total Total Sales
2015-01 75,000 35,000 110,000 50,000 38,000 88,000 198,000
2015-02 69,000 12,000 81,000 42,000 20,000 62,000 143,000
2015-03 73,000 18,000 91,000 56,000 20,000 76,000 167,000
... ... ... ... ... ... ... ...

The right way to model the data in Contiamo is to upload it as follows:

Date Sales Subcategory Category
2015-01 75,000 Televisions Electronics
2015-01 35,000 Computers Electronics
2015-01 50,000 Rings Jewelry
2015-01 38,000 Necklaces Jewelry
2015-02 69,000 Televisions Electronics
... ... ... ...

Here, the word 'Electronics' could be repeated thousands of times, if not millions for item-level data. Moreover, since the 'Televisions' subcategory is always part of 'Electronics', specifying 'Electronics' each time may seem unnecessary.

This is nonetheless the right approach, as it unlocks all the functionality of the query engine in Contiamo while ensuring fast query performance, all for a minuscule cost in data storage.

6. How to transform the data

We include an example Python script using the Pandas package to transform the data.

import pandas as pd

# Read CSV file
df = pd.read_csv('sales_example.csv')

# Drop redundant columns
df.drop(['Store A Total', 'Store B Total', 'Total Sales'], axis=1, inplace=True)

# Melt the data
dfm = pd.melt(df, id_vars=['Date'], var_name='Dimensions', value_name='Sales')

# Split the new column to extract store and category information
labels = dfm['Dimensions'].str.split(expand=True)

# Add store and category information to dataframe
dfm['Store'] = labels[1]
dfm['Category'] = labels[2]

# Remove intermediate column
dfm.drop(['Dimensions'], axis=1, inplace=True)

# Write CSV file
dfm.to_csv('sales_formatted.csv', index=False)

If you have data that you are not sure how to format for upload, get in touch with us: support@contiamo.com.