Data modelling
We explain dimensional modelling in Contiamo via an example.
Contents
Data modelling
1. Introduction
When you create a new custom data source, the data in your CSV files may need to be transformed before being uploaded. Contiamo uses a dimensional data model that ensures optimal functionality and performance. We illustrate this with an example.
2. Example
Let's consider a retail operation with sales split across two stores (A and B) and two categories (electronics and jewelry). The typical sales report might be organized as follows:
Date | Store A Electronics | Store A Jewelry | Store A Total | Store B Electronics | Store B Jewelry | Store B Total | Total Sales |
---|---|---|---|---|---|---|---|
2015-01 | 75,000 | 35,000 | 110,000 | 50,000 | 38,000 | 88,000 | 198,000 |
2015-02 | 69,000 | 12,000 | 81,000 | 42,000 | 20,000 | 62,000 | 143,000 |
2015-03 | 73,000 | 18,000 | 91,000 | 56,000 | 20,000 | 76,000 | 167,000 |
... | ... | ... | ... | ... | ... | ... | ... |
This format is not adapted to the way Contiamo works. In order to make the most of Contiamo, the data should be formatted as follows:
Date | Sales | Store | Category |
---|---|---|---|
2015-01 | 75,000 | A | Electronics |
2015-01 | 35,000 | A | Jewelry |
2015-01 | 50,000 | B | Electronics |
2015-01 | 38,000 | B | Jewelry |
2015-02 | 69,000 | A | Electronics |
... | ... | ... | ... |
2.1. Metrics vs. dimensions
In this example, there is one metric: sales. It measures the currency amount of sales across the company. The metric is split along dimensions: stores and categories.
Let us upload the tables above to two data contracts called 'Right' and 'Wrong'. Here are the resulting metrics:
With the right approach, only one metric, sales, is created (ignore the greyed out 'count' fields). With the wrong approach, many metrics are created, which would quickly become unmanageable as the number of stores and categories increased, or new dimensions were added.
In other words, with the wrong approach, the dimensionality of the data manifests itself in the multiplication of metrics. With the right approach, the dimensionality is simply embodied in the data's dimensions:
We now show how the right approach to modelling makes it much easier to perform two basic tasks: drilling and filtering.
3. Drilling down
Let us plot sales by store. With the right approach, we select the Sales metric, then add Store as a dimension:
With the wrong approach, we would select the two metrics Store A total and Store B total:
The result is obviously the same. So far so good. But what if we wanted to see sales by category? With the right approach, we remove the Store dimension and replace it with the Category dimension:
With the wrong approach, we would have to manually select all the metrics for all categories, and then create an ad-hoc subtotal calculation for each category. With more stores and categories, this approach would quickly lead to death by a thousand clicks.
4. Filtering
Let us plot total sales restricted to electronics. With the right approach, we can simply add the Category dimension as a filter:
With the wrong approach, we would have to select the metrics measuring electronics sales for every single store, and add them all in a calculation. With dozens of stores and categories, this would quickly become impossible.
5. Do not fear redundancy
We have seen that by modelling the dimensionality of the data as dimensions in Contiamo, tasks such as drilling into the data or filtering it flow naturally in Contiamo. However, one natural worry about this approach is data redundancy.
Let us consider another example:
Date | Televisions | Computers | Electronics Total | Rings | Necklaces | Jewelery Total | Total Sales |
---|---|---|---|---|---|---|---|
2015-01 | 75,000 | 35,000 | 110,000 | 50,000 | 38,000 | 88,000 | 198,000 |
2015-02 | 69,000 | 12,000 | 81,000 | 42,000 | 20,000 | 62,000 | 143,000 |
2015-03 | 73,000 | 18,000 | 91,000 | 56,000 | 20,000 | 76,000 | 167,000 |
... | ... | ... | ... | ... | ... | ... | ... |
The right way to model the data in Contiamo is to upload it as follows:
Date | Sales | Subcategory | Category |
---|---|---|---|
2015-01 | 75,000 | Televisions | Electronics |
2015-01 | 35,000 | Computers | Electronics |
2015-01 | 50,000 | Rings | Jewelry |
2015-01 | 38,000 | Necklaces | Jewelry |
2015-02 | 69,000 | Televisions | Electronics |
... | ... | ... | ... |
Here, the word 'Electronics' could be repeated thousands of times, if not millions for item-level data. Moreover, since the 'Televisions' subcategory is always part of 'Electronics', specifying 'Electronics' each time may seem unnecessary.
This is nonetheless the right approach, as it unlocks all the functionality of the query engine in Contiamo while ensuring fast query performance, all for a minuscule cost in data storage.
6. How to transform the data
We include an example Python script using the Pandas package to transform the data.
import pandas as pd
# Read CSV file
df = pd.read_csv('sales_example.csv')
# Drop redundant columns
df.drop(['Store A Total', 'Store B Total', 'Total Sales'], axis=1, inplace=True)
# Melt the data
dfm = pd.melt(df, id_vars=['Date'], var_name='Dimensions', value_name='Sales')
# Split the new column to extract store and category information
labels = dfm['Dimensions'].str.split(expand=True)
# Add store and category information to dataframe
dfm['Store'] = labels[1]
dfm['Category'] = labels[2]
# Remove intermediate column
dfm.drop(['Dimensions'], axis=1, inplace=True)
# Write CSV file
dfm.to_csv('sales_formatted.csv', index=False)
If you have data that you are not sure how to format for upload, get in touch with us: support@contiamo.com.