Writing Schemas

Schemas are data models written in the Pantheon Schema Language (PSL).

This article will show you how to use PSL to define schemas. Schemas allow you to logically integrate and run transformations on your data.

Schema definitions consist of two main parts:

  • Schema tables, or views of tables and their columns that are mapped from underlying physical data sources. They make up the physical mapping of the schema definition.

  • Measures and dimensions, which are derived from schema tables and schema columns. They make up the logical model that sits on top of the physical mapping within the schema definition.

The two parts combined within a schema form the abstraction layer that sits on top of your physical data sources, unifies them, and creates a single point for accessing and managing all your data.

Schema tables and schema columns are views of physical tables and columns in physical data sources. To define schema tables, you first need to understand the structure of the underlying physical data source(s) you will reference.

To view a complete listing of available data sources and their constituent physical tables and columns, click on Schemas > Inventory in the upper right-hand corner of the Schema Editor UI.

Schemas are written in PSL with only a handful of keywords and parameters. Let us write a schema definition.

Starting a Schema Definition

Click on Edit in a schema to get started.

Every schema starts with the schema keyword and the name of the schema. Schemas can refer to columns and tables in more than one data source, although the dataSource parameter can be used to define a default data source.

For example, our schema is named "simple-sales" and based on the "foodmart" data source.

schema simple-sales(dataSource = "foodmart") {
}

After defining your schema, start by writing the physical mapping.

Physical Mapping

The physical mapping defines schema tables and schema columns that represent a view on top of physical tables and physical columns. These schema tables and columns can be 1:1 representations of the underlying physical tables and columns, a filtered subset of the data, or views defined after running one or more data transformations on the column or table levels.

Transformations in Data Hub are performed via standardized SQL syntax that works with any data source regardless of its type. Transformations on the column level require SQL snippets, while transformations on the table level require complete SQL statements.

Once defined the schema tables and columns can be queried directly using the SQL query type.

Schema Table

table defines a schema table derived from a physical table. The below example names the schema table "date" derived from the physical table "time_by_day":

table date(physicalTable = "TIME_BY_DAY") {
}

If the name of your schema table is the same as the name of the physical table, you can omit the physicalTable parameter. For example:

table store {
}

where the schema table "store" has the same name as the physical table "store".

Note, however, that PSL is case-sensitive, so make sure you define the physical table name if there is a difference in case.

When defining schema tables, you can create new views of the underlying physical data by performing data transformations. To perform data transformations on a schema table, use the parameter sql followed by a complete SQL statement written in standard syntax. For example, say we wish to transform two physical tables with a UNION ALL in order to define a new schema table:

table sales(sql = """SELECT * FROM "Test"."SALES_FACT_1998" WHERE PROMOTION_ID = 0""") {

Above we define the schema table "sales" by transforming it to include only sales without a promotion. Note the "Test" prefix here which specifies the name of the data source.

Schema Column

column defines a schema column within a schema table. Schema columns are nested under schema tables. Here we define the schema column "time_id" that derives from a physical column with the same name, so we omit the physical column name.

table date(physicalTable = "TIME_BY_DAY") {
column time_id(expression = "TIME_ID")
}

In case you wish to give your schema column a name different from the name of the physical column, use the parameter expression to reference the physical column name. Below we rename the physical column "table.city" to the schema column "table.location."

table customer(physicalTable = "CUSTOMER") {
column location(expression = "CITY")
}

Note that the schema table "customer" omits the physicalTable parameter as the name of the physical table is the same.

In addition to renaming, the parameter expression allows you perform other types of data transformations on schema columns by using standard SQL snippets. For example:

table customer {
column location(expression = "CITY")
column birth_year(expression = "EXTRACT(year from BIRTHDATE)")
}

where we use the SQL function EXTRACT to transform the physical column "birthdate" to retrieve only the "year" portion of the value into the resulting computed schema column "birth_year".

To map all physical columns from a physical table to a schema table, use columns *, as below:

table store {
columns *
}

You can likewise define a new schema column by joining it with another schema column from a different schema table. The physical tables that the two schema tables derive from must both contain physical columns that can be joined. To join two schema columns, use the tableRef parameter. For example:

column time_id(expression = "TIME_ID", tableRef = "date.time_id")

where we join the schema column "date.time_id" (that we defined above) with the new schema column "sales.time_id". Note that schema columns can have the same name when they are nested under different schema tables.

Let us put the above schema table and schema column definitions together:

table sales(sql = """SELECT * FROM "Test"."SALES_FACT_1998" WHERE PROMOTION_ID = 0""") {
column time_id (expression = "TIME_ID", tableRef = "date.time_id")
}
table customer(physicalTable = "CUSTOMER") {
column location(expression = "CITY")
column birth_year(expression = "EXTRACT(year from BIRTHDATE)")
}
table store(physicalTable = "STORE) {
columns *
}
table date(physicalTable = "TIME_BY_DAY") {
column time_id(expression = "TIME_ID")
}

Finally, let us add a few extra columns:

table sales(sql = """SELECT * FROM "Test"."SALES_FACT_1998" WHERE PROMOTION_ID = 0""") {
column time_id(expression = "TIME_ID", tableRef = "date.time_id")
column customer_id(expression = "CUSTOMER_ID", tableRef = "customer.customer_id")
column store_id(expression = "STORE_ID", tableRef = "store.STORE_ID")
column store_sales(expression = "STORE_SALES")
column store_cost(expression = "STORE_COST")
column unit_sales(expression = "UNIT_SALES")
}
table customer(physicalTable = "CUSTOMER") {
column customer_id(expression = "CUSTOMER_ID")
column location(expression = "CITY")
column fullname(expression = "FULLNAME")
column birth_year(expression = "EXTRACT(year from BIRTHDATE)")
}
table store(physicalTable = "STORE") {
columns *
}
table date(physicalTable = "TIME_BY_DAY") {
column time_id(expression = "TIME_ID")
column year(expression = "SUBSTRING(CAST(THE_DATE AS varchar) FROM 0 FOR 5)")
column month(expression = "SUBSTRING(CAST(THE_DATE AS varchar) FROM 6 FOR 2)")
column day(expression = "SUBSTRING(CAST(THE_DATE AS varchar) FROM 10 FOR 2)")
}

Note that above we have added a number of computed columns with SQL snippets that follow the expression parameter.

At this stage, our schema definition enables you to run SQL queries against the data source referenced, in this case the "foodmart" data source. To extend the usability of schema definitions and allow for additional types of queries, we will now add dimensions and measures to the schema.

Logical Model

Schema tables allow you to combine data sources, perform transformations on data by creating abstracted views of physical data, and execute SQL queries. The logical model that comes on top consists of dimensions and measures that allow you to aggregate data and add multi-dimensionality to your schema. By extension, the logical model within your schema allows you to perform additional query types, namely aggregate and record queries.

Measures are aggregates of schema columns that quantify your data and return numerical values. Dimensions are qualitative and provide context necessary to understand the meaning of measures. Dimensions are derived from schema tables and represent an additional way to categorize your data.

Dimensions

Dimensions are defined by the dimension keyword that names the dimension and references the schema table from which the dimension is derived. Dimensions contain dimension attributes, which we add to queries to break down measures and provide additional meaning. Attributes are nested under dimensions. For example:

dimension Date (table = "date") {
attribute Year (column = "year")
attribute Month (column = "month")
attribute Day (column = "day")
}

where we define the dimension "Date" based on the schema table "foodmart.date" with the attributes "Year," "Month," and "Day."

Dimensions always require a schema table reference. Attributes may omit the schema column reference if the attribute name is the same as the schema column name, e.g.:

dimension Customer(table = "customer") {
attribute fullName(column = "fullname")
attribute location
attribute birthYear(column = "birth_year")
}

where the attribute "Customer.location" corresponds to the schema column "customer.location" with the same name.

Measures

The keyword measure names measures and references the schema columns they aggregate. For example:

measure cost(column = "sales.store_cost")
measure unitSales(column = "sales.unit_sales")

where the measures "cost" and "unitSales" correspond to the schema columns "sales.store_cost" and "sales.unit_sales," respectively.

Measures aggregate schema via the aggregate parameter and functions such as sum, avg, count. If the aggregate parameter is omitted, the default aggregation function is sum. For example:

measure revenue(column = "sales.store_sales", aggregate = "sum")

Additionally, measures can apply filtering or arithmetic calculations to other previously defined measures. (For a complete list of functions applicable to measures, see PSL Reference > Measures)

Measures that reference other measures and apply filtering by using the filter parameter:

measure revenueForSeattle(measure = "revenue", filter = "Store.city = 'Seattle'")

where the measure "revenueForSeattle" is a view of the previously defined measure "revenue" filtered by the dimensions "Store.city" and its value "Seattle."

Measures that calculate values using the calculation parameter:

measure costPerUnit(calculation = "cost / unitSales")

where we calculate the measure "costPerUnit" by dividing the values of the measure "cost" by the measure "unitSales."

Let us put together all the dimensions and measures we have defined so far:

dimension Date (table = "date") {
attribute Year (column = "year")
attribute Month (column = "month")
attribute Day (column = "day")
}
dimension Customer(table = "customer") {
attribute fullName(column = "fullname")
attribute location
attribute birthYear(column = "birth_year")
}
measure cost(column = "sales.store_cost")
measure unitSales(column = "sales.unit_sales")
measure revenue(column = "sales.store_sales", aggregate = "sum")
measure revenueForSeattle(measure = "revenue", filter= "Store.city = 'Seattle'")
measure costPerUnit(calculation = "cost / unitSales")

Finally, let us add another dimension with a few more attributes:

dimension Date (table = "date") {
attribute Year (column = "year")
attribute Month (column = "month")
attribute Day (column = "day")
}
dimension Store(table = "store") {
attribute city(column = "STORE_CITY")
attribute country(column = "STORE_COUNTRY")
attribute name(column = "STORE_NAME")
attribute type(column = "STORE_TYPE")
}
dimension Customer(table = "customer") {
attribute fullName(column = "fullname")
attribute location
attribute birthYear(column = "birth_year")
}
measure cost(column = "sales.store_cost")
measure unitSales(column = "sales.unit_sales")
measure revenue(column = "sales.store_sales", aggregate = "sum")
measure revenueForSeattle(measure = "revenue", filter= "Store.city = 'Seattle'")
measure costPerUnit(calculation = "cost / unitSales")

Sample Schema

Let us now put it all together. Here we have our first complete schema definition.

schema simple-sales(dataSource = "Test") {
dimension Date (table = "date") {
attribute Year (column = "year")
attribute Month (column = "month")
attribute Day (column = "day")
}
dimension Store(table = "store") {
attribute city(column = "STORE_CITY")
attribute country(column = "STORE_COUNTRY")
attribute name(column = "STORE_NAME")
attribute type(column = "STORE_TYPE")
}
dimension Customer(table = "customer") {
attribute fullName(column = "fullname")
attribute location
attribute birthYear(column = "birth_year")
}
measure cost(column = "sales.store_cost")
measure unitSales(column = "sales.unit_sales")
measure revenue(column = "sales.store_sales", aggregate = "sum")
measure revenueForSeattle(measure = "revenue", filter= "Store.city = 'Seattle'")
measure costPerUnit(calculation = "cost / unitSales")
table sales(sql = """SELECT * FROM "Test"."SALES_FACT_1998" WHERE PROMOTION_ID = 0""") {
column time_id(expression = "TIME_ID", tableRef = "date.time_id")
column customer_id(expression = "CUSTOMER_ID", tableRef = "customer.customer_id")
column store_id(expression = "STORE_ID", tableRef = "store.STORE_ID")
column store_sales(expression = "STORE_SALES")
column store_cost(expression = "STORE_COST")
column unit_sales(expression = "UNIT_SALES")
}
table customer(physicalTable = "CUSTOMER") {
column customer_id(expression = "CUSTOMER_ID")
column location(expression = "CITY")
column fullname(expression = "FULLNAME")
column birth_year(expression = "EXTRACT(year from BIRTHDATE)")
}
table store(physicalTable = "STORE") {
columns *
}
table date(physicalTable = "TIME_BY_DAY") {
column time_id(expression = "TIME_ID")
column year(expression = "SUBSTRING(CAST(THE_DATE AS varchar) FROM 0 FOR 5)")
column month(expression = "SUBSTRING(CAST(THE_DATE AS varchar) FROM 6 FOR 2)")
column day(expression = "SUBSTRING(CAST(THE_DATE AS varchar) FROM 10 FOR 2)")
}
}

Combining Multiple Data Sources

The schema we defined above uses one data source and combines multiple tables. PSL allows you to combine as many data sources as you wish, on the schema or schema table level.

To define schema tables derived from physical tables in other available data sources, use the dataSource parameter after the table definition, e.g:

table date(dataSource = "Source2", physicalTable = "date")

where we reference the physical table "date" from the data source "Source2" and define the schema table "date."

As a reminder, you can do the same on the level of the schema itself:

schema simple-sales(dataSource = "foodmart") {
}

Nesting Schemas

Schemas can import other schemas. We call these nested schemas.

Import

To import a schema within another schema, use the import keyword in combination with the name of a previously defined schema:

import sales
schema simple-sales(dataSource = "foodmart") {
}

Above we import the schema "Sales" into our schema "simple-sales." Everything that lives inside a schema is by default available for querying. Note, however, that imports live outside a schema.

Expose

To allow querying of dimensions, measures, and schema tables from an imported schema, use the expose keyword.

import sales
schema simple-sales {dataSource = "foodmart"
// Logical model
dimension { ...
attribute
}
measure ( ... )
// Expose dimension
expose dimension sales.Store
// Expose measure
expose measure sales.sales
// Expose schema table
expose table sales.events
// Expose table with alias to avoid name conflict
expose table sales.customer_id as sales-customer_id
// Physical mapping
table { ...
column ( ... )
}
}

Hierarchies and Levels

Schema dimensions may also define related sets of hierarchies and levels, in addition to attributes. Dimensions can accommodate different structures from multiple hierarchies of levels all the way down to a simple set of attributes. A simple set of attributes represents data in a non-hierarchical format when all attributes are considered at the same level.

Below is another sample schema that includes levels, again based on the "foodmart" data source.

schema sales (dataSource = "foodmart") {
dimension Date (table = "time_by_day") {
level year (column = "the_year")
level month (column = "the_month")
level date (column = "the_date")
}
dimension Store(table = "store") {
level region(column = "region_id", table = "region") {
attribute city(column = "sales_city")
attribute province(column = "sales_state_province")
attribute region(column = "sales_region")
attribute country(column = "sales_country")
}
level store(column = "store_id") {
attribute name(column = "store_name")
attribute type(column = "store_type")
}
}
dimension Customer(table = "customer") {
level country
level province(column = "state_province")
level city
level customer(column = "customer_id") {
attribute firstName(column = "fname")
attribute lastName(column = "lname")
attribute birthDate(column = "birthdate")
attribute birthYear
attribute gender
}
}
measure sales(column = "sales_fact_1998.store_sales")
measure cost(column = "sales_fact_1998.store_cost")
measure unitSales(column = "sales_fact_1998.unit_sales")
table customer {
column birthYear(expression = "year(birthdate)")
}
table store {
column region_id (tableRef = "region.region_id")
}
table sales_fact_1998 {
column time_id (tableRef = "time_by_day.time_id")
column customer_id (tableRef = "customer.customer_id")
column store_id (tableRef = "store.store_id")
column store_sales
column store_cost
column unit_sales
}
}

Within the above schema "sales" we defined three dimensions:

  • Date with levels:

    • year

    • month

    • date

  • Store with levels:

    • region

    • store

  • Customer with levels:

    • country

    • province

    • customer

Some of the levels also define attributes within the levels. All attributes defined on the same level have an equal place in the dimension hierarchy, i.e. they require the same level of measure aggregation.

A level itself implicitly defines an attribute with the same name. For example, you can query both "Store.region" and "Store.region.city".

A dimension can span multiple schema tables, as in the case of the dimension "Store" in the schema above.