Structural versus semantic integration
Overview
Flow is a data integration and querying platform, built on top of the Taxi schema language.
The idea behind Taxi (and semantic types in general) is to allow producers of data to define both the structural contract (the labels assigned to fields and shapes of objects), and the semantic contract (what each field actually means).
A schema plays multiple roles:
-
Labels define how information is addressed within a message - e.g.,
firstName
,lastName
-
Structure defines how a message will be laid out - what’s nested where. Sometimes data is flat (e.g., tables or CSV files), at other times there’s a nested structure (such as JSON)
-
Semantics define the contract of what information is in place at a certain location
-
Encoding defines how data is written to the wire. (Not all schemas cover this)
Most schema languages focus entirely on addressing labelling and structure, but offer little to model the semantics of a contract. And, when you’re trying to consume data, the semantics are the most important thing.
Producers of data need to be careful about labels, structure and semantics of the data they emit. For consumers, the only thing that really matters is the content. Labelling and structure are just addresses to access the information that consumers need. |
Structure-first schemas
Most data schemas (and subsequent integration techniques) focus on describing and connecting data based on the structure of information.
Here are some examples of linking data based on structure, shown in a couple of different languages:
// Server-side contracts that the producers of data have defined:
interface Actor {
name: string
id : number
agentId: number
}
interface Agent {
name: string
id : number
}
// As a consumer - load an actor, then fetch it's agent using the agentId
function loadAgentsAndActors() {
const actor = await loadActor();
// Here, we're also coupled to another API call
const agent = await loadAgent(actor.agentId)
const castMember = {
actorName : actor.name,
agentName : agent.name
}
}
In this example, we wanted to consume data with two attributes: an actor’s name, and
their agent’s name. We also wanted to apply labels that were meaningful to us as consumers (actorName
and agentName
) that
differed from the way producers originally labelled the data (simply - name
).
The only way we could build this was by becoming tightly coupled to the producer’s contract.
That’s a problem because tight coupling makes it hard for producers to evolve their APIs, and brittle for consumers when they do.
Integration based on structure is brittle - changes by producers have expensive knock-on effects to consumers. |
Semantic integration
Taxi and Flow provide a different style of schema and integration, by allowing schemas to define their semantic contract as well.
Semantic types describe the meaning of content in a field, rather than just how it’s written out by computers.
Field Name | System Type | Semantic Type |
---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
The column on the left - Field name
- describes how a producer has named the field in one specific system.
The column in the middle - System Type
- describes how a system might read or write a field.
The column on the right - Semantic Type
- describes the actual meaning of the text in a business sense.
By defining a semantic contract, we introduce a way for producers and consumers to discuss data that’s not coupled to structure.
Semantic contracts provide a shared language for producers and consumers to discuss data that’s not coupled to structure, so it’s not as brittle when structural contracts change. |
Semantic contracts allow you to request data based on its meaning, irrespective of the shape producers are emitting it in.
Let’s look at our example query again, this time run through TaxiQL, a semantic query language:
given { ActorId = 1 }
find {
actorName : ActorName
agentName : AgentName
}
It’s important to call out a few details:
There’s no FROM
There’s no FROM clause here - nothing that specifies which producer, and where, the data should come from. That’s left to the query layer to determine, which means that consumers remain protected when producers inevitably change.
There’s no JOIN
Likewise, we haven’t specified how to navigate infrastructure and data contracts to describe how to link two ideas (Actor and Agent) together. In fact, for all we know, this could be coming from one table, or some complex lookup linking several entities together.
As a consumer, understanding the specifics of how data has been structured is an implementation detail we shouldn’t be concerned with.
Field names are defined by the consumer
The producing systems named these fields something else. The consumer has been able to define the shape of the data they need, and left it to the querying layer to solve. There’s enough semantic metadata present for the query layer to understand what data is needed.
This is possible because we’ve separated the structural concept from the semantic concept. Field names are now entirely up to the consumer to decide - the way it should be.
Semantic querying is centered around what data the consumer needs, and how the consumer needs it. Producer concepts are abstracted away. |
Enhance structural contracts with semantic metadata
It’s not a question of structure OR semantics, or choosing one schema language to rule them all. Taxi is a great schema language, but it’s not intended as a replacement for your existing schemas.
Modern data architectures are polyglot, with a mix of APIs, databases, streaming sources, and flat files. The key to making semantic schemas work across this ecosystem is to enhance existing schema structure with semantic metadata.
Taxi supports this, with a series of extensions.
For example, in Swagger / OpenAPI:
// Partial OpenAPI example:
paths:
/pets:
get:
description: |
Returns all pets from the system that the user has access to.
operationId: findPets
parameters:
- name: tags
in: query
description: Tags to filter by.
required: false
style: form
schema:
type: array
items:
# ------Below is a Taxi extension for semantic types-----------
x-taxi-type:
name: petstore.Tag
# ----------End the extension----------------------------------
type: string
Or in Protobuf:
// An example of enriching a structural contract with semantic metadata
syntax = "proto3";
import "taxilang/dataType.proto"; // Import the DataType extension
package tutorial;
message Person {
string name = 1 [(dataType='PersonName')];
int32 id = 2 [(dataType='PersonId')];
string email = 3 [(dataType='PersonEmail')];
}