Publish schemas to Flow
Overview
Flow works by understanding where and how data is exposed across your organisation. Flow is technology-agnostic, and is designed for organisations with a wide variety of technology stacks. As a result, there are lots of different ways to make schemas available to Flow, each with its own set of pros and cons.
In this guide, we’ll explore the different techniques, so you can choose the right mix that works across your organisation.
We’ll also provide recommendations. These are a mixture of principals that we’ve kept in mind when designing Flow, and techniques we’ve seen customers implement. These aren’t hard rules - like any advice, it’s up to you to choose what works.
Publish a schema to Flow
Each system needs to decide how its schema information will be made available to Flow, and in what form.
In a large organisation, it’s typical to mix and match these approaches, to fit in what works for each team. Flow is designed to support multiple different methods at the same time.
A schema needs to describe a few important elements:
The structural contract of data
What’s the shape of the data that the system is exposing (output) or expecting (input)? This includes field names, nested objects, and expected parameters for operations.
Most schema languages (eg., OpenAPI / Protobuf / SQL) describe this really well.
// An example of a clear structural contract, taken from the protobuf docs
syntax = "proto3";
package tutorial;
message Person {
string name = 1;
int32 id = 2;
string email = 3;
}
The semantic contract of data
What’s the meaning of each field that’s being exposed (output) or received (input)? This is really critical when mapping data between systems, as it’s how we ensure we’re passing the correct information into the correct field.
There’s much less support for semantic contracts in standard schema languages. As a result it’s common for someone to build adhoc maps in Word Documents, Wikis, or Spreadsheets.
This is where languages like Taxi really shine, as they let you bake in the semantic contract as well as the structural contract.
// A contract that contains semantic data
model Person {
name : PersonName inherits String
id : PersonId inherits Int
email : EmailAddress inherits String
}
Enrich structural contracts with semantics
While you can use Taxi to define your schemas outright, this isn’t common practice, given the prevalence of feature-rich, well supported schema languages. Therefore, a good practice is to combine the two, using taxi extensions inside existing schema languages.
// An example of enriching a structural contract with semantic metadata
syntax = "proto3";
import "taxilang/dataType.proto"; // Import the DataType extension
package tutorial;
message Person {
string name = 1 [(dataType='PersonName')];
int32 id = 2 [(dataType='PersonId')];
string email = 3 [(dataType='PersonEmail')];
}
Systems push schemas to Flow
The preferred way of exposing schema data to Flow is to have systems (or CI/CD tooling) publish the schema directly to Flow.
By making the system (and its team) responsible for publishing its own definition, the schema documentation lives as close as possible to the system itself, so has the best chance of being up to date. The team that maintains the system can evolve any schema documentation along with the system itself.
Automatically adapt to change
Flow automates the integration between services, by leveraging the metadata present in the schemas that are published to it.
As schemas change, Flow automatically adapts its integrations accordingly.
In order to make the most of this capability, it’s ideal to have systems automatically publishing their own schemas. The larger the separation between a change happening in a system, and the team responsible for updating the schema, the greater the chance of schemas being incorrect.
Of course, this is no different from manual integration without Flow - if documentation isn’t maintained, then integration becomes error-prone.
Generate schema definitions from code
Generating schemas directly from code is a great way of ensuring that schemas evolve with the code, as they’re generated at run-time.
As a schema language, Taxi has great support for generating schemas directly from Kotlin and Java with other framework support planned.
Using this approach, services generate their own schemas, and publish them directly to Flow on startup.
Pros:
-
Strong chance of schema staying up to date
-
Schema is edited by the same domain experts who build the application
Cons:
-
Requires the application to "have knowledge" of Taxi for code generation
-
Requires the application to "have knowledge" of Flow for publication
Augment existing schemas with semantic metadata
Many applications and systems already publish schemas using a rich schema language, such as Swagger / OpenAPI, Protobuf, JsonSchema, etc.
Generally speaking, these schema languages only describe the structural contract of the data, but not the semantic contract
Therefore, the ideal is to enhance existing schemas with this additional metadata.
The Taxi project has growing support for embedding semantic metadata inside existing schema languages.
Schema Format | Taxi Support |
---|---|
OpenAPI |
Supported |
Swagger |
Supported |
Protobuf |
In development |
Avro |
Planned |
JsonSchema |
In development |
In these cases, a great solution is to simply enhance the existing schemas with additional metadata.
Pros:
-
Strong chance of schema staying up to date
-
Schema is edited by the same domain experts who build the application
-
No knowledge of Taxi inside code
-
Schema publication can be performed either at runtime, or in a CI/CD job
Cons:
-
Not available for all schema languages
Flow polls systems for updates
Flow’s schema server can be configured to poll sources for schemas, using a variety of back-end storages:
-
File systems
-
Git Repositories
-
OpenAPI endpoints
-
HTTP servers
This is a strong option for scenarios where systems can’t publish their own schemas (eg., databases), or for data sources that are otherwise structureless (eg., CSV files).
Additionally, using a git-backed repository for a shared glossary / taxonomy is a great way to allow decentralized authorship of the core set of glossary terms.
Store schemas separately from systems
Sometimes it’s not possible to have systems publish their own code; there’s a variety of reasons for this:
-
Database schemas - which can’t automatically be pushed
-
Legacy or external systems, which can’t be modified to publish their own schemas
-
Schemaless content - such as CSV files
In these cases, it’s possible to store schemas in a git repository, and have Flow’s schema server manually poll the repository.
The disadvantages here are that it’s easy for the schema definition to drift from the actual schema as the system changes.
Pros:
-
Good fall-back option when no other options are available
-
Requires no changes to publishing systems
Cons:
-
Requires careful change planning to ensure schemas don’t get out of sync with applications
-
Schemas are not necessarily maintained by the same team, which can lead to loss of domain knowledge