Publish schemas to Flow

Overview

Flow works by understanding where and how data is exposed across your organisation. Flow is technology-agnostic, and is designed for organisations with a wide variety of technology stacks. As a result, there are lots of different ways to make schemas available to Flow, each with its own set of pros and cons.

In this guide, we’ll explore the different techniques, so you can choose the right mix that works across your organisation.

We’ll also provide recommendations. These are a mixture of principals that we’ve kept in mind when designing Flow, and techniques we’ve seen customers implement. These aren’t hard rules - like any advice, it’s up to you to choose what works.

Publish a schema to Flow

Each system needs to decide how its schema information will be made available to Flow, and in what form.

In a large organisation, it’s typical to mix and match these approaches, to fit in what works for each team. Flow is designed to support multiple different methods at the same time.

A schema needs to describe a few important elements:

The structural contract of data

What’s the shape of the data that the system is exposing (output) or expecting (input)? This includes field names, nested objects, and expected parameters for operations.

Most schema languages (eg., OpenAPI / Protobuf / SQL) describe this really well.

// An example of a clear structural contract, taken from the protobuf docs
syntax = "proto3";
package tutorial;

message Person {
  string name = 1;
  int32 id = 2;
  string email = 3;
}

The semantic contract of data

What’s the meaning of each field that’s being exposed (output) or received (input)? This is really critical when mapping data between systems, as it’s how we ensure we’re passing the correct information into the correct field.

There’s much less support for semantic contracts in standard schema languages. As a result it’s common for someone to build adhoc maps in Word Documents, Wikis, or Spreadsheets.

This is where languages like Taxi really shine, as they let you bake in the semantic contract as well as the structural contract.

// A contract that contains semantic data
model Person {
   name : PersonName inherits String
   id : PersonId inherits Int
   email : EmailAddress inherits String
}

Enrich structural contracts with semantics

While you can use Taxi to define your schemas outright, this isn’t common practice, given the prevalence of feature-rich, well supported schema languages. Therefore, a good practice is to combine the two, using taxi extensions inside existing schema languages.

// An example of enriching a structural contract with semantic metadata
syntax = "proto3";
import "taxilang/dataType.proto"; // Import the DataType extension
package tutorial;


message Person {
  string name = 1 [(dataType='PersonName')];
  int32 id = 2  [(dataType='PersonId')];
  string email = 3 [(dataType='PersonEmail')];
}

Systems push schemas to Flow

The preferred way of exposing schema data to Flow is to have systems (or CI/CD tooling) publish the schema directly to Flow.

By making the system (and its team) responsible for publishing its own definition, the schema documentation lives as close as possible to the system itself, so has the best chance of being up to date. The team that maintains the system can evolve any schema documentation along with the system itself.

Automatically adapt to change

Flow automates the integration between services, by leveraging the metadata present in the schemas that are published to it.

As schemas change, Flow automatically adapts its integrations accordingly.

In order to make the most of this capability, it’s ideal to have systems automatically publishing their own schemas. The larger the separation between a change happening in a system, and the team responsible for updating the schema, the greater the chance of schemas being incorrect.

Of course, this is no different from manual integration without Flow - if documentation isn’t maintained, then integration becomes error-prone.

Generate schema definitions from code

Generating schemas directly from code is a great way of ensuring that schemas evolve with the code, as they’re generated at run-time.

As a schema language, Taxi has great support for generating schemas directly from Kotlin and Java with other framework support planned.

Using this approach, services generate their own schemas, and publish them directly to Flow on startup.

Pros:

  • Strong chance of schema staying up to date

  • Schema is edited by the same domain experts who build the application

Cons:

  • Requires the application to "have knowledge" of Taxi for code generation

  • Requires the application to "have knowledge" of Flow for publication

Augment existing schemas with semantic metadata

Many applications and systems already publish schemas using a rich schema language, such as Swagger / OpenAPI, Protobuf, JsonSchema, etc.

Generally speaking, these schema languages only describe the structural contract of the data, but not the semantic contract

Therefore, the ideal is to enhance existing schemas with this additional metadata.

The Taxi project has growing support for embedding semantic metadata inside existing schema languages.

Schema Format Taxi Support

OpenAPI

Supported

Swagger

Supported

Protobuf

In development

Avro

Planned

JsonSchema

In development

In these cases, a great solution is to simply enhance the existing schemas with additional metadata.

Pros:

  • Strong chance of schema staying up to date

  • Schema is edited by the same domain experts who build the application

  • No knowledge of Taxi inside code

  • Schema publication can be performed either at runtime, or in a CI/CD job

Cons:

  • Not available for all schema languages

Flow polls systems for updates

Flow’s schema server can be configured to poll sources for schemas, using a variety of back-end storages:

  • File systems

  • Git Repositories

  • OpenAPI endpoints

  • HTTP servers

This is a strong option for scenarios where systems can’t publish their own schemas (eg., databases), or for data sources that are otherwise structureless (eg., CSV files).

Additionally, using a git-backed repository for a shared glossary / taxonomy is a great way to allow decentralized authorship of the core set of glossary terms.

Store schemas separately from systems

Sometimes it’s not possible to have systems publish their own code; there’s a variety of reasons for this:

  • Database schemas - which can’t automatically be pushed

  • Legacy or external systems, which can’t be modified to publish their own schemas

  • Schemaless content - such as CSV files

In these cases, it’s possible to store schemas in a git repository, and have Flow’s schema server manually poll the repository.

The disadvantages here are that it’s easy for the schema definition to drift from the actual schema as the system changes.

Pros:

  • Good fall-back option when no other options are available

  • Requires no changes to publishing systems

Cons:

  • Requires careful change planning to ensure schemas don’t get out of sync with applications

  • Schemas are not necessarily maintained by the same team, which can lead to loss of domain knowledge