Here you'll find a variety of articles I've written over the years (see the chronological list). The site is organized as a very
simple blog (old school) and below are few of the first entries listed in chronological order. You can find site keywords and more information about me in the upper right.
A workflow for data engineering, machine learning, or other business processes is typically described a graph of
tasks that are chained together by dependencies and consequences. When a task completes, it may cause another
task (or many tasks) to execute. The graph of tasks and their connections are a description of the work accomplished by the workflow.
An example workflow
How a workflow is described depends on the workflow orchestration system and these might be:
the workflow and tasks are stored in a database - typically defined via an API
the workflow is defined by code using an API and possibly code annotations
the workflow it’s own artifact (e.g., a file) defined by a DSL (Domain Specific Language) which is specified in a common (e.g., YAML, JSON, or XML) or custom syntax.
And within specific categories, you can also consider whether code annotations are used to describe the workflow (which implies not using a DSL):
Annotation API by Category
What is a workflow as code annotations?
Many programming languages allow code annotations and some enable these annotations to define or change the behavior of the annotated code. In Python, this is often a convenient way to package registration and wrapping “core behavior” for interacting with a more complicated system. For a workflow, this is a convenient mechanism for registering and packaging a workflow task as a Python function into the workflow.
The workflow is defined by extending the class FlowSpec.
A step in the workflow is defined with the @step annotation on a workflow class method.
A step defines dependent task by calling self.next()
There are many different approaches to code annotations, but they have a common approach of:
There is a mechanism for identifying a function or method as a task in a workflow.
There is either an API for chaining steps within the task definition or as separate code that defines the workflow (possibly with annotations).
What is workflow as a DSL?
A DSL (Domain Specific Language) is typically a separate artifact (a file) that describes an object (i.e., the workflow) using a “language” encoded in some syntax. In workflow orchestration systems, there is a high prevalence of using a generic syntax like YAML or JSON to encode the workflow description. Consequently, the DSL is a specific structure in that syntax that encodes the workflow, tasks, and the connections between the tasks.
For example, Argo Workflows are YAML files. The example workflow graph shown at the beginning of this article can be encoded as an Argo Workflow in YAML as follows:
A workflow orchestration system has to process the workflow DSL artifact into an internal representation, resolve the workflow graph, and ensure all the parts are properly specified. Often, the tasks are references to implementations that are invokable in some system. In the case of Argo Workflows, the tasks are container invocations (i.e., Kubernetes batch jobs).
Which is better?
The answer is somewhat subjective. If your whole world is Python, the code annotation approach is very attractive. Systems that use this approach often make it very easy to get things working quickly.
When your world gets a little more complicated, it isn’t a stretch to imagine how a task in a workflow might call a remote implementation. This enables the workflow system to stay within a particular programming paradigm (e.g., Python with code annotations) while allowing interactions with other components or systems that are implemented differently.
On the other hand, a DSL is separably processable and typically agnostic to task implementation. You can get an idea of the shape of the workflow (e.g., the workflow graph) without running any code or locating the code for each task. That’s attractive approach for someone who might not be an expert on a particular code base or programming language.
The challenges of workflows as code:
you have to execute the code truly understand the structure of the workflow,
requires correctly configured environment which is typically the domain of a developer,
everything is packaged as code - which is great until it isn’t,
as the number of workflows and variety of environments expands over time, technical debt can make these workflows become brittle.
In contrast, the challenges of workflows as a DSL:
the workflow isn’t code - it is something else you need to learn,
understanding the syntax and semantics may be challenging (e.g., love or hate YAML?),
synchronizing workflows and task implementations may be challenging and requires extra coordination
The common thread here is the need for coordination. A workflow is an orchestration of tasks and those tasks define an API. Regardless of how you define the workflow, you need to be careful about how task implementations evolve. That means your organization has to continually curate their workflows to be successful with either approach.
Conclusions
There is simply nothing terribly wrong with either approach for authoring workflows. If your part of the organization is primarily developers who work in a particular language (e.g., Python), then you may be better off with using code annotations. The process for keeping the workflows and tasks compatible with each other is the same as any other software engineering challenge; solutions for this are well known.
On the other hand, if your organization has a heterogeneous environment with tasks implemented in a variety of languages and different kinds of consumers of the workflows themselves, you are likely better off with a system that has a DSL somewhere in the mix. The DSL acts as an intermediary between the developers of the tasks, the way they are orchestrated, and the different business consumers within your organization.
As a final note, using a DSL has the possibility of authoring tools or ways to generate them from diagrams that may be helpful to “cross chasms” between different parts of an organization with different skill sets. Generating workflows via a DSL is a way to add dynamic and generative approaches to MLOps. So having a generative metalanguage as a workflow of task primitives for your organization may also be helpful with “agentic AI” systems where the workflow is not just the means but is also an outcome that can be executed to accomplish a goal.
Workflow orchestration is a common problem in business automation that has an essential place
in the development and use of ML models. While systems for running workflows have been available for
many years, these systems have a variety of areas of focus. Earlier systems were often focused on
business process automation. Newer systems are developed specifically for the challenges
of orchestrating the tasks of data science and machine learning applications. Depending on their focus,
these systems have different communities of use, features, and deployment characteristics specific to their targeted domain.
This article provides a general overview of what constitutes a workflow orchestration system and follows
with a survey of trends in the available systems that covers:
origins and activity
how workflows are specified
deployment options
What is workflow orchestration?
A workflow is an organization of a set of tasks encapsulates a repeatable pattern of activity that
typically provides services, transforms materials, or processes information1. The origin of
the term dates back to the 1920’s and primarily in the context of manufacturing. In a modern parlance, we
can think of a workflow as akin to a “flow chart of things needed to be accomplished” for a specific
purpose within an organization. In more recent years, “workflow orchestration” or “workflow management” systems have been
developed to track and execute workflows for specific domains.
In the recent past, companies used workflow orchestration for various aspects of business automation. This
has enabled companies to go from paper-based or human centric processes to one where the rules by
which actions are taken are dictated by workflows encoded in these systems. While ensuring consistency, it
also gives the organization a way to track metadata around tasks and ensure completion.
Within data platforms, data science, and more recent machine learning endeavours, workflow orchestration has
become a fundamental tool for scaling processes and ensuring quality outcomes. When the uses of the
systems are considered, earlier systems were focused on business processes whilst latter are focused on data engineering,
data science, and machine learning. Each of these systems were categorized into one of the following areas of focus:
Business Processing - oriented for generic business process workflows
Science - specifically focused on scientific data processing, HPC, and modeling or inference for science
Data Science / ML - processing focused on data science or machine learning
Data Engineering - processing specific to data manipulation, ETL, and other forms of data manangement
Operations - processes for managing computer systems, clusters, databases, etc.
Of the systems surveyed, the breakdown of categories is shown below:
Systems by Category
While many of these systems can be used for different purposes, each has specializations for specific domains based on their community of use. An effort has been made to place a system into a single category based on the use cases, documentation, and marketing associated with the system.
Origins of projects
Project Creation by Category
All the systems surveyed appear after 2005 and which just after the “dot-com era” and at the start of “big data”. In the above figure, the start of and end dates are shown for each category. Each column starts at the earliest project formation and ends at the last project formation. This gives a visual representation of activity and possible innovation in each category.
While business process automation has been and continues to be a focus of workflow system development, you can see some evolution of development from data engineering or operations to data science and machine learning. Meanwhile, the creation of new science-oriented systems appear to have stagnated. This may be due to the use of data engineering and machine learning methods in scientific contexts and so there is no need for special systems.
Activity
Active Projects by Category
As is often the case with open-source software, even if associated with a commercial endeavour, some of the projects appear to have been abandoned. In the above chart, there tends to be a 20-25% rate of abandonment for workflow systems with the notable exception of science-oriented systems. In addition, it should be noted that some of these active projects are just being maintained whilst others are being actively developed by a vibrant community.
For science, while there may not be many new science-oriented workflow systems being created in recent years, most of those that exist are still actively being used.
SaaS Offered
In addition, some of these projects have commercial SaaS offerings that also indicate viability. The largest section of which is for Data Science / ML at 35% of those surveyed. This has a likely correlation with the current investment in machine learning and AI technologies.
Saas Available by Category
Workflow specification
Workflow Graph
Most workflows are conceptualized as a “graph of tasks” where there is a single starting point that may branch out to any number of tasks. Each following task has a dependency of a preceding task that creates a link between tasks. This continues through to “leaf” tasks that are at the very end of the workflow. In some systems, these are all connected to an end of the workflow.
Many systems differ on how a workflow is described. Some have a DSL (Domain Specific Language) that is used to encode the workflow. Others have an API that is used by code to create the workflow via program execution. Others have a hybrid mechanism that uses code annotation features of a specific programming language to describe the workflow. The use of annotations simplifies the description of
a workflow via an API and serves as a middle ground between the API and a DSL.
In the following chart, the use of a DSL and the encoding format is shown. If the DSL and format is compared to the project creation, you can see that a DSL is more prominent in Business Processing and Science workflow systems that generally have an earlier origin (~ 2005). Whereas, Data Engineer and Data Science / ML tend to use code or annotations on code rather than a DSL to describe the workflow.
Further, there is a strong trend to use YAML as a syntax for describing the graph of tasks in the workflow DSL. This is almost exclusively true for those surveyed in the Data Science / ML category. It should be noted that there is some use of specialized syntaxes (Custom), which is occurs often in the Science category, where the DSL uses a specialized syntax that must be learned by the user.
DSL Format by Category
Meanwhile, using annotations in code to describe workflows is a growing trend. In those surveyed, it appears that as systems evolved from focusing on data engineering to data science and ML, the use of code annotations has increased. This is also likely due in part to the dominance of Python as a programming language of choice for machine learning applications and the fondness of python users for annotation schemes.
Annotation API by Category
When it comes to describing tasks, systems that use annotations have a clear advantage in terms of simplicity. In those systems, a task is typically a function with an annotation. Subsequently, the system orchestrates execution of that function within the deployment environment.
In general, tasks are implemented as code in some programming language. Some workflow systems are agnostic to the choice of programming language as they use containers for invocation, a service request (e.g., an HTTP request to a service), or some other orthogonal invocation. Many systems are specifically designed to be opinionated about the choice of language, either by the API provided or due to the way the workflow is described through code annotations.
The following chart shows the distribution of task languages in the surveyed systems. The dominance of Python is clear from this chart due to prevalence of use in data engineering, data science, and machine learning. Many of the uses of Java are from systems that are focused on business processing workflows.
Task Language
Deployment
As with any software, these workflow systems must be deployed on infrastructure. Unsurprisingly, there is a strong trend towards containers and container orchestration. Many still leave the deployment considerations up to the user to decide and craft.
Deployments
When only the Data Engineering and Data Science / ML categories are considered, you can see the increasing trend of the use of Kubernetes as the preferred deployment.
Deployments - Data Science/ML Only
Conclusions
Overall, when you look at the activity and project creation over all the categories, two things seem to be clear:
There is a healthy ecosystem of workflow systems for variety of domains of use.
There is no clearly dominant system.
While particular communities or companies behind certain systems might argue otherwise, there is clearly a lot of
choice and activity in this space. There are certainly outlier systems that have smaller usage, support, and
active development. In a particular category, there are probably certain systems you might have on a short list of “winners”.
In fact, what is missing here is any community sentiment around various systems. There are systems that are in
use by a lot of companies (e.g., Airflow) simply because they have been around for a while. Their “de facto” use
doesn’t mean that all the user’s needs are being met nor are they satisfied with their experience using the system. These
users may simply not have a choice or sufficient reason to change; working well enough means there is enough momentum
to make change costly.
Rather, the point here is there is a lot of choice given the activity in workflow systems. That variety of choice
means there is lot of opportunity for innovation by users or developers as well as for companies who have a workflow system product.
And that is a very good thing.
Data
All the systems considered were either drawn from curated
lists of systems or by github tags
such as workflow-engine or workflow.
Whilst not a complete list, it does consist of 80 workflow systems or engines.
Each system’s documentation and GitHub project was examined to determine various properties. Some of these values
may be subjective. An effort was made to have consistent judgements for categories of use. Meanwhile, a valiant
attempt was made to understand the features of each system by finding evidence in their documentation or examples for various
features. As such, some things may have been missed if they were hard to find. Although, that is not unlike a user’s
experience with the product: if the feature is hard to determine, they may assume it doesn’t exist.
I recently read “Medical Graph RAG: Towards Safe Medical Large Language Model via Graph Retrieval-Augmented Generation”
(MedGraphRAG) where the authors use a careful and thoughtful construction of a knowledge graph, curated from various textual sources, and extracted via a careful
orchestration of an LLM. Queries against this knowledge graph are used to create a prompt for a pre-trained LLM where a clever use of
tagging allows the answer to be traceable back to the sources.
The paper details several innovations:
using layers in the graph to represent different sources of data,
allowing the layers to represent data with different update cadences and characteristics (e.g., patient data, medical research texts, or reference material)
careful use of tags to enable the traceability of the results back to the sources.
I highly recommend you read through the paper as the results surpass SOTA and the techniques are mostly sound.
When digging into the “how”, especially given the associated github project, I am
continually nagged by thoughts around using an LLM for Named Entity Recognition (NER) and Relation Extraction (RE) tasks. In particular:
How often does such an LLM miss reporting entities or relations entirely (omissions)?
What kinds of errors does such an LLM make and how often (misinformation)?
If we use an LLM to generate the knowledge graph, and it has problems from (1) and (2), how well does an LLM answer questions given information from the knowledge graph (circularity)?
The success demonstrated by the authors of the MedGraphRAG technique as used for answering various
medical diagnosis questions is one measure for (3). As with all inference, incorrect answers will happen. Tracing
down the “why” for the incorrect answer relies on understanding whether it is the input (the prompt generated from the knowledge graph) or the inference
drawn by the LLM. This means we must understand whether something is “wrong” or “missing” in the knowledge graph itself.
To answer this, I went on a tear for the last few weeks of reading whatever I could on NER and RE evaluation, datasets, and random
blog posts to update myself on the latest research. There are some good NER datasets out there and some for RE as well. I am certain there
are many more resources out there that I haven’t encountered, but I did find this
list on Github which led me to the conclusion that I really need to
focus on RE.
In going through how the MedGraphRAG knowledge graph is constructed, there are many pre-and-post processing steps that need to be applied to their datasets.
Not only do they need to process various medical texts to extract entities and relations, but they also need to chunk or
summarize these text in a way that is respective of topic boundaries. This helps the text fit into the limits of the prompt. The authors use “proposition transfer”, which serves a
critical step regarding topic boundaries, and that process also uses an LLM; bringing another circularity and questions about correctness.
All things considered, the paper demonstrates how a well constructed knowledge graph can be used to contextualize
queries for better answers that are traceable back to the sources supporting that answer. To put such a technique into
production, you need to be able to evaluate whether the entities and relations extracted are correct, that you aren’t
missing important information, and you need to do this every time you update your knowledge graph. For that, you need
some mechanism for evaluating an external LLM’s capabilities and the quality
of its ability to perform a relation extraction (RE) task.
Experiments with llama3
Maybe I shouldn’t be surprised, but there are subtle nuances in the prompts that can generate
vastly different outcomes for relation extraction tasks. I ran some experiments running llama3.1 locally (8B parameters)
just to test various things. At one point during ad hoc testing, one of the responses said something along the
lines of “there is more, but I omitted them” and adding “be as comprehensive and complete as possible” to the prompt
fixed that problem.
Everyone who has put something into production knows that a very subtle change can have drastic and unintended
outcomes. When constructing a knowledge graph from iterative interactions with an external LLM, we need some way to
know that our new prompt that fixes one problem hasn’t created a hundred problems elsewhere. That is usually the
point of unit and system testing (and I already hear the groans from the software engineers).
In the case of the MedGraphRAG implementation, they use the CAMEL-AI libraries
in Python to extract “entities” and “relations”. That library instructs the LLM to produce a particular syntax
that reduces to typed entities and relation triples (i.e., subject, relation, object triples) which is then
parsed by the library. I am certainly curious as to when that fails to parse as escaping text is always
a place where errors proliferate.
Meanwhile, in my own experimentation, I simply asked llama to output YAML and was surprised that it did something
close to what might be parsable. A few more instructions were sufficient to pass the results into a YAML
parser:
A node should be formatted in YAML syntax with the following rules:
* All nodes must be listed under a single 'nodes' property.
* All relationships must be listed under a single 'relations' property.
* The 'nodes' and 'relations' properties may not repeat at the top level.
Note:
There are so many ways I can imagine this breaking. So, we will have to see what happens when I run a lot of text
through this kind of prompt.
I did spend some time experimenting on whether I could prompt llama to produce a property graph. That is, could it
separate properties of an entity from relations. Or could it identify a property of a relation? I wasn’t particularly successful (yet) but that is a topic
for a different blog post.
A gem of a paper on relation extraction
In my wanders looking for more research in this area, I found this paper titled “Revisiting Relation Extraction in the era of Large Language Models” which addresses the question at the heart of knowledge graph construction with an LLM. While doing NER and NER resolution is one critical step, a knowledge
graph wouldn’t be a graph if the LLM does not handle the RE task well. This is another paper that I highly recommend
you read.
The authors give a good outline of the elements of an evaluation of RE with an LLM. They compare the results of various models
and LLM techniques against human annotated datasets for relations. They also detail the need for human evaluators for determining “correctness”
given the various challenges already present in the thorny problems of RE.
In some contexts, the datasets were not well suited to be run through an LLM for RE tasks. The authors say at one point,
“These results highlight a remaining limitation of in-context learning with large language models: for datasets with long
texts or a large number of targets, it is not possible to fit detailed instructions in the prompt.”
This is a problem that the MedGraphRAG technique solved using proposition transfer but doing so muddies the RE task with yet
another LLM task.
An idea
I’ve recently become involved in the ML Commons efforts where am particularly interested in
datasets. I think the challenge of collecting, curating, or contributing to datasets that support
LLM evaluation for knowledge graph construction would be particularly useful.
This effort at ML Commons could focus on a variety of challenges:
Collecting datasets: identification of existing datasets or corpus that can be use for NER and RE tasks in various domains
Standardized metadata: helping to standardize the metadata and structure of these datasets to allow more automated use for evaluation
Annotation: annotation of datasets with entities and relations to provide a baseline for comparison
Conformance levels: enable different levels of conformance to differentiate between the “basics” and more complex RE outcomes.
Tools: tooling for dataset curation and LLM evaluation
One area of innovation here would be the ability to label outcomes from an LLM not just in terms of omissions or misinformation but also
whether they can identify more subtle relations, inverse relations, etc. That would allow a consumer of these models
to understand what they should expect and what they may have to do afterwards to the knowledge graph as a secondary inference.
I will post more on this effort when and if it becomes an official work item. I hope it does.