Alex Miłowski, geek

Welcome!

Here you'll find a variety of articles I've written over the years (see the chronological list). The site is organized as a very simple blog (old school) and below are few of the first entries listed in chronological order. You can find site keywords and more information about me in the upper right.

ready? set? go!

Workflows - DSL or code?

A workflow for data engineering, machine learning, or other business processes is typically described a graph of tasks that are chained together by dependencies and consequences. When a task completes, it may cause another task (or many tasks) to execute. The graph of tasks and their connections are a description of the work accomplished by the workflow.

A workflow with a start node connected to a A task, A connected to B and C, C connected to D, and B and D are connected to the end.

An example workflow

How a workflow is described depends on the workflow orchestration system and these might be:

  • the workflow and tasks are stored in a database - typically defined via an API
  • the workflow is defined by code using an API and possibly code annotations
  • the workflow it’s own artifact (e.g., a file) defined by a DSL (Domain Specific Language) which is specified in a common (e.g., YAML, JSON, or XML) or custom syntax.

In my survey of workflow orchestration systems, I noted a trend of moving from a DSL to code for defining workflows. You can see this trend in the chart below:

DSL Format by Category

And within specific categories, you can also consider whether code annotations are used to describe the workflow (which implies not using a DSL):

Annotation API by Category

What is a workflow as code annotations?

Many programming languages allow code annotations and some enable these annotations to define or change the behavior of the annotated code. In Python, this is often a convenient way to package registration and wrapping “core behavior” for interacting with a more complicated system. For a workflow, this is a convenient mechanism for registering and packaging a workflow task as a Python function into the workflow.

For example, in this metaflow example from their tutorial, you can see the workflow is defined by three mechanisms:

  1. The workflow is defined by extending the class FlowSpec.
  2. A step in the workflow is defined with the @step annotation on a workflow class method.
  3. A step defines dependent task by calling self.next()

There are many different approaches to code annotations, but they have a common approach of:

  • There is a mechanism for identifying a function or method as a task in a workflow.
  • There is either an API for chaining steps within the task definition or as separate code that defines the workflow (possibly with annotations).

What is workflow as a DSL?

A DSL (Domain Specific Language) is typically a separate artifact (a file) that describes an object (i.e., the workflow) using a “language” encoded in some syntax. In workflow orchestration systems, there is a high prevalence of using a generic syntax like YAML or JSON to encode the workflow description. Consequently, the DSL is a specific structure in that syntax that encodes the workflow, tasks, and the connections between the tasks.

For example, Argo Workflows are YAML files. The example workflow graph shown at the beginning of this article can be encoded as an Argo Workflow in YAML as follows:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: example-
spec:
  entrypoint: start
  templates:
  - name: echo
    inputs:
      parameters:
      - name: message
    container:
      image: alpine:3.7
      command: [echo, "{{inputs.parameters.message}}"]
  - name: start
    dag:
      tasks:
      - name: A
        template: echo
        arguments:
          parameters: [{name: message, value: A}]
      - name: B
        depends: "A"
        template: echo
        arguments:
          parameters: [{name: message, value: B}]
      - name: C
        depends: "A"
        template: echo
        arguments:
          parameters: [{name: message, value: C}]
      - name: D
        depends: "C"
        template: echo
        arguments:
          parameters: [{name: message, value: D}]
      - name: end
        depends: "B && D"
        template: echo
        arguments:
          parameters: [{name: message, value: end}]

A workflow orchestration system has to process the workflow DSL artifact into an internal representation, resolve the workflow graph, and ensure all the parts are properly specified. Often, the tasks are references to implementations that are invokable in some system. In the case of Argo Workflows, the tasks are container invocations (i.e., Kubernetes batch jobs).

Which is better?

The answer is somewhat subjective. If your whole world is Python, the code annotation approach is very attractive. Systems that use this approach often make it very easy to get things working quickly.

When your world gets a little more complicated, it isn’t a stretch to imagine how a task in a workflow might call a remote implementation. This enables the workflow system to stay within a particular programming paradigm (e.g., Python with code annotations) while allowing interactions with other components or systems that are implemented differently.

On the other hand, a DSL is separably processable and typically agnostic to task implementation. You can get an idea of the shape of the workflow (e.g., the workflow graph) without running any code or locating the code for each task. That’s attractive approach for someone who might not be an expert on a particular code base or programming language.

The challenges of workflows as code:

  • you have to execute the code truly understand the structure of the workflow,
  • requires correctly configured environment which is typically the domain of a developer,
  • everything is packaged as code - which is great until it isn’t,
  • as the number of workflows and variety of environments expands over time, technical debt can make these workflows become brittle.

In contrast, the challenges of workflows as a DSL:

  • the workflow isn’t code - it is something else you need to learn,
  • understanding the syntax and semantics may be challenging (e.g., love or hate YAML?),
  • synchronizing workflows and task implementations may be challenging and requires extra coordination

The common thread here is the need for coordination. A workflow is an orchestration of tasks and those tasks define an API. Regardless of how you define the workflow, you need to be careful about how task implementations evolve. That means your organization has to continually curate their workflows to be successful with either approach.

Conclusions

There is simply nothing terribly wrong with either approach for authoring workflows. If your part of the organization is primarily developers who work in a particular language (e.g., Python), then you may be better off with using code annotations. The process for keeping the workflows and tasks compatible with each other is the same as any other software engineering challenge; solutions for this are well known.

On the other hand, if your organization has a heterogeneous environment with tasks implemented in a variety of languages and different kinds of consumers of the workflows themselves, you are likely better off with a system that has a DSL somewhere in the mix. The DSL acts as an intermediary between the developers of the tasks, the way they are orchestrated, and the different business consumers within your organization.

As a final note, using a DSL has the possibility of authoring tools or ways to generate them from diagrams that may be helpful to “cross chasms” between different parts of an organization with different skill sets. Generating workflows via a DSL is a way to add dynamic and generative approaches to MLOps. So having a generative metalanguage as a workflow of task primitives for your organization may also be helpful with “agentic AI” systems where the workflow is not just the means but is also an outcome that can be executed to accomplish a goal.

next entry

A survey of workflow orchestration systems

Introduction

Workflow orchestration is a common problem in business automation that has an essential place in the development and use of ML models. While systems for running workflows have been available for many years, these systems have a variety of areas of focus. Earlier systems were often focused on business process automation. Newer systems are developed specifically for the challenges of orchestrating the tasks of data science and machine learning applications. Depending on their focus, these systems have different communities of use, features, and deployment characteristics specific to their targeted domain.

This article provides a general overview of what constitutes a workflow orchestration system and follows with a survey of trends in the available systems that covers:

  • origins and activity
  • how workflows are specified
  • deployment options

What is workflow orchestration?

A workflow is an organization of a set of tasks encapsulates a repeatable pattern of activity that typically provides services, transforms materials, or processes information 1. The origin of the term dates back to the 1920’s and primarily in the context of manufacturing. In a modern parlance, we can think of a workflow as akin to a “flow chart of things needed to be accomplished” for a specific purpose within an organization. In more recent years, “workflow orchestration” or “workflow management” systems have been developed to track and execute workflows for specific domains.

In the recent past, companies used workflow orchestration for various aspects of business automation. This has enabled companies to go from paper-based or human centric processes to one where the rules by which actions are taken are dictated by workflows encoded in these systems. While ensuring consistency, it also gives the organization a way to track metadata around tasks and ensure completion.

Within data platforms, data science, and more recent machine learning endeavours, workflow orchestration has become a fundamental tool for scaling processes and ensuring quality outcomes. When the uses of the systems are considered, earlier systems were focused on business processes whilst latter are focused on data engineering, data science, and machine learning. Each of these systems were categorized into one of the following areas of focus:

  • Business Processing - oriented for generic business process workflows
  • Science - specifically focused on scientific data processing, HPC, and modeling or inference for science
  • Data Science / ML - processing focused on data science or machine learning
  • Data Engineering - processing specific to data manipulation, ETL, and other forms of data manangement
  • Operations - processes for managing computer systems, clusters, databases, etc.

Of the systems surveyed, the breakdown of categories is shown below:

Systems by Category

While many of these systems can be used for different purposes, each has specializations for specific domains based on their community of use. An effort has been made to place a system into a single category based on the use cases, documentation, and marketing associated with the system.

Origins of projects

Project Creation by Category

All the systems surveyed appear after 2005 and which just after the “dot-com era” and at the start of “big data”. In the above figure, the start of and end dates are shown for each category. Each column starts at the earliest project formation and ends at the last project formation. This gives a visual representation of activity and possible innovation in each category.

While business process automation has been and continues to be a focus of workflow system development, you can see some evolution of development from data engineering or operations to data science and machine learning. Meanwhile, the creation of new science-oriented systems appear to have stagnated. This may be due to the use of data engineering and machine learning methods in scientific contexts and so there is no need for special systems.

Activity

Active Projects by Category

As is often the case with open-source software, even if associated with a commercial endeavour, some of the projects appear to have been abandoned. In the above chart, there tends to be a 20-25% rate of abandonment for workflow systems with the notable exception of science-oriented systems. In addition, it should be noted that some of these active projects are just being maintained whilst others are being actively developed by a vibrant community.

For science, while there may not be many new science-oriented workflow systems being created in recent years, most of those that exist are still actively being used.

SaaS Offered

In addition, some of these projects have commercial SaaS offerings that also indicate viability. The largest section of which is for Data Science / ML at 35% of those surveyed. This has a likely correlation with the current investment in machine learning and AI technologies.

Saas Available by Category

Workflow specification

Workflow Graph

Most workflows are conceptualized as a “graph of tasks” where there is a single starting point that may branch out to any number of tasks. Each following task has a dependency of a preceding task that creates a link between tasks. This continues through to “leaf” tasks that are at the very end of the workflow. In some systems, these are all connected to an end of the workflow.

Many systems differ on how a workflow is described. Some have a DSL (Domain Specific Language) that is used to encode the workflow. Others have an API that is used by code to create the workflow via program execution. Others have a hybrid mechanism that uses code annotation features of a specific programming language to describe the workflow. The use of annotations simplifies the description of a workflow via an API and serves as a middle ground between the API and a DSL.

In the following chart, the use of a DSL and the encoding format is shown. If the DSL and format is compared to the project creation, you can see that a DSL is more prominent in Business Processing and Science workflow systems that generally have an earlier origin (~ 2005). Whereas, Data Engineer and Data Science / ML tend to use code or annotations on code rather than a DSL to describe the workflow.

Further, there is a strong trend to use YAML as a syntax for describing the graph of tasks in the workflow DSL. This is almost exclusively true for those surveyed in the Data Science / ML category. It should be noted that there is some use of specialized syntaxes (Custom), which is occurs often in the Science category, where the DSL uses a specialized syntax that must be learned by the user.

DSL Format by Category

Meanwhile, using annotations in code to describe workflows is a growing trend. In those surveyed, it appears that as systems evolved from focusing on data engineering to data science and ML, the use of code annotations has increased. This is also likely due in part to the dominance of Python as a programming language of choice for machine learning applications and the fondness of python users for annotation schemes.

Annotation API by Category

When it comes to describing tasks, systems that use annotations have a clear advantage in terms of simplicity. In those systems, a task is typically a function with an annotation. Subsequently, the system orchestrates execution of that function within the deployment environment.

In general, tasks are implemented as code in some programming language. Some workflow systems are agnostic to the choice of programming language as they use containers for invocation, a service request (e.g., an HTTP request to a service), or some other orthogonal invocation. Many systems are specifically designed to be opinionated about the choice of language, either by the API provided or due to the way the workflow is described through code annotations.

The following chart shows the distribution of task languages in the surveyed systems. The dominance of Python is clear from this chart due to prevalence of use in data engineering, data science, and machine learning. Many of the uses of Java are from systems that are focused on business processing workflows.

Task Language

Deployment

As with any software, these workflow systems must be deployed on infrastructure. Unsurprisingly, there is a strong trend towards containers and container orchestration. Many still leave the deployment considerations up to the user to decide and craft.

Deployments

When only the Data Engineering and Data Science / ML categories are considered, you can see the increasing trend of the use of Kubernetes as the preferred deployment.

Deployments - Data Science/ML Only

Conclusions

Overall, when you look at the activity and project creation over all the categories, two things seem to be clear:

  1. There is a healthy ecosystem of workflow systems for variety of domains of use.
  2. There is no clearly dominant system.

While particular communities or companies behind certain systems might argue otherwise, there is clearly a lot of choice and activity in this space. There are certainly outlier systems that have smaller usage, support, and active development. In a particular category, there are probably certain systems you might have on a short list of “winners”.

In fact, what is missing here is any community sentiment around various systems. There are systems that are in use by a lot of companies (e.g., Airflow) simply because they have been around for a while. Their “de facto” use doesn’t mean that all the user’s needs are being met nor are they satisfied with their experience using the system. These users may simply not have a choice or sufficient reason to change; working well enough means there is enough momentum to make change costly.

Rather, the point here is there is a lot of choice given the activity in workflow systems. That variety of choice means there is lot of opportunity for innovation by users or developers as well as for companies who have a workflow system product. And that is a very good thing.

Data

All the systems considered were either drawn from curated lists of systems or by github tags such as workflow-engine or workflow. Whilst not a complete list, it does consist of 80 workflow systems or engines.

Each system’s documentation and GitHub project was examined to determine various properties. Some of these values may be subjective. An effort was made to have consistent judgements for categories of use. Meanwhile, a valiant attempt was made to understand the features of each system by finding evidence in their documentation or examples for various features. As such, some things may have been missed if they were hard to find. Although, that is not unlike a user’s experience with the product: if the feature is hard to determine, they may assume it doesn’t exist.

The data is available here: workflow-orchestration-data.csv

References


  1. Workflow, Wikipedia, see also https://en.wikipedia.org/wiki/Workflow ↩︎

next entry

Circularity in LLM-curated knowledge graphs

I recently read “Medical Graph RAG: Towards Safe Medical Large Language Model via Graph Retrieval-Augmented Generation” (MedGraphRAG) where the authors use a careful and thoughtful construction of a knowledge graph, curated from various textual sources, and extracted via a careful orchestration of an LLM. Queries against this knowledge graph are used to create a prompt for a pre-trained LLM where a clever use of tagging allows the answer to be traceable back to the sources.

The paper details several innovations:

  • using layers in the graph to represent different sources of data,
  • allowing the layers to represent data with different update cadences and characteristics (e.g., patient data, medical research texts, or reference material)
  • careful use of tags to enable the traceability of the results back to the sources.

I highly recommend you read through the paper as the results surpass SOTA and the techniques are mostly sound.

When digging into the “how”, especially given the associated github project, I am continually nagged by thoughts around using an LLM for Named Entity Recognition (NER) and Relation Extraction (RE) tasks. In particular:

  1. How often does such an LLM miss reporting entities or relations entirely (omissions)?
  2. What kinds of errors does such an LLM make and how often (misinformation)?
  3. If we use an LLM to generate the knowledge graph, and it has problems from (1) and (2), how well does an LLM answer questions given information from the knowledge graph (circularity)?

The success demonstrated by the authors of the MedGraphRAG technique as used for answering various medical diagnosis questions is one measure for (3). As with all inference, incorrect answers will happen. Tracing down the “why” for the incorrect answer relies on understanding whether it is the input (the prompt generated from the knowledge graph) or the inference drawn by the LLM. This means we must understand whether something is “wrong” or “missing” in the knowledge graph itself.

To answer this, I went on a tear for the last few weeks of reading whatever I could on NER and RE evaluation, datasets, and random blog posts to update myself on the latest research. There are some good NER datasets out there and some for RE as well. I am certain there are many more resources out there that I haven’t encountered, but I did find this list on Github which led me to the conclusion that I really need to focus on RE.

In going through how the MedGraphRAG knowledge graph is constructed, there are many pre-and-post processing steps that need to be applied to their datasets. Not only do they need to process various medical texts to extract entities and relations, but they also need to chunk or summarize these text in a way that is respective of topic boundaries. This helps the text fit into the limits of the prompt. The authors use “proposition transfer”, which serves a critical step regarding topic boundaries, and that process also uses an LLM; bringing another circularity and questions about correctness.

All things considered, the paper demonstrates how a well constructed knowledge graph can be used to contextualize queries for better answers that are traceable back to the sources supporting that answer. To put such a technique into production, you need to be able to evaluate whether the entities and relations extracted are correct, that you aren’t missing important information, and you need to do this every time you update your knowledge graph. For that, you need some mechanism for evaluating an external LLM’s capabilities and the quality of its ability to perform a relation extraction (RE) task.

Experiments with llama3

Maybe I shouldn’t be surprised, but there are subtle nuances in the prompts that can generate vastly different outcomes for relation extraction tasks. I ran some experiments running llama3.1 locally (8B parameters) just to test various things. At one point during ad hoc testing, one of the responses said something along the lines of “there is more, but I omitted them” and adding “be as comprehensive and complete as possible” to the prompt fixed that problem.

Everyone who has put something into production knows that a very subtle change can have drastic and unintended outcomes. When constructing a knowledge graph from iterative interactions with an external LLM, we need some way to know that our new prompt that fixes one problem hasn’t created a hundred problems elsewhere. That is usually the point of unit and system testing (and I already hear the groans from the software engineers).

In the case of the MedGraphRAG implementation, they use the CAMEL-AI libraries in Python to extract “entities” and “relations”. That library instructs the LLM to produce a particular syntax that reduces to typed entities and relation triples (i.e., subject, relation, object triples) which is then parsed by the library. I am certainly curious as to when that fails to parse as escaping text is always a place where errors proliferate.

Meanwhile, in my own experimentation, I simply asked llama to output YAML and was surprised that it did something close to what might be parsable. A few more instructions were sufficient to pass the results into a YAML parser:

A node should be formatted in YAML syntax with the following rules:

 * All nodes must be listed under a single 'nodes' property. 
 * All relationships must be listed under a single 'relations' property.
 * The 'nodes' and 'relations' properties may not repeat at the top level.

Note:

There are so many ways I can imagine this breaking. So, we will have to see what happens when I run a lot of text through this kind of prompt.

I did spend some time experimenting on whether I could prompt llama to produce a property graph. That is, could it separate properties of an entity from relations. Or could it identify a property of a relation? I wasn’t particularly successful (yet) but that is a topic for a different blog post.

A gem of a paper on relation extraction

In my wanders looking for more research in this area, I found this paper titled “Revisiting Relation Extraction in the era of Large Language Models” which addresses the question at the heart of knowledge graph construction with an LLM. While doing NER and NER resolution is one critical step, a knowledge graph wouldn’t be a graph if the LLM does not handle the RE task well. This is another paper that I highly recommend you read.

The authors give a good outline of the elements of an evaluation of RE with an LLM. They compare the results of various models and LLM techniques against human annotated datasets for relations. They also detail the need for human evaluators for determining “correctness” given the various challenges already present in the thorny problems of RE.

In some contexts, the datasets were not well suited to be run through an LLM for RE tasks. The authors say at one point,

“These results highlight a remaining limitation of in-context learning with large language models: for datasets with long texts or a large number of targets, it is not possible to fit detailed instructions in the prompt.”

This is a problem that the MedGraphRAG technique solved using proposition transfer but doing so muddies the RE task with yet another LLM task.

An idea

I’ve recently become involved in the ML Commons efforts where am particularly interested in datasets. I think the challenge of collecting, curating, or contributing to datasets that support LLM evaluation for knowledge graph construction would be particularly useful.

This effort at ML Commons could focus on a variety of challenges:

  • Collecting datasets: identification of existing datasets or corpus that can be use for NER and RE tasks in various domains
  • Standardized metadata: helping to standardize the metadata and structure of these datasets to allow more automated use for evaluation
  • Annotation: annotation of datasets with entities and relations to provide a baseline for comparison
  • Conformance levels: enable different levels of conformance to differentiate between the “basics” and more complex RE outcomes.
  • Tools: tooling for dataset curation and LLM evaluation

One area of innovation here would be the ability to label outcomes from an LLM not just in terms of omissions or misinformation but also whether they can identify more subtle relations, inverse relations, etc. That would allow a consumer of these models to understand what they should expect and what they may have to do afterwards to the knowledge graph as a secondary inference.

I will post more on this effort when and if it becomes an official work item. I hope it does.

next entry