Alex Miłowski, geek

Circularity in LLM-curated knowledge graphs

I recently read “Medical Graph RAG: Towards Safe Medical Large Language Model via Graph Retrieval-Augmented Generation” (MedGraphRAG) where the authors use a careful and thoughtful construction of a knowledge graph, curated from various textual sources, and extracted via a careful orchestration of an LLM. Queries against this knowledge graph are used to create a prompt for a pre-trained LLM where a clever use of tagging allows the answer to be traceable back to the sources.

The paper details several innovations:

  • using layers in the graph to represent different sources of data,
  • allowing the layers to represent data with different update cadences and characteristics (e.g., patient data, medical research texts, or reference material)
  • careful use of tags to enable the traceability of the results back to the sources.

I highly recommend you read through the paper as the results surpass SOTA and the techniques are mostly sound.

When digging into the “how”, especially given the associated github project, I am continually nagged by thoughts around using an LLM for Named Entity Recognition (NER) and Relation Extraction (RE) tasks. In particular:

  1. How often does such an LLM miss reporting entities or relations entirely (omissions)?
  2. What kinds of errors does such an LLM make and how often (misinformation)?
  3. If we use an LLM to generate the knowledge graph, and it has problems from (1) and (2), how well does an LLM answer questions given information from the knowledge graph (circularity)?

The success demonstrated by the authors of the MedGraphRAG technique as used for answering various medical diagnosis questions is one measure for (3). As with all inference, incorrect answers will happen. Tracing down the “why” for the incorrect answer relies on understanding whether it is the input (the prompt generated from the knowledge graph) or the inference drawn by the LLM. This means we must understand whether something is “wrong” or “missing” in the knowledge graph itself.

To answer this, I went on a tear for the last few weeks of reading whatever I could on NER and RE evaluation, datasets, and random blog posts to update myself on the latest research. There are some good NER datasets out there and some for RE as well. I am certain there are many more resources out there that I haven’t encountered, but I did find this list on Github which led me to the conclusion that I really need to focus on RE.

In going through how the MedGraphRAG knowledge graph is constructed, there are many pre-and-post processing steps that need to be applied to their datasets. Not only do they need to process various medical texts to extract entities and relations, but they also need to chunk or summarize these text in a way that is respective of topic boundaries. This helps the text fit into the limits of the prompt. The authors use “proposition transfer”, which serves a critical step regarding topic boundaries, and that process also uses an LLM; bringing another circularity and questions about correctness.

All things considered, the paper demonstrates how a well constructed knowledge graph can be used to contextualize queries for better answers that are traceable back to the sources supporting that answer. To put such a technique into production, you need to be able to evaluate whether the entities and relations extracted are correct, that you aren’t missing important information, and you need to do this every time you update your knowledge graph. For that, you need some mechanism for evaluating an external LLM’s capabilities and the quality of its ability to perform a relation extraction (RE) task.

Experiments with llama3

Maybe I shouldn’t be surprised, but there are subtle nuances in the prompts that can generate vastly different outcomes for relation extraction tasks. I ran some experiments running llama3.1 locally (8B parameters) just to test various things. At one point during ad hoc testing, one of the responses said something along the lines of “there is more, but I omitted them” and adding “be as comprehensive and complete as possible” to the prompt fixed that problem.

Everyone who has put something into production knows that a very subtle change can have drastic and unintended outcomes. When constructing a knowledge graph from iterative interactions with an external LLM, we need some way to know that our new prompt that fixes one problem hasn’t created a hundred problems elsewhere. That is usually the point of unit and system testing (and I already hear the groans from the software engineers).

In the case of the MedGraphRAG implementation, they use the CAMEL-AI libraries in Python to extract “entities” and “relations”. That library instructs the LLM to produce a particular syntax that reduces to typed entities and relation triples (i.e., subject, relation, object triples) which is then parsed by the library. I am certainly curious as to when that fails to parse as escaping text is always a place where errors proliferate.

Meanwhile, in my own experimentation, I simply asked llama to output YAML and was surprised that it did something close to what might be parsable. A few more instructions were sufficient to pass the results into a YAML parser:

A node should be formatted in YAML syntax with the following rules:

 * All nodes must be listed under a single 'nodes' property. 
 * All relationships must be listed under a single 'relations' property.
 * The 'nodes' and 'relations' properties may not repeat at the top level.

Note:

There are so many ways I can imagine this breaking. So, we will have to see what happens when I run a lot of text through this kind of prompt.

I did spend some time experimenting on whether I could prompt llama to produce a property graph. That is, could it separate properties of an entity from relations. Or could it identify a property of a relation? I wasn’t particularly successful (yet) but that is a topic for a different blog post.

A gem of a paper on relation extraction

In my wanders looking for more research in this area, I found this paper titled “Revisiting Relation Extraction in the era of Large Language Models” which addresses the question at the heart of knowledge graph construction with an LLM. While doing NER and NER resolution is one critical step, a knowledge graph wouldn’t be a graph if the LLM does not handle the RE task well. This is another paper that I highly recommend you read.

The authors give a good outline of the elements of an evaluation of RE with an LLM. They compare the results of various models and LLM techniques against human annotated datasets for relations. They also detail the need for human evaluators for determining “correctness” given the various challenges already present in the thorny problems of RE.

In some contexts, the datasets were not well suited to be run through an LLM for RE tasks. The authors say at one point,

“These results highlight a remaining limitation of in-context learning with large language models: for datasets with long texts or a large number of targets, it is not possible to fit detailed instructions in the prompt.”

This is a problem that the MedGraphRAG technique solved using proposition transfer but doing so muddies the RE task with yet another LLM task.

An idea

I’ve recently become involved in the ML Commons efforts where am particularly interested in datasets. I think the challenge of collecting, curating, or contributing to datasets that support LLM evaluation for knowledge graph construction would be particularly useful.

This effort at ML Commons could focus on a variety of challenges:

  • Collecting datasets: identification of existing datasets or corpus that can be use for NER and RE tasks in various domains
  • Standardized metadata: helping to standardize the metadata and structure of these datasets to allow more automated use for evaluation
  • Annotation: annotation of datasets with entities and relations to provide a baseline for comparison
  • Conformance levels: enable different levels of conformance to differentiate between the “basics” and more complex RE outcomes.
  • Tools: tooling for dataset curation and LLM evaluation

One area of innovation here would be the ability to label outcomes from an LLM not just in terms of omissions or misinformation but also whether they can identify more subtle relations, inverse relations, etc. That would allow a consumer of these models to understand what they should expect and what they may have to do afterwards to the knowledge graph as a secondary inference.

I will post more on this effort when and if it becomes an official work item. I hope it does.

next entry

GQL: Schemas and Types

In GQL, a graph is set of zero or more nodes and edges where:

A Node has:
  • a label set with zero or more unique label names,
  • a property set with zero or more name-value pairs with unique names.
An Edge has:
  • a label set with zero or more unique label names,
  • a property set with zero or more name-value pairs with unique names,
  • two endpoint nodes in the same graph,
  • an indication of whether the edge is directed or undirected ,
  • when a directed edge, one endpoint node is the source and the other node is the destination.

While data can be unstructured in a graph such that nodes and edges are created in an ad hoc manner, graphs are often carefully crafted and structured to represent the relationships and properties of the subject. As such, having a schema that matches the rules used to structure the graph is useful for validation, queries, and optimizations by a database engine.

What is a schema for a graph?

A graph schema should attempt to answer basic questions like:

  • What kinds of nodes are allowed in the graph?
  • What kinds of relationships (i.e., edges) are allowed between nodes?
  • What are the named properties of nodes and edges?
  • What are the allowed values of properties?
  • What properties are optional?

Additionally, it would be good to define more ontology-oriented questions for validity like:

  • the cardinality of edges and nodes,
  • inverse relations (e.g., a parent relation to node has a child relation in reverse)
  • cross-node or cross-edge property constraints

Such additional criteria are often described as a constraint language that is separable from a schema language. That is, they can often be viewed as an additional layer and not the domain of the schema language.

Schemas in GQL

In GQL, the graph type (see §4.13.2 “Graph types and graph element types”) is the main construct for defining a schema. A graph type describes the nodes and edges allowed in a graph. It is created with a create graph type statement (see §12.6):

CREATE GRAPH TYPE MyGraph AS {
  // definitions go here
}

graph type creation

Once created, the graph type can be used to type specific instances of graphs in your site (i.e., a portion of your graph database). In this example, the specific syntax uses a nested graph type specification (see §18.1) that contains a list of element type specifications are specified in a comma-separated list.

Each element type specification is either:

  • a node type specification (see §18.2)
  • an edge type specification (see §18.3)

For each of these, there are two ways to describe the type:

  • a “pattern” that is similar in syntax to node or edge patterns in queries,
  • a “phrase” that is more explicit and verbose.

The simplest and most compact form of a node type specification is as a pattern:

(:label {
  prop1 :: type1,
  prop2 :: type2,
  // etc.
})

node type pattern basics

The label set defines the “keys” by which the type can be referenced and also used to query for matching nodes. A node type can also define additional non-keyed labels and the union of these and the keyed labels are the total set of labels for the type. For example, we could model animal nodes as the following:

(:cat => :mammal:animal),
(:dog => :mammal:animal)

animals as nodes

The node types for cat and dog are keyed with unique labels but also have the labels mammal and animal on the same nodes.

Similarly, we can specify edges with patterns:

(:Person)~[:sibling]~(:Person),
(:Person)-[:children]->(:Person)

edge patterns

The first edge pattern is an undirected edge that is keyed with sibling and has two endpoints whose type is keyed with Person. The second edge pattern is a directed edge that is keyed with children and has a source and destination type that is keyed with Person.

A worked example

If we imagine we’re trying to structure a graph to represent data retrieved from resources using the Person type from schema.org, we can see how these declarations fit together as well as the “rough edges” of graph typing.

Here is a complete example where the properties and relations have been limited to make the schema smaller:

CREATE GRAPH TYPE People AS {
   (:Thing {
      name :: STRING NOT NULL,
      url :: STRING
   }),
   (:Person => :Thing {
      name :: STRING NOT NULL,
      url :: STRING,
      givenName :: STRING,
      familyName :: STRING NOT NULL,
      birthDate :: DATE NOT NULL,
      deathDate :: DATE
   }),
   (:Person)-[:knows]->(:Person),
   (:Person)-[:children]->(:Person),
   (:Person)-[:parent]->(:Person),
   (:Person)~[:sibling]~(:Person),
   (:Person)~[:spouse { started :: DATE NOT NULL, ended :: DATE }]~(:Person)
}

schema via patterns

This same schema can be described via phrase declarations:

CREATE GRAPH TYPE PeopleViaPhrase AS {
   NODE :Thing {
      name :: STRING NOT NULL,
      url :: STRING
   },
   NODE :Person => :Thing {
      name :: STRING NOT NULL,
      url :: STRING,
      givenName :: STRING,
      familyName :: STRING NOT NULL,
      birthDate :: DATE NOT NULL,
      deathDate :: DATE
   } AS Person,
   DIRECTED EDGE knows {} CONNECTING (Person -> Person),
   DIRECTED EDGE children {} CONNECTING (Person -> Person),
   DIRECTED EDGE parent {} CONNECTING (Person -> Person),
   UNDIRECTED EDGE sibling {} CONNECTING (Person ~ Person),
   UNDIRECTED EDGE spouse 
      { started :: DATE NOT NULL, ended :: DATE } 
      CONNECTING (Person ~ Person)
}

schema via phrases

Note:

It is not clear to me why there are two ways to do the same thing. In phrase version, you’ll see that AS Person added to the declaration of the Person node type. This seems necessary as the endpoint pair phrase requires a local alias . Otherwise, the outcome is effectively the same. The question remains as to what you can do with one form that you can’t do with the other?

Curious things

I’ve noticed a few things:

  1. You’d like to see type extension: If Person extends Thing, then it inherits the properties of Thing instead of re-declaring the property set. If you think about that a bit, there be dragons. Having type extension in a schema language really requires having constructs and semantics for both extension and restriction. There is a whole section (§4.13.2.7 “Structural consistency of element types”) that explains what could be interpreted as a subtype. Yet, that structural consistency doesn’t appear to apply to schemas.
  2. The way local aliases are specified and used seems a bit inconsistent. You only have a local alias for a type when you declare it, yet there are specific situations where you need a local alias (e.g., the endpoint pair phrase ). It is also appears useful when you have a key label set with more than one label in that you can use the local alias to avoid being required to repeat the multiple labels for each edge type.
  3. Edges to anything: It isn’t clear how you can describe an edge which has an endpoint (e.g., a destination) that is any node.
  4. Node type unions: It isn’t clear to me how you specify an edge whose destination is a union of node types. While you may be able to work around this with a common label, the set of destination node types may be disjoint.
  5. Lack of graph constraints: There are graph type level constraints such as limiting the cardinality of edges that would be very useful (e.g., node type X can only have one edge of type Y).
  6. Optional keywords - why?: CREATE PROPERTY GRAPH and CREATE GRAPH, NODE TYPE and NODE, and EDGE TYPE and EDGE are all substitutable syntax without any change in semantics. I feel like we should have dropped PROPERTY and TYPE from the grammar.
  7. Synonyms - why?: NODE vs VERTEX and EDGE vs RELATIONSHIP - as synonyms, they mean the same thing, and so why didn’t they just pick one?

Concluding remarks

As with many schema languages, there are things left yet to do. As a first version, a graph type provides some basic mechanisms for describing a graph as a set of “leaf types”. The modeling language you want to use to describe your graph may simply be out of scope for GQL. Meanwhile, you can simply use another tool (pick your favorite UML or ERM tool) to generate GQL graph types useful in your database.

Presently, the type system in GQL provides a minimum bar for validation. More importantly, these types give the database system a contract for the kinds of nodes, edges, and properties that are to be created, updated, and queried. This affords the database engine the ability to optimize how information is stored, indexed, and accessed.

That should probably be the primary takeaway here:

The graph schema is for your database and not for you. It doesn’t describe everything you need to understand about your graph’s structure.

next entry

GQL: Let's variable

When trying to understand a new language, I always like to start with how variables are defined and examine how scoping works with respect to variable usage. This helps with understanding the basics of how you can store values, use them in expressions, and where scope of use ends. While this may seem simple, understanding how a language implements scoping rules isn’t always simple.

Execution context

At the center of evaluation in GQL is the “execution context” (§ 4.8) that has:

  • a working record, which is a record,
  • a working table, which is a binding table,
  • an execution outcome

A statement in GQL may modify this context by setting the working table, working record, or the execution outcome. During the execution of a construct, new “child contexts” may be established, based on the current context, and used for nested evaluation. In the end, the “current working table” can be used to formulate the query result which is represented in the execution outcome.

For more on that, see “General Rules” in §14.10 <primitive result statement> where you’ll see the conditions under which:

The current execution outcome is set to a successful outcome with the current working table as its result.

The crucial point here is that the current “in scope” variable names are drawn from the fields of the working records and tables from the current context in various ways.

Calling procedures

Consider the following GQL program shown which is an inline procedure call (see § 15 “Procedure calling). An inline procedure call is like a closure where the arguments to the procedure are a list of variables that must be available in the incoming working record. When that list is omitted completely, all the fields of the incoming working record are available. Note that when an empty list of variables is specified (i.e., ()), the effective incoming variables are at empty sequence.

CALL (x, y) {
   VALUE z = x + y
   RETURN z
}

An inline procedure call

For a procedure call statement to evaluate, we need a complete context where x and y are defined so that we can expect a successful outcome:

VALUE x = 40
VALUE y = 2
VALUE k = 12
CALL (x, y) {
   VALUE z = x + y
   RETURN z
}

An inline procedure call in context

Each statement in this program can change the context using incoming values for the working record and working table to produce outgoing working records and table. For this query, the context is transformed as follows:

Level Statement Incoming Outgoing
1 VALUE x = 40 record: unit
table: unit
record: x=40
unit
1 VALUE y = 2 record: x=40
table: unit
record: x=40, y=2
unit
1 VALUE k = 12 record: x=40, y=2
unit
record: x=40, y=2, k=12
unit
1 CALL (x,y) { ... } record: x=40, y=2, k=12
table: unit
record: x=40, y=2, k=12
table: 1: z=42
1.1 (x,y) { ... } record: x=40, y=2
table: unit
record: x=40, y=2
table: 1: z=42
1.1.1 { VALUE z = x + y RETURN z } record: x=40, y=2
table: unit
record: x=40, y=2
table: 1: z=42
1.1.1 VALUE z = x + y record: x=40, y=2
table: unit
record: x=40, y=2, z=42
table: unit
1.1.1 RETURN z record: x=40, y=2, z=42
table: unit
record: x=40, y=2, z=42
table: 1: z=42

In the above table, I have attempted to outline how the execution context changes when executing the GQL program with values for x and y. Each of the columns have these definitions:

  • The “level” column shows the execution context as it is changed for “child” constructs.
  • The “statement” column is the portion of the query being evaluated.
  • The “incoming” is the record and table of context coming into the expression
  • The “outgoing” is the record and table of the context resulting from the expression evaluation.

The evaluation can be explained as follows:

  • A value variable definition (§ 10.3) (in rows 1-3 and 7) adds a single variable to the incoming working record of the current context. It is an error if the name is already in the working record.
  • The call procedure statement (§ 15.1) (in row 4) invokes a procedure call and the outgoing context is the incoming working table amended with the outgoing working table of the procedure called.
  • The inline procedure call (§ 15.2) (in row 5) changes the incoming record to be those variables lists in the binding variables listed (e.g., (x, y)).
  • The nested procedure specification (§ 9.1) / procedure body (§ 9.2) (in row 6) create a new execution context whose working record contain the results of the value variable definition. Otherwise, the working record remains unchanged. In this case, the definition of z causes a new child execution context. Also, the outgoing declared type of the procedure body is the declared type of the last statement in the procedure.
  • The value variable definition (§ 10.3) (in row 7) only declares z in the context. In this case, it is a new child context. The result is that z won’t be available outside the nested procedure specification.
  • The return statement (§ 14.11) changes the working table of the current execution context. This establishes the declared type of the procedure body which will be used to merge the results into the calling context.

Note:

While I believe this is correct, I may have missed some nuance. I am working though how binding tables are transformed and how values interact with them.

The result of this query is:

  • The working record: x=40, y=2, k=12
  • The working table: 1: z=42
  • The outcome: successful, unit

Note:

The overall query didn’t return anything and so the result is successful but empty.

Let statements

If a value variable definition adds a field to the working record, then what does a let statement (§ 14.7) do? The short answer is that a let statement changes the working table. Consider the example query with let statement where we bind the variable

MATCH (n {name: 'John'})-[:FRIEND]-(friend)
  LET friendsCount = count(friend)
  RETURN n, friendsCount

Query with let statement

The first thing to note is that every instance of variable = expression is a shorthand for VALUE variable = expression. You can see this in the grammar below:

---
title: simple query statement
---
stateDiagram-v2
  direction LR
  
  state LetVariableDefinition {
     direction LR

     state fork <>
     state join <>

     RegularIdentifier1 : regular identifier
     EQUALS1 : =
     ValueExpression1 : value expresion
     fork --> RegularIdentifier1
     RegularIdentifier1 --> EQUALS1 
     EQUALS1 --> ValueExpression1
     ValueExpression1 --> join

     RegularIdentifier2 : regular identifier
     typed : #colon;#colon; | TYPED
     EQUALS2 : =
     ValueExpression2 : value expresion
     ValueType : value type
     fork --> VALUE
     VALUE --> RegularIdentifier2 
     state value_type_fork <>
     state value_type_join <>
     RegularIdentifier2 --> value_type_fork
     RegularIdentifier2 --> value_type_join
     value_type_fork --> typed
     typed --> ValueType
     ValueType --> value_type_join
     value_type_fork --> ValueType
     value_type_join --> EQUALS2
     EQUALS2 --> ValueExpression2
     ValueExpression2 --> join
  }

  LET --> LetVariableDefinition
  LetVariableDefinition --> LetVariableDefinition
 

The semantics of a let statement first transforms these shorthands into value variable definitions:

LET friendsCount = count(friend) → LET VALUE friendsCount = count(friend)

The second transformation is from the let statement into a call procedure statement with an inline procedure call where the variable scope clause (§ 15.2) is a list of all the binding variable references used in the right-hand side of the let variable definitions:

LET VALUE friendsCount = count(friend) → CALL (friend) {
                                            VALUE friendsCount = count(friend)
                                            RETURN friendsCount
                                         }

The body of the procedure is a value variable definition for each let variable definition. The procedure ends with a return statement that enumerates the left-hand side of the let variable definitions.

The resulting query executed is shown in query executed where the let statement has been replaced with call procedure statement.

MATCH (n {name: 'John'})-[:FRIEND]-(friend)
  CALL (friend) {
    VALUE friendsCount = count(friend)
    RETURN friendsCount
  }
  RETURN n, friendsCount

query executed

When tracing through the execution contexts, you’ll see how friendsCount is added to the binding table via the call procedure statement that replaced the let statement. At the end, values in the context’s working table is used for the result statement.

Note:

I see a problem here with friend that I can’t quite sort. The friend variable reference is to a node from the match statement. So, that’s in the binding table. The binding variable references are allowed to be the union of the field names of the working record and working table. Meanwhile, § 15.1 states that the incoming working record’s declared type is the incoming working record amended by the declared type of the incoming working table - which is some sort of record union. But, it isn’t clear to me yet how the call procedure statement scopes references from the binding table nor exactly how that amending works with the current working record used for variable scoping.

Can variables refer to each other?

The short answer is “yes” but only in declaration order. It is easy to see that this should work:

VALUE x = 1
VALUE y = 2
VALUE z = x + y

And this should not:

VALUE z = x + y
VALUE x = 1
VALUE y = 2

For a let statement, there is a similar ordering requirement:

LET x = 1, y = 2, z = x + y

which becomes:

CALL {
   VALUE x = 1
   VALUE y = 2
   VALUE z = x + y
   RETURN x, y, z
}

Where the only difference between let and value bindings is whether the fields names are in the working record or working table.

Concluding thoughts

There was a lot of background material to get to the conclusion of “expected semantics” for variable binding and use. The distinction between the “working record” and “working table” is essential to understand but has some hidden complexity. The challenge there is understanding how simple value bindings are handled versus variables that refer to patterns of results (e.g., a set of matching nodes) where their enumeration could be large; a topic for another post.

This is my first foray into how various expressions affect the context. There is more complexity that I need to sort for myself in terms of how values are added to binding tables when you have fields whose value space is a set of nodes from a match pattern. That is, I need to have a deeper understanding of exactly how working tables are amended via joins or other operators.

Feedback and questions are always welcome!

next entry