Difference between revisions of "Journal:Semantic units: Organizing knowledge graphs into semantically meaningful units of representation"

From LIMSWiki
Jump to navigationJump to search
(Saving and adding more.)
Line 82: Line 82:


==Methods==
==Methods==
===Problem statement===
====Challenge 1: Ensuring schematic interoperability for FAIR empirical data====


In the pursuit of FAIRness in empirical data and metadata in a knowledge graph, it is important not only for the terms employed in data and metadata statements to possess identifiers from controlled vocabularies, such as ontologies, ensuring terminological interoperability, but also the semantic graph patterns underlying each statement. These patterns specify the relationships among the terms in a statement, facilitating schematic interoperability.
Due to the expressivity of RDF and OWL, statements can be modelled in multiple, often not directly interoperable ways within a knowledge graph. Distinguishing between RDF graphs with different structures that essentially model the same underlying data statement poses a challenge. Consequently, the presence of schematic interoperability conflicts becomes unavoidable, especially when data are represented using diverse graph patterns (cf. Figs. 2 and 3).
[[File:Fig2 Vogt JofBiomedSem24 15.png|900px]]
{{clear}}
{|
| style="vertical-align:top;" |
{| border="0" cellpadding="5" cellspacing="0" width="900px"
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" |<blockquote>'''Figure 2.''' Comparison of a human-readable statement with its machine-actionable representation as a semantic graph following the RDF syntax. Top: A human-readable statement concerning the observation that a specific apple (X) weighs 204.56 grams. Bottom: The corresponding representation of the same statement as a semantic graph, adhering to RDF syntax and following the established pattern for measurement data from the Ontology for Biomedical Investigations (OBI) [35] of the Open Biological and Biomedical Ontology Foundry (OBO).</blockquote>
|-
|}
|}
[[File:Fig3 Vogt JofBiomedSem24 15.png|800px]]
{{clear}}
{|
| style="vertical-align:top;" |
{| border="0" cellpadding="5" cellspacing="0" width="800px"
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" |<blockquote>'''Figure 3.''' Alternative machine-actionable representation of the data statement from Fig. 2. This graph represents the same data statement as shown in Fig. 2 Top, but applies a semantic graph model that is based on the Extensible Observation Ontology (OBOE) [36], an ontology frequently used in the ecology community.</blockquote>
|-
|}
|}
Therefore, to maintain interoperability in the representation of empirical data statements within an RDF graph, it can be beneficial to restrict the graph patterns employed for their semantic modelling. Statements of the same type, such as all weight measurements, would employ identical graph patterns to maintain interoperability. Each of these patterns would be assigned an identifier. When representing empirical data in the form of an RDF graph, the graph’s metadata should reference that graph-pattern identifier. This approach enables the identification of potentially interoperable RDF graphs sharing common graph-pattern identifiers.
Practically implementing these principles entails two criteria. Firstly, all statements within a knowledge graph must be categorized into statement classes, each associated with a specified graph pattern, typically in the form of a shape specification. Secondly, the subgraph corresponding to a particular statement must be distinctly identifiable.
====Challenge 2: Overcoming barriers in graph query language adoption====
Another significant challenge arises in the context of searching for specific information in a knowledge graph. The prevalent formats for knowledge graphs include RDF/OWL or labeled property graphs like Neo4j. Interacting directly with these graphs, encompassing CRUD operations for creating (= writing), reading (= searching), updating, and deleting statements in the knowledge graph, necessitates the utilization of a query language. SPARQL [37] is an example for RDF/OWL, while Cypher [38] is employed for Neo4j.
Although these query languages empower users to formulate detailed and intricate queries, the challenge lies in their complexity, creating an entry barrier for seamless interactions with knowledge graphs [39]. Furthermore, query languages are not aware of graph patterns.
This challenge may potentially be addressed by providing reusable query patterns that link to specific graph patterns, thereby integrating representation and querying.
====Challenge 3: Addressing complexities in making statements about statements====
The RDF triple syntax of ''Subject'', ''Predicate'', and ''Object'' allows expressing a statement about another statement by creating a triple that relates a statement, composed of one or more triples, to a value, resource, or another statement. The scenario may arise where such statements about statements must be modelled. For instance, metadata for a measurement may relate two distinct subgraphs: one representing the measurement itself (as seen in Fig. 2) and another documenting the underlying measuring process (as seen in Fig. 4).
[[File:Fig4 Vogt JofBiomedSem24 15.png|1000px]]
{{clear}}
{|
| style="vertical-align:top;" |
{| border="0" cellpadding="5" cellspacing="0" width="1000px"
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" |<blockquote>'''Figure 4.''' A detailed machine-actionable representation of the metadata relating to a weight measurement datum. This detailed illustration presents a machine-actionable representation of a mass measurement process employing a balance. It documents metadata associated with a weight measurement datum, articulated as an RDF graph. The graph establishes connections between an instance of <u>mass measurement assay</u> (OBI:0000445) and instances of various other classes from diverse ontologies. Noteworthy details include the identification of the measurement conductor, the location and timing of the measurement, the protocol followed, and the specific device utilized (i.e., a balance). Additionally, the graph outlines the material entity serving as the subject and input for the measurement process (i.e., "apple X"), along with specifying the resultant data encapsulated in a particular weight measurement assertion.</blockquote>
|-
|}
|}
In RDF reification, a statement resource is defined to represent a particular triple by describing it via three additional triples that specify its ''Subject'', ''Predicate'', and ''Object''. Alternatively, the RDF-star approach can be employed. [40, 41] Both methods increase complexity of the represented graph.
In cases like this, the adoption of Named Graphs is an alternative compared to RDF reification or RDF-star approaches. Within RDF-based knowledge graphs, a Named Graph resource identifies a set of triples by incorporating the URI of the Named Graph as a fourth element to each triple, transforming them into quads. In labeled property graphs, on the other hand, assigning a resource for identifying subgraphs within the overall data graph is straightforward and can be achieved by incorporating the resource identifier as the value of a corresponding property-value pair, subsequently adding this pair to all relations and nodes belonging to the same subgraph.
==Results==
===Semantic unit===
We developed an approach for organizing knowledge graphs into distinct layers of subgraphs using graph patterns. Unlike traditional methods of partitioning a knowledge graph that (i) rely on technical aspects such as shared graph-topological properties of its triples with the goal of (federated) reasoning and query optimization (see characteristic sets [29, 30], RDF molecules [31, 42], and other approaches [43,44,45]), that (ii) partition a knowledge graph into small blocks for embedding and entity alignment learning to scale knowledge graph fusion [46], or that (iii) partition knowledge extractions, allowing reasoning over them in parallel to speed up knowledge graph construction [47], our approach introduces "semantic units." Semantic units prioritize structuring a knowledge graph into identifiable sets of triples, as subgraphs that represent units of representation possessing semantic significance for human readers. Technically, a semantic unit is a subgraph within a knowledge graph, represented in the graph by its own resource—designated as a UPRI—and embodied in the graph as a node. This resource is classified as an instance of a specific semantic unit class.
Semantic units focus on creating units that are semantically meaningful to domain experts. For instance, the graph in Fig. 2 exemplifies a subgraph that can be organized in a semantic unit that instantiates the class *<u>SEMUNIT:weight statement unit</u>* as it is illustrated in Fig. 6 (later). The statement unit models a single, human-readable statement, as opposed to the individual triple ‘<u>weight</u>’ (PATO:0000128) ''isQualityMeasuredAs'' (IAO:0000417) ‘<u>scalar measurement datum</u>’ (IAO:0000032), which is a single triple from that subgraph. That triple, without the context of the other triples in the subgraph, lacks semantic meaningfulness for a domain expert who has no background in semantics.
Beyond statement units, which constitute the smallest semantically meaningful statements (e.g., a weight measurement), collections of statement units can form compound units representing a coarser level of representational granularity. The classification of semantic units thus distinguishes two fundamental categories: statement units and compound units, each with its respective subcategories. For a detailed classification of semantic units, refer to Fig. 5.
[[File:Fig5 Vogt JofBiomedSem24 15.png|300px]]
{{clear}}
{|
| style="vertical-align:top;" |
{| border="0" cellpadding="5" cellspacing="0" width="300px"
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" |<blockquote>'''Figure 5.''' Classification of different categories of semantic units.</blockquote>
|-
|}
|}
The structuring of a knowledge graph into semantic units involves introducing an additional layer of triples to the existing graph. To distinguish these two layers, we label the pre-existing graph as the data graph layer, while the newly added triples constitute the semantic-units graph layer. For clarity across the graph, the resource representing a semantic unit, along with all triples featuring this resource in the ''Subject'' or ''Object'' position, is assigned to the semantic-units graph layer. Extending this distinction from the graph as a whole to individual semantic units, each semantic unit is associated with both a data graph and a semantic-units graph. The data graph of a particular semantic unit shares the same UPRI as its semantic unit resource. This alignment enables reference to the UPRI, concurrently denoting the semantic unit as a resource and its corresponding data graph. This interconnectedness empowers users to make statements about the content encapsulated within the semantic unit’s data graph, as shown in Fig. 6.
[[File:Fig6 Vogt JofBiomedSem24 15.png|1000px]]
{{clear}}
{|
| style="vertical-align:top;" |
{| border="0" cellpadding="5" cellspacing="0" width="1000px"
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" |<blockquote>'''Figure 6.''' Example of a statement unit. The illustration displays a statement unit exemplifying a has-weight relation. The data graph, denoted within the blue box at the bottom, articulates the statement with "apple X" as the subject and "gram X" alongside the numerical value 204.56 as the objects. The peach-colored box encompasses the semantic-units graph, housing triples that encapsulate the semantic unit’s representation. It explicitly denotes the resource embodying the statement unit (bordered blue box), an instance of the *<u>SEMUNIT:weight statement unit</u>* class, with "apple X" identified as the subject. Notably, the UPRI of *’<u>weight statement unit</u>’* is also the UPRI of the semantic unit’s data graph (the unbordered subgraph in the blue box).</blockquote>
|-
|}
|}
====Statement unit: A proposition in the knowledge graph====
A statement unit is characterized as the fundamental unit of information encapsulating the smallest, independent proposition (i.e., statement) with semantic meaning for human comprehension (see also [32]). For instance, the weight measurement statement for "apple X" illustrated in Fig. 6 represents a statement unit.
Structuring a knowledge graph into statement units results in a partition of its graph. Each triple within the data graph layer of the knowledge graph is associated with exactly one statement unit, and merging the subgraphs of all statement units results in the complete data graph of a knowledge graph. This partitioning only applies to the data graph layer.
We can understand each statement unit to specify a particular proposition by establishing a relationship between a resource serving as the subject and either a literal or another resource, denoted as the object of the predicate. Every statement unit encompasses a single subject and one or more objects.
To illustrate, a has-part statement unit features a subject and one object. Conversely, a weight measurement statement unit consists of a subject, as well as two objects: the weight value and the weight unit (refer to Fig. 6). The resource signifying a statement unit in the graph establishes a connection with its subject through the property *<u>SEMUNIT:''hasSemanticUnitSubject''</u>*, which is documented in the semantic-units graph of the statement unit.
In scenarios where the proposition within the data graph is grounded in a binary relation—a divalent predicate like "This right hand has as a part this right thumb"—the associated statement unit typically comprises a single triple. This alignment arises from the nature of RDF, where ''Predicates'' of triples are inherently binary relations. In such cases, the RDF property concurrently embodies the statement’s verb or predicate. However, numerous propositions are grounded in ''n''-ary relations, making a single triple insufficient for their representation. Examples encompass the weight measurement statement in Fig. 6 and statements like "This right hand has part this right thumb on January 29th 2022," "Anna gives Bob a book," and "Carla travels by train from Paris to Berlin on the 29th of June 2022," each necessitating more than one triple. In these cases, the statement’s verb or predicate is often represented not by a property within a single triple but instead by an instance resource, as exemplified by ‘<u>weight X</u>’ (PATO:0000128) in Fig. 6. The composition of statement units, whether consisting of one or more triples, is contingent upon the relation of the underlying proposition, the ''n''-aryness of its predicate, and the incorporation of optional objects. Types of statement units can be distinguished based on the ''n''-ary verb or predicate that characterizes their underlying proposition. Notably, numerous object properties of the Basic Formal Ontology 2 denote ternary relations, particularly those entailing temporal dependencies. [48] For instance, "''b'' located_in ''c'' at ''t''" mandates at least two triples for accurate representation in RDF.
The determination of which triples belong to a statement unit necessitates case-by-case specification by human domain experts. The statement unit patterns can then be specified using languages like LinkML [49, 50] or the Shapes Constraint Language SHACL [51]. These languages enable the definition of graph patterns to represent specific propositions, subsequently constituting a statement unit. Each statement unit instantiates a designated statement unit class, a classification defined by the specific verb or predicate characterizing the propositions modelled by its instances. We can distinguish different subcategories of statement units based on the underlying predicate, such as ''has part'', ''type'', and ''develops from''.
A distinctive category within the statement units, denoted as identification units, serves a specific purpose, providing details about a particular named individual or class resource. Two principal subtypes define this category. A named individual identification unit is a statement unit that serves to identify a resource to be a named individual, adding information such as the resource’s label, type, and its class membership (refer to Fig. 7A). A class identification unit{{Efn|Analog to class identification units, one could specify property identification units that have property resources as their subject.}} is a statement unit that serves to identify a resource to be a class and provides details including its label, identifier, and optionally, the URIs of both the ontology and the specific version from which the class term has been imported (refer to Fig. 7B). Both types of identification units are important for providing human-readable displays of statement units, as they provide the labels for the resources used in them (see "typed statement unit" and "dynamic label" in Fig. 9, later).
[[File:Fig7 Vogt JofBiomedSem24 15.png|500px]]
{{clear}}
{|
| style="vertical-align:top;" |
{| border="0" cellpadding="5" cellspacing="0" width="500px"
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" |<blockquote>'''Figure 7.''' Examples for two different types of identification units. '''A)''' Named-individual identification unit. The data graph within the unbordered box delineates the class-affiliation of the ‘<u>apple X</u>’ (NCIT:C71985) instance. The subject, "apple X," is connected to its class through the property ''<u>type</u>'' (RDF:type), while its label "apple X" is conveyed via the property ''<u>label</u>'' (RDFS:label). The unbordered blue box designates the data graph associated with this named-individual identification unit. '''B)''' Class identification unit. This data graph of this unit, represented by the unbordered blue box, captures the label and identifier of the class ‘<u>apple</u>’ (NCIT:C71985), the unit’s designated subject. Optionally, it includes the URI details of the ontology and the ontology version from which the class is derived. The bordered blue box designates the resource of this class identification unit.</blockquote>
|-
|}
|}
====Compound unit: A collection of propositions====
Compound units are containers of collections of associated semantic units, each possessing semantic significance for a human reader. Each compound unit possesses a UPRI and instantiates a corresponding compound unit class. The connection between the resource representing the compound unit and those representing its associated semantic units is detailed through the property *<u>SEMUNIT:hasAssociatedSemanticUnit</u>* (see Fig. 8). The subsequent sections introduce distinct subcategories of compound units.
[[File:Fig8 Vogt JofBiomedSem24 15.png|700px]]
{{clear}}
{|
| style="vertical-align:top;" |
{| border="0" cellpadding="5" cellspacing="0" width="700px"
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" |<blockquote>'''Figure 8.''' Example of a compound unit, denoted as *‘<u>apple X item unit</u>’*, that encompasses multiple statement units. Compound units, by virtue of merging the data graphs of their associated statement units, indirectly manifest a data graph (here, highlighted by the blue arrow). Notably, the compound unit possesses a semantic-units graph (depicted in the peach-colored box) delineating the associated semantic units.</blockquote>
|-
|}
|}
===Typed statement unit===





Revision as of 19:07, 16 June 2024

Full article title Semantic units: Organizing knowledge graphs into semantically meaningful units of representation
Journal Journal of Biomedical Semantics
Author(s) Vogt, Lars; Kuhn, Tobias; Hoehndorf, Robert
Author affiliation(s) TIB Leibniz Information Centre for Science and Technology, Vrije Universiteit, King Abdullah University of Science and Technology
Primary contact Email: lars dot m dot vogt at googlemail dot com
Year published 2024
Volume and issue 15
Article # 7
DOI 10.1186/s13326-024-00310-5
ISSN 2041-1480
Distribution license Creative Commons Attribution 4.0 International
Website https://jbiomedsem.biomedcentral.com/articles/10.1186/s13326-024-00310-5
Download https://jbiomedsem.biomedcentral.com/counter/pdf/10.1186/s13326-024-00310-5.pdf (PDF)

Abstract

Background: In today’s landscape of data management, the importance of knowledge graphs and ontologies is escalating as critical mechanisms aligned with the FAIR Guiding Principles ask that research data and metadata be more findable, accessible, interoperable, and reusable. We discuss three challenges that may hinder the effective exploitation of the full potential of applying FAIR concepts to research objects using knowledge graphs.

Results: We introduce “semantic units” as a conceptual solution, although currently exemplified only in a limited prototype. Semantic units structure a knowledge graph into identifiable and semantically meaningful subgraphs by adding another layer of triples on top of the conventional data layer. Semantic units and their subgraphs are represented by their own resource that instantiates a corresponding semantic unit class. We distinguish statement and compound units as basic categories of semantic units. A statement unit is the smallest independent proposition that is semantically meaningful for a human reader. Depending on the relation of its underlying proposition, it consists of one or more triples. Organizing a knowledge graph into statement units results in a partition of the graph, with each triple belonging to exactly one statement unit. A compound unit, on the other hand, is a semantically meaningful collection of statement and compound units that form larger subgraphs. Some semantic units organize the graph into different levels of representational granularity, others orthogonally into different types of granularity trees or different frames of reference, structuring and organizing the knowledge graph into partially overlapping, partially enclosed subgraphs, each of which can be referenced by its own resource.

Conclusions: Semantic units, applicable in RDF/OWL and labeled property graphs, offer support for making statements about statements and facilitate graph-alignment, subgraph-matching, knowledge graph profiling, and management of access restrictions to sensitive data. Additionally, we argue that organizing the graph into semantic units promotes the differentiation of ontological and discursive information, and that it also supports the differentiation of multiple frames of reference within the graph.

Keywords: FAIR data and metadata, knowledge graph, OWL, RDF, semantic unit, graph organization, granularity tree, representational granularity

Background

In an era marked by the exponential generation of data [1,2,3], both technically and socially intricate challenges have emerged [4], necessitating innovative approaches to data representation and management in science and industry. The growing volume of produced data requires systems capable of collecting, integrating, and analyzing extensive datasets from diverse sources, a critical requirement in addressing contemporary global challenges. [5] Notably, data stewardship should rest within the hands of the domain experts or institutions to ensure technical autonomy, aligning with the concept of "data visiting" rather than conventional "data sharing." [6]

From the standpoint of data representation and management, meeting these demands relies on adherence to the FAIR Guiding Principles, which ask for research data and metadata to be readily findable, accessible, interoperable, and reusable for machines and humans alike. [7] Failure to achieve FAIRness risks transforming big data into opaque dark data. [8] Establishing the FAIRness of these research objects not only contributes to a solution for the reproducibility crisis in science [9] but also addresses broader concerns regarding the trustworthiness of information (see also the TRUST Principles of transparency, responsibility, user focus, sustainability, and technology [10]).

To capitalize on the transformative potential of the FAIR Principles, the idea of an internet of FAIR data and services was suggested. [11] Such a framework would seamlessly scale with the demands of big data, enabling relevant data-rich institutions, research projects, and citizen-science initiatives to make their research objects universally accessible in adherence to the FAIR Guiding Principles. [12, 13] The key lies in furnishing comprehensive, machine-actionable[a] data and metadata, complemented by human-readable interfaces and search capabilities.

Knowledge graphs can contribute to the needed technical frameworks, offering a structure for managing and representing FAIR data and metadata. [14] Knowledge graphs are particularly applied in the context of semantic search based on entities and relations, deep reasoning, disambiguation of natural language, machine reading, and entity consolidation for big data and text analytics. [15]

The distinctive graph-based abstractions inherent in knowledge graphs yield advantages over traditional relational or other NoSQL models. These include

  • an intuitive way for modelling relations;
  • the flexibility to defer data schema definitions to accommodate evolving knowledge, which is especially important when dealing with incomplete knowledge;
  • incorporation of machine-actionable knowledge representation formalisms like ontologies and rules;
  • deployment of graph analytics and machine learning (ML); and
  • utilization of specialized graph query languages that support, in addition to standard relational operators such as joins, unions, and projections, also navigational operators for recursively searching for entities through arbitrary-length paths. [16,17,18,19,20,21,22]

Moreover, the inherent semantic transparency of knowledge graphs can improve the transparency of data-based decision-making and improve the communication of data and knowledge within research and science in general. [23,24,25,26,27]

Despite offering an appropriate technical foundation, the utilization of a knowledge graph for storing data and metadata does not inherently ensure the achievement of the FAIR Guiding Principles. Realizing FAIR research objects necessitates adherence to specific guidelines, encompassing the consistent application of adequate semantic data models tailored to distinct types of data and metadata statements. This approach is pivotal for ensuring seamless interoperability across a dataset.

The rest of the paper is organized as such. In the Problem statement section, we discuss three specific challenges that, from our perspective, can be effectively addressed by systematically organizing a knowledge graph into well-defined subgraphs. Prior attempts at this, such as defining a characteristic set as a subgraph based on triples that share the same resource in the Subject position, have demonstrated noteworthy enhancements in space and query performance [28, 29] (see also the related concept of RDF molecules [30, 31]), but they do not fully mitigate the challenges outlined below.

The Results section introduces a novel concept: the partitioning and structuring of a knowledge graph into semantic units, identifiable subgraphs represented in the graph with their own resource. Semantic units are semantically meaningful units of representation, which will contribute to overcoming the challenges at hand. The concept builds upon an idea originally proposed for structuring descriptions of phenotypes into distinct subgraphs, each of which models a descriptive statement like a particular weight measurement or a particular parthood statement for a given anatomical entity. [32] Each such subgraph is organized in its own "Named Graph" and functions as the smallest semantically meaningful unit in a phenotype description. Generalizing and extending this concept, we present semantic units as accessible, searchable, identifiable, and reusable data items in their own right, forming units of representation implemented through graphs based on the Resource Description Framework (RDF) and the Web Ontology Language (OWL) or labeled property graphs. Two basic categories of semantic units—statement units and compound units—are introduced, supplementing the well-established triples and the overall graph in FAIR knowledge graphs. These units offer a structure that organizes a knowledge graph into five levels of representational granularity, from individual triples to the graph as a whole. In further refinement, additional subcategories of semantic units are proposed for enhanced graph organization. The incorporation of unique, persistent, and resolvable identifiers (UPRIs) for each semantic unit enables their efficient referencing within triples, facilitating an efficient way of making statements about statements. The introduction of semantic units adds further layers of triples to the well-established RDF and OWL layer for knowledge graphs. (Fig. 1) This augmentation aims to enhance the usability of knowledge graphs for both domain experts and developers.


Fig1 Vogt JofBiomedSem24 15.png

Figure 1. Semantic units introduce additional layers atop the RDF/OWL layer of triples within a knowledge graph. The figure illustrates a partitioning of the triple layer into statement units, wherein each triple aligns with exactly one statement unit, and each statement unit contains one or more triples. Statement units can be organized into diverse types of semantically meaningful collections, denoted as compound units. Compound units serve as the basis for defining several layers that contribute to the enhanced structuring and organization of the knowledge graph in semantically meaningful ways.

In the Discussion section, we discuss the benefits we see from organizing knowledge graphs into distinct knowledge graph modules (i.e., semantic units) in terms of increasing data management flexibility and explorability of the graph. We also discuss possible strategies for implementing semantic units for RDF/OWL-based and labeled-property-graph-based knowledge graphs.

Conventions used in this paper

In this paper, the term "knowledge graph" denotes a machine-actionable semantic graph employed for the documentation, organization, and representation of data and metadata. It is essential to note that our discussion of semantic units is situated within the context of RDF-based triple stores, OWL, and Description Logics serving as a formal framework for inferencing, alongside labeled property graphs as an alternative to triple stores. We deliberately focus on these technologies as they constitute the primary technologies and logical frameworks within the knowledge graph domain, benefiting from widespread community support and established standards. We are aware of the fact that alternative technologies and frameworks exist that support an n-tuples syntax and more advanced logics (e.g., First Order Logic) [33, 34], but supporting tools and applications are missing or are not widely used to turn them into well-supported, scalable, and easily usable knowledge graph applications.

Throughout this text, regular underlining is employed for indicating ontology classes, while italicsUnderlined text is reserved for referencing properties. Identification (ID) numbers, formed by the ontology prefix followed by a colon and a number, uniquely specify each resource (e.g., isAbout [IAO:0000136]). When a term is not yet covered in any ontology, we denote the corresponding class with an asterisk (*). New classes and properties that relate to semantic units will use the ontology prefix SEMUNIT, as in the class *SEMUNIT:metric measurement statement unit*. These will be part of a future Semantic Unit ontology. We use 'regular underlined' to indicate instances of classes, with the label referring to the class label and the ID to the ID of the class.

The term "resource" is employed to signify something uniquely designated, such as a Uniform Resource Identifier (URI), about which informative statements are made. It thus stands for something and represents something you want to talk about. In RDF, the Subject and the Predicate in a triple are always resources, whereas the Object can be either a resource or a literal. Resources encompass properties, instances, and classes, with properties occupying the Predicate position in a triple, instances referring to individuals (=particulars), and classes representing universals or kinds.

To maintain clarity, resources are represented with human-readable labels in both the text and all figures, opting for the implicit assumption that each property, instance, and class possesses its UPRI. Additionally, the term "triple" refers specifically to a triple statement, while "statement" pertains to a natural language statement, establishing a clear distinction between the two.

Methods

Problem statement

Challenge 1: Ensuring schematic interoperability for FAIR empirical data

In the pursuit of FAIRness in empirical data and metadata in a knowledge graph, it is important not only for the terms employed in data and metadata statements to possess identifiers from controlled vocabularies, such as ontologies, ensuring terminological interoperability, but also the semantic graph patterns underlying each statement. These patterns specify the relationships among the terms in a statement, facilitating schematic interoperability.

Due to the expressivity of RDF and OWL, statements can be modelled in multiple, often not directly interoperable ways within a knowledge graph. Distinguishing between RDF graphs with different structures that essentially model the same underlying data statement poses a challenge. Consequently, the presence of schematic interoperability conflicts becomes unavoidable, especially when data are represented using diverse graph patterns (cf. Figs. 2 and 3).


Fig2 Vogt JofBiomedSem24 15.png

Figure 2. Comparison of a human-readable statement with its machine-actionable representation as a semantic graph following the RDF syntax. Top: A human-readable statement concerning the observation that a specific apple (X) weighs 204.56 grams. Bottom: The corresponding representation of the same statement as a semantic graph, adhering to RDF syntax and following the established pattern for measurement data from the Ontology for Biomedical Investigations (OBI) [35] of the Open Biological and Biomedical Ontology Foundry (OBO).

Fig3 Vogt JofBiomedSem24 15.png

Figure 3. Alternative machine-actionable representation of the data statement from Fig. 2. This graph represents the same data statement as shown in Fig. 2 Top, but applies a semantic graph model that is based on the Extensible Observation Ontology (OBOE) [36], an ontology frequently used in the ecology community.

Therefore, to maintain interoperability in the representation of empirical data statements within an RDF graph, it can be beneficial to restrict the graph patterns employed for their semantic modelling. Statements of the same type, such as all weight measurements, would employ identical graph patterns to maintain interoperability. Each of these patterns would be assigned an identifier. When representing empirical data in the form of an RDF graph, the graph’s metadata should reference that graph-pattern identifier. This approach enables the identification of potentially interoperable RDF graphs sharing common graph-pattern identifiers.

Practically implementing these principles entails two criteria. Firstly, all statements within a knowledge graph must be categorized into statement classes, each associated with a specified graph pattern, typically in the form of a shape specification. Secondly, the subgraph corresponding to a particular statement must be distinctly identifiable.

Challenge 2: Overcoming barriers in graph query language adoption

Another significant challenge arises in the context of searching for specific information in a knowledge graph. The prevalent formats for knowledge graphs include RDF/OWL or labeled property graphs like Neo4j. Interacting directly with these graphs, encompassing CRUD operations for creating (= writing), reading (= searching), updating, and deleting statements in the knowledge graph, necessitates the utilization of a query language. SPARQL [37] is an example for RDF/OWL, while Cypher [38] is employed for Neo4j.

Although these query languages empower users to formulate detailed and intricate queries, the challenge lies in their complexity, creating an entry barrier for seamless interactions with knowledge graphs [39]. Furthermore, query languages are not aware of graph patterns.

This challenge may potentially be addressed by providing reusable query patterns that link to specific graph patterns, thereby integrating representation and querying.

Challenge 3: Addressing complexities in making statements about statements

The RDF triple syntax of Subject, Predicate, and Object allows expressing a statement about another statement by creating a triple that relates a statement, composed of one or more triples, to a value, resource, or another statement. The scenario may arise where such statements about statements must be modelled. For instance, metadata for a measurement may relate two distinct subgraphs: one representing the measurement itself (as seen in Fig. 2) and another documenting the underlying measuring process (as seen in Fig. 4).


Fig4 Vogt JofBiomedSem24 15.png

Figure 4. A detailed machine-actionable representation of the metadata relating to a weight measurement datum. This detailed illustration presents a machine-actionable representation of a mass measurement process employing a balance. It documents metadata associated with a weight measurement datum, articulated as an RDF graph. The graph establishes connections between an instance of mass measurement assay (OBI:0000445) and instances of various other classes from diverse ontologies. Noteworthy details include the identification of the measurement conductor, the location and timing of the measurement, the protocol followed, and the specific device utilized (i.e., a balance). Additionally, the graph outlines the material entity serving as the subject and input for the measurement process (i.e., "apple X"), along with specifying the resultant data encapsulated in a particular weight measurement assertion.

In RDF reification, a statement resource is defined to represent a particular triple by describing it via three additional triples that specify its Subject, Predicate, and Object. Alternatively, the RDF-star approach can be employed. [40, 41] Both methods increase complexity of the represented graph.

In cases like this, the adoption of Named Graphs is an alternative compared to RDF reification or RDF-star approaches. Within RDF-based knowledge graphs, a Named Graph resource identifies a set of triples by incorporating the URI of the Named Graph as a fourth element to each triple, transforming them into quads. In labeled property graphs, on the other hand, assigning a resource for identifying subgraphs within the overall data graph is straightforward and can be achieved by incorporating the resource identifier as the value of a corresponding property-value pair, subsequently adding this pair to all relations and nodes belonging to the same subgraph.

Results

Semantic unit

We developed an approach for organizing knowledge graphs into distinct layers of subgraphs using graph patterns. Unlike traditional methods of partitioning a knowledge graph that (i) rely on technical aspects such as shared graph-topological properties of its triples with the goal of (federated) reasoning and query optimization (see characteristic sets [29, 30], RDF molecules [31, 42], and other approaches [43,44,45]), that (ii) partition a knowledge graph into small blocks for embedding and entity alignment learning to scale knowledge graph fusion [46], or that (iii) partition knowledge extractions, allowing reasoning over them in parallel to speed up knowledge graph construction [47], our approach introduces "semantic units." Semantic units prioritize structuring a knowledge graph into identifiable sets of triples, as subgraphs that represent units of representation possessing semantic significance for human readers. Technically, a semantic unit is a subgraph within a knowledge graph, represented in the graph by its own resource—designated as a UPRI—and embodied in the graph as a node. This resource is classified as an instance of a specific semantic unit class.

Semantic units focus on creating units that are semantically meaningful to domain experts. For instance, the graph in Fig. 2 exemplifies a subgraph that can be organized in a semantic unit that instantiates the class *SEMUNIT:weight statement unit* as it is illustrated in Fig. 6 (later). The statement unit models a single, human-readable statement, as opposed to the individual triple ‘weight’ (PATO:0000128) isQualityMeasuredAs (IAO:0000417) ‘scalar measurement datum’ (IAO:0000032), which is a single triple from that subgraph. That triple, without the context of the other triples in the subgraph, lacks semantic meaningfulness for a domain expert who has no background in semantics.

Beyond statement units, which constitute the smallest semantically meaningful statements (e.g., a weight measurement), collections of statement units can form compound units representing a coarser level of representational granularity. The classification of semantic units thus distinguishes two fundamental categories: statement units and compound units, each with its respective subcategories. For a detailed classification of semantic units, refer to Fig. 5.


Fig5 Vogt JofBiomedSem24 15.png

Figure 5. Classification of different categories of semantic units.

The structuring of a knowledge graph into semantic units involves introducing an additional layer of triples to the existing graph. To distinguish these two layers, we label the pre-existing graph as the data graph layer, while the newly added triples constitute the semantic-units graph layer. For clarity across the graph, the resource representing a semantic unit, along with all triples featuring this resource in the Subject or Object position, is assigned to the semantic-units graph layer. Extending this distinction from the graph as a whole to individual semantic units, each semantic unit is associated with both a data graph and a semantic-units graph. The data graph of a particular semantic unit shares the same UPRI as its semantic unit resource. This alignment enables reference to the UPRI, concurrently denoting the semantic unit as a resource and its corresponding data graph. This interconnectedness empowers users to make statements about the content encapsulated within the semantic unit’s data graph, as shown in Fig. 6.


Fig6 Vogt JofBiomedSem24 15.png

Figure 6. Example of a statement unit. The illustration displays a statement unit exemplifying a has-weight relation. The data graph, denoted within the blue box at the bottom, articulates the statement with "apple X" as the subject and "gram X" alongside the numerical value 204.56 as the objects. The peach-colored box encompasses the semantic-units graph, housing triples that encapsulate the semantic unit’s representation. It explicitly denotes the resource embodying the statement unit (bordered blue box), an instance of the *SEMUNIT:weight statement unit* class, with "apple X" identified as the subject. Notably, the UPRI of *’weight statement unit’* is also the UPRI of the semantic unit’s data graph (the unbordered subgraph in the blue box).

Statement unit: A proposition in the knowledge graph

A statement unit is characterized as the fundamental unit of information encapsulating the smallest, independent proposition (i.e., statement) with semantic meaning for human comprehension (see also [32]). For instance, the weight measurement statement for "apple X" illustrated in Fig. 6 represents a statement unit.

Structuring a knowledge graph into statement units results in a partition of its graph. Each triple within the data graph layer of the knowledge graph is associated with exactly one statement unit, and merging the subgraphs of all statement units results in the complete data graph of a knowledge graph. This partitioning only applies to the data graph layer.

We can understand each statement unit to specify a particular proposition by establishing a relationship between a resource serving as the subject and either a literal or another resource, denoted as the object of the predicate. Every statement unit encompasses a single subject and one or more objects.

To illustrate, a has-part statement unit features a subject and one object. Conversely, a weight measurement statement unit consists of a subject, as well as two objects: the weight value and the weight unit (refer to Fig. 6). The resource signifying a statement unit in the graph establishes a connection with its subject through the property *SEMUNIT:hasSemanticUnitSubject*, which is documented in the semantic-units graph of the statement unit.

In scenarios where the proposition within the data graph is grounded in a binary relation—a divalent predicate like "This right hand has as a part this right thumb"—the associated statement unit typically comprises a single triple. This alignment arises from the nature of RDF, where Predicates of triples are inherently binary relations. In such cases, the RDF property concurrently embodies the statement’s verb or predicate. However, numerous propositions are grounded in n-ary relations, making a single triple insufficient for their representation. Examples encompass the weight measurement statement in Fig. 6 and statements like "This right hand has part this right thumb on January 29th 2022," "Anna gives Bob a book," and "Carla travels by train from Paris to Berlin on the 29th of June 2022," each necessitating more than one triple. In these cases, the statement’s verb or predicate is often represented not by a property within a single triple but instead by an instance resource, as exemplified by ‘weight X’ (PATO:0000128) in Fig. 6. The composition of statement units, whether consisting of one or more triples, is contingent upon the relation of the underlying proposition, the n-aryness of its predicate, and the incorporation of optional objects. Types of statement units can be distinguished based on the n-ary verb or predicate that characterizes their underlying proposition. Notably, numerous object properties of the Basic Formal Ontology 2 denote ternary relations, particularly those entailing temporal dependencies. [48] For instance, "b located_in c at t" mandates at least two triples for accurate representation in RDF.

The determination of which triples belong to a statement unit necessitates case-by-case specification by human domain experts. The statement unit patterns can then be specified using languages like LinkML [49, 50] or the Shapes Constraint Language SHACL [51]. These languages enable the definition of graph patterns to represent specific propositions, subsequently constituting a statement unit. Each statement unit instantiates a designated statement unit class, a classification defined by the specific verb or predicate characterizing the propositions modelled by its instances. We can distinguish different subcategories of statement units based on the underlying predicate, such as has part, type, and develops from.

A distinctive category within the statement units, denoted as identification units, serves a specific purpose, providing details about a particular named individual or class resource. Two principal subtypes define this category. A named individual identification unit is a statement unit that serves to identify a resource to be a named individual, adding information such as the resource’s label, type, and its class membership (refer to Fig. 7A). A class identification unit[b] is a statement unit that serves to identify a resource to be a class and provides details including its label, identifier, and optionally, the URIs of both the ontology and the specific version from which the class term has been imported (refer to Fig. 7B). Both types of identification units are important for providing human-readable displays of statement units, as they provide the labels for the resources used in them (see "typed statement unit" and "dynamic label" in Fig. 9, later).


Fig7 Vogt JofBiomedSem24 15.png

Figure 7. Examples for two different types of identification units. A) Named-individual identification unit. The data graph within the unbordered box delineates the class-affiliation of the ‘apple X’ (NCIT:C71985) instance. The subject, "apple X," is connected to its class through the property type (RDF:type), while its label "apple X" is conveyed via the property label (RDFS:label). The unbordered blue box designates the data graph associated with this named-individual identification unit. B) Class identification unit. This data graph of this unit, represented by the unbordered blue box, captures the label and identifier of the class ‘apple’ (NCIT:C71985), the unit’s designated subject. Optionally, it includes the URI details of the ontology and the ontology version from which the class is derived. The bordered blue box designates the resource of this class identification unit.

Compound unit: A collection of propositions

Compound units are containers of collections of associated semantic units, each possessing semantic significance for a human reader. Each compound unit possesses a UPRI and instantiates a corresponding compound unit class. The connection between the resource representing the compound unit and those representing its associated semantic units is detailed through the property *SEMUNIT:hasAssociatedSemanticUnit* (see Fig. 8). The subsequent sections introduce distinct subcategories of compound units.


Fig8 Vogt JofBiomedSem24 15.png

Figure 8. Example of a compound unit, denoted as *‘apple X item unit’*, that encompasses multiple statement units. Compound units, by virtue of merging the data graphs of their associated statement units, indirectly manifest a data graph (here, highlighted by the blue arrow). Notably, the compound unit possesses a semantic-units graph (depicted in the peach-colored box) delineating the associated semantic units.

Typed statement unit

Footnotes

  1. Machine-actionable data and metadata are machine-interpretable and belong to a type for which operations have been specified in symbolic grammar, such as logical reasoning based on description logics for statements formalized in the Web Ontology Language (OWL) or rule-based data transformations such as unit conversion for defined types of elements.[1]
  2. Analog to class identification units, one could specify property identification units that have property resources as their subject.

References

  1. Weiland, C.; Islam, S.; Broder, D. et al. (19 August 2022). "FDO Machine Actionability, Version 2.1". Google Docs. FDO Forum. https://docs.google.com/document/d/1hbCRJvMTmEmpPcYb4_x6dv1OWrBtKUUW5CEXB2gqsRo. 

Notes

This presentation is faithful to the original, with only a few minor changes to presentation, though grammar and word usage was substantially updated for improved readability. In some cases important information was missing from the references, and that information was added.