Difference between revisions of "Journal:Semantic units: Organizing knowledge graphs into semantically meaningful units of representation"
Shawndouglas (talk | contribs) (Created stub. Saving and adding more.) |
Shawndouglas (talk | contribs) (Saving and adding more.) |
||
Line 26: | Line 26: | ||
}} | }} | ||
==Abstract== | ==Abstract== | ||
'''Background''': In today’s landscape of [[Information management|data management]], the importance of [[knowledge graph]]s and [[Ontology (information science)|ontologies]] is escalating as critical mechanisms aligned with the [[Journal:The FAIR Guiding Principles for scientific data management and stewardship|FAIR Guiding Principles]]ask that research data and [[metadata]] be more findable, accessible, interoperable, and reusable. We discuss three challenges that may hinder the effective exploitation of the full potential of applying FAIR concepts to research objects using knowledge graphs. | '''Background''': In today’s landscape of [[Information management|data management]], the importance of [[knowledge graph]]s and [[Ontology (information science)|ontologies]] is escalating as critical mechanisms aligned with the [[Journal:The FAIR Guiding Principles for scientific data management and stewardship|FAIR Guiding Principles]] ask that research data and [[metadata]] be more findable, accessible, interoperable, and reusable. We discuss three challenges that may hinder the effective exploitation of the full potential of applying FAIR concepts to research objects using knowledge graphs. | ||
'''Results''': We introduce “semantic units” as a conceptual solution, although currently exemplified only in a limited prototype. Semantic units structure a knowledge graph into identifiable and [[Semantics|semantically]] meaningful subgraphs by adding another layer of triples on top of the conventional data layer. Semantic units and their subgraphs are represented by their own resource that instantiates a corresponding semantic unit class. We distinguish statement and compound units as basic categories of semantic units. A statement unit is the smallest independent proposition that is semantically meaningful for a human reader. Depending on the relation of its underlying proposition, it consists of one or more triples. Organizing a knowledge graph into statement units results in a partition of the graph, with each triple belonging to exactly one statement unit. A compound unit, on the other hand, is a semantically meaningful collection of statement and compound units that form larger subgraphs. Some semantic units organize the graph into different levels of representational granularity, others orthogonally into different types of granularity trees or different frames of reference, structuring and organizing the knowledge graph into partially overlapping, partially enclosed subgraphs, each of which can be referenced by its own resource. | '''Results''': We introduce “semantic units” as a conceptual solution, although currently exemplified only in a limited prototype. Semantic units structure a knowledge graph into identifiable and [[Semantics|semantically]] meaningful subgraphs by adding another layer of triples on top of the conventional data layer. Semantic units and their subgraphs are represented by their own resource that instantiates a corresponding semantic unit class. We distinguish statement and compound units as basic categories of semantic units. A statement unit is the smallest independent proposition that is semantically meaningful for a human reader. Depending on the relation of its underlying proposition, it consists of one or more triples. Organizing a knowledge graph into statement units results in a partition of the graph, with each triple belonging to exactly one statement unit. A compound unit, on the other hand, is a semantically meaningful collection of statement and compound units that form larger subgraphs. Some semantic units organize the graph into different levels of representational granularity, others orthogonally into different types of granularity trees or different frames of reference, structuring and organizing the knowledge graph into partially overlapping, partially enclosed subgraphs, each of which can be referenced by its own resource. | ||
Line 35: | Line 35: | ||
==Background== | ==Background== | ||
In an era marked by the exponential generation of data [1,2,3], both technically and socially intricate challenges have emerged [4], necessitating innovative approaches to data representation and [[Information management|management]] in science and industry. The growing volume of produced data requires systems capable of collecting, [[Data integration|integrating]], and [[Data analysis|analyzing]] extensive datasets from diverse sources, a critical requirement in addressing contemporary global challenges. [5] Notably, data stewardship should rest within the hands of the domain experts or institutions to ensure technical autonomy, aligning with the concept of "data visiting" rather than conventional "[[data sharing]]." [6] | |||
From the standpoint of data representation and management, meeting these demands relies on adherence to the [[Journal:The FAIR Guiding Principles for scientific data management and stewardship|FAIR Guiding Principles]], which ask for research data and [[metadata]] to be readily findable, accessible, interoperable, and reusable for machines and humans alike. [7] Failure to achieve FAIRness risks transforming big data into opaque dark data. [8] Establishing the FAIRness of these research objects not only contributes to a solution for the reproducibility crisis in science [9] but also addresses broader concerns regarding the trustworthiness of [[information]] (see also the TRUST Principles of transparency, responsibility, user focus, sustainability, and technology [10]). | |||
To capitalize on the transformative potential of the FAIR Principles, the idea of an internet of FAIR data and services was suggested. [11] Such a framework would seamlessly scale with the demands of big data, enabling relevant data-rich institutions, research projects, and citizen-science initiatives to make their research objects universally accessible in adherence to the FAIR Guiding Principles. [12, 13] The key lies in furnishing comprehensive, machine-actionable{{Efn|Machine-actionable data and metadata are machine-interpretable and belong to a type for which operations have been specified in symbolic grammar, such as logical reasoning based on description logics for statements formalized in the Web Ontology Language (OWL) or rule-based data transformations such as unit conversion for defined types of elements.<ref name="WEilandFDO22">{{cite web |url=https://docs.google.com/document/d/1hbCRJvMTmEmpPcYb4_x6dv1OWrBtKUUW5CEXB2gqsRo |title=FDO Machine Actionability, Version 2.1 |author=Weiland, C.; Islam, S.; Broder, D. et al. |work=Google Docs |publisher=FDO Forum |date=19 August 2022}}</ref>}} data and metadata, complemented by human-readable interfaces and search capabilities. | |||
[[Knowledge graph]]s can contribute to the needed technical frameworks, offering a structure for managing and representing FAIR data and metadata. [14] Knowledge graphs are particularly applied in the context of [[Semantics|semantic]] search based on entities and relations, deep reasoning, disambiguation of natural language, machine reading, and entity consolidation for big data and text analytics. [15] | |||
The distinctive graph-based abstractions inherent in knowledge graphs yield advantages over traditional [[Relational database|relational]] or other NoSQL models. These include | |||
* an intuitive way for modelling relations; | |||
* the flexibility to defer data schema definitions to accommodate evolving knowledge, which is especially important when dealing with incomplete knowledge; | |||
* incorporation of machine-actionable knowledge representation formalisms like [[Ontology (information science)|ontologies]] and rules; | |||
* deployment of graph analytics and [[machine learning]] (ML); and | |||
* utilization of specialized graph query languages that support, in addition to standard relational operators such as joins, unions, and projections, also navigational operators for recursively searching for entities through arbitrary-length paths. [16,17,18,19,20,21,22] | |||
Moreover, the inherent semantic transparency of knowledge graphs can improve the transparency of data-based decision-making and improve the communication of data and knowledge within research and science in general. [23,24,25,26,27] | |||
Despite offering an appropriate technical foundation, the utilization of a knowledge graph for storing data and metadata does not inherently ensure the achievement of the FAIR Guiding Principles. Realizing FAIR research objects necessitates adherence to specific guidelines, encompassing the consistent application of adequate semantic data models tailored to distinct types of data and metadata statements. This approach is pivotal for ensuring seamless interoperability across a dataset. | |||
The rest of the paper is organized as such. In the Problem statement section, we discuss three specific challenges that, from our perspective, can be effectively addressed by systematically organizing a knowledge graph into well-defined subgraphs. Prior attempts at this, such as defining a characteristic set as a subgraph based on triples that share the same resource in the ''Subject'' position, have demonstrated noteworthy enhancements in space and query performance [28, 29] (see also the related concept of RDF molecules [30, 31]), but they do not fully mitigate the challenges outlined below. | |||
The Results section introduces a novel concept: the partitioning and structuring of a knowledge graph into semantic units, identifiable subgraphs represented in the graph with their own resource. Semantic units are semantically meaningful units of representation, which will contribute to overcoming the challenges at hand. The concept builds upon an idea originally proposed for structuring descriptions of [[phenotype]]s into distinct subgraphs, each of which models a descriptive statement like a particular weight measurement or a particular parthood statement for a given anatomical entity. [32] Each such subgraph is organized in its own "Named Graph" and functions as the smallest semantically meaningful unit in a phenotype description. Generalizing and extending this concept, we present semantic units as accessible, searchable, identifiable, and reusable data items in their own right, forming units of representation implemented through graphs based on the [[Resource Description Framework]] (RDF) and the Web Ontology Language (OWL) or labeled property graphs. Two basic categories of semantic units—statement units and compound units—are introduced, supplementing the well-established triples and the overall graph in FAIR knowledge graphs. These units offer a structure that organizes a knowledge graph into five levels of representational granularity, from individual triples to the graph as a whole. In further refinement, additional subcategories of semantic units are proposed for enhanced graph organization. The incorporation of unique, persistent, and resolvable identifiers (UPRIs) for each semantic unit enables their efficient referencing within triples, facilitating an efficient way of making statements about statements. The introduction of semantic units adds further layers of triples to the well-established RDF and OWL layer for knowledge graphs. (Fig. 1) This augmentation aims to enhance the usability of knowledge graphs for both domain experts and developers. | |||
[[File:Fig1 Vogt JofBiomedSem24 15.png|600px]] | |||
{{clear}} | |||
{| | |||
| style="vertical-align:top;" | | |||
{| border="0" cellpadding="5" cellspacing="0" width="600px" | |||
|- | |||
| style="background-color:white; padding-left:10px; padding-right:10px;" |<blockquote>'''Figure 1.''' Semantic units introduce additional layers atop the RDF/OWL layer of triples within a knowledge graph. The figure illustrates a partitioning of the triple layer into statement units, wherein each triple aligns with exactly one statement unit, and each statement unit contains one or more triples. Statement units can be organized into diverse types of semantically meaningful collections, denoted as compound units. Compound units serve as the basis for defining several layers that contribute to the enhanced structuring and organization of the knowledge graph in semantically meaningful ways.</blockquote> | |||
|- | |||
|} | |||
|} | |||
In the Discussion section, we discuss the benefits we see from organizing knowledge graphs into distinct knowledge graph modules (i.e., semantic units) in terms of increasing data management flexibility and explorability of the graph. We also discuss possible strategies for implementing semantic units for RDF/OWL-based and labeled-property-graph-based knowledge graphs. | |||
===Conventions used in this paper=== | |||
In this paper, the term "knowledge graph" denotes a machine-actionable semantic graph employed for the documentation, organization, and representation of data and metadata. It is essential to note that our discussion of semantic units is situated within the context of RDF-based triple stores, OWL, and Description Logics serving as a formal framework for inferencing, alongside labeled property graphs as an alternative to triple stores. We deliberately focus on these technologies as they constitute the primary technologies and logical frameworks within the knowledge graph domain, benefiting from widespread community support and established standards. We are aware of the fact that alternative technologies and frameworks exist that support an ''n''-tuples syntax and more advanced logics (e.g., First Order Logic) [33, 34], but supporting tools and applications are missing or are not widely used to turn them into well-supported, scalable, and easily usable knowledge graph applications. | |||
Throughout this text, <u>regular underlining</u> is employed for indicating ontology classes, while ''<u>italicsUnderlined</u>'' text is reserved for referencing properties. Identification (ID) numbers, formed by the ontology prefix followed by a colon and a number, uniquely specify each resource (e.g., ''<u>isAbout</u>'' [IAO:0000136]). When a term is not yet covered in any ontology, we denote the corresponding class with an asterisk (*). New classes and properties that relate to semantic units will use the ontology prefix SEMUNIT, as in the class *<u>SEMUNIT:metric measurement statement unit</u>*. These will be part of a future Semantic Unit ontology. We use "<u>regular underlined</u>" to indicate instances of classes, with the label referring to the class label and the ID to the ID of the class. | |||
The term "resource" is employed to signify something uniquely designated, such as a Uniform Resource Identifier (URI), about which informative statements are made. It thus stands for something and represents something you want to talk about. In RDF, the ''Subject'' and the ''Predicate'' in a triple are always resources, whereas the ''Object'' can be either a resource or a literal. Resources encompass properties, instances, and classes, with properties occupying the ''Predicate'' position in a triple, instances referring to individuals (=particulars), and classes representing universals or kinds. | |||
To maintain clarity, resources are represented with human-readable labels in both the text and all figures, opting for the implicit assumption that each property, instance, and class possesses its UPRI. Additionally, the term "triple" refers specifically to a triple statement, while "statement" pertains to a [[Natural language processing|natural language statement]], establishing a clear distinction between the two. | |||
==Methods== | |||
==Footnotes== | |||
{{reflist|group=lower-alpha}} | |||
==References== | ==References== | ||
Line 50: | Line 99: | ||
[[Category:LIMSwiki journal articles on data management and sharing]] | [[Category:LIMSwiki journal articles on data management and sharing]] | ||
[[Category:LIMSwiki journal articles on FAIR data principles]] | [[Category:LIMSwiki journal articles on FAIR data principles]] | ||
[[Category:LIMSwiki journal articles on health informatics]] |
Revision as of 17:55, 16 June 2024
Full article title | Semantic units: Organizing knowledge graphs into semantically meaningful units of representation |
---|---|
Journal | Journal of Biomedical Semantics |
Author(s) | Vogt, Lars; Kuhn, Tobias; Hoehndorf, Robert |
Author affiliation(s) | TIB Leibniz Information Centre for Science and Technology, Vrije Universiteit, King Abdullah University of Science and Technology |
Primary contact | Email: lars dot m dot vogt at googlemail dot com |
Year published | 2024 |
Volume and issue | 15 |
Article # | 7 |
DOI | 10.1186/s13326-024-00310-5 |
ISSN | 2041-1480 |
Distribution license | Creative Commons Attribution 4.0 International |
Website | https://jbiomedsem.biomedcentral.com/articles/10.1186/s13326-024-00310-5 |
Download | https://jbiomedsem.biomedcentral.com/counter/pdf/10.1186/s13326-024-00310-5.pdf (PDF) |
This article should be considered a work in progress and incomplete. Consider this article incomplete until this notice is removed. |
Abstract
Background: In today’s landscape of data management, the importance of knowledge graphs and ontologies is escalating as critical mechanisms aligned with the FAIR Guiding Principles ask that research data and metadata be more findable, accessible, interoperable, and reusable. We discuss three challenges that may hinder the effective exploitation of the full potential of applying FAIR concepts to research objects using knowledge graphs.
Results: We introduce “semantic units” as a conceptual solution, although currently exemplified only in a limited prototype. Semantic units structure a knowledge graph into identifiable and semantically meaningful subgraphs by adding another layer of triples on top of the conventional data layer. Semantic units and their subgraphs are represented by their own resource that instantiates a corresponding semantic unit class. We distinguish statement and compound units as basic categories of semantic units. A statement unit is the smallest independent proposition that is semantically meaningful for a human reader. Depending on the relation of its underlying proposition, it consists of one or more triples. Organizing a knowledge graph into statement units results in a partition of the graph, with each triple belonging to exactly one statement unit. A compound unit, on the other hand, is a semantically meaningful collection of statement and compound units that form larger subgraphs. Some semantic units organize the graph into different levels of representational granularity, others orthogonally into different types of granularity trees or different frames of reference, structuring and organizing the knowledge graph into partially overlapping, partially enclosed subgraphs, each of which can be referenced by its own resource.
Conclusions: Semantic units, applicable in RDF/OWL and labeled property graphs, offer support for making statements about statements and facilitate graph-alignment, subgraph-matching, knowledge graph profiling, and management of access restrictions to sensitive data. Additionally, we argue that organizing the graph into semantic units promotes the differentiation of ontological and discursive information, and that it also supports the differentiation of multiple frames of reference within the graph.
Keywords: FAIR data and metadata, knowledge graph, OWL, RDF, semantic unit, graph organization, granularity tree, representational granularity
Background
In an era marked by the exponential generation of data [1,2,3], both technically and socially intricate challenges have emerged [4], necessitating innovative approaches to data representation and management in science and industry. The growing volume of produced data requires systems capable of collecting, integrating, and analyzing extensive datasets from diverse sources, a critical requirement in addressing contemporary global challenges. [5] Notably, data stewardship should rest within the hands of the domain experts or institutions to ensure technical autonomy, aligning with the concept of "data visiting" rather than conventional "data sharing." [6]
From the standpoint of data representation and management, meeting these demands relies on adherence to the FAIR Guiding Principles, which ask for research data and metadata to be readily findable, accessible, interoperable, and reusable for machines and humans alike. [7] Failure to achieve FAIRness risks transforming big data into opaque dark data. [8] Establishing the FAIRness of these research objects not only contributes to a solution for the reproducibility crisis in science [9] but also addresses broader concerns regarding the trustworthiness of information (see also the TRUST Principles of transparency, responsibility, user focus, sustainability, and technology [10]).
To capitalize on the transformative potential of the FAIR Principles, the idea of an internet of FAIR data and services was suggested. [11] Such a framework would seamlessly scale with the demands of big data, enabling relevant data-rich institutions, research projects, and citizen-science initiatives to make their research objects universally accessible in adherence to the FAIR Guiding Principles. [12, 13] The key lies in furnishing comprehensive, machine-actionable[a] data and metadata, complemented by human-readable interfaces and search capabilities.
Knowledge graphs can contribute to the needed technical frameworks, offering a structure for managing and representing FAIR data and metadata. [14] Knowledge graphs are particularly applied in the context of semantic search based on entities and relations, deep reasoning, disambiguation of natural language, machine reading, and entity consolidation for big data and text analytics. [15]
The distinctive graph-based abstractions inherent in knowledge graphs yield advantages over traditional relational or other NoSQL models. These include
- an intuitive way for modelling relations;
- the flexibility to defer data schema definitions to accommodate evolving knowledge, which is especially important when dealing with incomplete knowledge;
- incorporation of machine-actionable knowledge representation formalisms like ontologies and rules;
- deployment of graph analytics and machine learning (ML); and
- utilization of specialized graph query languages that support, in addition to standard relational operators such as joins, unions, and projections, also navigational operators for recursively searching for entities through arbitrary-length paths. [16,17,18,19,20,21,22]
Moreover, the inherent semantic transparency of knowledge graphs can improve the transparency of data-based decision-making and improve the communication of data and knowledge within research and science in general. [23,24,25,26,27]
Despite offering an appropriate technical foundation, the utilization of a knowledge graph for storing data and metadata does not inherently ensure the achievement of the FAIR Guiding Principles. Realizing FAIR research objects necessitates adherence to specific guidelines, encompassing the consistent application of adequate semantic data models tailored to distinct types of data and metadata statements. This approach is pivotal for ensuring seamless interoperability across a dataset.
The rest of the paper is organized as such. In the Problem statement section, we discuss three specific challenges that, from our perspective, can be effectively addressed by systematically organizing a knowledge graph into well-defined subgraphs. Prior attempts at this, such as defining a characteristic set as a subgraph based on triples that share the same resource in the Subject position, have demonstrated noteworthy enhancements in space and query performance [28, 29] (see also the related concept of RDF molecules [30, 31]), but they do not fully mitigate the challenges outlined below.
The Results section introduces a novel concept: the partitioning and structuring of a knowledge graph into semantic units, identifiable subgraphs represented in the graph with their own resource. Semantic units are semantically meaningful units of representation, which will contribute to overcoming the challenges at hand. The concept builds upon an idea originally proposed for structuring descriptions of phenotypes into distinct subgraphs, each of which models a descriptive statement like a particular weight measurement or a particular parthood statement for a given anatomical entity. [32] Each such subgraph is organized in its own "Named Graph" and functions as the smallest semantically meaningful unit in a phenotype description. Generalizing and extending this concept, we present semantic units as accessible, searchable, identifiable, and reusable data items in their own right, forming units of representation implemented through graphs based on the Resource Description Framework (RDF) and the Web Ontology Language (OWL) or labeled property graphs. Two basic categories of semantic units—statement units and compound units—are introduced, supplementing the well-established triples and the overall graph in FAIR knowledge graphs. These units offer a structure that organizes a knowledge graph into five levels of representational granularity, from individual triples to the graph as a whole. In further refinement, additional subcategories of semantic units are proposed for enhanced graph organization. The incorporation of unique, persistent, and resolvable identifiers (UPRIs) for each semantic unit enables their efficient referencing within triples, facilitating an efficient way of making statements about statements. The introduction of semantic units adds further layers of triples to the well-established RDF and OWL layer for knowledge graphs. (Fig. 1) This augmentation aims to enhance the usability of knowledge graphs for both domain experts and developers.
|
In the Discussion section, we discuss the benefits we see from organizing knowledge graphs into distinct knowledge graph modules (i.e., semantic units) in terms of increasing data management flexibility and explorability of the graph. We also discuss possible strategies for implementing semantic units for RDF/OWL-based and labeled-property-graph-based knowledge graphs.
Conventions used in this paper
In this paper, the term "knowledge graph" denotes a machine-actionable semantic graph employed for the documentation, organization, and representation of data and metadata. It is essential to note that our discussion of semantic units is situated within the context of RDF-based triple stores, OWL, and Description Logics serving as a formal framework for inferencing, alongside labeled property graphs as an alternative to triple stores. We deliberately focus on these technologies as they constitute the primary technologies and logical frameworks within the knowledge graph domain, benefiting from widespread community support and established standards. We are aware of the fact that alternative technologies and frameworks exist that support an n-tuples syntax and more advanced logics (e.g., First Order Logic) [33, 34], but supporting tools and applications are missing or are not widely used to turn them into well-supported, scalable, and easily usable knowledge graph applications.
Throughout this text, regular underlining is employed for indicating ontology classes, while italicsUnderlined text is reserved for referencing properties. Identification (ID) numbers, formed by the ontology prefix followed by a colon and a number, uniquely specify each resource (e.g., isAbout [IAO:0000136]). When a term is not yet covered in any ontology, we denote the corresponding class with an asterisk (*). New classes and properties that relate to semantic units will use the ontology prefix SEMUNIT, as in the class *SEMUNIT:metric measurement statement unit*. These will be part of a future Semantic Unit ontology. We use "regular underlined" to indicate instances of classes, with the label referring to the class label and the ID to the ID of the class.
The term "resource" is employed to signify something uniquely designated, such as a Uniform Resource Identifier (URI), about which informative statements are made. It thus stands for something and represents something you want to talk about. In RDF, the Subject and the Predicate in a triple are always resources, whereas the Object can be either a resource or a literal. Resources encompass properties, instances, and classes, with properties occupying the Predicate position in a triple, instances referring to individuals (=particulars), and classes representing universals or kinds.
To maintain clarity, resources are represented with human-readable labels in both the text and all figures, opting for the implicit assumption that each property, instance, and class possesses its UPRI. Additionally, the term "triple" refers specifically to a triple statement, while "statement" pertains to a natural language statement, establishing a clear distinction between the two.
Methods
Footnotes
- ↑ Machine-actionable data and metadata are machine-interpretable and belong to a type for which operations have been specified in symbolic grammar, such as logical reasoning based on description logics for statements formalized in the Web Ontology Language (OWL) or rule-based data transformations such as unit conversion for defined types of elements.[1]
References
- ↑ Weiland, C.; Islam, S.; Broder, D. et al. (19 August 2022). "FDO Machine Actionability, Version 2.1". Google Docs. FDO Forum. https://docs.google.com/document/d/1hbCRJvMTmEmpPcYb4_x6dv1OWrBtKUUW5CEXB2gqsRo.
Notes
This presentation is faithful to the original, with only a few minor changes to presentation, though grammar and word usage was substantially updated for improved readability. In some cases important information was missing from the references, and that information was added.