Difference between revisions of "Journal:Semantic units: Organizing knowledge graphs into semantically meaningful units of representation"

From LIMSWiki
Jump to navigationJump to search
(Saving and adding more.)
 
(5 intermediate revisions by the same user not shown)
Line 35: Line 35:


==Background==
==Background==
In an era marked by the exponential generation of data [1,2,3], both technically and socially intricate challenges have emerged [4], necessitating innovative approaches to data representation and [[Information management|management]] in science and industry. The growing volume of produced data requires systems capable of collecting, [[Data integration|integrating]], and [[Data analysis|analyzing]] extensive datasets from diverse sources, a critical requirement in addressing contemporary global challenges. [5] Notably, data stewardship should rest within the hands of the domain experts or institutions to ensure technical autonomy, aligning with the concept of "data visiting" rather than conventional "[[data sharing]]." [6]
In an era marked by the exponential generation of data<ref>{{Cite journal |last=Adam, K.; Hammad, I.; Fakhreldin, M.A.I. et al. |year=2015 |title=Big Data Analysis and Storage |url=http://umpir.ump.edu.my/id/eprint/7341 |journal=Proceedings of the 2015 International Conference on Operations Excellence and Service Engineering |pages=648–59}}</ref><ref>{{Cite web |last=Marr, B. |date=21 May 2018 |title=How Much Data Do We Create Every Day? The Mind-Blowing Stats Everyone Should Read |work=Forbes |url=https://www.forbes.com/sites/bernardmarr/2018/05/21/how-much-data-do-we-create-every-day-the-mind-blowing-stats-everyone-should-read/ |accessdate=22 May 2024}}</ref><ref>{{Cite web |date=2017 |title=Data Never Sleeps 5 |url=https://www.domo.com/learn/infographic/data-never-sleeps-5 |publisher=Domo, Inc}}</ref>, both technically and socially intricate challenges have emerged<ref>{{Cite journal |last=Idrees |first=Sheikh Mohammad |last2=Alam |first2=M. Afshar |last3=Agarwal |first3=Parul |date=2019-12 |title=A study of big data and its challenges |url=http://link.springer.com/10.1007/s41870-018-0185-1 |journal=International Journal of Information Technology |language=en |volume=11 |issue=4 |pages=841–846 |doi=10.1007/s41870-018-0185-1 |issn=2511-2104}}</ref>, necessitating innovative approaches to data representation and [[Information management|management]] in science and industry. The growing volume of produced data requires systems capable of collecting, [[Data integration|integrating]], and [[Data analysis|analyzing]] extensive datasets from diverse sources, a critical requirement in addressing contemporary global challenges.<ref>{{Cite web |last=United Nations |date=2015 |title=Transforming our world: the 2030 Agenda for Sustainable Development |url=https://wedocs.unep.org/20.500.11822/9814 |publisher=United Nations Environment Programme |accessdate=22 May 2024}}</ref> Notably, data stewardship should rest within the hands of the domain experts or institutions to ensure technical autonomy, aligning with the concept of "data visiting" rather than conventional "[[data sharing]]."<ref>{{Cite web |last=Mons, B. |date=December 2018 |title=Message from President Barend Mons (2018-2023) |url=https://codata.org/about-codata/message-from-president-merce-crosas/message-from-president-barend-mons-2018-2023/ |publisher=Committee on Data (CODATA) |accessdate=22 May 2024}}</ref>


From the standpoint of data representation and management, meeting these demands relies on adherence to the [[Journal:The FAIR Guiding Principles for scientific data management and stewardship|FAIR Guiding Principles]], which ask for research data and [[metadata]] to be readily findable, accessible, interoperable, and reusable for machines and humans alike. [7] Failure to achieve FAIRness risks transforming big data into opaque dark data. [8] Establishing the FAIRness of these research objects not only contributes to a solution for the reproducibility crisis in science [9] but also addresses broader concerns regarding the trustworthiness of [[information]] (see also the TRUST Principles of transparency, responsibility, user focus, sustainability, and technology [10]).
From the standpoint of data representation and management, meeting these demands relies on adherence to the [[Journal:The FAIR Guiding Principles for scientific data management and stewardship|FAIR Guiding Principles]], which ask for research data and [[metadata]] to be readily findable, accessible, interoperable, and reusable for machines and humans alike.<ref>{{Cite journal |last=Wilkinson |first=Mark D. |last2=Dumontier |first2=Michel |last3=Aalbersberg |first3=IJsbrand Jan |last4=Appleton |first4=Gabrielle |last5=Axton |first5=Myles |last6=Baak |first6=Arie |last7=Blomberg |first7=Niklas |last8=Boiten |first8=Jan-Willem |last9=da Silva Santos |first9=Luiz Bonino |last10=Bourne |first10=Philip E. |last11=Bouwman |first11=Jildau |date=2016-03-15 |title=The FAIR Guiding Principles for scientific data management and stewardship |url=https://www.nature.com/articles/sdata201618 |journal=Scientific Data |language=en |volume=3 |issue=1 |pages=160018 |doi=10.1038/sdata.2016.18 |issn=2052-4463 |pmc=PMC4792175 |pmid=26978244}}</ref> Failure to achieve FAIRness risks transforming big data into opaque dark data.<ref>{{Cite journal |last=Heidorn |first=P. Bryan |date=2008-09 |title=Shedding Light on the Dark Data in the Long Tail of Science |url=https://muse.jhu.edu/article/262029 |journal=Library Trends |language=en |volume=57 |issue=2 |pages=280–299 |doi=10.1353/lib.0.0036 |issn=1559-0682}}</ref> Establishing the FAIRness of these research objects not only contributes to a solution for the reproducibility crisis in science<ref>{{Cite journal |last=Baker |first=Monya |date=2016-05-26 |title=1,500 scientists lift the lid on reproducibility |url=https://www.nature.com/articles/533452a |journal=Nature |language=en |volume=533 |issue=7604 |pages=452–454 |doi=10.1038/533452a |issn=0028-0836}}</ref> but also addresses broader concerns regarding the trustworthiness of [[information]] (see also the TRUST Principles of transparency, responsibility, user focus, sustainability, and technology<ref>{{Cite journal |last=Lin |first=Dawei |last2=Crabtree |first2=Jonathan |last3=Dillo |first3=Ingrid |last4=Downs |first4=Robert R. |last5=Edmunds |first5=Rorie |last6=Giaretta |first6=David |last7=De Giusti |first7=Marisa |last8=L’Hours |first8=Hervé |last9=Hugo |first9=Wim |last10=Jenkyns |first10=Reyna |last11=Khodiyar |first11=Varsha |date=2020-05-14 |title=The TRUST Principles for digital repositories |url=https://www.nature.com/articles/s41597-020-0486-7 |journal=Scientific Data |language=en |volume=7 |issue=1 |pages=144 |doi=10.1038/s41597-020-0486-7 |issn=2052-4463 |pmc=PMC7224370 |pmid=32409645}}</ref>).


To capitalize on the transformative potential of the FAIR Principles, the idea of an internet of FAIR data and services was suggested. [11] Such a framework would seamlessly scale with the demands of big data, enabling relevant data-rich institutions, research projects, and citizen-science initiatives to make their research objects universally accessible in adherence to the FAIR Guiding Principles. [12, 13] The key lies in furnishing comprehensive, machine-actionable{{Efn|Machine-actionable data and metadata are machine-interpretable and belong to a type for which operations have been specified in symbolic grammar, such as logical reasoning based on description logics for statements formalized in the Web Ontology Language (OWL) or rule-based data transformations such as unit conversion for defined types of elements.<ref name="WEilandFDO22">{{cite web |url=https://docs.google.com/document/d/1hbCRJvMTmEmpPcYb4_x6dv1OWrBtKUUW5CEXB2gqsRo |title=FDO Machine Actionability, Version 2.1 |author=Weiland, C.; Islam, S.; Broder, D. et al. |work=Google Docs |publisher=FDO Forum |date=19 August 2022}}</ref>}} data and metadata, complemented by human-readable interfaces and search capabilities.
To capitalize on the transformative potential of the FAIR Principles, the idea of an internet of FAIR data and services was suggested.<ref>{{Cite web |title=The Internet of FAIR Data & Services |url=https://www.go-fair.org/resources/internet-fair-data-services/ |publisher=GO FAIR |accessdate=22 May 2024}}</ref> Such a framework would seamlessly scale with the demands of big data, enabling relevant data-rich institutions, research projects, and citizen-science initiatives to make their research objects universally accessible in adherence to the FAIR Guiding Principles.<ref>{{Cite book |last=European Commission. Directorate General for Research and Innovation. |date=2016 |title=Realising the European open science cloud: first report and recommendations of the Commission high level expert group on the European open science cloud. |url=https://data.europa.eu/doi/10.2777/940154 |publisher=Publications Office |place=LU |doi=10.2777/940154}}</ref><ref>{{Citation |last=Hasnain |first=Ali |last2=Rebholz-Schuhmann |first2=Dietrich |date=2018 |editor-last=Gangemi |editor-first=Aldo |editor2-last=Gentile |editor2-first=Anna Lisa |editor3-last=Nuzzolese |editor3-first=Andrea Giovanni |editor4-last=Rudolph |editor4-first=Sebastian |editor5-last=Maleshkova |editor5-first=Maria |title=Assessing FAIR Data Principles Against the 5-Star Open Data Principles |url=https://link.springer.com/10.1007/978-3-319-98192-5_60 |work=The Semantic Web: ESWC 2018 Satellite Events |language=en |publisher=Springer International Publishing |place=Cham |volume=11155 |pages=469–477 |doi=10.1007/978-3-319-98192-5_60 |isbn=978-3-319-98191-8 |accessdate=2024-06-17}}</ref> The key lies in furnishing comprehensive, machine-actionable{{Efn|Machine-actionable data and metadata are machine-interpretable and belong to a type for which operations have been specified in symbolic grammar, such as logical reasoning based on description logics for statements formalized in the Web Ontology Language (OWL) or rule-based data transformations such as unit conversion for defined types of elements.<ref name="WEilandFDO22">{{cite web |url=https://docs.google.com/document/d/1hbCRJvMTmEmpPcYb4_x6dv1OWrBtKUUW5CEXB2gqsRo |title=FDO Machine Actionability, Version 2.1 |author=Weiland, C.; Islam, S.; Broder, D. et al. |work=Google Docs |publisher=FDO Forum |date=19 August 2022}}</ref>}} data and metadata, complemented by human-readable interfaces and search capabilities.


[[Knowledge graph]]s can contribute to the needed technical frameworks, offering a structure for managing and representing FAIR data and metadata. [14] Knowledge graphs are particularly applied in the context of [[Semantics|semantic]] search based on entities and relations, deep reasoning, disambiguation of natural language, machine reading, and entity consolidation for big data and text analytics. [15]
[[Knowledge graph]]s can contribute to the needed technical frameworks, offering a structure for managing and representing FAIR data and metadata.<ref>{{Cite journal |last=Vogt |first=Lars |last2=Baum |first2=Roman |last3=Bhatty |first3=Philipp |last4=Köhler |first4=Christian |last5=Meid |first5=Sandra |last6=Quast |first6=Björn |last7=Grobe |first7=Peter |date=2019-01-01 |title=SOCCOMAS: a FAIR web content management system that uses knowledge graphs and that is based on semantic programming |url=https://academic.oup.com/database/article/doi/10.1093/database/baz067/5544589 |journal=Database |language=en |volume=2019 |pages=baz067 |doi=10.1093/database/baz067 |issn=1758-0463 |pmc=PMC6686081 |pmid=31392324}}</ref> Knowledge graphs are particularly applied in the context of [[Semantics|semantic]] search based on entities and relations, deep reasoning, disambiguation of natural language, machine reading, and entity consolidation for big data and text analytics.<ref>{{Cite journal |last=Bonatti |first=Piero Andrea |last2=Decker |first2=Stefan |last3=Polleres |first3=Axel |last4=Presutti |first4=Valentina |date=2019 |title=Knowledge Graphs: New Directions for Knowledge Representation on the Semantic Web (Dagstuhl Seminar 18371) |url=https://drops.dagstuhl.de/entities/document/10.4230/DagRep.8.9.29 |language=en |pages=83 pages, 5326322 bytes |doi=10.4230/dagrep.8.9.29}}</ref>


The distinctive graph-based abstractions inherent in knowledge graphs yield advantages over traditional [[Relational database|relational]] or other NoSQL models. These include
The distinctive graph-based abstractions inherent in knowledge graphs yield advantages over traditional [[Relational database|relational]] or other NoSQL models. These include
* an intuitive way for modelling relations;
* the flexibility to defer data schema definitions to accommodate evolving knowledge, which is especially important when dealing with incomplete knowledge;
* incorporation of machine-actionable knowledge representation formalisms like [[Ontology (information science)|ontologies]] and rules;
* deployment of graph analytics and [[machine learning]] (ML); and
* utilization of specialized graph query languages that support, in addition to standard relational operators such as joins, unions, and projections, also navigational operators for recursively searching for entities through arbitrary-length paths. [16,17,18,19,20,21,22]


Moreover, the inherent semantic transparency of knowledge graphs can improve the transparency of data-based decision-making and improve the communication of data and knowledge within research and science in general. [23,24,25,26,27]
*an intuitive way for modelling relations;
*the flexibility to defer data schema definitions to accommodate evolving knowledge, which is especially important when dealing with incomplete knowledge;
*incorporation of machine-actionable knowledge representation formalisms like [[Ontology (information science)|ontologies]] and rules;
*deployment of graph analytics and [[machine learning]] (ML); and
*utilization of specialized graph query languages that support, in addition to standard relational operators such as joins, unions, and projections, also navigational operators for recursively searching for entities through arbitrary-length paths.<ref>{{Cite journal |last=Hogan |first=Aidan |last2=Blomqvist |first2=Eva |last3=Cochez |first3=Michael |last4=D’amato |first4=Claudia |last5=Melo |first5=Gerard De |last6=Gutierrez |first6=Claudio |last7=Kirrane |first7=Sabrina |last8=Gayo |first8=José Emilio Labra |last9=Navigli |first9=Roberto |last10=Neumaier |first10=Sebastian |last11=Ngomo |first11=Axel-Cyrille Ngonga |date=2022-05-31 |title=Knowledge Graphs |url=https://dl.acm.org/doi/10.1145/3447772 |journal=ACM Computing Surveys |language=en |volume=54 |issue=4 |pages=1–37 |doi=10.1145/3447772 |issn=0360-0300}}</ref><ref>{{Citation |last=Abiteboul |first=Serge |date=1997 |editor-last=Afrati |editor-first=Foto |editor2-last=Kolaitis |editor2-first=Phokion |title=Querying semi-structured data |url=http://link.springer.com/10.1007/3-540-62222-5_33 |work=Database Theory — ICDT '97 |publisher=Springer Berlin Heidelberg |place=Berlin, Heidelberg |volume=1186 |pages=1–18 |doi=10.1007/3-540-62222-5_33 |isbn=978-3-540-62222-2 |accessdate=2024-06-17}}</ref><ref>{{Cite journal |last=Angles |first=Renzo |last2=Gutierrez |first2=Claudio |date=2008-02 |title=Survey of graph database models |url=https://dl.acm.org/doi/10.1145/1322432.1322433 |journal=ACM Computing Surveys |language=en |volume=40 |issue=1 |pages=1–39 |doi=10.1145/1322432.1322433 |issn=0360-0300}}</ref><ref>{{Cite journal |last=Angles |first=Renzo |last2=Arenas |first2=Marcelo |last3=Barceló |first3=Pablo |last4=Hogan |first4=Aidan |last5=Reutter |first5=Juan |last6=Vrgoč |first6=Domagoj |date=2018-09-30 |title=Foundations of Modern Query Languages for Graph Databases |url=https://dl.acm.org/doi/10.1145/3104031 |journal=ACM Computing Surveys |language=en |volume=50 |issue=5 |pages=1–40 |doi=10.1145/3104031 |issn=0360-0300}}</ref><ref>{{Cite web |last=Hitzler, P.; Krötzsch, M.; Parsia, B. et al. |date=11 December 2012 |title=OWL 2 Web Ontology Language Primer (Second Edition) |url=https://www.w3.org/TR/owl2-primer/ |publisher=World Wide Web Consortium}}</ref><ref>{{Cite journal |last=Philip |first=Stutz |last2=Daniel |first2=Strebel |last3=Abraham |first3=Bernstein |date=2016 |title=Signal/collect12: processing large graphs in seconds |url=https://www.zora.uzh.ch/id/eprint/119576 |doi=10.5167/UZH-119576}}</ref><ref>{{Cite journal |last=Wang |first=Quan |last2=Mao |first2=Zhendong |last3=Wang |first3=Bin |last4=Guo |first4=Li |date=2017-12-01 |title=Knowledge Graph Embedding: A Survey of Approaches and Applications |url=http://ieeexplore.ieee.org/document/8047276/ |journal=IEEE Transactions on Knowledge and Data Engineering |volume=29 |issue=12 |pages=2724–2743 |doi=10.1109/TKDE.2017.2754499 |issn=1041-4347}}</ref>
 
Moreover, the inherent semantic transparency of knowledge graphs can improve the transparency of data-based decision-making and improve the communication of data and knowledge within research and science in general.<ref>{{Cite journal |last=Stocker |first=Markus |last2=Oelen |first2=Allard |last3=Jaradeh |first3=Mohamad Yaser |last4=Haris |first4=Muhammad |last5=Oghli |first5=Omar Arab |last6=Heidari |first6=Golsa |last7=Hussein |first7=Hassan |last8=Lorenz |first8=Anna-Lena |last9=Kabenamualu |first9=Salomon |last10=Farfar |first10=Kheir Eddine |last11=Prinz |first11=Manuel |date=2023-01-11 |editor-last=Magagna |editor-first=Barbara |title=FAIR scientific information with the Open Research Knowledge Graph |url=https://www.medra.org/servlet/aliasResolver?alias=iospress&doi=10.3233/FC-221513 |journal=FAIR Connect |volume=1 |issue=1 |pages=19–21 |doi=10.3233/FC-221513}}</ref><ref>{{Cite journal |last=Aisopos |first=Fotis |last2=Jozashoori |first2=Samaneh |last3=Niazmand |first3=Emetis |last4=Purohit |first4=Disha |last5=Rivas |first5=Ariam |last6=Sakor |first6=Ahmad |last7=Iglesias |first7=Enrique |last8=Vogiatzis |first8=Dimitrios |last9=Menasalvas |first9=Ernestina |last10=Rodriguez Gonzalez |first10=Alejandro |last11=Vigueras |first11=Guillermo |date=2023-05-08 |editor-last=Kondylakis |editor-first=Haridimos |editor2-last=Rao |editor2-first=Praveen |editor3-last=Stefanidis |editor3-first=Kostas |editor4-last=Stefanidis |editor4-first=Kostas |editor5-last=Kondylakis |editor5-first=Haridimos |title=Knowledge graphs for enhancing transparency in health data ecosystems1 |url=https://www.medra.org/servlet/aliasResolver?alias=iospress&doi=10.3233/SW-223294 |journal=Semantic Web |volume=14 |issue=5 |pages=943–976 |doi=10.3233/SW-223294}}</ref><ref>{{Cite journal |last=Cifuentes-Silva |first=Francisco |last2=Fernández-Álvarez |first2=Daniel |last3=Labra-Gayo |first3=Jose Emilio |date=2020-06-03 |title=National Budget as Linked Open Data: New Tools for Supporting the Sustainability of Public Finances |url=https://www.mdpi.com/2071-1050/12/11/4551 |journal=Sustainability |language=en |volume=12 |issue=11 |pages=4551 |doi=10.3390/su12114551 |issn=2071-1050}}</ref><ref>{{Cite journal |last=Rajabi |first=Enayat |last2=Kafaie |first2=Somayeh |date=2022-09-28 |title=Knowledge Graphs and Explainable AI in Healthcare |url=https://www.mdpi.com/2078-2489/13/10/459 |journal=Information |language=en |volume=13 |issue=10 |pages=459 |doi=10.3390/info13100459 |issn=2078-2489}}</ref><ref>{{Cite journal |last=Tiddi |first=Ilaria |last2=Schlobach |first2=Stefan |date=2022-01 |title=Knowledge graphs as tools for explainable machine learning: A survey |url=https://linkinghub.elsevier.com/retrieve/pii/S0004370221001788 |journal=Artificial Intelligence |language=en |volume=302 |pages=103627 |doi=10.1016/j.artint.2021.103627}}</ref>


Despite offering an appropriate technical foundation, the utilization of a knowledge graph for storing data and metadata does not inherently ensure the achievement of the FAIR Guiding Principles. Realizing FAIR research objects necessitates adherence to specific guidelines, encompassing the consistent application of adequate semantic data models tailored to distinct types of data and metadata statements. This approach is pivotal for ensuring seamless interoperability across a dataset.
Despite offering an appropriate technical foundation, the utilization of a knowledge graph for storing data and metadata does not inherently ensure the achievement of the FAIR Guiding Principles. Realizing FAIR research objects necessitates adherence to specific guidelines, encompassing the consistent application of adequate semantic data models tailored to distinct types of data and metadata statements. This approach is pivotal for ensuring seamless interoperability across a dataset.


The rest of the paper is organized as such. In the Problem statement section, we discuss three specific challenges that, from our perspective, can be effectively addressed by systematically organizing a knowledge graph into well-defined subgraphs. Prior attempts at this, such as defining a characteristic set as a subgraph based on triples that share the same resource in the ''Subject'' position, have demonstrated noteworthy enhancements in space and query performance [28, 29] (see also the related concept of RDF molecules [30, 31]), but they do not fully mitigate the challenges outlined below.
The rest of the paper is organized as such. In the Problem statement section, we discuss three specific challenges that, from our perspective, can be effectively addressed by systematically organizing a knowledge graph into well-defined subgraphs. Prior attempts at this, such as defining a characteristic set as a subgraph based on triples that share the same resource in the ''Subject'' position, have demonstrated noteworthy enhancements in space and query performance<ref>{{Cite journal |last=Hogan |first=Aidan |last2=Arenas |first2=Marcelo |last3=Mallea |first3=Alejandro |last4=Polleres |first4=Axel |date=2014-08 |title=Everything you always wanted to know about blank nodes |url=https://linkinghub.elsevier.com/retrieve/pii/S1570826814000481 |journal=Journal of Web Semantics |language=en |volume=27-28 |pages=42–69 |doi=10.1016/j.websem.2014.06.004}}</ref><ref>{{Cite web |last=Neumann, T.; Moerkotte, G. |title=Characteristic sets: Accurate cardinality estimation for RDF queries with multiple joins {{!}} IEEE Conference Publication {{!}} IEEE Xplore |work=Proceedings of the 2011 IEEE 27th International Conference on Data Engineering |url=https://ieeexplore.ieee.org/document/5767868/ |doi=10.1109/icde.2011.5767868 |accessdate=}}</ref> (see also the related concept of RDF molecules<ref>{{Cite journal |last=Papastefanatos |first=George |last2=Meimaris |first2=Marios |last3=Vassiliadis |first3=Panos |date=2022-02 |title=Relational schema optimization for RDF-based knowledge graphs |url=https://linkinghub.elsevier.com/retrieve/pii/S0306437921000223 |journal=Information Systems |language=en |volume=104 |pages=101754 |doi=10.1016/j.is.2021.101754}}</ref><ref>{{Cite journal |last=Collarana |first=Diego |last2=Galkin |first2=Mikhail |last3=Traverso-Ribón |first3=Ignacio |last4=Vidal |first4=Maria-Esther |last5=Lange |first5=Christoph |last6=Auer |first6=Sören |date=2017-06-19 |title=MINTE: semantically integrating RDF graphs |url=https://dl.acm.org/doi/10.1145/3102254.3102280 |journal=Proceedings of the 7th International Conference on Web Intelligence, Mining and Semantics |language=en |publisher=ACM |place=Amantea Italy |pages=1–11 |doi=10.1145/3102254.3102280 |isbn=978-1-4503-5225-3}}</ref>), but they do not fully mitigate the challenges outlined below.


The Results section introduces a novel concept: the partitioning and structuring of a knowledge graph into semantic units, identifiable subgraphs represented in the graph with their own resource. Semantic units are semantically meaningful units of representation, which will contribute to overcoming the challenges at hand. The concept builds upon an idea originally proposed for structuring descriptions of [[phenotype]]s into distinct subgraphs, each of which models a descriptive statement like a particular weight measurement or a particular parthood statement for a given anatomical entity. [32] Each such subgraph is organized in its own "Named Graph" and functions as the smallest semantically meaningful unit in a phenotype description. Generalizing and extending this concept, we present semantic units as accessible, searchable, identifiable, and reusable data items in their own right, forming units of representation implemented through graphs based on the [[Resource Description Framework]] (RDF) and the Web Ontology Language (OWL) or labeled property graphs. Two basic categories of semantic units—statement units and compound units—are introduced, supplementing the well-established triples and the overall graph in FAIR knowledge graphs. These units offer a structure that organizes a knowledge graph into five levels of representational granularity, from individual triples to the graph as a whole. In further refinement, additional subcategories of semantic units are proposed for enhanced graph organization. The incorporation of unique, persistent, and resolvable identifiers (UPRIs) for each semantic unit enables their efficient referencing within triples, facilitating an efficient way of making statements about statements. The introduction of semantic units adds further layers of triples to the well-established RDF and OWL layer for knowledge graphs. (Fig. 1) This augmentation aims to enhance the usability of knowledge graphs for both domain experts and developers.
The Results section introduces a novel concept: the partitioning and structuring of a knowledge graph into semantic units, identifiable subgraphs represented in the graph with their own resource. Semantic units are semantically meaningful units of representation, which will contribute to overcoming the challenges at hand. The concept builds upon an idea originally proposed for structuring descriptions of [[phenotype]]s into distinct subgraphs, each of which models a descriptive statement like a particular weight measurement or a particular parthood statement for a given anatomical entity.<ref>{{Cite journal |last=Vogt |first=Lars |date=2019-12 |title=Organizing phenotypic data—a semantic data model for anatomy |url=https://jbiomedsem.biomedcentral.com/articles/10.1186/s13326-019-0204-6 |journal=Journal of Biomedical Semantics |language=en |volume=10 |issue=1 |pages=12 |doi=10.1186/s13326-019-0204-6 |issn=2041-1480 |pmc=PMC6585074 |pmid=31221226}}</ref> Each such subgraph is organized in its own "Named Graph" and functions as the smallest semantically meaningful unit in a phenotype description. Generalizing and extending this concept, we present semantic units as accessible, searchable, identifiable, and reusable data items in their own right, forming units of representation implemented through graphs based on the [[Resource Description Framework]] (RDF) and the Web Ontology Language (OWL) or labeled property graphs. Two basic categories of semantic units—statement units and compound units—are introduced, supplementing the well-established triples and the overall graph in FAIR knowledge graphs. These units offer a structure that organizes a knowledge graph into five levels of representational granularity, from individual triples to the graph as a whole. In further refinement, additional subcategories of semantic units are proposed for enhanced graph organization. The incorporation of unique, persistent, and resolvable identifiers (UPRIs) for each semantic unit enables their efficient referencing within triples, facilitating an efficient way of making statements about statements. The introduction of semantic units adds further layers of triples to the well-established RDF and OWL layer for knowledge graphs. (Fig. 1) This augmentation aims to enhance the usability of knowledge graphs for both domain experts and developers.




Line 73: Line 74:


===Conventions used in this paper===
===Conventions used in this paper===
In this paper, the term "knowledge graph" denotes a machine-actionable semantic graph employed for the documentation, organization, and representation of data and metadata. It is essential to note that our discussion of semantic units is situated within the context of RDF-based triple stores, OWL, and Description Logics serving as a formal framework for inferencing, alongside labeled property graphs as an alternative to triple stores. We deliberately focus on these technologies as they constitute the primary technologies and logical frameworks within the knowledge graph domain, benefiting from widespread community support and established standards. We are aware of the fact that alternative technologies and frameworks exist that support an ''n''-tuples syntax and more advanced logics (e.g., First Order Logic) [33, 34], but supporting tools and applications are missing or are not widely used to turn them into well-supported, scalable, and easily usable knowledge graph applications.
In this paper, the term "knowledge graph" denotes a machine-actionable semantic graph employed for the documentation, organization, and representation of data and metadata. It is essential to note that our discussion of semantic units is situated within the context of RDF-based triple stores, OWL, and Description Logics serving as a formal framework for inferencing, alongside labeled property graphs as an alternative to triple stores. We deliberately focus on these technologies as they constitute the primary technologies and logical frameworks within the knowledge graph domain, benefiting from widespread community support and established standards. We are aware of the fact that alternative technologies and frameworks exist that support an ''n''-tuples syntax and more advanced logics (e.g., First Order Logic)<ref>{{Citation |last=Ceusters |first=Werner |date=2022 |editor-last=Elkin |editor-first=Peter L. |title=The Place of Referent Tracking in Biomedical Informatics |url=https://link.springer.com/10.1007/978-3-031-11302-4_6 |work=Terminology, Ontology and their Implementations |language=en |publisher=Springer International Publishing |place=Cham |pages=39–46 |doi=10.1007/978-3-031-11302-4_6 |isbn=978-3-031-11301-7 |accessdate=2024-06-17}}</ref><ref>{{Cite journal |last=Ceusters |first=Werner |last2=Elkin |first2=Peter |last3=Smith |first3=Barry |date=2007-12 |title=Negative findings in electronic health records and biomedical ontologies: A realist approach |url=https://linkinghub.elsevier.com/retrieve/pii/S1386505607000408 |journal=International Journal of Medical Informatics |language=en |volume=76 |pages=S326–S333 |doi=10.1016/j.ijmedinf.2007.02.003 |pmc=PMC2211452 |pmid=17369081}}</ref>, but supporting tools and applications are missing or are not widely used to turn them into well-supported, scalable, and easily usable knowledge graph applications.


Throughout this text, <u>regular underlining</u> is employed for indicating ontology classes, while ''<u>italicsUnderlined</u>'' text is reserved for referencing properties. Identification (ID) numbers, formed by the ontology prefix followed by a colon and a number, uniquely specify each resource (e.g., ''<u>isAbout</u>'' [IAO:0000136]). When a term is not yet covered in any ontology, we denote the corresponding class with an asterisk (*). New classes and properties that relate to semantic units will use the ontology prefix SEMUNIT, as in the class *<u>SEMUNIT:metric measurement statement unit</u>*. These will be part of a future Semantic Unit ontology. We use '<u>regular underlined</u>' to indicate instances of classes, with the label referring to the class label and the ID to the ID of the class.
Throughout this text, <u>regular underlining</u> is employed for indicating ontology classes, while ''<u>italicsUnderlined</u>'' text is reserved for referencing properties. Identification (ID) numbers, formed by the ontology prefix followed by a colon and a number, uniquely specify each resource (e.g., ''<u>isAbout</u>'' [IAO:0000136]). When a term is not yet covered in any ontology, we denote the corresponding class with an asterisk (*). New classes and properties that relate to semantic units will use the ontology prefix SEMUNIT, as in the class *<u>SEMUNIT:metric measurement statement unit</u>*. These will be part of a future Semantic Unit ontology. We use '<u>regular underlined</u>' to indicate instances of classes, with the label referring to the class label and the ID to the ID of the class.
Line 82: Line 83:


==Methods==
==Methods==
===Problem statement===
====Challenge 1: Ensuring schematic interoperability for FAIR empirical data====
In the pursuit of FAIRness in empirical data and metadata in a knowledge graph, it is important not only for the terms employed in data and metadata statements to possess identifiers from controlled vocabularies, such as ontologies, ensuring terminological interoperability, but also the semantic graph patterns underlying each statement. These patterns specify the relationships among the terms in a statement, facilitating schematic interoperability.
Due to the expressivity of RDF and OWL, statements can be modelled in multiple, often not directly interoperable ways within a knowledge graph. Distinguishing between RDF graphs with different structures that essentially model the same underlying data statement poses a challenge. Consequently, the presence of schematic interoperability conflicts becomes unavoidable, especially when data are represented using diverse graph patterns (cf. Figs. 2 and 3).
[[File:Fig2 Vogt JofBiomedSem24 15.png|900px]]
{{clear}}
{|
| style="vertical-align:top;" |
{| border="0" cellpadding="5" cellspacing="0" width="900px"
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" |<blockquote>'''Figure 2.''' Comparison of a human-readable statement with its machine-actionable representation as a semantic graph following the RDF syntax. Top: A human-readable statement concerning the observation that a specific apple (X) weighs 204.56 grams. Bottom: The corresponding representation of the same statement as a semantic graph, adhering to RDF syntax and following the established pattern for measurement data from the Ontology for Biomedical Investigations (OBI)<ref>{{Cite journal |last=Bandrowski |first=Anita |last2=Brinkman |first2=Ryan |last3=Brochhausen |first3=Mathias |last4=Brush |first4=Matthew H. |last5=Bug |first5=Bill |last6=Chibucos |first6=Marcus C. |last7=Clancy |first7=Kevin |last8=Courtot |first8=Mélanie |last9=Derom |first9=Dirk |last10=Dumontier |first10=Michel |last11=Fan |first11=Liju |date=2016-04-29 |editor-last=Xue |editor-first=Yu |title=The Ontology for Biomedical Investigations |url=https://dx.plos.org/10.1371/journal.pone.0154556 |journal=PLOS ONE |language=en |volume=11 |issue=4 |pages=e0154556 |doi=10.1371/journal.pone.0154556 |issn=1932-6203 |pmc=PMC4851331 |pmid=27128319}}</ref> of the Open Biological and Biomedical Ontology Foundry (OBO).</blockquote>
|-
|}
|}
[[File:Fig3 Vogt JofBiomedSem24 15.png|800px]]
{{clear}}
{|
| style="vertical-align:top;" |
{| border="0" cellpadding="5" cellspacing="0" width="800px"
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" |<blockquote>'''Figure 3.''' Alternative machine-actionable representation of the data statement from Fig. 2. This graph represents the same data statement as shown in Fig. 2 Top, but applies a semantic graph model that is based on the Extensible Observation Ontology (OBOE)<ref>{{Cite journal |last=Madin |first=Joshua |last2=Bowers |first2=Shawn |last3=Schildhauer |first3=Mark |last4=Krivov |first4=Sergeui |last5=Pennington |first5=Deana |last6=Villa |first6=Ferdinando |date=2007-10 |title=An ontology for describing and synthesizing ecological observation data |url=https://linkinghub.elsevier.com/retrieve/pii/S1574954107000362 |journal=Ecological Informatics |language=en |volume=2 |issue=3 |pages=279–296 |doi=10.1016/j.ecoinf.2007.05.004}}</ref>, an ontology frequently used in the ecology community.</blockquote>
|-
|}
|}
Therefore, to maintain interoperability in the representation of empirical data statements within an RDF graph, it can be beneficial to restrict the graph patterns employed for their semantic modelling. Statements of the same type, such as all weight measurements, would employ identical graph patterns to maintain interoperability. Each of these patterns would be assigned an identifier. When representing empirical data in the form of an RDF graph, the graph’s metadata should reference that graph-pattern identifier. This approach enables the identification of potentially interoperable RDF graphs sharing common graph-pattern identifiers.
Practically implementing these principles entails two criteria. Firstly, all statements within a knowledge graph must be categorized into statement classes, each associated with a specified graph pattern, typically in the form of a shape specification. Secondly, the subgraph corresponding to a particular statement must be distinctly identifiable.
====Challenge 2: Overcoming barriers in graph query language adoption====
Another significant challenge arises in the context of searching for specific information in a knowledge graph. The prevalent formats for knowledge graphs include RDF/OWL or labeled property graphs like Neo4j. Interacting directly with these graphs, encompassing CRUD operations for creating (= writing), reading (= searching), updating, and deleting statements in the knowledge graph, necessitates the utilization of a query language. SPARQL<ref>{{Cite web |last=Harris, S.; Seaborne, A. |date=21 March 2013 |title=SPARQL 1.1 Query Language |url=https://www.w3.org/TR/sparql11-query/ |publisher=World Wide Web Consortium}}</ref> is an example for RDF/OWL, while Cypher<ref>{{Cite web |date=2024 |title=The Neo4j Operations Manual v5 |url=https://neo4j.com/docs/operations-manual/current/ |publisher=Neo4j, Inc}}</ref> is employed for Neo4j.
Although these query languages empower users to formulate detailed and intricate queries, the challenge lies in their complexity, creating an entry barrier for seamless interactions with knowledge graphs.<ref>{{Cite web |last=Booth, D.; Wallace, E. |date=2019 |title=Session X: EasyRDF |work=2nd U.S. Semantic Technologies Symposium 2019 |url=https://us2ts.org/2019/posts/program-session-x.html}}</ref> Furthermore, query languages are not aware of graph patterns.
This challenge may potentially be addressed by providing reusable query patterns that link to specific graph patterns, thereby integrating representation and querying.
====Challenge 3: Addressing complexities in making statements about statements====
The RDF triple syntax of ''Subject'', ''Predicate'', and ''Object'' allows expressing a statement about another statement by creating a triple that relates a statement, composed of one or more triples, to a value, resource, or another statement. The scenario may arise where such statements about statements must be modelled. For instance, metadata for a measurement may relate two distinct subgraphs: one representing the measurement itself (as seen in Fig. 2) and another documenting the underlying measuring process (as seen in Fig. 4).
[[File:Fig4 Vogt JofBiomedSem24 15.png|1000px]]
{{clear}}
{|
| style="vertical-align:top;" |
{| border="0" cellpadding="5" cellspacing="0" width="1000px"
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" |<blockquote>'''Figure 4.''' A detailed machine-actionable representation of the metadata relating to a weight measurement datum. This detailed illustration presents a machine-actionable representation of a mass measurement process employing a balance. It documents metadata associated with a weight measurement datum, articulated as an RDF graph. The graph establishes connections between an instance of <u>mass measurement assay</u> (OBI:0000445) and instances of various other classes from diverse ontologies. Noteworthy details include the identification of the measurement conductor, the location and timing of the measurement, the protocol followed, and the specific device utilized (i.e., a balance). Additionally, the graph outlines the material entity serving as the subject and input for the measurement process (i.e., "apple X"), along with specifying the resultant data encapsulated in a particular weight measurement assertion.</blockquote>
|-
|}
|}
In RDF reification, a statement resource is defined to represent a particular triple by describing it via three additional triples that specify its ''Subject'', ''Predicate'', and ''Object''. Alternatively, the RDF-star approach can be employed.<ref>{{Cite web |last=Hartig, O. |date=2017 |title=Foundations of RDF⋆ and SPARQL⋆ (An Alternative Approach to Statement-Level Metadata in RDF) |work=Alberto Mendelzon Workshop on Foundations of Data Management |url=https://www.semanticscholar.org/paper/Foundations-of-RDF%E2%8B%86-and-SPARQL%E2%8B%86-(An-Alternative-to-Hartig/36e70ee51cb7b7ec12faac934ae6b6a4d9da15a8}}</ref> Both methods increase complexity of the represented graph.
In cases like this, the adoption of Named Graphs is an alternative compared to RDF reification or RDF-star approaches. Within RDF-based knowledge graphs, a Named Graph resource identifies a set of triples by incorporating the URI of the Named Graph as a fourth element to each triple, transforming them into quads. In labeled property graphs, on the other hand, assigning a resource for identifying subgraphs within the overall data graph is straightforward and can be achieved by incorporating the resource identifier as the value of a corresponding property-value pair, subsequently adding this pair to all relations and nodes belonging to the same subgraph.
==Results==
===Semantic unit===
We developed an approach for organizing knowledge graphs into distinct layers of subgraphs using graph patterns. Unlike traditional methods of partitioning a knowledge graph that (i) rely on technical aspects such as shared graph-topological properties of its triples with the goal of (federated) reasoning and query optimization (see characteristic sets [29, 30], RDF molecules [31, 42], and other approaches [43,44,45]), that (ii) partition a knowledge graph into small blocks for embedding and entity alignment learning to scale knowledge graph fusion [46], or that (iii) partition knowledge extractions, allowing reasoning over them in parallel to speed up knowledge graph construction [47], our approach introduces "semantic units." Semantic units prioritize structuring a knowledge graph into identifiable sets of triples, as subgraphs that represent units of representation possessing semantic significance for human readers. Technically, a semantic unit is a subgraph within a knowledge graph, represented in the graph by its own resource—designated as a UPRI—and embodied in the graph as a node. This resource is classified as an instance of a specific semantic unit class.
Semantic units focus on creating units that are semantically meaningful to domain experts. For instance, the graph in Fig. 2 exemplifies a subgraph that can be organized in a semantic unit that instantiates the class *<u>SEMUNIT:weight statement unit</u>* as it is illustrated in Fig. 6 (later). The statement unit models a single, human-readable statement, as opposed to the individual triple ‘<u>weight</u>’ (PATO:0000128) ''isQualityMeasuredAs'' (IAO:0000417) ‘<u>scalar measurement datum</u>’ (IAO:0000032), which is a single triple from that subgraph. That triple, without the context of the other triples in the subgraph, lacks semantic meaningfulness for a domain expert who has no background in semantics.
Beyond statement units, which constitute the smallest semantically meaningful statements (e.g., a weight measurement), collections of statement units can form compound units representing a coarser level of representational granularity. The classification of semantic units thus distinguishes two fundamental categories: statement units and compound units, each with its respective subcategories. For a detailed classification of semantic units, refer to Fig. 5.
[[File:Fig5 Vogt JofBiomedSem24 15.png|300px]]
{{clear}}
{|
| style="vertical-align:top;" |
{| border="0" cellpadding="5" cellspacing="0" width="300px"
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" |<blockquote>'''Figure 5.''' Classification of different categories of semantic units.</blockquote>
|-
|}
|}
The structuring of a knowledge graph into semantic units involves introducing an additional layer of triples to the existing graph. To distinguish these two layers, we label the pre-existing graph as the data graph layer, while the newly added triples constitute the semantic-units graph layer. For clarity across the graph, the resource representing a semantic unit, along with all triples featuring this resource in the ''Subject'' or ''Object'' position, is assigned to the semantic-units graph layer. Extending this distinction from the graph as a whole to individual semantic units, each semantic unit is associated with both a data graph and a semantic-units graph. The data graph of a particular semantic unit shares the same UPRI as its semantic unit resource. This alignment enables reference to the UPRI, concurrently denoting the semantic unit as a resource and its corresponding data graph. This interconnectedness empowers users to make statements about the content encapsulated within the semantic unit’s data graph, as shown in Fig. 6.
[[File:Fig6 Vogt JofBiomedSem24 15.png|1000px]]
{{clear}}
{|
| style="vertical-align:top;" |
{| border="0" cellpadding="5" cellspacing="0" width="1000px"
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" |<blockquote>'''Figure 6.''' Example of a statement unit. The illustration displays a statement unit exemplifying a has-weight relation. The data graph, denoted within the blue box at the bottom, articulates the statement with "apple X" as the subject and "gram X" alongside the numerical value 204.56 as the objects. The peach-colored box encompasses the semantic-units graph, housing triples that encapsulate the semantic unit’s representation. It explicitly denotes the resource embodying the statement unit (bordered blue box), an instance of the *<u>SEMUNIT:weight statement unit</u>* class, with "apple X" identified as the subject. Notably, the UPRI of *’<u>weight statement unit</u>’* is also the UPRI of the semantic unit’s data graph (the unbordered subgraph in the blue box).</blockquote>
|-
|}
|}
====Statement unit: A proposition in the knowledge graph====
A statement unit is characterized as the fundamental unit of information encapsulating the smallest, independent proposition (i.e., statement) with semantic meaning for human comprehension (see also [32]). For instance, the weight measurement statement for "apple X" illustrated in Fig. 6 represents a statement unit.
Structuring a knowledge graph into statement units results in a partition of its graph. Each triple within the data graph layer of the knowledge graph is associated with exactly one statement unit, and merging the subgraphs of all statement units results in the complete data graph of a knowledge graph. This partitioning only applies to the data graph layer.
We can understand each statement unit to specify a particular proposition by establishing a relationship between a resource serving as the subject and either a literal or another resource, denoted as the object of the predicate. Every statement unit encompasses a single subject and one or more objects.
To illustrate, a has-part statement unit features a subject and one object. Conversely, a weight measurement statement unit consists of a subject, as well as two objects: the weight value and the weight unit (refer to Fig. 6). The resource signifying a statement unit in the graph establishes a connection with its subject through the property *<u>SEMUNIT:''hasSemanticUnitSubject''</u>*, which is documented in the semantic-units graph of the statement unit.
In scenarios where the proposition within the data graph is grounded in a binary relation—a divalent predicate like "This right hand has as a part this right thumb"—the associated statement unit typically comprises a single triple. This alignment arises from the nature of RDF, where ''Predicates'' of triples are inherently binary relations. In such cases, the RDF property concurrently embodies the statement’s verb or predicate. However, numerous propositions are grounded in ''n''-ary relations, making a single triple insufficient for their representation. Examples encompass the weight measurement statement in Fig. 6 and statements like "This right hand has part this right thumb on January 29th 2022," "Anna gives Bob a book," and "Carla travels by train from Paris to Berlin on the 29th of June 2022," each necessitating more than one triple. In these cases, the statement’s verb or predicate is often represented not by a property within a single triple but instead by an instance resource, as exemplified by ‘<u>weight X</u>’ (PATO:0000128) in Fig. 6. The composition of statement units, whether consisting of one or more triples, is contingent upon the relation of the underlying proposition, the ''n''-aryness of its predicate, and the incorporation of optional objects. Types of statement units can be distinguished based on the ''n''-ary verb or predicate that characterizes their underlying proposition. Notably, numerous object properties of the Basic Formal Ontology 2 denote ternary relations, particularly those entailing temporal dependencies. [48] For instance, "''b'' located_in ''c'' at ''t''" mandates at least two triples for accurate representation in RDF.
The determination of which triples belong to a statement unit necessitates case-by-case specification by human domain experts. The statement unit patterns can then be specified using languages like LinkML [49, 50] or the Shapes Constraint Language SHACL [51]. These languages enable the definition of graph patterns to represent specific propositions, subsequently constituting a statement unit. Each statement unit instantiates a designated statement unit class, a classification defined by the specific verb or predicate characterizing the propositions modelled by its instances. We can distinguish different subcategories of statement units based on the underlying predicate, such as ''has part'', ''type'', and ''develops from''.
A distinctive category within the statement units, denoted as identification units, serves a specific purpose, providing details about a particular named individual or class resource. Two principal subtypes define this category. A named individual identification unit is a statement unit that serves to identify a resource to be a named individual, adding information such as the resource’s label, type, and its class membership (refer to Fig. 7A). A class identification unit{{Efn|Analog to class identification units, one could specify property identification units that have property resources as their subject.}} is a statement unit that serves to identify a resource to be a class and provides details including its label, identifier, and optionally, the URIs of both the ontology and the specific version from which the class term has been imported (refer to Fig. 7B). Both types of identification units are important for providing human-readable displays of statement units, as they provide the labels for the resources used in them (see "typed statement unit" and "dynamic label" in Fig. 9, later).
[[File:Fig7 Vogt JofBiomedSem24 15.png|500px]]
{{clear}}
{|
| style="vertical-align:top;" |
{| border="0" cellpadding="5" cellspacing="0" width="500px"
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" |<blockquote>'''Figure 7.''' Examples for two different types of identification units. '''A)''' Named-individual identification unit. The data graph within the unbordered box delineates the class-affiliation of the ‘<u>apple X</u>’ (NCIT:C71985) instance. The subject, "apple X," is connected to its class through the property ''<u>type</u>'' (RDF:type), while its label "apple X" is conveyed via the property ''<u>label</u>'' (RDFS:label). The unbordered blue box designates the data graph associated with this named-individual identification unit. '''B)''' Class identification unit. This data graph of this unit, represented by the unbordered blue box, captures the label and identifier of the class ‘<u>apple</u>’ (NCIT:C71985), the unit’s designated subject. Optionally, it includes the URI details of the ontology and the ontology version from which the class is derived. The bordered blue box designates the resource of this class identification unit.</blockquote>
|-
|}
|}
====Compound unit: A collection of propositions====
Compound units are containers of collections of associated semantic units, each possessing semantic significance for a human reader. Each compound unit possesses a UPRI and instantiates a corresponding compound unit class. The connection between the resource representing the compound unit and those representing its associated semantic units is detailed through the property *<u>SEMUNIT:hasAssociatedSemanticUnit</u>* (see Fig. 8). The subsequent sections introduce distinct subcategories of compound units.


[[File:Fig8 Vogt JofBiomedSem24 15.png|700px]]
{{clear}}
{|
| style="vertical-align:top;" |
{| border="0" cellpadding="5" cellspacing="0" width="700px"
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" |<blockquote>'''Figure 8.''' Example of a compound unit, denoted as *‘<u>apple X item unit</u>’*, that encompasses multiple statement units. Compound units, by virtue of merging the data graphs of their associated statement units, indirectly manifest a data graph (here, highlighted by the blue arrow). Notably, the compound unit possesses a semantic-units graph (depicted in the peach-colored box) delineating the associated semantic units.</blockquote>
|-
|}
|}


===Typed statement unit===
A typed statement unit assigns a human-readable label to a statement unit. A typed statement unit is a compound unit comprising the following statement units (see Fig. 9A):


==Footnotes==
#A statement unit that is not an instance of a named-individual or a class identification unit. It functions as the reference statement unit of the typed statement unit, and its subject is also the subject of the typed statement unit.
#Identification units specifying the class affiliations of all the resources that are referenced in the data graph of the reference statement unit, together with their human-readable labels.
 
 
[[File:Fig9 Vogt JofBiomedSem24 15.png|700px]]
{{clear}}
{|
| style="vertical-align:top;" |
{| border="0" cellpadding="5" cellspacing="0" width="700px"
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" |<blockquote>'''Figure 9.''' Typed statement unit with dynamic label and dynamic mind-map pattern. '''A)''' Typed statement unit exemplified for a weight statement. This typed statement unit consolidates the data graphs of six statement units, including the *’<u>weight statement unit</u>’* from Figure 6, serving as the reference statement unit for this *‘<u>typed statement unit</u>’*, and five instances of *<u>SEMUNIT:named-individual identification unit</u>*. '''B)''' Dynamic label: Illustrated is an example of the dynamic label associated with the reference statement unit class (*<u>SEMUNIT:weight statement unit</u>*). This dynamic label template is utilized for textual displays of information from the reference statement unit. '''C)''' Dynamic mind-map pattern: Depicted is an example of the dynamic mind-map pattern associated with the reference statement unit class (*<u>SEMUNIT:weight statement unit</u>*). This pattern template is employed for graphical displays of information from the reference statement unit.</blockquote>
|-
|}
|}
 
Each statement unit class has at least one display pattern associated with it. A display pattern acts as a template that takes as input the labels provided by the identification units associated with a typed statement unit and generates a human-readable dynamic label for the textual (see Fig. 9B) or a dynamic mind-map pattern for the graphical representation (see Fig. 9C) of the statement of its reference statement unit. Thus, a dynamic label and a dynamic mind-map pattern of a typed statement unit are derived from the corresponding templates provided by its reference statement unit, taking the human-readable labels provided by its identification units as input.
 
===Item unit===
An item unit encompasses all statement and typed statement units that share a common subject, i.e., they form a group of statements relating to the same entity. The subject resource becomes the subject of the item unit, and the resource representing an item unit in the semantic-units graph relates to its subject through the property *<u>SEMUNIT:hasSemanticUnitSubject</u>*. Conceptually, item units align with the ''graph-per-resource'' data management pattern [52] or the previously mentioned ''characteristic set'' or ''RDF molecule'', and they are akin to the ''Item''  concept in the Wikibase data model<ref name="MWWikibase24">{{cite web |url=https://www.mediawiki.org/wiki/Wikibase/DataModel#Item |title=Wikibase/DataModel - Overview of the data model |work=MediaWiki.org |date=07 April 2024}}</ref>, but adapt the concept to statement units rather than triples.
 
===Item group unit===
An item group unit is composed of a minimum of two item units. The subgraphs of the item units belonging to the same item group unit are connected through statement units that share their subject with the subject of one item unit and one of their objects with the subject of another item unit. As a result, merging the subgraphs of all the item units of an item group unit forms a connected graph.
 
===Granularity tree unit===
We can further identify types of statement units that depend on partial order relations (i.e., relations that are transitive, reflexive, and asymmetric), forming partial orders. Examples include class-subclass relations in ontologies, parthood relations in descriptive statements, and sequential relations like ''<u>before</u>'' (RO:0002083) in process specifications. Partial order relations give rise to granular partitions that form granularity trees [53,54,55] and contribute to defining granularity perspectives. [56,57,58]
 
Granularity perspectives identify specific types of semantically meaningful tree-like subgraphs within a knowledge graph, supporting graph exploration by modularization in addition to statement, item, and item group units.
 
Due to the nested structure of a granularity tree and its inherent directionality from root to leaves, the subject of a granularity tree unit can be specified as the subject of statement units sharing objects with the subjects but not their subject with the objects of other statement units within the same granularity tree unit.
 
===Granular item group unit===
A granular item group unit encompasses all statement units and item units whose subjects belong to the same granularity tree unit. The item units belonging to a granular item group unit can be systematically arranged within a nested hierarchy dictated by the underlying granularity tree. This additional organization offers improved explorability for users of a knowledge graph application.
 
===Context unit===
The ''<u>isAbout</u>'' property (IAO:0000136) connects an information artifact to an entity about which the artifact provides information. Using this property in a knowledge graph changes the frame of reference from the discursive layer to the ontological layer. An is-about statement thus divides a knowledge graph into two subgraphs, each forming a context unit that belongs to one of these two layers. Is-about statement units relate resources from the semantic-units graph with resources from the data graph of a knowledge graph. For example, in documenting a research activity that results in the creation of a dataset describing the anatomy of a multicellular organism, the statement *‘<u>description item unit</u>’* ''<u>isAbout</u>'' ‘<u>multicellular organism</u>’ (UBERON:0000468) marks a transition in the frame of reference from the research activity’s outcome to the multicellular organism being described (see also Fig. 12 further below).
 
===Dataset unit===
A dataset unit is an ordered set of semantic units. They can be employed to aggregate all data contributed by a specific institution in a collaborative project, document the state of a particular object at a given time, or store and make accessible the results of a specific search query. Knowledge graph users have the flexibility to specify dataset units for their individual needs, utilizing the unit’s UPRI as reference identifier.
 
===List unit===
In certain instances, it becomes necessary to articulate statements about a specific collection of particular resources. To achieve this, such a collection can be modelled as a list unit. We distinguish unordered list units from ordered list units, with the latter organizing resources in a specific sequence, such as the authors of a scholarly publication. Conversely, a set unit is an unordered list unit where each resource is listed only once, adhering to a uniqueness restriction.
 
From a technical standpoint, a list unit contains membership statement units, each delineating a resource belonging to the list by linking the UPRI of the list unit through a *<u>SEMUNIT:''child''</u>* relation to the respective resource. In the case of an ordered list unit, each membership statement unit must be indexed through a data property ''<u>index</u>'' (RDF:index).
 
List units can be employed as arrays and may incorporate cardinality restrictions, thereby characterizing a closed collection of entities and enabling a localized closed-world assumption.
 
==Discussion==
===Benefits of organizing a knowledge graph into semantic units===
====Semantic units enhance data management flexibility through modularity====
The organization of a knowledge graph into distinct subgraphs, each associated with a particular semantic unit, introduces modularity in a graph. Each semantic unit, represented in the graph by a dedicated resource classified as an instance of a specific semantic unit class, serves as a structured module that encapsulates complexity. This modular approach allows for the encapsulation of subgraphs, and may add flexibility in data management as larger parts of a graph can be manipulated jointly.
 
====Semantic units operate at a higher level of abstraction than individual triples====
Semantically, they encapsulate the contents of their data graphs, representing statements or sets of semantically and ontologically related statements. The specification of relations between semantic units further extends the flexibility of data management. A given semantic unit from a finer level of representational granularity can be associated with multiple units from a coarser level. Consequently, a statement unit may be linked to more than one compound unit, all while maintaining the centrality of the statement unit itself and its triples in a single location within the graph.
 
The modular nature introduced by semantic units may streamline partitioned-based querying of knowledge graphs. While other approaches for graph partitioning have shown success [59], employing semantic units for partitioning and establishing modularity in the graph is an avenue for future research exploration.
 
===Semantic units as a framework for knowledge graph alignment===
The instantiation of semantic units belonging to the same class inherently implies a semantic similarity across instances. This characteristic lays the groundwork for a systematic approach to aligning and comparing knowledge graphs that share a common set of semantic unit classes. The alignment process could operate in a stepwise manner across various levels of representational granularity. In the initial step, alignment focuses on item group units, leveraging their types of associated item units and their alignment for comparison. The latter alignment hinges on the types of subjects and the types of associated statement units, allowing for further alignment based on class. Ultimately, individual triples within the aligned statement units undergo comparison, marking a comprehensive strategy to enhance existing methods for knowledge graph alignment, subgraph-matching, graph comparison, and graph similarity measures.
 
===Managing restricted access to sensitive data===
The classification of statement units into corresponding ontology classes may serve as a framework for identifying subgraphs within a knowledge graph housing sensitive data that warrants restricted access. By identifying statement units containing sensitive information by class, access restrictions can be dynamically enforced based on specific criteria.
 
===Semantic units: A framework for nested and overlapping knowledge graph modules===
====Semantic units identify five levels of representational granularity====
Semantic units introduce a structured framework encompassing five levels of representational granularity within a knowledge graph: triples, statement units, item units, item group units, and the knowledge graph as a whole (refer to Fig. 10). While triples represent the lowest level of abstraction, semantic units provide coarser levels, organizing the semantic-units graph layer (i.e., the discursive layer of a knowledge graph) and, indirectly, the knowledge graph’s data graph layer.
 
 
[[File:Fig10 Vogt JofBiomedSem24 15.png|700px]]
{{clear}}
{|
| style="vertical-align:top;" |
{| border="0" cellpadding="5" cellspacing="0" width="700px"
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" |<blockquote>'''Figure 10.''' Five levels of representational granularity. The integration of semantic units into a knowledge graph introduces a semantic-units graph layer, enriching the existing data graph layer. This augmentation includes distinct levels, namely triples, statement units, item units, and item group units, providing a nuanced hierarchy of representational granularity within a knowledge graph.</blockquote>
|-
|}
|}
 
The hierarchical organization of triples into statement units (→ smallest units of propositions that are semantically meaningful for a human reader), further into item units (→ comprising all the information from the knowledge graph about a particular entity), and eventually into item group units (→ collections of semantically interrelated entities) could enhance human readability and usability. This structural hierarchy supports users in seamlessly navigating across the graph, zooming in and out of different levels of representational granularity.
 
====Semantic units identify granularity trees====
Granularity trees offer a perspective that is orthogonal to representational granularity, structuring the data graph layer and thus the ontological layer of a knowledge graph into distinct granularity perspectives. Consider the example of a multicellular organism’s description, including a has-part statement unit stating that the organism has a head as its part. This unit is associated with the item unit of the organism itself, which is linked to additional item units about the organism’s other parts, constituting an item group unit. Moreover, since has-part is a partial order relation [55], the has-part statement unit is associated with a parthood granularity tree unit and its corresponding granular item group unit. Consequently, the statement unit is associated with at least four different compound units that can be communicated to the user alongside the statement itself, showcasing the versatility enabled by semantic units in exploring contextualized subgraphs. [54]
 
===Semantic units identify context-dependent subgraphs===
Semantic units empower the organization of item group units into context units, each defining a specific frame of reference. Intersections between context units are discerned through is-about statements (see also Fig. 12), facilitating traversal across diverse frames of reference. Context units contribute to structuring the data graph layer and thus the ontological layer of a knowledge graph into different frames of reference.
 
====Statements about statements and documenting ontological and discursive information in knowledge graphs using semantic units====
The introduction of semantic units provides a framework for making statements about statements in a knowledge graph. Each semantic unit, equipped with its unique UPRI and represented in the semantic-units graph layer, facilitates assertions about statement units. This structured approach offers the potential for cross-database and cross-knowledge-graph statements when semantic units are implemented as nanopublications or FAIR Digital Objects, addressing the challenge of making statements about statements in knowledge graphs.
 
Moreover, if a knowledge graph should cover contextual assertions such as “Author A asserts that the melting point of lead is at 327.5 °C” or “The assertion about the melting point of lead being at 327.5 °C is a result of experiment X,” it becomes challenging to model this without having a formalism for representing such discursive contextual information and its relationship to empirical data (see also Ingvar Johannson’s distinction between use and mention of linguistic entities [60]). Statement units with their data graphs contribute ontological information, nested within compound units of coarser representational granularity. In the semantic-units graph, propositions are represented as nodes, forming a significant portion of the discursive layer. Additionally, context units allow the explicit documentation of different frames of reference within both the ontological and discursive layers. The ability of statement units to establish relations between resources or even between other statement units (e.g., ‘''author_A -asserts-> statement_unit_Y''’; ‘''statement_unit_X -hasMetadata-> statement_unit_Z''’) facilitates the documentation of connections between the empirical and discursive layers. For instance, an item group unit focusing on the contents of a scholarly publication, can encapsulate information about the associated research activity, its inputs, outputs, research methods, and objectives (see Fig. 11).
 
 
[[File:Fig11 Vogt JofBiomedSem24 15.png|900px]]
{{clear}}
{|
| style="vertical-align:top;" |
{| border="0" cellpadding="5" cellspacing="0" width="900px"
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" |<blockquote>'''Figure 11.''' A semantic schema for modelling the contents of scholarly publications. The depicted semantic schema outlines the modelling structure for encapsulating the components of scholarly publications. It delineates the relationship between a research activity, its associated input and output, and the underlying specification of its process plan, manifested in the form of a research method and research objective. The model draws inspiration from Vogt ''et al.'' [61]</blockquote>
|-
|}
|}
 
The proposed model may find application within a knowledge graph centered around scholarly publications. For example, the representation in Fig. 12 combines the discursive and the ontological layers and represents the connections between different frames of reference.
 
 
[[File:Fig12 Vogt JofBiomedSem24 15.png|1300px]]
{{clear}}
{|
| style="vertical-align:top;" |
{| border="0" cellpadding="5" cellspacing="0" width="1300px"
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" |<blockquote>'''Figure 12.''' Detail from the RDF graph illustrating the contents of a scholarly publication. The data schema employed aligns with the schema shown in Figure 11, tailored to accommodate semantic units. The publication’s content is encapsulated within a dedicated publication item group unit instance through various interconnected semantic units. The publication itself is denoted as an instance of <u>journal article</u> (IAO:0000013). The publication item group unit encompasses multiple item units related to the research activity, interconnected through the *<u>SEMUNIT:''hasLinkedSemanticUnit''</u>* property. The interconnected hierarchy extends to an <u>investigation</u> (OBI:0000066) instance, resulting in a <u>data set</u> (IAO:0000100) instance with a <u>description</u> (SIO:000136) instance as its part. This description, in turn, has the multicellular organism item unit describing the organism as its part, which has an instance of <u>multicellular organism</u> (UBERON:0000468) as its subject. The blue arrow signifies the representation of the data graph (dark blue box with shadow) by this specific item unit (bordered box in the same color). The ontological layer is constituted by the data graphs of the semantic units, while their semantic-units graphs collectively form the discursive layer. Distinct context units demarcate the reference frames of the publication, research-activity, and research-subject, delineated by is-about statements. For reasons of clarity of presentation, the associated statement units are not shown in the discursive layer.</blockquote>
|-
|}
|}
 
===Implementation===
====Implementing semantic units in RDF/OWL-based knowledge graphs using nanopublications===
To initiate the structuring of a knowledge graph into semantic units, first, a layer of abstraction beyond the triple level must be created. This is accomplished by partitioning the knowledge graph into a set of statement units, where each triple belongs exclusively to one data graph of a statement unit. In RDF/OWL, statement units can be conceptualized like nanopublications.
 
Nanopublications are RDF graphs that serve as the smallest published information units extracted from literature and enriched with provenance and attribution information. [62,63,64,65] Leveraging Named Graphs and Semantic Web technologies, each nanopublication models a particular assertion, such as a scientific claim, in a machine-readable format and semantics and is accessible and citable through a unique identifier. Each nanopublication is organized into four Named Graphs:
 
#the head Named Graph, connecting the other three Named Graphs to the nanopublication’s unique identifier;
#the assertion Named Graph, containing the assertion modelled as a graph;
#the provenance Named Graph, containing metadata about the assertion; and
#the publicationInfo Named Graph, containing metadata about the nanopublication itself.
 
The assertion Named Graph would contain the data graph of a statement unit, whereas the head Named Graph its semantic-units graph. Triples in the provenance Named Graph can potentially link to other semantic units and thus other nanopublications that contain detailed metadata descriptions (e.g., a metadata graph as shown in Fig. 4).
 
A compound unit, being a collection of two or more semantic units, can be organized in an RDF/OWL-based knowledge graph by linking the compound unit’s UPRI to the UPRIs of its associated semantic units. Following the nanopublication schema, this can be implemented by employing the compound unit’s semantic-units graph as the head Named Graph of a corresponding nanopublication, leaving the nanopublication’s assertion Named Graph empty. The head Named Graph thus specifies all statement and compound units associated with this compound unit.
 
====Implementing semantic units in Neo4j-based knowledge graphs using UPRIs and corresponding property-value pairs====
In Neo4j, a labeled property graph, the assignment of UPRIs to all nodes and relations through a ‘''UPRI:upri''’ property-value pair is an essential prerequisite for implementing semantic units. To identify all triples affiliated with the same statement unit, a ‘''statement_unit_UPRI:upri''’ property-value pair must be added to each node and relation belonging to the statement unit, with the statement unit’s UPRI serving as the value. Building on this primary abstraction layer of statement units, a secondary abstraction layer of compound units can be organized. The nodes and relations associated with all triples within a compound unit are endowed with a ‘''compound_unit_UPRI:upri''’ property-value pair, having the compound unit’s UPRI as their value. Since a particular statement unit may be associated with multiple compound units, its ‘''compound_unit_URI''’ property can incorporate an array of UPRIs representing different semantic units.
 
An initial software for demonstration purposes has been developed by one of the authors, illustrating how semantic units can manage a knowledge graph. [66] Built upon Neo4j as the persistence-layer technology, the application sources its content via a web interface and user input. This small-scale knowledge graph application is designed for documenting assertions from scholarly publications, offering users an exemplary platform to describe some of the contents (and not merely bibliographic metadata) found in a scholarly publication. Each described paper stands as its own item group unit, featuring assertions covered by statement units linked to item units and granularity tree units. The prototype encompasses versioning of semantic units and automatic tracking of their editing histories and provenance. The application employs the organization of the graph into semantic units within a navigation tree, facilitating exploration of a given item group unit through its associated item units (see Fig. 13). The showcase is built using Python and flask/Jinja2 and is openly available at https://github.com/LarsVogt/Knowledge-Graph-Building-Blocks.
 
 
[[File:Fig13 Vogt JofBiomedSem24 15.png|1000px]]
{{clear}}
{|
| style="vertical-align:top;" |
{| border="0" cellpadding="5" cellspacing="0" width="1000px"
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" |<blockquote>'''Figure 13.''' User interface of a prototype web application that implements semantic units. On the left is a navigation tree that leverages the organization of the underlying Neo4j knowledge graph into different item group, item, and statement units. Currently selected is the infectious agent population item group. On the right, all statements belonging to the selected item group are displayed.</blockquote>
|-
|}
|}
 
====Strategies for implementation====
Given that only statement units store information, while compound units act as their containers, the first step of implementing semantic units should focus on identifying the statement unit classes required for representing the types of statements integral to the knowledge graph’s coverage. Each statement unit class requires an assigned graph schema, preferably articulated using a shapes constraint language like SHACL. [51] In this initial step, statement types that are grounded in partial order relations must be identified as well (required for identifying granularity tree units). From here, three distinct implementation strategies are available:
 
#'''Develop from scratch''': In cases where no knowledge graph exists yet, the focus should be on developing a knowledge graph application that organizes incoming information into statement units in accordance with their assigned graph schemata. Rules for organizing statement units into compound units, contingent on the compound unit type, must be established. For example, statement units sharing the same subject resource form a corresponding item unit.
#'''Transfer an existing knowledge graph''': If there is an existing knowledge graph that needs restructuring into semantic units, crafting queries to transfer all triples into corresponding statement units, based on the graph schemata identified in the first step, is the next step. The main challenge is maintaining disjointedness of triples between statement units.
#'''A hybrid approach''': For scenarios where restructuring an entire knowledge graph seems impractical or undesirable, but there is a desire to organize newly added information into semantic units, a hybrid approach is possible. This involves developing input workflows to ensure that all incoming data conforms to the semantic units structure.
 
====Semantic units as FAIR Digital Objects====
The concept of FAIR Digital Objects, as proposed by the European Commission Expert Group on FAIR Data, stands at the core of achieving the FAIR Principles [67], emphasizing persistent identifiers, comprehensive metadata, and contextual documentation for reliable discovery, citation, and reuse. The concept of semantic units aligns with that of FAIR Digital Objects. Each semantic unit inherently possesses a UPRI, serving as a ready-made persistent identifier. Accessibility and searchability are ensured through established protocols like SPARQL and CYPHER, with RDF, JSON, and other formats supporting data export. When knowledge graphs adhere to controlled vocabularies and ontologies, and when they employ standard graph-patterns using tools like SHACL [51], ShEx [68, 69], or OTTR [70, 71], the data within the data graphs of semantic units may more easily achieve semantic interoperability.
 
Moreover, semantic units can provide provenance—crucial for tracking a semantic unit’s history—through utilizing property-value pairs for labeled property knowledge graphs or a designated provenance Named Graph for RDF/OWL knowledge graphs. The provenance metadata of a semantic unit encompasses details like the creator, creation date, application used, title, contributing users, and last-update, focusing solely on the semantic unit itself, not the original data production process.
 
Access control metadata can specify any licenses as well as access control restrictions.
 
==Conclusion and future work==
In conclusion, the adoption of semantic units in structuring knowledge graphs may be useful to address the challenges faced in knowledge representation mentioned in the introduction. By encapsulating each statement within its dedicated statement unit, accompanied by a corresponding statement unit class and data schema (e.g., as a SHACL shape), a robust foundation for FAIR data and metadata is established, supporting schematic interoperability. Because statement units partition the knowledge graph so that every triple belongs to exactly one statement unit and every statement unit’s subgraph is identifiable and referenceable through its UPRI, data in a knowledge graph is linked to graph patterns, which are identifiable as a whole. By providing each schema its own UPRI, each semantic unit can specify its underlying schema in its metadata. Identifying semantically interoperable semantic units is then straightforward, and schema crosswalks between different schemata can increase schematic interoperability. [72] (This addresses Challenge 1.)
 
Graph query languages can use the graph patterns (semantic units), and therefore allow access to knowledge graph content through higher levels of abstractions than basic triples. (This addresses Challenge 2.) Further, we have shown how semantic units can organize knowledge graphs in different layers and make statements about statements. (This addresses Challenge 3.)
 
Future research involves extending the semantic units approach to incorporate question units and a nuanced categorization of assertional, contingent, prototypical, and universal statement units. This extension will encompass formal semantics for the latter, including provisions for negations and cardinality restrictions. Additionally, we are exploring novel approaches to knowledge graph exploration based on semantic units.
 
==Abbreviations, acronyms, and initialisms==
 
*'''BFO''': Basic Formal Ontology
*'''CRUD''': Create, Read, Update, Delete
*'''FAIR''': Findable, Accessible, Interoperable, and Reusable
*'''HTTP''': Hypertext Transfer Protocol
*'''HTTPS''': Hypertext Transfer Protocol Secure
*'''IAO''': Information Artifact Ontology
*'''ID''': Identifier
*'''JSON''': JavaScript Object Notation
*'''LinkML''': Linked Data Modeling Language
*'''NCIT''': National Cancer Institute
*'''NoSQL''': Not only Structured Query Language
*'''OBI''': Ontology for Biomedical Investigations
*'''OBOE''': Extensible Observation Ontology
*'''OBO Foundry''': Open Biological and Biomedical Ontology Foundry
*'''OTTR''': Reasonable Ontology Templates
*'''OWL''': Web Ontology Language
*'''PATO''': Phenotype and Trait Ontology
*'''RDF''': Resource Description Framework
*'''RDFS''': RDF-Schema
*'''RO''': OBO Relations Ontology
*'''SHACL''': Shape Constraint Language
*'''ShEx''': Shape Expression
*'''SIO''': Semanticscience Integrated Ontology
*'''SPARQL''': SPARQL Protocol and RDF Query Language
*'''TI''': Time Ontology in OWL
*'''TRUST''': Transparency, Responsibility, User Focus, Sustainability, and Technology
*'''UBERON''': Uber-anatomy ontology
*'''UO''': Units of Measurement Ontology
*'''UPRI''': Unique Persistent and Resolvable Identifier
*'''XSD''': Extensible Markup Language Schema Definition
 
==Foonotes==
{{reflist|group=lower-alpha}}
{{reflist|group=lower-alpha}}
==Acknowledgements==
We thank Werner Ceusters, Nico Matentzoglu, Manuel Prinz, Marcel Konrad, Philip Strömert, Roman Baum, Björn Quast, Peter Grobe, István Míko, Manfred Jeusfeld, Manolis Koubarakis, Javad Chamanara, and Kheir Eddine for discussing some of the presented ideas. We also thank to anonymous reviewers for their suggestions and feedback. We are solely responsible for all the arguments and statements in this paper.
===Author contributions===
L.V. developed the concept of semantic units and wrote the initial manuscript text. All authors reviewed and revised the manuscript.
===Funding===
Open Access funding enabled and organized by Projekt DEAL. Lars Vogt received funding by the ERC H2020 Project ‘ScienceGraph’ (819536).
===Conflict of interest===
The authors declare no competing interests.


==References==
==References==

Latest revision as of 22:35, 17 June 2024

Full article title Semantic units: Organizing knowledge graphs into semantically meaningful units of representation
Journal Journal of Biomedical Semantics
Author(s) Vogt, Lars; Kuhn, Tobias; Hoehndorf, Robert
Author affiliation(s) TIB Leibniz Information Centre for Science and Technology, Vrije Universiteit, King Abdullah University of Science and Technology
Primary contact Email: lars dot m dot vogt at googlemail dot com
Year published 2024
Volume and issue 15
Article # 7
DOI 10.1186/s13326-024-00310-5
ISSN 2041-1480
Distribution license Creative Commons Attribution 4.0 International
Website https://jbiomedsem.biomedcentral.com/articles/10.1186/s13326-024-00310-5
Download https://jbiomedsem.biomedcentral.com/counter/pdf/10.1186/s13326-024-00310-5.pdf (PDF)

Abstract

Background: In today’s landscape of data management, the importance of knowledge graphs and ontologies is escalating as critical mechanisms aligned with the FAIR Guiding Principles ask that research data and metadata be more findable, accessible, interoperable, and reusable. We discuss three challenges that may hinder the effective exploitation of the full potential of applying FAIR concepts to research objects using knowledge graphs.

Results: We introduce “semantic units” as a conceptual solution, although currently exemplified only in a limited prototype. Semantic units structure a knowledge graph into identifiable and semantically meaningful subgraphs by adding another layer of triples on top of the conventional data layer. Semantic units and their subgraphs are represented by their own resource that instantiates a corresponding semantic unit class. We distinguish statement and compound units as basic categories of semantic units. A statement unit is the smallest independent proposition that is semantically meaningful for a human reader. Depending on the relation of its underlying proposition, it consists of one or more triples. Organizing a knowledge graph into statement units results in a partition of the graph, with each triple belonging to exactly one statement unit. A compound unit, on the other hand, is a semantically meaningful collection of statement and compound units that form larger subgraphs. Some semantic units organize the graph into different levels of representational granularity, others orthogonally into different types of granularity trees or different frames of reference, structuring and organizing the knowledge graph into partially overlapping, partially enclosed subgraphs, each of which can be referenced by its own resource.

Conclusions: Semantic units, applicable in RDF/OWL and labeled property graphs, offer support for making statements about statements and facilitate graph-alignment, subgraph-matching, knowledge graph profiling, and management of access restrictions to sensitive data. Additionally, we argue that organizing the graph into semantic units promotes the differentiation of ontological and discursive information, and that it also supports the differentiation of multiple frames of reference within the graph.

Keywords: FAIR data and metadata, knowledge graph, OWL, RDF, semantic unit, graph organization, granularity tree, representational granularity

Background

In an era marked by the exponential generation of data[1][2][3], both technically and socially intricate challenges have emerged[4], necessitating innovative approaches to data representation and management in science and industry. The growing volume of produced data requires systems capable of collecting, integrating, and analyzing extensive datasets from diverse sources, a critical requirement in addressing contemporary global challenges.[5] Notably, data stewardship should rest within the hands of the domain experts or institutions to ensure technical autonomy, aligning with the concept of "data visiting" rather than conventional "data sharing."[6]

From the standpoint of data representation and management, meeting these demands relies on adherence to the FAIR Guiding Principles, which ask for research data and metadata to be readily findable, accessible, interoperable, and reusable for machines and humans alike.[7] Failure to achieve FAIRness risks transforming big data into opaque dark data.[8] Establishing the FAIRness of these research objects not only contributes to a solution for the reproducibility crisis in science[9] but also addresses broader concerns regarding the trustworthiness of information (see also the TRUST Principles of transparency, responsibility, user focus, sustainability, and technology[10]).

To capitalize on the transformative potential of the FAIR Principles, the idea of an internet of FAIR data and services was suggested.[11] Such a framework would seamlessly scale with the demands of big data, enabling relevant data-rich institutions, research projects, and citizen-science initiatives to make their research objects universally accessible in adherence to the FAIR Guiding Principles.[12][13] The key lies in furnishing comprehensive, machine-actionable[a] data and metadata, complemented by human-readable interfaces and search capabilities.

Knowledge graphs can contribute to the needed technical frameworks, offering a structure for managing and representing FAIR data and metadata.[15] Knowledge graphs are particularly applied in the context of semantic search based on entities and relations, deep reasoning, disambiguation of natural language, machine reading, and entity consolidation for big data and text analytics.[16]

The distinctive graph-based abstractions inherent in knowledge graphs yield advantages over traditional relational or other NoSQL models. These include

  • an intuitive way for modelling relations;
  • the flexibility to defer data schema definitions to accommodate evolving knowledge, which is especially important when dealing with incomplete knowledge;
  • incorporation of machine-actionable knowledge representation formalisms like ontologies and rules;
  • deployment of graph analytics and machine learning (ML); and
  • utilization of specialized graph query languages that support, in addition to standard relational operators such as joins, unions, and projections, also navigational operators for recursively searching for entities through arbitrary-length paths.[17][18][19][20][21][22][23]

Moreover, the inherent semantic transparency of knowledge graphs can improve the transparency of data-based decision-making and improve the communication of data and knowledge within research and science in general.[24][25][26][27][28]

Despite offering an appropriate technical foundation, the utilization of a knowledge graph for storing data and metadata does not inherently ensure the achievement of the FAIR Guiding Principles. Realizing FAIR research objects necessitates adherence to specific guidelines, encompassing the consistent application of adequate semantic data models tailored to distinct types of data and metadata statements. This approach is pivotal for ensuring seamless interoperability across a dataset.

The rest of the paper is organized as such. In the Problem statement section, we discuss three specific challenges that, from our perspective, can be effectively addressed by systematically organizing a knowledge graph into well-defined subgraphs. Prior attempts at this, such as defining a characteristic set as a subgraph based on triples that share the same resource in the Subject position, have demonstrated noteworthy enhancements in space and query performance[29][30] (see also the related concept of RDF molecules[31][32]), but they do not fully mitigate the challenges outlined below.

The Results section introduces a novel concept: the partitioning and structuring of a knowledge graph into semantic units, identifiable subgraphs represented in the graph with their own resource. Semantic units are semantically meaningful units of representation, which will contribute to overcoming the challenges at hand. The concept builds upon an idea originally proposed for structuring descriptions of phenotypes into distinct subgraphs, each of which models a descriptive statement like a particular weight measurement or a particular parthood statement for a given anatomical entity.[33] Each such subgraph is organized in its own "Named Graph" and functions as the smallest semantically meaningful unit in a phenotype description. Generalizing and extending this concept, we present semantic units as accessible, searchable, identifiable, and reusable data items in their own right, forming units of representation implemented through graphs based on the Resource Description Framework (RDF) and the Web Ontology Language (OWL) or labeled property graphs. Two basic categories of semantic units—statement units and compound units—are introduced, supplementing the well-established triples and the overall graph in FAIR knowledge graphs. These units offer a structure that organizes a knowledge graph into five levels of representational granularity, from individual triples to the graph as a whole. In further refinement, additional subcategories of semantic units are proposed for enhanced graph organization. The incorporation of unique, persistent, and resolvable identifiers (UPRIs) for each semantic unit enables their efficient referencing within triples, facilitating an efficient way of making statements about statements. The introduction of semantic units adds further layers of triples to the well-established RDF and OWL layer for knowledge graphs. (Fig. 1) This augmentation aims to enhance the usability of knowledge graphs for both domain experts and developers.


Fig1 Vogt JofBiomedSem24 15.png

Figure 1. Semantic units introduce additional layers atop the RDF/OWL layer of triples within a knowledge graph. The figure illustrates a partitioning of the triple layer into statement units, wherein each triple aligns with exactly one statement unit, and each statement unit contains one or more triples. Statement units can be organized into diverse types of semantically meaningful collections, denoted as compound units. Compound units serve as the basis for defining several layers that contribute to the enhanced structuring and organization of the knowledge graph in semantically meaningful ways.

In the Discussion section, we discuss the benefits we see from organizing knowledge graphs into distinct knowledge graph modules (i.e., semantic units) in terms of increasing data management flexibility and explorability of the graph. We also discuss possible strategies for implementing semantic units for RDF/OWL-based and labeled-property-graph-based knowledge graphs.

Conventions used in this paper

In this paper, the term "knowledge graph" denotes a machine-actionable semantic graph employed for the documentation, organization, and representation of data and metadata. It is essential to note that our discussion of semantic units is situated within the context of RDF-based triple stores, OWL, and Description Logics serving as a formal framework for inferencing, alongside labeled property graphs as an alternative to triple stores. We deliberately focus on these technologies as they constitute the primary technologies and logical frameworks within the knowledge graph domain, benefiting from widespread community support and established standards. We are aware of the fact that alternative technologies and frameworks exist that support an n-tuples syntax and more advanced logics (e.g., First Order Logic)[34][35], but supporting tools and applications are missing or are not widely used to turn them into well-supported, scalable, and easily usable knowledge graph applications.

Throughout this text, regular underlining is employed for indicating ontology classes, while italicsUnderlined text is reserved for referencing properties. Identification (ID) numbers, formed by the ontology prefix followed by a colon and a number, uniquely specify each resource (e.g., isAbout [IAO:0000136]). When a term is not yet covered in any ontology, we denote the corresponding class with an asterisk (*). New classes and properties that relate to semantic units will use the ontology prefix SEMUNIT, as in the class *SEMUNIT:metric measurement statement unit*. These will be part of a future Semantic Unit ontology. We use 'regular underlined' to indicate instances of classes, with the label referring to the class label and the ID to the ID of the class.

The term "resource" is employed to signify something uniquely designated, such as a Uniform Resource Identifier (URI), about which informative statements are made. It thus stands for something and represents something you want to talk about. In RDF, the Subject and the Predicate in a triple are always resources, whereas the Object can be either a resource or a literal. Resources encompass properties, instances, and classes, with properties occupying the Predicate position in a triple, instances referring to individuals (=particulars), and classes representing universals or kinds.

To maintain clarity, resources are represented with human-readable labels in both the text and all figures, opting for the implicit assumption that each property, instance, and class possesses its UPRI. Additionally, the term "triple" refers specifically to a triple statement, while "statement" pertains to a natural language statement, establishing a clear distinction between the two.

Methods

Problem statement

Challenge 1: Ensuring schematic interoperability for FAIR empirical data

In the pursuit of FAIRness in empirical data and metadata in a knowledge graph, it is important not only for the terms employed in data and metadata statements to possess identifiers from controlled vocabularies, such as ontologies, ensuring terminological interoperability, but also the semantic graph patterns underlying each statement. These patterns specify the relationships among the terms in a statement, facilitating schematic interoperability.

Due to the expressivity of RDF and OWL, statements can be modelled in multiple, often not directly interoperable ways within a knowledge graph. Distinguishing between RDF graphs with different structures that essentially model the same underlying data statement poses a challenge. Consequently, the presence of schematic interoperability conflicts becomes unavoidable, especially when data are represented using diverse graph patterns (cf. Figs. 2 and 3).


Fig2 Vogt JofBiomedSem24 15.png

Figure 2. Comparison of a human-readable statement with its machine-actionable representation as a semantic graph following the RDF syntax. Top: A human-readable statement concerning the observation that a specific apple (X) weighs 204.56 grams. Bottom: The corresponding representation of the same statement as a semantic graph, adhering to RDF syntax and following the established pattern for measurement data from the Ontology for Biomedical Investigations (OBI)[36] of the Open Biological and Biomedical Ontology Foundry (OBO).

Fig3 Vogt JofBiomedSem24 15.png

Figure 3. Alternative machine-actionable representation of the data statement from Fig. 2. This graph represents the same data statement as shown in Fig. 2 Top, but applies a semantic graph model that is based on the Extensible Observation Ontology (OBOE)[37], an ontology frequently used in the ecology community.

Therefore, to maintain interoperability in the representation of empirical data statements within an RDF graph, it can be beneficial to restrict the graph patterns employed for their semantic modelling. Statements of the same type, such as all weight measurements, would employ identical graph patterns to maintain interoperability. Each of these patterns would be assigned an identifier. When representing empirical data in the form of an RDF graph, the graph’s metadata should reference that graph-pattern identifier. This approach enables the identification of potentially interoperable RDF graphs sharing common graph-pattern identifiers.

Practically implementing these principles entails two criteria. Firstly, all statements within a knowledge graph must be categorized into statement classes, each associated with a specified graph pattern, typically in the form of a shape specification. Secondly, the subgraph corresponding to a particular statement must be distinctly identifiable.

Challenge 2: Overcoming barriers in graph query language adoption

Another significant challenge arises in the context of searching for specific information in a knowledge graph. The prevalent formats for knowledge graphs include RDF/OWL or labeled property graphs like Neo4j. Interacting directly with these graphs, encompassing CRUD operations for creating (= writing), reading (= searching), updating, and deleting statements in the knowledge graph, necessitates the utilization of a query language. SPARQL[38] is an example for RDF/OWL, while Cypher[39] is employed for Neo4j.

Although these query languages empower users to formulate detailed and intricate queries, the challenge lies in their complexity, creating an entry barrier for seamless interactions with knowledge graphs.[40] Furthermore, query languages are not aware of graph patterns.

This challenge may potentially be addressed by providing reusable query patterns that link to specific graph patterns, thereby integrating representation and querying.

Challenge 3: Addressing complexities in making statements about statements

The RDF triple syntax of Subject, Predicate, and Object allows expressing a statement about another statement by creating a triple that relates a statement, composed of one or more triples, to a value, resource, or another statement. The scenario may arise where such statements about statements must be modelled. For instance, metadata for a measurement may relate two distinct subgraphs: one representing the measurement itself (as seen in Fig. 2) and another documenting the underlying measuring process (as seen in Fig. 4).


Fig4 Vogt JofBiomedSem24 15.png

Figure 4. A detailed machine-actionable representation of the metadata relating to a weight measurement datum. This detailed illustration presents a machine-actionable representation of a mass measurement process employing a balance. It documents metadata associated with a weight measurement datum, articulated as an RDF graph. The graph establishes connections between an instance of mass measurement assay (OBI:0000445) and instances of various other classes from diverse ontologies. Noteworthy details include the identification of the measurement conductor, the location and timing of the measurement, the protocol followed, and the specific device utilized (i.e., a balance). Additionally, the graph outlines the material entity serving as the subject and input for the measurement process (i.e., "apple X"), along with specifying the resultant data encapsulated in a particular weight measurement assertion.

In RDF reification, a statement resource is defined to represent a particular triple by describing it via three additional triples that specify its Subject, Predicate, and Object. Alternatively, the RDF-star approach can be employed.[41] Both methods increase complexity of the represented graph.

In cases like this, the adoption of Named Graphs is an alternative compared to RDF reification or RDF-star approaches. Within RDF-based knowledge graphs, a Named Graph resource identifies a set of triples by incorporating the URI of the Named Graph as a fourth element to each triple, transforming them into quads. In labeled property graphs, on the other hand, assigning a resource for identifying subgraphs within the overall data graph is straightforward and can be achieved by incorporating the resource identifier as the value of a corresponding property-value pair, subsequently adding this pair to all relations and nodes belonging to the same subgraph.

Results

Semantic unit

We developed an approach for organizing knowledge graphs into distinct layers of subgraphs using graph patterns. Unlike traditional methods of partitioning a knowledge graph that (i) rely on technical aspects such as shared graph-topological properties of its triples with the goal of (federated) reasoning and query optimization (see characteristic sets [29, 30], RDF molecules [31, 42], and other approaches [43,44,45]), that (ii) partition a knowledge graph into small blocks for embedding and entity alignment learning to scale knowledge graph fusion [46], or that (iii) partition knowledge extractions, allowing reasoning over them in parallel to speed up knowledge graph construction [47], our approach introduces "semantic units." Semantic units prioritize structuring a knowledge graph into identifiable sets of triples, as subgraphs that represent units of representation possessing semantic significance for human readers. Technically, a semantic unit is a subgraph within a knowledge graph, represented in the graph by its own resource—designated as a UPRI—and embodied in the graph as a node. This resource is classified as an instance of a specific semantic unit class.

Semantic units focus on creating units that are semantically meaningful to domain experts. For instance, the graph in Fig. 2 exemplifies a subgraph that can be organized in a semantic unit that instantiates the class *SEMUNIT:weight statement unit* as it is illustrated in Fig. 6 (later). The statement unit models a single, human-readable statement, as opposed to the individual triple ‘weight’ (PATO:0000128) isQualityMeasuredAs (IAO:0000417) ‘scalar measurement datum’ (IAO:0000032), which is a single triple from that subgraph. That triple, without the context of the other triples in the subgraph, lacks semantic meaningfulness for a domain expert who has no background in semantics.

Beyond statement units, which constitute the smallest semantically meaningful statements (e.g., a weight measurement), collections of statement units can form compound units representing a coarser level of representational granularity. The classification of semantic units thus distinguishes two fundamental categories: statement units and compound units, each with its respective subcategories. For a detailed classification of semantic units, refer to Fig. 5.


Fig5 Vogt JofBiomedSem24 15.png

Figure 5. Classification of different categories of semantic units.

The structuring of a knowledge graph into semantic units involves introducing an additional layer of triples to the existing graph. To distinguish these two layers, we label the pre-existing graph as the data graph layer, while the newly added triples constitute the semantic-units graph layer. For clarity across the graph, the resource representing a semantic unit, along with all triples featuring this resource in the Subject or Object position, is assigned to the semantic-units graph layer. Extending this distinction from the graph as a whole to individual semantic units, each semantic unit is associated with both a data graph and a semantic-units graph. The data graph of a particular semantic unit shares the same UPRI as its semantic unit resource. This alignment enables reference to the UPRI, concurrently denoting the semantic unit as a resource and its corresponding data graph. This interconnectedness empowers users to make statements about the content encapsulated within the semantic unit’s data graph, as shown in Fig. 6.


Fig6 Vogt JofBiomedSem24 15.png

Figure 6. Example of a statement unit. The illustration displays a statement unit exemplifying a has-weight relation. The data graph, denoted within the blue box at the bottom, articulates the statement with "apple X" as the subject and "gram X" alongside the numerical value 204.56 as the objects. The peach-colored box encompasses the semantic-units graph, housing triples that encapsulate the semantic unit’s representation. It explicitly denotes the resource embodying the statement unit (bordered blue box), an instance of the *SEMUNIT:weight statement unit* class, with "apple X" identified as the subject. Notably, the UPRI of *’weight statement unit’* is also the UPRI of the semantic unit’s data graph (the unbordered subgraph in the blue box).

Statement unit: A proposition in the knowledge graph

A statement unit is characterized as the fundamental unit of information encapsulating the smallest, independent proposition (i.e., statement) with semantic meaning for human comprehension (see also [32]). For instance, the weight measurement statement for "apple X" illustrated in Fig. 6 represents a statement unit.

Structuring a knowledge graph into statement units results in a partition of its graph. Each triple within the data graph layer of the knowledge graph is associated with exactly one statement unit, and merging the subgraphs of all statement units results in the complete data graph of a knowledge graph. This partitioning only applies to the data graph layer.

We can understand each statement unit to specify a particular proposition by establishing a relationship between a resource serving as the subject and either a literal or another resource, denoted as the object of the predicate. Every statement unit encompasses a single subject and one or more objects.

To illustrate, a has-part statement unit features a subject and one object. Conversely, a weight measurement statement unit consists of a subject, as well as two objects: the weight value and the weight unit (refer to Fig. 6). The resource signifying a statement unit in the graph establishes a connection with its subject through the property *SEMUNIT:hasSemanticUnitSubject*, which is documented in the semantic-units graph of the statement unit.

In scenarios where the proposition within the data graph is grounded in a binary relation—a divalent predicate like "This right hand has as a part this right thumb"—the associated statement unit typically comprises a single triple. This alignment arises from the nature of RDF, where Predicates of triples are inherently binary relations. In such cases, the RDF property concurrently embodies the statement’s verb or predicate. However, numerous propositions are grounded in n-ary relations, making a single triple insufficient for their representation. Examples encompass the weight measurement statement in Fig. 6 and statements like "This right hand has part this right thumb on January 29th 2022," "Anna gives Bob a book," and "Carla travels by train from Paris to Berlin on the 29th of June 2022," each necessitating more than one triple. In these cases, the statement’s verb or predicate is often represented not by a property within a single triple but instead by an instance resource, as exemplified by ‘weight X’ (PATO:0000128) in Fig. 6. The composition of statement units, whether consisting of one or more triples, is contingent upon the relation of the underlying proposition, the n-aryness of its predicate, and the incorporation of optional objects. Types of statement units can be distinguished based on the n-ary verb or predicate that characterizes their underlying proposition. Notably, numerous object properties of the Basic Formal Ontology 2 denote ternary relations, particularly those entailing temporal dependencies. [48] For instance, "b located_in c at t" mandates at least two triples for accurate representation in RDF.

The determination of which triples belong to a statement unit necessitates case-by-case specification by human domain experts. The statement unit patterns can then be specified using languages like LinkML [49, 50] or the Shapes Constraint Language SHACL [51]. These languages enable the definition of graph patterns to represent specific propositions, subsequently constituting a statement unit. Each statement unit instantiates a designated statement unit class, a classification defined by the specific verb or predicate characterizing the propositions modelled by its instances. We can distinguish different subcategories of statement units based on the underlying predicate, such as has part, type, and develops from.

A distinctive category within the statement units, denoted as identification units, serves a specific purpose, providing details about a particular named individual or class resource. Two principal subtypes define this category. A named individual identification unit is a statement unit that serves to identify a resource to be a named individual, adding information such as the resource’s label, type, and its class membership (refer to Fig. 7A). A class identification unit[b] is a statement unit that serves to identify a resource to be a class and provides details including its label, identifier, and optionally, the URIs of both the ontology and the specific version from which the class term has been imported (refer to Fig. 7B). Both types of identification units are important for providing human-readable displays of statement units, as they provide the labels for the resources used in them (see "typed statement unit" and "dynamic label" in Fig. 9, later).


Fig7 Vogt JofBiomedSem24 15.png

Figure 7. Examples for two different types of identification units. A) Named-individual identification unit. The data graph within the unbordered box delineates the class-affiliation of the ‘apple X’ (NCIT:C71985) instance. The subject, "apple X," is connected to its class through the property type (RDF:type), while its label "apple X" is conveyed via the property label (RDFS:label). The unbordered blue box designates the data graph associated with this named-individual identification unit. B) Class identification unit. This data graph of this unit, represented by the unbordered blue box, captures the label and identifier of the class ‘apple’ (NCIT:C71985), the unit’s designated subject. Optionally, it includes the URI details of the ontology and the ontology version from which the class is derived. The bordered blue box designates the resource of this class identification unit.

Compound unit: A collection of propositions

Compound units are containers of collections of associated semantic units, each possessing semantic significance for a human reader. Each compound unit possesses a UPRI and instantiates a corresponding compound unit class. The connection between the resource representing the compound unit and those representing its associated semantic units is detailed through the property *SEMUNIT:hasAssociatedSemanticUnit* (see Fig. 8). The subsequent sections introduce distinct subcategories of compound units.


Fig8 Vogt JofBiomedSem24 15.png

Figure 8. Example of a compound unit, denoted as *‘apple X item unit’*, that encompasses multiple statement units. Compound units, by virtue of merging the data graphs of their associated statement units, indirectly manifest a data graph (here, highlighted by the blue arrow). Notably, the compound unit possesses a semantic-units graph (depicted in the peach-colored box) delineating the associated semantic units.

Typed statement unit

A typed statement unit assigns a human-readable label to a statement unit. A typed statement unit is a compound unit comprising the following statement units (see Fig. 9A):

  1. A statement unit that is not an instance of a named-individual or a class identification unit. It functions as the reference statement unit of the typed statement unit, and its subject is also the subject of the typed statement unit.
  2. Identification units specifying the class affiliations of all the resources that are referenced in the data graph of the reference statement unit, together with their human-readable labels.


Fig9 Vogt JofBiomedSem24 15.png

Figure 9. Typed statement unit with dynamic label and dynamic mind-map pattern. A) Typed statement unit exemplified for a weight statement. This typed statement unit consolidates the data graphs of six statement units, including the *’weight statement unit’* from Figure 6, serving as the reference statement unit for this *‘typed statement unit’*, and five instances of *SEMUNIT:named-individual identification unit*. B) Dynamic label: Illustrated is an example of the dynamic label associated with the reference statement unit class (*SEMUNIT:weight statement unit*). This dynamic label template is utilized for textual displays of information from the reference statement unit. C) Dynamic mind-map pattern: Depicted is an example of the dynamic mind-map pattern associated with the reference statement unit class (*SEMUNIT:weight statement unit*). This pattern template is employed for graphical displays of information from the reference statement unit.

Each statement unit class has at least one display pattern associated with it. A display pattern acts as a template that takes as input the labels provided by the identification units associated with a typed statement unit and generates a human-readable dynamic label for the textual (see Fig. 9B) or a dynamic mind-map pattern for the graphical representation (see Fig. 9C) of the statement of its reference statement unit. Thus, a dynamic label and a dynamic mind-map pattern of a typed statement unit are derived from the corresponding templates provided by its reference statement unit, taking the human-readable labels provided by its identification units as input.

Item unit

An item unit encompasses all statement and typed statement units that share a common subject, i.e., they form a group of statements relating to the same entity. The subject resource becomes the subject of the item unit, and the resource representing an item unit in the semantic-units graph relates to its subject through the property *SEMUNIT:hasSemanticUnitSubject*. Conceptually, item units align with the graph-per-resource data management pattern [52] or the previously mentioned characteristic set or RDF molecule, and they are akin to the Item concept in the Wikibase data model[42], but adapt the concept to statement units rather than triples.

Item group unit

An item group unit is composed of a minimum of two item units. The subgraphs of the item units belonging to the same item group unit are connected through statement units that share their subject with the subject of one item unit and one of their objects with the subject of another item unit. As a result, merging the subgraphs of all the item units of an item group unit forms a connected graph.

Granularity tree unit

We can further identify types of statement units that depend on partial order relations (i.e., relations that are transitive, reflexive, and asymmetric), forming partial orders. Examples include class-subclass relations in ontologies, parthood relations in descriptive statements, and sequential relations like before (RO:0002083) in process specifications. Partial order relations give rise to granular partitions that form granularity trees [53,54,55] and contribute to defining granularity perspectives. [56,57,58]

Granularity perspectives identify specific types of semantically meaningful tree-like subgraphs within a knowledge graph, supporting graph exploration by modularization in addition to statement, item, and item group units.

Due to the nested structure of a granularity tree and its inherent directionality from root to leaves, the subject of a granularity tree unit can be specified as the subject of statement units sharing objects with the subjects but not their subject with the objects of other statement units within the same granularity tree unit.

Granular item group unit

A granular item group unit encompasses all statement units and item units whose subjects belong to the same granularity tree unit. The item units belonging to a granular item group unit can be systematically arranged within a nested hierarchy dictated by the underlying granularity tree. This additional organization offers improved explorability for users of a knowledge graph application.

Context unit

The isAbout property (IAO:0000136) connects an information artifact to an entity about which the artifact provides information. Using this property in a knowledge graph changes the frame of reference from the discursive layer to the ontological layer. An is-about statement thus divides a knowledge graph into two subgraphs, each forming a context unit that belongs to one of these two layers. Is-about statement units relate resources from the semantic-units graph with resources from the data graph of a knowledge graph. For example, in documenting a research activity that results in the creation of a dataset describing the anatomy of a multicellular organism, the statement *‘description item unit’* isAboutmulticellular organism’ (UBERON:0000468) marks a transition in the frame of reference from the research activity’s outcome to the multicellular organism being described (see also Fig. 12 further below).

Dataset unit

A dataset unit is an ordered set of semantic units. They can be employed to aggregate all data contributed by a specific institution in a collaborative project, document the state of a particular object at a given time, or store and make accessible the results of a specific search query. Knowledge graph users have the flexibility to specify dataset units for their individual needs, utilizing the unit’s UPRI as reference identifier.

List unit

In certain instances, it becomes necessary to articulate statements about a specific collection of particular resources. To achieve this, such a collection can be modelled as a list unit. We distinguish unordered list units from ordered list units, with the latter organizing resources in a specific sequence, such as the authors of a scholarly publication. Conversely, a set unit is an unordered list unit where each resource is listed only once, adhering to a uniqueness restriction.

From a technical standpoint, a list unit contains membership statement units, each delineating a resource belonging to the list by linking the UPRI of the list unit through a *SEMUNIT:child* relation to the respective resource. In the case of an ordered list unit, each membership statement unit must be indexed through a data property index (RDF:index).

List units can be employed as arrays and may incorporate cardinality restrictions, thereby characterizing a closed collection of entities and enabling a localized closed-world assumption.

Discussion

Benefits of organizing a knowledge graph into semantic units

Semantic units enhance data management flexibility through modularity

The organization of a knowledge graph into distinct subgraphs, each associated with a particular semantic unit, introduces modularity in a graph. Each semantic unit, represented in the graph by a dedicated resource classified as an instance of a specific semantic unit class, serves as a structured module that encapsulates complexity. This modular approach allows for the encapsulation of subgraphs, and may add flexibility in data management as larger parts of a graph can be manipulated jointly.

Semantic units operate at a higher level of abstraction than individual triples

Semantically, they encapsulate the contents of their data graphs, representing statements or sets of semantically and ontologically related statements. The specification of relations between semantic units further extends the flexibility of data management. A given semantic unit from a finer level of representational granularity can be associated with multiple units from a coarser level. Consequently, a statement unit may be linked to more than one compound unit, all while maintaining the centrality of the statement unit itself and its triples in a single location within the graph.

The modular nature introduced by semantic units may streamline partitioned-based querying of knowledge graphs. While other approaches for graph partitioning have shown success [59], employing semantic units for partitioning and establishing modularity in the graph is an avenue for future research exploration.

Semantic units as a framework for knowledge graph alignment

The instantiation of semantic units belonging to the same class inherently implies a semantic similarity across instances. This characteristic lays the groundwork for a systematic approach to aligning and comparing knowledge graphs that share a common set of semantic unit classes. The alignment process could operate in a stepwise manner across various levels of representational granularity. In the initial step, alignment focuses on item group units, leveraging their types of associated item units and their alignment for comparison. The latter alignment hinges on the types of subjects and the types of associated statement units, allowing for further alignment based on class. Ultimately, individual triples within the aligned statement units undergo comparison, marking a comprehensive strategy to enhance existing methods for knowledge graph alignment, subgraph-matching, graph comparison, and graph similarity measures.

Managing restricted access to sensitive data

The classification of statement units into corresponding ontology classes may serve as a framework for identifying subgraphs within a knowledge graph housing sensitive data that warrants restricted access. By identifying statement units containing sensitive information by class, access restrictions can be dynamically enforced based on specific criteria.

Semantic units: A framework for nested and overlapping knowledge graph modules

Semantic units identify five levels of representational granularity

Semantic units introduce a structured framework encompassing five levels of representational granularity within a knowledge graph: triples, statement units, item units, item group units, and the knowledge graph as a whole (refer to Fig. 10). While triples represent the lowest level of abstraction, semantic units provide coarser levels, organizing the semantic-units graph layer (i.e., the discursive layer of a knowledge graph) and, indirectly, the knowledge graph’s data graph layer.


Fig10 Vogt JofBiomedSem24 15.png

Figure 10. Five levels of representational granularity. The integration of semantic units into a knowledge graph introduces a semantic-units graph layer, enriching the existing data graph layer. This augmentation includes distinct levels, namely triples, statement units, item units, and item group units, providing a nuanced hierarchy of representational granularity within a knowledge graph.

The hierarchical organization of triples into statement units (→ smallest units of propositions that are semantically meaningful for a human reader), further into item units (→ comprising all the information from the knowledge graph about a particular entity), and eventually into item group units (→ collections of semantically interrelated entities) could enhance human readability and usability. This structural hierarchy supports users in seamlessly navigating across the graph, zooming in and out of different levels of representational granularity.

Semantic units identify granularity trees

Granularity trees offer a perspective that is orthogonal to representational granularity, structuring the data graph layer and thus the ontological layer of a knowledge graph into distinct granularity perspectives. Consider the example of a multicellular organism’s description, including a has-part statement unit stating that the organism has a head as its part. This unit is associated with the item unit of the organism itself, which is linked to additional item units about the organism’s other parts, constituting an item group unit. Moreover, since has-part is a partial order relation [55], the has-part statement unit is associated with a parthood granularity tree unit and its corresponding granular item group unit. Consequently, the statement unit is associated with at least four different compound units that can be communicated to the user alongside the statement itself, showcasing the versatility enabled by semantic units in exploring contextualized subgraphs. [54]

Semantic units identify context-dependent subgraphs

Semantic units empower the organization of item group units into context units, each defining a specific frame of reference. Intersections between context units are discerned through is-about statements (see also Fig. 12), facilitating traversal across diverse frames of reference. Context units contribute to structuring the data graph layer and thus the ontological layer of a knowledge graph into different frames of reference.

Statements about statements and documenting ontological and discursive information in knowledge graphs using semantic units

The introduction of semantic units provides a framework for making statements about statements in a knowledge graph. Each semantic unit, equipped with its unique UPRI and represented in the semantic-units graph layer, facilitates assertions about statement units. This structured approach offers the potential for cross-database and cross-knowledge-graph statements when semantic units are implemented as nanopublications or FAIR Digital Objects, addressing the challenge of making statements about statements in knowledge graphs.

Moreover, if a knowledge graph should cover contextual assertions such as “Author A asserts that the melting point of lead is at 327.5 °C” or “The assertion about the melting point of lead being at 327.5 °C is a result of experiment X,” it becomes challenging to model this without having a formalism for representing such discursive contextual information and its relationship to empirical data (see also Ingvar Johannson’s distinction between use and mention of linguistic entities [60]). Statement units with their data graphs contribute ontological information, nested within compound units of coarser representational granularity. In the semantic-units graph, propositions are represented as nodes, forming a significant portion of the discursive layer. Additionally, context units allow the explicit documentation of different frames of reference within both the ontological and discursive layers. The ability of statement units to establish relations between resources or even between other statement units (e.g., ‘author_A -asserts-> statement_unit_Y’; ‘statement_unit_X -hasMetadata-> statement_unit_Z’) facilitates the documentation of connections between the empirical and discursive layers. For instance, an item group unit focusing on the contents of a scholarly publication, can encapsulate information about the associated research activity, its inputs, outputs, research methods, and objectives (see Fig. 11).


Fig11 Vogt JofBiomedSem24 15.png

Figure 11. A semantic schema for modelling the contents of scholarly publications. The depicted semantic schema outlines the modelling structure for encapsulating the components of scholarly publications. It delineates the relationship between a research activity, its associated input and output, and the underlying specification of its process plan, manifested in the form of a research method and research objective. The model draws inspiration from Vogt et al. [61]

The proposed model may find application within a knowledge graph centered around scholarly publications. For example, the representation in Fig. 12 combines the discursive and the ontological layers and represents the connections between different frames of reference.


Fig12 Vogt JofBiomedSem24 15.png

Figure 12. Detail from the RDF graph illustrating the contents of a scholarly publication. The data schema employed aligns with the schema shown in Figure 11, tailored to accommodate semantic units. The publication’s content is encapsulated within a dedicated publication item group unit instance through various interconnected semantic units. The publication itself is denoted as an instance of journal article (IAO:0000013). The publication item group unit encompasses multiple item units related to the research activity, interconnected through the *SEMUNIT:hasLinkedSemanticUnit* property. The interconnected hierarchy extends to an investigation (OBI:0000066) instance, resulting in a data set (IAO:0000100) instance with a description (SIO:000136) instance as its part. This description, in turn, has the multicellular organism item unit describing the organism as its part, which has an instance of multicellular organism (UBERON:0000468) as its subject. The blue arrow signifies the representation of the data graph (dark blue box with shadow) by this specific item unit (bordered box in the same color). The ontological layer is constituted by the data graphs of the semantic units, while their semantic-units graphs collectively form the discursive layer. Distinct context units demarcate the reference frames of the publication, research-activity, and research-subject, delineated by is-about statements. For reasons of clarity of presentation, the associated statement units are not shown in the discursive layer.

Implementation

=Implementing semantic units in RDF/OWL-based knowledge graphs using nanopublications

To initiate the structuring of a knowledge graph into semantic units, first, a layer of abstraction beyond the triple level must be created. This is accomplished by partitioning the knowledge graph into a set of statement units, where each triple belongs exclusively to one data graph of a statement unit. In RDF/OWL, statement units can be conceptualized like nanopublications.

Nanopublications are RDF graphs that serve as the smallest published information units extracted from literature and enriched with provenance and attribution information. [62,63,64,65] Leveraging Named Graphs and Semantic Web technologies, each nanopublication models a particular assertion, such as a scientific claim, in a machine-readable format and semantics and is accessible and citable through a unique identifier. Each nanopublication is organized into four Named Graphs:

  1. the head Named Graph, connecting the other three Named Graphs to the nanopublication’s unique identifier;
  2. the assertion Named Graph, containing the assertion modelled as a graph;
  3. the provenance Named Graph, containing metadata about the assertion; and
  4. the publicationInfo Named Graph, containing metadata about the nanopublication itself.

The assertion Named Graph would contain the data graph of a statement unit, whereas the head Named Graph its semantic-units graph. Triples in the provenance Named Graph can potentially link to other semantic units and thus other nanopublications that contain detailed metadata descriptions (e.g., a metadata graph as shown in Fig. 4).

A compound unit, being a collection of two or more semantic units, can be organized in an RDF/OWL-based knowledge graph by linking the compound unit’s UPRI to the UPRIs of its associated semantic units. Following the nanopublication schema, this can be implemented by employing the compound unit’s semantic-units graph as the head Named Graph of a corresponding nanopublication, leaving the nanopublication’s assertion Named Graph empty. The head Named Graph thus specifies all statement and compound units associated with this compound unit.

Implementing semantic units in Neo4j-based knowledge graphs using UPRIs and corresponding property-value pairs

In Neo4j, a labeled property graph, the assignment of UPRIs to all nodes and relations through a ‘UPRI:upri’ property-value pair is an essential prerequisite for implementing semantic units. To identify all triples affiliated with the same statement unit, a ‘statement_unit_UPRI:upri’ property-value pair must be added to each node and relation belonging to the statement unit, with the statement unit’s UPRI serving as the value. Building on this primary abstraction layer of statement units, a secondary abstraction layer of compound units can be organized. The nodes and relations associated with all triples within a compound unit are endowed with a ‘compound_unit_UPRI:upri’ property-value pair, having the compound unit’s UPRI as their value. Since a particular statement unit may be associated with multiple compound units, its ‘compound_unit_URI’ property can incorporate an array of UPRIs representing different semantic units.

An initial software for demonstration purposes has been developed by one of the authors, illustrating how semantic units can manage a knowledge graph. [66] Built upon Neo4j as the persistence-layer technology, the application sources its content via a web interface and user input. This small-scale knowledge graph application is designed for documenting assertions from scholarly publications, offering users an exemplary platform to describe some of the contents (and not merely bibliographic metadata) found in a scholarly publication. Each described paper stands as its own item group unit, featuring assertions covered by statement units linked to item units and granularity tree units. The prototype encompasses versioning of semantic units and automatic tracking of their editing histories and provenance. The application employs the organization of the graph into semantic units within a navigation tree, facilitating exploration of a given item group unit through its associated item units (see Fig. 13). The showcase is built using Python and flask/Jinja2 and is openly available at https://github.com/LarsVogt/Knowledge-Graph-Building-Blocks.


Fig13 Vogt JofBiomedSem24 15.png

Figure 13. User interface of a prototype web application that implements semantic units. On the left is a navigation tree that leverages the organization of the underlying Neo4j knowledge graph into different item group, item, and statement units. Currently selected is the infectious agent population item group. On the right, all statements belonging to the selected item group are displayed.

Strategies for implementation

Given that only statement units store information, while compound units act as their containers, the first step of implementing semantic units should focus on identifying the statement unit classes required for representing the types of statements integral to the knowledge graph’s coverage. Each statement unit class requires an assigned graph schema, preferably articulated using a shapes constraint language like SHACL. [51] In this initial step, statement types that are grounded in partial order relations must be identified as well (required for identifying granularity tree units). From here, three distinct implementation strategies are available:

  1. Develop from scratch: In cases where no knowledge graph exists yet, the focus should be on developing a knowledge graph application that organizes incoming information into statement units in accordance with their assigned graph schemata. Rules for organizing statement units into compound units, contingent on the compound unit type, must be established. For example, statement units sharing the same subject resource form a corresponding item unit.
  2. Transfer an existing knowledge graph: If there is an existing knowledge graph that needs restructuring into semantic units, crafting queries to transfer all triples into corresponding statement units, based on the graph schemata identified in the first step, is the next step. The main challenge is maintaining disjointedness of triples between statement units.
  3. A hybrid approach: For scenarios where restructuring an entire knowledge graph seems impractical or undesirable, but there is a desire to organize newly added information into semantic units, a hybrid approach is possible. This involves developing input workflows to ensure that all incoming data conforms to the semantic units structure.

Semantic units as FAIR Digital Objects

The concept of FAIR Digital Objects, as proposed by the European Commission Expert Group on FAIR Data, stands at the core of achieving the FAIR Principles [67], emphasizing persistent identifiers, comprehensive metadata, and contextual documentation for reliable discovery, citation, and reuse. The concept of semantic units aligns with that of FAIR Digital Objects. Each semantic unit inherently possesses a UPRI, serving as a ready-made persistent identifier. Accessibility and searchability are ensured through established protocols like SPARQL and CYPHER, with RDF, JSON, and other formats supporting data export. When knowledge graphs adhere to controlled vocabularies and ontologies, and when they employ standard graph-patterns using tools like SHACL [51], ShEx [68, 69], or OTTR [70, 71], the data within the data graphs of semantic units may more easily achieve semantic interoperability.

Moreover, semantic units can provide provenance—crucial for tracking a semantic unit’s history—through utilizing property-value pairs for labeled property knowledge graphs or a designated provenance Named Graph for RDF/OWL knowledge graphs. The provenance metadata of a semantic unit encompasses details like the creator, creation date, application used, title, contributing users, and last-update, focusing solely on the semantic unit itself, not the original data production process.

Access control metadata can specify any licenses as well as access control restrictions.

Conclusion and future work

In conclusion, the adoption of semantic units in structuring knowledge graphs may be useful to address the challenges faced in knowledge representation mentioned in the introduction. By encapsulating each statement within its dedicated statement unit, accompanied by a corresponding statement unit class and data schema (e.g., as a SHACL shape), a robust foundation for FAIR data and metadata is established, supporting schematic interoperability. Because statement units partition the knowledge graph so that every triple belongs to exactly one statement unit and every statement unit’s subgraph is identifiable and referenceable through its UPRI, data in a knowledge graph is linked to graph patterns, which are identifiable as a whole. By providing each schema its own UPRI, each semantic unit can specify its underlying schema in its metadata. Identifying semantically interoperable semantic units is then straightforward, and schema crosswalks between different schemata can increase schematic interoperability. [72] (This addresses Challenge 1.)

Graph query languages can use the graph patterns (semantic units), and therefore allow access to knowledge graph content through higher levels of abstractions than basic triples. (This addresses Challenge 2.) Further, we have shown how semantic units can organize knowledge graphs in different layers and make statements about statements. (This addresses Challenge 3.)

Future research involves extending the semantic units approach to incorporate question units and a nuanced categorization of assertional, contingent, prototypical, and universal statement units. This extension will encompass formal semantics for the latter, including provisions for negations and cardinality restrictions. Additionally, we are exploring novel approaches to knowledge graph exploration based on semantic units.

Abbreviations, acronyms, and initialisms

  • BFO: Basic Formal Ontology
  • CRUD: Create, Read, Update, Delete
  • FAIR: Findable, Accessible, Interoperable, and Reusable
  • HTTP: Hypertext Transfer Protocol
  • HTTPS: Hypertext Transfer Protocol Secure
  • IAO: Information Artifact Ontology
  • ID: Identifier
  • JSON: JavaScript Object Notation
  • LinkML: Linked Data Modeling Language
  • NCIT: National Cancer Institute
  • NoSQL: Not only Structured Query Language
  • OBI: Ontology for Biomedical Investigations
  • OBOE: Extensible Observation Ontology
  • OBO Foundry: Open Biological and Biomedical Ontology Foundry
  • OTTR: Reasonable Ontology Templates
  • OWL: Web Ontology Language
  • PATO: Phenotype and Trait Ontology
  • RDF: Resource Description Framework
  • RDFS: RDF-Schema
  • RO: OBO Relations Ontology
  • SHACL: Shape Constraint Language
  • ShEx: Shape Expression
  • SIO: Semanticscience Integrated Ontology
  • SPARQL: SPARQL Protocol and RDF Query Language
  • TI: Time Ontology in OWL
  • TRUST: Transparency, Responsibility, User Focus, Sustainability, and Technology
  • UBERON: Uber-anatomy ontology
  • UO: Units of Measurement Ontology
  • UPRI: Unique Persistent and Resolvable Identifier
  • XSD: Extensible Markup Language Schema Definition

Foonotes

  1. Machine-actionable data and metadata are machine-interpretable and belong to a type for which operations have been specified in symbolic grammar, such as logical reasoning based on description logics for statements formalized in the Web Ontology Language (OWL) or rule-based data transformations such as unit conversion for defined types of elements.[14]
  2. Analog to class identification units, one could specify property identification units that have property resources as their subject.

Acknowledgements

We thank Werner Ceusters, Nico Matentzoglu, Manuel Prinz, Marcel Konrad, Philip Strömert, Roman Baum, Björn Quast, Peter Grobe, István Míko, Manfred Jeusfeld, Manolis Koubarakis, Javad Chamanara, and Kheir Eddine for discussing some of the presented ideas. We also thank to anonymous reviewers for their suggestions and feedback. We are solely responsible for all the arguments and statements in this paper.

Author contributions

L.V. developed the concept of semantic units and wrote the initial manuscript text. All authors reviewed and revised the manuscript.

Funding

Open Access funding enabled and organized by Projekt DEAL. Lars Vogt received funding by the ERC H2020 Project ‘ScienceGraph’ (819536).

Conflict of interest

The authors declare no competing interests.

References

  1. Adam, K.; Hammad, I.; Fakhreldin, M.A.I. et al. (2015). "Big Data Analysis and Storage". Proceedings of the 2015 International Conference on Operations Excellence and Service Engineering: 648–59. http://umpir.ump.edu.my/id/eprint/7341. 
  2. Marr, B. (21 May 2018). "How Much Data Do We Create Every Day? The Mind-Blowing Stats Everyone Should Read". Forbes. https://www.forbes.com/sites/bernardmarr/2018/05/21/how-much-data-do-we-create-every-day-the-mind-blowing-stats-everyone-should-read/. Retrieved 22 May 2024. 
  3. "Data Never Sleeps 5". Domo, Inc. 2017. https://www.domo.com/learn/infographic/data-never-sleeps-5. 
  4. Idrees, Sheikh Mohammad; Alam, M. Afshar; Agarwal, Parul (1 December 2019). "A study of big data and its challenges" (in en). International Journal of Information Technology 11 (4): 841–846. doi:10.1007/s41870-018-0185-1. ISSN 2511-2104. http://link.springer.com/10.1007/s41870-018-0185-1. 
  5. United Nations (2015). "Transforming our world: the 2030 Agenda for Sustainable Development". United Nations Environment Programme. https://wedocs.unep.org/20.500.11822/9814. Retrieved 22 May 2024. 
  6. Mons, B. (December 2018). "Message from President Barend Mons (2018-2023)". Committee on Data (CODATA). https://codata.org/about-codata/message-from-president-merce-crosas/message-from-president-barend-mons-2018-2023/. Retrieved 22 May 2024. 
  7. Wilkinson, Mark D.; Dumontier, Michel; Aalbersberg, IJsbrand Jan; Appleton, Gabrielle; Axton, Myles; Baak, Arie; Blomberg, Niklas; Boiten, Jan-Willem et al. (15 March 2016). "The FAIR Guiding Principles for scientific data management and stewardship" (in en). Scientific Data 3 (1): 160018. doi:10.1038/sdata.2016.18. ISSN 2052-4463. PMC PMC4792175. PMID 26978244. https://www.nature.com/articles/sdata201618. 
  8. Heidorn, P. Bryan (1 September 2008). "Shedding Light on the Dark Data in the Long Tail of Science" (in en). Library Trends 57 (2): 280–299. doi:10.1353/lib.0.0036. ISSN 1559-0682. https://muse.jhu.edu/article/262029. 
  9. Baker, Monya (26 May 2016). "1,500 scientists lift the lid on reproducibility" (in en). Nature 533 (7604): 452–454. doi:10.1038/533452a. ISSN 0028-0836. https://www.nature.com/articles/533452a. 
  10. Lin, Dawei; Crabtree, Jonathan; Dillo, Ingrid; Downs, Robert R.; Edmunds, Rorie; Giaretta, David; De Giusti, Marisa; L’Hours, Hervé et al. (14 May 2020). "The TRUST Principles for digital repositories" (in en). Scientific Data 7 (1): 144. doi:10.1038/s41597-020-0486-7. ISSN 2052-4463. PMC PMC7224370. PMID 32409645. https://www.nature.com/articles/s41597-020-0486-7. 
  11. "The Internet of FAIR Data & Services". GO FAIR. https://www.go-fair.org/resources/internet-fair-data-services/. Retrieved 22 May 2024. 
  12. European Commission. Directorate General for Research and Innovation. (2016). Realising the European open science cloud: first report and recommendations of the Commission high level expert group on the European open science cloud.. LU: Publications Office. doi:10.2777/940154. https://data.europa.eu/doi/10.2777/940154. 
  13. Hasnain, Ali; Rebholz-Schuhmann, Dietrich (2018), Gangemi, Aldo; Gentile, Anna Lisa; Nuzzolese, Andrea Giovanni et al.., eds., "Assessing FAIR Data Principles Against the 5-Star Open Data Principles" (in en), The Semantic Web: ESWC 2018 Satellite Events (Cham: Springer International Publishing) 11155: 469–477, doi:10.1007/978-3-319-98192-5_60, ISBN 978-3-319-98191-8, https://link.springer.com/10.1007/978-3-319-98192-5_60. Retrieved 2024-06-17 
  14. Weiland, C.; Islam, S.; Broder, D. et al. (19 August 2022). "FDO Machine Actionability, Version 2.1". Google Docs. FDO Forum. https://docs.google.com/document/d/1hbCRJvMTmEmpPcYb4_x6dv1OWrBtKUUW5CEXB2gqsRo. 
  15. Vogt, Lars; Baum, Roman; Bhatty, Philipp; Köhler, Christian; Meid, Sandra; Quast, Björn; Grobe, Peter (1 January 2019). "SOCCOMAS: a FAIR web content management system that uses knowledge graphs and that is based on semantic programming" (in en). Database 2019: baz067. doi:10.1093/database/baz067. ISSN 1758-0463. PMC PMC6686081. PMID 31392324. https://academic.oup.com/database/article/doi/10.1093/database/baz067/5544589. 
  16. Bonatti, Piero Andrea; Decker, Stefan; Polleres, Axel; Presutti, Valentina (2019) (in en). Knowledge Graphs: New Directions for Knowledge Representation on the Semantic Web (Dagstuhl Seminar 18371). pp. 83 pages, 5326322 bytes. doi:10.4230/dagrep.8.9.29. https://drops.dagstuhl.de/entities/document/10.4230/DagRep.8.9.29. 
  17. Hogan, Aidan; Blomqvist, Eva; Cochez, Michael; D’amato, Claudia; Melo, Gerard De; Gutierrez, Claudio; Kirrane, Sabrina; Gayo, José Emilio Labra et al. (31 May 2022). "Knowledge Graphs" (in en). ACM Computing Surveys 54 (4): 1–37. doi:10.1145/3447772. ISSN 0360-0300. https://dl.acm.org/doi/10.1145/3447772. 
  18. Abiteboul, Serge (1997), Afrati, Foto; Kolaitis, Phokion, eds., "Querying semi-structured data", Database Theory — ICDT '97 (Berlin, Heidelberg: Springer Berlin Heidelberg) 1186: 1–18, doi:10.1007/3-540-62222-5_33, ISBN 978-3-540-62222-2, http://link.springer.com/10.1007/3-540-62222-5_33. Retrieved 2024-06-17 
  19. Angles, Renzo; Gutierrez, Claudio (1 February 2008). "Survey of graph database models" (in en). ACM Computing Surveys 40 (1): 1–39. doi:10.1145/1322432.1322433. ISSN 0360-0300. https://dl.acm.org/doi/10.1145/1322432.1322433. 
  20. Angles, Renzo; Arenas, Marcelo; Barceló, Pablo; Hogan, Aidan; Reutter, Juan; Vrgoč, Domagoj (30 September 2018). "Foundations of Modern Query Languages for Graph Databases" (in en). ACM Computing Surveys 50 (5): 1–40. doi:10.1145/3104031. ISSN 0360-0300. https://dl.acm.org/doi/10.1145/3104031. 
  21. Hitzler, P.; Krötzsch, M.; Parsia, B. et al. (11 December 2012). "OWL 2 Web Ontology Language Primer (Second Edition)". World Wide Web Consortium. https://www.w3.org/TR/owl2-primer/. 
  22. Philip, Stutz; Daniel, Strebel; Abraham, Bernstein (2016). Signal/collect12: processing large graphs in seconds. doi:10.5167/UZH-119576. https://www.zora.uzh.ch/id/eprint/119576. 
  23. Wang, Quan; Mao, Zhendong; Wang, Bin; Guo, Li (1 December 2017). "Knowledge Graph Embedding: A Survey of Approaches and Applications". IEEE Transactions on Knowledge and Data Engineering 29 (12): 2724–2743. doi:10.1109/TKDE.2017.2754499. ISSN 1041-4347. http://ieeexplore.ieee.org/document/8047276/. 
  24. Stocker, Markus; Oelen, Allard; Jaradeh, Mohamad Yaser; Haris, Muhammad; Oghli, Omar Arab; Heidari, Golsa; Hussein, Hassan; Lorenz, Anna-Lena et al. (11 January 2023). Magagna, Barbara. ed. "FAIR scientific information with the Open Research Knowledge Graph". FAIR Connect 1 (1): 19–21. doi:10.3233/FC-221513. https://www.medra.org/servlet/aliasResolver?alias=iospress&doi=10.3233/FC-221513. 
  25. Aisopos, Fotis; Jozashoori, Samaneh; Niazmand, Emetis; Purohit, Disha; Rivas, Ariam; Sakor, Ahmad; Iglesias, Enrique; Vogiatzis, Dimitrios et al. (8 May 2023). Kondylakis, Haridimos; Rao, Praveen; Stefanidis, Kostas et al.. eds. "Knowledge graphs for enhancing transparency in health data ecosystems1". Semantic Web 14 (5): 943–976. doi:10.3233/SW-223294. https://www.medra.org/servlet/aliasResolver?alias=iospress&doi=10.3233/SW-223294. 
  26. Cifuentes-Silva, Francisco; Fernández-Álvarez, Daniel; Labra-Gayo, Jose Emilio (3 June 2020). "National Budget as Linked Open Data: New Tools for Supporting the Sustainability of Public Finances" (in en). Sustainability 12 (11): 4551. doi:10.3390/su12114551. ISSN 2071-1050. https://www.mdpi.com/2071-1050/12/11/4551. 
  27. Rajabi, Enayat; Kafaie, Somayeh (28 September 2022). "Knowledge Graphs and Explainable AI in Healthcare" (in en). Information 13 (10): 459. doi:10.3390/info13100459. ISSN 2078-2489. https://www.mdpi.com/2078-2489/13/10/459. 
  28. Tiddi, Ilaria; Schlobach, Stefan (1 January 2022). "Knowledge graphs as tools for explainable machine learning: A survey" (in en). Artificial Intelligence 302: 103627. doi:10.1016/j.artint.2021.103627. https://linkinghub.elsevier.com/retrieve/pii/S0004370221001788. 
  29. Hogan, Aidan; Arenas, Marcelo; Mallea, Alejandro; Polleres, Axel (1 August 2014). "Everything you always wanted to know about blank nodes" (in en). Journal of Web Semantics 27-28: 42–69. doi:10.1016/j.websem.2014.06.004. https://linkinghub.elsevier.com/retrieve/pii/S1570826814000481. 
  30. Neumann, T.; Moerkotte, G.. "Characteristic sets: Accurate cardinality estimation for RDF queries with multiple joins | IEEE Conference Publication | IEEE Xplore". Proceedings of the 2011 IEEE 27th International Conference on Data Engineering. doi:10.1109/icde.2011.5767868. https://ieeexplore.ieee.org/document/5767868/. 
  31. Papastefanatos, George; Meimaris, Marios; Vassiliadis, Panos (1 February 2022). "Relational schema optimization for RDF-based knowledge graphs" (in en). Information Systems 104: 101754. doi:10.1016/j.is.2021.101754. https://linkinghub.elsevier.com/retrieve/pii/S0306437921000223. 
  32. Collarana, Diego; Galkin, Mikhail; Traverso-Ribón, Ignacio; Vidal, Maria-Esther; Lange, Christoph; Auer, Sören (19 June 2017). "MINTE: semantically integrating RDF graphs" (in en). Proceedings of the 7th International Conference on Web Intelligence, Mining and Semantics (Amantea Italy: ACM): 1–11. doi:10.1145/3102254.3102280. ISBN 978-1-4503-5225-3. https://dl.acm.org/doi/10.1145/3102254.3102280. 
  33. Vogt, Lars (1 December 2019). "Organizing phenotypic data—a semantic data model for anatomy" (in en). Journal of Biomedical Semantics 10 (1): 12. doi:10.1186/s13326-019-0204-6. ISSN 2041-1480. PMC PMC6585074. PMID 31221226. https://jbiomedsem.biomedcentral.com/articles/10.1186/s13326-019-0204-6. 
  34. Ceusters, Werner (2022), Elkin, Peter L., ed., "The Place of Referent Tracking in Biomedical Informatics" (in en), Terminology, Ontology and their Implementations (Cham: Springer International Publishing): 39–46, doi:10.1007/978-3-031-11302-4_6, ISBN 978-3-031-11301-7, https://link.springer.com/10.1007/978-3-031-11302-4_6. Retrieved 2024-06-17 
  35. Ceusters, Werner; Elkin, Peter; Smith, Barry (1 December 2007). "Negative findings in electronic health records and biomedical ontologies: A realist approach" (in en). International Journal of Medical Informatics 76: S326–S333. doi:10.1016/j.ijmedinf.2007.02.003. PMC PMC2211452. PMID 17369081. https://linkinghub.elsevier.com/retrieve/pii/S1386505607000408. 
  36. Bandrowski, Anita; Brinkman, Ryan; Brochhausen, Mathias; Brush, Matthew H.; Bug, Bill; Chibucos, Marcus C.; Clancy, Kevin; Courtot, Mélanie et al. (29 April 2016). Xue, Yu. ed. "The Ontology for Biomedical Investigations" (in en). PLOS ONE 11 (4): e0154556. doi:10.1371/journal.pone.0154556. ISSN 1932-6203. PMC PMC4851331. PMID 27128319. https://dx.plos.org/10.1371/journal.pone.0154556. 
  37. Madin, Joshua; Bowers, Shawn; Schildhauer, Mark; Krivov, Sergeui; Pennington, Deana; Villa, Ferdinando (1 October 2007). "An ontology for describing and synthesizing ecological observation data" (in en). Ecological Informatics 2 (3): 279–296. doi:10.1016/j.ecoinf.2007.05.004. https://linkinghub.elsevier.com/retrieve/pii/S1574954107000362. 
  38. Harris, S.; Seaborne, A. (21 March 2013). "SPARQL 1.1 Query Language". World Wide Web Consortium. https://www.w3.org/TR/sparql11-query/. 
  39. "The Neo4j Operations Manual v5". Neo4j, Inc. 2024. https://neo4j.com/docs/operations-manual/current/. 
  40. Booth, D.; Wallace, E. (2019). "Session X: EasyRDF". 2nd U.S. Semantic Technologies Symposium 2019. https://us2ts.org/2019/posts/program-session-x.html. 
  41. Hartig, O. (2017). "Foundations of RDF⋆ and SPARQL⋆ (An Alternative Approach to Statement-Level Metadata in RDF)". Alberto Mendelzon Workshop on Foundations of Data Management. https://www.semanticscholar.org/paper/Foundations-of-RDF%E2%8B%86-and-SPARQL%E2%8B%86-(An-Alternative-to-Hartig/36e70ee51cb7b7ec12faac934ae6b6a4d9da15a8. 
  42. "Wikibase/DataModel - Overview of the data model". MediaWiki.org. 7 April 2024. https://www.mediawiki.org/wiki/Wikibase/DataModel#Item. 

Notes

This presentation is faithful to the original, with only a few minor changes to presentation, though grammar and word usage was substantially updated for improved readability. In some cases important information was missing from the references, and that information was added.