LIMSWiki - User contributions [en]

LIMS Q&A:What types of testing occur within a medical microbiology laboratory?

2024-06-21T02:31:10Z

Shawndouglas: Shawndouglas moved page LIMS Q&A:LIMS Q&A:What types of testing occur within a medical microbiology laboratory? to LIMS Q&A:What types of testing occur within a medical microbiology laboratory? without leaving a redirect

[[File:US Navy 070905-N-0194K-029 Lt. Paul Graf, a microbiology officer aboard Military Sealift Command hospital ship USNS Comfort (T-AH 20), examines wound cultures in the ship's microbiology laboratory.jpg|right|380px]]
'''Title''': ''What types of testing occur within a medical microbiology laboratory?''

'''Author for citation''': Shawn E. Douglas

'''License for content''': [https://creativecommons.org/licenses/by-sa/4.0/ Creative Commons Attribution-ShareAlike 4.0 International]

'''Publication date''': April 2024

==Introduction==
The medical [[microbiology]] [[laboratory]] has a variety of testing and workflow requirements that manage to separate it from other biomedical labs. These often complex requirements are in part due to the challenges of analyzing [[microorganism]]s at the microscopic level, as well as the vital role medical microbiology labs play in [[public health]]. As societal and economic pressures such as [[COVID-19]] and hiring challenges have forced these labs to adopt more automated methods to their work, the way these labs work and perform their tests have changed. Regardless, their mandate remains the same: detect, identify, and characterize microorganisms to improve patient outcomes and mitigate infectious agents from spreading across entire populations.

This brief topical article will examine the typical types of testing that occur in medical microbiology labs, while also touching upon a few elements of technology and automation, and how they have changed the way these labs perform their activities.

==The medical microbiology lab in general==
A medical [[microbiology]] [[laboratory]] helps detect, identify, and characterize [[microorganism]]s for both individual patient treatment and broader population disease prevention and control. In the course of its work towards aiding in the diagnosis of individual patients' ailments, the lab may identify infectious agents of concern and trends in those infections as part of a greater [[public health]] effort. By extension, medical microbiology laboratories are also responsible for reporting those identifications and trends to various public health agencies (city, county, state, and federal). These reports are then used by [[Public health laboratory|public health laboratories]], in tandem with medical microbiology labs, to track incidences and attempt to identify outbreaks.<ref name="RhoadsClin14" /> In particular, the medical microbiology lab is uniquely suited to confirming infectious disease cases as part of outbreak investigations, with its analytical and interpretive "methods that are not commonly available in a routine laboratory setting."<ref name="ECDCCore10">{{cite web |url=https://www.ecdc.europa.eu/sites/default/files/media/en/publications/Publications/1006_TER_Core_functions_of_reference_labs.pdf |format=PDF |title=Core functions of microbiology reference laboratories for communicable diseases |author=European Centre for Disease Prevention and Control |date=June 2010 |publisher=European Centre for Disease Prevention and Control |isbn=9789291932115 |doi=10.2900/29017 |accessdate=24 April 2024}}</ref>

A standard consolidated medical microbiology laboratory will have the facilities for rapid microbiology, [[Microscope|microscopy]], [[Cell culture|cell culturing]], serology, molecular biology, parasitology, virology, communicable disease management (i.e., public health or reference activities<ref name="ECDCCore10" />) and more, and it also may have the facilities for environmental microbiology.<ref name="VandenbergConsol20">{{Cite journal |last=Vandenberg |first=Olivier |last2=Durand |first2=Géraldine |last3=Hallin |first3=Marie |last4=Diefenbach |first4=Andreas |last5=Gant |first5=Vanya |last6=Murray |first6=Patrick |last7=Kozlakidis |first7=Zisis |last8=van Belkum |first8=Alex |date=2020-03-18 |title=Consolidation of Clinical Microbiology Laboratories and Introduction of Transformative Technologies |url=https://journals.asm.org/doi/10.1128/CMR.00057-19 |journal=Clinical Microbiology Reviews |language=en |volume=33 |issue=2 |pages=e00057–19 |doi=10.1128/CMR.00057-19 |issn=0893-8512 |pmc=PMC7048017 |pmid=32102900}}</ref> A variety of specimen types will be tested, including urine, blood, stool, tissues, and precious fluids, as well as skin, mucosal, and genital swabs.<ref name="VandenbergConsol20" />

Culture-based and other microbiology test methods have largely been performed manually up until recently. As Antonios ''et al.'' noted at the end of 2021, "the introduction of automation in microbiology was considered difficult to apply for several reasons such as the complexity and variability of sample types, the variations of specimens processing, the doubtful cost-effectiveness especially for small and average-sized laboratories, and the perception that machines could not exercise the critical decision-making skills required to process microbiological samples."<ref name="AntoniosCurrent21">{{Cite journal |last=Antonios |first=Kritikos |last2=Croxatto |first2=Antony |last3=Culbreath |first3=Karissa |date=2021-12-30 |title=Current State of Laboratory Automation in Clinical Microbiology Laboratory |url=https://academic.oup.com/clinchem/article/68/1/99/6490228 |journal=Clinical Chemistry |language=en |volume=68 |issue=1 |pages=99–114 |doi=10.1093/clinchem/hvab242 |issn=0009-9147}}</ref> However, economic, employment, and other societal drivers have necessarily brought [[laboratory automation]] and [[large language model]]s (LLMs) more fully to the medical microbiology lab in recent years.<ref name="VandenbergConsol20" /><ref name="AntoniosCurrent21" /><ref name="SandleEnhanc21">{{cite web |url=https://www.europeanpharmaceuticalreview.com/article/166302/enhancing-rapid-microbiology-methods-how-ai-is-shaping-microbiology/ |title=Enhancing rapid microbiology methods: how AI is shaping microbiology |author=Sandle, T. |work=European Pharmaceutical Review |date=22 December 2021 |accessdate=17 April 2024}}</ref> This has allowed these labs to move from a traditional partial-day work schedule to a more 24-hour work schedule by, for example, the use of automated front-end plating systems.<ref name="AntoniosCurrent21" />

Whether manual or automated, successful medical microbiology workflows rely on specific quality controls, reporting, instruments, and test methods to achieve overall laboratory and healthcare objectives. The next section will specifically examine the types of testing that occur within a medical microbiology laboratory.

==Medical microbiology testing==
Within the scope of detecting, identifying, and characterizing microorganisms, medical microbiology labs depend on a variety of scientific subspecialties (e.g., bacteriology, mycology, virology) and test methods to achieve their goals. What follows are examples of the more common detection, identification, and characterization activities conducted in these labs.

===Detection of microbial growth===
By detecting the telltale signs of living microorganisms, such as growth (i.e., an increase in the number of cells), microbiologists can then make an initial diagnosis of microbiological infection and take a deeper dive into identifying the microorganism(s). (Note that measuring microbial growth is not a direct proxy for measuring microbial metabolism, however.<ref>{{Cite journal |last=Braissant |first=Olivier |last2=Astasov-Frauenhoffer |first2=Monika |last3=Waltimo |first3=Tuomas |last4=Bonkat |first4=Gernot |date=2020-11-17 |title=A Review of Methods to Determine Viability, Vitality, and Metabolic Rates in Microbiology |url=https://www.frontiersin.org/articles/10.3389/fmicb.2020.547458/full |journal=Frontiers in Microbiology |volume=11 |pages=547458 |doi=10.3389/fmicb.2020.547458 |issn=1664-302X |pmc=PMC7705206 |pmid=33281753}}</ref>) Growth can be demonstrated in multiple ways, including<ref name=":0">{{Cite book |last=Washington, J.A. |date=1996 |editor-last=Baron |editor-first=Samuel |title=Medical microbiology |chapter=Chapter 10: Principles of Diagnosis |edition=4th ed |publisher=University of Texas Medical Branch at Galveston |place=Galveston, Tex |isbn=978-0-9631172-1-2 |pmid=21413287}}</ref>:

*confirming turbidity, gas, or discrete colonies in broth;
*confirming discrete colonies on agar plates;
*confirming cytopathic effects or inclusions that distort the structures of cells in culture; and
*confirming "genus- or species-specific antigens or nucleotide sequences"<ref name=":0" /> in the specimen, culture medium, or culture system.

Cell culturing plays an important role, as hinted at above. Those cultures can occur in liquid broth, agar plates, or some other enhanced culture medium, as found with blood cultures in specialized bottles or tubes. Cultures are incubated to allow time for any microorganisms to multiply. Then signs of growth are sought out.<ref name=":0" /> However, detecting this growth is rarely straightforward and has its own set of complications.<ref>{{Cite journal |last=Zengler |first=Karsten |date=2009-12 |title=Central Role of the Cell in Microbial Ecology |url=https://journals.asm.org/doi/10.1128/MMBR.00027-09 |journal=Microbiology and Molecular Biology Reviews |language=en |volume=73 |issue=4 |pages=712–729 |doi=10.1128/MMBR.00027-09 |issn=1092-2172 |pmc=PMC2786577 |pmid=19946138}}</ref><ref name="ŹródłowskiClass20">{{Cite journal |last=Źródłowski |first=Tomasz |last2=Sobońska |first2=Joanna |last3=Salamon |first3=Dominika |last4=McFarlane |first4=Isabel M. |last5=Ziętkiewicz |first5=Mirosław |last6=Gosiewski |first6=Tomasz |date=2020-02-29 |title=Classical Microbiological Diagnostics of Bacteremia: Are the Negative Results Really Negative? What is the Laboratory Result Telling Us About the “Gold Standard”? |url=https://www.mdpi.com/2076-2607/8/3/346 |journal=Microorganisms |language=en |volume=8 |issue=3 |pages=346 |doi=10.3390/microorganisms8030346 |issn=2076-2607 |pmc=PMC7143506 |pmid=32121353}}</ref> This may necessitate other methods such as Gram staining or [[wikipedia:Fluorescence in situ hybridization|fluorescence ''in situ'' hybridization]] (FISH) for quicker and more accurate detection of growth.<ref name="ŹródłowskiClass20" />

===Taxonomic identification and overall characterization===
As an extension of detecting microbial growth, microbiologists can examine the growth characteristics of the microorganism(s) in order to identify what type of bacteria or fungi is growing. The identification of viruses, on the other hand, is typically done by examining the cytopathic effects or inclusions that affect cells in culture, or through detection of antigens or nucleotides specific to a viral genus or species.<ref name=":0" /> Databases are commonly used as part of the identification process of microorganisms.<ref name="RhoadsClin14" /> The sources used for these databases highlights some of the identification techniques used. For example, a "biochemical reaction" database implies microorganisms identified with techniques such as [[polymerase chain reaction]] (PCR), [[ELISA|enzyme-linked immunosorbent assay]] (ELISA), fatty acid profiling (using [[gas chromatography]] [GC] and [[mass spectrometry]] [MS]), and metabolic/chemo profiling (using [[high-performance liquid chromatography]] [HPLC] and MS).<ref name="MooreBiochem21">{{cite web |url=https://www.news-medical.net/life-sciences/Biochemical-Tests-for-Microbial-Identification.aspx |title=Biochemical Tests for Microbial Identification |author=Moore, S. |work=News-Medical Life Sciences |date=14 January 2021 |accessdate=26 April 2024}}</ref> A "nucleic acid sequence" database implies microorganisms identified with PCR for a single pathogen<ref name="RhoadsClin14" />, or [[DNA microarray]]s, metagenomics analysis, and [[DNA sequencing#High-throughput methods|next-generation sequencing]] (NGS) for identifying multiple pathogens at the same time.<ref name="RhoadsClin14" /><ref>{{Cite journal |last=Yadav |first=Brijesh Singh |last2=Ronda |first2=Venkateswarlu |last3=Vashista |first3=Dinesh P. |last4=Sharma |first4=Bhaskar |date=2013-01 |title=Sequencing and Computational Approaches to Identification and Characterization of Microbial Organisms |url=http://journals.sagepub.com/doi/10.4137/BECB.S10886 |journal=Biomedical Engineering and Computational Biology |language=en |volume=5 |pages=BECB.S10886 |doi=10.4137/BECB.S10886 |issn=1179-5972 |pmc=PMC4147756 |pmid=25288901}}</ref>

All of these techniques have their place in the microbiology lab, with genotypic methods in particular proving useful "for assessing sterility test and media fill failures, and for tracking the route of contamination as part of a contamination control strategy."<ref name="SandleEnhanc21" /> This type of contamination tracking and tracing is enabled by genotypic methods that allow microorganisms to be "characterized," i.e., grouped together based upon the shared characteristics of their DNA fragment patterns or antigenic profiles.<ref>{{Cite journal |last=Kim |first=Young Ran |last2=Lee |first2=Shee Eun |last3=Kim |first3=Choon Mee |last4=Kim |first4=Soo Young |last5=Shin |first5=Eun Kyoung |last6=Shin |first6=Dong Hyeon |last7=Chung |first7=Sun Sik |last8=Choy |first8=Hyon E. |last9=Progulske-Fox |first9=Ann |last10=Hillman |first10=Jeffrey D. |last11=Handfield |first11=Martin |date=2003-10 |title=Characterization and Pathogenic Significance of Vibrio vulnificus Antigens Preferentially Expressed in Septicemic Patients |url=https://journals.asm.org/doi/10.1128/IAI.71.10.5461-5471.2003 |journal=Infection and Immunity |language=en |volume=71 |issue=10 |pages=5461–5471 |doi=10.1128/IAI.71.10.5461-5471.2003 |issn=0019-9567 |pmc=PMC201039 |pmid=14500463}}</ref> Other aspects of a culture may be characterized as well in order to provide a more accurate "description" of the microorganism for future identification efforts.<ref>{{Citation |last=Trüper |first=Hans G. |last2=Krämer |first2=Johannes |date=1981 |editor-last=Starr |editor-first=Mortimer P. |editor2-last=Stolp |editor2-first=Heinz |editor3-last=Trüper |editor3-first=Hans G. |editor4-last=Balows |editor4-first=Albert |editor5-last=Schlegel |editor5-first=Hans G. |title=Principles of Characterization and Identification of Prokaryotes |url=http://link.springer.com/10.1007/978-3-662-13187-9_6 |work=The Prokaryotes |language=en |publisher=Springer Berlin Heidelberg |place=Berlin, Heidelberg |pages=176–193 |doi=10.1007/978-3-662-13187-9_6 |isbn=978-3-662-13189-3 |accessdate=2024-04-26}}</ref>

===Other analyses and techniques===
Medical microbiology labs will perform antibiogram and antimicrobial susceptibility testing (AST) as part of their public health function. An antibiogram is a cumulative summary or "overall profile of [''in vitro''] susceptibility testing results for a specific microorganism to an array of antimicrobial drugs," often given in a tabular form.<ref name="UnivMNHowTo20">{{cite web |url=https://arsi.umn.edu/sites/arsi.umn.edu/files/2020-02/How_to_Use_a_Clinical_Antibiogram_26Feb2020_Final.pdf |format=PDF |title=How to Use a Clinical Antibiogram |author=Antimicrobial Resistance and Stewardship Initiative, University of Minnesota |date=February 2020 |accessdate=17 April 2024}}</ref> Given that antibiotic resistance remains one of the primary challenges for global public health, determinations of how susceptible a microorganism is to certain antimicrobials before physician prescription or administration of an antibiotic is of significant value.<ref name="GajicAnti22">{{Cite journal |last=Gajic |first=Ina |last2=Kabic |first2=Jovana |last3=Kekic |first3=Dusan |last4=Jovicevic |first4=Milos |last5=Milenkovic |first5=Marina |last6=Mitic Culafic |first6=Dragana |last7=Trudic |first7=Anika |last8=Ranin |first8=Lazar |last9=Opavski |first9=Natasa |date=2022-03-23 |title=Antimicrobial Susceptibility Testing: A Comprehensive Review of Currently Used Methods |url=https://www.mdpi.com/2079-6382/11/4/427 |journal=Antibiotics |language=en |volume=11 |issue=4 |pages=427 |doi=10.3390/antibiotics11040427 |issn=2079-6382 |pmc=PMC9024665 |pmid=35453179}}</ref> There are multiple approaches to antibiograms for a wide variety of susceptibility testing, common to microbiology labs, including broth and agar dilution, gradient strip test, disk diffusion test, chromogenic and colorimetric test, PCR, DNA microarray, and other methods.<ref name="GajicAnti22" /> The nuances of antibiograms and susceptibility testing drive reporting requirements, particularly to the standard CLSI M39 ''Analysis and Presentation of Cumulative Antimicrobial Susceptibility Test Data''.<ref name="RhoadsClin14">{{Cite journal |last=Rhoads |first=Daniel D. |last2=Sintchenko |first2=Vitali |last3=Rauch |first3=Carol A. |last4=Pantanowitz |first4=Liron |date=2014-10 |title=Clinical Microbiology Informatics |url=https://journals.asm.org/doi/10.1128/CMR.00049-14 |journal=Clinical Microbiology Reviews |language=en |volume=27 |issue=4 |pages=1025–1047 |doi=10.1128/CMR.00049-14 |issn=0893-8512 |pmc=PMC4187636 |pmid=25278581}}</ref><ref>{{Cite journal |last=Simner |first=Patricia J. |last2=Hindler |first2=Janet A. |last3=Bhowmick |first3=Tanaya |last4=Das |first4=Sanchita |last5=Johnson |first5=J. Kristie |last6=Lubers |first6=Brian V. |last7=Redell |first7=Mark A. |last8=Stelling |first8=John |last9=Erdman |first9=Sharon M. |date=2022-10-19 |editor-last=Humphries |editor-first=Romney M. |title=What’s New in Antibiograms? Updating CLSI M39 Guidance with Current Trends |url=https://journals.asm.org/doi/10.1128/jcm.02210-21 |journal=Journal of Clinical Microbiology |language=en |volume=60 |issue=10 |pages=e02210–21 |doi=10.1128/jcm.02210-21 |issn=0095-1137 |pmc=PMC9580356 |pmid=35916520}}</ref> This highlights the importance of the lab not only accurately performing these analyses but also properly reporting the results for consistent and rapid interpretation.

Digital image analysis is another important technique used in the medical microbiology lab. This work has traditionally been performed with analog microscopy techniques to identify and characterize microorganisms, even into the early 2010s when digital imaging analysis was becoming more viable.<ref>{{Cite journal |last=Pasulka |first=Alexis L. |last2=Hood |first2=Jonathan F. |last3=Michels |first3=Dana E. |last4=Wright |first4=Mason D. |date=2023-01-19 |title=Flexible and open-source programs for quantitative image analysis in microbial ecology |url=https://www.frontiersin.org/articles/10.3389/fmars.2023.1052119/full |journal=Frontiers in Marine Science |volume=10 |pages=1052119 |doi=10.3389/fmars.2023.1052119 |issn=2296-7745}}</ref> In 2014, Rhoads ''et al.'' characterized automated or semi-automated methods in image interpretation as not being widely implemented in the medical microbiology lab, while at the same time recognizing those methods' potential for screening slides for identifications or characterizations, as well as improving standardization and turnaround time for analyzed specimens.<ref name="RhoadsClin14" /> Since then, laboratory automation, LLMs, and [[artificial intelligence]] (AI) tools—as well as the [[COVID-19]] [[pandemic]]—have pushed the microbiology imaging paradigm forward sufficiently to arguably make digital image analysis more mainstream.<ref name="AntoniosCurrent21" /><ref name="SandleEnhanc21" /><ref>{{Cite journal |last=Burns |first=Bethany L. |last2=Rhoads |first2=Daniel D. |last3=Misra |first3=Anisha |date=2023-09-21 |editor-last=Humphries |editor-first=Romney M. |title=The Use of Machine Learning for Image Analysis Artificial Intelligence in Clinical Microbiology |url=https://journals.asm.org/doi/10.1128/jcm.02336-21 |journal=Journal of Clinical Microbiology |language=en |volume=61 |issue=9 |pages=e02336–21 |doi=10.1128/jcm.02336-21 |issn=0095-1137 |pmc=PMC10575257 |pmid=37395657}}</ref><ref>{{Cite journal |last=Yakimovich |first=Artur |date=2024-02-28 |editor-last=Imperiale |editor-first=Michael J. |title=Toward the novel AI tasks in infection biology |url=https://journals.asm.org/doi/10.1128/msphere.00591-23 |journal=mSphere |language=en |volume=9 |issue=2 |pages=e00591–23 |doi=10.1128/msphere.00591-23 |issn=2379-5042 |pmc=PMC10900907 |pmid=38334404}}</ref> The introduction of automated microscopes "designed to collect high‑resolution image data from microscopic slides" and "high‑resolution image analysis systems that can detect small and mixed colonies, which a human eye cannot"<ref name="SandleEnhanc21" /> are examples of how modern medical microbiology labs are approaching their imaging work.

==Conclusion==
With not only its goal of detecting, identifying, and characterizing [[microorganism]]s for improved patient outcomes, but also its public health component of infectious agent detection and trend analysis, the medical microbiology lab plays a pivotal role in disease detection and prevention. The technology, methods, and requirements associated with these efforts are in turn sophisticated, as one might expect when dealing with infectious agents at the micro scale. From cell cultures and digital imaging to genotypic analyses and AST, the complexities of this lab become obvious. We find numerous methods for detecting microbial growth, a precursor for detecting the presence of microorganisms in specimens. From there, identification using growth characteristics, cytopathic effects, and the detection of antigens or nucleotides can provide greater insight. And the characterization of microorganisms and their telltale signs—using numerous techniques like PCR and MS—further enhances the databased knowledge we have on them. We also find that antibiograms and AST are important components to responsible antibiotic use in the global population. Additionally, imaging methods are important and challenging to the medical microbiology lab, requiring more advanced automated systems to assist with identification and characterization in these often understaffed labs.

==References==
{{Reflist|colwidth=30em}}


[[Category:LIMS Q&A articles (added in 2024)]]
[[Category:LIMS Q&A articles (all)]]
[[Category:LIMS Q&A articles on medical microbiology]]

LIMS Q&A:What types of testing occur within a medical microbiology laboratory?

2024-06-21T02:30:57Z

Shawndouglas: Shawndouglas moved page LIMA Q&A:What types of testing occur within a medical microbiology laboratory? to LIMS Q&A:LIMS Q&A:What types of testing occur within a medical microbiology laboratory? without leaving a redirect: Misnamed the namespace

Template:LIMS Selection Guide for Manufacturing Quality Control/Standards and regulations affecting manufacturing labs/Regulations and laws around the world

2024-06-20T22:57:31Z

Shawndouglas: /* 2.3.4 Other industries and regulations */ Fixed broken image

===2.2 Regulations and laws around the world===
As the end of Chapter 1 highlighted, today's regulatory focus on product safety, quality, and efficacy is largely built on the past failures, injuries, and deaths that highlighted the regulatory need.<ref>{{Cite book |last=Center for Policy Alternatives at the Massachusetts Institute of Technology |year=1980 |title=Benefits of Environmental, Health, and Safety Regulation |url=https://books.google.com/books?id=VadeKZOzcmwC&pg=PA1 |publisher=U.S. Government Printing Office |pages=100}}</ref><ref name="AschConsum88">{{Cite book |last=Asch |first=Peter |date=1988 |title=Consumer safety regulation: putting a price on life and limb |url=https://books.google.com/books?id=Pi_nCwAAQBAJ&pg=PA1 |publisher=Oxford University Press |place=New York |pages=3–14 |isbn=978-0-19-504972-5}}</ref><ref>{{Cite book |last=Dwyer |first=Tom |date=1991 |title=Life and death at work: industrial accidents as a case of socially produced error |series=Plenum studies in work and industry |publisher=Plenum Press |place=New York |isbn=978-0-306-43949-0}}</ref><ref>{{Cite book |last=CoVan |first=James |date=1995 |title=Safety engineering |series=New dimensions in engineering |publisher=Wiley |place=New York |isbn=978-0-471-55612-1}}</ref> For example, the consumption of raw milk was associated with a growing number of health issues in the mid- to late 1800s, particularly milk from unscrupulous dairy farms. In the U.S. Northeast during the 1860s, recognition was growing concerning the threat that tainted milk originating from dairy cows being singularly fed distillery byproducts had to human health. Not only was the milk generated from such cows thin and low in nutrients, but it also was adulterated with questionable substances to give it a better appearance. This resulted in many children and adults falling ill or dying from consuming the product. The efforts of Dr. Henry Coit and others in the late 1800s to develop a certification program for milk—which included laboratory testing among other activities—eventually helped plant the seeds for a national food and beverage safety program.<ref>{{Cite book |last=Lytton |first=Timothy D. |date=2019 |title=Outbreak: foodborne illness and the struggle for food safety |chapter=Chapter 2: The Gospel of Clean Milk |publisher=The University of Chicago Press |place=Chicago ; London |pages=24-64 |isbn=978-0-226-61154-9}}</ref> By 1939, the U.S. Public Health Services had drafted the Model Milk Health Ordinance "in order to encourage a greater uniformity of milk-control practice in the United States."<ref name="PHSMilk39">{{cite web |url=http://resource.nlm.nih.gov/101528318 |title=Milk ordinance and code: Recommended by the United States Public Health Service, 1939 |author=U.S. Public Health Service |publisher=U.S. Government Printing Office |date=1939 |accessdate=05 May 2023}}</ref>

While regulation can at times be overbearing and harmful, well-crafted regulations can definitely benefit our society. This can be seen with manufacturing regulations driven on safety, quality, and efficacy principles. How those regulations are implemented around the world may differ slightly, however, which should not be surprising given the cultural, political, and functional differences across regions and nations of the world.<ref name="BuzbyFood03">{{cite web |url=https://www.ers.usda.gov/amber-waves/2003/november/food-safety-and-trade-regulations-risks-and-reconciliation/ |title=Food Safety and Trade: Regulations, Risks, and Reconciliation |author=Buzby, J.C.; Mitchell, L. |work=Amber Waves |publisher=U.S. Department of Agriculture, Economic Research Service |date=01 November 2003 |accessdate=05 May 2023}}</ref>

The following subsections examine some of the more critical regulations that apply to a wide variety of manufacturing industries, from various parts of the world.

====2.2.1 Food and beverage====
[[File:Food-safety-modernization-act-fsma.png|right|300px]]The safety and quality of food is a high priority for most countries around the world, though how that safety and quality is regulated and legislated varies, sometimes significantly. The following subsections briefly address the primary regulations and legislation enacted in seven major countries and supranational unions around the world. (It is beyond the scope of this guide to address them all.) Similarities among the countries may be seen in their goals, but it should be noted that differences—significant and nuanced—exist among them all in regards to regulatory approaches to sampling, testing, risk, and importing of products.<ref name="BuzbyFood03" /><ref name="WestGlobal18">{{cite web |url=https://www.brookings.edu/research/global-manufacturing-scorecard-how-the-us-compares-to-18-other-nations/ |title=Global manufacturing scorecard: How the US compares to 18 other nations |author=West, D.M.; Lansang, C. |work=Brookings |date=10 July 2018 |accessdate=05 May 2023}}</ref><ref name="GAOFoodSafety05">{{cite web |url=https://www.gao.gov/products/gao-05-212 |title=Food Safety: Experiences of Seven Countries in Consolidating Their Food Safety Systems |author=U.S. Government Accountability Office |date=February 2005 |accessdate=05 May 2023}}</ref><ref name="WhitworthReport22">{{cite web |url=https://www.foodsafetynews.com/2022/02/report-finds-food-testing-policies-different-between-countries/ |title=Report finds food testing policies different between countries |author=Whitworth, J. |work=Food Safety News |date=22 February 2022 |accessdate=05 May 2023}}</ref>

'''2.2.1.1 Food Safety Act 1990 and Food Standards Act 1999 - ''United Kingdom'''''

The [[wikipedia:Food Safety Act 1990|Food Safety Act of 1990]] and [[wikipedia:Food Standards Agency|Food Standards Act of 1999]] represent the core of food safety regulation in the United Kingdom, though there are other pieces of legislation that also have an impact.<ref name="SBCFood22">{{cite web |url=https://www.scarborough.gov.uk/home/business-licensing-and-grants/food-hygeine/food-safety-regulations |archiveurl=https://web.archive.org/web/20230203164750/https://www.scarborough.gov.uk/home/business-licensing-and-grants/food-hygeine/food-safety-regulations |title=Food safety regulations |publisher=Scarborough Borough Council |date=10 November 2022 |archivedate=03 February 2023 |accessdate=05 May 2023}}</ref><ref name="FSAKey22">{{cite web |url=https://www.food.gov.uk/about-us/key-regulations |title=Key regulations |publisher=Food Standards Agency |date=30 August 2022 |accessdate=05 May 2023}}</ref> The Food Safety Act of 1990 encourages entities to "not include anything in food, remove anything from food, or treat food in any way which means it would be damaging to the health of people eating it"; serve or sell food that is of a quality that "consumers would expect"; and ensure food is labeled, advertised, and presented clearly and truthfully.<ref name="SBCFood22" /><ref name="FSAKey22" /> The Food Standards Act of 1999 later created the UK's Food Standards Agency (FSA) "to protect public health from risks which may arise in connection with the consumption of food (including risks caused by the way in which it is produced or supplied) and otherwise to protect the interests of consumers in relation to food."<ref name="FSA99Sec1">{{cite web |url=https://www.legislation.gov.uk/ukpga/1999/28/section/1 |title=1999 c. 28, The Food Standards Agency, Section 1 |work=legislation.gov.uk |accessdate=05 May 2023}}</ref> One of the ways the FSA does this is through enforcing food safety regulation at the local level, including within food production facilities, as well as setting ingredient and nutrition labeling policy.<ref name="FSAAbout">{{cite web |url=https://www.gov.uk/government/organisations/food-standards-agency |title=Food Standards Agency |work=Gov.uk |accessdate=05 May 2023}}</ref> Regulations and guidance from the FSA address not only labelling but also radioactivity monitoring, meat processing, manure management, ''Salmonella'' testing, temperature control, dairy hygiene, and more.<ref name="FSAGuidReg">{{cite web |url=https://www.gov.uk/search/guidance-and-regulation?organisations%5B%5D=food-standards-agency&parent=food-standards-agency |title=Guidance and regulation: Food Standards Agency (FSA) |work=Gov.uk |accessdate=05 May 2023}}</ref>

'''2.2.1.2 Food Safety and Standards Act of 2006 - ''India'''''

This act was enacted in 2006 to both consolidate existing food-related law and to establish the Food Safety and Standards Authority of India (FSSAI), which develops regulations and standards of practice for the manufacture, storage, distribution, and packaging of food.<ref name="PRSImplement">{{cite web |url=https://prsindia.org/policy/report-summaries/implementation-food-safety-and-standards-act-2006 |title=Implementation of Food Safety and Standards Act, 2006 |work=PRS Legislative Research |accessdate=05 May 2023}}</ref><ref name="FSSAIFood">{{cite web |url=https://fssai.gov.in/cms/food-safety-and-standards-act-2006.php |title=Food Safety and Standards Act, 2006 |publisher=Food Safety and Standards Authority of India |accessdate=05 May 2023}}</ref> However, an audit of FSSAI by the Comptroller and Auditor General of India (CAG) in December 2017 revealed some deficiencies in the FSSAI's activities, including an overall "low quality" of food testing laboratories in the country.<ref name="PRSImplement" /> Nonetheless, the FSSAI remains the primary regulatory watchdog, developing standards and guidelines for food and enforcing those standards. This includes setting limits for food additives, contaminants, pesticides, drugs, heavy metals, and more, as well as defining quality control mechanisms, accreditation requirements, sampling and analytical techniques, and more.<ref name="FSSAIFood" />

'''2.2.1.3 Food Safety Law - ''China'''''

The [[wikipedia:Food safety in China|Food Safety Law]] is described as "the fundamental law regulating food safety in China."<ref name="UNEPFood15">{{cite web |url=https://leap.unep.org/countries/cn/national-legislation/food-safety-law-2015 |title=Food Safety Law (2015) |author=Food and Agriculture Organization of the United Nations |work=Law and Environment Assistance Platform |publisher=United Nations Environmental Programme |date=24 April 2015 |accessdate=05 May 2023}}</ref> Enacted in 2009 and revised in 2015, the Law "builds up the basic legal framework for food safety supervision and management" and "introduces many new regulatory requirements," including "not only general requirements applicable to food and food additives, but also specific requirements for food-related products and other product categories."<ref name="UNEPFood15" /> Among these activities, the Law describes how food testing laboratories shall conduct their activities, from accreditation and sampling to testing and reporting.<ref name="USDAChina15">{{cite web |url=https://apps.fas.usda.gov/newgainapi/api/report/downloadreportbyfilename?filename=Amended%20Food%20Safety%20Law%20of%20China_Beijing_China%20-%20Peoples%20Republic%20of_5-18-2015.pdf |format=PDF |title=China's Food Safety Law (2015) |author=Foreign Agriculture Service Staff |publisher=U.S. Department of Agriculture |work=GAIN Repo |date=18 May 2015 |accessdate=05 May 2023}}</ref>

'''2.2.1.4 Food Sanitation Act and Food Safety Basic Act - ''Japan'''''

The Food Sanitation Act of 1947 and the Food Safety Basic Act of 2003 represent the most important pieces of food-related legislation in Japan, though there are others. The Food Sanitation Act was originally enacted "to prevent sanitation hazards resulting from eating and drinking by enforcing regulations and other measures necessary from the viewpoint of public health, to ensure food safety and thereby to protect citizens' health."<ref name="JLTFood47">{{cite web |url=https://www.japaneselawtranslation.go.jp/en/laws/view/3687/en |title=Food Sanitation Act (Act No. 233 of 1947) |work=Japanese Law Translation |date=24 December 1947 |accessdate=05 May 2023}}</ref> The Food Safety Basic Act recognized the effects of "internationalization" and changing dietary habits, as well as scientific and technological shifts in food production, as a primary driver for modernizing food safety and sustainability in the country, and it also created the Food Safety Commission of Japan.<ref name="FSCFoodSafe03">{{cite web |url=https://www.fsc.go.jp/english/basic_act/fs_basic_act.pdf |format=PDF |title=Food Safety Basic Act |publisher=Food Safety Commission of Japan |date=23 May 2003 |accessdate=05 May 2023}}</ref> Between the two pieces of legislation, standards and specifications for food and food additives, as well as associated tools and packaging, are addressed, as are inspection standards, production standards, hygiene management, and individual food and ingredient safety.<ref name="BMFoodJapan18">{{cite web |url=https://resourcehub.bakermckenzie.com/en/resources/asia-pacific-food-law-guide/asia-pacific/japan/topics/food-product-and-safety-regulation |title=Japan: Food product and safety regulation |work=Asia Pacific Food Law Guide |author=Baker McKenzie |date=2018 |accessdate=05 May 2023}}</ref>

'''2.2.1.5 Food Safety Modernization Act (FSMA) and other acts - ''United States'''''

The [[wikipedia:FDA Food Safety Modernization Act|Food Safety Modernization Act]] of the United States was signed into law in January 2011, giving the US Food and Drug Administration (FDA) more regulatory authority to address the way food is grown, harvested, and processed.<ref name="WeinrothHist18">{{Cite journal |last=Weinroth |first=Margaret D |last2=Belk |first2=Aeriel D |last3=Belk |first3=Keith E |date=2018-11-09 |title=History, development, and current status of food safety systems worldwide |url=https://academic.oup.com/af/article/8/4/9/5087923 |journal=Animal Frontiers |language=en |volume=8 |issue=4 |pages=9–15 |doi=10.1093/af/vfy016 |issn=2160-6056 |pmc=PMC6951898 |pmid=32002225}}</ref><ref name="FDAFood22">{{cite web |url=https://www.fda.gov/animal-veterinary/animal-food-feeds/food-safety-modernization-act-and-animal-food |title=Food Safety Modernization Act and Animal Food |publisher=U.S. Food and Drug Administration |date=20 October 2022 |accessdate=05 May 2023}}</ref> It has been described by the FDA as "the most sweeping reform of our food safety laws in more than 70 years."<ref name="FDAFood22" /> The FSMA, at its base, has five key aspects, addressing preventive controls, inspection and compliance, safety of food imports, mandatory recall response, and food partnership enhancement.<ref name="FDAFood22" /> However, FSMA continues to evolve, with additional rules getting added since its enactment, including rules about record management, good manufacturing practice (GMP) for human food and animal feed, and laboratory accreditation (referred to as the [[LII:FDA Food Safety Modernization Act Final Rule on Laboratory Accreditation for Analyses of Foods: Considerations for Labs and Informatics Vendors|LAAF Rule]]).<ref name="FDAFSMA22">{{cite web |url=https://www.fda.gov/food/food-safety-modernization-act-fsma/fsma-rules-guidance-industry#rules |title=FSMA Rules & Guidance for Industry |publisher=U.S. Food and Drug Administration |date=20 October 2022 |accessdate=05 May 2023}}</ref>

Another important regulatory body in the US is the Food Safety and Inspection Service (FSIS), which is overseen by the US Department of Agriculture (USDA). The FSIS and its authority to regulate are derived from three different acts: the Federal Meat Inspection Act of 1906, the Poultry Products Inspection Act of 1957, and the Egg Products Inspection Act of 1970.<ref name="USDAOurHist18">{{cite web |url=https://www.fsis.usda.gov/about-fsis/history |title=Our History |author=Food Safety and Inspection Service |publisher=U.S. Department of Agriculture |date=21 February 2018 |accessdate=05 May 2023}}</ref> The FSIS has developed its own regulatory requirements for meat, poultry, and egg products, including for inspections, imports and exports, labeling, and laboratory testing.<ref name="9CFR412">{{cite web |url=https://www.ecfr.gov/current/title-9/chapter-III/subchapter-E/part-412 |title=9 CFR Part 412 - Label Approval |work=Code of Federal Regulations |date=31 October 2022 |accessdate=05 May 2023}}</ref><ref name="FSISFedReg">{{cite web |url=https://www.fsis.usda.gov/policy/federal-register-rulemaking/federal-register-rules |title=Federal Register Rules |publisher=Food Safety and Inspection Service |accessdate=05 May 2023}}</ref><ref name="NALFoodSafe">{{cite web |url=https://www.nal.usda.gov/human-nutrition-and-food-safety/food-safety-standards |title=Food Safety Standards |author=National Agricultural Library |publisher=U.S. Department of Agriculture |accessdate=05 May 2023}}</ref>

'''2.2.1.6 General Food Law Regulation (GFLR) - ''European Union'''''
The GFLR was enacted across the European Union in 2002 as part of Regulation (EC) No 178/2002, and it is described as "the foundation of food and feed law" for the EU.<ref name="EUGeneral">{{cite web |url=https://food.ec.europa.eu/horizontal-topics/general-food-law_en |title=General Food Law |work=Food Safety |publisher=European Commission |accessdate=05 May 2023}}</ref> Along with setting requirements and procedures for food and feed safety, the GFLR also mandated the creation of the European Food Safety Authority (EFSA), an independent body assigned to developing sound scientific advice about and providing support towards the goals of food, beverage, and feed safety in the EU.<ref name="WeinrothHist18" /><ref name="EUGeneral" /> As such, the EFSA develops broad and sector-specific guidance<ref name="EFSAGuidance">{{cite web |url=https://www.efsa.europa.eu/en/methodology/guidance |title=Guidance and other assessment methodology documents |publisher=European Food Safety Authority |accessdate=05 May 2023}}</ref>, as well as other rules related to scientific assessment of food safety matters, e.g., Regulation (EC) No 2073/2005 on microbiological criteria for foodstuffs.<ref name="EU2073-2005">{{cite web |url=https://eur-lex.europa.eu/legal-content/EN/ALL/?uri=CELEX%3A32005R2073 |title=Commission Regulation (EC) No 2073/2005 of 15 November 2005 on microbiological criteria for foodstuffs |work=EUR-Lex |date=03 August 2020 |accessdate=05 May 2023}}</ref> The EFSA also develops food classification standardization tools such as the Standard Sample Description (SSD2) data model, to better ensure an appropriate "format for describing food and feed samples and analytical results that is used by EFSA’s data providers."<ref name="EFSAFoodClass">{{cite web |url=https://www.efsa.europa.eu/en/data/data-standardisation |title=Food classification standardisation – The FoodEx2 system |publisher=European Food Safety Authority |accessdate=05 May 2023}}</ref>

'''2.2.1.7 Safe Food for Canadians Act (SFCA) - ''Canada'''''

In November 2012, the SFCA was enacted to place regulatory "focus on prevention to ensure a food that is imported, exported or shipped from one province to another, is manufactured, stored, packaged and labelled in a way that does not present a risk of contamination."<ref name="ManitobaSafe">{{cite web |url=https://www.gov.mb.ca/agriculture/food-safety/at-the-food-processor/safe-food-for-canadians-act.html |title=Safe Food for Canadians Act |publisher=Manitoba Government |accessdate=05 May 2023}}</ref><ref name="JLWSafeFood19">{{cite web |url=https://laws-lois.justice.gc.ca/eng/acts/s-1.1/index.html |title=Safe Food for Canadians Act (S.C. 2012, c. 24) |work=Justice Laws Website |publisher=Government of Canada |date=17 June 2019 |accessdate=05 May 2023}}</ref> Though Canadian Food Inspection Agency (CFIA) enforcement of the SFCA's regulations didn't start until January 2019<ref name="ManitobaSafe" />, the consolidation of 14 sets of existing food regulations by the SFCA has managed to improve consistency, reduce administrative burden, and enable food business innovation.<ref name="GoCUnder18">{{cite book |last=Canadian Food Inspection Agency |year=2018 |title=Understanding the Safe Food for Canadians Regulations: A handbook for food businesses |url=https://inspection.canada.ca/food-safety-for-industry/toolkit-for-food-businesses/sfcr-handbook-for-food-businesses/eng/1481560206153/1481560532540?chap=0 |publisher=Government of Canada |isbn=9780660269856}}</ref> An interpretive guide published by the CFIA, ''Understanding the Safe Food for Canadians Regulations: A handbook for food businesses'', summarizes and explains some of the nuances of the SFCA and its 16 parts on matters such as trade, licensing, preventive controls, packaging and labeling, and traceability.<ref name="GoCUnder18" />

====2.2.2 Materials====
[[File:Assessing prototype reference material for testing emissions of VOCs (5940985174).jpg|left|330px]]

Like materials standards, there are too many materials regulations to list for the given scope of this guide. However, several different examples are given below of materials-related regulations found in various parts of the world.

'''2.2.2.1 21 CFR Part 175 and 176 - ''United States'''''

These two regulations from the U.S. Code of Federal Regulations system relate specifically to the materials used to package food. (While the food and beverage industry is the recipient and end user of such materials, the materials researcher and manufacturer is responsible, in part, for developing and producing materials that meet such regulations.) Part 175 dictates what substances can be used as components of adhesives and coatings in food packaging materials<ref name="ECFR21Part175">{{cite web |url=https://www.ecfr.gov/current/title-21/chapter-I/subchapter-B/part-175?toc=1 |title=Title 21, Chapter I, Subchapter B, Part 175 |work=Code of Federal Regulations |publisher=Office of the Federal Register; U.S. Government Publishing Office |date=24 April 2023 |accessdate=05 May 2023}}</ref>, and Part 176 addresses what substances can be used as components of paper and paperboard in food packaging materials.<ref name="ECFR21Part176">{{cite web |url=https://www.ecfr.gov/current/title-21/chapter-I/subchapter-B/part-175?toc=1 |title=Title 21, Chapter I, Subchapter B, Part 176 |work=Code of Federal Regulations |publisher=Office of the Federal Register; U.S. Government Publishing Office |date=24 April 2023 |accessdate=05 May 2023}}</ref>

'''2.2.2.2 Building Standard Law - ''Japan'''''

Japan's Building Standard Law, broadly speaking, states a minimum set of standards that must be followed concerning the construction of buildings in the country, in order to protect the health and property of those using those buildings. Given the country's infrastructure is affected by snow accumulation, earthquakes, and tsunamis, a modern approach to building regulation was required in 1950. One section of the law addresses the quality of materials used in building construction, particularly in important sections of the home such as the foundation, load-bearing walls, columns, and more. The regulation mandates that such materials must conform to the Japanese Industrial Standard or Japanese Agriculture Standard, or that they be specifically approved by the Minister. This means manufacturers of structural steel, high-strength bolts, concrete, wood-based composite panels, membrane materials, and more are largely beholden to those standards throughout the manufacturing process.<ref name="TomohiroIntro13">{{cite web |url=https://www.bcj.or.jp/upload/international/baseline/BSLIntroduction201307_e.pdf |format=PDF |title=Introduction to the Building Standard Law, Building Regulation in Japan |author=Tomohiro, H. |publisher=Building Center of Japan |date=July 2013 |accessdate=05 May 2023}}</ref>

'''2.2.2.3 The Furniture and Furnishings (Fire) (Safety) Regulations 1988 - ''United Kingdom'''''

In 1988, the U.K. made into law The Furniture and Furnishings (Fire) (Safety) Regulations, which "were introduced to help reduce the risks of injury or loss of life through fires in the home spread by upholstered furniture."<ref name="OPSSUpdat19">{{cite web |url=https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/822072/furniture-fire-regulations-2016-consultation-government-response-july-2019.pdf |format=PDF |title=Updating The Furniture and Furnishings (Fire) (Safety) Regulations 1988 - Government response to consultation |author=Office for Product Safety & Standards |publisher=Crown |date=July 2019 |accessdate=05 May 2023}}</ref> The regulations set out general testing requirements for foam fillings, non-foam fillings, composite fillings, and cover materials, requiring that they pass various material and combustion requirements.<ref name="BCNewUphol22">{{cite web |url=https://www.businesscompanion.info/en/quick-guides/product-safety/new-upholstered-furniture |title=New upholstered furniture |work=Business Companion |publisher=Chartered Trading Standards Institute |date=December 2022 |accessdate=05 May 2023}}</ref> Various updates have been made to the regulations over time. In 2019, the government, noting "growing evidence linking the specific flame-retardant chemicals most often used in furniture to serious long-term health impacts," acknowledged that updates to regulations on the materials used in furniture making were required, which would in turn further dictate manufacturing processes.

'''2.2.2.4 National Environment Protection (Used Packaging Materials) Measure 2011 - ''Australia'''''

In September 2011, Australia's National Environment Protection Council made law the National Environment Protection (Used Packaging Materials) Measure 2011. Signatories of the Covenant agreed to "a voluntary system of industry self regulation" to better ensure "improved environmental outcomes" in regards to packaging material use and re-use.<ref name="FRLNation11">{{cite web |url=https://www.legislation.gov.au/Details/F2011L02093 |title=National Environment Protection (Used Packaging Materials) Measure 2011 |work=Federal Register of Legislation |publisher=Australian Government |date=16 September 2011 |accessdate=05 May 2023}}</ref> However, a 2021 review found that the regulation was lacking in several areas, including with monitoring and enforcement. A December 2022 response by the government agreed, noting, "[l]ack of compliance monitoring and enforcement activity has undermined the effectiveness of, and confidence in, the mandatory co-regulatory arrangement and enabled some businesses to avoid their obligations." Updates to the regulations are expected by 2025, "to ensure that all packaging available in Australia is designed to be recovered, reused, recycled and reprocessed safely in line with circular economy principles. Reforming the regulation of packaging in Australia presents a significant opportunity to improve the way our packaging is designed ..."<ref name="DCCEEWAustr">{{cite web |url=https://www.dcceew.gov.au/environment/protection/waste/plastics-and-packaging/packaging-covenant |title=Australian Packaging Covenant |publisher=Department of Climate Change, Energy, and Environment and Water |date=2023 |accessdate=05 May 2023}}</ref> As this language suggests, such regulatory changes, should they come, will further dictate the materials designed and used in manufacturing processes in the country.

'''2.2.2.5 Surface Coating Materials Regulations (SOR/2016-193) - ''Canada'''''

Enacted in 2016, these regulations—enabled by Canada's Consumer Product Safety Act—dictate the amount of lead and mercury allowed in surface coating materials, stickers, and films under certain circumstances. The testing for those [[heavy metals]] must be done by manufacturers (or their third-party labs) "in accordance with a method that conforms to good laboratory practices."<ref name="JLWSurface22">{{cite web |url=https://laws-lois.justice.gc.ca/eng/regulations/SOR-2016-193/page-1.html |title=Surface Coating Materials Regulations |work=Justice Laws Website |publisher=Government of Canada |date=19 December 2022 |accessdate=05 May 2023}}</ref>

====2.2.3 Pharmaceutical and medical devices====
[[File:Generic Drug Research (5896) (8493711422).jpg|right|340px]]As was similarly noted in the section about pharmaceutical and medical device standards, regulations in this industry are vital for ensuring the safety of consumers and the efficacy of the drugs and devices they use, and they are typically present as some of the most rigorous regulations for the broad world of manufacturing. That said, as serious as pharmaceutical regulation can be, governments can often be slow to act with their regulatory duties<ref name="MPWGSRegul11">{{Cite book |url=https://publications.gc.ca/collections/collection_2012/bvg-oag/FA1-2011-2-4-eng.pdf |format=PDF |last=Office of the Auditor General of Canada |year=2011 |title=Report of the Auditor General of Canada to the House of Commons |chapter=Chapter 4: Regulating Pharmaceutical Drugs - Health Canada |publisher=Minister of Public Works and Government Services |pages=1–38 |isbn=9781100194028}}</ref> or have complex governance structures and difficult-to-enforce rules.<ref name="BasakIndian23">{{cite web |url=https://www.outlookindia.com/national/indian-pharma-irregularities-hint-at-lack-of-drug-regulation-policies-and-better-oversight-news-261625 |title=Indian Pharma Irregularities Point To Lack Of Drug Regulation Policies And Better Oversight |author=Basak, S. |work=Outlook |publisher=Outlook Publishing India Pvt. Ltd |date=13 February 2023 |accessdate=05 May 2023}}</ref><ref name="BhattIncon23">{{Cite journal |last=Bhatt |first=Neha |date=2023-01-13 |title=Inconsistent drug regulation spells danger for India’s global pharma ambitions |url=https://www.bmj.com/lookup/doi/10.1136/bmj.p23 |journal=BMJ |language=en |pages=p23 |doi=10.1136/bmj.p23 |issn=1756-1833}}</ref> As these regulatory bodies and their regulations improve, manufacturers of pharmaceuticals and medical devices find themselves needing to be even more efficient in their activities and focused on good manufacturing practices.

The following represent examples of pharmaceutical and medical device regulation around the world. Note that the discussion of GMP and cGMP is discussed further in the next section.

'''2.2.3.1 Current Good Manufacturing Practice (cGMP) regulations - ''United States and other countries'''''

For more on this, see the next section on other industries and regulations.

'''2.2.3.2 Drugs and Cosmetics Act of 1940 - ''India'''''

This act represents the primary regulatory control over the manufacture of pharmaceuticals and cosmetics in India. The act "creates a web of regulatory authorities to govern the process at both the central and the state levels."<ref name="BasakIndian23" /> Like other pharmaceutical regulations in other countries, the act attempts to ensure that drugs, cosmetics, medical devices, and diagnostic devices sold in India are safe, standardized to a certain quality, and effective at what they are said to do. Public health activists, lawyers, and specialized agencies such as the WHO, however, have criticized India's regulatory efforts in recent years for not keeping up with the rapidly growing industry while exhibiting poor transparency, unnecessary complexity, and insufficient statutory backing (as seen with the Central Drugs Standard Control Organisation or CDSCO).<ref name="BasakIndian23" /><ref name="BhattIncon23" /> The country has been attempting to reform the regulatory system by, for example, updating the Indian Pharmacopoeia and the Drugs and Cosmetics Act with more modern, relevant information and improved labeling mechanisms for tracking and tracing drugs.<ref>{{Cite journal |last=Kukade, T.; Punnen, D.; Antani, M. |date=24 June 2022 |title=Mid-Year Regulatory Update 2022: Pharmaceuticals in India |url=https://www.natlawreview.com/article/mid-year-regulatory-update-2022-pharmaceuticals-india |journal=The National Law Review |volume=XII |issue=175}}</ref>

'''2.2.3.3 Food and Drugs Act - ''Canada'''''

Canada's Food and Drugs Act enables its Food and Drug Regulations, which, among other things, helps to "ensure the pharmaceutical drugs offered for sale in Canada are safe, effective and of high quality."<ref name="HCDrug22">{{cite web |url=https://www.canada.ca/en/health-canada/services/drugs-health-products/drug-products/legislation-guidelines.html |title=Drug products legislation and guidelines |author=Health Canada |publisher=Government of Canada |date=20 June 2022 |accessdate=05 May 2023}}</ref> Part C of the regulations addresses drugs in a broad sense, with Division 2 of that part regulating manufacturing controls, quality testing, and records management.<ref name="JLWFoodDrug23">{{cite web |url=https://lois.justice.gc.ca/eng/regulations/C.R.C.,_c._870/index.html |title=Food and Drug Regulations (C.R.C., c. 870) |work=Justice Laws Website |publisher=Government of Canada |date=15 February 2023 |accessdate=05 May 2023}}</ref> These regulations are overseen and updated by Health Canada "through a combination of scientific review, monitoring, compliance, and enforcement activities."<ref name="MPWGSRegul11" />

'''2.2.3.4 Pharmaceutical Affairs Act (PAA) and Medical Devices Act (MDA) - ''South Korea'''''

Overseen by the Ministry of Food and Drug Safety (MFDS), these acts help ensure safer and effective pharmaceuticals and medical devices to the people of South Korea. The country's PAA classifies pharmaceuticals as either pharmaceutical ingredients or drug products, which in turn are broken down into new drugs, pharmaceuticals requiring specific data submissions, and generic drugs.<ref name="PBMKorea18">{{cite web |url=https://www.pacificbridgemedical.com/regulation/korea-medical-device-pharmaceutical-regulations/ |title=Korea Medical Device and Pharmaceutical Regulations |publisher=Pacific Bridge Medical |date=12 August 2018 |accessdate=05 May 2023}}</ref> The PAA's goal is described as being "to prescribe matters necessary to deal with pharmaceutical affairs smoothly, thereby contributing to the improvement of the national public health."<ref name="KLRIPharm17">{{cite web |url=https://elaw.klri.re.kr/eng_service/lawView.do?hseq=40196&lang=ENG |title=Pharmaceutical Affairs Act |publisher=Korea Legislation Research Institute |date=25 August 2017 |accessdate=05 May 2023}}</ref> The MDA's goal is "to promote the efficient management of medical devices and further contribute to the improvement of public health by providing for matters concerning the manufacturing, import, distribution, etc. of medical devices."<ref name="KLRIMedDev18">{{cite web |url=https://elaw.klri.re.kr/eng_service/lawView.do?hseq=48691&lang=ENG |title=Medical Devices Act |publisher=Korea Legislation Research Institute |date=12 July 2018 |accessdate=05 May 2023}}</ref> Along with managing the regulations from these acts, the MFDS is also responsible for setting the manufacturing standards and specifications for pharmaceuticals, monitoring certain pre- and post-manufacturing procedures, enforcing good manufacturing practice, reinforcing safety controls, and more.<ref name="PBMKorea18" />

'''2.2.3.5 Pharmaceutical and Medical Device Act (PMD Act) - ''Japan'''''

The PMD Act represents one of the primary pieces of pharmaceutical and medical device regulation in Japan. The PMD Act replaced Japan's Pharmaceutical Affairs Law in November 2014, with the primary goal of using regulation to improve public health by assuring the quality, safety, and efficacy of pharmaceuticals, medical devices, and cosmetics, as well as preventing the expansion of activities harmful to those efforts. Under the PMD Act, manufacturers are expected to demonstrate how their products conform to Japan's regulations and how their operations are guided by a well-defined QMS.<ref name="JPMAPharm20">{{cite book |url=https://www.jpma.or.jp/english/about/parj/eki4g6000000784o-att/2020e_ch02.pdf |format=PDF |title=Pharmaceutical Administration and Regulations in Japan |chapter=Chapter 2: Pharmaceutical Laws and Regulations |author=Regulatory Information Task Force |publisher=Japan Pharmaceutical Manufacturers Association |pages=15–56 |year=2020 |accessdate=05 May 2023}}</ref><ref name="BSIJapan16">{{cite web |url=https://www.bsigroup.com/meddev/LocalFiles/en-US/Brochures/bsi-md-japan-brochure.pdf |format=PDF |title=Japan Pharmaceutical and Medical Device Act |publisher=BSI Group America, Inc |date=November 2016 |accessdate=05 May 2023}}</ref> The act has multiple chapters addressing the manufacturing of drugs, "quasi-drugs," cosmetics, cellular and tissue-based products, medical devices, and ''in vitro'' diagnostic devices, as well as separate chapters on the safety of and standards applicable to drugs.<ref name="JPMAPharm20" /> In regards to manufacturers of medical devices, they should turn to several sets of standards published by Japan's Ministry of Health, Labour and Welfare (MHLW)—MO No. 169 – ''Standards for Manufacturing Control and Quality Control for Medical Devices and In Vitro Diagnostic Reagents'' and MO No. 135 – ''Standards for Good Vigilance Practice (GVP)''—in order to better comply with the PMD Act.<ref name="BSIJapan16" />

====2.3.4 Other industries and regulations====
[[File:Logo gmp RGB red.png|left|220px]]Like standards, manufacturing regulations exist beyond industries like the food and beverage, materials, and pharmaceutical industries. From electronics and automotive parts to cosmetics and chemicals, governments around the world place restrictions on and recommendations for testing a variety of manufactured goods. What follows are a few examples of regulations on other industries from various parts of the world.

'''2.3.4.1 Good manufacturing practice (GMP) and current good manufacturing practice (cGMP) - ''United States and other countries'''''

As a broad concept, [[good manufacturing practice]] or GMP is an organized set of standards and guidelines that allow manufacturers of most any product to better ensure their products are consistently produced and packaged to a consistent level of quality. GMP tends to cover most every step of production, from planning recipes and choosing starting materials to training personnel and documenting processes.<ref name="ISPEGMP">{{cite web |url=https://ispe.org/initiatives/regulatory-resources/gmp |title=Good Manufacturing Practice (GMP) Resources |publisher=International Society for Pharmaceutical Engineering, Inc |accessdate=05 May 2023}}</ref> The concept of GMP is often spoken of in terms of pharmaceutical and medical device manufacturing<ref name="PBMKorea18" /><ref name="ISPEGMP" /><ref name="WHOMedicines15">{{cite web |url=https://www.who.int/news-room/questions-and-answers/item/medicines-good-manufacturing-processes |title=Medicines: Good manufacturing practices |publisher=World Health Organization |date=20 November 2015 |accessdate=05 May 2023}}</ref>, though it is applicable to most any other production industry.<ref name="CEReg07">{{cite web |url=https://www.controleng.com/articles/regulated-or-not-know-good-manufacturing-practices-gmp/ |title=Regulated or not? Know good manufacturing practices (GMP) |author=''Control Engineering'' Staff |work=Control Engineering |date=14 July 2007 |accessdate=05 May 2023}}</ref><ref name="FDAGMPCosm22">{{cite web |url=https://www.fda.gov/cosmetics/cosmetics-guidance-documents/good-manufacturing-practice-gmp-guidelinesinspection-checklist-cosmetics |title=Good Manufacturing Practice (GMP) Guidelines/Inspection Checklist for Cosmetics |publisher=U.S. Food and Drug Administration |date=25 February 2022 |accessdate=05 May 2023}}</ref>

Closely related is the term "current good manufacturing practice" or cGMP. Both "GMP" and "cGMP" are largely interchangeable, though the latter is preferred in most regulatory language of the United States. A more nuanced take says that cGMP essentially represents the newest, most updated technologies implemented towards the goals of meeting GMP requirements.<ref name="PSDiff21">{{cite web |url=https://www.pharmaspecialists.com/2021/10/difference-between-gmp-and-cgmp.html#gsc.tab=0 |title=Difference Between GMP and cGMP |work=Pharma Specialists |date=13 October 2021 |accessdate=05 May 2023}}</ref><ref name="MoravekTheDiff">{{cite web |url=https://www.moravek.com/the-differences-between-gmp-and-cgmp/ |title=The Differences Between GMP and cGMP |work=Moravek Blog |publisher=Moravek, Inc |date=January 2021 |accessdate=05 May 2023}}</ref>

In the United States, cGMP—in the context of pharmaceuticals—is enshrined in numerous sections of Title 21 of the Code of Federal Regulations, including "in parts 1-99, 200-299, 300-499, 600-799, and 800-1299."<ref name="FDAcGMP22">{{cite web |url=https://www.fda.gov/drugs/pharmaceutical-quality-resources/current-good-manufacturing-practice-cgmp-regulations |title=Current Good Manufacturing Practice (CGMP) Regulations |author=U.S. Food and Drug Administration |date=16 November 2022 |accessdate=05 May 2023}}</ref> The FDA describes these regulations as containing "minimum requirements for the methods, facilities, and controls used in manufacturing, processing, and packing of a drug product."<ref name="FDAcGMP22" /> It also states safety, efficacy, and labeling requirements for manufactured drugs. These regulations require careful attention by manufacturers, lest they face seizure of their products, legal injunction, and criminal cases.<ref name="FDAFactsAb21">{{cite web |url=https://www.fda.gov/drugs/pharmaceutical-quality-resources/facts-about-current-good-manufacturing-practices-cgmps |title=Facts About the Current Good Manufacturing Practices (cGMPs) |author=U.S. Food and Drug Administration |date=01 June 2021 |accessdate=05 May 2023}}</ref>

In the context of food, cGMP principles were first introduced in the U.S. in 1969 as 21 CFR Part 110, though the concept of cGMP was modernized in 2015, in 21 CFR Part 117. This led to not only broad food- and beverage-based cGMPs but also cGMPs specific to a type of ingestible, including dietary supplements, infant formula, low-acid canned food, and bottled water.<ref name="FDACurrentGood20">{{cite web |url=https://www.fda.gov/food/guidance-regulation-food-and-dietary-supplements/current-good-manufacturing-practices-cgmps-food-and-dietary-supplements |title=Current Good Manufacturing Practices (CGMPs) for Food and Dietary Supplements |publisher=U.S. Food and Drug Administration |date=31 January 2020 |accessdate=05 May 2023}}</ref>

GMP and cGMP contexts also exist for other manufacturing industries outside of pharmaceuticals and food, including automotive parts, medical devices, clothing, and more.<ref name="DomingoTheComp22">{{cite web |url=https://qvalon.com/blog/the-complete-guide-to-good-manufacturing-practices-gmp-by-qvalon/ |title=The Complete Guide to Good Manufacturing Practices (GMP) by QVALON |author=Domingo, J. |publisher=QVALON Inc |date=28 January 2022 |accessdate=05 May 2023}}</ref> Additionally, these concepts are not limited to the U.S. For example, the World Health Organization has its own GMP/cGMP guidelines for pharmaceuticals and biological medicines, with more than 100 countries reportedly incorporating those guidelines into their national medicine regulations.<ref name="WHOGMP18">{{cite web |url=https://www.who.int/teams/health-product-policy-and-standards/standards-and-specifications/gmp |title=Health products policy and standards - Good Manufacturing Practices |publisher=World Health Organization |date=28 September 2018 |accessdate=05 May 2023}}</ref>

'''2.3.4.2 Registration, Evaluation, Authorization, and Restriction of Chemicals (REACH) Regulation - ''European Union'''''

For the E.U.'s chemical manufacturers and the other manufacturing industries depending on those chemicals, REACH represents one of the most expansive and complicated pieces of regulation in the E.U. Affecting potentially more than 140,000 substances, manufacturers of those substances are now largely responsible for identifying hazards, identifying mitigation methods for those hazards, and managing associated risks. Of particular note are "substances of very high concern" (SVHCs) and how they are authorized and restricted. This regulations means manufacturers must pay closer attention to what substances get put into their products and report on SVHCs making up more than 0.1 percent (by weight) of the product. These regulations may also mean the manufacturer may need to redesign their products with new substances.<ref name="ECIAREACH21">{{cite web |url=https://www.ecianow.org/reach |title=REACH (Registration, Evaluation, Authorization, and Restriction of Chemicals) |publisher=Electronic Components Industry Association |date=2021 |accessdate=05 May 2023}}</ref>

'''2.3.4.3 Resolução de diretoria colegiada - RDC nº 529 - ''Brazil'''''

This Brazilian resolution, effective August 2021, revised the country's List of Prohibited Substances in Personal Hygiene, Cosmetics and Perfume Products to include additional substances. While it's not clear what original legislation houses that list, this list makes clear that manufacturers of personal hygiene, cosmetic, and perfume products have to be mindful of the ingredients they use in their manufacturing processes.<ref name="HavensBrazil21">{{cite web |url=https://www.ul.com/news/brazil-revises-list-prohibited-substances-cosmetics |title=Brazil Revises List of Prohibited Substances in Cosmetics |author=Havens, R. |work=UL Solutions |date=02 September 2021 |accessdate=05 May 2023}}</ref>

'''2.3.4.4 Restriction of Hazardous Substances in Electrical and Electronic Equipment (RoHS) Directive - ''European Union'''''

The E.U. uses the RoHS Directive as a means towards restricting the use of specific hazardous materials in the manufacture of electrical and electronic equipment (EEE), namely "lead, cadmium, mercury, hexavalent chromium, polybrominated biphenyls (PBB) and polybrominated diphenyl ethers (PBDE), bis(2-ethylhexyl) phthalate (DEHP), butyl benzyl phthalate (BBP), dibutyl phthalate (DBP) and diisobutyl phthalate (DIBP)."<ref name="ECRestrict">{{cite web |url=https://environment.ec.europa.eu/topics/waste-and-recycling/rohs-directive_en |title=Restriction of Hazardous Substances in Electrical and Electronic Equipment (RoHS) |publisher=European Commission |date=2023 |accessdate=05 May 2023}}</ref> By extension, the E.U. is encouraging safer alternatives for the manufacturing of EEE and fewer hazardous substances making their way into ecosystems from waste streams.

'''2.3.4.5 Road Vehicle Standards (RVS) legislation - ''Australia'''''

Coming into effect in July 2021 as a replacement to the Motor Vehicle Standards Act 1989, the RVS legislation brought regulation of road vehicle manufacturing into the twenty-first century, with updates to recalls, model reports, testing, and component type approvals. These regulations are backed by the Australian Design Rules (ADRs), which are national standards for vehicle safety.<ref name="DITRDCARoad23">{{cite web |url=https://www.infrastructure.gov.au/infrastructure-transport-vehicles/vehicles/road-vehicle-standards-laws |title=Road Vehicle Standards laws |author=Department of Infrastructure, Transport, Regional Development, Communications and the Arts |publisher=Australian Government |date=2023 |accessdate=05 May 2023}}</ref><ref name="DITRDCAusDes23">{{cite web |url=https://www.infrastructure.gov.au/infrastructure-transport-vehicles/vehicles/vehicle-design-regulation/australian-design-rules |title=Australian Design Rules |author=Department of Infrastructure, Transport, Regional Development, Communications and the Arts |publisher=Australian Government |date=2023 |accessdate=05 May 2023}}</ref> Approved testing facilities are required to test that vehicles and their components are in compliance with ADRs and other relevant standards.<ref name="DITRDCTesting23">{{cite web |url=https://www.infrastructure.gov.au/infrastructure-transport-vehicles/vehicles/rvs/testing-facilities |title=Testing facilities |author=Department of Infrastructure, Transport, Regional Development, Communications and the Arts |publisher=Australian Government |date=2023 |accessdate=05 May 2023}}</ref>

Template:COVID-19 Testing, Reporting, and Information Management in the Laboratory/Final thoughts and additional resources/Public health laboratory informatics vendors

2024-06-20T19:24:30Z

Shawndouglas: /* 5.7 Public health laboratory informatics vendors */ Fixed broken link from Vendor moves

===5.7 Public health laboratory informatics vendors===
This is not a complete list but rather a representative sampling of vendors who explicitly discuss how their laboratory informatics solution helps public health laboratories.

* [[Vendor:Abbott Informatics Corporation|Abbott Informatics Corporation]]
* [[Vendor:BGASoft, Inc.|BGASoft, Inc.]]
* [[Vendor:CliniSys Group Limited|CliniSys Group Limited]]
* [[Vendor:Common Cents Systems, Inc.|Common Cents Systems, Inc.]]
* [[Vendor:Deutsche Telekom Healthcare Solutions Netherlands B.V.|Deutsche Telekom Healthcare]]
* [[Vendor:Eusoft Srl|Eusoft Srl]]
* [[Vendor:LabLynx, Inc.|LabLynx, Inc.]]
* [[Vendor:LabWare, Inc.|LabWare, Inc.]]
* [[Vendor:Orchard Software Corporation|Orchard Software Corporation]]
* [[Vendor:Polisystem Informatica Srl|Polisystem Informatica Srl]]
* [[Vendor:Promium, LLC|Promium, LLC]]
* [[Vendor:Sunquest Information Systems, Inc.|Sunquest Information Systems, Inc.]]

Template:COVID-19 Testing, Reporting, and Information Management in the Laboratory/Final thoughts and additional resources/Final thoughts

2024-06-20T19:22:30Z

Shawndouglas: Fixed broken file link.

==5. Final thoughts and additional resources==
===5.1 Final thoughts===
[[File:COVID-21 Palmerston North posters MRD.jpg|right|280px]]Since it has started, the [[COVID-19]] [[pandemic]] has brought with it numerous challenges for society to face. How poised is a state and national government to truly lend assistance to its citizens in the face of a crisis? How does the increasing divide between the "haves" and "have nots," and the associated economic structures that lend to them, reveal the fragility of our society? What more can be done to fund epidemiology research? How can we improve our healthcare system to be better equipped to handle communicable disease response and better funded to provide more social services to a broader base of people? And what lessons can be learned from the successes and failures of providing accurate, responsive laboratory testing during pandemics?

We've learned that the family of [[coronavirus]]es can be disruptive to humanity, having had past brushes with [[SARS]] and [[MERS]], yet we arguably [https://www.newscientist.com/article/mg24532724-700-we-were-warned-so-why-couldnt-we-prevent-the-coronavirus-outbreak/ haven't done enough] to research these and similar viruses to be more prepared. We were perhaps [https://doi.org/10.1098/rstb.2004.1487 fortunate in some ways] that SARS wasn't worse than it proved to be. However, responses by the World Health Organization (WHO), the Centers for Disease Control and Prevention (CDC), and other organizations and agencies around the world during the SARS and MERS outbreaks laid the foundations for [[laboratory]] testing a novel coronavirus like [[SARS-CoV-2]]. [[Reverse transcription polymerase chain reaction|Reverse transcription PCR]] (RT-PCR) is again proving to be a useful diagnostic tool for identifying the virus in patient specimens. Other methods such as [[Lateral flow test|lateral flow assays]] (LFA) borrow from more rapid methods of identification, and other more rapid methods of testing such as antigen testing and reverse transcription [[loop-mediated isothermal amplification]] (RT-LAMP) lend additional support to testing. And while confusing—particularly given the unknowns surrounding the predictive ability of antibodies conferring immunity—serology antibody tests appear to have their place as well.

These and related tests can be complex, as evidenced by the CLIA approval status of a strong majority of emergency use authorized (EUA) test kits. Performing these tests on complex instruments and then effectively using the data they provide require clear [[workflow]]s that can be at least partially automated. This is particularly vital given the paltry 13 percent of CLIA-certified U.S labs that are certified to perform moderate- and high-complexity testing. Additionally, given the value of test result data to governments agencies, [[Epidemiology|epidemiological]] researchers, and patients, it's important that reporting is clear, timely, and moderated. Laboratory informatics systems such as [[laboratory information management system]]s (LIMS) and [[laboratory information system]]s (LIS) can go a long way towards ensuring laboratory testing and reporting of communicable diseases goes smoothly.

Choosing just any [[Informatics (academic field)|informatics]] system and implementing it haphazardly in the laboratory doesn't automatically ensure improvements, however. Many elements of the system should be carefully considered. Does the system have a provider portal that is flexible in its ability to handle providers from many different healthcare facility types entering test orders and reviewing results? How well does it address the workflow of COVID-19 and other types of respiratory illness testing? Does it interface with the instruments you're using to test such illnesses, and at a reasonable cost? How well does it handle internal and external reporting requirements, as well as any data visualization and dashboarding you require? During outbreaks and pandemics, the system should improve your laboratory workflow, not slow you down. This includes the element of reporting, which is not only critical but also challenging even in relatively peaceful times of health. And how interoperable is the system with other clinical systems such as [[electronic health record]]s (EHR) and [[radiology information system]]s (RIS)? As we found out, academic and research laboratories wanting to assist with testing have at times been locked out due to their informatics system not interfacing cleanly with a hospital EHR.

Hopefully this guide has provided important background in several areas, from COVID-19's historical impact and challenging health issues, to the current state of laboratory testing, reporting, and informatics applications being applied to fight its spread. As noted in the beginning, this pandemic and how humanity is dealing with it is rapidly changing us, as we try to keep up with ways to fend it off. That means information changes rapidly. An effort will be made to update this content as new information comes to light. In the meantime, stay safe and consider your informatics solutions with care.

LII:Comprehensive Guide to Developing and Implementing a Cybersecurity Plan

2024-06-20T18:36:48Z

Shawndouglas: Fixed link to LIMSpec

[[File:Innovation & Research Symposium Cisco and Ecole Polytechnique 9-10 April 2018 Artificial Intelligence & Cybersecurity (40631791164).jpg|right|370px]]
'''Title''': ''Comprehensive Guide to Developing and Implementing a Cybersecurity Plan''

'''Edition''': Second

'''Author for citation''': Shawn E. Douglas

'''License for content''': [https://creativecommons.org/licenses/by-sa/4.0/ Creative Commons Attribution-ShareAlike 4.0 International]

'''Publication date''': March 2023

Look across the internet and you will find a wealth of [[information]] about [[cybersecurity]] and the cybersecurity plan. However, much of that information is either disparate or, if comprehensive, difficult to access or expensive to acquire. In particular, a walk-through of the various steps involved with how an organization or individual develops, enforces, and maintains a cybersecurity plan is difficult to come by. This guide attempts to fill that gap, including not only a 10-step walk-through but also insight into regulations, standards, and cybersecurity standards frameworks, as well as how they all fit together with cybersecurity planning. Additionally, this document provides access to ''[[:File:An Example Cybersecurity Plan - Shawn Douglas - v1.1.pdf|An Example Cybersecurity Plan]]'', a companion document that provides a representative example of the 10-step walk-through put to use. This guide also includes a slightly simplified version of many of the security controls found in the National Institute of Standards and Technology's (NIST) Special Publication 800-53, Rev. 5, with additional resources to provide context, and mappings to [[Book:LIMSpec 2022 R2|LIMSpec]], an evolving set of specifications for laboratory informatics solutions and their development. The guide attempts to be helpful to most any organization attempting to navigate the challenges of cybersecurity planning, with a slight bias towards [[Laboratory|laboratories]] implementing and updating information systems.

The second edition updates citations and statistics, as well as grammar. The first edition was released months prior to the NIST 800-53 update from Rev. 4 to 5; this edition is updated throughout to address the changes in that framework to Rev. 5, including Appendix 1.

The table of contents for ''Comprehensive Guide to Developing and Implementing a Cybersecurity Plan'' is as follows:

1. [[LII:Comprehensive Guide to Developing and Implementing a Cybersecurity Plan/What is a cybersecurity plan and why do you need it?|What is a cybersecurity plan and why do you need it?]]

:1.1 Cybersecurity planning and its value

2. [[LII:Comprehensive Guide to Developing and Implementing a Cybersecurity Plan/What are the major regulations and standards dictating cybersecurity action?|What are the major regulations and standards dictating cybersecurity action?]]

:2.1 Cybersecurity standards frameworks

3. [[LII:Comprehensive Guide to Developing and Implementing a Cybersecurity Plan/Fitting a cybersecurity standards framework into a cybersecurity plan|Fitting a cybersecurity standards framework into a cybersecurity plan]]

:3.1 How do cybersecurity controls and frameworks guide plan development?

4. [[LII:Comprehensive Guide to Developing and Implementing a Cybersecurity Plan/NIST Special Publication 800-53, Revision 5 and the NIST Cybersecurity Framework|NIST Special Publication 800-53, Revision 5 and the NIST Cybersecurity Framework]]

:4.1 NIST Cybersecurity Framework

5. [[LII:Comprehensive Guide to Developing and Implementing a Cybersecurity Plan/Develop and create the cybersecurity plan|Develop and create the cybersecurity plan]]

:5.1 Develop strategic cybersecurity goals and define success
:5.2 Define scope and responsibilities
:5.3 Identify cybersecurity requirements and objectives
:5.4 Establish performance indicators and associated time frames
:5.5 Identify key stakeholders
:5.6 Determine resource needs
:5.7 Develop a communications plan
:5.8 Develop a response and continuity plan
:5.9 Establish how the overall cybersecurity plan will be implemented
:5.10 Review progress

6. [[LII:Comprehensive Guide to Developing and Implementing a Cybersecurity Plan/Closing remarks|Closing remarks]]

:6.1 Recap and closing

Appendix 1. [[LII:Comprehensive Guide to Developing and Implementing a Cybersecurity Plan/A simplified description of NIST Special Publication 800-53 controls, with ties to LIMSpec|A simplified description of NIST Special Publication 800-53 controls, with ties to LIMSpec]]

:Appendix 1.1 Access control
:Appendix 1.2 Awareness and training
:Appendix 1.3 Audit and accountability
:Appendix 1.4 Assessment, authorization, and monitoring
:Appendix 1.5 Configuration management
:Appendix 1.6 Contingency planning
:Appendix 1.7 Identification and authentication
:Appendix 1.8 Incident response
:Appendix 1.9 Maintenance
:Appendix 1.10 Media protection
:Appendix 1.11 Physical and environmental protection
:Appendix 1.12 Planning
:Appendix 1.13 Program management
:Appendix 1.14 Personnel security
:Appendix 1.15 Personally identifiable information processing and transparency
:Appendix 1.16 Risk assessment
:Appendix 1.17 System and services acquisition
:Appendix 1.18 System and communications protection
:Appendix 1.19 System and information integrity
:Appendix 1.20 Supply chain risk management


[[Category:LII:Guides, white papers, and other publications]]

LII:Considerations in the Automation of Laboratory Procedures

2024-06-20T03:07:04Z

Shawndouglas: /* How does this discussion relate to previous work? */ Fixed broken URL

'''Title''': ''Considerations in the Automation of Laboratory Procedures''

'''Author for citation''': Joe Liscouski, with editorial modifications by Shawn Douglas

'''License for content''': [https://creativecommons.org/licenses/by/4.0/ Creative Commons Attribution 4.0 International]

'''Publication date''': January 2021

==Introduction==
Scientists have been dealing with the issue of [[laboratory automation]] for decades, and during that time the meaning of those words has expanded from the basics of connecting an instrument to a computer, to the possibility of a fully integrated [[Informatics (academic field)|informatics]] infrastructure beginning with [[Sample (material)|sample]] preparation and continuing on to the [[laboratory information management system]] (LIMS), [[electronic laboratory notebook]] (ELN), and beyond. Throughout this evolution there has been one underlying concern: how do we go about doing this?

The answer to that question has changed from a focus on hardware and programming, to today’s need for a lab-wide informatics strategy. We’ve moved from the bits and bytes of assembly language programming to managing terabytes of files and data structures.

The high-end of the problem—the large informatics database systems—has received significant industry-wide attention in the last decade. The stuff on the lab bench, while the target of a lot of individual products, has been less organized and more experimental. Failed or incompletely met promises have to yield to planned successes. How we do it needs to change. This document is about the considerations required when making that change. The haphazard "let's try this" method has to give way to more engineered solutions and a realistic appraisal of the human issues, as well as the underlying technology management and planning.

Why is this important? Whether you are conducting intense laboratory experiments to produce data and [[information]] or making chocolate chip cookies in the kitchen, two things remain important: productivity and the quality of the products. In either case, if the productivity isn’t high enough, you won’t be able to justify your work; if the quality isn’t there, no one will want what you produce. Conducting laboratory work and making cookies have a lot in common. Your laboratories exist to answer questions. What happens if I do this? What is the purity of this material? What is the structure of this compound? The field of laboratories asking these questions is extensive, basically covering the entire array of lab bench and scientific work, including chemistry, life sciences, physics, and electronics labs. The more efficiently we answer those questions, the more likely it will be that these labs will continue operating and, that you’ll achieve the goals your organization has set. At some point, it comes down to performance against goals and the return on the investment organizations make in lab operations.

In addition to product quality and productivity, there are a number of other points that favor automation over manual implementations of lab processes. They include:

* lower costs per test;
* better control over expenditures;
* a stronger basis for better [[workflow]] planning;
* reproducibility;
* predictably; and
* tighter adherence to procedures, i.e., consistency.

Lists similar to the one above can be found in justifications for lab automation, and cookie production, without further comment. It’s just assumed that everyone agrees and that the reasoning is obvious. Since we are going to use those items to justify the cost and effort that goes into automation, we should take a closer look at them.

Lets begin with reproducibility, predictability, and consistency, very similar concerns that reflect automation’s ability to produce the same product with the desired characteristics over and over. For data and information, that means that the same analysis on the same materials will yield the same results, that all the steps are documented and that the process is under control. The variability that creeps into the execution of a process by people is eliminated. That variability in human labor can result from the quality of training, equipment setup and calibration, readings from analog devices (e.g., meters, pipette meniscus, charts, etc.), there is a long list of potential issues.

Concerns with reproducibility, predictability, and consistency are common to production environments, general lab work, manufacturing, and even food service. There are several pizza restaurants in our area using one of two methods of making the pies. Both start the preparation the same way, spreading dough and adding cheese and toppings, but the differences are in how they are cooked. Once method uses standard ovens (e.g., gas, wood, or electric heating); the pizza goes in, the cook watches it, and then removes it when the cooking is completed. This leads to a lot of variability in the product, some a function of the cook’s attention, some depending on requests for over or under cooking the crust. Some is based on "have it your way" customization. The second method uses a metal conveyor belt to move the pie through an oven. The oven temperature is set as is the speed of the belt, and as long as the settings are the same, you get a reproducible, consistent product order after order. It’s a matter of priorities. Manual verses automated. Consistent product quality verses how the cook feels that day. In the end, reducing variability and being able to demonstrate consistent, accurate, results gives people confidence in your product.

Lower costs per test, better control over expenditures, and better workflow planning also benefit from automation. Automated processes are more cost-efficient since the sample throughput is higher and the labor cost is reduced. The cost per test and the material usage is predictable since variability in components used in testing is reduced or eliminated, and workflow planning is improved since the time per test is known, work can be better scheduled. Additionally, process scale-up should be easier if there is a high demand for particular procedures. However there is a lot of work that has to be considered before automation is realizable, and that is where this discussion is headed.

==How does this discussion relate to previous work?==
This work follows on the heels of two previous works:

* ''[https://www.pda.org/bookstore/product-detail/4297-computerized-systems-in-modern-lab Computerized Systems in the Modern Laboratory: A Practical Guide]'' (2015): This book presents the range of informatics technologies, their relationship to each other, and the role they play in laboratory work. It differentiates a LIMS from an ELN and [[scientific data management system]] (SDMS) for example, contrasting their use and how they would function in different lab working environments. In addition, it covers topics such as support and regulatory issues.

* ''[[LII:A Guide for Management: Successfully Applying Laboratory Systems to Your Organization's Work|A Guide for Management: Successfully Applying Laboratory Systems to Your Organization's Work]]'' (2018): This webinar series complements the above text. It begins by introducing the major topics in informatics (e.g., LIMS, ELN, etc.) and then discusses their use from a strategic viewpoint. Where and how do you start planning? What is your return on investment? What should get implemented first, and then what are my options? The series then moves on to developing an [[information management]] strategy for the lab, taking into account budgets, support, ease of implementation, and the nature of your lab’s work.

The material in this write-up picks up where the last part of the webinar series ends. The last session covers lab processes, amd this picks up that thread and goes into more depth concerning a basic issue: how do you move from manual methods to automated systems?

Productivity has always been an issue in laboratory work. Until the 1950s, a lab had little choice but to add more people if more work needed to be done. Since then, new technologies have afforded wider options, including new instrument technologies. The execution of the work was still done by people, but the tools were better. Now we have other options. We just have to figure out when, if, and how to use them.

===Before we get too far into this...===
With elements such as productivity, return on investment (ROI), [[data quality]], and [[data integrity]] as driving factors in this work, you shouldn’t be surprised if a lot of the material reads like a discussion of manufacturing methodologies; we’ve already seen some examples. We are talking about scientific work, but the same things that drive the elements noted in labs have very close parallels in product manufacturing. The work we are describing here will be referenced as "scientific manufacturing," manufacturing or production in support of scientific programs.{{efn|The term "scientific manufacturing" was first mentioned to the author by Mr. Alberto Correia, then of Cambridge Biomedical, Boston, MA.}}

The key points of a productivity conversation in both lab and material production environments are almost exact overlays, the only significant difference is that the results of the efforts are data and information in one case, and a physical item you might sell in the other. Product quality and integrity are valued considerations in both. For scientists, this may require an adjustment to their perspectives when dealing with automation. On the plus side, the lessons learned in product manufacturing can be applied to lab bench work, making the path to implementation a bit easier while providing a framework for understanding what a successful automation effort looks like. People with backgrounds in product manufacturing can be a useful resource in the lab, with a bit of an adjustment in perspective on their part.

==Transitioning from typical lab operations to automated systems==
Transitioning a lab from its current state of operations to one that incorporates automation can raise a number of questions, and people’s anxiety levels. There are several questions that should be considered to set expectations for automated systems and how they will impact jobs and the introduction of new technologies. They include:

* What will happen to people’s jobs as a result of automation?
* What is the role of [[artificial intelligence]] (AI) and [[machine learning]] (ML) in automation?
* Where do we find the resources to carry out automation projects/programs?
* What equipment would we need for automated processes, and will it be different that what we currently have?
* What role does a [[laboratory execution system]] (LES) play in laboratory automation?
* How do we go about planning for automation?

===What will happen to people’s jobs as a result of automation?===
Stories are appearing in print, online, and in television news reporting about the potential for automation to replace human effort in the labor force. It seems like it is an all-or-none situation, either people will continue working in their occupations or automation (e.g., mechanical, software, AI, etc.) will replace them. The storyline is people are expensive and automated work can be less costly in the long run. If commercial manufacturing is a guide, automation is a preferred option from both a productivity and an ROI perspective. In order to make the productivity gains from automation similar to those seen in commercial manufacturing, there are some basic requirements and conditions that have to be met:

* The process has to be well documented and understood, down to the execution of each step without variation, while error detection and recovery have to be designed in.
* The process has to remain static and be expected to continue over enough execution cycles to make it economically attractive to design, build, and maintain.
* Automation-compatible equipment has to be available. Custom-built components are going to be expensive and could represent a barrier to successful implementation.
* There has to be a driving need to justify the cost of automation; economics, the volume of work that has to be addressed, working with hazardous materials, and lack of educated workers are just a few of the factors that would need to be considered.

There are places in laboratory work where production-scale automation has been successfully implemented; life sciences applications for processes based on microplate technologies are one example. When we look at the broad scope of lab work across disciplines, most lab processes don’t lend themselves to that level of automation, at least not yet. We’ll get into this in more detail later. But that brings us back to the starting point: what happens to people's jobs?

In the early stages of manufacturing automation, as well as fields such as mining where work was labor intensive and repetitive, people did lose jobs when new methods of production were introduced. That shift from a human workforce to automated task execution is expanding as system designers probe markets from retail to transportation.<ref name="FreyTheFuture13">{{cite web |url=https://www.oxfordmartin.ox.ac.uk/downloads/academic/The_Future_of_Employment.pdf |format=PDF |title=The Future of Employment: How Susceptible Are Jobs to Computerisation? |author=Frey, C.B.; Osborne, M.A. |publisher=Oxford Martin School, University of Oxford |date=17 September 2013 |accessdate=04 February 2021}}</ref> Lower skilled occupations gave way first, and we find ourselves facing automation efforts that are moving up the skills ladder, most recently is the potential for automated driving, a technology that has yet to be fully embraced but is moving in that direction. The problem that leaves us with is providing displaced workers with a means of employment that gives them at least a living income, and the purpose, dignity, and self-worth that they’d like to have. This is going to require significant education, and people are going to have to come to grips with the realization that education never stops.

Due to the push for increased productivity, lab work has seen some similar developments in automation. The development of automated pipettes, titration stations, auto-injectors, computer-assisted instrumentation, and automation built to support microplate technologies represent just a few places where specific tasks have been addressed. However these developments haven’t moved people out of the workplace as has happened in manufacturing, mining, etc. In some cases they’ve changed the work, replacing repetitive time-consuming tasks with equipment that allows lab personnel to take on different tasks. In other cases the technology addresses work that couldn’t be performed in a cost-effective manner with human effort; without automation, that work might just not be feasible due to the volume of work (whose delivery might be limited by the availability of the right people, equipment, and facilities) or the need to work with hazardous materials. Automation may prevent the need for hiring new people while giving those currently working more challenging tasks.

As noted in the previous paragraph, much of the automation in lab work is at the task level: equipment designed to carry out a specific function such as Karl-Fisher titrations. Some equipment designed around microplate formats can function at both the task level and as part of user-integrated robotics system. This gives the planner useful options about the introduction of automation that makes it easier for personnel to get accustomed to automation before moving into scientific manufacturing.

Overall, laboratory people shouldn’t be loosing their jobs as a result of lab automation, but they do have to be open to changes in their jobs, and that could require an investment in their education. Take someone whose current job is to carry out a lab procedure, someone who understands all aspects of the work, including troubleshooting equipment, reagents, and any special problems that may crop up. Someone else may have developed the procedure, but that person is the expert in its execution.

First of all you need these experts to help plan and test the automated systems if you decide to create that project. These would also be the best people to educate as automated systems managers; they know how the process is supposed to work and should be in a position to detect problems. If it crashes, you’ll need someone who can cover the work while problems are be addressed. Secondly, if lab personnel get the idea that they are watching their replacement being installed, they may leave before the automated systems are ready. In the event of a delay, you’ll have a backlog and no one to handle it.

Beyond that, people will be freed from the routine of carrying out processes and be able to address work that had been put on a back burner until it could be addressed. As we move toward automated systems, jobs will change by expansion to accommodate typical lab work, as well as the management, planning, maintenance, and evolution of laboratory automation and computing.

Automation in lab work is not an "all or none" situation. Processes can be structured so that the routine work is done by systems, and the analyst can spend time reviewing the results, looking for anomalies and interesting patterns, while being able to make decisions about the need for and nature of follow-on efforts.

===What is the role of AI and ML in automation?===
When we discuss automation, what we are referencing now is basic robotics and programming. AI may, and likely will, play a role in the work, but first we have to get the foundations right before we consider the next step; we need to put in the human intelligence first. Part of the issue with AI is that we don’t know what it is.

Science fiction aside, many of today's applications of AI have a limited role in lab work today. Here are some examples:

* Having a system that can bring up all relevant information on a research question—a sort of super Google—or a variation of IBM’s Watson could have significant benefits.
* Analyzing complex data or large volumes of data could be beneficial, e.g., the analysis of radio astronomy data to find fast radio bursts (FRB). After discovering 21 FRB signals upon analyzing five hours of data, researchers at Green Bank Telescope used AI to analyze 400 terabytes of older data and detected another 100.<ref name="HsuIsIt18">{{cite web |url=https://www.nbcnews.com/mach/science/it-aliens-scientists-detect-more-mysterious-radio-signals-distant-galaxy-ncna912586 |title=Is it aliens? Scientists detect more mysterious radio signals from distant galaxy ||author=Hsu, J. |work=NBC News MACH |date=24 September 2018 |accessdate=04 February 2021}}</ref>
* "[A] team at Glasgow University has paired a machine-learning system with a robot that can run and analyze its own chemical reaction. The result is a system that can figure out every reaction that's possible from a given set of starting materials."<ref name="TimmerAIPlus18">{{cite web |url=https://arstechnica.com/science/2018/07/ai-plus-a-chemistry-robot-finds-all-the-reactions-that-will-work/5/ |title=AI plus a chemistry robot finds all the reactions that will work |author=Timmer, J. |work=Ars Technica |date=18 July 2018 |accessdate=04 February 2021}}</ref>
* HelixAI is using Amazon's Alexa as a digital assitant for laboratory work.<ref name="HelixAIHome">{{cite web |url=http://www.askhelix.io/ |title=HelixAI - Voice Powered Digital Laboratory Assistants for Scientific Laboratories |publisher=HelixAI |accessdate=04 February 2021}}</ref>

Note that the points above are research-based applications, not routine production environments where regulatory issues are important. While there are research applications that might be more forgiving of AI systems because the results are evaluated by human intelligence, and problematic results can be made subject to further verification, data entry systems such as voice entry have to be carefully tested and the results of that data entry verified and shown to be correct.

Pharma IQ continues to publish material on advanced topics in laboratory informatics, including articles on how labs are benefiting from new technologies<ref name="PharmaIQNewsAutom18">{{cite web |url=https://www.pharma-iq.com/pre-clinical-discovery-and-development/news/automation-iot-and-the-future-of-smarter-research-environments |title=Automation, IoT and the future of smarter research environments |author=PharmaIQ News |work=PharmaIQ |date=20 August 2018 |accessdate=04 February 2021}}</ref> and survey reports such as ''AI 2020: The Future of Drug Discovery''. In that report they note<ref name="PharmaIQTheFuture17">{{cite web |url=https://www.pharma-iq.com/pre-clinical-discovery-and-development/whitepapers/the-future-of-drug-discovery-ai-2020 |title=The Future of Drug Discovery: AI 2020 |author=PharmaIQ |publisher=PharmaIQ |date=14 November 2017 |accessdate=04 February 2021}}</ref>:

* "94% of pharma professionals expect that intelligent technologies will have a noticeable impact on the pharmaceutical industry over the next two years."
* "Almost one fifth of pharma professionals believe that we are on the cusp of a revolution."
* "Intelligent automation and predictive analytics are expected to have the most significant impact on the industry."
* "However, a lack of understanding and awareness about the benefits of AI-led technologies remain a hindrance to their implementation."

Note that these are expectations, not a reflection of current reality. That same report makes comments about the impact of AI on headcount disruption, asking, "Do you expect intelligent enterprise technologies{{efn|Intelligent enterprise technologies referenced in the report include robotic process automation, machine learning, artificial intelligence, the internet Of things, predictive analysis, and cognitive computing.}} to significantly cut and/or create jobs in pharma through 2020?" Among the responses, 47 percent said they expected those technologies to do both, 40 percent said it will create new job opportunities, and 13 percent said there will be no dramatic change, with zero percent saying they expected solely job losses.<ref name="PharmaIQTheFuture17" />

While there are high levels of expectations and hopes for results, we need to approach the idea of AI in labs with some caution. We read about examples based on machine learning (ML), for example using computer systems to recognize cats in photos, to recognized faces in a crowd, etc. We don’t know how they accomplish their tasks, and we can’t analyze their algorithms and decision-making. That leaves us with testing in quality, which at best is an uncertain process with qualified results (it has worked so far). One problem with testing AI systems based on ML is that they are going to continually evolve, so testing may affect the ML processes by providing a bias. It may also cause continued, redundant testing, because something we thought was evaluated was changed by the “experiences” the AI based it’s learning on. As one example, could the AI modify the science through process changes without our knowing because it didn’t understand the science or the goals of the work?

AI is a black box with ever-changing contents. That shouldn’t be taken as a condemnation of AI in the lab, but rather as a challenge to human intelligence in evaluating, proving, and applying the technology. That application includes defining the operating boundaries of an AI system. Rather than creating a master AI for a complete process, we may elect to divide the AI’s area of operation into multiple, independent segments, with segment integration occurring in later stages once we are confident in their ability to work and show clear evidence of systems stability. In all of this we need to remember that our goal is the production of high-quality data and information in a controlled, predictable environment, not gee-wiz technology. One place where AI (or clever programming) could be of use is in better workflow planning, which takes into account current workloads and assignments, factors in the inevitable panic-level testing need, and, perhaps in a QC/production environment, anticipates changes in analysis requirements based on changes in production operations.

Throughout this section I've treated “AI” as “artificial intelligence,” its common meaning. There may be a better way of looking at it for lab use as, noted in this excerpt from the October 2018 issue of Wired magazine<ref name="RossettoFight18">{{cite journal |title=Fight the Dour |journal=Wired |author=Rossetto, L. |issue=October |pages=826–7 |year=2018 |url=https://www.magzter.com/stories/Science/WIRED/Fight-The-Dour}}</ref>:

<blockquote>Augmented intelligence. Not “artificial,” but how Doug Engelbart{{efn|[https://en.wikipedia.org/wiki/Douglas_Engelbart Doug Engelbart] found the field of human-computer interaction and is credited with the invention of the computer mouse, and the “Mother of All Demos” in 1968.}} envisioned our relationship with computer: AI doesn’t replace humans. It offers idiot-savant assistants that enable us to become the best humans we can be.</blockquote>

Augmented intelligence (AuI) is a better term for what we might experience in lab work, at least in the near future. It suggests something that is both more realistic and attainable, with the synergism that would make it, and automation, attractive to lab management and personnel—a tool they can work with and improve lab operations that doesn’t carry the specter of something going on that they don’t understand or control. OPUS/SEARCH from Bruker might be just such an entry in this category.<ref name="BrukerOPUS">{{cite web |url=https://www.bruker.com/en/products-and-solutions/infrared-and-raman/opus-spectroscopy-software/search-identify.html |title=OPUS Package: SEARCH & IDENT |publisher=Bruker Corporation |accessdate=04 February 2021}}</ref> AuI may serve as a first-pass filter for large data sets—as noted in the radio astronomy and chemistry examples noted earlier—reducing those sets of data and information to smaller collections that human intelligence can/should evaluate. However, that does put a burden on the AuI to avoid excessive false positives or negatives, something that can be adjusted over time.

Beyond that there is the possibility of more cooperative work between people and AuI systems. An article in ''Scientific American'' titled “My Boss the Robot”<ref name="BourneMyBoss13">{{cite journal |title=My Boss the Robot |journal=Scientific American |author=Bourne, D. |volume=308 |issue=5 |pages=38–41 |year=2013 |doi=10.1038/scientificamerican0513-38 |pmid=23627215}}</ref> describes the advantage of a human-robot team, with the robot doing the heavy work and the human—under the robots guidance—doing work he was more adept at, verses a team of experts with the same task. The task, welding a Humvee frame, was competed by the human machine pair in 10 hours at a cost of $1,150; the team of experts took 89 hours and a cost of $7,075. That might translate into terms of laboratory work by having a robot do routine, highly repetitive tasks and the analyst overseeing the operation and doing higher-level analysis of the results.

Certainly, AI/AuI is going to change over time as programming and software technology becomes more sophisticated and capable; today’s example of AuI might be seen as tomorrow’s clever software. However, a lot depends on the experience of the user.

There is something important to ask about laboratory technology development, and AI in particular: is the direction of development going to be the result of someone’s innovation that people look at and embrace, or will it be the result of a deliberate choice of lab people saying “this is where we need to go, build systems that will get us there”? The difference is important, and lab managers and personnel need to be in control of the planning and implementation of systems.

===Where do we find the resources to carry out automation projects/programs?===
Given the potential scope of work, you may need people with skills in programming, robotics, instrumentation, and possibly mechanical or electrical engineering if off-the-shelf components aren’t available. The biggest need is for people who can do the planning and optimization that is needed as you move from manual to semi- or fully-automated systems, particularly specialists in process engineering who can organize and plan the work, including the process controls and provision for statistical process control.

We need to develop people who are well versed in laboratory work and the technologies that can be applied to that work, as assets in laboratory automation development and planning. In the past, this role has been filled with lab personnel having an interest in the subject, IT people willing to extend their responsibilities, and/or outside consultants. A 2017 report by Salesforce Research states "77% of IT leaders believe IT functions as an extension/partner of business units rather than as a separate function."<ref name="SalesForceSecondAnn17">{{cite web |url=https://a.sfdcstatic.com/content/dam/www/ocms/assets/pdf/misc/2017-state-of-it-report-salesforce.pdf |format=PDF |title=Second Annual State of IT |author=SalesForce Research |publisher=SalesForce |date=2017 |accessdate=04 February 2021}}</ref> The report makes no mention of laboratory work or manufacturing aside from those being functions within businesses surveyed. Unless a particular effort is made, IT personnel rarely have the backgrounds needed to meet the needs of lab work. In many cases, they will try and fit lab needs into software they are already familiar with, rather then extend their backgrounds into new computational environments. Office and pure database applications are easily handled, but when we get to the lab bench, it's another matter entirely.

The field is getting complex enough that we need people whose responsibilities span both science and technology. This subject is discussed in the webinar series ''[[LII:A Guide for Management: Successfully Applying Laboratory Systems to Your Organization's Work|A Guide for Management: Successfully Applying Laboratory Systems to Your Organization's Work]]'', Part 5 "Supporting Laboratory Systems."

===What equipment would we need for automated processes, and will it be different that what we currently have?===
This is an interesting issue and it directly addresses the commitment labs have to automation, particularly robotics. In the early days of lab automation when Zymark (Zymate and Benchmate), [[Vendor:PerkinElmer Inc.|Perkin Elmer]], and Hewlett Packard (ORCA) were the major players in the market, the robot had to adapt to equipment that was designed for human use: standard laboratory equipment. They did that through special modifications and the use of different grippers to handle test tubes, beakers, and flasks. While some companies wanted to test the use of robotics in the lab, they didn’t want to invest in equipment that could only be used with robots; they wanted lab workers to pick up where the robots left off in case the robots didn’t work.

Since then, equipment has evolved to support automation more directly. In some cases it is a device (e.g., a balance, pH meter, etc.) that has front panel human operator capability and rear connectors for computer communications. Liquid handling systems have seen the most advancement through the adoption of microplate formats and equipment designed to work with them. However, the key point is standardization of the sample containers. Vials and microplates lend themselves to a variety of automation devices, from sample processing to auto-injectors/samplers. The issue is getting the samples into those formats.

One point that labs, in any scientific discipline, have to come to grips with is the commitment to automation. That commitment isn’t going to be done on a lab-wide basis, but on a procedure-by-procedure basis. Full automation may not be appropriate for all lab work, whereas partial automation may be a better choice, and in some cases no automation may be required (we’ll get into that later). The point that needs to be addressed is the choice of equipment. In most cases, equipment is designed for use by people, with options for automation and electronic communications. However, if you want to maximize throughput, you may have to follow examples from manufacturing and commit to equipment that is only used by automation. That will mean a redesign of the equipment, a shared risk for both the vendors and the users. The upside to this is that equipment can be specifically designed for a task, be more efficient, have the links needed for integration, use less material, and, more likely, take up less space. One example is the microplate, allowing for tens, hundreds, or thousands (depending on the plate used) of sample cells in a small space. What used to take many cubic feet of space as test tubes (the precursor to using microplates) is now a couple of cubic inches, using much less material and working space. Note, however, that while microplates are used by lab personnel, their use in automated systems provides greater efficiency and productivity.

The idea of equipment used only in an automated process isn’t new. The development and commercialization of segmented flow analyzers—initially by Technicon in the form of the AutoAnalyzers for general use, and the SMA (Sequential Multiple Analyzer) and SMAC (Sequential Multiple Analyzer with Computer) in clinical markets—improved a lab's ability to process samples. These systems were phased out with new equipment that consumed less material. Products like these are being provided by Seal Analytical<ref name="SealAnal">{{cite web |url=https://seal-analytical.com/Products/tabid/55/language/en-US/Default.aspx |title=Seal Analytical - Products |publisher=Seal Analytical |accessdate=04 February 2021}}</ref> for environmental work and Bran+Luebbe (a division of SPX Process Equipment in Germany).<ref name="BranLuebbe">{{cite web |url=https://www.spxflow.com/bran-luebbe/ |title=Bran+Luebbe |publisher=SPX FLOW, Inc |accessdate=04 February 2021}}</ref>

The issue in committing to automated equipment is that vendors and users will have to agree on equipment specifications and use them within procedures. One place this has been done successfully is in clinical chemistry labs. What other industry workflows could benefit? Do the vendors lead or do the users drive the issue? Vendors need to be convinced that there is a viable market for product before making an investment, and users need to be equally convinced that they will succeed in applying those products. In short, procedures that are important to a particular industry have to be identified, and both users and vendors have to come together to develop automated procedure and equipment specifications for products. This has been done successfully in clinical chemistry markets to the extent that equipment is marketed for use as validated for particular procedures.

===What role does a LES play in laboratory automation?===
Before ELNs settled into their current role in laboratory work, the initial implementations differed considerably from what we have now. LabTech Notebook was released in 1986 (discontinued in 2004) to provide communications between computers and devices that used RS-232 serial communications. In the early 2000s SmartLab from Velquest was the first commercial product to carry the "electronic laboratory notebook" identifier. That product became a stand-alone entry in the laboratory execution system (LES) market; since its release, the same conceptual functionality has been incorporated into LIMS and ELNs that fit the more current expectation for an ELN.

At it’s core, LES are scripted test procedures that an analyst would follow to carry out a laboratory method, essentially functioning as the programmed execution of a lab process. Each step in a process is described, followed exactly, and provision is made within the script for data collection. In addition, the LES can/will (depending on the implementation; "can" in the case of SmartLab) check to see if the analyst is qualified to carry out the work and that the equipment and reagents are current, calibrated, and suitable for use. The systems can also have access to help files that an analyst can reference if there are questions about how to carry out a step or resolve issues. Beyond that, the software had the ability to work with lab instruments and automatically acquire data either through direct interfaces (e.g., balances, pH meters, etc.) or through parsing PDF files of instrument reports.

There are two reasons that these systems are attractive. First, they provide for a rigorous execution of a process with each step being logged as it is done. Second, that log provides a regulatory inspector with documented evidence that the work was done properly, making it easier for the lab to meet any regulatory burden.

Since the initial development of SmartLab, that product has changed ownership and is currently in the hands of [[Vendor:Dassault Systèmes SA|Dassault Systèmes]] as part of the BIOVIA product line. As noted above, LIMS and ELN vendors have incorporated similar functionality into their products. Using those features requires “scripting” (in reality, software development), but it does allow the ability to access the database structures within those products. The SmartLab software needed programmed interfaces to other vendors' LIMS and ELNs to gain access to the same information.

====What does this have to do with automation?====
When we think about automated systems, particularly full-automation with robotic support, it is a programmed process from start to finish. The samples are introduced at the start, and the process continues until the final data/information is reported and stored. These can be large scale systems using microplate formats, including tape-based systems from Douglas Scientific<ref name="DouglasScientificArrayTape">{{cite web |url=https://www.douglasscientific.com/Products/ArrayTape.aspx |title=Array Tape Advanced Consumable |publisher=Douglas Scientific |accessdate=04 February 2021}}</ref>, programmable autosamplers such as those from [[Vendor:Agilent Technologies, Inc.|Agilent]]<ref name="Agilent1200Series">{{cite web |url=https://www.agilent.com/cs/library/usermanuals/Public/G1329-90012_StandPrepSamplers_ebook.pdf |format=PDF |title=Agilent 1200 Series Standard and Preparative Autosamplers - User Manual |publisher=Agilent Technologies |date=November 2008 |accessdate=04 February 2021}}</ref>, or systems built around robotics arms from a variety of vendors that move samples from one station to another.

Both LES and the automation noted in the previous paragraph have the following point in common: there is a strict process that must be followed, with no provision for variation. The difference is that in one case that process is implemented completely through the use of computers, as well as electronic and mechanical equipment. In the other case, the process is being carried out by lab personnel using computers, as well as electronic and mechanical lab equipment. In essence, people take the place of mechanical robots, which conjures up all kinds of images going back to the 1927 film ''Metropolis''.{{efn|See [[wikipedia:Metropolis (1927 film)|''Metropolis'' (1927 film)]] on Wikipedia.}} Though the LES represents a step toward more sophisticated automation, both methods still require:

* programming, including “scripting” (the LES methods are a script that has to be followed);
* validated, proven processes; and
* qualified staff, though the qualifications differ. (In both cases they have to be fully qualified to carry out the process in question. However in the full automation case, they will require more education on running, managing, and troubleshooting the systems.)

In the case of full automation, there has to be sufficient justification for the automation of the process, including sufficient sample processing. The LES-human implementation can be run for a single sample if needed, and the operating personnel can be trained on multiple procedures, switching tasks as needed. Electro-mechanical automation would require a change in programming, verification that the system is operating properly, and may require equipment re-configuration. Which method is better for a particular lab depends on trade-offs between sample load, throughput requirements, cost, and flexibility. People are adaptable, easily moving between tasks, whereas equipment has to be adapted to a task.

===How do we go about planning for automation?===
There are three forms of automation to be considered:

# No automation – Instead, the lab relies on lab personnel to carry out all steps of a procedure.
# Partial automation – Automated equipment is used to carry out steps in a procedure. Given the current state of laboratory systems, this is the most prevalent since most lab equipment has computer components in them to facilitate their use.
# Full automation - The entire process is automated. The definition of “entire” is open to each labs interpretation and may vary from one process to another. For example, some samples may need some handing before they are suitable for use in a procedure. That might be a selection process from a freezer, grinding materials prior to a solvent extraction, and so on, representing cases where the equipment available isn’t suitable for automated equipment interaction. One goal is to minimize this effort since it can put a limit on the productivity of the entire process. This is also an area where negotiation between the lab and the sample submitter can be useful. Take plastic pellets for example, which often need to be ground into a course powder before they can be analyzed; having the submitter provide them in this form will reduce the time and cost of the analysis. Standardizing on the sample container can also facilitate the analysis (having the lab provide the submitter with standard sample vials using barcodes or RFID chips can streamline the process).

One common point that these three forms share is a well-described method (procedure, process) that needs to be addressed. That method should be fully developed, tested, and validated. This is the reference point for evaluating any form of automation (Figure 1).

[[File:Fig1 Liscouski ConsidAutoLabProc21.png|600px]]
{{clear}}
{|
| STYLE="vertical-align:top;"|
{| border="0" cellpadding="5" cellspacing="0" width="600px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;"| <blockquote>'''Figure 1.''' Items to be considered in automating systems</blockquote>
|-
|}
|}

The documentation for the chosen method should include the bulleted list of items from Figure 1, as they describe the science aspects of the method. The last four points are important. The method should be validated since the manual procedure is a reference point for determining if the automated system is producing useful results. The reproducibility metric offers a means of evaluating at least one expected improvement in an automated system; you’d expect less variability in the results. This requires a set of reference sample materials that can be repeatedly evaluated to compare the manual and automated systems, and to periodically test the methods in use to ensure that there aren’t any trends developing that would compromise the method’s use. Basically, this amounts to statistical quality control on the processes.

The next step is to decide what improvements you are looking for in an automated system: increased throughput, lower cost of operation, the ability to off-load human work, reduced variability, etc. In short, what are your goals?

That brings us to the matter of project planning. We’re not going to go into a lot of depth in this piece about project planning, as there are a number of references{{efn|See for example https://www.projectmanager.com/project-planning; the simplest thing to do it put “project planning” in a search engine and browse the results for something interesting.}} on the subject, including material produced by the former Institute for Laboratory Automation.{{efn|See for example https://theinformationdrivenlaboratory.wordpress.com/category/resources/; note that any references to the ILA should be ignored as the original site is gone, with the domain name perhaps having been leased by another organization that has no affiliation with the original Institute for Laboratory Automation.}} There are some aspects of the subject that we do need to touch on, however, and they include:

* justifying the project and setting expectations and goals;
* analyzing the process;
* scheduling automation projects; and
* budgeting.

====Justification, expectations, and goals====
Basically why are you doing this, what do you expect to gain? What arguments are you going to use to justify the work and expense involved in the project? How will you determine if the project is successful?

Fundamentally, automation efforts are about productivity and the bulleted items noted in the introduction of this piece, repeated below with additional commentary:

* Lower costs per test, and better control over expenditure: These can result from a reduction in labor and materials costs, including more predictable and consistent reagent usage per test.
* Stronger basis for better workflow planning: Informatics systems can provide better management over workloads and resource allocation, while key performance indicators can show where bottlenecks are occurring or if samples are taking too long to process. These can be triggers for procedure automation to improve throughput.
* Reproducibility: The test results from automated procedures can be expected to be more reproducible by eliminating the variability that is typical of steps executed by people. Small variation in dispensing reagents, for example, could be eliminated.
* Predictability: The time to completion for a given test is more predictable in automated programs; once the process starts it keeps going without interruptions that can be found in human centered activities
* Tighter adherence to procedures: Automated procedures have no choice but to be consistent in procedure execution; that is what programming and automation is about.

Of these, which are important to your project? If you achieved these goals, what would it mean to your labs operations and the organization as a whole; this is part of the justification for carrying out the projects.

As noted earlier, there are several things to consider in order to justify a project. First, there has to be a growing need that supports a procedures automation, one that can’t be satisfied by other means that could include adding people, equipment, and lab space, or outsourcing the work (with the added burden of insuring data quality and integrity, and integrating that work with the lab’s data/information). Second, the cost of the project must be balanced by it’s benefits. This includes any savings in cost, people (not reducing headcount, but avoiding new hires), material, and equipment, as well as the improvement of timeliness of results and overall lab operations. Third, when considering project justification, the automated process’s useful lifetime has to be long enough to justify the development work. And finally, the process has to be stable so that you aren’t in a constant re-development situation (this differs from periodic upgrades and performance improvements, EVOP in manufacturing terms). One common point of failure in projects is changes in underlying procedures; if the basic process model changes, you are trying to hit a moving target. That ruins schedules and causes budgets to inflate.

This may seem like a lot of things to think about for something that could be as simple as perhaps moving from manual pipettes to automatic units, but that just means the total effort to do the work will be small. However it is still important since it impacts data quality and integrity, and your ability to defend your results should they be challenged. And, by the way, the issue of automated pipettes isn’t simple; there is a lot to consider in properly specifying and using these products.{{Efn|As a starting point, view the [https://www.artel.co/ Artel, Inc. site] as one source. Also, John Bradshaw gave an [https://www.artel.co/learning_center/2589/ informative presentation] on “The Importance of Liquid Handling Details and Their Impact on your Assays” at the 2012 European Lab Automation Conference, Hamburg, Germany.}}

====Analyzing the process====
Assuming that you have a well-described, thoroughly tested and validated procedure, that process has to be analyzed for optimization and suitability for automation. This is an end-to-end evaluation, not just a examination of isolated steps. This is an important point. Looking at a single step without taking into account the rest of the process may improve that portion of the process but have consequences elsewhere.

Take a common example: working in a testing environment where samples are being submitted by outside groups (Figure 2).

[[File:Fig2 Liscouski ConsidAutoLabProc21.png|600px]]
{{clear}}
{|
| STYLE="vertical-align:top;"|
{| border="0" cellpadding="5" cellspacing="0" width="600px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;"| <blockquote>'''Figure 2.''' Lab sample processing, initial data entry through results</blockquote>
|-
|}
|}

Most LIMS will permit sample submitters (with appropriate permissions) to enter the sample description information directly into the LIMS, reducing some of the clerical burden. Standardizing on sample containers, with barcodes, reduces the effort and cost in some aspects of sample handling. A barcode scanner could be used to scan samples as they arrive into the lab, letting the system know that they are ready to be tested.

That brings us to an evaluation of the process as a whole, as well as an examination of the individual steps in the procedure. As shown in Figure 1, automation can be done in one of two ways: automating the full process or automating individual steps. Your choice depends on several factors, not the least of which is your comfort level and confidence in adopting automation as a strategy for increasing productivity. For some, concentrating on improvements in individual steps is an attractive approach. The cost and risk may be lower and if a problem occurs you can always backup to a fully manual implementation until they are resolved.

Care does have to be taken in choosing which steps to improve. From one perspective, you’d want to do the step-wise implementation of automation as close to the end of the process as possible. The problem with doing it earlier is that you may create a backup in later stages of the process. Optimizing step 2, for example, doesn’t do you much good if step 3 is overloaded and requires more people, or additional (possibly unplanned) automation to relieve a bottleneck there. In short, before you automate or improve a given step, you need to be sure that downstream processing can absorb the increase in materials flow. In addition, optimizing all the individual steps, one at time, doesn’t necessarily add up to a well-designed full system automation. The transition between steps may not be as effective or efficient if the system were evaluated as a whole. If the end of the process is carried out by commercial instrumentation, the ability to absorb more work is easier since most of these systems are automated with computer data acquisition and processing, and many have auto-samplers available to accumulate samples that can be processed automatically. Some of those auto-samplers have built in robotics for common sample handling functions. If the workload builds, additional instruments can pick up the load, and equipment such as [[Vendor:Baytek International, Inc.|Baytek International’s]] TurboTube<ref name="BaytekiPRO">{{cite web |url=https://www.baytekinternational.com/products/ipro-interface/89-products |title=iPRO Interface - Products |publisher=Baytek International, Inc |accessdate=05 February 2021}}</ref> can accumulate sample vials in a common system and route them to individual instruments for processing.

Another consideration for partial automation is where the process is headed in the future. If the need for the process persists over a long period of time, will you eventually get to the point of needing to redo the automation to an integrated stream? If so, is it better to take the plunge early on instead of continually expending resources to upgrade it?

Other considerations include the ability to re-purpose equipment. If a process isn’t used full-time (a justification for partial automation) the same components may be used in improving other processes. Ideally, if you go the full-process automation route, you’ll have sufficient sample throughput to keep it running for an extended period of time, and not have to start and stop the system as samples accumulate. A smoothly running slower automation process is better than a faster system that lies idle for significant periods of time, particularly since startup and shutdown procedures may diminish the operational cost savings in both equipment use and people’s time.

All these points become part of both the technical justification and budget requirements.

'''Analyzing the process: Simulation and modeling'''

Simulation and modeling have been part of science and engineering for decades, supported by ever-increasing powerful computing hardware and software. Continuous systems simulations have shown us the details of how machinery works, how chemical reactions occur, and how chromatographic systems and other instrumentation behaves.<ref name="JoyceComputer18">{{cite journal |title=Computer Modeling and Simulation |journal=Lab Manager |author=Joyce, J. |volume 13 |issue=9 |pages=32–35 |year=2018 |url=https://www.labmanager.com/laboratory-technology/computer-modeling-and-simulation-1826}}</ref> There is another aspect to modeling and simulation that is appropriate here.

Discrete-events simulation (DES) is used to model and understand processes in business and manufacturing applications, evaluating the interactions between service providers and customers, for example. One application of DES is to determine the best way to distribute incoming customers to a limited number of servers, taking into account that not all customers have the same needs; some will tie up a service provider a lot longer than others, as represented by the classic bank teller line problem. That is one question that discrete systems can analyze. This form of simulation and modeling is appropriate to event-driven processes where the action is focused on discrete steps (like materials moving from one workstation to another) rather than as a continuous function of time (most naturally occurring systems fall into this category, e.g., heat flow and models using differential equations).

The processes in your lab can be described and analyzed via DES systems.<ref name="CostigliolaSimul17">{{cite journal |title=Simulation Model of a Quality Control Laboratory in Pharmaceutical Industry |journal=IFAC-PapersOnLine |author=Costigliola, A.; Ataíde, F.A.P.; Vieira, S.M. et al. |volume=50 |issue=1 |pages=9014-9019 |year=2017 |doi=10.1016/j.ifacol.2017.08.1582}}</ref><ref name="MengImprov13">{{cite journal |title=Improving Medical Laboratory Operations via Discrete-event Simulation |journal=Proceedings of the 2013 INFORMS Healthcare Conference |author=Meng, L.; Liu, R.; Essick, C. et al. |year=2013 |url=https://www.researchgate.net/publication/263238201_Improving_Medical_Laboratory_Operations_via_Discrete-event_Simulation}}</ref><ref name="JunApplic99">{{cite journal |title=Application of discrete-event simulation in health care clinics: A survey |journal=Journal of the Operational Research Society |volume=50 |pages=109–23 |year=1999 |doi=10.1057/palgrave.jors.2600669}}</ref> Those laboratory procedures are a sequence of steps, each having a precursor, variable duration, and following step until the end of the process is reached; this is basically the same as a manufacturing operation where modeling and simulation have been used successfully for decades. DES can be used to evaluate those processes and ask questions that can guide you on the best paths to take in applying automation technologies and solving productivity or throughput problems. For example:

* What happens if we tighten up the variability in a particular step; how will that affect the rest of the system?
* What happens at the extremes of the variability in process steps; does it create a situation where samples pile up?
* How much of a workload can the process handle before one step becomes saturated with work and the entire system backs up?
* Can you introduce an alternate path to process those samples and avoid problems (e.g., if samples are held for too long in one stage, do they deteriorate)?
* Can the output of several parallel slower procedures be merged into a feed stream for a common instrumental technique?

In complex procedures some steps may be sensitive to small delays, and DES can help test and uncover them. Note that setting up these models will require the collection of a lot of data about the processes and their timing, so this is not something to be taken casually.

Previous research<ref name="JoyceComputer18" /><ref name="CostigliolaSimul17" /><ref name="MengImprov13" /><ref name="JunApplic99" /> suggests only a few ideas where simulation can be effective, including one where an entire labs operation’s was evaluated. Models that extensive can be used to not only look at procedures, but also the introduction of informatics systems. This may appear to be a significant undertaking, and it can be depending on the complexity of the lab processes. However, simple processes can be initially modeled on spreadsheets to see if more significant effort is justified. Operations research, of which DES is a part, has been usefully applied in production operations to increase throughput and improve ROI. It might be successfully applied to some routine production oriented lab work.

Most lab processes are linear in their execution, one step following another, with the potential for loop-backs should problems be recognized with samples, reagents (e.g., being out-of-date, doesn’t look right, need to obtain new materials), or equipment (e.g., not functioning properly, out of calibration, busy due to other work). On one level, the modeling of a manually implemented process should appear to be simple: each step takes a certain amount of time. If you add up the times, you have a picture of the process execution through time. However, the reality is quite different if you take into account problems (and their resolution) that can occur in each of those steps. The data collection used to model the procedure can change how that picture looks and your ability to improve it. By monitoring the process over a number of iterations, you can find out how much variation there is in the execution time for each step and whether or not the variation is a normal distribution or skewed (e.g., if one step is skewed, how does it impact others?).

Questions to ask about potential problems that could occur at each step include:

* How often do problems with reagents occur and how much of a delay does that create?
* Is instrumentation always in calibration (do you know?), are there operational problems with devices and their control systems (what are the ramifications?), are procedures delayed due to equipment being in use by someone else, and how long does it take to make changeovers in operating conditions?
* What happens to the samples; do they degrade over time? What impact does this have on the accuracy of results and their reproducibility?
* How often are workflows interrupted by the need to deal with high-priority samples, and what effect does it have on the processing of other samples?

Just the collection of data can suggest useful improvements before there are any considerations for automation, and perhaps negating the need for it. The answer to a lab’s productivity might be as simple as adding another instrument if that is a bottleneck. It might also suggest that an underutilized device might be more productive if sample preparation for different procedures workflows were organized differently. Underutilization might be a consequence of the amount of time needed to prepare the equipment for service: doing so for one sample might be disproportionately time consuming (and expensive) and cause other samples to wait until there were enough of them to justify the preparation. It could also suggest that some lab processes should be outsourced to groups that have a more consistent sample flow and turn-around time (TAT) for that technique. Some of these points are illustrated in Figures 3a and 3b below.

[[File:Fig3a Liscouski ConsidAutoLabProc21.png|574px]]
{{clear}}
{|
| STYLE="vertical-align:top;"|
{| border="0" cellpadding="5" cellspacing="0" width="574px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;"| <blockquote>'''Figure 3a.''' Simplified process views versus some modeling considerations. Note that the total procedure execution time is affected by the variability in each step, plus equipment and material availability delays; these can change from one day to the next in manual implementations.</blockquote>
|-
|}
|}

[[File:Fig3b Liscouski ConsidAutoLabProc21.png|544px]]
{{clear}}
{|
| STYLE="vertical-align:top;"|
{| border="0" cellpadding="5" cellspacing="0" width="544px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;"| <blockquote>'''Figure 3b.''' The execution times of each step include the variable execution times of potential issues that can occur in each stage. Note that because each factor has a different distribution curve, the total execution time has a much wider variability than the individual factors.</blockquote>
|-
|}
|}

How does the simulation system work? Once you have all the data set up, the simulation runs thousands of times using random number generators to pick out variables in execution times for each component in each step. For example, if there is a one-in-ten chance a piece of equipment will be in use when needed, 10% of the runs will show that with each one picking a delay time based on the input delay distribution function. With a large number of runs, you can see where delays exist and how they impact the overall processes behavior. You can also adjust the factors (what happens if equipment delays are cut in half) and see the effect of doing that. By testing the system, you can make better judgments on how to apply your resources.

Some of the issues that surface may be things that lab personnel know about and just deal with. It isn’t until the problems are looked at that the impact on operations are fully realized and addressed. Modeling and simulation may appear to be overkill for lab process automation, something reserved for large- scale production projects. The physical size of the project is not the key factor, it is the complexity of the system that matters and the potential for optimization.

One benefit of a well-structured simulation of lab processes is that it would provide a solid basis for making recommendations for project approval and budgeting. The most significant element in modeling and simulation is the initial data collection, asking lab personnel to record the time it takes to carry out steps. This isn’t likely to be popular if they don’t understand why it is being done and what the benefits will be to them and the lab; accurate information is essential. This is another case where “bad data is worse than no data.”

'''Guidleines for process automation'''

There are two types of guidelines that will be of interest to those conducting automation work: those that help you figure out what to do and how to do it, and those that must be met to satisfy regulatory requirements (both those evaluated by internal or external groups or organizations).

The first is going to depend on the nature of the science and automation being done to support it. Equipment vendor community support groups can be of assistance. Additionally, professional groups like the Pharmaceutical Research and Manufacturers of America (PhRMA), International Society for Pharmaceutical Engineering (ISPE), and Parenteral Drug Association (PDA) in the pharmaceutical and biotechnology industrues, with similar organizations in other industries and other countries. This may seem like a large jump from laboratory work, but it is appropriate when we consider the ramification of full-process automation. You are essentially developing a manufacturing operation on a lab bench, and the same concerns that large-scale production have also apply here; you have to ensure that the process is maintained and in control. The same is true of manual or semi-automated lab work, but it is more critical in fully-automated systems because of the potential high volume of results that can be produced.

The second set is going to consist of regulatory guidelines from groups appropriate to your industry: the [[Food and Drug Administration]] (FDA), [[United States Environmental Protection Agency|Environmental Protection Agency]] (EPA), and [[International Organization for Standardization]] (ISO), as well as international groups (e.g., [[Good Automated Manufacturing Practice|GAMP]], [[Good Automated Laboratory Practices|GALP]]) etc. The interesting point is that we are looking at a potentially complete automation scheme for a procedure; does that come under manufacturing or laboratory? The likelihood is that laboratory guidelines will apply since the work is being done within the lab's footprint; however, there are things that can be learned from their manufacturing counterparts that may assist in project management and documentation. One interesting consideration is what happens when fully automated testing, such as on-line analyzers, becomes integrated with both the lab and production or process control data/information streams. Which regulatory guidelines apply? It may come down to who is responsible for managing and supporting those systems.

====Scheduling automation projects====
There are two parts to the schedule issue: how long is it going to take to compete the project (dependent on the process and people), and when do you start? The second point will be addressed here.

The timing of an automated process coming online is important. If it comes on too soon, there may not be enough work to justify it’s use, and startup/shutdown procedures may create more work than the system saves. If it comes too late, people will be frustrated with a heavy workload while the system that was supposed to provide relief is under development.

In Figure 4, the blue line represents the growing need for sample/material processing using a given laboratory procedure. Ideally, you’d like the automated version to be available when that blue line crosses the “automation needed on-line” level of processing requirements; this the point where the current (manual?) implementation can no longer meet the demands of sample throughput requirements.

[[File:Fig4 Liscouski ConsidAutoLabProc21.png|600px]]
{{clear}}
{|
| STYLE="vertical-align:top;"|
{| border="0" cellpadding="5" cellspacing="0" width="600px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;"| <blockquote>'''Figure 4.''' Timing the development of an automated system</blockquote>
|-
|}
|}

Those throughput limits are something you are going to have to evaluate and measure on a regular basis and use to make adjustments to the planning process (accelerating or slowing it as appropriate). How fast is the demand growing and at what point will your current methods be overwhelmed? Hiring more people is one option, but then the lab's operating expenses increase due to the cost of people, equipment, and lab space.

Once we have an idea of when something has to be working, we can begin the process of planning; note: the planning can begin at any point, it would be good to get the preliminaries done as soon as a manual process is finalized so that you have an idea of what you’ll be getting into. Those preliminaries include looking at equipment that might be used (keeping track of its development), training requirements, developer resources, and implementation strategies, all of which would be updated as new information becomes available. The “we’ll-get-to-it-when-we-need-it” approach is just going to create a lot of stress and frustration.

You need to put together a first-pass project plan so that you can detail what you know, and more importantly what you don’t know. The goal is to have enough information, updated as noted above, so that you can determine if an automated solution is feasible, make an informed initial choice between full and partial automation, and have a timeline for implementation. Any time estimate is going to be subject to change as you gather information and refine your implementation approach. The point of the timeline is to figure out how long the yellow box in Figure 4 is because that is going to tell you how much time you have to get the plan together and working; it is a matter of setting priorities and recognizing what they are. The time between now and the start of the yellow box is what you have to work with for planning and evaluating plans, and any decisions that are needed before you begin, including corporate project management requirements and approvals.

Those plans have to include time for validation and the evaluation of the new implementation against the standard implementation. Does it work? Do we know how to use and maintain it? And are people educated in its use? Is there documentation for the project?

====Budgeting====
At some point, all the material above and following this section comes down to budgeting: how much will it cost to implement a program and is it worth it? Of the two points, the latter is the one that is most important. How do you go about that? (Note: Some of this material is also covered in the webinar series ''[[LII:A Guide for Management: Successfully Applying Laboratory Systems to Your Organization's Work|A Guide for Management: Successfully Applying Laboratory Systems to Your Organization's Work]]'' in the section on ROI.)

What a lot of this comes down to is explaining and justifying the choices you’ve made in your project proposal. We’re not going to go into a lot of depth, but just note some of the key issues:

* Did you choose full or partial automation for your process?
* What drove that choice? If in your view it would be less expensive than the full automation of a process, how long will it be until the next upgrade is needed to another stage?
* How independent are the potential, sequential implementation efforts that may be undertaken in the future? Will there be a need to connect them, and if so, how will the incremental costs compare to just doing it once and getting it over with?

There is a tendency in lab work to treat problems and the products that might be used to address them in isolation. You see the need for a LIMS or ELN, or an instrument data system, and the focus is on those issues. Effective decisions have to consider both the immediate and longer-term aspects of a problem. If you want to get access to a LIMS, have you considered how it will affect other aspects of lab work such as connecting instrument to it?

The same holds true for partial automation as a solution to a lab process productivity problem. While you are addressing a particular step, should you be looking at the potential for synergism by addressing other concerns. Modeling and simulations of processes can help resolve that issue.

Have you factored in the cost of support and education? The support issue needs to address the needs of lab personnel in managing the equipment and the options for vendor support, as well as the impact on IT groups. Note that the IT group will require access to vendor support, as well as being educated on their role in any project work.

What happens if you don’t automate? One way to justify the cost of a project is to help people understand what the lab’s operations will be like without it. Will more people, equipment, space, or added shifts be needed? At what cost? What would the impact be on those who need the results and how would it affect their programs?

==Build, buy, or cooperate?==
In this write up and some of the referenced materials, we’ve noted several times the benefits that clinical labs have gained through automation, although crediting it all to the use of automation alone isn’t fair. What the clinical laboratory industry did was recognize that there was a need for the use of automation to solve problems with the operational costs of running labs, and recognition that they could benefit further by coming together and cooperatively addressing lab operational problems.

It’s that latter point that made the difference and resulted in standardized communications, and purpose-built commercial equipment that could be used to implement automation in their labs. They also had common sample types, common procedures, and data processing. That same commonality applies to segments of industrial and academic lab work. Take life sciences as an example. Where possible, that industry has standardized on micro-plates for sample processing. The result is a wide selection of instruments and robotics built around that sample-holding format that greatly improves lab economics and throughput. While it isn’t the answer to everything, it’s a good answer to a lot of things.

If your industry segment came together and recognized that you used common procedures, how would you benefit by creating a common approach to automation instead of each lab doing it on their own? It would open the development of common products or product variations from vendors and relieve the need for each lab developing its own answer to the need. The result could be more effective and easily supportable solutions.

==Project planning==
Once you’ve decided on the project you are going to undertake, the next stage is looking at the steps needed to manage your project (Figure 5).

[[File:Fig5 Liscouski ConsidAutoLabProc21.png|800px]]
{{clear}}
{|
| STYLE="vertical-align:top;"|
{| border="0" cellpadding="5" cellspacing="0" width="800px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;"| <blockquote>'''Figure 5.''' Steps in a laboratory automation project. This diagram is modeled after the GAMP V for systems validation.</blockquote>
|-
|}
|}

The planning begins with the method description from Figure 1, which describes the science behind the project and the specification of how the automation is expected to be put into effect: as full-process automation, a specific step, or steps in the process. The provider of those documents is considered the “customer” and is consistent with GAMP V nomenclature (Figure 6); that consistency is important due to the need for system-wide validation protocols.

[[File:Fig6 Liscouski ConsidAutoLabProc21.png|749px]]
{{clear}}
{|
| STYLE="vertical-align:top;"|
{| border="0" cellpadding="5" cellspacing="0" width="749px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;"| <blockquote>'''Figure 6.''' GAMP V model for showing customer and supplier roles in specifying and evaluating project components for computer hardware and software.</blockquote>
|-
|}
|}

From there the “supplier” (e.g., internal development group, consultant, IT services, etc.) responds with a functional specification that is reviewed by the customer. The “analysis, prototyping, and evaluation” step, represented in the third box of Figure 5, is not the same as the process analysis noted earlier in this piece. The earlier section was to help you determine what work needed to be done and documented in the user requirements specification. The analysis and associated tasks here are specific to the implementation of this project. The colored arrows refer to the diagram in Figure 7. That process defines the equipment needed, dependencies, and options/technologies for automation implementations, including robotics, instrument design requirements, pre-built automation (e.g., titrators, etc.) and any custom components. The documentation and specifications are part of the validation protocol.

[[File:Fig7 Liscouski ConsidAutoLabProc21.png|650px]]
{{clear}}
{|
| STYLE="vertical-align:top;"|
{| border="0" cellpadding="5" cellspacing="0" width="650px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;"| <blockquote>'''Figure 7.''' Defining dependencies and qualification of equipment</blockquote>
|-
|}
|}

The prototyping function is an important part of the overall process. It is rare that someone will look at a project and come up with a working solution on the first pass. There is always tinkering and modifications that occur as you move from a blank slate to a working system. You make notes along the way about what should be done differently in the final product, and places where improvements or adjustments are needed. These all become part of the input to the system design specification that will be reviewed and approved by the customer and supplier. The prototype can be considered a proof of concept or a demonstration of what will occur in the finished product. Remember also that prototypes would not have to be validated since they wouldn’t be used in a production environment; they are simply a test bed used prior to the development of a production system.

The component design specifications are the refined requirement for elements that will be used in the final design. Those refinements could point to updated models of components or equipment used, modifications needed, or recommendations for products with capabilities other than those used in the prototype.

The boxes on the left side of Figure 5 are documents that go into increasing depth as the system is designed and specified. The details in those items will vary with the extent of the project. The right side of the diagram is a series of increasingly sophisticated testing and evaluation against steps in the right side, culminating in the final demonstration that the system works, has been validated, and is accepted by the customer. It also means that lab and support personnel are educated in their roles.

==Conclusions (so far)==
“Laboratory automation” has to give way to “laboratory automation engineering.” From the initial need to the completion of the validation process, we have to plan, design, and implement successful systems on a routine basis. Just as the manufacturing industries transitioned from cottage industries to production lines and then to integrated production-information systems, the execution of laboratory science has to tread a similar path if the demands for laboratory results are going to be met in a financially responsible manner. The science is fundamental; however, we need to pay attention now to efficient execution.

==Abbreviations, acronyms, and initialisms==
'''AI''': Artificial intelligence

'''AuI''': Augmented intelligence

'''DES''': Discrete-events simulation

'''ELN''': Electronic laboratory notebook

'''EPA''': Environmental Protection Agency

'''FDA''': Food and Drug Administration

'''FRB''': Fast radio bursts

'''GALP''': Good automated laboratory practices

'''GAMP''': Good automated manufacturing practice

'''ISO''': International Organization for Standardization

'''LES''': Laboratory execution system

'''LIMS''': Laboratory information management system

'''ML''': Machine learning

'''ROI''': Return on investment

'''SDMS''': Scientific data management system

'''TAT''': Turn-around time

==Footnotes==
{{reflist|group=lower-alpha}}

==About the author==
Initially educated as a chemist, author Joe Liscouski (joe dot liscouski at gmail dot com) is an experienced laboratory automation/computing professional with over forty years of experience in the field, including the design and development of automation systems (both custom and commercial systems), LIMS, robotics and data interchange standards. He also consults on the use of computing in laboratory work. He has held symposia on validation and presented technical material and short courses on laboratory automation and computing in the U.S., Europe, and Japan. He has worked/consulted in pharmaceutical, biotech, polymer, medical, and government laboratories. His current work centers on working with companies to establish planning programs for lab systems, developing effective support groups, and helping people with the application of automation and information technologies in research and quality control environments.

==References==
{{Reflist|colwidth=30em}}


[[Category:LII:Guides, white papers, and other publications]]

LII:Considerations in the Automation of Laboratory Procedures

2024-06-20T03:03:32Z

Shawndouglas: /* Project planning */ Fixed error in Figure 7

'''Title''': ''Considerations in the Automation of Laboratory Procedures''

'''Author for citation''': Joe Liscouski, with editorial modifications by Shawn Douglas

'''License for content''': [https://creativecommons.org/licenses/by/4.0/ Creative Commons Attribution 4.0 International]

'''Publication date''': January 2021

==Introduction==
Scientists have been dealing with the issue of [[laboratory automation]] for decades, and during that time the meaning of those words has expanded from the basics of connecting an instrument to a computer, to the possibility of a fully integrated [[Informatics (academic field)|informatics]] infrastructure beginning with [[Sample (material)|sample]] preparation and continuing on to the [[laboratory information management system]] (LIMS), [[electronic laboratory notebook]] (ELN), and beyond. Throughout this evolution there has been one underlying concern: how do we go about doing this?

The answer to that question has changed from a focus on hardware and programming, to today’s need for a lab-wide informatics strategy. We’ve moved from the bits and bytes of assembly language programming to managing terabytes of files and data structures.

The high-end of the problem—the large informatics database systems—has received significant industry-wide attention in the last decade. The stuff on the lab bench, while the target of a lot of individual products, has been less organized and more experimental. Failed or incompletely met promises have to yield to planned successes. How we do it needs to change. This document is about the considerations required when making that change. The haphazard "let's try this" method has to give way to more engineered solutions and a realistic appraisal of the human issues, as well as the underlying technology management and planning.

Why is this important? Whether you are conducting intense laboratory experiments to produce data and [[information]] or making chocolate chip cookies in the kitchen, two things remain important: productivity and the quality of the products. In either case, if the productivity isn’t high enough, you won’t be able to justify your work; if the quality isn’t there, no one will want what you produce. Conducting laboratory work and making cookies have a lot in common. Your laboratories exist to answer questions. What happens if I do this? What is the purity of this material? What is the structure of this compound? The field of laboratories asking these questions is extensive, basically covering the entire array of lab bench and scientific work, including chemistry, life sciences, physics, and electronics labs. The more efficiently we answer those questions, the more likely it will be that these labs will continue operating and, that you’ll achieve the goals your organization has set. At some point, it comes down to performance against goals and the return on the investment organizations make in lab operations.

In addition to product quality and productivity, there are a number of other points that favor automation over manual implementations of lab processes. They include:

* lower costs per test;
* better control over expenditures;
* a stronger basis for better [[workflow]] planning;
* reproducibility;
* predictably; and
* tighter adherence to procedures, i.e., consistency.

Lists similar to the one above can be found in justifications for lab automation, and cookie production, without further comment. It’s just assumed that everyone agrees and that the reasoning is obvious. Since we are going to use those items to justify the cost and effort that goes into automation, we should take a closer look at them.

Lets begin with reproducibility, predictability, and consistency, very similar concerns that reflect automation’s ability to produce the same product with the desired characteristics over and over. For data and information, that means that the same analysis on the same materials will yield the same results, that all the steps are documented and that the process is under control. The variability that creeps into the execution of a process by people is eliminated. That variability in human labor can result from the quality of training, equipment setup and calibration, readings from analog devices (e.g., meters, pipette meniscus, charts, etc.), there is a long list of potential issues.

Concerns with reproducibility, predictability, and consistency are common to production environments, general lab work, manufacturing, and even food service. There are several pizza restaurants in our area using one of two methods of making the pies. Both start the preparation the same way, spreading dough and adding cheese and toppings, but the differences are in how they are cooked. Once method uses standard ovens (e.g., gas, wood, or electric heating); the pizza goes in, the cook watches it, and then removes it when the cooking is completed. This leads to a lot of variability in the product, some a function of the cook’s attention, some depending on requests for over or under cooking the crust. Some is based on "have it your way" customization. The second method uses a metal conveyor belt to move the pie through an oven. The oven temperature is set as is the speed of the belt, and as long as the settings are the same, you get a reproducible, consistent product order after order. It’s a matter of priorities. Manual verses automated. Consistent product quality verses how the cook feels that day. In the end, reducing variability and being able to demonstrate consistent, accurate, results gives people confidence in your product.

Lower costs per test, better control over expenditures, and better workflow planning also benefit from automation. Automated processes are more cost-efficient since the sample throughput is higher and the labor cost is reduced. The cost per test and the material usage is predictable since variability in components used in testing is reduced or eliminated, and workflow planning is improved since the time per test is known, work can be better scheduled. Additionally, process scale-up should be easier if there is a high demand for particular procedures. However there is a lot of work that has to be considered before automation is realizable, and that is where this discussion is headed.

==How does this discussion relate to previous work?==
This work follows on the heels of two previous works:

* ''[https://www.pda.org/bookstore/product-detail/2684-computerized-systems-in-modern-lab Computerized Systems in the Modern Laboratory: A Practical Guide]'' (2015): This book presents the range of informatics technologies, their relationship to each other, and the role they play in laboratory work. It differentiates a LIMS from an ELN and [[scientific data management system]] (SDMS) for example, contrasting their use and how they would function in different lab working environments. In addition, it covers topics such as support and regulatory issues.

* ''[[LII:A Guide for Management: Successfully Applying Laboratory Systems to Your Organization's Work|A Guide for Management: Successfully Applying Laboratory Systems to Your Organization's Work]]'' (2018): This webinar series complements the above text. It begins by introducing the major topics in informatics (e.g., LIMS, ELN, etc.) and then discusses their use from a strategic viewpoint. Where and how do you start planning? What is your return on investment? What should get implemented first, and then what are my options? The series then moves on to developing an [[information management]] strategy for the lab, taking into account budgets, support, ease of implementation, and the nature of your lab’s work.

The material in this write-up picks up where the last part of the webinar series ends. The last session covers lab processes, amd this picks up that thread and goes into more depth concerning a basic issue: how do you move from manual methods to automated systems?

Productivity has always been an issue in laboratory work. Until the 1950s, a lab had little choice but to add more people if more work needed to be done. Since then, new technologies have afforded wider options, including new instrument technologies. The execution of the work was still done by people, but the tools were better. Now we have other options. We just have to figure out when, if, and how to use them.

===Before we get too far into this...===
With elements such as productivity, return on investment (ROI), [[data quality]], and [[data integrity]] as driving factors in this work, you shouldn’t be surprised if a lot of the material reads like a discussion of manufacturing methodologies; we’ve already seen some examples. We are talking about scientific work, but the same things that drive the elements noted in labs have very close parallels in product manufacturing. The work we are describing here will be referenced as "scientific manufacturing," manufacturing or production in support of scientific programs.{{efn|The term "scientific manufacturing" was first mentioned to the author by Mr. Alberto Correia, then of Cambridge Biomedical, Boston, MA.}}

The key points of a productivity conversation in both lab and material production environments are almost exact overlays, the only significant difference is that the results of the efforts are data and information in one case, and a physical item you might sell in the other. Product quality and integrity are valued considerations in both. For scientists, this may require an adjustment to their perspectives when dealing with automation. On the plus side, the lessons learned in product manufacturing can be applied to lab bench work, making the path to implementation a bit easier while providing a framework for understanding what a successful automation effort looks like. People with backgrounds in product manufacturing can be a useful resource in the lab, with a bit of an adjustment in perspective on their part.

==Transitioning from typical lab operations to automated systems==
Transitioning a lab from its current state of operations to one that incorporates automation can raise a number of questions, and people’s anxiety levels. There are several questions that should be considered to set expectations for automated systems and how they will impact jobs and the introduction of new technologies. They include:

* What will happen to people’s jobs as a result of automation?
* What is the role of [[artificial intelligence]] (AI) and [[machine learning]] (ML) in automation?
* Where do we find the resources to carry out automation projects/programs?
* What equipment would we need for automated processes, and will it be different that what we currently have?
* What role does a [[laboratory execution system]] (LES) play in laboratory automation?
* How do we go about planning for automation?

===What will happen to people’s jobs as a result of automation?===
Stories are appearing in print, online, and in television news reporting about the potential for automation to replace human effort in the labor force. It seems like it is an all-or-none situation, either people will continue working in their occupations or automation (e.g., mechanical, software, AI, etc.) will replace them. The storyline is people are expensive and automated work can be less costly in the long run. If commercial manufacturing is a guide, automation is a preferred option from both a productivity and an ROI perspective. In order to make the productivity gains from automation similar to those seen in commercial manufacturing, there are some basic requirements and conditions that have to be met:

* The process has to be well documented and understood, down to the execution of each step without variation, while error detection and recovery have to be designed in.
* The process has to remain static and be expected to continue over enough execution cycles to make it economically attractive to design, build, and maintain.
* Automation-compatible equipment has to be available. Custom-built components are going to be expensive and could represent a barrier to successful implementation.
* There has to be a driving need to justify the cost of automation; economics, the volume of work that has to be addressed, working with hazardous materials, and lack of educated workers are just a few of the factors that would need to be considered.

There are places in laboratory work where production-scale automation has been successfully implemented; life sciences applications for processes based on microplate technologies are one example. When we look at the broad scope of lab work across disciplines, most lab processes don’t lend themselves to that level of automation, at least not yet. We’ll get into this in more detail later. But that brings us back to the starting point: what happens to people's jobs?

In the early stages of manufacturing automation, as well as fields such as mining where work was labor intensive and repetitive, people did lose jobs when new methods of production were introduced. That shift from a human workforce to automated task execution is expanding as system designers probe markets from retail to transportation.<ref name="FreyTheFuture13">{{cite web |url=https://www.oxfordmartin.ox.ac.uk/downloads/academic/The_Future_of_Employment.pdf |format=PDF |title=The Future of Employment: How Susceptible Are Jobs to Computerisation? |author=Frey, C.B.; Osborne, M.A. |publisher=Oxford Martin School, University of Oxford |date=17 September 2013 |accessdate=04 February 2021}}</ref> Lower skilled occupations gave way first, and we find ourselves facing automation efforts that are moving up the skills ladder, most recently is the potential for automated driving, a technology that has yet to be fully embraced but is moving in that direction. The problem that leaves us with is providing displaced workers with a means of employment that gives them at least a living income, and the purpose, dignity, and self-worth that they’d like to have. This is going to require significant education, and people are going to have to come to grips with the realization that education never stops.

Due to the push for increased productivity, lab work has seen some similar developments in automation. The development of automated pipettes, titration stations, auto-injectors, computer-assisted instrumentation, and automation built to support microplate technologies represent just a few places where specific tasks have been addressed. However these developments haven’t moved people out of the workplace as has happened in manufacturing, mining, etc. In some cases they’ve changed the work, replacing repetitive time-consuming tasks with equipment that allows lab personnel to take on different tasks. In other cases the technology addresses work that couldn’t be performed in a cost-effective manner with human effort; without automation, that work might just not be feasible due to the volume of work (whose delivery might be limited by the availability of the right people, equipment, and facilities) or the need to work with hazardous materials. Automation may prevent the need for hiring new people while giving those currently working more challenging tasks.

As noted in the previous paragraph, much of the automation in lab work is at the task level: equipment designed to carry out a specific function such as Karl-Fisher titrations. Some equipment designed around microplate formats can function at both the task level and as part of user-integrated robotics system. This gives the planner useful options about the introduction of automation that makes it easier for personnel to get accustomed to automation before moving into scientific manufacturing.

Overall, laboratory people shouldn’t be loosing their jobs as a result of lab automation, but they do have to be open to changes in their jobs, and that could require an investment in their education. Take someone whose current job is to carry out a lab procedure, someone who understands all aspects of the work, including troubleshooting equipment, reagents, and any special problems that may crop up. Someone else may have developed the procedure, but that person is the expert in its execution.

First of all you need these experts to help plan and test the automated systems if you decide to create that project. These would also be the best people to educate as automated systems managers; they know how the process is supposed to work and should be in a position to detect problems. If it crashes, you’ll need someone who can cover the work while problems are be addressed. Secondly, if lab personnel get the idea that they are watching their replacement being installed, they may leave before the automated systems are ready. In the event of a delay, you’ll have a backlog and no one to handle it.

Beyond that, people will be freed from the routine of carrying out processes and be able to address work that had been put on a back burner until it could be addressed. As we move toward automated systems, jobs will change by expansion to accommodate typical lab work, as well as the management, planning, maintenance, and evolution of laboratory automation and computing.

Automation in lab work is not an "all or none" situation. Processes can be structured so that the routine work is done by systems, and the analyst can spend time reviewing the results, looking for anomalies and interesting patterns, while being able to make decisions about the need for and nature of follow-on efforts.

===What is the role of AI and ML in automation?===
When we discuss automation, what we are referencing now is basic robotics and programming. AI may, and likely will, play a role in the work, but first we have to get the foundations right before we consider the next step; we need to put in the human intelligence first. Part of the issue with AI is that we don’t know what it is.

Science fiction aside, many of today's applications of AI have a limited role in lab work today. Here are some examples:

* Having a system that can bring up all relevant information on a research question—a sort of super Google—or a variation of IBM’s Watson could have significant benefits.
* Analyzing complex data or large volumes of data could be beneficial, e.g., the analysis of radio astronomy data to find fast radio bursts (FRB). After discovering 21 FRB signals upon analyzing five hours of data, researchers at Green Bank Telescope used AI to analyze 400 terabytes of older data and detected another 100.<ref name="HsuIsIt18">{{cite web |url=https://www.nbcnews.com/mach/science/it-aliens-scientists-detect-more-mysterious-radio-signals-distant-galaxy-ncna912586 |title=Is it aliens? Scientists detect more mysterious radio signals from distant galaxy ||author=Hsu, J. |work=NBC News MACH |date=24 September 2018 |accessdate=04 February 2021}}</ref>
* "[A] team at Glasgow University has paired a machine-learning system with a robot that can run and analyze its own chemical reaction. The result is a system that can figure out every reaction that's possible from a given set of starting materials."<ref name="TimmerAIPlus18">{{cite web |url=https://arstechnica.com/science/2018/07/ai-plus-a-chemistry-robot-finds-all-the-reactions-that-will-work/5/ |title=AI plus a chemistry robot finds all the reactions that will work |author=Timmer, J. |work=Ars Technica |date=18 July 2018 |accessdate=04 February 2021}}</ref>
* HelixAI is using Amazon's Alexa as a digital assitant for laboratory work.<ref name="HelixAIHome">{{cite web |url=http://www.askhelix.io/ |title=HelixAI - Voice Powered Digital Laboratory Assistants for Scientific Laboratories |publisher=HelixAI |accessdate=04 February 2021}}</ref>

Note that the points above are research-based applications, not routine production environments where regulatory issues are important. While there are research applications that might be more forgiving of AI systems because the results are evaluated by human intelligence, and problematic results can be made subject to further verification, data entry systems such as voice entry have to be carefully tested and the results of that data entry verified and shown to be correct.

Pharma IQ continues to publish material on advanced topics in laboratory informatics, including articles on how labs are benefiting from new technologies<ref name="PharmaIQNewsAutom18">{{cite web |url=https://www.pharma-iq.com/pre-clinical-discovery-and-development/news/automation-iot-and-the-future-of-smarter-research-environments |title=Automation, IoT and the future of smarter research environments |author=PharmaIQ News |work=PharmaIQ |date=20 August 2018 |accessdate=04 February 2021}}</ref> and survey reports such as ''AI 2020: The Future of Drug Discovery''. In that report they note<ref name="PharmaIQTheFuture17">{{cite web |url=https://www.pharma-iq.com/pre-clinical-discovery-and-development/whitepapers/the-future-of-drug-discovery-ai-2020 |title=The Future of Drug Discovery: AI 2020 |author=PharmaIQ |publisher=PharmaIQ |date=14 November 2017 |accessdate=04 February 2021}}</ref>:

* "94% of pharma professionals expect that intelligent technologies will have a noticeable impact on the pharmaceutical industry over the next two years."
* "Almost one fifth of pharma professionals believe that we are on the cusp of a revolution."
* "Intelligent automation and predictive analytics are expected to have the most significant impact on the industry."
* "However, a lack of understanding and awareness about the benefits of AI-led technologies remain a hindrance to their implementation."

Note that these are expectations, not a reflection of current reality. That same report makes comments about the impact of AI on headcount disruption, asking, "Do you expect intelligent enterprise technologies{{efn|Intelligent enterprise technologies referenced in the report include robotic process automation, machine learning, artificial intelligence, the internet Of things, predictive analysis, and cognitive computing.}} to significantly cut and/or create jobs in pharma through 2020?" Among the responses, 47 percent said they expected those technologies to do both, 40 percent said it will create new job opportunities, and 13 percent said there will be no dramatic change, with zero percent saying they expected solely job losses.<ref name="PharmaIQTheFuture17" />

While there are high levels of expectations and hopes for results, we need to approach the idea of AI in labs with some caution. We read about examples based on machine learning (ML), for example using computer systems to recognize cats in photos, to recognized faces in a crowd, etc. We don’t know how they accomplish their tasks, and we can’t analyze their algorithms and decision-making. That leaves us with testing in quality, which at best is an uncertain process with qualified results (it has worked so far). One problem with testing AI systems based on ML is that they are going to continually evolve, so testing may affect the ML processes by providing a bias. It may also cause continued, redundant testing, because something we thought was evaluated was changed by the “experiences” the AI based it’s learning on. As one example, could the AI modify the science through process changes without our knowing because it didn’t understand the science or the goals of the work?

AI is a black box with ever-changing contents. That shouldn’t be taken as a condemnation of AI in the lab, but rather as a challenge to human intelligence in evaluating, proving, and applying the technology. That application includes defining the operating boundaries of an AI system. Rather than creating a master AI for a complete process, we may elect to divide the AI’s area of operation into multiple, independent segments, with segment integration occurring in later stages once we are confident in their ability to work and show clear evidence of systems stability. In all of this we need to remember that our goal is the production of high-quality data and information in a controlled, predictable environment, not gee-wiz technology. One place where AI (or clever programming) could be of use is in better workflow planning, which takes into account current workloads and assignments, factors in the inevitable panic-level testing need, and, perhaps in a QC/production environment, anticipates changes in analysis requirements based on changes in production operations.

Throughout this section I've treated “AI” as “artificial intelligence,” its common meaning. There may be a better way of looking at it for lab use as, noted in this excerpt from the October 2018 issue of Wired magazine<ref name="RossettoFight18">{{cite journal |title=Fight the Dour |journal=Wired |author=Rossetto, L. |issue=October |pages=826–7 |year=2018 |url=https://www.magzter.com/stories/Science/WIRED/Fight-The-Dour}}</ref>:

<blockquote>Augmented intelligence. Not “artificial,” but how Doug Engelbart{{efn|[https://en.wikipedia.org/wiki/Douglas_Engelbart Doug Engelbart] found the field of human-computer interaction and is credited with the invention of the computer mouse, and the “Mother of All Demos” in 1968.}} envisioned our relationship with computer: AI doesn’t replace humans. It offers idiot-savant assistants that enable us to become the best humans we can be.</blockquote>

Augmented intelligence (AuI) is a better term for what we might experience in lab work, at least in the near future. It suggests something that is both more realistic and attainable, with the synergism that would make it, and automation, attractive to lab management and personnel—a tool they can work with and improve lab operations that doesn’t carry the specter of something going on that they don’t understand or control. OPUS/SEARCH from Bruker might be just such an entry in this category.<ref name="BrukerOPUS">{{cite web |url=https://www.bruker.com/en/products-and-solutions/infrared-and-raman/opus-spectroscopy-software/search-identify.html |title=OPUS Package: SEARCH & IDENT |publisher=Bruker Corporation |accessdate=04 February 2021}}</ref> AuI may serve as a first-pass filter for large data sets—as noted in the radio astronomy and chemistry examples noted earlier—reducing those sets of data and information to smaller collections that human intelligence can/should evaluate. However, that does put a burden on the AuI to avoid excessive false positives or negatives, something that can be adjusted over time.

Beyond that there is the possibility of more cooperative work between people and AuI systems. An article in ''Scientific American'' titled “My Boss the Robot”<ref name="BourneMyBoss13">{{cite journal |title=My Boss the Robot |journal=Scientific American |author=Bourne, D. |volume=308 |issue=5 |pages=38–41 |year=2013 |doi=10.1038/scientificamerican0513-38 |pmid=23627215}}</ref> describes the advantage of a human-robot team, with the robot doing the heavy work and the human—under the robots guidance—doing work he was more adept at, verses a team of experts with the same task. The task, welding a Humvee frame, was competed by the human machine pair in 10 hours at a cost of $1,150; the team of experts took 89 hours and a cost of $7,075. That might translate into terms of laboratory work by having a robot do routine, highly repetitive tasks and the analyst overseeing the operation and doing higher-level analysis of the results.

Certainly, AI/AuI is going to change over time as programming and software technology becomes more sophisticated and capable; today’s example of AuI might be seen as tomorrow’s clever software. However, a lot depends on the experience of the user.

There is something important to ask about laboratory technology development, and AI in particular: is the direction of development going to be the result of someone’s innovation that people look at and embrace, or will it be the result of a deliberate choice of lab people saying “this is where we need to go, build systems that will get us there”? The difference is important, and lab managers and personnel need to be in control of the planning and implementation of systems.

===Where do we find the resources to carry out automation projects/programs?===
Given the potential scope of work, you may need people with skills in programming, robotics, instrumentation, and possibly mechanical or electrical engineering if off-the-shelf components aren’t available. The biggest need is for people who can do the planning and optimization that is needed as you move from manual to semi- or fully-automated systems, particularly specialists in process engineering who can organize and plan the work, including the process controls and provision for statistical process control.

We need to develop people who are well versed in laboratory work and the technologies that can be applied to that work, as assets in laboratory automation development and planning. In the past, this role has been filled with lab personnel having an interest in the subject, IT people willing to extend their responsibilities, and/or outside consultants. A 2017 report by Salesforce Research states "77% of IT leaders believe IT functions as an extension/partner of business units rather than as a separate function."<ref name="SalesForceSecondAnn17">{{cite web |url=https://a.sfdcstatic.com/content/dam/www/ocms/assets/pdf/misc/2017-state-of-it-report-salesforce.pdf |format=PDF |title=Second Annual State of IT |author=SalesForce Research |publisher=SalesForce |date=2017 |accessdate=04 February 2021}}</ref> The report makes no mention of laboratory work or manufacturing aside from those being functions within businesses surveyed. Unless a particular effort is made, IT personnel rarely have the backgrounds needed to meet the needs of lab work. In many cases, they will try and fit lab needs into software they are already familiar with, rather then extend their backgrounds into new computational environments. Office and pure database applications are easily handled, but when we get to the lab bench, it's another matter entirely.

The field is getting complex enough that we need people whose responsibilities span both science and technology. This subject is discussed in the webinar series ''[[LII:A Guide for Management: Successfully Applying Laboratory Systems to Your Organization's Work|A Guide for Management: Successfully Applying Laboratory Systems to Your Organization's Work]]'', Part 5 "Supporting Laboratory Systems."

===What equipment would we need for automated processes, and will it be different that what we currently have?===
This is an interesting issue and it directly addresses the commitment labs have to automation, particularly robotics. In the early days of lab automation when Zymark (Zymate and Benchmate), [[Vendor:PerkinElmer Inc.|Perkin Elmer]], and Hewlett Packard (ORCA) were the major players in the market, the robot had to adapt to equipment that was designed for human use: standard laboratory equipment. They did that through special modifications and the use of different grippers to handle test tubes, beakers, and flasks. While some companies wanted to test the use of robotics in the lab, they didn’t want to invest in equipment that could only be used with robots; they wanted lab workers to pick up where the robots left off in case the robots didn’t work.

Since then, equipment has evolved to support automation more directly. In some cases it is a device (e.g., a balance, pH meter, etc.) that has front panel human operator capability and rear connectors for computer communications. Liquid handling systems have seen the most advancement through the adoption of microplate formats and equipment designed to work with them. However, the key point is standardization of the sample containers. Vials and microplates lend themselves to a variety of automation devices, from sample processing to auto-injectors/samplers. The issue is getting the samples into those formats.

One point that labs, in any scientific discipline, have to come to grips with is the commitment to automation. That commitment isn’t going to be done on a lab-wide basis, but on a procedure-by-procedure basis. Full automation may not be appropriate for all lab work, whereas partial automation may be a better choice, and in some cases no automation may be required (we’ll get into that later). The point that needs to be addressed is the choice of equipment. In most cases, equipment is designed for use by people, with options for automation and electronic communications. However, if you want to maximize throughput, you may have to follow examples from manufacturing and commit to equipment that is only used by automation. That will mean a redesign of the equipment, a shared risk for both the vendors and the users. The upside to this is that equipment can be specifically designed for a task, be more efficient, have the links needed for integration, use less material, and, more likely, take up less space. One example is the microplate, allowing for tens, hundreds, or thousands (depending on the plate used) of sample cells in a small space. What used to take many cubic feet of space as test tubes (the precursor to using microplates) is now a couple of cubic inches, using much less material and working space. Note, however, that while microplates are used by lab personnel, their use in automated systems provides greater efficiency and productivity.

The idea of equipment used only in an automated process isn’t new. The development and commercialization of segmented flow analyzers—initially by Technicon in the form of the AutoAnalyzers for general use, and the SMA (Sequential Multiple Analyzer) and SMAC (Sequential Multiple Analyzer with Computer) in clinical markets—improved a lab's ability to process samples. These systems were phased out with new equipment that consumed less material. Products like these are being provided by Seal Analytical<ref name="SealAnal">{{cite web |url=https://seal-analytical.com/Products/tabid/55/language/en-US/Default.aspx |title=Seal Analytical - Products |publisher=Seal Analytical |accessdate=04 February 2021}}</ref> for environmental work and Bran+Luebbe (a division of SPX Process Equipment in Germany).<ref name="BranLuebbe">{{cite web |url=https://www.spxflow.com/bran-luebbe/ |title=Bran+Luebbe |publisher=SPX FLOW, Inc |accessdate=04 February 2021}}</ref>

The issue in committing to automated equipment is that vendors and users will have to agree on equipment specifications and use them within procedures. One place this has been done successfully is in clinical chemistry labs. What other industry workflows could benefit? Do the vendors lead or do the users drive the issue? Vendors need to be convinced that there is a viable market for product before making an investment, and users need to be equally convinced that they will succeed in applying those products. In short, procedures that are important to a particular industry have to be identified, and both users and vendors have to come together to develop automated procedure and equipment specifications for products. This has been done successfully in clinical chemistry markets to the extent that equipment is marketed for use as validated for particular procedures.

===What role does a LES play in laboratory automation?===
Before ELNs settled into their current role in laboratory work, the initial implementations differed considerably from what we have now. LabTech Notebook was released in 1986 (discontinued in 2004) to provide communications between computers and devices that used RS-232 serial communications. In the early 2000s SmartLab from Velquest was the first commercial product to carry the "electronic laboratory notebook" identifier. That product became a stand-alone entry in the laboratory execution system (LES) market; since its release, the same conceptual functionality has been incorporated into LIMS and ELNs that fit the more current expectation for an ELN.

At it’s core, LES are scripted test procedures that an analyst would follow to carry out a laboratory method, essentially functioning as the programmed execution of a lab process. Each step in a process is described, followed exactly, and provision is made within the script for data collection. In addition, the LES can/will (depending on the implementation; "can" in the case of SmartLab) check to see if the analyst is qualified to carry out the work and that the equipment and reagents are current, calibrated, and suitable for use. The systems can also have access to help files that an analyst can reference if there are questions about how to carry out a step or resolve issues. Beyond that, the software had the ability to work with lab instruments and automatically acquire data either through direct interfaces (e.g., balances, pH meters, etc.) or through parsing PDF files of instrument reports.

There are two reasons that these systems are attractive. First, they provide for a rigorous execution of a process with each step being logged as it is done. Second, that log provides a regulatory inspector with documented evidence that the work was done properly, making it easier for the lab to meet any regulatory burden.

Since the initial development of SmartLab, that product has changed ownership and is currently in the hands of [[Vendor:Dassault Systèmes SA|Dassault Systèmes]] as part of the BIOVIA product line. As noted above, LIMS and ELN vendors have incorporated similar functionality into their products. Using those features requires “scripting” (in reality, software development), but it does allow the ability to access the database structures within those products. The SmartLab software needed programmed interfaces to other vendors' LIMS and ELNs to gain access to the same information.

====What does this have to do with automation?====
When we think about automated systems, particularly full-automation with robotic support, it is a programmed process from start to finish. The samples are introduced at the start, and the process continues until the final data/information is reported and stored. These can be large scale systems using microplate formats, including tape-based systems from Douglas Scientific<ref name="DouglasScientificArrayTape">{{cite web |url=https://www.douglasscientific.com/Products/ArrayTape.aspx |title=Array Tape Advanced Consumable |publisher=Douglas Scientific |accessdate=04 February 2021}}</ref>, programmable autosamplers such as those from [[Vendor:Agilent Technologies, Inc.|Agilent]]<ref name="Agilent1200Series">{{cite web |url=https://www.agilent.com/cs/library/usermanuals/Public/G1329-90012_StandPrepSamplers_ebook.pdf |format=PDF |title=Agilent 1200 Series Standard and Preparative Autosamplers - User Manual |publisher=Agilent Technologies |date=November 2008 |accessdate=04 February 2021}}</ref>, or systems built around robotics arms from a variety of vendors that move samples from one station to another.

Both LES and the automation noted in the previous paragraph have the following point in common: there is a strict process that must be followed, with no provision for variation. The difference is that in one case that process is implemented completely through the use of computers, as well as electronic and mechanical equipment. In the other case, the process is being carried out by lab personnel using computers, as well as electronic and mechanical lab equipment. In essence, people take the place of mechanical robots, which conjures up all kinds of images going back to the 1927 film ''Metropolis''.{{efn|See [[wikipedia:Metropolis (1927 film)|''Metropolis'' (1927 film)]] on Wikipedia.}} Though the LES represents a step toward more sophisticated automation, both methods still require:

* programming, including “scripting” (the LES methods are a script that has to be followed);
* validated, proven processes; and
* qualified staff, though the qualifications differ. (In both cases they have to be fully qualified to carry out the process in question. However in the full automation case, they will require more education on running, managing, and troubleshooting the systems.)

In the case of full automation, there has to be sufficient justification for the automation of the process, including sufficient sample processing. The LES-human implementation can be run for a single sample if needed, and the operating personnel can be trained on multiple procedures, switching tasks as needed. Electro-mechanical automation would require a change in programming, verification that the system is operating properly, and may require equipment re-configuration. Which method is better for a particular lab depends on trade-offs between sample load, throughput requirements, cost, and flexibility. People are adaptable, easily moving between tasks, whereas equipment has to be adapted to a task.

===How do we go about planning for automation?===
There are three forms of automation to be considered:

# No automation – Instead, the lab relies on lab personnel to carry out all steps of a procedure.
# Partial automation – Automated equipment is used to carry out steps in a procedure. Given the current state of laboratory systems, this is the most prevalent since most lab equipment has computer components in them to facilitate their use.
# Full automation - The entire process is automated. The definition of “entire” is open to each labs interpretation and may vary from one process to another. For example, some samples may need some handing before they are suitable for use in a procedure. That might be a selection process from a freezer, grinding materials prior to a solvent extraction, and so on, representing cases where the equipment available isn’t suitable for automated equipment interaction. One goal is to minimize this effort since it can put a limit on the productivity of the entire process. This is also an area where negotiation between the lab and the sample submitter can be useful. Take plastic pellets for example, which often need to be ground into a course powder before they can be analyzed; having the submitter provide them in this form will reduce the time and cost of the analysis. Standardizing on the sample container can also facilitate the analysis (having the lab provide the submitter with standard sample vials using barcodes or RFID chips can streamline the process).

One common point that these three forms share is a well-described method (procedure, process) that needs to be addressed. That method should be fully developed, tested, and validated. This is the reference point for evaluating any form of automation (Figure 1).

[[File:Fig1 Liscouski ConsidAutoLabProc21.png|600px]]
{{clear}}
{|
| STYLE="vertical-align:top;"|
{| border="0" cellpadding="5" cellspacing="0" width="600px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;"| <blockquote>'''Figure 1.''' Items to be considered in automating systems</blockquote>
|-
|}
|}

The documentation for the chosen method should include the bulleted list of items from Figure 1, as they describe the science aspects of the method. The last four points are important. The method should be validated since the manual procedure is a reference point for determining if the automated system is producing useful results. The reproducibility metric offers a means of evaluating at least one expected improvement in an automated system; you’d expect less variability in the results. This requires a set of reference sample materials that can be repeatedly evaluated to compare the manual and automated systems, and to periodically test the methods in use to ensure that there aren’t any trends developing that would compromise the method’s use. Basically, this amounts to statistical quality control on the processes.

The next step is to decide what improvements you are looking for in an automated system: increased throughput, lower cost of operation, the ability to off-load human work, reduced variability, etc. In short, what are your goals?

That brings us to the matter of project planning. We’re not going to go into a lot of depth in this piece about project planning, as there are a number of references{{efn|See for example https://www.projectmanager.com/project-planning; the simplest thing to do it put “project planning” in a search engine and browse the results for something interesting.}} on the subject, including material produced by the former Institute for Laboratory Automation.{{efn|See for example https://theinformationdrivenlaboratory.wordpress.com/category/resources/; note that any references to the ILA should be ignored as the original site is gone, with the domain name perhaps having been leased by another organization that has no affiliation with the original Institute for Laboratory Automation.}} There are some aspects of the subject that we do need to touch on, however, and they include:

* justifying the project and setting expectations and goals;
* analyzing the process;
* scheduling automation projects; and
* budgeting.

====Justification, expectations, and goals====
Basically why are you doing this, what do you expect to gain? What arguments are you going to use to justify the work and expense involved in the project? How will you determine if the project is successful?

Fundamentally, automation efforts are about productivity and the bulleted items noted in the introduction of this piece, repeated below with additional commentary:

* Lower costs per test, and better control over expenditure: These can result from a reduction in labor and materials costs, including more predictable and consistent reagent usage per test.
* Stronger basis for better workflow planning: Informatics systems can provide better management over workloads and resource allocation, while key performance indicators can show where bottlenecks are occurring or if samples are taking too long to process. These can be triggers for procedure automation to improve throughput.
* Reproducibility: The test results from automated procedures can be expected to be more reproducible by eliminating the variability that is typical of steps executed by people. Small variation in dispensing reagents, for example, could be eliminated.
* Predictability: The time to completion for a given test is more predictable in automated programs; once the process starts it keeps going without interruptions that can be found in human centered activities
* Tighter adherence to procedures: Automated procedures have no choice but to be consistent in procedure execution; that is what programming and automation is about.

Of these, which are important to your project? If you achieved these goals, what would it mean to your labs operations and the organization as a whole; this is part of the justification for carrying out the projects.

As noted earlier, there are several things to consider in order to justify a project. First, there has to be a growing need that supports a procedures automation, one that can’t be satisfied by other means that could include adding people, equipment, and lab space, or outsourcing the work (with the added burden of insuring data quality and integrity, and integrating that work with the lab’s data/information). Second, the cost of the project must be balanced by it’s benefits. This includes any savings in cost, people (not reducing headcount, but avoiding new hires), material, and equipment, as well as the improvement of timeliness of results and overall lab operations. Third, when considering project justification, the automated process’s useful lifetime has to be long enough to justify the development work. And finally, the process has to be stable so that you aren’t in a constant re-development situation (this differs from periodic upgrades and performance improvements, EVOP in manufacturing terms). One common point of failure in projects is changes in underlying procedures; if the basic process model changes, you are trying to hit a moving target. That ruins schedules and causes budgets to inflate.

This may seem like a lot of things to think about for something that could be as simple as perhaps moving from manual pipettes to automatic units, but that just means the total effort to do the work will be small. However it is still important since it impacts data quality and integrity, and your ability to defend your results should they be challenged. And, by the way, the issue of automated pipettes isn’t simple; there is a lot to consider in properly specifying and using these products.{{Efn|As a starting point, view the [https://www.artel.co/ Artel, Inc. site] as one source. Also, John Bradshaw gave an [https://www.artel.co/learning_center/2589/ informative presentation] on “The Importance of Liquid Handling Details and Their Impact on your Assays” at the 2012 European Lab Automation Conference, Hamburg, Germany.}}

====Analyzing the process====
Assuming that you have a well-described, thoroughly tested and validated procedure, that process has to be analyzed for optimization and suitability for automation. This is an end-to-end evaluation, not just a examination of isolated steps. This is an important point. Looking at a single step without taking into account the rest of the process may improve that portion of the process but have consequences elsewhere.

Take a common example: working in a testing environment where samples are being submitted by outside groups (Figure 2).

[[File:Fig2 Liscouski ConsidAutoLabProc21.png|600px]]
{{clear}}
{|
| STYLE="vertical-align:top;"|
{| border="0" cellpadding="5" cellspacing="0" width="600px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;"| <blockquote>'''Figure 2.''' Lab sample processing, initial data entry through results</blockquote>
|-
|}
|}

Most LIMS will permit sample submitters (with appropriate permissions) to enter the sample description information directly into the LIMS, reducing some of the clerical burden. Standardizing on sample containers, with barcodes, reduces the effort and cost in some aspects of sample handling. A barcode scanner could be used to scan samples as they arrive into the lab, letting the system know that they are ready to be tested.

That brings us to an evaluation of the process as a whole, as well as an examination of the individual steps in the procedure. As shown in Figure 1, automation can be done in one of two ways: automating the full process or automating individual steps. Your choice depends on several factors, not the least of which is your comfort level and confidence in adopting automation as a strategy for increasing productivity. For some, concentrating on improvements in individual steps is an attractive approach. The cost and risk may be lower and if a problem occurs you can always backup to a fully manual implementation until they are resolved.

Care does have to be taken in choosing which steps to improve. From one perspective, you’d want to do the step-wise implementation of automation as close to the end of the process as possible. The problem with doing it earlier is that you may create a backup in later stages of the process. Optimizing step 2, for example, doesn’t do you much good if step 3 is overloaded and requires more people, or additional (possibly unplanned) automation to relieve a bottleneck there. In short, before you automate or improve a given step, you need to be sure that downstream processing can absorb the increase in materials flow. In addition, optimizing all the individual steps, one at time, doesn’t necessarily add up to a well-designed full system automation. The transition between steps may not be as effective or efficient if the system were evaluated as a whole. If the end of the process is carried out by commercial instrumentation, the ability to absorb more work is easier since most of these systems are automated with computer data acquisition and processing, and many have auto-samplers available to accumulate samples that can be processed automatically. Some of those auto-samplers have built in robotics for common sample handling functions. If the workload builds, additional instruments can pick up the load, and equipment such as [[Vendor:Baytek International, Inc.|Baytek International’s]] TurboTube<ref name="BaytekiPRO">{{cite web |url=https://www.baytekinternational.com/products/ipro-interface/89-products |title=iPRO Interface - Products |publisher=Baytek International, Inc |accessdate=05 February 2021}}</ref> can accumulate sample vials in a common system and route them to individual instruments for processing.

Another consideration for partial automation is where the process is headed in the future. If the need for the process persists over a long period of time, will you eventually get to the point of needing to redo the automation to an integrated stream? If so, is it better to take the plunge early on instead of continually expending resources to upgrade it?

Other considerations include the ability to re-purpose equipment. If a process isn’t used full-time (a justification for partial automation) the same components may be used in improving other processes. Ideally, if you go the full-process automation route, you’ll have sufficient sample throughput to keep it running for an extended period of time, and not have to start and stop the system as samples accumulate. A smoothly running slower automation process is better than a faster system that lies idle for significant periods of time, particularly since startup and shutdown procedures may diminish the operational cost savings in both equipment use and people’s time.

All these points become part of both the technical justification and budget requirements.

'''Analyzing the process: Simulation and modeling'''

Simulation and modeling have been part of science and engineering for decades, supported by ever-increasing powerful computing hardware and software. Continuous systems simulations have shown us the details of how machinery works, how chemical reactions occur, and how chromatographic systems and other instrumentation behaves.<ref name="JoyceComputer18">{{cite journal |title=Computer Modeling and Simulation |journal=Lab Manager |author=Joyce, J. |volume 13 |issue=9 |pages=32–35 |year=2018 |url=https://www.labmanager.com/laboratory-technology/computer-modeling-and-simulation-1826}}</ref> There is another aspect to modeling and simulation that is appropriate here.

Discrete-events simulation (DES) is used to model and understand processes in business and manufacturing applications, evaluating the interactions between service providers and customers, for example. One application of DES is to determine the best way to distribute incoming customers to a limited number of servers, taking into account that not all customers have the same needs; some will tie up a service provider a lot longer than others, as represented by the classic bank teller line problem. That is one question that discrete systems can analyze. This form of simulation and modeling is appropriate to event-driven processes where the action is focused on discrete steps (like materials moving from one workstation to another) rather than as a continuous function of time (most naturally occurring systems fall into this category, e.g., heat flow and models using differential equations).

The processes in your lab can be described and analyzed via DES systems.<ref name="CostigliolaSimul17">{{cite journal |title=Simulation Model of a Quality Control Laboratory in Pharmaceutical Industry |journal=IFAC-PapersOnLine |author=Costigliola, A.; Ataíde, F.A.P.; Vieira, S.M. et al. |volume=50 |issue=1 |pages=9014-9019 |year=2017 |doi=10.1016/j.ifacol.2017.08.1582}}</ref><ref name="MengImprov13">{{cite journal |title=Improving Medical Laboratory Operations via Discrete-event Simulation |journal=Proceedings of the 2013 INFORMS Healthcare Conference |author=Meng, L.; Liu, R.; Essick, C. et al. |year=2013 |url=https://www.researchgate.net/publication/263238201_Improving_Medical_Laboratory_Operations_via_Discrete-event_Simulation}}</ref><ref name="JunApplic99">{{cite journal |title=Application of discrete-event simulation in health care clinics: A survey |journal=Journal of the Operational Research Society |volume=50 |pages=109–23 |year=1999 |doi=10.1057/palgrave.jors.2600669}}</ref> Those laboratory procedures are a sequence of steps, each having a precursor, variable duration, and following step until the end of the process is reached; this is basically the same as a manufacturing operation where modeling and simulation have been used successfully for decades. DES can be used to evaluate those processes and ask questions that can guide you on the best paths to take in applying automation technologies and solving productivity or throughput problems. For example:

* What happens if we tighten up the variability in a particular step; how will that affect the rest of the system?
* What happens at the extremes of the variability in process steps; does it create a situation where samples pile up?
* How much of a workload can the process handle before one step becomes saturated with work and the entire system backs up?
* Can you introduce an alternate path to process those samples and avoid problems (e.g., if samples are held for too long in one stage, do they deteriorate)?
* Can the output of several parallel slower procedures be merged into a feed stream for a common instrumental technique?

In complex procedures some steps may be sensitive to small delays, and DES can help test and uncover them. Note that setting up these models will require the collection of a lot of data about the processes and their timing, so this is not something to be taken casually.

Previous research<ref name="JoyceComputer18" /><ref name="CostigliolaSimul17" /><ref name="MengImprov13" /><ref name="JunApplic99" /> suggests only a few ideas where simulation can be effective, including one where an entire labs operation’s was evaluated. Models that extensive can be used to not only look at procedures, but also the introduction of informatics systems. This may appear to be a significant undertaking, and it can be depending on the complexity of the lab processes. However, simple processes can be initially modeled on spreadsheets to see if more significant effort is justified. Operations research, of which DES is a part, has been usefully applied in production operations to increase throughput and improve ROI. It might be successfully applied to some routine production oriented lab work.

Most lab processes are linear in their execution, one step following another, with the potential for loop-backs should problems be recognized with samples, reagents (e.g., being out-of-date, doesn’t look right, need to obtain new materials), or equipment (e.g., not functioning properly, out of calibration, busy due to other work). On one level, the modeling of a manually implemented process should appear to be simple: each step takes a certain amount of time. If you add up the times, you have a picture of the process execution through time. However, the reality is quite different if you take into account problems (and their resolution) that can occur in each of those steps. The data collection used to model the procedure can change how that picture looks and your ability to improve it. By monitoring the process over a number of iterations, you can find out how much variation there is in the execution time for each step and whether or not the variation is a normal distribution or skewed (e.g., if one step is skewed, how does it impact others?).

Questions to ask about potential problems that could occur at each step include:

* How often do problems with reagents occur and how much of a delay does that create?
* Is instrumentation always in calibration (do you know?), are there operational problems with devices and their control systems (what are the ramifications?), are procedures delayed due to equipment being in use by someone else, and how long does it take to make changeovers in operating conditions?
* What happens to the samples; do they degrade over time? What impact does this have on the accuracy of results and their reproducibility?
* How often are workflows interrupted by the need to deal with high-priority samples, and what effect does it have on the processing of other samples?

Just the collection of data can suggest useful improvements before there are any considerations for automation, and perhaps negating the need for it. The answer to a lab’s productivity might be as simple as adding another instrument if that is a bottleneck. It might also suggest that an underutilized device might be more productive if sample preparation for different procedures workflows were organized differently. Underutilization might be a consequence of the amount of time needed to prepare the equipment for service: doing so for one sample might be disproportionately time consuming (and expensive) and cause other samples to wait until there were enough of them to justify the preparation. It could also suggest that some lab processes should be outsourced to groups that have a more consistent sample flow and turn-around time (TAT) for that technique. Some of these points are illustrated in Figures 3a and 3b below.

[[File:Fig3a Liscouski ConsidAutoLabProc21.png|574px]]
{{clear}}
{|
| STYLE="vertical-align:top;"|
{| border="0" cellpadding="5" cellspacing="0" width="574px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;"| <blockquote>'''Figure 3a.''' Simplified process views versus some modeling considerations. Note that the total procedure execution time is affected by the variability in each step, plus equipment and material availability delays; these can change from one day to the next in manual implementations.</blockquote>
|-
|}
|}

[[File:Fig3b Liscouski ConsidAutoLabProc21.png|544px]]
{{clear}}
{|
| STYLE="vertical-align:top;"|
{| border="0" cellpadding="5" cellspacing="0" width="544px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;"| <blockquote>'''Figure 3b.''' The execution times of each step include the variable execution times of potential issues that can occur in each stage. Note that because each factor has a different distribution curve, the total execution time has a much wider variability than the individual factors.</blockquote>
|-
|}
|}

How does the simulation system work? Once you have all the data set up, the simulation runs thousands of times using random number generators to pick out variables in execution times for each component in each step. For example, if there is a one-in-ten chance a piece of equipment will be in use when needed, 10% of the runs will show that with each one picking a delay time based on the input delay distribution function. With a large number of runs, you can see where delays exist and how they impact the overall processes behavior. You can also adjust the factors (what happens if equipment delays are cut in half) and see the effect of doing that. By testing the system, you can make better judgments on how to apply your resources.

Some of the issues that surface may be things that lab personnel know about and just deal with. It isn’t until the problems are looked at that the impact on operations are fully realized and addressed. Modeling and simulation may appear to be overkill for lab process automation, something reserved for large- scale production projects. The physical size of the project is not the key factor, it is the complexity of the system that matters and the potential for optimization.

One benefit of a well-structured simulation of lab processes is that it would provide a solid basis for making recommendations for project approval and budgeting. The most significant element in modeling and simulation is the initial data collection, asking lab personnel to record the time it takes to carry out steps. This isn’t likely to be popular if they don’t understand why it is being done and what the benefits will be to them and the lab; accurate information is essential. This is another case where “bad data is worse than no data.”

'''Guidleines for process automation'''

There are two types of guidelines that will be of interest to those conducting automation work: those that help you figure out what to do and how to do it, and those that must be met to satisfy regulatory requirements (both those evaluated by internal or external groups or organizations).

The first is going to depend on the nature of the science and automation being done to support it. Equipment vendor community support groups can be of assistance. Additionally, professional groups like the Pharmaceutical Research and Manufacturers of America (PhRMA), International Society for Pharmaceutical Engineering (ISPE), and Parenteral Drug Association (PDA) in the pharmaceutical and biotechnology industrues, with similar organizations in other industries and other countries. This may seem like a large jump from laboratory work, but it is appropriate when we consider the ramification of full-process automation. You are essentially developing a manufacturing operation on a lab bench, and the same concerns that large-scale production have also apply here; you have to ensure that the process is maintained and in control. The same is true of manual or semi-automated lab work, but it is more critical in fully-automated systems because of the potential high volume of results that can be produced.

The second set is going to consist of regulatory guidelines from groups appropriate to your industry: the [[Food and Drug Administration]] (FDA), [[United States Environmental Protection Agency|Environmental Protection Agency]] (EPA), and [[International Organization for Standardization]] (ISO), as well as international groups (e.g., [[Good Automated Manufacturing Practice|GAMP]], [[Good Automated Laboratory Practices|GALP]]) etc. The interesting point is that we are looking at a potentially complete automation scheme for a procedure; does that come under manufacturing or laboratory? The likelihood is that laboratory guidelines will apply since the work is being done within the lab's footprint; however, there are things that can be learned from their manufacturing counterparts that may assist in project management and documentation. One interesting consideration is what happens when fully automated testing, such as on-line analyzers, becomes integrated with both the lab and production or process control data/information streams. Which regulatory guidelines apply? It may come down to who is responsible for managing and supporting those systems.

====Scheduling automation projects====
There are two parts to the schedule issue: how long is it going to take to compete the project (dependent on the process and people), and when do you start? The second point will be addressed here.

The timing of an automated process coming online is important. If it comes on too soon, there may not be enough work to justify it’s use, and startup/shutdown procedures may create more work than the system saves. If it comes too late, people will be frustrated with a heavy workload while the system that was supposed to provide relief is under development.

In Figure 4, the blue line represents the growing need for sample/material processing using a given laboratory procedure. Ideally, you’d like the automated version to be available when that blue line crosses the “automation needed on-line” level of processing requirements; this the point where the current (manual?) implementation can no longer meet the demands of sample throughput requirements.

[[File:Fig4 Liscouski ConsidAutoLabProc21.png|600px]]
{{clear}}
{|
| STYLE="vertical-align:top;"|
{| border="0" cellpadding="5" cellspacing="0" width="600px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;"| <blockquote>'''Figure 4.''' Timing the development of an automated system</blockquote>
|-
|}
|}

Those throughput limits are something you are going to have to evaluate and measure on a regular basis and use to make adjustments to the planning process (accelerating or slowing it as appropriate). How fast is the demand growing and at what point will your current methods be overwhelmed? Hiring more people is one option, but then the lab's operating expenses increase due to the cost of people, equipment, and lab space.

Once we have an idea of when something has to be working, we can begin the process of planning; note: the planning can begin at any point, it would be good to get the preliminaries done as soon as a manual process is finalized so that you have an idea of what you’ll be getting into. Those preliminaries include looking at equipment that might be used (keeping track of its development), training requirements, developer resources, and implementation strategies, all of which would be updated as new information becomes available. The “we’ll-get-to-it-when-we-need-it” approach is just going to create a lot of stress and frustration.

You need to put together a first-pass project plan so that you can detail what you know, and more importantly what you don’t know. The goal is to have enough information, updated as noted above, so that you can determine if an automated solution is feasible, make an informed initial choice between full and partial automation, and have a timeline for implementation. Any time estimate is going to be subject to change as you gather information and refine your implementation approach. The point of the timeline is to figure out how long the yellow box in Figure 4 is because that is going to tell you how much time you have to get the plan together and working; it is a matter of setting priorities and recognizing what they are. The time between now and the start of the yellow box is what you have to work with for planning and evaluating plans, and any decisions that are needed before you begin, including corporate project management requirements and approvals.

Those plans have to include time for validation and the evaluation of the new implementation against the standard implementation. Does it work? Do we know how to use and maintain it? And are people educated in its use? Is there documentation for the project?

====Budgeting====
At some point, all the material above and following this section comes down to budgeting: how much will it cost to implement a program and is it worth it? Of the two points, the latter is the one that is most important. How do you go about that? (Note: Some of this material is also covered in the webinar series ''[[LII:A Guide for Management: Successfully Applying Laboratory Systems to Your Organization's Work|A Guide for Management: Successfully Applying Laboratory Systems to Your Organization's Work]]'' in the section on ROI.)

What a lot of this comes down to is explaining and justifying the choices you’ve made in your project proposal. We’re not going to go into a lot of depth, but just note some of the key issues:

* Did you choose full or partial automation for your process?
* What drove that choice? If in your view it would be less expensive than the full automation of a process, how long will it be until the next upgrade is needed to another stage?
* How independent are the potential, sequential implementation efforts that may be undertaken in the future? Will there be a need to connect them, and if so, how will the incremental costs compare to just doing it once and getting it over with?

There is a tendency in lab work to treat problems and the products that might be used to address them in isolation. You see the need for a LIMS or ELN, or an instrument data system, and the focus is on those issues. Effective decisions have to consider both the immediate and longer-term aspects of a problem. If you want to get access to a LIMS, have you considered how it will affect other aspects of lab work such as connecting instrument to it?

The same holds true for partial automation as a solution to a lab process productivity problem. While you are addressing a particular step, should you be looking at the potential for synergism by addressing other concerns. Modeling and simulations of processes can help resolve that issue.

Have you factored in the cost of support and education? The support issue needs to address the needs of lab personnel in managing the equipment and the options for vendor support, as well as the impact on IT groups. Note that the IT group will require access to vendor support, as well as being educated on their role in any project work.

What happens if you don’t automate? One way to justify the cost of a project is to help people understand what the lab’s operations will be like without it. Will more people, equipment, space, or added shifts be needed? At what cost? What would the impact be on those who need the results and how would it affect their programs?

==Build, buy, or cooperate?==
In this write up and some of the referenced materials, we’ve noted several times the benefits that clinical labs have gained through automation, although crediting it all to the use of automation alone isn’t fair. What the clinical laboratory industry did was recognize that there was a need for the use of automation to solve problems with the operational costs of running labs, and recognition that they could benefit further by coming together and cooperatively addressing lab operational problems.

It’s that latter point that made the difference and resulted in standardized communications, and purpose-built commercial equipment that could be used to implement automation in their labs. They also had common sample types, common procedures, and data processing. That same commonality applies to segments of industrial and academic lab work. Take life sciences as an example. Where possible, that industry has standardized on micro-plates for sample processing. The result is a wide selection of instruments and robotics built around that sample-holding format that greatly improves lab economics and throughput. While it isn’t the answer to everything, it’s a good answer to a lot of things.

If your industry segment came together and recognized that you used common procedures, how would you benefit by creating a common approach to automation instead of each lab doing it on their own? It would open the development of common products or product variations from vendors and relieve the need for each lab developing its own answer to the need. The result could be more effective and easily supportable solutions.

==Project planning==
Once you’ve decided on the project you are going to undertake, the next stage is looking at the steps needed to manage your project (Figure 5).

[[File:Fig5 Liscouski ConsidAutoLabProc21.png|800px]]
{{clear}}
{|
| STYLE="vertical-align:top;"|
{| border="0" cellpadding="5" cellspacing="0" width="800px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;"| <blockquote>'''Figure 5.''' Steps in a laboratory automation project. This diagram is modeled after the GAMP V for systems validation.</blockquote>
|-
|}
|}

The planning begins with the method description from Figure 1, which describes the science behind the project and the specification of how the automation is expected to be put into effect: as full-process automation, a specific step, or steps in the process. The provider of those documents is considered the “customer” and is consistent with GAMP V nomenclature (Figure 6); that consistency is important due to the need for system-wide validation protocols.

[[File:Fig6 Liscouski ConsidAutoLabProc21.png|749px]]
{{clear}}
{|
| STYLE="vertical-align:top;"|
{| border="0" cellpadding="5" cellspacing="0" width="749px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;"| <blockquote>'''Figure 6.''' GAMP V model for showing customer and supplier roles in specifying and evaluating project components for computer hardware and software.</blockquote>
|-
|}
|}

From there the “supplier” (e.g., internal development group, consultant, IT services, etc.) responds with a functional specification that is reviewed by the customer. The “analysis, prototyping, and evaluation” step, represented in the third box of Figure 5, is not the same as the process analysis noted earlier in this piece. The earlier section was to help you determine what work needed to be done and documented in the user requirements specification. The analysis and associated tasks here are specific to the implementation of this project. The colored arrows refer to the diagram in Figure 7. That process defines the equipment needed, dependencies, and options/technologies for automation implementations, including robotics, instrument design requirements, pre-built automation (e.g., titrators, etc.) and any custom components. The documentation and specifications are part of the validation protocol.

[[File:Fig7 Liscouski ConsidAutoLabProc21.png|650px]]
{{clear}}
{|
| STYLE="vertical-align:top;"|
{| border="0" cellpadding="5" cellspacing="0" width="650px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;"| <blockquote>'''Figure 7.''' Defining dependencies and qualification of equipment</blockquote>
|-
|}
|}

The prototyping function is an important part of the overall process. It is rare that someone will look at a project and come up with a working solution on the first pass. There is always tinkering and modifications that occur as you move from a blank slate to a working system. You make notes along the way about what should be done differently in the final product, and places where improvements or adjustments are needed. These all become part of the input to the system design specification that will be reviewed and approved by the customer and supplier. The prototype can be considered a proof of concept or a demonstration of what will occur in the finished product. Remember also that prototypes would not have to be validated since they wouldn’t be used in a production environment; they are simply a test bed used prior to the development of a production system.

The component design specifications are the refined requirement for elements that will be used in the final design. Those refinements could point to updated models of components or equipment used, modifications needed, or recommendations for products with capabilities other than those used in the prototype.

The boxes on the left side of Figure 5 are documents that go into increasing depth as the system is designed and specified. The details in those items will vary with the extent of the project. The right side of the diagram is a series of increasingly sophisticated testing and evaluation against steps in the right side, culminating in the final demonstration that the system works, has been validated, and is accepted by the customer. It also means that lab and support personnel are educated in their roles.

==Conclusions (so far)==
“Laboratory automation” has to give way to “laboratory automation engineering.” From the initial need to the completion of the validation process, we have to plan, design, and implement successful systems on a routine basis. Just as the manufacturing industries transitioned from cottage industries to production lines and then to integrated production-information systems, the execution of laboratory science has to tread a similar path if the demands for laboratory results are going to be met in a financially responsible manner. The science is fundamental; however, we need to pay attention now to efficient execution.

==Abbreviations, acronyms, and initialisms==
'''AI''': Artificial intelligence

'''AuI''': Augmented intelligence

'''DES''': Discrete-events simulation

'''ELN''': Electronic laboratory notebook

'''EPA''': Environmental Protection Agency

'''FDA''': Food and Drug Administration

'''FRB''': Fast radio bursts

'''GALP''': Good automated laboratory practices

'''GAMP''': Good automated manufacturing practice

'''ISO''': International Organization for Standardization

'''LES''': Laboratory execution system

'''LIMS''': Laboratory information management system

'''ML''': Machine learning

'''ROI''': Return on investment

'''SDMS''': Scientific data management system

'''TAT''': Turn-around time

==Footnotes==
{{reflist|group=lower-alpha}}

==About the author==
Initially educated as a chemist, author Joe Liscouski (joe dot liscouski at gmail dot com) is an experienced laboratory automation/computing professional with over forty years of experience in the field, including the design and development of automation systems (both custom and commercial systems), LIMS, robotics and data interchange standards. He also consults on the use of computing in laboratory work. He has held symposia on validation and presented technical material and short courses on laboratory automation and computing in the U.S., Europe, and Japan. He has worked/consulted in pharmaceutical, biotech, polymer, medical, and government laboratories. His current work centers on working with companies to establish planning programs for lab systems, developing effective support groups, and helping people with the application of automation and information technologies in research and quality control environments.

==References==
{{Reflist|colwidth=30em}}


[[Category:LII:Guides, white papers, and other publications]]

LII:Choosing and Implementing a Cloud-based Service for Your Laboratory

2024-06-19T22:59:06Z

Shawndouglas:

[[File:Cloud computing icon.svg|right|400px]]
'''Title''': ''Choosing and Implementing a Cloud-based Service for Your Laboratory''

'''Edition''': Second edition

'''Author for citation''': Shawn E. Douglas

'''License for content''': [https://creativecommons.org/licenses/by-sa/4.0/ Creative Commons Attribution-ShareAlike 4.0 International]

'''Publication date''': August 2023

This guide examines the state of [[cloud computing]] and the security mechanisms inherent to it, especially in regards to how it relates to today's [[Laboratory|laboratories]]. While cloud computing and cloud-based applications can enhance the activities of many types of labs, a methodical and meticulous approach to [[cybersecurity]] is required to not only get the most out of a cloud solution but also mitigate future data catastrophes. This means understanding [[risk management]], regulatory considerations, deployment approaches, and the potential value of managed security services in the cloud. Additionally, the essential links between laboratory [[quality assurance]], the shared responsibility model, and cybersecurity in the lab are emphasized. Of course, it's also vital to understand what to look for in cloud providers, as well as how to approach finding them. In that regard, this guide adds value by more closely examining major public/hybrid cloud and managed security service providers (Appendix 1 and 2), as well as providing example request for information (RFI) templates for both provider types (Appendix 3). While this guide can prove useful to even non-laboratory organizations looking to dip into cloud services, it focuses heavily on laboratories implementing and updating information systems in the cloud.

The second edition of this guide updates grammar and phrasing, tweaks a variety of historical statistics, tweaks information about container security, updates a few trends in hybrid and multicloud, updates information about cybersecurity insurance for cloud, updates information about the DoD JEDI project and the replacement JWCC project, and adds a subsection to Chapter 1 about edges and edge computing.

The table of contents for ''Choosing and Implementing a Cloud-based Service for Your Laboratory'' is as follows:

1. [[LII:Choosing and Implementing a Cloud-based Service for Your Laboratory/What is cloud computing?|What is cloud computing?]]

:1.1 History and evolution
:1.2 Cloud computing services and deployment models
::1.2.1 Platform-as-a-service vs. serverless computing
::1.2.2 Hybrid cloud vs. multicloud vs. distributed cloud
::1.2.3 Edge computing?
:1.3 The relationship between cloud computing and the open source paradigm

2. [[LII:Choosing and Implementing a Cloud-based Service for Your Laboratory/Standards and security in the cloud|Standards and security in the cloud]]

:2.1 Standards and regulations influencing cloud computing
:2.2 Security in the cloud
::2.2.1 The shared responsibility model
::2.2.2 Public cloud
::2.2.3 Hybrid cloud and multicloud
::2.2.4 Container security and other concerns
::2.2.5 Software as a service

3. [[LII:Choosing and Implementing a Cloud-based Service for Your Laboratory/Organizational cloud computing risk management|Organizational cloud computing risk management]]

:3.1 Five risk categories to consider
:3.2 Risk management and cybersecurity frameworks
:3.3 A brief note on cloud-inclusive cybersecurity insurance

4. [[LII:Choosing and Implementing a Cloud-based Service for Your Laboratory/Cloud computing in the laboratory|Cloud computing in the laboratory]]

:4.1 Benefits
:4.2 Regulatory considerations
:4.3 Deployment approaches
::4.3.1 Hybrid cloud, multicloud, and the vendor lock-in conundrum

5. [[LII:Choosing and Implementing a Cloud-based Service for Your Laboratory/Managed security services and quality assurance|Managed security services and quality assurance]]

:5.1 The provision of managed security services
::5.1.1 Managed security services in the cloud
:5.2 Managed security services and the laboratory
::5.2.1 The quality assurance officer
::5.2.2 The shared responsibility model in the scope of security management and quality assurance
:5.3 Choosing a provider for managed security services
::5.3.1 Using a request for information (RFI) process

6. [[LII:Choosing and Implementing a Cloud-based Service for Your Laboratory/Considerations when choosing and implementing a cloud solution|Considerations when choosing and implementing a cloud solution]]

:6.1 What are the various characteristics of an average cloud provider?
:6.2 What should your lab look for in a cloud provider?
::6.2.1 Service-level agreements
:6.3 What questions should you ask yourself?
:6.4 What questions should be asked of a cloud provider?
::6.4.1 Using a request for information (RFI) process

7. [[LII:Choosing and Implementing a Cloud-based Service for Your Laboratory/Final thoughts and additional resources|Final thoughts and additional resources]]

:7.1 Final thoughts
:7.2 Key reading and reference material
:7.3 Associations, organizations, and interest groups
:7.4 Consultancy and support services

Appendix 1. Top public and hybrid/mutlicloud services

:[[Alibaba Cloud]]
:[[Amazon Web Services]]
:[[Cisco Cloudcenter and UCS Director]]
:[[Dell Technologies Cloud]]
:[[DigitalOcean]]
:[[Google Cloud]]
:[[HPE GreenLake]]
:[[IBM Cloud]]
:[[Linode]]
:[[Microsoft Azure]]
:[[Oracle Cloud Infrastructure]]
:[[OVHcloud]]
:[[Tencent Cloud]]
:[[VMware Cloud]]

Appendix 2. Top managed security services

:[[Accenture Security Managed Security]]
:[[AT&T Cуbеrѕесurіtу]]
:[[Atos Managed Security Services]]
:[[BT Cyber Security Platform]]
:[[Cisco Cloudcenter and UCS Director|Cisco Active Threat Analytics]]
:[[Cyderes Managed Services]]
:[[Foresite Managed Cybersecurity]]
:[[IBM Cloud|IBM Managed Security Services]]
:[[NTT Managed Security Services]]
:[[Orange Cyberdefense]]
:[[Secureworks Managed Security Services]]
:[[Trustwave Managed Security Services]]
:[[Verizon Managed Security Services]]
:[[Wipro Managed Security Services]]

Appendix 3. RFI questions for cloud providers and MSSPs

:[[LII:Choosing and Implementing a Cloud-based Service for Your Laboratory/RFI questions for cloud providers|RFI questions for cloud providers]]
:[[LII:Choosing and Implementing a Cloud-based Service for Your Laboratory/RFI questions for MSSPs|RFI questions for MSSPs]]


[[Category:LII:Guides, white papers, and other publications]]

LII:Choosing and Implementing a Cloud-based Service for Your Laboratory

2024-06-19T22:58:38Z

Shawndouglas: Prior image deleted on Wikipedia

[[File:Cloud computing.svg|right|400px]]
'''Title''': ''Choosing and Implementing a Cloud-based Service for Your Laboratory''

'''Edition''': Second edition

'''Author for citation''': Shawn E. Douglas

'''License for content''': [https://creativecommons.org/licenses/by-sa/4.0/ Creative Commons Attribution-ShareAlike 4.0 International]

'''Publication date''': August 2023

This guide examines the state of [[cloud computing]] and the security mechanisms inherent to it, especially in regards to how it relates to today's [[Laboratory|laboratories]]. While cloud computing and cloud-based applications can enhance the activities of many types of labs, a methodical and meticulous approach to [[cybersecurity]] is required to not only get the most out of a cloud solution but also mitigate future data catastrophes. This means understanding [[risk management]], regulatory considerations, deployment approaches, and the potential value of managed security services in the cloud. Additionally, the essential links between laboratory [[quality assurance]], the shared responsibility model, and cybersecurity in the lab are emphasized. Of course, it's also vital to understand what to look for in cloud providers, as well as how to approach finding them. In that regard, this guide adds value by more closely examining major public/hybrid cloud and managed security service providers (Appendix 1 and 2), as well as providing example request for information (RFI) templates for both provider types (Appendix 3). While this guide can prove useful to even non-laboratory organizations looking to dip into cloud services, it focuses heavily on laboratories implementing and updating information systems in the cloud.

The second edition of this guide updates grammar and phrasing, tweaks a variety of historical statistics, tweaks information about container security, updates a few trends in hybrid and multicloud, updates information about cybersecurity insurance for cloud, updates information about the DoD JEDI project and the replacement JWCC project, and adds a subsection to Chapter 1 about edges and edge computing.

The table of contents for ''Choosing and Implementing a Cloud-based Service for Your Laboratory'' is as follows:

1. [[LII:Choosing and Implementing a Cloud-based Service for Your Laboratory/What is cloud computing?|What is cloud computing?]]

:1.1 History and evolution
:1.2 Cloud computing services and deployment models
::1.2.1 Platform-as-a-service vs. serverless computing
::1.2.2 Hybrid cloud vs. multicloud vs. distributed cloud
::1.2.3 Edge computing?
:1.3 The relationship between cloud computing and the open source paradigm

2. [[LII:Choosing and Implementing a Cloud-based Service for Your Laboratory/Standards and security in the cloud|Standards and security in the cloud]]

:2.1 Standards and regulations influencing cloud computing
:2.2 Security in the cloud
::2.2.1 The shared responsibility model
::2.2.2 Public cloud
::2.2.3 Hybrid cloud and multicloud
::2.2.4 Container security and other concerns
::2.2.5 Software as a service

3. [[LII:Choosing and Implementing a Cloud-based Service for Your Laboratory/Organizational cloud computing risk management|Organizational cloud computing risk management]]

:3.1 Five risk categories to consider
:3.2 Risk management and cybersecurity frameworks
:3.3 A brief note on cloud-inclusive cybersecurity insurance

4. [[LII:Choosing and Implementing a Cloud-based Service for Your Laboratory/Cloud computing in the laboratory|Cloud computing in the laboratory]]

:4.1 Benefits
:4.2 Regulatory considerations
:4.3 Deployment approaches
::4.3.1 Hybrid cloud, multicloud, and the vendor lock-in conundrum

5. [[LII:Choosing and Implementing a Cloud-based Service for Your Laboratory/Managed security services and quality assurance|Managed security services and quality assurance]]

:5.1 The provision of managed security services
::5.1.1 Managed security services in the cloud
:5.2 Managed security services and the laboratory
::5.2.1 The quality assurance officer
::5.2.2 The shared responsibility model in the scope of security management and quality assurance
:5.3 Choosing a provider for managed security services
::5.3.1 Using a request for information (RFI) process

6. [[LII:Choosing and Implementing a Cloud-based Service for Your Laboratory/Considerations when choosing and implementing a cloud solution|Considerations when choosing and implementing a cloud solution]]

:6.1 What are the various characteristics of an average cloud provider?
:6.2 What should your lab look for in a cloud provider?
::6.2.1 Service-level agreements
:6.3 What questions should you ask yourself?
:6.4 What questions should be asked of a cloud provider?
::6.4.1 Using a request for information (RFI) process

7. [[LII:Choosing and Implementing a Cloud-based Service for Your Laboratory/Final thoughts and additional resources|Final thoughts and additional resources]]

:7.1 Final thoughts
:7.2 Key reading and reference material
:7.3 Associations, organizations, and interest groups
:7.4 Consultancy and support services

Appendix 1. Top public and hybrid/mutlicloud services

:[[Alibaba Cloud]]
:[[Amazon Web Services]]
:[[Cisco Cloudcenter and UCS Director]]
:[[Dell Technologies Cloud]]
:[[DigitalOcean]]
:[[Google Cloud]]
:[[HPE GreenLake]]
:[[IBM Cloud]]
:[[Linode]]
:[[Microsoft Azure]]
:[[Oracle Cloud Infrastructure]]
:[[OVHcloud]]
:[[Tencent Cloud]]
:[[VMware Cloud]]

Appendix 2. Top managed security services

:[[Accenture Security Managed Security]]
:[[AT&T Cуbеrѕесurіtу]]
:[[Atos Managed Security Services]]
:[[BT Cyber Security Platform]]
:[[Cisco Cloudcenter and UCS Director|Cisco Active Threat Analytics]]
:[[Cyderes Managed Services]]
:[[Foresite Managed Cybersecurity]]
:[[IBM Cloud|IBM Managed Security Services]]
:[[NTT Managed Security Services]]
:[[Orange Cyberdefense]]
:[[Secureworks Managed Security Services]]
:[[Trustwave Managed Security Services]]
:[[Verizon Managed Security Services]]
:[[Wipro Managed Security Services]]

Appendix 3. RFI questions for cloud providers and MSSPs

:[[LII:Choosing and Implementing a Cloud-based Service for Your Laboratory/RFI questions for cloud providers|RFI questions for cloud providers]]
:[[LII:Choosing and Implementing a Cloud-based Service for Your Laboratory/RFI questions for MSSPs|RFI questions for MSSPs]]


[[Category:LII:Guides, white papers, and other publications]]

LII:The Practical Guide to the U.S. Physician Office Laboratory

2024-06-19T18:55:59Z

Shawndouglas: Header info

'''Title''': '''The Practical Guide to the U.S. Physician Office Laboratory'''

'''Author for citation''': Rebecca A. Fein, M.S.A.H.I., M.B.A., with editorial modifications by Shawn E. Douglas

'''License for content''': [https://creativecommons.org/licenses/by-sa/4.0/ Creative Commons Attribution-ShareAlike 4.0 International]

'''Publication date''': May 2014

__TOC__

==Introduction==
This guide intends to give the reader practical information on the [[physician office laboratory]] (POL), assisting the reader with the decision-making processes related to becoming affiliated with a POL.

This guide provides a discussion on the history and trends related to the POL market, as well as testing considerations, staffing requirements, regulatory issues, related technology, and economic considerations.

No recommendations are made, though appropriate best practices are mentioned.

==What is a physician office laboratory?==
The definition of a [[physician office laboratory]] varies from state to state. Some states define the POL by the actual number of physicians in the practice, and others do not. When setting up a POL, the proprietor should consult their individual state regulatory body to ensure full compliance.

For the purpose of this paper, the definition provided by the State of New York is used, as New York regulations are strict, and using a stringent guideline for the rest of this guide is preferred:

<blockquote>''In order to qualify as a physician office laboratory (POL), individual health care providers must operate the practice or be part of a legally constituted, independently owned, and managed partnership or group practice. Laboratories that are owned, managed and/or operated by managed care organizations, [[Hospital|hospitals]] or consulting firms do not qualify for the POL exception and should apply for a [[clinical laboratory]] permit through the Clinical Laboratory Evaluation Program.<ref name="NYSPOL">{{cite web |url=http://www.wadsworth.org/labcert/polep/ |title=Physician Office Laboratory Evaluation Program (POLEP) |author=New York State Department of Health, Wadsworth Center |date=2014 |accessdate=14 May 2014}}</ref>''</blockquote>

Based on the New York definition, a POL is thus a [[laboratory]] that provides the physician in-office laboratory testing services, thereby allowing the physician to have quicker access to laboratory results. Depending on the rules and regulations in any given state, the lab can be owned and operated either by a single physician practice or by more than one physician practice. In many states, additional regulations prohibit the acceptance of specimens from outside the clinician's practice.

===Types of POLs and their workflow===
POLs come in many different types. Nearly any practice can operate a POL. However, the most common types include gastroenterology, family practice, internal medicine, obstetrics and gynecology, and hematology and oncology practices. This is likely due to the need for these specialties to get quick results for treatment plan decisions.

According to United Healthcare's Oxford's In-Office Laboratory Testing and Procedures List, the specialists that use an in-office laboratory include<ref name="UHOxInOffice">{{cite web |url=https://www.oxhp.com/secure/policy/oxfords_in_office_laboratory_testing_and_procedures_list.pdf |format=PDF |title=Oxford's in-office laboratory testing and procedures list |author=UnitedHealthcare Oxford |date=01 July 2012 |accessdate=14 May 2014}}</ref>:

* '''Primary care physicians and specialists''': In some parts of the world, this type of doctor is referred to as a general practitioner. This doctor is the first point of contact for a patient and coordinates referrals to other physicians as necessary (dependent upon the insurance plan a patient has). Often these doctors are family medicine or internal medicine specialists by training. In some cases, these doctors are OB/GYNs, nephrologists, allergists, pediatricians, or emergency medicine specialists.

* '''Dermatologists / dermatopathologists''': A dermatologist is a doctor who specializes in skin conditions. Dermatologists can also board certify as dermatopathologists, which are trained in both dermatology and pathology. A dematopathologist examines tissue samples taken as part of a biopsy, for example.

* '''Rheumatologists''': Rheumatologists deal with issues related to joints, soft tissue, autoimmune diseases, vasculitis, and hereditary connective tissue disorders.

* '''Urologists''': Urologists examine issues related to the male and female urinary tract, as well as the male reproductive tract.

* '''Pediatricians''': Pediatricians are trained to treat medical conditions found in infants, children, and adolescents. Generally, they do not see patients over the age of 18.

* '''Pulmonologists''': Pulmonologists specialize in conditions related to the respiratory tract.

* '''Hematologists / oncologists / pediatric hematologists''': Hematologists specialize in blood-related conditions, oncologists in cancers, and pediatric hematologists in childhood blood disorders.

* '''Obstetricians / gynecologists''': Obstetricians specialize in conditions related to pregnancy. Often they are also gynecologists, specializing in women’s reproductive health conditions.

* '''Reproductive endocrinologists / infertility''': This is a sub-branch of the above specialty, addressing both male and female infertility issues.

Other specialists such as surgeons are not as likely to have a physician office lab. Surgeries often take place in a hospital, and therefore tests would be processed through the hospital lab.
<br /> <br />

:'''Common clinical laboratory workflow''':<br />
[[File:NormWorkflow.png|700px]]

:'''POL workflow''':<br />
[[File:POLWorkflow.png|700px]]

The difference in these two workflows mostly comes down to the time spent in transporting the specimen to an outside lab and waiting for the processing. The in-office lab saves time in those parts of the process.

===History and market trends of the POL===
In ancient times, patient diagnosis predicated on what the physician could observe during an exam of the patient, and in some cases the samples from the patient.<ref name="BergerP1">{{cite journal |url=http://www.academia.dk/Blog/wp-content/uploads/KlinLab-Hist/LabHistory1.pdf |format=PDF |journal=Medical Laboratory Observer |title=A brief history of medical diagnosis and the birth of the clinical laboratory: Part 1—Ancient times through the 19th century |author=Berger, D. |volume=31 |issue=7 |pages=28–30, 32, 34–40 |date=July 1999 |accessdate=14 May 2014}}</ref>

Beginning in the Middle Ages and ending during the eighteenth century, bed side medicine was the predominant form of practice. Patients were diagnosed at their bed side, similar to what modern clinicians call point-of-care testing (POCT).<ref name="BergerP1" />

During the eighteenth and nineteenth centuries, with the rise of hospital medicine, laboratory medicine started to play a bigger role in the diagnosis of patients.<ref name="BergerP1" />

Although, instruments such as the stethoscope and thermometer came into use at the end of the nineteenth century, the clinical laboratory did not become a standard part of medical practice until the twentieth century.<ref name="BergerP2">{{cite journal |url=http://www.academia.dk/Blog/wp-content/uploads/KlinLab-Hist/LabHistory2.pdf |format=PDF |journal=Medical Laboratory Observer |title=A brief history of medical diagnosis and the birth of the clinical laboratory: Part 2—Laboratory science and professional certification in the 20th century |author=Berger, D. |volume=31 |issue=8 |pages=32–34, 36, 38 |date=August 1999 |accessdate=14 May 2014}}</ref>

====Early diagnostic testing====
Diagnosis of patients via techniques, such as the examination of bodily fluids to predict disease, can be traced back to Hippocrates in ancient Greece. Hippocrates instituted diagnostic criterion that included listening to the patient’s lungs, examining the urine of the patient, and observing the patient’s skin color.<ref name="BergerP1" />

Blood in urine was linked to kidney failure in 50 AD, followed by another early clinician, Galen, identifying diabetes as “diarrhea of urine” and noting a normal relationship between fluid intake and urinary output in 180 AD.<ref name="BergerP1" />

In 900 AD, Isaac Judaeus established a protocol for using urine in patient diagnoses <ref name="BergerP1" />. By 1300 AD, examination of the urine under a microscope (uroscopy) had become so popular it was nearly universal in Europe.<ref name="BergerP1" />

The seventeenth century saw many innovations in diagnostic techniques as a result of the advances in literature related to the structure of the body and the formation of scientific societies.<ref name="BergerP1" /> Some of the innovations that came from the seventeenth century include first attempts to use pulse and temperature as indications of illness in a patient; intravenous drug injections; and identification of the sweet taste of urine in patients with diabetes.<ref name="BergerP1" />

In the eighteenth century, Dr. William Hewson began to identify ways to measure coagulants in blood tests, an event that set the stage for modern [[diagnostic laboratory]] practice. During this time period, the ability to use temperature and blood pressure as diagnostic indicators was refined, allowing James Currie to treat his typhoid fever patients by putting them in a cold bath.<ref name="BergerP1" />

Other advances out of the eighteenth century include Sir John Floyer’s pulse measuring technique, Tichy’s urine analysis technique, Dobson’s ability to prove the sweetness of both blood and urine in patients with diabetes was caused by sugar, and Home's development of a yeast test for sugar in urine.<ref name="BergerP1" />

The nineteenth century is sometimes referred to as the era of public health; during this time independent laboratories started to develop.<ref name="BergerP1" /> In the United States, laboratory medicine was viewed with skepticism, as a destructive force related to medical knowledge. As a result of this, many American physicians went to Europe to receive training on laboratory techniques.<ref name="BergerP1" />

As older physicians retired from practice and faculty positions, opposition to laboratory practice faded, allowing for bacteriological discoveries like pasteurization.<ref name="BergerP1" /> The nineteenth century saw aseptic methodologies produce fewer deaths after surgery, resulting in a greater emphasis on hygienic practices. This is also the period of time when x-ray and microscopy become more important to the practice of medicine.<ref name="BergerP1" />

Around 1850, the laboratory became popularized, and the first hospital laboratories begin to appear. Prior to this time period, most laboratory tests were performed at the bed side or were performed by the physician in the office laboratory.<ref name="BergerP1" />

====Diagnostic testing in the twentieth century====
Although the nineteenth century was a time of advancement for clinical laboratory practice, the blossoming of the clinical laboratory did not occur until the twentieth century.<ref name="BergerP2" /> In the early twentieth century, laboratories began to stratify into the different types of sub-specialties seen in practice today: [[Public health laboratory|public health]], forensic, and clinical.<ref name="BergerP2" />

In 1928, when Alexander Fleming accidentally discovered penicillin, he ushered in the age of antibiotics, which allowed for new treatment options for infections, especially when combined with Domagk’s discovery that sulphonomides had antibacterial attributes and did not harm humans.<ref name="BergerP2" />

In the twentieth century, laboratory medicine personnel needed to be certified and licensed as the movement to ensure quality in medicine came to the laboratory. Organizations were founded to accommodate this, including the American Society for Clinical Pathology (ASCP), founded in 1922 to offer certifications to pathologists.<ref name="BergerP2" />

By the end of the 1950s, the clinical laboratory had earned the respect of other medical professionals and the public, ensuring professional legitimacy.<ref name="BergerP2" /> This was accomplished by the discoveries made in the clinical laboratory, leading to new treatment options for patients, who would have endured difficult illnesses, or died, without such interventions.<ref name="BergerP2" />

The initial creation of [[Centers for Medicare and Medicaid Services|Medicare]] in 1965 was seen as an opportunity for free money by the healthcare industry as a whole. As costs increased, loopholes were found to get more reimbursements; however, these loopholes would be closed, and the resourceful provider would find other ways to continue the practice of charging more to get a higher reimbursement.<ref name="BergerP3">{{cite journal |url=http://www.academia.dk/Blog/wp-content/uploads/KlinLab-Hist/LabHistory3.pdf |format=PDF |journal=Medical Laboratory Observer |title=A brief history of medical diagnosis and the birth of the clinical laboratory: Part 3—Medicare, government regulation and competency certification |author=Berger, D. |volume=31 |issue=10 |pages=40–42, 44 |date=October 1999 |accessdate=14 May 2014}}</ref>

The government soon discovered the potential for fraud and abuse inherent in Medicare. The need for regulations in order to prevent such abuses became apparent as early as 1967.<ref name="BergerP3" /> The [[Clinical Laboratory Improvement Amendments|Clinical Laboratory Improvement Act]] came into effect that year as a tool to regulate laboratories practicing across state lines, and it was amended in 1988 to include nearly all laboratories practicing in the United States.<ref name="BergerP3" />

In 1989, an estimated 98,400 POLs were operating in the United States. Estimates from the time vary from 20,000 to 200,000 due to the lack of a standard definition for a POL and the need for physicians to self-report the status of their lab.<ref name="HHS89">{{cite journal |url=https://oig.hhs.gov/oei/reports/oai-05-88-00330.pdf |title=Quality assurance in physician office labs |author=Kusserow, R. P. |publisher=U.S. Department of Health and Human Services, Office of Analysis and Inspections |version=OAI-0588-00330 |date=March 1989 |accessdate=14 May 2014}}</ref> Some of these issues continue to persist today, as states often have different definitions for a POL.

In 1989, limited regulatory controls existed for POLs, resulting in wide variations in the complexity of testing among POLs.<ref name="HHS89" /> Kusserow, writing for the [[United States Department of Health and Human Services|HHS]] Office of Analysis and Inspections, noted the following during this time period:

<blockquote>''Physicians operating office laboratories conduct approximately 25 percent of all laboratory testing in the country. Sixteen States have laws pertaining to them. About $20 billion is spent nationally on laboratory services annually, of which POLs receive $5 billion. Each year Medicare pays POLs over $400 million.<ref name="HHS89" />''</blockquote>

Kusserow also found the average 1985 Medicare Part B payment to POLs was $7 per test, compared to $10 per test for an independent lab, or $19 per test to a lab classified as "other".<ref name="HHS89" /> One can see the trend: the POL became a cheaper option when compared to other types of labs.

Since 1995, with a better understanding and acceptance of regulations on the laboratory and the list of waived tests growing from 8 to 40<ref name="Bachman04">{{cite journal |url=http://laboratory-manager.advanceweb.com/Article/Prosperity-in-the-POL.aspx |format=PDF |journal=ADVANCE for Administrators of the Laboratory |title=Prosperity in the POL |author=Bachman, A. |volume=13 |issue=12 |page=66 |date=2004 |accessdate=14 May 2014}}</ref>, the number of POLs in the United States increased to a total of 120,399 or 49% of all the laboratories in the United States as of December 2013.<ref name="CMS13LabTypes">{{cite web |url=https://www.cms.gov/Regulations-and-Guidance/Legislation/CLIA/downloads/factype.pdf |format=PDF |title=Laboratories by type of facility |author=Centers for Medicare and Medicaid Services, Division of Laboratory Services |date=December 2013 |accessdate=14 May 2014}}</ref> Additionally, 60% of the POLs in the United States today are running [[Clinical Laboratory Improvement Amendments]] (CLIA) waived tests, and 24% hold provider performed microscopy (PPM) certificates.<ref name="CMS13Enroll">{{cite web |url=http://www.cms.gov/Regulations-and-Guidance/Legislation/CLIA/Downloads/statupda.pdf |format=PDF |title=Enrollment, CLIA exempt states, and certification of accreditation by organization |author=Centers for Medicare and Medicaid Services, Division of Laboratory Services |date=December 2013 |accessdate=14 May 2014}}</ref>

According to Bachman, POL growth is expected to increase in the future due to an aging population of baby boomers with money to finance laboratory testing and an increased interest and awareness in healthcare topics. Other factors to watch out for in the future are a softening stance on testing by payors (an example of this is the addition of an initial physical exam given to new Medicare beneficiaries for preventative care) and additions to the list of CLIA waived tests.<ref name="Bachman04" />

===Why do POLs exist?===
Historically, POLs are a subset of point-of-care testing (POCT), laboratory testing that is done where the patient is located. These laboratories came about initially because as clinical medicine became more complicated to practice and techniques became more sophisticated, physicians still needed to perform tests to diagnose patients.

POLs also exist because the industry was looking for a cheaper way to test. Running a POL was found to be an effective way to provide clinician information for treatment plans, while at the same time saving insurers money.

In some cases, POLs opened to provide additional revenue streams for a physician’s practice. The reasons for the POL's existence are varied, but the central reason is to provide quality diagnoses, treatment, and care to patients in the healthcare system.

===Advantages and disadvantages of running a POL===
In the early days of the POL, lack of regulation proved to be a disadvantage; however, over time, [[:Category:Regulatory information|regulations]] have caught up for the better.

Advantages include
* quicker access to test results for the clinician, leading to more treatment options for the patient;
* greater efficiency of the clinical workflow;
* cheaper testing, though subject to individual test and pricing information; and
* patient comfort and happiness, including time saved by having to go only to one location.

Disadvantages include:
* the physician office being the only point-of-access, with some physicians not willing to release patient information to an outside party (such as a hospital or competing clinician). This disadvantage may be eliminated due to regulatory changes in April of 2014, now allowing patients direct access to their laboratory results;
* patients not feeling comfortable about the physician's office being the central repository of information, and physicians may not see the value in having a lab in their practice; and
* the cost of meeting compliance requirements for local, state, and federal regulations, especially in states with stricter requirements.

These lists are of course limited; one could weigh advantages and disadvantages endlessly if the appropriate time was spent to fully evaluate the endeavor. Some of this process would be related to the individual practice in question.

===How the POL integrates with the entire practice===
POLs can integrate with an entire practice in a variety of ways. First, POLs can store laboratory data in a form more readily exchanged between the laboratory and the patient's broader [[electronic health record]] (EHR). A slight disconnect often exists between [[reference lab]]s and physician offices. By placing a lab in the physician office, tighter integration of patient and testing data is achieved, a benefit for both the patient and the practitioner.

The tighter integration may save patients follow-up visits for diagnoses that are able to be done in the office laboratory. For example, in diagnosing a urinary tract infection, the physician office can significantly reduce time spent sending the sample off for testing by doing the testing independently.

Additionally, the POL can allow the financial departments of a practice to track costs and revenue by using laboratory data. For example, during flu season the physician can budget more money for gloves if their data indicates they're seeing more patients during this time.

Laboratory data can also assist with trends related to a population. If a POL notices an unexpected trend in disease among the patient population, the lab data can help the entire organization decide how best to address the related issues through community education or some other outreach program.

===POL or reference lab?===
On average, most POL testing is simple and basic, falling under what is known as Clinical Laboratory Improvement Amendments (CLIA) waived tests. These tests will be discussed further in the next section, but for now know they are near the patient and simple to perform. As previously discussed, bringing simple laboratory testing to the physician office provides benefits to both the patient and the physician, making the POL more attractive.

For some physicians a POL is best because of their location; they may operate in a rural environment and would not have access to laboratory services if they did not do it themselves. For others, the expense of creating such a lab has swayed them towards using a reference lab, which can perform complex tests and, in many cases, has a staff available 24/7. And while placing a POL in the physician office may integrate patient and lab data better, software-based offerings like Health Gorilla that provide real-time results may be sufficient for physicians that prefer to use a reference lab.

In the end, the decision to set up a POL or use a reference lab is based on a review of the advantages and disadvantages, finding a balance of what is best for both the patients' interest and the practice's long-term stability.

==Testing and associated reporting==
CLIA lays out seven criteria for determining the complexity of a test, including the origin of the test.<ref name="CDCTestCom">{{cite web |url=http://wwwn.cdc.gov/clia/Resources/TestComplexities.aspx |title=Clinical Laboratory Improvement Amendments (CLIA): Test complexities |author=Centers for Disease Control and Prevention |date=31 May 2013 |accessdate=14 May 2014}}</ref> For example, if a new test is developed or an existing test is modified, and then it's used at that laboratory, the test is automatically rated a high-complexity test.<ref name="CDCTestCom" /> The complexity of the test determines the requirements the laboratory needs to comply with in order to maintain regulatory compliance. The more complex the test is, the stricter the requirements are.<ref name="CDCTestCom" />

Test complexity has three levels: high, moderate, and waived. Waived tests are simple to perform and have a relatively low risk of an incorrect test result.<ref name="CDCTestCom" /> Moderately complex tests include tests like provider performed microscopy (PPM), which requires the use of a microscope during the office visit.<ref name="CDCTestCom" /> Providers that want to perform PPM tests must be qualified to do so under CLIA regulations.<ref name="CDCTestCom" />

High-complexity tests require the most regulation. These tests are the most complicated and run the highest risk of an inaccurate result, as determined during the FDA pre-market approval process.<ref name="CDCTestCom" /> Tests may come from the manufacturer with their complexity level on them, or one can search the FDA database to determine the complexity of the test.<ref name="CDCTestCom" /> It is important to understand the complexity level of the testing provided in order to ensure full compliance with CLIA.

Commonly performed tests, according to the United Healthcare guide, include<ref name="UHOxInOffice" />:
* urine analysis
* urine pregnancy
* blood occult
* glucose blood
* pathology consultation during surgery
* crystal identification by microscope
* sperm identification and analyses
* bilirubin total
* blood gasses
* complete blood count
* bone marrow smear
* blood bank services
* transfusion medicine

Reimbursement levels for tests are dependent upon the reimbursement guide put out by the individual insurance company. Billing personnel in POL-related offices are ultimately responsible for finding out what the reimbursement is and which tests are permitted for the POL's level of certification.

As noted, the predominant form of testing in the POL is waived complexity testing. See Table 1 in [[#Appendix|the appendix]] for the complete list of waived tests as of May 2014.

===Reporting===
Just as POLs manage a set of commonly performed tests, a set of corresponding reports provides the results of those tests. The results will pass through a set of validation and quality control checks (discussed in section 8.2) before being fashioned into a final report for the ordering physician. For example, if a complete blood count is ordered by the physician, a corresponding patient report is produced by the laboratory, often through a [[laboratory information system]] (LIS). The patient report contains patient, physician, and sample demographics, as well as the results and whether they are above, below, or within recommended limits. Other types of reports may be generated in the laboratory, including daily summary, test total, and various quality control reports.

==Staffing and certification requirements for the POL==
Regulatory requirements regarding laboratory staff vary depending upon the level of testing performed and the state the POL is operating in, and due diligence is required to ensure the POL is meeting those requirements. The previous section noted most POLs perform waived tests, and as such, this section focuses on the requirements for staffing a CLIA waiver-certified laboratory. Typically these labs operate only during office hours, and therefore they do not have 24/7 staffing.

The highest level of laboratory management is typically the person holding the title of laboratory director. Laboratory directors are responsible for the administration and operations of the laboratory, including the hiring of all personnel and ensuring testing procedures are done in the correct manner.<ref name="AAFPDirDuties">{{cite web |url=http://www.aafp.org/practice-management/regulatory/clia/pol-director-duties.html |title=Physician office laboratory (POL) director duties |author=American Academy of Family Physicians |date=2014 |accessdate=14 May 2014}}</ref> For the waiver-certified laboratory, anyone may be a laboratory director; however, the Joint Commission recommends laboratory directors meet the same minimum requirements of those testing in moderately complex laboratories.<ref name="Olea12">{{cite web |url=http://www.jointcommission.org/assets/1/18/CLIA_required_personnel_qualifications.pdf |title=CLIA required personnel qualifications |author=Olea, S. |date=2012 |accessdate=14 May 2014}}</ref> Some states require the laboratory director of a moderately complex laboratory to be state-licensed and also to be a medical doctor or hold a degree in a laboratory science, such as chemistry.<ref name="AAFPPersReqs">{{cite web |url=http://www.aafp.org/practice-management/regulatory/clia/personnel.html |title=Personnel requirements |author=American Academy of Family Physicians |date=2014 |accessdate=14 May 2014}}</ref>

In general, the laboratory staff of a POL may consist of the laboratory director and a mix of laboratory technicians, laboratory technologists, laboratory clinical consultants, and, in some cases, a laboratory manager. The presence of some of these roles may vary, dependent upon if the laboratory is certified for waived, moderately complex, or highly complex testing. Additional educational qualifications and certifications for the laboratory technician, histotechnician, laboratory director, and other staff members may exist based on the lab's complexity level. Directors should check state requirements for the type of lab in question to ensure full compliance.

==Regulatory requirements and considerations==
When setting up a POL, the proprietor is faced with individual state, local, and federal regulations to ensure full compliance. (This guide is no substitute.) Three key regulations were chosen for discussion in this guide: the [[Clinical Laboratory Improvement Amendments]] (CLIA), the [[Health Insurance Portability and Accountability Act]] (HIPAA), and the Patient Protection and Affordable Care Act (PPACA). These regulations were chosen because they are the most common — and most important — to affect the POL and the physician office.

===CLIA===
The most impactful regulation for the physician office laboratory at the time of this writing was CLIA. The U.S. federal statute was implemented in 1988 to remove obsolete laboratory requirements and include new requirements to improve the quality of a modern clinical laboratory.

As previously noted, most POLs are CLIA waiver labs, and therefore much like the previous section, the discussion here is mostly centered on those requirements. Waived tests have a low risk of an incorrect result; this includes the tests the Food and Drug Administration (FDA) has approved for consumers to use in their homes.<ref name="CDCTestCom" /> Tests performed under this provision are done at laboratories that have registered as required by CLIA and obtained a certificate of waiver. These labs are not inspected on a routine basis like labs certified to perform moderate- and high-complexity testing. Laboratories that wish to change their status from waived to one of the other statuses must comply with the CLIA requirements for registration, inspection, and proficiency testing as outlined in the law.<ref name="CDCTestCom" /> Waived laboratory staff, as previously mentioned, does not require proficiency testing, and anyone can be qualified to be the laboratory director.<ref name="CDCTestCom" />

===HIPAA===
POLs are required to comply with HIPAA and must provide safeguards for the security and privacy of the data collected and maintained in the laboratory. HIPAA passed in 1996 in an attempt to provide better guidance regarding the privacy and security of data, portability of health insurance, and better accountability for violations related to these topics. Laboratories are required to implement measures that prevent the unauthorized disclosure and access to data in the laboratory.

Prior to 2014, most laboratories were exempt from the HIPAA requirement to provide patients with lab results or other protected health information.<ref name="HHS14">{{cite web |url=http://www.hhs.gov/news/press/2014pres/02/20140203a.html |title=HHS strengthens patients' right to access lab test reports |author=U.S. Department of Health and Human Services |date=03 February 2014}}</ref> However, in February 2014, the [[United States Department of Health and Human Services|Department of Health and Human Services]] wanted to provide patients the opportunity to become better members of their own care team by giving them more information about their health.<ref name="HHS14" /> This resulted in the amendment of CLIA 1988 to require a laboratory to give a patient, or their designated representative, lab results within 30 days of said individual sending a written request.<ref name="HHS14" /> Laboratories are still required to ensure those accessing this data have authorization to do so, as the original requirements to keep data secure and private remain the same.

The annual cost of compliance with this new rule is estimated to be $59 million over the first five years when examining the potential cost to the laboratory industry in general.<ref name="Conn14">{{cite web |url=http://www.modernhealthcare.com/article/20140203/NEWS/302039958 |title=HHS issues rule granting patients direct access to lab test results |author=Conn, J. |publisher=Modern Healthcare |date=03 February 2014 |accessdate=14 May 2014}}</ref> Laboratories like Quest Diagnostics support the rule because it will allow them to give patients their lab results without prior approval from the patient’s provider.<ref name="Conn14" />

As of this writing, it remains unclear as to how this impacts the POL. Since the POL is located at the physician office, access to results is most likely determined by the provider’s regular procedures for acquiring personal health information (PHI). The POL could provide forms to patients for release of PHI, just as any other lab can, but it is unclear as to how this rule change will impact the POL in the long term.

===PPACA===
As of this writing, the most difficult regulation to assess is the Patient Protection and Affordable Care Act, also known as the ACA. This does not negate the obligations of the laboratory under CLIA and HIPAA. According to the Clinical Laboratory Coalition, laboratory testing informs about 70% of a clinician’s medical decision-making process. However, the laboratory comprises less than two percent of Medicare spending.<ref name="CLC12">{{cite web |url=http://www.aab.org/images/aab/SGR%20Fix%202012%20Talking%20Points.pdf |format=PDF |title=Protect access to laboratory services for Medicare beneficiaries |author=Clinical Laboratory Coalition |date=2012 |accessdate=14 May 2014}}</ref>

<blockquote>''The Patient Protection and Affordable Care Act (PPACA) included a direct cut to the Medicare Clinical Laboratory Fee Schedule of 1.75 percent each year for 5 years. This 9 percent cut is the largest cut among all Part B providers and started in 2011. In PPACA, clinical laboratories also received another cut in the form of a productivity adjustment, resulting in an additional 11 percent cut over 10 years.''<br /> <br />
''The laboratory-specific cut and the productivity adjustment will already result in a cumulative 20 percent cut over 10 years. Laboratories are also facing up to a 2 percent cut to the fee schedule as a result of sequestration, which begins in January 2013.<ref name="CLC12" />''</blockquote>

The laboratory space in general may face challenges from the accountable care organization (ACO) model under the PPACA, due to a decrease in laboratory testing volume.<ref name="HughesCamm">{{cite web |url=http://www.law360.com/articles/500623/clinical-labs-under-aca-challenge-and-opportunity |title=Clinical labs under ACA: Challenge and opportunity |author=Hughes, D.; Cammarata, B. |publisher=Law360 |date=16 January 2014 |accessdate=14 May 2014}}</ref> Under the ACO model, unnecessary or redundant testing would be discouraged.<ref name="HughesCamm" /> This could be a good thing for the POL market, as waived testing would be done in-house, close to the patient. It could also be a problem for the POL market, as physicians may need to recalculate if operating a POL makes economic sense for their practice.

==Economic considerations==
Economics are very important to any aspect of the medical practice. This is also true for the POL. Four economic considerations should be made regarding the POL: profitability and stability, insurance reimbursements, billing, and return on investment (ROI). For example, at the end of the previous section, the economics surrounding the PPACA were discussed. Other considerations would include things like assessing financial penalties for non-compliance.

===Profitability and sustainability of the POL===
Maintaining a profitable and sustainable lab is important. This is why the ROI calculation is provided in section 7.4 for examination. Performing this calculation can help determine how long it will take the lab to become profitable and if it will be sustainable over time. Like most parts of a business, the laboratory becomes profitable by collecting more revenue than it is putting out in expenses.

===Insurance reimbursements===
Insurance reimbursements vary by insurance company and plan. Checking with the insurance companies and plans accepted by the POL is advised; these numbers often change, as required by regulations or insurance company needs. It is important to keep current on this subject. Failing to do so can result in less reimbursement than one is expecting. If the reimbursement for a test is cut from $25 to $5, and one fails to keep current with reimbursements from the insurance providers, the shock could ripple through the laboratory or provider practice.

The April 2014 passage of the “Doc Fix,” a one-year protection of the Sustainable Growth Rate (SGR) Medicare physician payment formula, saved providers from this type of shock<ref name="HIMSS14">{{cite web |url=http://www.himss.org/News/NewsDetail.aspx?ItemNumber=28914 |title=After passage by Congress, President signs SGR "Doc Fix" & ICD-10 delay |author=HIMSS |date=01 April 2014 |accessdate=14 May 2014}}</ref>, as without this legislation providers were set to experience a 30% reduction in their reimbursement rates.<ref name="CLC12" /> Laboratory personnel, especially billing personnel, would be wise to keep up with trends in this area.

===Billing===
Billing requires medical codes such as the [[International Statistical Classification of Diseases and Related Health Problems|International Classification of Disease]], Ninth Version, Clinical Modification (ICD-9-CM) codes, soon to be ICD-10-CM codes, as well as procedure codes from either the Current Procedural Terminology (CPT) codes or the Healthcare Common Procedure Coding System (HCPCS). Logical Observation Identifiers Names and Codes (LOINC) may also be used, but mapping to another code set such as CPT is more common. Ordering physician, patient name, medical record number (MRN), and other demographic information may also appear on billing. Laboratory-specific information such as CLIA certification number and modifications indicating whether the test is CLIA-waived are also present.

Laboratory billing for Medicare went through an important simplification process in 2003, to allow for a more standardized process and to eliminate confusion office staff was experiencing in their attempts to comply with billing rules.<ref name="BakerMcKenzie">{{cite web |url=http://www.acpinternist.org/archives/2003/12/baker.htm |title=Recent CMS lab test standards simplify billing rules |author=Baker, B.; McKenzie, C. |publisher=ACP Observer |date=December 2013 |accessdate=14 May 2014}}</ref> The rules changed the billing for 23 laboratory tests that cover nearly two-thirds of all laboratory testing.<ref name="BakerMcKenzie" />

===Return on investment===
The return on investment (ROI) metric is important to the POL. An example of ROI in action is when someone invests in a stock and gets 10% of the money back every year; the 10% would be the ROI from a purely financial standpoint. The formula for an ROI calculation is listed below:

:''Simple ROI = Financial Gain/Total Investment''

:''Discounted ROI = Net Present Value of Benefits/Total Present Value of Costs''<ref name="ITEcoCorp">{{cite web |url=http://iteconcorp.com/ROICalc.html |title=Computing the ROI for IT projects and other investments |author=IT Economics Corporation |date=2010 |accessdate=14 May 2014}}</ref>

The Simple ROI calculation is primarily used for short-term calculations related to an investment of one year or less, while the Discounted ROI is more accurate for long-term analysis and calculations.<ref name="ITEcoCorp" />

The factors included in the ROI calculation will vary, depending on the type of calculation. The Simple ROI, for example, only examines financial gain divided by total investment amount. The person using this formula would say, after gaining $110,000 by investing $100,000, their ROI is 110%.

Before performing either calculation, it is important to measure current performance and then measure performance again after the investment.

The POL example below uses the IT Economics Corporation ROI calculation formula<ref name="ITEcoCorp" />:

'''Simple ROI calculation for Fein and Douglas Associates' POL'''

'''Year 1'''

''$100,000 benefit to the practice - $100,000 outlay of resources to establish the lab = $0 net savings divided by $100,000 outlay of resources, yielding a Simple ROI of 0%''

'''Year 2'''

''$400,000 benefit to the practice - $100,000 outlay of resources to maintain the laboratory = $300,000 net savings divided by $100,000 outlay of resources, yielding a Simple ROI of 300%''

'''Year 3'''

''$700,000 benefit to the practice - $100,000 outlay of resources to maintain the laboratory = $600,000 net savings divided by $100,000 outlay of resources, yielding a Simple ROI of 600%.''

The Simple ROI calculation becomes less accurate as time goes on, because it does not take into account any discounting for the value of money or other assets impacted by time, such as equipment depreciation.

The Discounted ROI methodology takes into account a dollar received in 1988 is worth more than one received in 2008. It is more complicated than the Simple ROI, as it requires calculating the present value of costs and benefits, and it also requires knowing the organization discount rate. The organization CFO is typically the best person to contact for that information.

Note how taking the same laboratory and performing a Discounted ROI calculation shows a more accurate result, as seen below<ref name="ITEcoCorp" />:

'''Discounted ROI calculation for Fein and Douglas Associates' POL'''

For this example, the assumption of an average 6% discount rate was used.

'''Year 1'''

Present Value (PV) of Benefits: $100,000 received at the end of year 1 = $94,340

PV of Costs: $100,000 paid at the end of year 1 = $94,340

''$94,340 PV of benefits to the practice - $94,340 PV of costs to establish the lab = $0 Net PV of benefits divided by $94,340 PV of costs, yielding a Discounted ROI of 0%''

Note how the Simple ROI and the Discounted ROI calculation are the same during the first year due to the cost and benefits all occurring the same year and at the same time.

'''Year 2'''

Present Value (PV) of Benefits: $100,000 received at the end of year 1 = $94,340; PV of another $300,000 received at the end of year 2 = $267,000; Net PV of Benefits for year 2 thus = $361,340

PV of Costs: $100,000 paid at the end of year 2 = $94,340

''$361,340 PV of benefits to the practice - $94,340 PV of costs to maintain the lab = $267,000 Net PV of benefits divided by $94,340 PV of costs, yielding a Discounted ROI of 283%''

Note how the Simple ROI for year 2 was 300%, a full 17% greater than the discounted ROI.

'''Year 3'''

Present Value (PV) of Benefits: $100,000 received at the end of year 1 = $94,340; PV of another $300,000 received at the end of year 2 = $267,000; PV of another $300,000 received at the end of year 3 = $251,886; Net PV of Benefits for year 3 thus = $613,226

PV of Costs: $100,000 paid at the end of year 3 = $94,340

''$613,226 PV of benefits to the practice - $94,340 PV of costs to maintain the lab = $518,886 Net PV of benefits divided by $94,340 PV of costs, yielding a Discounted ROI of 550%''

Discounted ROI at the end of year 3 is 550%, not the 600% shown by the Simple ROI calculation. In this case the overage is 50% in year 3, as opposed to 17% in year 2. The Simple ROI calculation becomes less accurate as future years increase, which means that by year four, the overage would be greater than 50%.<ref name="ITEcoCorp" />

==Managing data and test results==
Data management in the physician office laboratory has six important aspects to it:

* Overall workflow
* Order entry
* Testing, including associated results and reports
* Quality control
* Integration with instruments and software
* Integration with external reference laboratory results

In any modern laboratory, the common way to integrate data and workflow across the enterprise is through a data management system.

===Data management systems and test workflow===
From a basic clinical and research laboratory perspective, there are two common data management systems to choose from: a [[laboratory information system]] (LIS) and a [[laboratory information management system]] (LIMS). Generally speaking, a LIS will be found more often in a clinical laboratory like the POL, whereas a LIMS will be more common in a research laboratory.<ref name="Friedman08">{{cite web |url=http://labsoftnews.typepad.com/lab_soft_news/2008/11/liss-vs-limss-its-time-to-consider-merging-the-two-types-of-systems.html |title=LIS vs. LIMS: It's time to blend the two types of lab information systems |author=Friedman, B. |publisher=Lab Soft News |date=04 November 2008 |accessdate=14 May 2014}}</ref>

The functionality is slightly different between the LIS and LIMS. The LIS tends to be more patient-centric, exhibiting features that focus on subjects and specimens, while the LIMS tends to be more group-centric, focusing on batches and samples. Both of these systems may have the ability to interface with a [[hospital information system]] (HIS), [[electronic health record]] (EHR), [[practice management system]] (PMS), or other types of systems commonly found in healthcare settings.

Overall, workflow was discussed in a previous section, but just to refresh the reader's mind, here is the chart again:
<br /> <br />

:'''POL workflow''':<br />
[[File:POLWorkflow.png|700px]]

Notice the flow starts with the doctor ordering the test and ends with the discussion of options with the patient. POL workflow may vary depending upon specialty, but the steps identified in this guide are the most common. Each part of the workflow is important and dependent upon the other. Without a doctor order, it does not matter if the specimen is collected and sent to the lab. Without a specimen to test, the rest of this workflow is unable to proceed.

Since the first step is a doctor ordering the test, it is important to understand how the order entry component of a LIS operates. Each vendor will have a slightly different display, but the process is typically the same. The doctor enters the order for the test, often through a drop-down menu of tests in the patient's chart through the provider's EHR system. The LIS integrates to the EHR, allowing the laboratory staff to receive the order to collect the samples, perform the tests, and then transfer the results back into the EHR record for the physician to review, often in the form of a [[LIS feature#Custom reporting|report]]. If the physician is not using an electronic system, then this process begins with a written or printed order and finishes with a written or printed report of the results for the patient file.

Many LIS vendors will custom configure the system with tests, including reference ranges for the POL as part of the purchase agreement. These interfaces will often have a drop-down menu as well, so that the person performing the test can select the test and compare the sample result to the reference range for that particular test.

===Quality control===
Aside from workflow, order entry, testing, and reporting, another important aspect of managing data is [[LIS feature#QA/QC functions|quality control]] (QC), which allows the handler of the data to ensure it still meets the definition of quality data. In healthcare informatics, quality data is defined as data that is clear, complete, relevant, timely, and accurate in presentation.<ref name="WHO03">{{cite web |url=http://www.wpro.who.int/publications/docs/Improving_Data_Quality.pdf |title=Data quality: A guide for developing countries |author=World Health Organization, Regional Office for the Western Pacific |publisher=World Health Organization |date=2003 |accessdate=14 May 2014}}</ref>

Prior to the use of computers, the entire process of [[Informatics (academic field)|informatics]] involved paper records. In many cases, these records were handwritten, requiring personnel to read the handwriting of others in order to assess the quality of the data provided.<ref name="Sinard06">{{cite book |url=http://www.springer.com/medicine/pathology/book/978-0-387-28057-8 |title=Practical pathology informatics: Demystifying informatics for the practicing anatomic pathologist |author=Sinard, J. |publisher=Springer Science+Business Media |year=2006 |isbn=9780387280585}}</ref> Resistance to using a computer for laboratory tasks typically comes from a belief that the old ways are better; however, in most cases the computer can greatly aid in the prevention of errors.<ref name="Sinard06" />

LIS and LIMS vendors often include a variety of quality control functions in their software. QC tests can be run on specimens, quality control charts and reports can be created, proficiency testing functions can be implemented, and certificates of analysis (COAs) can be produced.<ref name="HullWrayEtAl">{{cite journal |url=http://www.ncbi.nlm.nih.gov/pubmed/21631414 |journal=Combinatorial Chemistry & High Throughput Screening |title=Tracking and controlling everything that affects quality is the key to a quality management system |author=Hull, C.; Wray, B.; Winslow, F.; Villicich, M. |volume=14 |issue=9 |pages=772–780 |date=2011 |accessdate=14 May 2014}}</ref>

It's worth noting, though, the use of computers and information management systems does not completely rule out error. For example, the drop-down menu still allows users to select either the wrong test or a test that looks for the same result but is less effective than another test.<ref name="Sinard06" /> These types of errors are more often than not controlled in a system where the laboratory personnel are knowledgeable about testing and are able to educate the ordering physician on such matters.<ref name="Sinard06" />

Other business processes can benefit from quality control measures, such as the application of Lean Six Sigma — an approach to reducing waste and limiting defects in a process — to the laboratory. It is important to examine the needs of the laboratory to ensure appropriate quality control techniques are implemented.

===Integration with instruments and software===
Since workflow is the single most important consideration in the design of the LIS<ref name="Sinard06" />, it is important it allows instruments and other software to [[LIS feature#Instrument interfacing and management|interface]]. These interfaces are generally done using standard communication processes and systems, as well as messaging formats, like [[Health Level 7]] (HL7).<ref name="Sinard06" />

While the majority of POLs do waived testing, and those simple tests don't often require advanced equipment, interfacing a data management system may not be a concern. But for those POLs that employ [[laboratory automation]], the ability of the equipment to talk to the LIS is vital, often using HL7.

<blockquote>''The types of information communicated between these systems include process control and status information for each device or analyzer, each specimen, specimen container, and container carrier, information and detailed data related to patients, orders, and results, and information related to specimen flow algorithms and automated decision making.''<ref name="HL711">{{cite web |url=http://www.hl7.org/implement/standards/product_brief.cfm?product_id=203 |title=HL7 version 2.7 standard: Chapter 13 - Clinical laboratory automation |author=Health Level Seven International |date=2011 |accessdate=14 May 2014}}</ref></blockquote>

A software interface between the LIS and the EHR is often referred to as a result interface, and it typically uses the HL7 messaging protocol and standard communication protocols like TCP/IP. These interfaces are not turn-key, however, requiring a comprehensive planning and implementation process.<ref name="Kasoff12">{{cite web |url=http://www.mlo-online.com/articles/201202/connecting-your-lis-and-ehr.php |title=Connecting your LIS and EHR |author=Kasoff, J. |publisher=Medical Laboratory Observer |date=February 2012 |accessdate=14 May 2014}}</ref> After a successful implementation, the interface allows information from a completed test to be reported back to the EHR, where the physician can readily obtain a copy of the patient test report. It can also allow for billing batches and admission/discharge/transfer (ADT) reporting. These same interfaces can be used for communicating with other reference laboratories, in addition to the various hospital systems.<ref name="Sinard06" />

==Getting help with your POL==
It is important those involved with the POL know where to get help when issues arise. Often, the first call goes to the [[:Category:Vendors|vendor]] of the software, instruments, testing supplies, and other products and services the laboratory utilizes. If the vendor is unable to assist, consultants may prove to be a valuable asset to the POL. Consultants can help staff sort through regulatory compliance, financial planning, systems planning, and laboratory design. Consultants can also assist with filling the gap between a vendor support agreement and the need of the POL, if such a gap exists in the contract.

Another route for support with your POL may be a professional organization or trade association. The American College of Physicians, for example, provides numerous printable resources to practices. The American Society for Clinical Laboratory Science offers professional development courses, educational courses, and certification help to its members.

A directory of consultants, organizations, and other tools is available as an [[#Addendum|addendum]] to this guide.

==Conclusion==
POLs have existed in various forms since the beginning of medicine. With accountable care organizations (ACO), patient-centered medical homes, direct-to-consumer laboratory testing, and telemedicine, an increased opportunity exists for laboratory medicine to be added to practices. Laboratory medicine influences 70% of all medical decisions, and the POL allows the physician to get results faster than having to send out for testing at a reference laboratory.

The decision to operate (or continue operating) a POL is an important one not to be entered into lightly. Owners must consider regulatory compliance, business processes, technology choices, and economic considerations. However, while a daunting task, ultimately choosing to operate a POL may prove to be a rewarding experience.

==References==
<references />

==Appendix==
'''Table 1.''' Tests Granted Waived Status under CLIA (This list includes updates from Change Request 8439.)
{| class="wikitable collapsible" border="1" cellpadding="5" cellspacing="0" style="width:50%"
|-
! colspan="1" style="color:brown; background-color:#ffffee;| CPT Code(s)
! colspan="1" style="color:brown; background-color:#ffffee;| Test Name
! colspan="1" style="color:brown; background-color:#ffffee;| Manufacturer
! colspan="1" style="color:brown; background-color:#ffffee;| Use
|-
| '''81002'''
| style="background-color:white;" |Dipstick or tablet reagent urinalysis – non-automated for bilirubin, glucose, hemoglobin, ketone, leukocytes, nitrite, pH, protein, specific gravity, and urobilinogen
| style="background-color:white;" |Various
| style="background-color:white;" |Screening of urine to monitor/diagnose various diseases/conditions, such as diabetes, the state of the kidney or urinary tract, and urinary tract infections
|-
| '''81025'''
| style="background-color:white;" |Urine pregnancy tests by visual color comparison
| style="background-color:white;" |Various
| style="background-color:white;" |Diagnosis of pregnancy
|-
| '''82270<br />82272<br />(Contact your Medicare carrier for claims instructions.)'''
| style="background-color:white;" |Fecal occult blood
| style="background-color:white;" |Various
| style="background-color:white;" |Detection of blood in feces from whatever cause, benign or malignant (colorectal cancer screening)
|-
| '''82962'''
| style="background-color:white;" |Blood glucose by glucose monitoring devices cleared by the FDA for home use
| style="background-color:white;" |Various
| style="background-color:white;" |Monitoring of blood glucose levels
|-
| '''83026'''
| style="background-color:white;" |Hemoglobin by copper sulfate – non-automated
| style="background-color:white;" |Various
| style="background-color:white;" |Monitors hemoglobin level in blood
|-
| '''84830'''
| style="background-color:white;" |Ovulation tests by visual color comparison for human luteinizing hormone
| style="background-color:white;" |Various
| style="background-color:white;" |Detection of ovulation (optimal for conception)
|-
| '''85013'''
| style="background-color:white;" |Blood count; spun microhematocrit
| style="background-color:white;" |Various
| style="background-color:white;" |Screen for anemia
|-
| '''85651'''
| style="background-color:white;" |Erythrocyte sedimentation rate – non-automated
| style="background-color:white;" |Various
| style="background-color:white;" |Nonspecific screening test for inflammatory activity, increased for majority of infections, and most cases of carcinoma and leukemia
|-
|}

(Recreated from http://www.cms.gov/Regulations-and-Guidance/Legislation/CLIA/downloads/waivetbl.pdf)

==Addendum==

As part of the "Getting help with your POL" section, an addendum to this white paper is included, containing information about conferences, consultants, organizations, and published materials that could potentially help those operating and working in a physician office laboratory.

You can find that content [[LII:The Practical Guide to the U.S. Physician Office Laboratory/Addendum|on this page]].


[[Category:LII:Guides, white papers, and other publications|Practical Guide to the U.S. Physician Office Laboratory]]

LII:Elements of Laboratory Technology Management

2024-06-19T18:49:30Z

Shawndouglas: Year clarification

'''Title''': ''Elements of Laboratory Technology Management''

'''Author for citation''': Joe Liscouski, with editorial modifications by Shawn Douglas

'''License for content''': [https://creativecommons.org/licenses/by/4.0/ Creative Commons Attribution 4.0 International]

'''Publication date''': Originally published 2014; republished here February 2021

==Introduction==
This discussion is less about specific technologies than it is about the ability to use advanced [[laboratory]] technologies effectively. When we say “effectively,” we mean that those products and technologies should be used successfully to address needs in your lab, and that they improve the lab’s ability to function. If they don't do that, you’ve wasted your money. Additionally, if the technology in question hasn’t been deployed according to a deliberate plan, your funded projects may not achieve everything they could. Optimally, when applied thoughtfully, the available technologies should result in the transformation of lab work from a labor-intensive effort to one that is intellectually intensive, making the most effective use of people and resources.

People come to the subject of [[laboratory automation]] from widely differing perspectives. To some it’s about robotics, to others it’s about [[laboratory informatics]], and even others view it as simply data acquisition and [[Data analysis|analysis]]. It all depends on what your interests are, and more importantly what your immediate needs are.

People began working in this field in the 1940s and 1950s, with the work focused on analog electronics to improve instrumentation; this was the first phase of lab automation. Most notably were the development of scanning [[spectrophotometer]]s and process [[Chromatography|chromatographs]]. Those who first encountered this equipment didn’t think much of it and considered it the world as it’s always been. Others who had to deal with products like the Spectronic 20{{Efn|The Spectronic 20 was developed by Bausch & Lomb in 1954 and is currently owned and marketed in updated versions by ThermoFisher.}} (a single-beam manual spectrophotometer), and use it to develop visible spectra one wavelength measurement at a time, appreciated the automation of scanning instruments.

Mercury switches and timers triggered by cams on a rotating shaft provided chromatographs with the ability to automatically take [[Sample (material)|samples]], actuate back flush valves, and take care of other functions without operator intervention. This left the analyst with the task of measuring peaks, developing calibration curves, and performing calculations, at least until data systems became available.

The direction of laboratory automation changed significantly when computer chips became available. In the 1960s, companies such as [[Vendor:PerkinElmer Inc.|PerkinElmer]] were experimenting with the use of computer systems for data acquisition as precursors to commercial products. The availability of general-purpose computers such as the PDP-8 and PDP-12 series (along with the Lab 8e) from Digital Equipment, with other models available from other vendors, made it possible for researchers to connect their instruments to computers and carry out experiments. The development of microprocessors from Intel (4004, 8008) led to the evolution of “intelligent” laboratory equipment ranging from processor-controlled stirring hot-plates to chromatographic integrators.

As researchers learned to use these systems, their application rapidly progressed from data acquisition to interactive control of the experiments, including data storage, analysis, and reporting. Today, the product set available for laboratory applications includes data acquisition systems, [[laboratory information management system]]s (LIMS), [[electronic laboratory notebook]]s (ELNs), laboratory robotics, and specialized components to help researchers, scientists, and technicians apply modern technologies to their work.

While there is a lot of technology available, the question remains "how do you go about using it?" Not only do we need to know how to use it, but we also must do so while avoiding our own biases about how computer systems operate. Our familiarity with using computer systems in our daily lives may cause us to assume they are doing what we need them to do, without questioning how it actually gets done. “The vendor knows what they are doing” is a poor reason for not testing and evaluating control parameters to ensure they are suitable and appropriate for your work.

==Moving from lab functions and requirements to practical solutions==
Before we can begin to understand the application of the tools and technologies that are available, we have to know what we want to accomplish, specifically what problems we want to solve. We can divide laboratory functions into two broad classes: management and work execution. Figure 1 addresses management functions, whereas Figure 2 addresses work execution functions, all common to laboratories. You can add to them based on your own experience.

[[File:Fig1 Liscouski ElementsLabTechMan14.png|849px]]
{{clear}}
{|
| STYLE="vertical-align:top;"|
{| border="0" cellpadding="5" cellspacing="0" width="849px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;"| <blockquote>'''Figure 1.''' A breakdown of management-level functions in a typical laboratory</blockquote>
|-
|}
|}

[[File:Fig2 Liscouski ElementsLabTechMan14.png|880px]]
{{clear}}
{|
| STYLE="vertical-align:top;"|
{| border="0" cellpadding="5" cellspacing="0" width="880px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;"| <blockquote>'''Figure 2.''' A breakdown of work-level functions in a typical laboratory</blockquote>
|-
|}
|}

Vendors have been developing products to address these work areas, and there are a lot of products available. Many of them are "point" solutions: products that are focused on one aspect of work without an effort to integrate them with others. That isn’t surprising since there isn’t an architectural basis for integration aside from specific hardware systems (e.g., Firewire, USB) or vendor-specific software systems (e.g., office product suites). Another issue in scientific work is that the vendor may only be interested in solving a particular problem, with most of the emphasis on an instrument or technique. They may provide the software needed to support their hardware, with data transfer and integration left to the user.

As you work through this document, you’ll find a map of management responsibilities and technologies. How do you connect the above map of functions to the technologies? Applying software and hardware solutions to your lab's needs requires deliberate planning. The days of purchasing point solutions to problems have passed. Today's lab managers need to think more broadly about product usage and how components of lab software systems work together. The point of this document is to help you understand what you need to think about in that regard.

Given those summaries of lab activities, how do we apply available technologies to improve lab operations? Most of the answers fall under the heading of "laboratory automation," so we’ll begin by looking at what that is.

==What is laboratory automation?==
This isn’t a trivial question; your answer may depend on the field you are working in, your experience, and your current interests. To some it means robotics, to others it is a LIMS (or their clinical counterpart, the [[laboratory information system]]s or LIS). The ELN and instrument data systems (IDS) are additional elements worth noting. These are examples of product classes and technologies used in lab automation, but they don’t define the field. Wikipedia provides the following as a definition<ref name="WPLabAuto14">{{cite web |url=https://en.wikipedia.org/wiki/Laboratory_automation |archiveurl=https://en.wikipedia.org/w/index.php?title=Laboratory_automation&oldid=636846823 |title=Laboratory automation |work=Wikipedia |archivedate=05 December 2014}}</ref>:

<blockquote>Laboratory automation is a multi-disciplinary strategy to research, develop, optimize and capitalize on technologies in the laboratory that enable new and improved processes. Laboratory automation professionals are academic, commercial and government researchers, scientists and engineers who conduct research and develop new technologies to increase productivity, elevate experimental data quality, reduce lab process cycle times, or enable experimentation that otherwise would be impossible.

The most widely known application of laboratory automation technology is laboratory robotics. More generally, the field of laboratory automation comprises many different automated laboratory instruments, devices, software algorithms, and methodologies used to enable, expedite and increase the efficiency and effectiveness of scientific research in laboratories.</blockquote>

And McDowall offered this definition in 1993<ref name="McDowallAMatrix93">{{cite journal |title=A Matrix for the Development of a Strategic Laboratory Information Management System |journal=Analytical Chemistry |author=McDowall, R.D. |volume=65 |issue=20 |pages=896A–901A |year=1993 |doi=10.1021/ac00068a725}}</ref>:

<blockquote>Apparatus, instrumentation, communications or computer applications designed to mechanize or automate the whole or specific portions of the analytical process in order for a laboratory to provide timely and quality information to an organization</blockquote>

These definitions emphasize equipment and products, and that is where typical approaches to lab automation and the work we are doing part company. Products and technologies are important, but what is more important is figuring out how to use them effectively. The lack of consistent success in the application of lab automation technologies appears to stem from this focus on technologies and equipment—“what will this product do for my lab?”—rather than methodologies for determining what is needed and how to implement solutions.

Having a useful definition of laboratory automation is crucial since how we approach the work depends on how we see the field developing. The definition the Institute for Laboratory Automation (ILA) bases its work on is this:

<blockquote>Laboratory automation is the process of determining needs and requirements, planning projects and programs, evaluating products and technologies, and developing and implementing projects according to a set of methodologies that results in successful systems that increase productivity, improve the effectiveness and efficiency of laboratory operations, reduce operating costs, and provide higher-quality data.</blockquote>

The field includes the use of data acquisition, analysis, robotics, sample preparation, laboratory informatics, information technology, computer science, and a wide range of technologies and products from widely varying disciplines, used in the implementation of projects.

===Why "process" is important===
Lab automation isn’t about stuff, but how we use stuff. The “process” component of the ILA definition is central to what we do. To quote Frank Zenie{{Efn|Mr. Zenie often introduced robotics courses at Zymark with that statement.}} (one of the founders of Zymark Corporation), “you don’t automate a thing, you automate a process.” You don’t automate an instrument; you automate the process of using one. Autosamplers are a good example of successful automation: they address the process of selecting a sample vial, withdrawing fluid, positioning a needle into an injection port, injecting the sample, preparing the syringe for the next injection, and indexing the sample vials when needed.

A number of people have studied the structure of science and the relationship between disciplines. Lab automation is less about science and more about how the work of science is done. Before lab automation can be considered for a project, the underlying science has to be done. In other words, the process that automation is going to be applied to must exist first. It also has to be the right process for consideration. This is a point that needs attention.

If you are going to spend resources on a project, you have to make sure you have a well-characterized process and that the process is both optimal and suitable for automation. This means:

* The process is well documented, people that use the process have been trained on that documented process, and their work is monitored to determine any shortcuts, workarounds, or other variations from that documented process. Differences between the published process and the one actually in use can have a significant impact on the success of a particular project design.
* The process’s “readiness for automation” has been determined. The equipment used is suitable for automation or the changes needed to make it suitable are known, and it can be done at reasonable cost. Any impact on warranties has been determined and found to be acceptable.
* If several process options exist (e.g., different test protocols for the same test question), they are evaluated for their ability to meet the needs of the science and to be successfully implemented. Other options, such as outsourcing, should be considered to make the best use of resources; is may be more cost-effective to outsource than automate.

When looking at laboratory processes, it's useful to recognize they may operate on different levels. There may be high-level processes that address the lab's reason for existence and cover the mechanics of how the lab functions, as well as low-level processes that address individual functions of the lab. This process view is important when we consider products and technologies, as the products have to fit the process, the general basis for product requirements. Discussions often revolve around LIMS and ELNs are one example, with questions being asked about whether one or both are required in the lab, or whether an ELN can replace a LIMS.

These and other similar questions reflect both vendor influence and a lack of understanding of the technologies and their application. Some differentiate the two types of technology based on “structured” (as found in LIMS) vs. “unstructured” (as found in ELNs) data. Broadly speaking, LIMS come with a well-defined, extensible, database structure, while ELNs are viewed as unstructured since you can put almost anything into an ELN and organize the contents as you see fit. But this characterization doesn’t work well either since as a user I might consider the contents, along with an index, as having a structure. This is more of an information technology approach rather than one that addresses laboratory needs. In the end, an understanding of lab processes is still required to resolve most issues.

LIMS are well-defined entities{{Efn|See [https://www.astm.org/DATABASE.CART/HISTORICAL/E1578-06.htm ASTM E1578 - 06].}} and the only one of the two to carry an objective industry-standard description. LIMS are also designed to manage the processes surrounding laboratory testing in a wide variety of industries (e.g., from analytical, physical, and environmental testing, to clinical testing, which is usually associated with the LIS). The lab behavior process model is essentially the same across industries and disciplines. Basically, if you are running an analytical lab and need to manage samples and test results while answering questions about the status of testing on a larger scale than what you can memorize{{Efn|This is not a recommended method.}}, a LIMS is a good answer.

At the time of this writing, there is no standardized definition of an ELN, though a forthcoming update of [[ASTM E1578]] intends to rectify that. However, given the current hype about ELNs, any product with that designation is going to get noticed. Let’s avoid the term and replace it with a set of software types that addresses similar functionality:

1. Scripted execution systems – These are software systems that guide an analyst in the conduct of a procedure (process); examples include Velquest (now owned by [[Vendor:Accelrys, Inc.|Accelrys]]) products and the scripting notebook function (vendor's description) in [[Vendor:LabWare, Inc.|LabWare LIMS]].

2. Journal or diary system - These are software systems that provide a record of laboratory work; a word processing system might fill this need, although there are products with features specifically designed to assist lab work.

3. Application- or discipline-specific record keeping systems – These are software systems designed for biology, chemistry, mathematics and other areas that contain features that allow you to record data and text in a variety of forms that are geared toward the needs of specific areas of science.

This is not an exhaustive list of forms or functionality, but it is sufficient to make a point. The first, scripted execution, is designed around a process or, more specifically, designed to give the user a mechanism to describe the sequential steps in a process so that they can be repeated under strict controls. These do not replace a LIMS but can be used synergistically with one, or with software that duplicates LIMS capabilities (some have suggested [[enterprise resource planning]] or ERP systems as a substitute). The other two types are repositories of lab information: equations, data, details of procedures, etc. There is no general underlying process as there is with LIMS. They can provide a researcher with a means of describing experiments, collecting data, and performing analyses, which you can correctly view as processes, but they are unique to that researcher or lab and not based on any generalized industry model, as we see in testing labs.

Why is this important? These descriptions illustrate something fundamental: process dictates needs, and needs set requirements for products and technology. The “do we need a LIMS or ELN” question is meaningless without an understanding of the processes that operate in your laboratory.

===The role of processes in integration===
From the standpoint of equipment, laboratories are often a collection of instruments, computer systems, sample preparation stations, and other test or measurement facilities. One goal frequently stated by lab managers is that “ideally we’d like all this to be integrated.” The purpose of integration is to streamline the operations, reduce human labor, and provide a more efficient way of doing work. You are integrating the equipment and systems used to execute one or more processes. However, without a thorough evaluation of the processes in the lab, there is no basis for integration.

Highlighting the connection between processes and integration, we see why defining what laboratory automation is remains important. One definition can lead to purchasing products and limiting the scope of automation to individual tasks. Another will take you through an evaluation of how your lab works, how you want it to work, and how to produce a framework for getting you there.

==The elements of laboratory technology management==
If lab automation is a process, we need to look at the elements that can be used to make that process work. The first thing that is needed is a structure that shows how elements of laboratory automation relate to each other and act as guides for someone coming into the field. That structure also serves as a framework for organizing knowledge about the field. The major elements of the structure are shown in Figure 3.

[[File:Fig3 Liscouski ElementsLabTechMan14.png|570px]]
{{clear}}
{|
| STYLE="vertical-align:top;"|
{| border="0" cellpadding="5" cellspacing="0" width="570px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;"| <blockquote>'''Figure 3.''' The elements of lab technology management</blockquote>
|-
|}
|}

Why is it important to recognize these laboratory automation management elements? As automation technology sees broader appeal across multiple industries, there is inevitability to the use of automation technologies in labs. Vendors are putting chips and programmed intelligence into every product with the goal of making them easier to use, while reducing the role of human judgment (which can lead to an accumulation of errors in tasks like pipetting) and the amount of work people have to do to get work done. There are few negatives to this philosophy, aside from one point: if we don’t understand how these systems are working and haven’t planned for their use, we won’t get the most benefit from them, and we may accept results from systems blindly without really questioning their suitability and the results they produce. One of the main points of this work is that the use of automation technologies should be planned and managed. That brings us to the first major element: management issues.

===Management issues===
The first thing we need to address is who “management” is. Unless the lab is represented solely by you, there are presumably layers of management. Depending on the size of the organization, this may mean one individual or a group of individuals who are responsible for various management aspects of the laboratory. Broadly speaking, those individuals will have a skillset that can appropriately address laboratory technology management, including project and program management, people skills, workflow modeling, and regulatory knowledge, to name a few (Figure 4). The need for these skillsets may vary slightly depending on the type of manager.

[[File:Fig4 Liscouski ElementsLabTechMan14.png|800px]]
{{clear}}
{|
| STYLE="vertical-align:top;"|
{| border="0" cellpadding="5" cellspacing="0" width="800px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;"| <blockquote>'''Figure 4.''' The skills required of managers when addressing laboratory technology management</blockquote>
|-
|}
|}

Aside from reviewing and approving programs ("programs" are efforts that cover multiple projects), “senior management” is responsible for setting the policies and practices that govern the conduct of laboratory programs. This is part of an overall architecture for managing lab operations, and has two components (Figure 5): setting policies and practices and developing operational models.{{Efn|This topic will be given light treatment in this work, but will be covered in more detail elsewhere.}}

[[File:Fig5 Liscouski ElementsLabTechMan14.png|600px]]
{{clear}}
{|
| STYLE="vertical-align:top;"|
{| border="0" cellpadding="5" cellspacing="0" width="600px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;"| <blockquote>'''Figure 5.''' The role of managers when addressing laboratory technology management</blockquote>
|-
|}
|}

Before the incorporation of [[Informatics (academic field)|informatics]] into labs, senior management’s involvement wasn’t necessary. However, the storage of intellectual property in electronic media has forced a significant change in lab work. Before informatics, labs used to be isolated from the rest of the organization, with formal contact made through the delivery of results, reports, and presentations. The desire for more effective, streamlined, and integrated information technology operations, and the development of information technology support groups, means that labs are now also part of the corporate picture. In organizations that have multiple labs, more efficient use of resources results in a desire to reduce duplication of work. You might easily justify two labs having their own spectrophotometers, but duplicate LIMS doing the same thing is going to require some explaining.

With the addition of informatics in the lab, management involvement is critical to support the success of laboratory programs. Among the reasons often cited for failure of laboratory programs is the lack of management involvement and, by extension, a lack of oversight, though these failures are usually stated without clarifying what management involvement should actually look like. To be clear, senior management's role is to ensure that programs are conducted in a way that:

* are common across all labs, such that all programs are conducted according to the same set of guidelines;
* are well-designed, supportable, and can be upgraded;
* are consistent with good project management practices;
* are conducted in a way that allows the results to be reused elsewhere in the company;
* are well-documented; and
* lead to successful results.

When work in lab automation began, it was usually the effort of one or two individuals in a lab or company. Today, we need a cooperative effort from management, lab staff, IT support, and, if available, LAEs. One of the reasons management must establish policies and practices (Figure 6) is to enable people to effectively work together, so they are working from the same set of ground rules and expectations and producing consistent and accurate results.

[[File:Fig6 Liscouski ElementsLabTechMan14.png|600px]]
{{clear}}
{|
| STYLE="vertical-align:top;"|
{| border="0" cellpadding="5" cellspacing="0" width="600px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;"| <blockquote>'''Figure 6.''' The business' policies and practices requiring focus by managers when addressing laboratory technology management</blockquote>
|-
|}
|}

"Lab management" is responsible for understanding how their lab needs to operate in order to meet the lab's goals. Before automation became a factor, lab management’s primary concern was managing laboratory personnel and helping them get their work done. In the early stages of lab automation, the technologies were treated as add-ons that assisted personnel in getting work done. Today, lab managers need to move beyond that mindset and look at how their role must shift towards planning for and managing automation systems that take on more of the work in the lab. As a result, lab managers need to take on the role of technology planners in addition to managing people. The implementation of those plans may be carried out by others (e.g., laboratory automation engineers [LAEs], IT specialists), but defining the objectives and how the lab will function with a combination of people and systems is a task squarely in the lap of lab management, requiring the use of operational workflow models (Figure 7) to define the technologies and products suitable for their lab's work.

[[File:Fig7 Liscouski ElementsLabTechMan14.png|600px]]
{{clear}}
{|
| STYLE="vertical-align:top;"|
{| border="0" cellpadding="5" cellspacing="0" width="600px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;"| <blockquote>'''Figure 7.''' The importance of operational workflow models for lab managers when addressing laboratory technology management</blockquote>
|-
|}
|}

Problems can occur when different operational groups have conflicting sets of priorities. Take for example the case of labs and IT support. At one point, lab computing only affected the lab systems, but today's labs share resources with other groups, requiring "IT management" to ensure that work in one place doesn’t adversely impact another. Most issues around elements such as networks and security can be handled through implementing effective network design and bridges to isolate network segments when necessary. More often than not, problems are likely to occur in the choice and maintenance of products, and the policies IT implements to provide cost-effective support. Operating system upgrades are one place issues can occur if those changes cause products used in lab work to break because the vendor is slow in responding to OS changes. Another place that issues can occur is in product selection; IT managers may want to minimize the number of vendors it has to deal with and prefer products that use the same database system deployed elsewhere. That policy may adversely limit the products that the lab can choose from. From the lab's perspective, they need the best products in order to get their work done; from the IT group's perspective, they see it as driving up support costs. The way to avoid these issues, and others, is for senior managers to determine the priorities and keep the inter-department politics out of the discussion.

===Classes of laboratory automation implementation===
The second element of laboratory technology management to address is laboratory automation implementation. There are three classes of implementation to address (Figure 8):

* ''Computer-controlled experiments'': This includes data collection in high-energy physics, LabVIEW implemented systems, instrument command and control, and robotics. This implementation class involves systems where the computer is an integral part of the experiment, doing data collection and/or experiment control.
* ''Computer-assisted lab work/experiments'': This includes work that could be done without a computer, but machines and software have been added to improve the process. Examples include [[chromatography data system]]s (CDS), ELNs used for documentation, and classic LIMS.
* ''Scientific manufacturing'': This implementation class focuses on production systems, including high-throughput screening, lights-out lab automation, process analytical technologies, and quality by design (QbD) initiatives.

[[File:Fig8 Liscouski ElementsLabTechMan14.png|800px]]
{{clear}}
{|
| STYLE="vertical-align:top;"|
{| border="0" cellpadding="5" cellspacing="0" width="800px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;"| <blockquote>'''Figure 8.''' The three classes of laboratory automation implementation to consider for laboratory technology management</blockquote>
|-
|}
|}

====Computer-controlled experiments====
This is where the second phase of laboratory automation began: when people began using digital computers with connections to their instrumentation. We moved in stages from simple data acquisition, acquiring a few points to show that we can accurately represent voltages from the equipment, to collecting multiple data streams over time and storing the results on disks. The next step consisted of automatically collecting the data, processing it, storing it, and reporting results from equipment such as chromatographs, spectrophotometers, and other devices.

Software development moved from assembly language programming to higher-level languages programming, and then specialized systems that provide a graphical interface to the programmer. Products like LabVIEW{{Efn|LabVIEW is a product of [https://www.ni.com/en-us/shop/labview.html National Instruments]; similar products are available from other vendors.}} allow the developer to use block diagrams to describe the programming and processing that have to be done, and provide the user with an attractive user interface with which to work. This is a far cry from embedding machine language programming in the BASIC language code, as was done in some earlier PC systems.

Robots offer another example of this class of work, where computers control the movement of equipment and materials through a process that prepares samples for analysis, and may include the analysis itself.

While commercial products have overtaken much of the work of interfacing, data acquisition, and data processing (in some cases, the instruments-computer combination are almost indistinguishable from the instruments themselves), the ability to deal with instrument interfacing and programming is still an essential skillset for those working in research and applications where commercial systems have yet to be developed.

It’s interesting that people often look at modern laboratory instrumentation and say that everything has gone digital. That’s far from the case. They may point to a single pan balance or thermometer with a digital readout as examples of a “digital” instrument, not realizing that the packaging contains sensors, analog-digital (A/D) converters, and computer control systems to manage the device and its communications. The appearance of a “digital” device masks what is going on inside; we still need to understand the transition from the analog world into the digital domain.

====Computer-assisted lab work/experiments====
There is a difference between work that can be done by computer and work that has to be done by computer; we just looked at the latter case. There’s a lot of work that goes on in laboratories that could be done and has been done by people, but in today’s labs we prefer to do that work with the aid of automated systems. For example, the management of samples in a testing laboratory used to be done by people logging samples in and keeping record books of the work that is scheduled and has been performed. Today, that work is better managed by a LIMS (or a LIS in the clinical world). The analysis of instrumental data used to be done by hand, and now it is more commonly done by instrument data systems that are faster, more accurate, and permit more complex analysis at a lower cost.

Do robots fit in this category? One could argue they are simply doing work that could be done manually performed in the past. The reason we might not consider robots in this class is that in many cases the equipment used for the robots is different than the equipment that’s used by human beings. As such, the two really aren’t interchangeable. If the LIMS or instrument data system were down, people could pick up the work manually; however, that may not be the case if a robot goes offline. It’s a small point and you can go either way on this.

The key element for this class of implementation is that the use of computers—and, if you prefer, robotics—is an option and not a requirement. Yet that option improves productivity, reduces cost, and provides better quality and more consistent data.

====Scientific manufacturing====
This isn’t so much a new implementation class as it is a formal recognition that much of what goes on in a laboratory mirrors work that’s done in manufacturing; some work in quality control is so routine that it matches assembly line work of the 1960s. The major difference, however, is that a lab's major production output is knowledge, information, and data (K/I/D). The work in this category is going to expand as a natural consequence of increasing automation, which must be addressed. If this is the direction things are going to go, then we need to do it right.

Recognizing this point has significant consequences. Rather than just letting things evolve, we can take advantage of the situation and drive situations that are appropriate for this level of automation into useful practice. This means we must:

* convert laboratory methods to fully automated systems;
* deliberately design and manage equipment and control, acquisition, and processing systems to meet the needs of this kind of application;
* train people to work in a more complex environment than they had been; and
* build the automation infrastructure (i.e., interfacing and data standards) needed to make these systems realizable and effective, without taking on significant cost.

In short, this means opening up another dimension to laboratory work as a natural evolution of work practices. If you look in the direction things are going in the lab, where large sample volume processing is necessary (e.g., high-throughput screening), it is simply a reflection of reality.

If we look at quality control applications and manufacturing processes, we basically see one production process ([[quality control]] or QC) layered on another (production), where ultimately the two merge into continuous production and testing. This is a logical conclusion to work described by process analytical technologies and quality-by-design.

This doesn’t diminish the science used in laboratory work; rather, it adds a level of sophistication that hasn’t been widely dealt with: thinking beyond the basic science process to its implementation in a continuous automated system. This is a much more complex undertaking since 1. data will be created at a high rate and 2. we want to be sure that this is high-quality data and not just the production of junk at a high rate.

This type of thinking is not limited to quality control work. It can be readily applied to research as well, where economical high-volume experiments can be used to support statistical experimental design methodologies and more exhaustive sample processing, as well as today’s screening applications in life sciences. It is also readily applied to environmental monitoring and regulatory evaluations. And while this kind of thinking may be new to scientific applications, it isn’t new technology. Work that has been done in automated manufacturing can serve as a template for the work that has to be done in laboratory process automation.

Let's return to the first two bullets in this section: laboratory method conversion and system management and planning. If you gave four labs a method description and asked them to automate it, you’d get varied implementations of four groups doing the same thing independently. If we are going to turn lab automation into the useful tool it can be, we need to take a different approach: cooperative development of automated systems.

In order to be useful, a description of a fully automated system needs more that a method description. It needs equipment lists, source code, etc. in sufficient detail that you can purchase the equipment needed and put the system together expecting it work. However, we can do better than that. In a given industry, where labs are doing the same testing on the same types of samples, we should be able to have them come together and designate and test automated systems to meet the need. Once that is done, vendors can pick up the description and be able to build products suitable to carry out the analysis or test. The problem labs face is getting the work done at a reasonable cost. If there isn’t a competitive advantage to having a unique test, cooperate so that standardized modules for testing can be developed.

This changes the process of lab automation from a build-it-from-scratch mentality to the planned connection of standardized automated components into a functioning system.

===Experimental methods===
There is relatively little that can be said about experimental methods at this point. Aside from the clinical industry, not enough work has been done to give really good examples of intentionally designed automated systems that can be purchased, installed in a lab, and expected to function.{{Efn|Having a data system connected to, or in control of, a process is not the same as full automation. For example, there are automated Karl Fisher systems (for water analysis), but they only address titration activities and not sample preparation. A vendor can only take things so far in commercial products unless labs describe a larger role for automation, one that will vary by application. The point is we need a formalized description of that larger context.}} There are some examples, including ELIZA robotics analysis systems from Caliper Life Sciences and Pressurized Liquid Extraction Systems from Fluid Management Systems. Many laboratory methods are designed with the assumption that people will be doing the work (manually), and any addition of automation would require conversion of that method (Figure 9).

[[File:Fig9 Liscouski ElementsLabTechMan14.png|800px]]
{{clear}}
{|
| STYLE="vertical-align:top;"|
{| border="0" cellpadding="5" cellspacing="0" width="800px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;"| <blockquote>'''Figure 9.''' The consideration of a lab's experimental methods for laboratory technology management</blockquote>
|-
|}
|}

What we need to begin doing is looking at the development of automated methods as a distinct task similar to the published manual methods by [[ASTM International]] (ASTM), the [[United States Environmental Protection Agency|Environmental Protection Agency]] (EPA), and the United States Pharmacopeia (USP), with the difference being that automation is not viewed as mimicking human actions but as well-designed and optimized systems that support the science and any associated production processes (i.e., a scientific manufacturing implementation that includes integration with informatics systems). We need to think “bigger,” without limiting our vision to just the immediate task but rather looking at how it fits into lab-wide operations.

===Lab-specific technologies and information technology===
This section quickly covers the next two elements of laboratory technology management at the same time, as many aspects of these elements have already been covered elsewhere. See Figure 10 and Figure 11 for a deeper dive into these two elements.{{Efn|For more information about lab-specific technologies, please refer to ''[https://www.researchgate.net/publication/275351757_Computerized_Systems_in_the_Modern_Laboratory_A_Practical_Guide Computerized Systems in the Modern Laboratory: A Practical Guide]''.}}

[[File:Fig10 Liscouski ElementsLabTechMan14.png|1000px]]
{{clear}}
{|
| STYLE="vertical-align:top;"|
{| border="0" cellpadding="5" cellspacing="0" width="1000px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;"| <blockquote>'''Figure 10.''' Reviewing the many aspects of lab-specific technologies for laboratory technology management</blockquote>
|-
|}
|}

[[File:Fig11 Liscouski ElementsLabTechMan14.png|800px]]
{{clear}}
{|
| STYLE="vertical-align:top;"|
{| border="0" cellpadding="5" cellspacing="0" width="800px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;"| <blockquote>'''Figure 11.''' Consideration of information technologies for laboratory technology management</blockquote>
|-
|}
|}

There is one point that needs to be made with respect to information technologies. While lab managers do not need to understand the implementation details of those technologies, they do need to be aware of the potential they offer in the development of a structure for laboratory automation implementations. Management is responsible for lab automation planning, including choosing the best technologies; in other words, management must manage the “big picture” of how technologies are used to meet their lab's purpose.

In particular, managers should pay close attention to the role of client-server systems and virtualization, since they offer design alternatives that impact the choice of products and the options for managing technology. This is one area where good relationships with IT departments are essential. We’ll be addressing these and other information technologies in more detail in other publications.

===Systems integration===
Systems integration is the final element of laboratory technology management, one that has been dealt with at length in other areas.<ref name="TriggTheIntegArch14">{{cite web |url=http://www.theintegratedlab.com/ |archiveurl=https://web.archive.org/web/20141218063422/http://www.theintegratedlab.com/ |title=The Integrated Lab |author=Trigg, J. |date=2014 |archivedate=18 December 2014}}</ref><ref name="LiscouskiInteg12">{{cite journal |title=Integrating Systems |journal=Lab Manager |author=Liscouski, J. |volume=7 |issue=1 |pages=26–9 |year=2012 |url=https://www.labmanager.com/computing-and-automation/integrating-systems-17595}}</ref> Many of the points noted above, particularly in the management sections, demand attention be paid to integration in order to develop systems that work well. When systems are planned, they need to be done with an eye toward integrating the components, something that today’s technologies are largely not capable doing as of yet (aside from those built around microplates and clinical chemistry applications). This isn’t going to happen magically, nor is it the province of vendors to define it. This is a realm that the user community has to address by defining the standards and methodologies for integration (Figure 12). The planning that managers have to do as part of technology management has to be done with an understanding of the role integration plays and an ability to choose solutions that lead to well-designed integrated systems. The concepts behind scientific manufacturing depend on it, just as integration is required in any efficient production process.

[[File:Fig12 Liscouski ElementsLabTechMan14.png|600px]]
{{clear}}
{|
| STYLE="vertical-align:top;"|
{| border="0" cellpadding="5" cellspacing="0" width="600px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;"| <blockquote>'''Figure 12.''' Addressing systems integration for laboratory technology management</blockquote>
|-
|}
|}

The purpose of integration in the lab is to make it easier to connect systems. For example, a CDS may need to pass information and data to a LIMS or ELN and then on to other groups. The resulting benefits of this ability to integrate systems include:

* smoother workflow, meaning less manual effort while avoiding duplication of data entry and data entry errors, something strived for and being accomplished in production environments, including manufacturing, video production, and graphics design;
* easier path for meeting regulatory requirements, as integrated systems, with integration built in by vendors, results in systems that are easier to validate and maintain;
* reduced cost of development and support;
* reduction in duplication of records via better data management; and
* more flexibility, as integrated systems built on modular components will make it easier to upgrade or update systems, and meet changing requirements.

The inability to integrate systems and components through vendor-provided mechanisms results in higher development and support costs, increased regulatory burden, and reduced likelihood that projects will be successful.

====What is an integrated system in the laboratory?====
Phrases like “integrated system” are used so commonly that it seems as though there should be instant recognition of what they are. While the words may bring a concept to mind, do we have the same concept in mind? For the sake of this discussion, the concept of an integrated system has several characteristics. First, in an integrated system, a given piece of information is entered once and then becomes available throughout the system, restricted only by access privileges. The word “system” in this case is the summation of all the information handling equipment in the lab. It may extend beyond the lab if process connections to other departments are needed. Second, the movement of materials (e.g., during sample preparation), information, and data is continuous from the start of a process through to the end of that process, without the need for human effort. The sequence doesn’t have to wait for someone to do a manual portion of the process in order for it to continue, aside from policy conditions that require checks, reviews, and approvals before subsequent steps are taken.

An integrated system should result in a better place for personnel to work as humans wouldn't be fully depended upon for conducting repetitive work. After all, leaving personnel to conduct repetitive work has several drawbacks. First, people get bored and make mistakes (some minor, some not, both of which contribute to variability in results), and second, the progress of work (productivity) is dependent on human effort, which may limit the number of hours that a process can operate. More broadly, it's also a bad way of using intelligent, educated personnel.

====A brief historical note====
For those who are new to the field, we’ve been working on system integration for a long time, with not nearly as much to show for it as we’d expect, particularly when compared to other fields that have seen an infusion of computer technologies. During the 1980s, the pages of ''Analytical Chemistry'' saw the initial ideas that would shape the development of automation in chemistry. Dr. Ray Dessy’s (then at Virginia Polytechnical Institute) articles on LIMS, robotics, networking, and IDS laid out the promise and expectation for electronic systems used to acquire and manage the flow of data and information throughout the lab.

That concept—the computer integrated lab—was the basis of work by instrument and computer vendors, resulting in proof-of-concept displays and exhibits at PITTCON and other trade shows. After more than 30 years, we are still waiting for that potential to be realized, and we may not be much closer today than we were then. What we have seen is an increase in the sophistication of the tools available for lab work, including client-server chromatography systems and ELNs in their varied forms. In each case, we keep running into the same problem: an inability to connect things into working systems. The result is the use of product-specific code and workarounds to moving and parsing data streams. These are fixes, not solutions. Solutions require careful design not just for the short-term "what-do-we-need-today" but also long term robust designs that permit graceful upgrades and improvements without the need to start over from scratch.

====The cost of the lack of progress====
Every day the scientists and technicians in your labs are working to produce the knowledge, information, and data (K/I/D) your company depends upon to meet its goals. That K/I/D is recorded in notebooks and electronic systems. How well are those systems going to support your need for access today, tomorrow, or over the next 20 or more years? This is the minimum most companies require for guaranteed access to data.

The systems being put in place to manage laboratory K/I/D are complex. Most laboratory data management systems (i.e., LIMS, ELN, IDS) are a combination of four separate products: hardware, operating system, database management system, and the application you and your staff uses, each from a different company with its own product life cycle. This means that changes can occur at any of those levels, asynchronously, without consideration for the impact they have on your ability to work.

Lab managers are usually trained in the sciences and personnel aspects of laboratory management. They are rarely trained in technology management and planning for laboratory robotics and informatics, the tools used today to get laboratory work done and manage the results. The consequences of inadequate planning can be significant:

<blockquote>In January 2006, the FBI ended the LIMS project, and in March 2006 the FBI and JusticeTrax agreed to terminate the contract for the convenience of the government. The FBI agreed to pay a settlement of $523,932 to the company in addition to the money already spent on developing the system and obtaining hardware. Therefore, the FBI spent a total of $1,380,151 on the project. With only the hardware usable, the FBI lost $1,175,015 on the unsuccessful LIMS project.<ref name="OIGTheFed06">{{cite web |url=https://oig.justice.gov/reports/FBI/a0633/index.htm |title=Executive Summary |work=The Federal Bureau of Investigation's Implementation of the Laboratory Information Management System - Audit Report 06-33 |author=Office of the Inspector General |date=June 2006}}</ref></blockquote>

Other instances of problems during laboratory informatics projects include:

* A 2006 Association for Laboratory Automation survey on the topic of industrial laboratory automation posed the following questions, with percentage of respondents agreeing in parentheses: My Company/Organization’s Senior Management Feels its Investment in Laboratory Automation Has: succeeded in delivering the expected benefits (56%); produced mixed results (43%); has not delivered the expected benefits (1%). 44% failed to fully realize expectations.<ref name="Hamilton2006">{{cite journal |title=2006 ALA Survey on Industrial Laboratory Automation |journal=SLAS Technology |author=Hamilton, S.D. |volume=12 |issue=4 |pages=239–46 |year=2007 |doi=10.1016/j.jala.2007.04.003}}</ref>

* A long-circulated statistic says that some 60 percent of LIMS installations fail.<ref name="SCWChoosing02">{{cite web |url=http://www.scientific-computing.com/features/feature.php?feature_id=132 |archiveurl=https://web.archive.org/web/20071019002613/http://www.scientific-computing.com/features/feature.php?feature_id=132 |title=Choosing the right client |author=Scientific Computing World |work=Scientific Computing World |date=September/October 2002 |archivedate=19 October 2007}}</ref>

* The Standish Group's CHAOS Report (1995) on project failures (looking at ERP implementations) shows that over half will fail, and 31.1% of projects will be canceled before they ever get completed. Further results indicate 52.7% of projects will cost over 189% of their original estimates.<ref name="SGITheCHAOS95">{{cite web |url=https://www.csus.edu/indiv/r/rengstorffj/obe152-spring02/articles/standishchaos.pdf |format=PDF |title=The CHAOS Report |publisher=Standish Group International, Inc |date=1995}}</ref>

From a more anecdotal standpoint, we’ve received a number of emails discussing the results of improperly managing projects. Stories that stand out among them include:

* An anonymous LIMS customer was given a precise fixed-price quote somewhere around $90,000 and then got hit with several $100,000 in extras after the contract was signed.
* An anonymous major pharmaceutical company some years back had implemented a LIMS with a lot of customization that was generally considered to be successful, until it came time to upgrade. They couldn’t do it and went back to square one, requiring the purchase of another system.
* An anonymous business reports of robotics system failures totaling over $700,000.
* Some report vendors are using customer sites as test-beds for software development.
* A group of three different types of labs with differing requirements were trying to use the same system to reduce costs; nearly $500,000 was spent before the project was cancelled.

In addition to those costs, there are the costs of missed opportunities, project delays, departmental and employee frustration, and the fact that the problems you wanted to solve are still sitting there.

The causes for failures are varied, but most include factors that could have been avoided by making sure those involved were properly trained. Poor planning, unrealistic goals, inadequate specifications (including a lack of regulatory compliance requirements), project management difficulties, scope creep, and lack of experienced resources can all play a part in a failed laboratory technology project. The lack of features that permit the easy development of integrated systems can also be added to that list. That missing element can cause projects to balloon in scope, requiring people to take on work that they may not be properly prepared for, or projects that are not technically feasible, something developers don’t realize until they are deeply involved in the work.

The methods people use today to achieve integration results in cost overruns, project failures, and systems that can’t be upgraded or modified without significant risk of damaging the integrity of the existing system. One individual reported that his company’s survey of customers found that systems were integrated in ways that prevented upgrades or updates; the coding was specific to a particular version of software, and any changes could result in scrapping the current system and starting over.

One way of achieving “integration” is similar to how one might integrate household wiring by hard-wiring all the lamps, appliances, TV’s, etc. to the electrical cables. Everything is integrated, but change isn’t possible without shutting off the power to everything, going into the wall and making the wiring changes, and then repairing the walls and turning things back on. When considering systems integration, that’s not the model we’re considering; however, from the comments we’ve received, it is the way people are implementing software. We’re looking for the ability to connect things in ways that permit change, like the wiring in most households: plug and unplug. That level of compatibility and integration results from the development of standards for power distribution and for connections: the design of plugs and sockets for specific voltages, phasing, and polarity so that the right type of power is supplied to the right devices.

Of course, there are other ways to practically connect systems, new and old. The Universal Serial Bus (USB) standard allows the same connector to be used for connecting portable storage, cameras, scanners, printers, and other communications devices with a computer. Another older example can be found with modular telephone jacks and tone dialing, which evolved to the more mobile system we have today. However, we probably wouldn't have the level of sophistication we have now if we relied on rotary dials and hard-wired phones.

These are just a few examples of component connections that can lead to systems integration. When we consider integrating systems in the lab, we need to look at connectivity and modularity (allowing us to make changes without tearing the entire system apart) as goals.

====What do we need to build integrated systems?====
The lab systems we have today are not built for system-wide integration. They are built by vendors and developers to accomplish a specific set of tasks; connections to other systems is either not considered or avoided for competitive reasons. If we want to consider the possibility of building integrated systems, there are at least five elements that are needed:

# Education
# User community commitment
# Standards (e.g., file formatting, messaging, interconnection)
# Modular systems
# Stable operating system environment

'''Education''': Facilities with integrated systems are built by people trained to do it. This has been discussed within the concept of LAEs, published in 2006.<ref name="LiscouskiAreYou06">{{cite journal |title=Are You a Laboratory Automation Engineer? |journal=SLAS Technology |author=Liscouski, J.G. |volume=11 |issue=3 |pages=157-162 |year=2006 |doi=10.1016/j.jala.2006.04.002}}</ref>{{Efn|You can also find the expanded version of the paper ''Are You a Laboratory Automation Engineer'' [[LII:Are You a Laboratory Automation Engineer?|here]].}} However, the educational issues don’t stop there. Laboratory management needs to understand their role in technology management. It isn’t enough to understand the science and how to manage people, as was the case 30 or 40 years ago. Managers have to understand how the work gets done and what technology is used to do it. The effective use (or the unintended misuse) of technologies can have as big an impact on productivity as anything else. The science also has to be adjusted for advanced lab technologies. Method development should be done with an eye toward method execution, asking "can this technique be automated?"

'''User community commitment''': Vendors and developers aren’t going to provide the facilities needed for integration unless the user community demands them. Suppliers are going to have to spend resources in order to meet the demands for integration, and they aren’t going to do this unless there is a clear market need and users force them to meet that need. If we continue with “business as usual” practices of force fitting things together and not being satisfied with the result, where is the incentive for vendors to spend development money? The choices come down to these: you only purchase products that meet your needs for integration, you spend resources trying to integrate systems that aren’t designed for it, or your labs continue to operate as they have for the last 30 years, with incremental improvements.

'''Standards''': Building systems that can be integrated depend upon two elements in particular: standardized file formats and messaging or interconnection systems that permit one vendor’s software package to communicate with another’s.

First, the output of an instrument should be packaged in an industry standardized file format that allows it to be used with any appropriate application. The structure of that file format should be published and include the instrument output plus other relevant information such as date, time, instrument ID, sample ID (read via barcode or other mechanism), instrument parameters, etc. Digital cameras have a similar setup for their raw data files: the pixel data and the camera metadata that tells you everything about the camera used to take the shot.

In the 1990s, the Analytical Instrument Association (AIA) (now the Analytical and Life Science Systems Association) had a program underway to develop a set of file format standards for chromatography and mass spectrometry. The program made progress and was turned over to ASTM, where momentum stalled. It was a good first attempt. There were several problems with it that bear noting. The first problem is found in the name of the standard: the Analytical Data Interchange (ANDI) standard.<ref name="ANDISF03Arch">{{cite web |url=http://andi.sourceforge.net/ |archiveurl=https://web.archive.org/web/20030804151150/http://andi.sourceforge.net/ |title=Analytical Data Interchange |work=SourceForge |author=Julian, R.K. |date=22 July 2003 |archivedate=04 August 2003}}</ref> It was viewed as a means of transferring data between instrument systems and served as a secondary file format, with the instrument vendors being the primary format. This has regulatory implications since the [[Food and Drug Administration]] (FDA) requires storage of the primary data and that the primary data is used to support submissions. It also means that files would have to have been converted between formats as it moved between systems.

A standardized file format would be ideal for an instrumental technique. Data collected from an instrument would be in that format and be implemented and used by each vendor. In fact, it would be feasible to have a circuit board in an instrument that would function as a network node. It would collect and store instrument data and forward it to another computer for long-term storage, analysis and reporting, thus separating data collection and use. A similar situation currently exists with instrument vendors that use networked data collection modules. The issue is further complicated by the nature of analytical work. A data file is meaningless without its associated reference material: standards, calibration files, etc., that are used to develop calibration curves and evaluate qualitative and quantitative results.

While file format standards are essential, so is a second-order description: sample set descriptors that provide a context for each sample’s data file (e.g., a sample set might be a sample tray in an autosampler, and the descriptor would be a list of the tray’s contents). Work is underway for the development of another standard for laboratory data: ASTM WK23265 - New Specification for Analytical Information Markup Language. Its description indicates that it does take the context of the sample—its relationship to other samples in a run or tray—into account as part of the standard description.<ref name="ASTMWK23265Arch">{{cite web |url=https://www.astm.org/DATABASE.CART/WORKITEMS/WK23265.htm |archiveurl=https://web.archive.org/web/20130813072546/https://www.astm.org/DATABASE.CART/WORKITEMS/WK23265.htm |title=ASTM WK23265 - New Specification for Analytical Information Markup Language |publisher=ASTM International |archivedate=13 August 2013}}</ref>

The second problem with the AIA’s program was that it was vendor-driven with little user participation. The transfer to ASTM should have resolved this, but by that point user interest had waned. People had to buy systems and they couldn’t wait for standards to be developed and implemented. The transition from proprietary file formats to standardized formats has to be addressed in any standards program.

The third issue with their program involved standards testing. Before you ask a customer to commit their work to a vendor’s implementation of a standard, they should have the assurance, through an independent third-party, that things work as expected.

'''Modular systems''': The previous section notes that vendors have to assume that their software may be running in a stand-alone environment in order to ensure that all of the needed facilities are available to meet the user's needs. This can lead to duplication of functions. A multi-user instrument data system and a LIMS both have a need for sample login. If both systems exist in the lab, you’ll have two sample login systems. The issue can be compounded even further with the addition of more multi-instrument packages.

Why not break down the functionality in a lab and use one sample login module? It is simply a multi-user database system. If we were to do a functional analysis of the elements needed in a lab, with an eye toward eliminating redundancy and duplication while designing components as modules, integration would be a simpler issue. A modular approach—a system with a login module, lab management module, modules for data acquisition, chromatographic analysis, spectra analysis, etc.—would provide a more streamlined design, with the ability to upgrade functionality as needed. For example, a new approach to chromatographic peak detection, peak deconvolution, could be integrated into an analysis method without having to reconstruct the entire data system.

When people talk about modular applications, the phrase “LEGO-like” comes to mind. It is a good illustration of what we’d like to accomplish. The easily connectable blocks and components can be structured in a wide variety of items, all based on a simple standardized connection concept. There are two differences that we need to understand. With LEGOs, almost everything connects. In the lab, connections need to make sense. Secondly LEGOs are a single-vendor solution; unless you’re the vendor, that isn’t a good model. A LEGO-like multi-source model (including open source) of well-structured and well-designed and -supported modules that could be connected or configured by the user would be an interesting approach to the development of integrable systems.

Modularity would also be of benefit when upgrading or updating systems. With more functions distributed over several modules, the amount of testing and validation needed would be reduced. It should also be easier to add functionality. This isn’t some fantasy, this is what LAE is when you look at the entire lab environment rather than implementing products task-by-task in isolation.

'''Stable operating system environment''': The foundation of an integrated system must be a stable operating environment. Operating system upgrades that require changes in applications coding are disruptive and lead to a loss of performance and integrity. It may be necessary to forgo the bells and whistles of some commercial operating systems in favor of open-source software that provides required stability. Upgrades should be improvements in quality and functionality where that change in functionality has a clear benefit to the user.

The elements noted above are just introductory commentary; each could fill a healthy document by itself. At some point, these steps are going to have to be taken. Until they are, and they result in tools you can use, labs—your labs—are going to be committing the results of your work into products and formats you have little control over. That should not be an acceptable situation; the use of proprietary file formats that limit your ability to work with your company’s data should end and be replaced with industry-standard formats that give you the flexibility to work as you choose, with whatever products you need.

We need to be deliberate in how we approach this problem. When discussing file format standards, it was noted that the data file for a single sample is useless by itself. If you had the file for a chromatogram for instance, you could display it and look at the conditions used to collect it; however, interpretation requires data from other files, so standards for file sets have to be developed. That wasn’t a consideration in the original AIA work on chromatography and mass spectrometry (though it was in work done on Atomic Absorption, Emission and Mass Spectroscopy Data Interchange Specification standards for the Army Corps of Engineers, 1995).

The first step in this process is for lab managers and IT professionals to become educated in laboratory automation and what it takes to get the job done. The role of management can’t be understated; they have to sign off on the direction work takes and support it for the long haul. The education needs to focus on the management and implementation of automation technologies, not just the underlying science. After all, it is the exclusive focus on the science that leads to the silo-like implementations we have today. The user community's active participation in the process is central to success, and unless that group is educated in the work, the effect of that participation will be limited.

Secondly, we need to renew the development of industry-standard file formats, not just from the standpoint of encapsulating data files, but formats that ensure that the data is usable. The initial focus for each technique needs to be a review of how laboratory data is used, particularly with the advent of hyphenated techniques (e.g., [[Gas chromatography–mass spectrometry]] or GC-MS), and use that review as a basis for defining the layers of standards needed to develop a useable product. This is a complex undertaking but worth the effort. If you’re not sure, consider how much your lab’s data is worth and the impact of its loss.

In the short term, we need to start pushing vendors—you have the buying power—to develop products with the characteristics needed to allow you to work with and control the results of your lab’s work. Products need to be developed to meet your needs, not the vendor's. Product criterion needs to be set with the points above in mind, not on a company-by-company basis but as a community; you’re more likely to get results with a community effort.

Overcoming the barriers to the integration of laboratory systems is going to take a change in mindset on the part of lab management and those working in the labs. That change will result in a significant evolution in the way labs work, yielding higher productivity and a better working environment, with an improvement in the return on your company’s investment in your lab's operations. Laboratory systems need to be designed to be effective. The points noted here are one basis for that design.

===Summary===
That is a brief tour of what the major elements of laboratory technology management looks like right now. The diagrams will change and details will be left to additional layers to keep the structure easy to understand and use. One thing that was sacrificed in order to facilitate clarity is the relationship between technologies. For example, a robotics system might use data acquisition and control components in its operations, which could be noted by a link between those elements.

There is room for added complexity to the map. Someone may ask where [[bioinformatics]] or some other subject resides. That as well as other points—and there are a number of them —would be addressed in successive levels, giving the viewer the ability to drill down to whatever level of detail they need. The best way to view this is an electronic map that can be explored by clicking on subjects for added information and relationships.

An entire view of the diagram of the elements of laboratory technology can be [https://web.archive.org/web/20151025114744/http://www.institutelabauto.org/publications/SOLA-11x17.pdf found here] (as a PDF).

==Skills required for working with lab technologies==
While this subject could have arguably been discussed in the management section above, we needed to wait until the major elements were described before taking up this critical point. In particular, we had to address the idea behind "scientific manufacturing."

Lab automation has an identity problem. Many people don’t recognize it as a field. It appears to be a collection of products and technologies that people can use as needed. Emphasis has shifted from one technology to another depending on what is new, hot, or interesting, with conferences and papers discussing that technology until something else comes along. Robotics, LIMS, and neural networks have all had their periods of intense activity, and now the spotlight is on ELNs, integration, and paperless labs.

Lab automation needs to be addressed as a multi-disciplinary field, working in all scientific disciplines, by lab personnel, consultants, developers, and those in IT support groups. That means addressing three broad groups of people: scientists and technicians (i.e., the end users), LAEs (i.e., those designing, and implementing systems for the end users), and the technology developers (Figure 13).

[[File:Fig13 Liscouski ElementsLabTechMan14.png|714px]]
{{clear}}
{|
| STYLE="vertical-align:top;"|
{| border="0" cellpadding="5" cellspacing="0" width="714px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;"| <blockquote>'''Figure 13.''' Groups that need to be addressed when discussing laboratory automation</blockquote>
|-
|}
|}

Discussions concerning lab automation and the use of advanced technologies in lab work are usually done from the standpoint of the technologies themselves: what they are, what they do, benefits, etc. Missing from these conversations is an appreciation of the ability of those scientists and technicians—the end users—in the lab to use these tools, and how they will change the nature of laboratory work.

The application of analog electronic systems to laboratory work began in the early part of the twentieth century. For the most part, those systems made it easier for a scientist to make measurements. Recording spectrophotometers replaced wavelength-by-wavelength manual measurements, and process chromatographs automated sample taking, back-flush valves, attenuation changes, etc. They made it easier to collect measurements but did not change the analyst's job of data analysis. After all, analysts still had to look at each curve or chromatogram, make judgments, and apply their skills to making sense of the experiment. At this point, scientists were in charge of executing the science, while analog electronics made the science easier to deal with.

When processor-based systems were added to the lab’s tool set, things moved in a different direction. The computers could then perform the data acquisition, display, and analysis. This left the science to be performed by a program, with the analyst able to adjust the behavior of the program by setting numerical parameters. This represents a major departure in the nature of laboratory work, from scientist being completely responsible for the execution of lab procedures to allowing a computer-based system to take over control of all or a portion of the work.

For many labs, the use of increasingly sophisticated technologies is just a better way of individuals doing tasks better, faster, and with less cost. In others, the technology takes over a task and frees the analyst to do other things. We’ve been in a slow transition from people driving work to technology driving work. As the use of lab technologies moves further into automation, the practice of laboratory work is going to change substantially until we get to the point where scientific manufacturing and production is the dominant function: automation applied from sample acceptance to the final test or experimental result.

====Development of manufacturing and production stages====
We can get a sense of how work will change by looking at the development of manufacturing and production systems, where we see a transition from manual methods to fully automated production, in the end driven by the same issues as laboratories: a need or desire for high productivity, lower costs, and improved and consistent product results. The major difference is that in labs, the “product” isn’t a widget, it is information and data. In product manufacturing, we also see a reduction in manpower as a goal; in labs, it is a shift from manual effort to using that same energy to understand data and improve the science. One significant benefit from a shift to automation is that lab staff will be able to redesign lab processes—the science behind lab work—to function better in an automated environment; most of the processes and equipment in place today assume manual labor and are not well-designed for automated control.

That said, we’re going to—by analogy—look at a set of manufacturing and production stages associated with wood working, e.g., making the components of door frames or moldings. The trim you see on wood windows and the framing on cabinet doors are examples of shaping wood. When we look at the stages of manufacturing or production, we have to consider five attributes of that production effort: relative production cost, productivity, required skills, product quality, and flexibility.

Initially, hand planes were used to remove wood and form trim components. Multiple passes were needed, each deepening the grooves and shaping the wood. It took practice, skill, and patience to do the work well and avoid waste. This was the domain of the craftsman, the skilled woodworker, who represents the first stage of production evolution. In terms of our evaluation, Figure 14 shows the characteristics of this first stage (we’ll fill in the table as we go along).

[[File:Fig14 Liscouski ElementsLabTechMan14.png|700px]]
{{clear}}
{|
| STYLE="vertical-align:top;"|
{| border="0" cellpadding="5" cellspacing="0" width="700px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;"| <blockquote>'''Figure 14.''' The first stage of an evolving manufacturing and production process, using wood working as an example</blockquote>
|-
|}
|}

The next stage sees the craftsman turn to hand-operated, electric motor-driven routers to shape the wood. Instead of multiple passes with a hand plane, motor-driven set of bits removes material, leaving the finished product. A variety of cutting bits allow the craftsman to create different shapes. For example, a matching set of bits may be specially designed for the router to create the interlocking rails and stiles that frame cabinet doors.

Figure 15 shows the impact of this equipment on this second evolutionary stage of production. While still available to the home woodworker, the use of this equipment implies that the craftsman is going to be producing the shaped wood in quantity, so we are moving beyond level of production found in a cottage industry to the seeds of a growth industry. The cost of good quality routers and bits is modest and requires an investment in developing skills to use them effectively. Used well (and safely) they can produce good products; they can also produce a lot of waste if the individual isn’t properly schooled.

[[File:Fig15 Liscouski ElementsLabTechMan14.png|700px]]
{{clear}}
{|
| STYLE="vertical-align:top;"|
{| border="0" cellpadding="5" cellspacing="0" width="700px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;"| <blockquote>'''Figure 15.''' The second stage of an evolving manufacturing and production process, using wood working as an example</blockquote>
|-
|}
|}

The third stage sees automated elements work their way into the wood working mix, with the multi-headed numerically controlled router. Instead of one hand-operated router-bit combination, there are four router-bit assemblies directed by a computer program to follow a specific path such that highly repeatable and precise cuts can be made. Of course, with the addition of multiple heads and software, the complexity of the product increases.

Figure 16 shows the impact of this equipment on the third evolutionary stage of production. We’ve moved from the casual woodworker to a full production operation. The cost of the equipment is significant, and the operators—both the program designer and the machine operator—have to be skilled in the use of the equipment to reduce mistakes and waste material. The “Less Manual Skill” notation under "Skills" indicates a transition point where we have moved almost entirely from the craftsman or woodworker to the skilled operator, requiring different skill sets than previous production methods. One of the side-effects of higher production is that if you make a design error, you can make out-of-specification product rather quickly.

[[File:Fig16 Liscouski ElementsLabTechMan14.png|700px]]
{{clear}}
{|
| STYLE="vertical-align:top;"|
{| border="0" cellpadding="5" cellspacing="0" width="700px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;"| <blockquote>'''Figure 16.''' The third stage of an evolving manufacturing and production process, using wood working as an example</blockquote>
|-
|}
|}

From there, it's not a far jump to the final stage: a fully automated assembly line. Their inclusion completes the chart that we’ve been developing (Figure 17).

[[File:Fig17 Liscouski ElementsLabTechMan14.png|700px]]
{{clear}}
{|
| STYLE="vertical-align:top;"|
{| border="0" cellpadding="5" cellspacing="0" width="700px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;"| <blockquote>'''Figure 17.''' The fourth and final stage of an evolving manufacturing and production process, using wood working as an example</blockquote>
|-
|}
|}

When we take the information from Figure 17, we can summarize the entire process as follows (Figure 18):

[[File:Fig18 Liscouski ElementsLabTechMan14.png|600px]]
{{clear}}
{|
| STYLE="vertical-align:top;"|
{| border="0" cellpadding="5" cellspacing="0" width="600px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;"| <blockquote>'''Figure 18.''' All stages of a wood working manufacturing and production process, and their attributes, summarized</blockquote>
|-
|}
|}

When we look at that summary, we can't help but notice that it translates fairly well when we replace "wood working" with "laboratory work," moving from the entirely manual processes of the skilled technician or scientist to the full automated scientific manufacturing and production process of the skilled operator or system supervisor. We visualize that in Figure 19. (The image in the last column of Figure 19 is of an automated extraction system from Fluid Management Systems, Watertown, MA.)

[[File:Fig19 Liscouski ElementsLabTechMan14.png|700px]]
{{clear}}
{|
| STYLE="vertical-align:top;"|
{| border="0" cellpadding="5" cellspacing="0" width="700px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;"| <blockquote>'''Figure 19.''' All four stages of an evolving scientific manufacturing and production process, this time using lab work</blockquote>
|-
|}
|}

====What does this all mean for laboratory workers====
The skills needed today and in the future to work in a modern lab have changed significantly, and they will continue to change as automation takes hold. We’ve seen these changes occur already. Clinical chemistry, high-throughput screening (HTS), and automated bioassays using microplates are some examples.

The discussion here mirrors the development in the woodworking example. We’ll look at the changes in skills, using chromatography as an example. The following material is applicable to any laboratory environment, be it electronics, forensics, physical properties testing, etc. Chromatography is being used because of its wide application in lab work.

'''Stage 1: The analyst using manual methods'''

By “manual methods” we mean 100% manual work, including having the chromatographic detector output (an analog signal) recorded on a standard strip chart recorder and the pen trace getting analyzed by the hand, eye, and skill of the analyst. The process begins with the analyst finding out what samples need to be processed, finding those samples, and preparing them for injection into the instrument. The instrument has to be set up for the analysis, which includes installing the proper columns, adjusting flow rates, confirming component temperatures, and making sure that the instrument is working properly.

As each injection is done, the starting point for the data—a pen trace of the analog signal—is noted on the strip chart. This process is repeated for each sample and reference standard. Depending on the type of analysis, each sample’s data may take up to several feet of chart paper. The recording is a continuous trace and is a faithful representation of the detector output, without any filtering aside from attenuator adjustments (range selections to keep the signal recording within the limits of the paper; some peaks may peg the pen at the top of the chart because of their size in which case that data is lost) and electrical or mechanical noise reduction.

When all the injections have been completed, the analyst begins the evaluation of each sample’s data. That includes:

* inspecting the chromatogram for anomalies, including peaks that weren’t expected (possible contaminants), separations that aren’t as clear as they should be, noise, baseline drifts, and any other unusual conditions that would indicate a problem with that sample or the entire run of samples;
* taking the measurements needed for qualitative and/or quantitative analysis;
* developing the calibration curves; and
* making the calculations needed to complete the analysis.

The analysis would include any in-process control samples, and addressing issues with problem samples. The final step would be the administrative work, including checks of the work by another analyst, reporting results, and updating work request lists.

'''Stage 2: Point automation, applied to specific tasks'''

The next major development in this evolution is the introduction of automated injectors. Instead of the analyst spending the day injecting samples into the instrument's injection port, a piece of equipment does it, ushering in the first expansion of the analyst’s skill set. (In the previous stage, the analysis time is generally too short to allow the analyst to do anything else, so the analyst's day is spent injecting and waiting.) Granted, this doesn't represent a major change, but it is a change. It requires the analyst to confirm that the samples and standards are in the right order, that the right number of injections per sample are set, or that duplicate vials are put in the tray (duplicate injections get used to confirm that problems don't occur during the injection process). The analyst has to ensure that the auto-injector is connected to the strip chart recorder so that the injection timing mark is made automatically.

This simple change of adding an auto-injector to the process has some impact on the analyst's skill set. The same holds true for the use of automatic integrators, and sample preparation systems; in addition to understanding the science, the lab work takes on the added dimension of managing systems, trading labor for systems supervision with a gain of higher productivity.

'''Stage 3: Sequential-step automation'''

The addition of data systems to the sample analysis process train (from worklist generation to sample preparation to instrument analysis sequence) further reduces the amount of work the analyst does in sample analysis and changes the nature of the work performed. Starting with the simple integrators and moving on to advanced computer systems, the data system works with the auto-injector to start the instrument analysis phase of work, acquire the signal from the detector, convert it to a digital form, process that data (e.g., peak detection, area and peak height calculations, retention time), and perform the calculations needed for quantitative analysis. This yields less work with higher productivity.

While systems like this are common in labs today, there are problems, which we’ll address shortly.

'''Stage 4: Production-level automation (scientific manufacturing and production)'''

Depending on what is necessary for sample preparation, it may not be much of a stretch to have automated sample prep, injection, data collection, analysis, and reporting (with automated updates into a LIMS) performed in a small footprint with equipment available today. One vendor has an auto-injection system that is capable of dissolving material, performing extractions, mixing, and barcode reading, as well as other functions. Connect that to a chromatograph and data station, with programmed connection to a LIMS, and you have the basis of an automated sample preparation–chromatographic system. However, there are some issues that have to be noted and addressed.

The goal with such a system has to be high-volume, automated sample processing with the generation of high-quality data. The intent is to reduce the amount of work the analyst has to perform, ideally so that the system can run unattended. Note that “high-quality” in this case means to have a high level of confidence in the results. There is more to that than the ability to do calculations for quantitative analysis or having a validated system; you have to validate the right system.

Computer systems used in chromatographic analysis can be tuned to control how peaks are detected, what is rejected as noise, and how separations are identified so that baselines can be properly drawn and peak areas allocated. The analyst needs to evaluate the impact of these parameters for each analytical procedure and make sure that the proper settings are used.

As previously noted regarding manual processes, the inspection of the chromatogram for elements that don’t match the expectations for a well-characterized sample (the number of peaks that should be there, the type of separations between peaks, etc.) is vital. This screening of samples has to be applied to every sample whether by human eye or automated system, the latter giving lower labor costs and higher productivity. If we are going to build fully automated production systems, we have to be able to describe a screening template that is applied to every sample to either confirm that the sample fits the standard criteria or has to be noted for further evaluation. That “further evaluation” may be frustrated by not having the data system keep sufficient data for that evaluation, and require rerunning the sample.

The data acquired by the computer system undergoes several levels of filtering and processing before you see the final results. The sampling algorithms don’t give us the level of detail in the analog chromatogram. The visual display of a chromatogram is going to be limited by the data collected and the resolution of the display. The stair-stepping of a digitized chromatogram is an example of that, while an analog chromatogram is a smooth line. Small details and anomalies that could be evidence of contamination may be missed because of the processing.

Additionally, the entire process needs to be continually monitored and evaluated to make sure that it is working properly. This is process-level statistical quality control, and it includes options for evolutionary operations updates, small changes to improve process performance. Standard samples have to be run to test the system. If screening templates are used, samples designed to exhibit problems have to be run to trigger those templates to make sure that problem samples are detected. These in-process checks have to include every phase of the process and be able to evaluate all potential risks. The intent is to build confidence in the data by building confidence in the system used to produce it.

The goals of higher productivity can be achieved for sample processing, but in doing so, the work of the analyst will change from carrying out a procedure to managing and continuously tuning a system that is doing the work for them. The science has to be well-understood, as does the implementation of that science. As we shift to more automation, and use analytical techniques to monitor in-process production systems, more emphasis has to be placed on characterizing all the assumptions and possible failure points of the technique and building in tests to ensure that the data being used to evaluate and control a production process is sound.

During the development of the process described above, the analyst has to first determine the sequence of steps, demonstrate that they work as expected, and prepare the documentation needed to support the process and guide someone through it’s execution. This includes the details of how the autosampler programming is developed, stored, and maintained. The same holds true for the data systems parameters, screening templates, and processing routines. This is process engineering. Then, when the system is being used, the analyst has to ensure that the proper programming is loaded into each component and that it is set up and ready for use.

This is a very simple example of what is possible and an illustration of the changes that could occur in the work of lab professionals. The performance of a system such as that described could be doubled by implementing a second process stream without significantly increasing the analyst workload.

The key element is the skill level of those working in the lab (Figure 20): are they capable of meeting the challenge? Much of what has been described is process engineering, and there are people in manufacturing and production who are good at that. We need to combine process engineering skills with the science. Developing automation teams is one approach, but no matter how you address the idea, those working in labs need an additional layer of skills, beyond what they have been exposed to in formal education settings.

[[File:Fig20 Liscouski ElementsLabTechMan14.png|700px]]
{{clear}}
{|
| STYLE="vertical-align:top;"|
{| border="0" cellpadding="5" cellspacing="0" width="700px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;"| <blockquote>'''Figure 20.''' The skills and education required at the various stages of the lab-based scientific manufacturing and production process</blockquote>
|-
|}
|}

Of the sciences, clinical chemistry has moved the furthest into the advanced application of laboratory automation in the lab (Figure 20), transforming lab work in the process, moving from lab staff executing procedures manually to managing systems. The following quote from Diana Mass of Associated Laboratory Consultants {{Efn|Formerly Professor and Director of Clinical Laboratory Sciences Program, Arizona State University. Quote from private communications, used with permission.}} helps delineate the difference in lab work styles:

<blockquote>What I have observed is that automation has replaced some of the routine repetitive steps in performing analysis; however, the individual has to be even more knowledgeable to troubleshoot sophisticated instrumentation. Even if the equipment is simple to operate, the person has to know how to evaluate quality control results and have a quality assurance system in place to ensure quality test information.</blockquote>

And here's a quote from Martha Casassa, Laboratory Director of Braintree Rehabilitation Hospital{{Efn|Also from private communications, used with permission.}}, who has experience in both clinical and non-clinical labs:

<blockquote>Having a background both clinical (as a medical technologist) and non-clinical (chemistry major and managing a non-clinical research lab), I can attest to the training/education being different. I was much more prepared coming through the clinical experience to handle automation and computers and the subsequent troubleshooting and repair necessary as well as the maintenance and upkeep of the systems. During my non-clinical training the emphasis was not so much on theory as practical application in manual methods. I learned assays on some automated equipment, but that education was more to obtain an end-product than to really understand the system and how it produced that product. On the clinical side I learned not only how to get the end-product, but the way it was produced so I could identify issues sooner, produce quality results, and more effectively troubleshoot.</blockquote>

The bottom line is simple: if people are going to be effective working in modern labs, the need is to understand both the science and the way science is done using the tools of lab automation. We have a long way to go before we get there. A joint survey by the ILA and Lab Managers Association{{Efn|Originally available on [https://web.archive.org/web/20150215144434/http://www.institutelabauto.org/research/index.htm the ILA site], but no longer available.}} yielded the following:

* Lab automation is essential for most labs, but not all.
* The skill set necessary to work with automation has changed significantly.
* Entry-level scientists are generally capable of working with the hardware and software.
* Entry-level technicians often are not entry-level technicians.
* In general, applicants for positions are not well qualified to work with automation.

How well-educated in the use of automated systems are those working in the lab? The following text was used earlier: "When processor-based systems were added to the lab’s tool set, things moved in a different direction. The computers could then perform the data acquisition, display, and analysis. This left the science to be performed by a program, with the analyst able to adjust the behavior of the program by setting numerical parameters."

Let's follow that thought down a different path. In chromatography, those numerical parameters were used to determine the start and end of a peak, and how baselines were drawn. In some cases, an inappropriate set of parameters would reduce a data set to junk. Do people understand what the parameters are in an IDS and how to use them? Many industrial labs and schools have people using and IDS with no understanding of what is happening to their data. Others such as Hinshaw and Stevenson ''et al.'' have commented on this phenomenon in the past:

<blockquote>Chromatographers go to great lengths to prepare, inject, and separate their samples, but they sometimes do not pay as much attention to the next step: peak detection and measurement ... Despite a lot of exposure to computerized data handling, however, many practicing chromatographers do not have a good idea of how a stored chromatogram file—a set of data points arrayed in time—gets translated into a set of peaks with quantitative attributes such as area, height, and amount.<ref name="HinshawFinding14">{{cite journal |title=Finding a Needle in a Haystack |journal=LCGC Europe |author=Hinshaw, J.V. |volume=27 |issue=11 |pages=584–89 |year=2014 |url=https://www.chromatographyonline.com/view/finding-needle-haystack-0}}</ref></blockquote>

<blockquote>At this point, I noticed that the discussion tipped from an academic recitation of technical needs and possible solutions to a session driven primarily by frustrations. Even today, the instruments are often more sophisticated than the average user, whether he/she is a technician, graduate student, scientist, or principal investigator using [[chromatography]] as part of the project. Who is responsible for generating good data? Can the designs be improved to increase data integrity?<ref name="StevensonTheFuture11">{{cite web |url=https://americanlaboratory.com/913-Technical-Articles/34439-The-Future-of-GC-Instrumentation-From-the-35th-International-Symposium-on-Capillary-Chromatography-ISCC/ |title=The Future of GC Instrumentation From the 35th International Symposium on Capillary Chromatography (ISCC) |author=Stevenson, R.L.; Lee, M.; Gras, R. |work=American Laboratory |date=01 September 2011}}</ref></blockquote>

In yet another example, at the European Lab Automation 2012 Meeting, one liquid-handling equipment vendor gave a presentation on how improper calibration and use of liquid handling systems would yield poor data.<ref name="BradshawTheImpo12">{{cite web |url=https://d1wfu1xu79s6d2.cloudfront.net/wp-content/uploads/2013/10/The-Importance-of-Liquid-Handling-Details-and-Their-Impact-on-Your-Assays.pdf |format=PDF |title=The Importance of Liquid Handling Details and Their Impact on Your Assays |author=Bradshaw, J.T. |work=European Lab Automation Conference 2012 |publisher=Artel, Inc |date=30 May 2012 |accessdate=11 February 2021}}</ref> Discussion with other vendors supported that point, citing poor training as the cause.

One of the problems that has developed is “push-button science,” or to be more precise, the execution of tasks by pushing a button: put the sample in, push a button to get the measurements, and get a printout. Measurements are being made, and those using the equipment don’t understand what is being done, or if it is being done properly, exemplifying a “trust the vendor” or the “vendor is the expert” mindset. Those points run against the concepts of validation described by the FDA and [[International Organization for Standardization]] (ISO) organizations, as well as others. People need to approach equipment with a healthy skepticism, not assuming that it is working but being able to demonstrate that it is working. Trusting the system is based on experience and proof that the system works as expected, not assumptions.

Education on the use of current lab systems is sorely needed in today’s working environment. What are the needs going to be in the future as we move from people using equipment in workstations to the integrated workflows of scientific manufacturing and production processes?

==Closing==
The direction of laboratory automation has changed significantly over time, and with it so has the work associated with a lab. However, many have looked at that automation as just a means to an end, when it reality laboratory automation is a process, like most any other in a manufacturing and production setting. As a process, we need to look at the elements that can be used to make that process work effectively: management issues, implementation issues, experimental methods, lab-specific technologies, broader information technologies, and systems integration issues. Addressing those early, as part of a planning process, better ensures successful implementation of automation in the lab setting.

But there's more to it than that: the personnel doing the work must have the skills and education to fully realize lab automation's benefits. In fact, it's more than the scientists and technicians doing the lab work, but also those designing and implementing the technology in the lab, as well as the vendors actually making the components. Everyone must be communicating ideas, making suggestions, and working together in order to make automation work well in the lab. But the addition of that technology does not mean a more efficient and cost-effective lab; it requires knowledge of production and how the underlying technology actually works. While, when implemented well, automation may come with lower costs and better productivity but it also comes with a demand for higher-level, specialized skills. In other words, lab personnel must understand both the science and the way science is done when using the tools of laboratory automation.

We need lab personnel who are competent users of modern lab instrumentation systems, robotics, and informatics (LIMS, ELNs, SDMS, CDS, etc.), the tools used to do lab work. They should understand the science behind the techniques and how the systems are used in their execution. If a computer system is used to do data capture and processing, that understanding includes:

* how the data capture is accomplished,
* how it is processed,
* what the control parameters are and how the current set in use was arrived at (not “that’s what came from the vendor”), and
* how to detect and correct problems.

They should also understand statistical process control so that the behavior of automated systems can be monitored, with potential problems detected and corrected before they become significant. Rather than simply being part of the execution of a procedure, they manage the process. We're talking about LAEs.

Those LAEs must be capable of planning, implementing, and supporting lab systems, as well as developing products and technologies for labs.{{Efn|See ''[[LII:Are You a Laboratory Automation Engineer?|Are You a Laboratory Automation Engineer?]]'' for more details.}} This type of knowledge isn't limited to the people developing systems; those supporting them also require those capabilities. After all, the implementation of laboratory systems is an engineering program and should be approached in the same manner as any systems development activity.

The use of advanced technology products isn’t going to improve until we have people that are fully competent to work with them, understand their limitations, and drive vendors to create better products.

==Abbreviations, acronyms, and initialisms==

'''AIA''': Analytical Instrument Association

'''CDS''': Chromatography data system

'''ELN''': Electronic laboratory notebook

'''FDA''': Food and Drug Administration

'''GC-MS''': Gas chromatography–mass spectrometry

'''IDS''': Instrument data system

'''ISO''': International Organization for Standardization

'''K/D/I''': Knowledge, data, and information

'''LAE''': Laboratory automation engineering (or engineer)

'''LIMS''': Laboratory information management system

'''LIS''': Laboratory information system

==Footnotes==
{{reflist|group=lower-alpha}}

==About the author==
Initially educated as a chemist, author Joe Liscouski (joe dot liscouski at gmail dot com) is an experienced laboratory automation/computing professional with over forty years of experience in the field, including the design and development of automation systems (both custom and commercial systems), LIMS, robotics and data interchange standards. He also consults on the use of computing in laboratory work. He has held symposia on validation and presented technical material and short courses on laboratory automation and computing in the U.S., Europe, and Japan. He has worked/consulted in pharmaceutical, biotech, polymer, medical, and government laboratories. His current work centers on working with companies to establish planning programs for lab systems, developing effective support groups, and helping people with the application of automation and information technologies in research and quality control environments.

==References==
{{Reflist|colwidth=30em}}


[[Category:LII:Guides, white papers, and other publications]]

LII:Are You a Laboratory Automation Engineer?

2024-06-19T18:48:13Z

Shawndouglas: Year clarification

'''Title''': ''Are You a Laboratory Automation Engineer?''

'''Author for citation''': Joe Liscouski, with editorial modifications by Shawn Douglas

'''License for content''': [https://creativecommons.org/licenses/by/4.0/ Creative Commons Attribution 4.0 International]

'''Publication date''': Originally published 2006; republished here February 2021

==Summary==
The technology and techniques that are used in automating [[laboratory]] work have been documented for over 40 years. Work done under the heading of “[[laboratory automation]]” has progressed from pioneering work in data acquisition and instrument control, to instrumentation with integral computing and communications. If the field is to move forward, we need to organize the practices and education of those applying automation and computing technologies to laboratory work. This note{{Efn|A portion of this material [http://dx.doi.org/10.1016%2Fj.jala.2006.04.002 was published] as a guest editorial in the ''Journal of the Association for Laboratory Automation'' (now ''SLAS Technology''), June 2006, volume 11, number 3, pages 157-162.}} is an opening to a dialog in the definition and development of the field of "laboratory automation engineering” (LAE).

In this article the following points will be considered:

* Definition of laboratory automation engineering (LAE)
* Laboratory automation as an engineering discipline
* Developments in laboratory automation
* Benefits of formalizing laboratory automation engineering
* Sub-disciplines within LAE
* Skills required for LAE
* What comes next?

==Introduction==
There are many of us working on the use of automation technologies in scientific disciplines such as chemistry, high-throughput screening, physics, quality control, electronics, oceanography, and materials testing. Yet, given the diversity of applications, we have many characteristics in common. There are enough common traits and skills that we can declare a field of laboratory automation engineering (LAE). The establishment of this field of work benefits both the practitioner and those in need of their skills. In addition, we may finally come to the full realization of the rewards we expect from automation systems.

Laboratory automation engineering can be defined as the application of automation, information, computing, and networking technologies to solve problems in, or improve the practice of, a scientific discipline. As such, it fits the accepted definition of "engineering": the discipline dealing with the art or science of applying scientific knowledge to practical problems.

The common characteristics of LAE are:

* It is practiced in facilities where the focus is scientific in nature, including research and development, testing, and quality control activities.
* The end product of the work consists of [[information]] (e.g., initial and processed test data) and/or systems for managing and improving people’s ability to work with information (intermediate steps result in prepared [[Sample (material)|samples]] or specimens).
* The automation practice draws upon robotics, mechanical and electrical engineering, data and [[Information management|information handling and management]], information and networking technology, and the science that the automation is addressing.
* It requires the practitioners to be knowledgeable in both automation technologies and the discipline in which they are being applied.

LAE differs from other forms of automation engineering found in manufacturing and product production in several ways. First, modern manufacturing and production facilities are designed with automation as an integral part of the process. Laboratory automation is a replacement process: manual techniques, once they mature, are replaced with automated systems. Economics, and the need to improve process uniformity, drive the replacement. If a testing or quality control lab could be designed initially as a fully automated facility, it would be difficult to distinguish the automated facility from any other manufacturing facility aside from the point that its “product” is data rather than tangible materials. The second difference is the point just made: the results are ultimately data.{{Efn|As a point of fact, and as noted later in this piece, the results of laboratory work are knowledge, information, and data. “Data” as used here is a short-hand for all three. For a full discussion of this point from the author's point of view, please refer to ''[https://isbnsearch.org/isbn/0471594229 Laboratory and Scientific Computing: A Strategic Approach]'', J. Liscouski, Wiley Interscience, 1995.}} LAE would result in systems and methods (e.g., high-throughput screening and combinatorial methods) that enable science to be pursued in ways that could not be done without automation.

==Developments in laboratory automation==
Laboratory automation was initiated by scientists with a need to do things differently. Whether it was to improve the efficiency of the work, to substitute intelligent control and acquisition systems for manual labor, or because it was the best or only way to implement an experimental system, automation eventually moved into the laboratory. Those doing the work had to be as well-versed in computing hardware and software as they were in their science. The choice of hardware and software, as well the development of the system, was in their hands.<ref name="HallockCreative05">{{cite journal |title=Creative Combustion: A History of the Association for Laboratory Automation |journal=SLAS Technology |author=Hallock, N. |volume=10 |issue=6 |pages=423–31 |year=2005 |doi=10.1016/j.jala.2005.10.001}}</ref><ref name="LiscouskiLab85">{{cite journal |title=Laboratory Automation |journal=JCOMP |author=Liscouski, J. |volume=25 |issue=3 |pages=288–92 |year=1985}}</ref>

The first generation of laboratory automation systems was concerned with interfacing instrument, sensors, and controllers to computers. Then the control of experiments through robotics and direct interfacing with computer systems arrived. A variety of experimental control software packages became available, as did [[laboratory information management system]]s (LIMS) and, in the last few years, [[electronic laboratory notebook]]s (ELNs).<ref name="HallockCreative05" /><ref name="LiscouskiLab85" />

If you have visited a conference in any discipline with a discussion on automation, it would appear that automation has seriously taken hold and most things are already automated or soon will be. One problem with this product array is the ability to integrate products from different vendors. Individual procedures are being addressed, but the movement of data and information within the lab and your ability to work with it is not as robust as it could be. Some vendors answer this need with comprehensive ELN systems in which their products are the hub of a lab's operations. Data is collected from instruments with local computing capability (e.g., the simple data translation of balances or similar devices) into their systems for storage, cataloging, analysis, etc. The movement is primarily one-way, but this is only the current generation of systems. The capabilities of today’s products generate new possibilities requiring upgrades to, or replacement of, these products.

The early automation models were driven by limitations in computing power, data storage, and communications. There was one experimental setup, with one system to provide automation.<ref name="HallockCreative05" /><ref name="LiscouskiLab85" /> The next generation of laboratory automation has to replace previous models, and develop implementations that can be gracefully upgraded rather than replaced. This will require increased modularity and less dependence upon local-to-the-instrument systems for automation. Achieving it requires engineering systems, not just building products.

Today, data communication is readily available thanks to the development of communication standards. Storage is cheap and there is an abundance of computing capability available. [[Data analysis]] does not have to be done at the instrument. Instead, the data file with instrument parameters can be sent to a central system for cataloging, storage, analysis, and reporting, while still making the raw data readily available for further analysis as new data treatment techniques are developed. The dots in the automation map will be connected with bi-directional communications. The field needs people who are trained to do the work and to take advantage of the increased capabilities available to them.

Beyond the issues of experimental work lay the requirements of corporate information architectures. Laboratory systems are part of the corporate network, whether it is a for-profit company, a university, or a research organization. Laboratory computing does not stop at the door. The purpose of a laboratory is to create knowledge, information, and data. The purpose of laboratory automation is to assist in that development, making the process more successful, effective, and cost-efficient. This can be accomplished by engineering systems that have security built-in to protect them, while at the same time giving scientists access to their data wherever they need it. It also means designing systems that permit a graceful evolution and continual improvement without disrupting the lab workings.

Too often we find corporate information technology (IT) departments at odds with those using computers in laboratory settings. On one side you hear “IT doesn’t understand our needs,” and on the other “we’re responsible for all corporate computing.” Both sides have a point. IT departments are responsible for corporate computing, and labs are part of the corporation. However, IT departments usually do not understand the computing needs of scientists; they do not speak the same language. IT-speak and science-speak do not readily connect and the issue will worsen as more instruments become network-ready. Specialized software packages used in laboratory work raise support issues for IT departments. IT would be concerned about who was going to support these programs and the effect they may have on the security of the corporation’s information network. LAE’s have the ability to bridge that gap.

==Benefits of formalizing laboratory automation engineering==
People have been successful for several decades doing lab automation work, so why do we need a change in the way it is done? While there certainly have been successful projects, there have also been projects that have ranged from less-than-complete successes, or even cancellations. In the end, the potential for laboratory automation to assist or even revolutionize laboratory work is too great to have projects be hit-or-miss successes. Even initially apparent successes may lead to long-term problems when the choice of software platform leads to an eventual dead-end, or the lab finds itself locked into a product set that does not give it the flexibility needed as lab requirements change over time.

To date, laboratory automation has moved along a developmental path that has been traveled in other areas of technology applications. In commercial chemical synthesis, aerospace, computers, civil engineering, and other fields, progress has been made first by individuals, followed by small groups practicing within a particular field. As the need to apply these technologies on a broader scale grew, training became formalized, as did methods of design and development. The methodology shifted from an “art” to actual engineering. The end result has been the ability to create buildings, bridges, aircraft, etc. on a scale that independent groups working on their own could not hope to achieve.

In the fields noted in the previous paragraph, and in others, “engineering” a project means that trained people have analyzed, understood, and defined it. It also means that plans have been laid out to meet the project goals, and risks have been identified and evaluated. It is a deliberate, confident movement from requirements to completion. This is what is needed to advance the application of automation technologies to scientific and laboratory problems. Hence, a movement to an engineering disciple for laboratory automation is necessary.

The benefits{{Efn|I’d like to thank Mark Russo, executive editor of the JALA, for his comments and, in this section in particular, for his insights and the material he contributed.}} of formalizing LAE to the individual practitioner, laboratory management, and the field are outlined below.

Benefits to the individual:

* provides a thorough and systematic education in a broadly practiced field;
* informs of past activities regarding what has worked and what has failed;
* provides evidence of relevant knowledge and training through a degree or certificate program; and
* lends a sense of identity and community with others on the same career path.

Benefits to laboratory management:

* provides a basis for evaluating people’s credentials and ability to work on laboratory automation projects;
* provides a basis for employee evaluation;
* limits the need for on-the-job training, reducing the impact on budgets;
* lends to faster project implementation with a higher expectation of success; and
* lends expertise to the design of laboratories using automation as part of their initial blueprint, rather than using it to replace existing manual procedures, while reducing the cost and improving the efficiency of a laboratory’s operations.

Benefits to the field of laboratory automation:

* creates a foundation of documented knowledge that LAEs can use for learning and to improve the effectiveness of the profession;
* encourages a community of people that can drive the organized development of systems and technologies that will provide advancements to the practice of science, while creating re-useable resources instead of re-inventing systems from scratch on similar projects;
* provides a basis for research into new technologies that can significantly improve scientist’s ability to do their work, encouring a move from incremental advancement in automation systems to leaps in advancement; and
* promotes a community of like-minded individuals that can discuss, and where appropriate, develop positions on key issues in the field (e.g., the impact of regulatory requirements and standards development) and develop position papers and guidelines on those points as warranted.

==Sub-disciplines within LAE==
Laboratory automation has the ability to allow us to do science that would be both physically and economically prohibitive without it. As noted above, high-throughput screening and combinatorial methods depend on automation. In physics, the data rates would be impossible to handle.<ref name="MateyHist99">{{cite journal |title=History of Laboratory Automation - Session BA01.05, Cent. Symposium: 20th Century Developments in Instrumentation & Measurements |journal=Centennial Meeting of the American Physical Society |author=Matey, J.R. |year=1999 |url=http://flux.aps.org/meetings/YR99/CENT99/abs/S350.html#SBA01.001}}</ref> In most industrial analytical chemistry laboratories, automation keeps budgets and testing schedules workable. As the discipline develops, we are going to become more dependent on automation systems for data access, [[data mining]], and [[Data visualization|visualization]].

Given the level of sophistication of existing systems, and the demands that will be placed on them in the future, there will be a need for specialization, even within the field of LAE. Sub-disciplines within LAE include:

* Sample handling, experiments, and testing – Automation is applied to the movement, manipulation, and analysis of samples and objects of interest. A common type of automation is robotics, which includes special-purpose devices such as autosamplers, as well as user-defined configurable systems.
* Data acquisition and analysis – This includes any sensor-based data entry, with subsequent data evaluation.
* Data, information, and knowledge management - This includes managing the access to and storage of thos objects, as well as understanding the uses of LIMS and ELN.

These sub-disciplines are not mutually exclusive but have considerable overlap (Figure 1) and share underlying technologies such as computing (which includes programming), networking, and communications. In some applications it may be difficult to separate “data acquisition” from experiment control; however, that separation could lead to insights in system design. The sub-disciplines also have distinctly different sets of skill requirements and, as you move from left to right in the Figure 1, less involvement with the actual laboratory science.

[[File:Fig1 Liscouski AreYouLabAutoEng06.png|419px]]
{{clear}}
{|
| STYLE="vertical-align:top;"|
{| border="0" cellpadding="5" cellspacing="0" width="419px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;"| <blockquote>'''Figure 1.''' The three sub-disciplines within laboratory automation engineering</blockquote>
|-
|}
|}

A systems approach to laboratory automation is essential in the implementation of projects in modern laboratories; without it, we are going to continually repeat the “islands of automation” practices so common in the past. Project designers must look beyond the immediate needs and make provision for integration with other systems, in particular bi-directional communication with knowledge, information, and data management systems.

==Skills required for LAE==
In the mid-1960s, “programming” instruments was a mechanical problem handled with needle-nosed pliers, screwdrivers, and a stopwatch, as you adjusted cams and micro-switches.{{Efn|For example, process chromatographs made by Mine Safety Appliances and Fisher Control used cams and micro-switches to control sampling and back-flush valves.}} This process is more complicated today. Automation systems are built around computer components, therefore, a strong background in computer science and programming would be necessary to be successful. In addition, an engineering background along with management, regulatory, and strong organizational skills would be expected. An understanding of the basic science in which automation is being applied is also crucial. As noted below under communications, it is up to the LAE to understand the science and the scientists in order to convert their needs into a functional requirements document and then implement working systems.

Those working in LAE should have the following basic skills in their backgrounds. These would be common across all sub-disciplines:

* project and change management
* strong communications skills
* strong understanding of regulatory issues (e.g. [[Food and Drug Administration|FDA]], [[International Organization for Standardization|ISO]])
* general systems theory and beyond
* process analysis and development

===Project and change management===
In addition to the skills you might expect to be covered by project management, the topic of change management also deserves to be noted. Change management is included to emphasize their mutual importance. Until we get to the point where we can anticipate automation issues within a lab and build them into the lab as it is constructed, laboratory automation is going to be both a replacement for an existing process (a project) and a new set of processes imposed on those working in that lab (some change). Even when we get to a stage where pre-built automation systems exist, there will be both step-wise and radical changes in the applied technologies as new techniques are developed.

The LAE practitioner will have to deal with the normal issues of budgets, schedules, and documentation common to projects. The need for thorough documentation cannot be stressed enough. Well-written documentation of the project goals, structure, milestones, the reasoning behind choices of materials, equipment, alternative ways of doing things, and final acceptance criteria are necessary to provide a basis for regulatory review. This will lead to the development of a supportable system that can be upgraded, and provide an objective reference for determining when the project is completed. As part of their practice, LAEs should adopt practices in other engineering disciplines such as the development of a functional requirements specification (FRS), user requirements specification (URS), and design specifications prior to system development. The LAE should also be familiar with the concept of project life cycle management<ref name="M123Project">{{cite web |url=https://www.method123.com/project-lifecycle.php |title=Project Management Life Cycle |work=Method123}}</ref><ref name="WPSystems">{{cite web |url=https://en.wikipedia.org/wiki/Systems_development_life_cycle |title=Systems development life cycle |work=Wikipedia}}</ref> found in a number of industries, paying particular attention to software life cycle management.<ref name="AdobeLesson2.2">{{cite web |url=http://www.adobe.com/education/webtech/CS/unit_planning1/pm_home_id.htm |archiveurl=https://web.archive.org/web/20060110072101/http://www.adobe.com/education/webtech/CS/unit_planning1/pm_home_id.htm |title=Lesson 2.2 Project Management |work=Adobe Web Tech Curriculum |date=2005 |archivedate=10 January 2006}}</ref>

More interestingly, the LAE will have to be able to respond to statements like “I know we’ve never done anything like this before but I still need to know how much it will cost and when it will be done” and questions like “ok, maybe you don’t know why it isn’t working, but can you tell me when it will be fixed?”. The politics of managing changes, adding requirements to projects (with appropriate documentation, change orders, budget, and scheduling adjustment) are part of professional work.

Change management also addresses “people issues,” and they can be among the more stressful. Moving from an existing process of doing work to an automated one means that people’s jobs will change. For some it will be a minor adjustment to the way they do work, while for others it will be the possibility of their job dramatically changing (and change raises people’s anxiety levels). Those working in LAE are going to have to be able to face that reality head-on because that will affect their ability to get work done. If those working in the lab believe that your project will cause them to lose their job, or have to make a significant change (particularly if it is viewed as a negative change) they may limit your ability to complete the project. The fact that this point deals with people does not mean that it cannot affect what may be viewed as a technology project, or that the issue can be viewed as a problem to be solved by lab managers.

Paying attention to “people issues” can:

* point to deficiencies in project planning,
* avoid delays in the final acceptance of the project, and
* help you uncover important factors in designing a system.

Regarding the first point, one of the lessons learned in the installation of early LIMS was the need for both training and the availability of practice databases so that those working with the system could familiarize themselves with a non-production database before using the live system. Having the project plan reviewed by those in the lab who will be affected by the system can lead to the discovery of issues that need to be addressed in training, scheduling, and potential conflicts in the lab’s operations. By extension, if deficiencies are caught early, delays in final project acceptance may be avoided.

As for the final point, laboratory work contains elements of art as well as science. People’s familiarity with the samples they are processing can lead to undocumented shortcuts and adjustments to procedures, which can make the difference between your implementation of an automated process working or failing. In the late 1980s, Shoshana Zuboff documented the problems that occurred in the automation of pulp mills when the plant personnel’s experience was ignored.<ref name="ZuboffInTheAge88">{{cite book |title=In the Age of the Smart Machine |author=Zuboff, S. |publisher=Butterworth–Heinemann |year=1988 |isbn=0434924865}}</ref> The parallel here is if you ignore peoples experience in doing the work you are about to automate, you run a significant risk of failing.

Project and change management also has to provide planning for the stresses that develop during a project's implementation. As noted earlier, laboratory automation today is largely a replacement of existing manual procedures. LIMS are installed to facilitate the bookkeeping of service laboratories, improve the lab’s ability to respond to questions about test results, and increase the efficiency of the lab’s operations. Robotic sample preparation systems are introduced into a lab because the existing manual processes are no longer economical, and there is a need for better data precision, or higher sample throughput.

These replacement activities are going to introduce stressors on the lab, which need to be taken into account in the planning process. First, there will be some stress on the lab workers and budget during the development of the system. That stress will come in the form of the cost of running the lab while the development is taking place, people’s attitudes if they feel their jobs will be adversely affected, potential disruption of lab operations, and the problems associated with installing systems in labs that are usually short on space. Second, once the automation system has been developed, it will have to be evaluated against the existing process to show that it meets the design criteria. Once that has been successfully accomplished, a change-over process has to be designed to switch from one method of operation to another. Both of these activities will increase the workload in the lab since, for a time, both the automated system and the process it is replacing will run in parallel.

===Strong communications skills===
The previous section mentioned managing people issues, and effective communication is part of that. The strong communication is also useful in making sure people understand what work is being done, what the benefits are, and what the costs are, as well as declaring any schedules, risks, etc. Those aspects, as well as the ramifications of any changes, have to be clearly understood. The lack of clear communications is often the basis of misdirection, delays, and frustration around projects. Communication is a two-way street. The LAE needs to be able to discuss the project, as well as listen to and understand the needs of those working in the lab.

===Strong understanding of regulatory issues===
No matter what industry in which you are working, meeting regulatory requirements is a fact of life. Companies are under increasing pressure from regulatory agencies to have their internal practices conform to a standard. This is no longer a specific regulatory- or standards-based issue that is focused on manufacturing systems, but rather an organization-wide set of activities encompassing financial and accounting practices, human resources, etc. Name a department and it will be affected by a regulation.

While the initial reaction is that this adds one more thing to do; for the most part, what the regulations require is part of a good system design process. The requirements for the validation of a system, for example, call for the designer to show that the system works, that it is supportable, well-documented, and uses equipment suited to the designer’s purpose and from reliable vendors. Tinkering in order to get a system together is no longer an accepted practice; it may work for prototyping, but not for production use. The point of any applicable regulations and standards is to ensure that any implemented system can stand up to long-term use, and that it fits a well-defined, documented need. In short, regulations and standards help ensure a system is well-engineered.

===General systems theory and beyond===
Frank Zenie (one of the founders of Zymark Corporation{{Efn|Zymark was acquired by Caliper Life Sciences in 2003.}}) usually made the statement that “you can only automate a process, you can’t automate a thing” as part of his introductory comments for Zymark’s robotics courses. Recognizing that a process exists and that it can be automated is the initial step in defining a project. General systems theory{{Efn|See "[http://pespmc1.vub.ac.be/SYSTHEOR.html What Is Systems Theory?]" for an introduction. In particular, review the work of George Klir, including ''[https://www.worldcat.org/title/approach-to-general-systems-theory/oclc/51307 An Approach to General Systems Theory]'' and ''[https://doi.org/10.1007/978-1-4615-1331-5 Facets of Systems Science]''.}} provides the tools for documenting a process, as well as the triggers and the state changes that occur as the process develops. It is particularly useful in working with inter-related systems.

Recognizing that a process exists and can be well described is one part of the problem. Another is its ability to be automated. Labs and lab equipment are designed for people. Using that same equipment and tools with automated systems, including robotics, may require a significant amount of re-engineering, and bring into question the economics and perceived benefits of a project.

Process engineering should also include statistical process control, statistical quality control, and productivity measurements. Robotics systems are small-scale manufacturing systems even if the end result is data. As portions of laboratory work become automated and integrated, the systems will behave like manufacturing processes and should fall under the same types of control. Productivity measurements are also important. Automation systems are put in place to achieve a number of goals, including a focus on improving efficiency of operations. Productivity measurements test the system’s ability to meet those goals, and if successful, will lead to the development of additional projects.

Much of what has been written so far could easily be recognized as part of an engineering or engineering management program, and that is exactly the point: laboratory automation is an engineering activity. The elements that differentiate it from other engineering activities include that:

* it's a laboratory or scientific environment, as noted earlier;
* automation is an enabling technology opening the door to new scientific methodologies, including discovery-based science{{Efn|As suggested by a reviewer.}};
* automation is usually a replacement for existing manual operations, which will require the LAE to be sensitive to change management issues;
* the scope of activities can include materials handling (robotics), data acquisition, analysis, reporting, and integration with database systems; and
* LAEs need to be knowledgeable in automation technologies and their application, and have strong backgrounds in science practiced in the lab.

This last point is significant; the LAE cannot expect the lab personnel to describe their requirements in engineering terms. Yes, lab personnel can explain what they would like to accomplish; however, it is the LAE’s responsibility to determine how to do it and understand the ramifications in system design and implementation. The LAE functions in part as a translator, converting a set of needs of scientists into a project plan, and ultimately into a functioning system. In some respects this is similar to traditional software engineering.

All the work associated with the three sub-disciplines reflects the transition of working with “things” and materials, making measurements, and creating a description of those items, and then managing and working with the resulting knowledge, information, and data. The lists found in Figure 2 are a summary of skills needed in addition to those noted above. Software engineering is repeated in the lists for additional emphasis.

[[File:Fig2 Liscouski AreYouLabAutoEng06.png|410px]]
{{clear}}
{|
| STYLE="vertical-align:top;"|
{| border="0" cellpadding="5" cellspacing="0" width="410px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;"| <blockquote>'''Figure 2.''' Summary of skills required for each of the three sub-disciplines within laboratory automation engineering</blockquote>
|-
|}
|}

==What comes next?==
There are two major tasks that need to be undertaken, both under the heading of education. First is the development of a curriculum for laboratory automation engineering at the university level. This will require support from both organizations like the Association for Laboratory Automation (ALA) and from industry. A university is not going to create a program, even one that can be assembled to a large extent from existing programs, unless it has some assurance that its graduates will find jobs. The availability of employment opportunities would be used to attract students, and the process would move on from there. One issue that needs to be addressed is how much laboratory science knowledge an LAE has to have in order to be effective. Another closely tied issue is whether we are talking about an undergraduate program with particular science specializations or a graduate program.

Undergraduate and graduate science programs need to address the inclusion of automation in their courses. The point is not to teach them engineering but to acquaint them with how the work of modern science is done. Just as computer literacy is part of high-school curriculums, automation literacy should be part of the course of science studies. James Sterling’s 2004 journal article on laboratory automation curriculum provides one example of what can be done.<ref name="SterlingLab04">{{cite journal |title=Laboratory Automation Curriculum at Keck Graduate Institute |journal=SLAS Technology |author=Sterling, J.D. |volume=9 |issue=5 |pages=331–35 |year=2004 |doi=10.1016/j.jala.2004.07.005}}</ref> The ALA can take a lead and develop materials (e.g., course outlines, source material, etc.) to assist instructors who want to include this material in their programs.

While a university could address a long-term need, there is the need to provide a means of having those already working in the field to augment their backgrounds. Expanding the short-course structure already put in place by the ALA could satisfy that need. Another possibility is the development of certification programs similar to those used in computer science, working in conjunction with short- courses and the development of an “Institute for Laboratory Automation” with an organized summer program.

The second task is the organization of a “body of knowledge” about laboratory automation engineering that would include compilations of relevant texts, web sites, knowledge bases, etc., while encouraging those working in the field to contribute material to its development. The initial step would be the development of a framework to organize material and then to begin populating it. Once a framework is in place, publications like the ''Journal of the Association for Laboratory Automation'' could have authors key their papers to that structure. The details of the framework need input from others in the field and more depth of evaluation than this introductory piece can afford.

In the development of this framework (Figure 3), the following points should be considered. First, there are fundamental techniques, skills, and technologies in laboratory automation that cut across scientific disciplines pretty much intact, including analog data acquisition, fundamentals of robotics, interfacing techniques, project management, etc. While there may be application-specific caveats (e.g., “Considerations in analog interfacing in secure biohazard facilities”), for the most part they are common elements and can be treated as a common block of knowledge. Second, once we get beyond basics and into applications of techniques in scientific disciplines, we may see the same basic outline structure duplicated with parallel sections. Robotics in chemistry may be similar to robotics in another discipline but with differences in equipment details and implementations. The outline will appear similar, with differences in the details. The framework should consider linkages between parallel outlines where similar needs generate similar systems development. In short, we should make it easy to recognize cross-fertilization between similar technologies in different disciplines.

[[File:Fig3 Liscouski AreYouLabAutoEng06.png|400px]]
{{clear}}
{|
| STYLE="vertical-align:top;"|
{| border="0" cellpadding="5" cellspacing="0" width="400px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;"| <blockquote>'''Figure 3.''' Example of an organized and interconnected LAE knowledge framework</blockquote>
|-
|}
|}

==Conclusion==
Laboratory automation is still in the early phases of its development. Some may say that significant progress has been made, but in comparison to the rate of development of automation and information technologies in other areas, our progress has been slow and incremental (the internet, for example, is a little more than a decade old).

What can and must be done is getting the user community to embrace the establishment and development of a discipline focused on envisioning, creating, and improving the tools and techniques of laboratory automation. That in turn will allow us to realize the promise that proponents of automation have long held: giving people the opportunity to do better science. Laboratory automation needs to be driven by people who want to do good work and are trained to do it.

An LAE’s employment will be driven by market demand. However, the skill sets should be transferable. Just as computer science professionals have flexibility in where their skills are applied, LAEs should enjoy the same ability to move from one scientific application to another. The major differences will be learning the underlying science.

==Abbreviations, acronyms, and initialisms==
'''ELN''': Electronic laboratory notebook

'''FDA''': Food and Drug Administration

'''FRS''': Functional requirements specification

'''IT''': Information technology

'''ISO''': International Organization for Standardization

'''LAE''': Laboratory automation engineering (or engineer)

'''LIMS''': Laboratory information management system

'''URS''': User requirements specification

==Footnotes==
{{reflist|group=lower-alpha}}

==Acknowledgements==
I’d like to thank Mark Russo (Bristol-Myers Squibb) for his comments and support in the development of this article.

==About the author==
Initially educated as a chemist, author Joe Liscouski (joe dot liscouski at gmail dot com) is an experienced laboratory automation/computing professional with over forty years of experience in the field, including the design and development of automation systems (both custom and commercial systems), LIMS, robotics and data interchange standards. He also consults on the use of computing in laboratory work. He has held symposia on validation and presented technical material and short courses on laboratory automation and computing in the U.S., Europe, and Japan. He has worked/consulted in pharmaceutical, biotech, polymer, medical, and government laboratories. His current work centers on working with companies to establish planning programs for lab systems, developing effective support groups, and helping people with the application of automation and information technologies in research and quality control environments.

==References==
{{Reflist|colwidth=30em}}


[[Category:LII:Guides, white papers, and other publications]]

Journal:Semantic units: Organizing knowledge graphs into semantically meaningful units of representation

2024-06-17T22:35:54Z

Shawndouglas: Saving and adding more.

{{Infobox journal article
|name =
|image =
|alt = 
|caption =
|title_full = Semantic units: Organizing knowledge graphs into semantically meaningful units of representation
|journal = ''Journal of Biomedical Semantics''
|authors = Vogt, Lars; Kuhn, Tobias; Hoehndorf, Robert
|affiliations = TIB Leibniz Information Centre for Science and Technology, Vrije Universiteit, King Abdullah University of Science and Technology
|contact = Email: lars dot m dot vogt at googlemail dot com
|editors =
|pub_year = 2024
|vol_iss = '''15'''
|at = 7
|doi = [https://doi.org/10.1186/s13326-024-00310-5 10.1186/s13326-024-00310-5]
|issn = 2041-1480
|license = [http://creativecommons.org/licenses/by/4.0/ Creative Commons Attribution 4.0 International]
|website = [https://jbiomedsem.biomedcentral.com/articles/10.1186/s13326-024-00310-5 https://jbiomedsem.biomedcentral.com/articles/10.1186/s13326-024-00310-5]
|download = [https://jbiomedsem.biomedcentral.com/counter/pdf/10.1186/s13326-024-00310-5.pdf https://jbiomedsem.biomedcentral.com/counter/pdf/10.1186/s13326-024-00310-5.pdf] (PDF)
}}
{{ombox
| type = notice
| image = [[Image:Emblem-important-yellow.svg|40px]]
| style = width: 500px;
| text = This article should be considered a work in progress and incomplete. Consider this article incomplete until this notice is removed.
}}
==Abstract==
'''Background''': In today’s landscape of [[Information management|data management]], the importance of [[knowledge graph]]s and [[Ontology (information science)|ontologies]] is escalating as critical mechanisms aligned with the [[Journal:The FAIR Guiding Principles for scientific data management and stewardship|FAIR Guiding Principles]] ask that research data and [[metadata]] be more findable, accessible, interoperable, and reusable. We discuss three challenges that may hinder the effective exploitation of the full potential of applying FAIR concepts to research objects using knowledge graphs.

'''Results''': We introduce “semantic units” as a conceptual solution, although currently exemplified only in a limited prototype. Semantic units structure a knowledge graph into identifiable and [[Semantics|semantically]] meaningful subgraphs by adding another layer of triples on top of the conventional data layer. Semantic units and their subgraphs are represented by their own resource that instantiates a corresponding semantic unit class. We distinguish statement and compound units as basic categories of semantic units. A statement unit is the smallest independent proposition that is semantically meaningful for a human reader. Depending on the relation of its underlying proposition, it consists of one or more triples. Organizing a knowledge graph into statement units results in a partition of the graph, with each triple belonging to exactly one statement unit. A compound unit, on the other hand, is a semantically meaningful collection of statement and compound units that form larger subgraphs. Some semantic units organize the graph into different levels of representational granularity, others orthogonally into different types of granularity trees or different frames of reference, structuring and organizing the knowledge graph into partially overlapping, partially enclosed subgraphs, each of which can be referenced by its own resource.

'''Conclusions''': Semantic units, applicable in RDF/OWL and labeled property graphs, offer support for making statements about statements and facilitate graph-alignment, subgraph-matching, knowledge graph profiling, and management of access restrictions to sensitive data. Additionally, we argue that organizing the graph into semantic units promotes the differentiation of ontological and discursive [[information]], and that it also supports the differentiation of multiple frames of reference within the graph.

'''Keywords''': FAIR data and metadata, knowledge graph, OWL, RDF, semantic unit, graph organization, granularity tree, representational granularity

==Background==
In an era marked by the exponential generation of data<ref>{{Cite journal |last=Adam, K.; Hammad, I.; Fakhreldin, M.A.I. et al. |year=2015 |title=Big Data Analysis and Storage |url=http://umpir.ump.edu.my/id/eprint/7341 |journal=Proceedings of the 2015 International Conference on Operations Excellence and Service Engineering |pages=648–59}}</ref><ref>{{Cite web |last=Marr, B. |date=21 May 2018 |title=How Much Data Do We Create Every Day? The Mind-Blowing Stats Everyone Should Read |work=Forbes |url=https://www.forbes.com/sites/bernardmarr/2018/05/21/how-much-data-do-we-create-every-day-the-mind-blowing-stats-everyone-should-read/ |accessdate=22 May 2024}}</ref><ref>{{Cite web |date=2017 |title=Data Never Sleeps 5 |url=https://www.domo.com/learn/infographic/data-never-sleeps-5 |publisher=Domo, Inc}}</ref>, both technically and socially intricate challenges have emerged<ref>{{Cite journal |last=Idrees |first=Sheikh Mohammad |last2=Alam |first2=M. Afshar |last3=Agarwal |first3=Parul |date=2019-12 |title=A study of big data and its challenges |url=http://link.springer.com/10.1007/s41870-018-0185-1 |journal=International Journal of Information Technology |language=en |volume=11 |issue=4 |pages=841–846 |doi=10.1007/s41870-018-0185-1 |issn=2511-2104}}</ref>, necessitating innovative approaches to data representation and [[Information management|management]] in science and industry. The growing volume of produced data requires systems capable of collecting, [[Data integration|integrating]], and [[Data analysis|analyzing]] extensive datasets from diverse sources, a critical requirement in addressing contemporary global challenges.<ref>{{Cite web |last=United Nations |date=2015 |title=Transforming our world: the 2030 Agenda for Sustainable Development |url=https://wedocs.unep.org/20.500.11822/9814 |publisher=United Nations Environment Programme |accessdate=22 May 2024}}</ref> Notably, data stewardship should rest within the hands of the domain experts or institutions to ensure technical autonomy, aligning with the concept of "data visiting" rather than conventional "[[data sharing]]."<ref>{{Cite web |last=Mons, B. |date=December 2018 |title=Message from President Barend Mons (2018-2023) |url=https://codata.org/about-codata/message-from-president-merce-crosas/message-from-president-barend-mons-2018-2023/ |publisher=Committee on Data (CODATA) |accessdate=22 May 2024}}</ref>

From the standpoint of data representation and management, meeting these demands relies on adherence to the [[Journal:The FAIR Guiding Principles for scientific data management and stewardship|FAIR Guiding Principles]], which ask for research data and [[metadata]] to be readily findable, accessible, interoperable, and reusable for machines and humans alike.<ref>{{Cite journal |last=Wilkinson |first=Mark D. |last2=Dumontier |first2=Michel |last3=Aalbersberg |first3=IJsbrand Jan |last4=Appleton |first4=Gabrielle |last5=Axton |first5=Myles |last6=Baak |first6=Arie |last7=Blomberg |first7=Niklas |last8=Boiten |first8=Jan-Willem |last9=da Silva Santos |first9=Luiz Bonino |last10=Bourne |first10=Philip E. |last11=Bouwman |first11=Jildau |date=2016-03-15 |title=The FAIR Guiding Principles for scientific data management and stewardship |url=https://www.nature.com/articles/sdata201618 |journal=Scientific Data |language=en |volume=3 |issue=1 |pages=160018 |doi=10.1038/sdata.2016.18 |issn=2052-4463 |pmc=PMC4792175 |pmid=26978244}}</ref> Failure to achieve FAIRness risks transforming big data into opaque dark data.<ref>{{Cite journal |last=Heidorn |first=P. Bryan |date=2008-09 |title=Shedding Light on the Dark Data in the Long Tail of Science |url=https://muse.jhu.edu/article/262029 |journal=Library Trends |language=en |volume=57 |issue=2 |pages=280–299 |doi=10.1353/lib.0.0036 |issn=1559-0682}}</ref> Establishing the FAIRness of these research objects not only contributes to a solution for the reproducibility crisis in science<ref>{{Cite journal |last=Baker |first=Monya |date=2016-05-26 |title=1,500 scientists lift the lid on reproducibility |url=https://www.nature.com/articles/533452a |journal=Nature |language=en |volume=533 |issue=7604 |pages=452–454 |doi=10.1038/533452a |issn=0028-0836}}</ref> but also addresses broader concerns regarding the trustworthiness of [[information]] (see also the TRUST Principles of transparency, responsibility, user focus, sustainability, and technology<ref>{{Cite journal |last=Lin |first=Dawei |last2=Crabtree |first2=Jonathan |last3=Dillo |first3=Ingrid |last4=Downs |first4=Robert R. |last5=Edmunds |first5=Rorie |last6=Giaretta |first6=David |last7=De Giusti |first7=Marisa |last8=L’Hours |first8=Hervé |last9=Hugo |first9=Wim |last10=Jenkyns |first10=Reyna |last11=Khodiyar |first11=Varsha |date=2020-05-14 |title=The TRUST Principles for digital repositories |url=https://www.nature.com/articles/s41597-020-0486-7 |journal=Scientific Data |language=en |volume=7 |issue=1 |pages=144 |doi=10.1038/s41597-020-0486-7 |issn=2052-4463 |pmc=PMC7224370 |pmid=32409645}}</ref>).

To capitalize on the transformative potential of the FAIR Principles, the idea of an internet of FAIR data and services was suggested.<ref>{{Cite web |title=The Internet of FAIR Data & Services |url=https://www.go-fair.org/resources/internet-fair-data-services/ |publisher=GO FAIR |accessdate=22 May 2024}}</ref> Such a framework would seamlessly scale with the demands of big data, enabling relevant data-rich institutions, research projects, and citizen-science initiatives to make their research objects universally accessible in adherence to the FAIR Guiding Principles.<ref>{{Cite book |last=European Commission. Directorate General for Research and Innovation. |date=2016 |title=Realising the European open science cloud: first report and recommendations of the Commission high level expert group on the European open science cloud. |url=https://data.europa.eu/doi/10.2777/940154 |publisher=Publications Office |place=LU |doi=10.2777/940154}}</ref><ref>{{Citation |last=Hasnain |first=Ali |last2=Rebholz-Schuhmann |first2=Dietrich |date=2018 |editor-last=Gangemi |editor-first=Aldo |editor2-last=Gentile |editor2-first=Anna Lisa |editor3-last=Nuzzolese |editor3-first=Andrea Giovanni |editor4-last=Rudolph |editor4-first=Sebastian |editor5-last=Maleshkova |editor5-first=Maria |title=Assessing FAIR Data Principles Against the 5-Star Open Data Principles |url=https://link.springer.com/10.1007/978-3-319-98192-5_60 |work=The Semantic Web: ESWC 2018 Satellite Events |language=en |publisher=Springer International Publishing |place=Cham |volume=11155 |pages=469–477 |doi=10.1007/978-3-319-98192-5_60 |isbn=978-3-319-98191-8 |accessdate=2024-06-17}}</ref> The key lies in furnishing comprehensive, machine-actionable{{Efn|Machine-actionable data and metadata are machine-interpretable and belong to a type for which operations have been specified in symbolic grammar, such as logical reasoning based on description logics for statements formalized in the Web Ontology Language (OWL) or rule-based data transformations such as unit conversion for defined types of elements.<ref name="WEilandFDO22">{{cite web |url=https://docs.google.com/document/d/1hbCRJvMTmEmpPcYb4_x6dv1OWrBtKUUW5CEXB2gqsRo |title=FDO Machine Actionability, Version 2.1 |author=Weiland, C.; Islam, S.; Broder, D. et al. |work=Google Docs |publisher=FDO Forum |date=19 August 2022}}</ref>}} data and metadata, complemented by human-readable interfaces and search capabilities.

[[Knowledge graph]]s can contribute to the needed technical frameworks, offering a structure for managing and representing FAIR data and metadata.<ref>{{Cite journal |last=Vogt |first=Lars |last2=Baum |first2=Roman |last3=Bhatty |first3=Philipp |last4=Köhler |first4=Christian |last5=Meid |first5=Sandra |last6=Quast |first6=Björn |last7=Grobe |first7=Peter |date=2019-01-01 |title=SOCCOMAS: a FAIR web content management system that uses knowledge graphs and that is based on semantic programming |url=https://academic.oup.com/database/article/doi/10.1093/database/baz067/5544589 |journal=Database |language=en |volume=2019 |pages=baz067 |doi=10.1093/database/baz067 |issn=1758-0463 |pmc=PMC6686081 |pmid=31392324}}</ref> Knowledge graphs are particularly applied in the context of [[Semantics|semantic]] search based on entities and relations, deep reasoning, disambiguation of natural language, machine reading, and entity consolidation for big data and text analytics.<ref>{{Cite journal |last=Bonatti |first=Piero Andrea |last2=Decker |first2=Stefan |last3=Polleres |first3=Axel |last4=Presutti |first4=Valentina |date=2019 |title=Knowledge Graphs: New Directions for Knowledge Representation on the Semantic Web (Dagstuhl Seminar 18371) |url=https://drops.dagstuhl.de/entities/document/10.4230/DagRep.8.9.29 |language=en |pages=83 pages, 5326322 bytes |doi=10.4230/dagrep.8.9.29}}</ref>

The distinctive graph-based abstractions inherent in knowledge graphs yield advantages over traditional [[Relational database|relational]] or other NoSQL models. These include

*an intuitive way for modelling relations;
*the flexibility to defer data schema definitions to accommodate evolving knowledge, which is especially important when dealing with incomplete knowledge;
*incorporation of machine-actionable knowledge representation formalisms like [[Ontology (information science)|ontologies]] and rules;
*deployment of graph analytics and [[machine learning]] (ML); and
*utilization of specialized graph query languages that support, in addition to standard relational operators such as joins, unions, and projections, also navigational operators for recursively searching for entities through arbitrary-length paths.<ref>{{Cite journal |last=Hogan |first=Aidan |last2=Blomqvist |first2=Eva |last3=Cochez |first3=Michael |last4=D’amato |first4=Claudia |last5=Melo |first5=Gerard De |last6=Gutierrez |first6=Claudio |last7=Kirrane |first7=Sabrina |last8=Gayo |first8=José Emilio Labra |last9=Navigli |first9=Roberto |last10=Neumaier |first10=Sebastian |last11=Ngomo |first11=Axel-Cyrille Ngonga |date=2022-05-31 |title=Knowledge Graphs |url=https://dl.acm.org/doi/10.1145/3447772 |journal=ACM Computing Surveys |language=en |volume=54 |issue=4 |pages=1–37 |doi=10.1145/3447772 |issn=0360-0300}}</ref><ref>{{Citation |last=Abiteboul |first=Serge |date=1997 |editor-last=Afrati |editor-first=Foto |editor2-last=Kolaitis |editor2-first=Phokion |title=Querying semi-structured data |url=http://link.springer.com/10.1007/3-540-62222-5_33 |work=Database Theory — ICDT '97 |publisher=Springer Berlin Heidelberg |place=Berlin, Heidelberg |volume=1186 |pages=1–18 |doi=10.1007/3-540-62222-5_33 |isbn=978-3-540-62222-2 |accessdate=2024-06-17}}</ref><ref>{{Cite journal |last=Angles |first=Renzo |last2=Gutierrez |first2=Claudio |date=2008-02 |title=Survey of graph database models |url=https://dl.acm.org/doi/10.1145/1322432.1322433 |journal=ACM Computing Surveys |language=en |volume=40 |issue=1 |pages=1–39 |doi=10.1145/1322432.1322433 |issn=0360-0300}}</ref><ref>{{Cite journal |last=Angles |first=Renzo |last2=Arenas |first2=Marcelo |last3=Barceló |first3=Pablo |last4=Hogan |first4=Aidan |last5=Reutter |first5=Juan |last6=Vrgoč |first6=Domagoj |date=2018-09-30 |title=Foundations of Modern Query Languages for Graph Databases |url=https://dl.acm.org/doi/10.1145/3104031 |journal=ACM Computing Surveys |language=en |volume=50 |issue=5 |pages=1–40 |doi=10.1145/3104031 |issn=0360-0300}}</ref><ref>{{Cite web |last=Hitzler, P.; Krötzsch, M.; Parsia, B. et al. |date=11 December 2012 |title=OWL 2 Web Ontology Language Primer (Second Edition) |url=https://www.w3.org/TR/owl2-primer/ |publisher=World Wide Web Consortium}}</ref><ref>{{Cite journal |last=Philip |first=Stutz |last2=Daniel |first2=Strebel |last3=Abraham |first3=Bernstein |date=2016 |title=Signal/collect12: processing large graphs in seconds |url=https://www.zora.uzh.ch/id/eprint/119576 |doi=10.5167/UZH-119576}}</ref><ref>{{Cite journal |last=Wang |first=Quan |last2=Mao |first2=Zhendong |last3=Wang |first3=Bin |last4=Guo |first4=Li |date=2017-12-01 |title=Knowledge Graph Embedding: A Survey of Approaches and Applications |url=http://ieeexplore.ieee.org/document/8047276/ |journal=IEEE Transactions on Knowledge and Data Engineering |volume=29 |issue=12 |pages=2724–2743 |doi=10.1109/TKDE.2017.2754499 |issn=1041-4347}}</ref>

Moreover, the inherent semantic transparency of knowledge graphs can improve the transparency of data-based decision-making and improve the communication of data and knowledge within research and science in general.<ref>{{Cite journal |last=Stocker |first=Markus |last2=Oelen |first2=Allard |last3=Jaradeh |first3=Mohamad Yaser |last4=Haris |first4=Muhammad |last5=Oghli |first5=Omar Arab |last6=Heidari |first6=Golsa |last7=Hussein |first7=Hassan |last8=Lorenz |first8=Anna-Lena |last9=Kabenamualu |first9=Salomon |last10=Farfar |first10=Kheir Eddine |last11=Prinz |first11=Manuel |date=2023-01-11 |editor-last=Magagna |editor-first=Barbara |title=FAIR scientific information with the Open Research Knowledge Graph |url=https://www.medra.org/servlet/aliasResolver?alias=iospress&doi=10.3233/FC-221513 |journal=FAIR Connect |volume=1 |issue=1 |pages=19–21 |doi=10.3233/FC-221513}}</ref><ref>{{Cite journal |last=Aisopos |first=Fotis |last2=Jozashoori |first2=Samaneh |last3=Niazmand |first3=Emetis |last4=Purohit |first4=Disha |last5=Rivas |first5=Ariam |last6=Sakor |first6=Ahmad |last7=Iglesias |first7=Enrique |last8=Vogiatzis |first8=Dimitrios |last9=Menasalvas |first9=Ernestina |last10=Rodriguez Gonzalez |first10=Alejandro |last11=Vigueras |first11=Guillermo |date=2023-05-08 |editor-last=Kondylakis |editor-first=Haridimos |editor2-last=Rao |editor2-first=Praveen |editor3-last=Stefanidis |editor3-first=Kostas |editor4-last=Stefanidis |editor4-first=Kostas |editor5-last=Kondylakis |editor5-first=Haridimos |title=Knowledge graphs for enhancing transparency in health data ecosystems1 |url=https://www.medra.org/servlet/aliasResolver?alias=iospress&doi=10.3233/SW-223294 |journal=Semantic Web |volume=14 |issue=5 |pages=943–976 |doi=10.3233/SW-223294}}</ref><ref>{{Cite journal |last=Cifuentes-Silva |first=Francisco |last2=Fernández-Álvarez |first2=Daniel |last3=Labra-Gayo |first3=Jose Emilio |date=2020-06-03 |title=National Budget as Linked Open Data: New Tools for Supporting the Sustainability of Public Finances |url=https://www.mdpi.com/2071-1050/12/11/4551 |journal=Sustainability |language=en |volume=12 |issue=11 |pages=4551 |doi=10.3390/su12114551 |issn=2071-1050}}</ref><ref>{{Cite journal |last=Rajabi |first=Enayat |last2=Kafaie |first2=Somayeh |date=2022-09-28 |title=Knowledge Graphs and Explainable AI in Healthcare |url=https://www.mdpi.com/2078-2489/13/10/459 |journal=Information |language=en |volume=13 |issue=10 |pages=459 |doi=10.3390/info13100459 |issn=2078-2489}}</ref><ref>{{Cite journal |last=Tiddi |first=Ilaria |last2=Schlobach |first2=Stefan |date=2022-01 |title=Knowledge graphs as tools for explainable machine learning: A survey |url=https://linkinghub.elsevier.com/retrieve/pii/S0004370221001788 |journal=Artificial Intelligence |language=en |volume=302 |pages=103627 |doi=10.1016/j.artint.2021.103627}}</ref>

Despite offering an appropriate technical foundation, the utilization of a knowledge graph for storing data and metadata does not inherently ensure the achievement of the FAIR Guiding Principles. Realizing FAIR research objects necessitates adherence to specific guidelines, encompassing the consistent application of adequate semantic data models tailored to distinct types of data and metadata statements. This approach is pivotal for ensuring seamless interoperability across a dataset.

The rest of the paper is organized as such. In the Problem statement section, we discuss three specific challenges that, from our perspective, can be effectively addressed by systematically organizing a knowledge graph into well-defined subgraphs. Prior attempts at this, such as defining a characteristic set as a subgraph based on triples that share the same resource in the ''Subject'' position, have demonstrated noteworthy enhancements in space and query performance<ref>{{Cite journal |last=Hogan |first=Aidan |last2=Arenas |first2=Marcelo |last3=Mallea |first3=Alejandro |last4=Polleres |first4=Axel |date=2014-08 |title=Everything you always wanted to know about blank nodes |url=https://linkinghub.elsevier.com/retrieve/pii/S1570826814000481 |journal=Journal of Web Semantics |language=en |volume=27-28 |pages=42–69 |doi=10.1016/j.websem.2014.06.004}}</ref><ref>{{Cite web |last=Neumann, T.; Moerkotte, G. |title=Characteristic sets: Accurate cardinality estimation for RDF queries with multiple joins {{!}} IEEE Conference Publication {{!}} IEEE Xplore |work=Proceedings of the 2011 IEEE 27th International Conference on Data Engineering |url=https://ieeexplore.ieee.org/document/5767868/ |doi=10.1109/icde.2011.5767868 |accessdate=}}</ref> (see also the related concept of RDF molecules<ref>{{Cite journal |last=Papastefanatos |first=George |last2=Meimaris |first2=Marios |last3=Vassiliadis |first3=Panos |date=2022-02 |title=Relational schema optimization for RDF-based knowledge graphs |url=https://linkinghub.elsevier.com/retrieve/pii/S0306437921000223 |journal=Information Systems |language=en |volume=104 |pages=101754 |doi=10.1016/j.is.2021.101754}}</ref><ref>{{Cite journal |last=Collarana |first=Diego |last2=Galkin |first2=Mikhail |last3=Traverso-Ribón |first3=Ignacio |last4=Vidal |first4=Maria-Esther |last5=Lange |first5=Christoph |last6=Auer |first6=Sören |date=2017-06-19 |title=MINTE: semantically integrating RDF graphs |url=https://dl.acm.org/doi/10.1145/3102254.3102280 |journal=Proceedings of the 7th International Conference on Web Intelligence, Mining and Semantics |language=en |publisher=ACM |place=Amantea Italy |pages=1–11 |doi=10.1145/3102254.3102280 |isbn=978-1-4503-5225-3}}</ref>), but they do not fully mitigate the challenges outlined below.

The Results section introduces a novel concept: the partitioning and structuring of a knowledge graph into semantic units, identifiable subgraphs represented in the graph with their own resource. Semantic units are semantically meaningful units of representation, which will contribute to overcoming the challenges at hand. The concept builds upon an idea originally proposed for structuring descriptions of [[phenotype]]s into distinct subgraphs, each of which models a descriptive statement like a particular weight measurement or a particular parthood statement for a given anatomical entity.<ref>{{Cite journal |last=Vogt |first=Lars |date=2019-12 |title=Organizing phenotypic data—a semantic data model for anatomy |url=https://jbiomedsem.biomedcentral.com/articles/10.1186/s13326-019-0204-6 |journal=Journal of Biomedical Semantics |language=en |volume=10 |issue=1 |pages=12 |doi=10.1186/s13326-019-0204-6 |issn=2041-1480 |pmc=PMC6585074 |pmid=31221226}}</ref> Each such subgraph is organized in its own "Named Graph" and functions as the smallest semantically meaningful unit in a phenotype description. Generalizing and extending this concept, we present semantic units as accessible, searchable, identifiable, and reusable data items in their own right, forming units of representation implemented through graphs based on the [[Resource Description Framework]] (RDF) and the Web Ontology Language (OWL) or labeled property graphs. Two basic categories of semantic units—statement units and compound units—are introduced, supplementing the well-established triples and the overall graph in FAIR knowledge graphs. These units offer a structure that organizes a knowledge graph into five levels of representational granularity, from individual triples to the graph as a whole. In further refinement, additional subcategories of semantic units are proposed for enhanced graph organization. The incorporation of unique, persistent, and resolvable identifiers (UPRIs) for each semantic unit enables their efficient referencing within triples, facilitating an efficient way of making statements about statements. The introduction of semantic units adds further layers of triples to the well-established RDF and OWL layer for knowledge graphs. (Fig. 1) This augmentation aims to enhance the usability of knowledge graphs for both domain experts and developers.

[[File:Fig1 Vogt JofBiomedSem24 15.png|600px]]
{{clear}}
{|
| style="vertical-align:top;" |
{| border="0" cellpadding="5" cellspacing="0" width="600px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;" |<blockquote>'''Figure 1.''' Semantic units introduce additional layers atop the RDF/OWL layer of triples within a knowledge graph. The figure illustrates a partitioning of the triple layer into statement units, wherein each triple aligns with exactly one statement unit, and each statement unit contains one or more triples. Statement units can be organized into diverse types of semantically meaningful collections, denoted as compound units. Compound units serve as the basis for defining several layers that contribute to the enhanced structuring and organization of the knowledge graph in semantically meaningful ways.</blockquote>
|-
|}
|}

In the Discussion section, we discuss the benefits we see from organizing knowledge graphs into distinct knowledge graph modules (i.e., semantic units) in terms of increasing data management flexibility and explorability of the graph. We also discuss possible strategies for implementing semantic units for RDF/OWL-based and labeled-property-graph-based knowledge graphs.

===Conventions used in this paper===
In this paper, the term "knowledge graph" denotes a machine-actionable semantic graph employed for the documentation, organization, and representation of data and metadata. It is essential to note that our discussion of semantic units is situated within the context of RDF-based triple stores, OWL, and Description Logics serving as a formal framework for inferencing, alongside labeled property graphs as an alternative to triple stores. We deliberately focus on these technologies as they constitute the primary technologies and logical frameworks within the knowledge graph domain, benefiting from widespread community support and established standards. We are aware of the fact that alternative technologies and frameworks exist that support an ''n''-tuples syntax and more advanced logics (e.g., First Order Logic)<ref>{{Citation |last=Ceusters |first=Werner |date=2022 |editor-last=Elkin |editor-first=Peter L. |title=The Place of Referent Tracking in Biomedical Informatics |url=https://link.springer.com/10.1007/978-3-031-11302-4_6 |work=Terminology, Ontology and their Implementations |language=en |publisher=Springer International Publishing |place=Cham |pages=39–46 |doi=10.1007/978-3-031-11302-4_6 |isbn=978-3-031-11301-7 |accessdate=2024-06-17}}</ref><ref>{{Cite journal |last=Ceusters |first=Werner |last2=Elkin |first2=Peter |last3=Smith |first3=Barry |date=2007-12 |title=Negative findings in electronic health records and biomedical ontologies: A realist approach |url=https://linkinghub.elsevier.com/retrieve/pii/S1386505607000408 |journal=International Journal of Medical Informatics |language=en |volume=76 |pages=S326–S333 |doi=10.1016/j.ijmedinf.2007.02.003 |pmc=PMC2211452 |pmid=17369081}}</ref>, but supporting tools and applications are missing or are not widely used to turn them into well-supported, scalable, and easily usable knowledge graph applications.

Throughout this text, <u>regular underlining</u> is employed for indicating ontology classes, while ''<u>italicsUnderlined</u>'' text is reserved for referencing properties. Identification (ID) numbers, formed by the ontology prefix followed by a colon and a number, uniquely specify each resource (e.g., ''<u>isAbout</u>'' [IAO:0000136]). When a term is not yet covered in any ontology, we denote the corresponding class with an asterisk (*). New classes and properties that relate to semantic units will use the ontology prefix SEMUNIT, as in the class *<u>SEMUNIT:metric measurement statement unit</u>*. These will be part of a future Semantic Unit ontology. We use '<u>regular underlined</u>' to indicate instances of classes, with the label referring to the class label and the ID to the ID of the class.

The term "resource" is employed to signify something uniquely designated, such as a Uniform Resource Identifier (URI), about which informative statements are made. It thus stands for something and represents something you want to talk about. In RDF, the ''Subject'' and the ''Predicate'' in a triple are always resources, whereas the ''Object'' can be either a resource or a literal. Resources encompass properties, instances, and classes, with properties occupying the ''Predicate'' position in a triple, instances referring to individuals (=particulars), and classes representing universals or kinds.

To maintain clarity, resources are represented with human-readable labels in both the text and all figures, opting for the implicit assumption that each property, instance, and class possesses its UPRI. Additionally, the term "triple" refers specifically to a triple statement, while "statement" pertains to a [[Natural language processing|natural language statement]], establishing a clear distinction between the two.

==Methods==
===Problem statement===
====Challenge 1: Ensuring schematic interoperability for FAIR empirical data====

In the pursuit of FAIRness in empirical data and metadata in a knowledge graph, it is important not only for the terms employed in data and metadata statements to possess identifiers from controlled vocabularies, such as ontologies, ensuring terminological interoperability, but also the semantic graph patterns underlying each statement. These patterns specify the relationships among the terms in a statement, facilitating schematic interoperability.

Due to the expressivity of RDF and OWL, statements can be modelled in multiple, often not directly interoperable ways within a knowledge graph. Distinguishing between RDF graphs with different structures that essentially model the same underlying data statement poses a challenge. Consequently, the presence of schematic interoperability conflicts becomes unavoidable, especially when data are represented using diverse graph patterns (cf. Figs. 2 and 3).

[[File:Fig2 Vogt JofBiomedSem24 15.png|900px]]
{{clear}}
{|
| style="vertical-align:top;" |
{| border="0" cellpadding="5" cellspacing="0" width="900px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;" |<blockquote>'''Figure 2.''' Comparison of a human-readable statement with its machine-actionable representation as a semantic graph following the RDF syntax. Top: A human-readable statement concerning the observation that a specific apple (X) weighs 204.56 grams. Bottom: The corresponding representation of the same statement as a semantic graph, adhering to RDF syntax and following the established pattern for measurement data from the Ontology for Biomedical Investigations (OBI)<ref>{{Cite journal |last=Bandrowski |first=Anita |last2=Brinkman |first2=Ryan |last3=Brochhausen |first3=Mathias |last4=Brush |first4=Matthew H. |last5=Bug |first5=Bill |last6=Chibucos |first6=Marcus C. |last7=Clancy |first7=Kevin |last8=Courtot |first8=Mélanie |last9=Derom |first9=Dirk |last10=Dumontier |first10=Michel |last11=Fan |first11=Liju |date=2016-04-29 |editor-last=Xue |editor-first=Yu |title=The Ontology for Biomedical Investigations |url=https://dx.plos.org/10.1371/journal.pone.0154556 |journal=PLOS ONE |language=en |volume=11 |issue=4 |pages=e0154556 |doi=10.1371/journal.pone.0154556 |issn=1932-6203 |pmc=PMC4851331 |pmid=27128319}}</ref> of the Open Biological and Biomedical Ontology Foundry (OBO).</blockquote>
|-
|}
|}

[[File:Fig3 Vogt JofBiomedSem24 15.png|800px]]
{{clear}}
{|
| style="vertical-align:top;" |
{| border="0" cellpadding="5" cellspacing="0" width="800px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;" |<blockquote>'''Figure 3.''' Alternative machine-actionable representation of the data statement from Fig. 2. This graph represents the same data statement as shown in Fig. 2 Top, but applies a semantic graph model that is based on the Extensible Observation Ontology (OBOE)<ref>{{Cite journal |last=Madin |first=Joshua |last2=Bowers |first2=Shawn |last3=Schildhauer |first3=Mark |last4=Krivov |first4=Sergeui |last5=Pennington |first5=Deana |last6=Villa |first6=Ferdinando |date=2007-10 |title=An ontology for describing and synthesizing ecological observation data |url=https://linkinghub.elsevier.com/retrieve/pii/S1574954107000362 |journal=Ecological Informatics |language=en |volume=2 |issue=3 |pages=279–296 |doi=10.1016/j.ecoinf.2007.05.004}}</ref>, an ontology frequently used in the ecology community.</blockquote>
|-
|}
|}

Therefore, to maintain interoperability in the representation of empirical data statements within an RDF graph, it can be beneficial to restrict the graph patterns employed for their semantic modelling. Statements of the same type, such as all weight measurements, would employ identical graph patterns to maintain interoperability. Each of these patterns would be assigned an identifier. When representing empirical data in the form of an RDF graph, the graph’s metadata should reference that graph-pattern identifier. This approach enables the identification of potentially interoperable RDF graphs sharing common graph-pattern identifiers.

Practically implementing these principles entails two criteria. Firstly, all statements within a knowledge graph must be categorized into statement classes, each associated with a specified graph pattern, typically in the form of a shape specification. Secondly, the subgraph corresponding to a particular statement must be distinctly identifiable.

====Challenge 2: Overcoming barriers in graph query language adoption====
Another significant challenge arises in the context of searching for specific information in a knowledge graph. The prevalent formats for knowledge graphs include RDF/OWL or labeled property graphs like Neo4j. Interacting directly with these graphs, encompassing CRUD operations for creating (= writing), reading (= searching), updating, and deleting statements in the knowledge graph, necessitates the utilization of a query language. SPARQL<ref>{{Cite web |last=Harris, S.; Seaborne, A. |date=21 March 2013 |title=SPARQL 1.1 Query Language |url=https://www.w3.org/TR/sparql11-query/ |publisher=World Wide Web Consortium}}</ref> is an example for RDF/OWL, while Cypher<ref>{{Cite web |date=2024 |title=The Neo4j Operations Manual v5 |url=https://neo4j.com/docs/operations-manual/current/ |publisher=Neo4j, Inc}}</ref> is employed for Neo4j.

Although these query languages empower users to formulate detailed and intricate queries, the challenge lies in their complexity, creating an entry barrier for seamless interactions with knowledge graphs.<ref>{{Cite web |last=Booth, D.; Wallace, E. |date=2019 |title=Session X: EasyRDF |work=2nd U.S. Semantic Technologies Symposium 2019 |url=https://us2ts.org/2019/posts/program-session-x.html}}</ref> Furthermore, query languages are not aware of graph patterns.

This challenge may potentially be addressed by providing reusable query patterns that link to specific graph patterns, thereby integrating representation and querying.

====Challenge 3: Addressing complexities in making statements about statements====
The RDF triple syntax of ''Subject'', ''Predicate'', and ''Object'' allows expressing a statement about another statement by creating a triple that relates a statement, composed of one or more triples, to a value, resource, or another statement. The scenario may arise where such statements about statements must be modelled. For instance, metadata for a measurement may relate two distinct subgraphs: one representing the measurement itself (as seen in Fig. 2) and another documenting the underlying measuring process (as seen in Fig. 4).

[[File:Fig4 Vogt JofBiomedSem24 15.png|1000px]]
{{clear}}
{|
| style="vertical-align:top;" |
{| border="0" cellpadding="5" cellspacing="0" width="1000px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;" |<blockquote>'''Figure 4.''' A detailed machine-actionable representation of the metadata relating to a weight measurement datum. This detailed illustration presents a machine-actionable representation of a mass measurement process employing a balance. It documents metadata associated with a weight measurement datum, articulated as an RDF graph. The graph establishes connections between an instance of <u>mass measurement assay</u> (OBI:0000445) and instances of various other classes from diverse ontologies. Noteworthy details include the identification of the measurement conductor, the location and timing of the measurement, the protocol followed, and the specific device utilized (i.e., a balance). Additionally, the graph outlines the material entity serving as the subject and input for the measurement process (i.e., "apple X"), along with specifying the resultant data encapsulated in a particular weight measurement assertion.</blockquote>
|-
|}
|}

In RDF reification, a statement resource is defined to represent a particular triple by describing it via three additional triples that specify its ''Subject'', ''Predicate'', and ''Object''. Alternatively, the RDF-star approach can be employed.<ref>{{Cite web |last=Hartig, O. |date=2017 |title=Foundations of RDF⋆ and SPARQL⋆ (An Alternative Approach to Statement-Level Metadata in RDF) |work=Alberto Mendelzon Workshop on Foundations of Data Management |url=https://www.semanticscholar.org/paper/Foundations-of-RDF%E2%8B%86-and-SPARQL%E2%8B%86-(An-Alternative-to-Hartig/36e70ee51cb7b7ec12faac934ae6b6a4d9da15a8}}</ref> Both methods increase complexity of the represented graph.

In cases like this, the adoption of Named Graphs is an alternative compared to RDF reification or RDF-star approaches. Within RDF-based knowledge graphs, a Named Graph resource identifies a set of triples by incorporating the URI of the Named Graph as a fourth element to each triple, transforming them into quads. In labeled property graphs, on the other hand, assigning a resource for identifying subgraphs within the overall data graph is straightforward and can be achieved by incorporating the resource identifier as the value of a corresponding property-value pair, subsequently adding this pair to all relations and nodes belonging to the same subgraph.

==Results==
===Semantic unit===
We developed an approach for organizing knowledge graphs into distinct layers of subgraphs using graph patterns. Unlike traditional methods of partitioning a knowledge graph that (i) rely on technical aspects such as shared graph-topological properties of its triples with the goal of (federated) reasoning and query optimization (see characteristic sets [29, 30], RDF molecules [31, 42], and other approaches [43,44,45]), that (ii) partition a knowledge graph into small blocks for embedding and entity alignment learning to scale knowledge graph fusion [46], or that (iii) partition knowledge extractions, allowing reasoning over them in parallel to speed up knowledge graph construction [47], our approach introduces "semantic units." Semantic units prioritize structuring a knowledge graph into identifiable sets of triples, as subgraphs that represent units of representation possessing semantic significance for human readers. Technically, a semantic unit is a subgraph within a knowledge graph, represented in the graph by its own resource—designated as a UPRI—and embodied in the graph as a node. This resource is classified as an instance of a specific semantic unit class.

Semantic units focus on creating units that are semantically meaningful to domain experts. For instance, the graph in Fig. 2 exemplifies a subgraph that can be organized in a semantic unit that instantiates the class *<u>SEMUNIT:weight statement unit</u>* as it is illustrated in Fig. 6 (later). The statement unit models a single, human-readable statement, as opposed to the individual triple ‘<u>weight</u>’ (PATO:0000128) ''isQualityMeasuredAs'' (IAO:0000417) ‘<u>scalar measurement datum</u>’ (IAO:0000032), which is a single triple from that subgraph. That triple, without the context of the other triples in the subgraph, lacks semantic meaningfulness for a domain expert who has no background in semantics.

Beyond statement units, which constitute the smallest semantically meaningful statements (e.g., a weight measurement), collections of statement units can form compound units representing a coarser level of representational granularity. The classification of semantic units thus distinguishes two fundamental categories: statement units and compound units, each with its respective subcategories. For a detailed classification of semantic units, refer to Fig. 5.

[[File:Fig5 Vogt JofBiomedSem24 15.png|300px]]
{{clear}}
{|
| style="vertical-align:top;" |
{| border="0" cellpadding="5" cellspacing="0" width="300px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;" |<blockquote>'''Figure 5.''' Classification of different categories of semantic units.</blockquote>
|-
|}
|}

The structuring of a knowledge graph into semantic units involves introducing an additional layer of triples to the existing graph. To distinguish these two layers, we label the pre-existing graph as the data graph layer, while the newly added triples constitute the semantic-units graph layer. For clarity across the graph, the resource representing a semantic unit, along with all triples featuring this resource in the ''Subject'' or ''Object'' position, is assigned to the semantic-units graph layer. Extending this distinction from the graph as a whole to individual semantic units, each semantic unit is associated with both a data graph and a semantic-units graph. The data graph of a particular semantic unit shares the same UPRI as its semantic unit resource. This alignment enables reference to the UPRI, concurrently denoting the semantic unit as a resource and its corresponding data graph. This interconnectedness empowers users to make statements about the content encapsulated within the semantic unit’s data graph, as shown in Fig. 6.

[[File:Fig6 Vogt JofBiomedSem24 15.png|1000px]]
{{clear}}
{|
| style="vertical-align:top;" |
{| border="0" cellpadding="5" cellspacing="0" width="1000px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;" |<blockquote>'''Figure 6.''' Example of a statement unit. The illustration displays a statement unit exemplifying a has-weight relation. The data graph, denoted within the blue box at the bottom, articulates the statement with "apple X" as the subject and "gram X" alongside the numerical value 204.56 as the objects. The peach-colored box encompasses the semantic-units graph, housing triples that encapsulate the semantic unit’s representation. It explicitly denotes the resource embodying the statement unit (bordered blue box), an instance of the *<u>SEMUNIT:weight statement unit</u>* class, with "apple X" identified as the subject. Notably, the UPRI of *’<u>weight statement unit</u>’* is also the UPRI of the semantic unit’s data graph (the unbordered subgraph in the blue box).</blockquote>
|-
|}
|}

====Statement unit: A proposition in the knowledge graph====
A statement unit is characterized as the fundamental unit of information encapsulating the smallest, independent proposition (i.e., statement) with semantic meaning for human comprehension (see also [32]). For instance, the weight measurement statement for "apple X" illustrated in Fig. 6 represents a statement unit.

Structuring a knowledge graph into statement units results in a partition of its graph. Each triple within the data graph layer of the knowledge graph is associated with exactly one statement unit, and merging the subgraphs of all statement units results in the complete data graph of a knowledge graph. This partitioning only applies to the data graph layer.

We can understand each statement unit to specify a particular proposition by establishing a relationship between a resource serving as the subject and either a literal or another resource, denoted as the object of the predicate. Every statement unit encompasses a single subject and one or more objects.

To illustrate, a has-part statement unit features a subject and one object. Conversely, a weight measurement statement unit consists of a subject, as well as two objects: the weight value and the weight unit (refer to Fig. 6). The resource signifying a statement unit in the graph establishes a connection with its subject through the property *<u>SEMUNIT:''hasSemanticUnitSubject''</u>*, which is documented in the semantic-units graph of the statement unit.

In scenarios where the proposition within the data graph is grounded in a binary relation—a divalent predicate like "This right hand has as a part this right thumb"—the associated statement unit typically comprises a single triple. This alignment arises from the nature of RDF, where ''Predicates'' of triples are inherently binary relations. In such cases, the RDF property concurrently embodies the statement’s verb or predicate. However, numerous propositions are grounded in ''n''-ary relations, making a single triple insufficient for their representation. Examples encompass the weight measurement statement in Fig. 6 and statements like "This right hand has part this right thumb on January 29th 2022," "Anna gives Bob a book," and "Carla travels by train from Paris to Berlin on the 29th of June 2022," each necessitating more than one triple. In these cases, the statement’s verb or predicate is often represented not by a property within a single triple but instead by an instance resource, as exemplified by ‘<u>weight X</u>’ (PATO:0000128) in Fig. 6. The composition of statement units, whether consisting of one or more triples, is contingent upon the relation of the underlying proposition, the ''n''-aryness of its predicate, and the incorporation of optional objects. Types of statement units can be distinguished based on the ''n''-ary verb or predicate that characterizes their underlying proposition. Notably, numerous object properties of the Basic Formal Ontology 2 denote ternary relations, particularly those entailing temporal dependencies. [48] For instance, "''b'' located_in ''c'' at ''t''" mandates at least two triples for accurate representation in RDF.

The determination of which triples belong to a statement unit necessitates case-by-case specification by human domain experts. The statement unit patterns can then be specified using languages like LinkML [49, 50] or the Shapes Constraint Language SHACL [51]. These languages enable the definition of graph patterns to represent specific propositions, subsequently constituting a statement unit. Each statement unit instantiates a designated statement unit class, a classification defined by the specific verb or predicate characterizing the propositions modelled by its instances. We can distinguish different subcategories of statement units based on the underlying predicate, such as ''has part'', ''type'', and ''develops from''.

A distinctive category within the statement units, denoted as identification units, serves a specific purpose, providing details about a particular named individual or class resource. Two principal subtypes define this category. A named individual identification unit is a statement unit that serves to identify a resource to be a named individual, adding information such as the resource’s label, type, and its class membership (refer to Fig. 7A). A class identification unit{{Efn|Analog to class identification units, one could specify property identification units that have property resources as their subject.}} is a statement unit that serves to identify a resource to be a class and provides details including its label, identifier, and optionally, the URIs of both the ontology and the specific version from which the class term has been imported (refer to Fig. 7B). Both types of identification units are important for providing human-readable displays of statement units, as they provide the labels for the resources used in them (see "typed statement unit" and "dynamic label" in Fig. 9, later).

[[File:Fig7 Vogt JofBiomedSem24 15.png|500px]]
{{clear}}
{|
| style="vertical-align:top;" |
{| border="0" cellpadding="5" cellspacing="0" width="500px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;" |<blockquote>'''Figure 7.''' Examples for two different types of identification units. '''A)''' Named-individual identification unit. The data graph within the unbordered box delineates the class-affiliation of the ‘<u>apple X</u>’ (NCIT:C71985) instance. The subject, "apple X," is connected to its class through the property ''<u>type</u>'' (RDF:type), while its label "apple X" is conveyed via the property ''<u>label</u>'' (RDFS:label). The unbordered blue box designates the data graph associated with this named-individual identification unit. '''B)''' Class identification unit. This data graph of this unit, represented by the unbordered blue box, captures the label and identifier of the class ‘<u>apple</u>’ (NCIT:C71985), the unit’s designated subject. Optionally, it includes the URI details of the ontology and the ontology version from which the class is derived. The bordered blue box designates the resource of this class identification unit.</blockquote>
|-
|}
|}

====Compound unit: A collection of propositions====
Compound units are containers of collections of associated semantic units, each possessing semantic significance for a human reader. Each compound unit possesses a UPRI and instantiates a corresponding compound unit class. The connection between the resource representing the compound unit and those representing its associated semantic units is detailed through the property *<u>SEMUNIT:hasAssociatedSemanticUnit</u>* (see Fig. 8). The subsequent sections introduce distinct subcategories of compound units.

[[File:Fig8 Vogt JofBiomedSem24 15.png|700px]]
{{clear}}
{|
| style="vertical-align:top;" |
{| border="0" cellpadding="5" cellspacing="0" width="700px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;" |<blockquote>'''Figure 8.''' Example of a compound unit, denoted as *‘<u>apple X item unit</u>’*, that encompasses multiple statement units. Compound units, by virtue of merging the data graphs of their associated statement units, indirectly manifest a data graph (here, highlighted by the blue arrow). Notably, the compound unit possesses a semantic-units graph (depicted in the peach-colored box) delineating the associated semantic units.</blockquote>
|-
|}
|}

===Typed statement unit===
A typed statement unit assigns a human-readable label to a statement unit. A typed statement unit is a compound unit comprising the following statement units (see Fig. 9A):

#A statement unit that is not an instance of a named-individual or a class identification unit. It functions as the reference statement unit of the typed statement unit, and its subject is also the subject of the typed statement unit.
#Identification units specifying the class affiliations of all the resources that are referenced in the data graph of the reference statement unit, together with their human-readable labels.

[[File:Fig9 Vogt JofBiomedSem24 15.png|700px]]
{{clear}}
{|
| style="vertical-align:top;" |
{| border="0" cellpadding="5" cellspacing="0" width="700px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;" |<blockquote>'''Figure 9.''' Typed statement unit with dynamic label and dynamic mind-map pattern. '''A)''' Typed statement unit exemplified for a weight statement. This typed statement unit consolidates the data graphs of six statement units, including the *’<u>weight statement unit</u>’* from Figure 6, serving as the reference statement unit for this *‘<u>typed statement unit</u>’*, and five instances of *<u>SEMUNIT:named-individual identification unit</u>*. '''B)''' Dynamic label: Illustrated is an example of the dynamic label associated with the reference statement unit class (*<u>SEMUNIT:weight statement unit</u>*). This dynamic label template is utilized for textual displays of information from the reference statement unit. '''C)''' Dynamic mind-map pattern: Depicted is an example of the dynamic mind-map pattern associated with the reference statement unit class (*<u>SEMUNIT:weight statement unit</u>*). This pattern template is employed for graphical displays of information from the reference statement unit.</blockquote>
|-
|}
|}

Each statement unit class has at least one display pattern associated with it. A display pattern acts as a template that takes as input the labels provided by the identification units associated with a typed statement unit and generates a human-readable dynamic label for the textual (see Fig. 9B) or a dynamic mind-map pattern for the graphical representation (see Fig. 9C) of the statement of its reference statement unit. Thus, a dynamic label and a dynamic mind-map pattern of a typed statement unit are derived from the corresponding templates provided by its reference statement unit, taking the human-readable labels provided by its identification units as input.

===Item unit===
An item unit encompasses all statement and typed statement units that share a common subject, i.e., they form a group of statements relating to the same entity. The subject resource becomes the subject of the item unit, and the resource representing an item unit in the semantic-units graph relates to its subject through the property *<u>SEMUNIT:hasSemanticUnitSubject</u>*. Conceptually, item units align with the ''graph-per-resource'' data management pattern [52] or the previously mentioned ''characteristic set'' or ''RDF molecule'', and they are akin to the ''Item'' concept in the Wikibase data model<ref name="MWWikibase24">{{cite web |url=https://www.mediawiki.org/wiki/Wikibase/DataModel#Item |title=Wikibase/DataModel - Overview of the data model |work=MediaWiki.org |date=07 April 2024}}</ref>, but adapt the concept to statement units rather than triples.

===Item group unit===
An item group unit is composed of a minimum of two item units. The subgraphs of the item units belonging to the same item group unit are connected through statement units that share their subject with the subject of one item unit and one of their objects with the subject of another item unit. As a result, merging the subgraphs of all the item units of an item group unit forms a connected graph.

===Granularity tree unit===
We can further identify types of statement units that depend on partial order relations (i.e., relations that are transitive, reflexive, and asymmetric), forming partial orders. Examples include class-subclass relations in ontologies, parthood relations in descriptive statements, and sequential relations like ''<u>before</u>'' (RO:0002083) in process specifications. Partial order relations give rise to granular partitions that form granularity trees [53,54,55] and contribute to defining granularity perspectives. [56,57,58]

Granularity perspectives identify specific types of semantically meaningful tree-like subgraphs within a knowledge graph, supporting graph exploration by modularization in addition to statement, item, and item group units.

Due to the nested structure of a granularity tree and its inherent directionality from root to leaves, the subject of a granularity tree unit can be specified as the subject of statement units sharing objects with the subjects but not their subject with the objects of other statement units within the same granularity tree unit.

===Granular item group unit===
A granular item group unit encompasses all statement units and item units whose subjects belong to the same granularity tree unit. The item units belonging to a granular item group unit can be systematically arranged within a nested hierarchy dictated by the underlying granularity tree. This additional organization offers improved explorability for users of a knowledge graph application.

===Context unit===
The ''<u>isAbout</u>'' property (IAO:0000136) connects an information artifact to an entity about which the artifact provides information. Using this property in a knowledge graph changes the frame of reference from the discursive layer to the ontological layer. An is-about statement thus divides a knowledge graph into two subgraphs, each forming a context unit that belongs to one of these two layers. Is-about statement units relate resources from the semantic-units graph with resources from the data graph of a knowledge graph. For example, in documenting a research activity that results in the creation of a dataset describing the anatomy of a multicellular organism, the statement *‘<u>description item unit</u>’* ''<u>isAbout</u>'' ‘<u>multicellular organism</u>’ (UBERON:0000468) marks a transition in the frame of reference from the research activity’s outcome to the multicellular organism being described (see also Fig. 12 further below).

===Dataset unit===
A dataset unit is an ordered set of semantic units. They can be employed to aggregate all data contributed by a specific institution in a collaborative project, document the state of a particular object at a given time, or store and make accessible the results of a specific search query. Knowledge graph users have the flexibility to specify dataset units for their individual needs, utilizing the unit’s UPRI as reference identifier.

===List unit===
In certain instances, it becomes necessary to articulate statements about a specific collection of particular resources. To achieve this, such a collection can be modelled as a list unit. We distinguish unordered list units from ordered list units, with the latter organizing resources in a specific sequence, such as the authors of a scholarly publication. Conversely, a set unit is an unordered list unit where each resource is listed only once, adhering to a uniqueness restriction.

From a technical standpoint, a list unit contains membership statement units, each delineating a resource belonging to the list by linking the UPRI of the list unit through a *<u>SEMUNIT:''child''</u>* relation to the respective resource. In the case of an ordered list unit, each membership statement unit must be indexed through a data property ''<u>index</u>'' (RDF:index).

List units can be employed as arrays and may incorporate cardinality restrictions, thereby characterizing a closed collection of entities and enabling a localized closed-world assumption.

==Discussion==
===Benefits of organizing a knowledge graph into semantic units===
====Semantic units enhance data management flexibility through modularity====
The organization of a knowledge graph into distinct subgraphs, each associated with a particular semantic unit, introduces modularity in a graph. Each semantic unit, represented in the graph by a dedicated resource classified as an instance of a specific semantic unit class, serves as a structured module that encapsulates complexity. This modular approach allows for the encapsulation of subgraphs, and may add flexibility in data management as larger parts of a graph can be manipulated jointly.

====Semantic units operate at a higher level of abstraction than individual triples====
Semantically, they encapsulate the contents of their data graphs, representing statements or sets of semantically and ontologically related statements. The specification of relations between semantic units further extends the flexibility of data management. A given semantic unit from a finer level of representational granularity can be associated with multiple units from a coarser level. Consequently, a statement unit may be linked to more than one compound unit, all while maintaining the centrality of the statement unit itself and its triples in a single location within the graph.

The modular nature introduced by semantic units may streamline partitioned-based querying of knowledge graphs. While other approaches for graph partitioning have shown success [59], employing semantic units for partitioning and establishing modularity in the graph is an avenue for future research exploration.

===Semantic units as a framework for knowledge graph alignment===
The instantiation of semantic units belonging to the same class inherently implies a semantic similarity across instances. This characteristic lays the groundwork for a systematic approach to aligning and comparing knowledge graphs that share a common set of semantic unit classes. The alignment process could operate in a stepwise manner across various levels of representational granularity. In the initial step, alignment focuses on item group units, leveraging their types of associated item units and their alignment for comparison. The latter alignment hinges on the types of subjects and the types of associated statement units, allowing for further alignment based on class. Ultimately, individual triples within the aligned statement units undergo comparison, marking a comprehensive strategy to enhance existing methods for knowledge graph alignment, subgraph-matching, graph comparison, and graph similarity measures.

===Managing restricted access to sensitive data===
The classification of statement units into corresponding ontology classes may serve as a framework for identifying subgraphs within a knowledge graph housing sensitive data that warrants restricted access. By identifying statement units containing sensitive information by class, access restrictions can be dynamically enforced based on specific criteria.

===Semantic units: A framework for nested and overlapping knowledge graph modules===
====Semantic units identify five levels of representational granularity====
Semantic units introduce a structured framework encompassing five levels of representational granularity within a knowledge graph: triples, statement units, item units, item group units, and the knowledge graph as a whole (refer to Fig. 10). While triples represent the lowest level of abstraction, semantic units provide coarser levels, organizing the semantic-units graph layer (i.e., the discursive layer of a knowledge graph) and, indirectly, the knowledge graph’s data graph layer.

[[File:Fig10 Vogt JofBiomedSem24 15.png|700px]]
{{clear}}
{|
| style="vertical-align:top;" |
{| border="0" cellpadding="5" cellspacing="0" width="700px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;" |<blockquote>'''Figure 10.''' Five levels of representational granularity. The integration of semantic units into a knowledge graph introduces a semantic-units graph layer, enriching the existing data graph layer. This augmentation includes distinct levels, namely triples, statement units, item units, and item group units, providing a nuanced hierarchy of representational granularity within a knowledge graph.</blockquote>
|-
|}
|}

The hierarchical organization of triples into statement units (→ smallest units of propositions that are semantically meaningful for a human reader), further into item units (→ comprising all the information from the knowledge graph about a particular entity), and eventually into item group units (→ collections of semantically interrelated entities) could enhance human readability and usability. This structural hierarchy supports users in seamlessly navigating across the graph, zooming in and out of different levels of representational granularity.

====Semantic units identify granularity trees====
Granularity trees offer a perspective that is orthogonal to representational granularity, structuring the data graph layer and thus the ontological layer of a knowledge graph into distinct granularity perspectives. Consider the example of a multicellular organism’s description, including a has-part statement unit stating that the organism has a head as its part. This unit is associated with the item unit of the organism itself, which is linked to additional item units about the organism’s other parts, constituting an item group unit. Moreover, since has-part is a partial order relation [55], the has-part statement unit is associated with a parthood granularity tree unit and its corresponding granular item group unit. Consequently, the statement unit is associated with at least four different compound units that can be communicated to the user alongside the statement itself, showcasing the versatility enabled by semantic units in exploring contextualized subgraphs. [54]

===Semantic units identify context-dependent subgraphs===
Semantic units empower the organization of item group units into context units, each defining a specific frame of reference. Intersections between context units are discerned through is-about statements (see also Fig. 12), facilitating traversal across diverse frames of reference. Context units contribute to structuring the data graph layer and thus the ontological layer of a knowledge graph into different frames of reference.

====Statements about statements and documenting ontological and discursive information in knowledge graphs using semantic units====
The introduction of semantic units provides a framework for making statements about statements in a knowledge graph. Each semantic unit, equipped with its unique UPRI and represented in the semantic-units graph layer, facilitates assertions about statement units. This structured approach offers the potential for cross-database and cross-knowledge-graph statements when semantic units are implemented as nanopublications or FAIR Digital Objects, addressing the challenge of making statements about statements in knowledge graphs.

Moreover, if a knowledge graph should cover contextual assertions such as “Author A asserts that the melting point of lead is at 327.5 °C” or “The assertion about the melting point of lead being at 327.5 °C is a result of experiment X,” it becomes challenging to model this without having a formalism for representing such discursive contextual information and its relationship to empirical data (see also Ingvar Johannson’s distinction between use and mention of linguistic entities [60]). Statement units with their data graphs contribute ontological information, nested within compound units of coarser representational granularity. In the semantic-units graph, propositions are represented as nodes, forming a significant portion of the discursive layer. Additionally, context units allow the explicit documentation of different frames of reference within both the ontological and discursive layers. The ability of statement units to establish relations between resources or even between other statement units (e.g., ‘''author_A -asserts-> statement_unit_Y''’; ‘''statement_unit_X -hasMetadata-> statement_unit_Z''’) facilitates the documentation of connections between the empirical and discursive layers. For instance, an item group unit focusing on the contents of a scholarly publication, can encapsulate information about the associated research activity, its inputs, outputs, research methods, and objectives (see Fig. 11).

[[File:Fig11 Vogt JofBiomedSem24 15.png|900px]]
{{clear}}
{|
| style="vertical-align:top;" |
{| border="0" cellpadding="5" cellspacing="0" width="900px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;" |<blockquote>'''Figure 11.''' A semantic schema for modelling the contents of scholarly publications. The depicted semantic schema outlines the modelling structure for encapsulating the components of scholarly publications. It delineates the relationship between a research activity, its associated input and output, and the underlying specification of its process plan, manifested in the form of a research method and research objective. The model draws inspiration from Vogt ''et al.'' [61]</blockquote>
|-
|}
|}

The proposed model may find application within a knowledge graph centered around scholarly publications. For example, the representation in Fig. 12 combines the discursive and the ontological layers and represents the connections between different frames of reference.

[[File:Fig12 Vogt JofBiomedSem24 15.png|1300px]]
{{clear}}
{|
| style="vertical-align:top;" |
{| border="0" cellpadding="5" cellspacing="0" width="1300px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;" |<blockquote>'''Figure 12.''' Detail from the RDF graph illustrating the contents of a scholarly publication. The data schema employed aligns with the schema shown in Figure 11, tailored to accommodate semantic units. The publication’s content is encapsulated within a dedicated publication item group unit instance through various interconnected semantic units. The publication itself is denoted as an instance of <u>journal article</u> (IAO:0000013). The publication item group unit encompasses multiple item units related to the research activity, interconnected through the *<u>SEMUNIT:''hasLinkedSemanticUnit''</u>* property. The interconnected hierarchy extends to an <u>investigation</u> (OBI:0000066) instance, resulting in a <u>data set</u> (IAO:0000100) instance with a <u>description</u> (SIO:000136) instance as its part. This description, in turn, has the multicellular organism item unit describing the organism as its part, which has an instance of <u>multicellular organism</u> (UBERON:0000468) as its subject. The blue arrow signifies the representation of the data graph (dark blue box with shadow) by this specific item unit (bordered box in the same color). The ontological layer is constituted by the data graphs of the semantic units, while their semantic-units graphs collectively form the discursive layer. Distinct context units demarcate the reference frames of the publication, research-activity, and research-subject, delineated by is-about statements. For reasons of clarity of presentation, the associated statement units are not shown in the discursive layer.</blockquote>
|-
|}
|}

===Implementation===
====Implementing semantic units in RDF/OWL-based knowledge graphs using nanopublications===
To initiate the structuring of a knowledge graph into semantic units, first, a layer of abstraction beyond the triple level must be created. This is accomplished by partitioning the knowledge graph into a set of statement units, where each triple belongs exclusively to one data graph of a statement unit. In RDF/OWL, statement units can be conceptualized like nanopublications.

Nanopublications are RDF graphs that serve as the smallest published information units extracted from literature and enriched with provenance and attribution information. [62,63,64,65] Leveraging Named Graphs and Semantic Web technologies, each nanopublication models a particular assertion, such as a scientific claim, in a machine-readable format and semantics and is accessible and citable through a unique identifier. Each nanopublication is organized into four Named Graphs:

#the head Named Graph, connecting the other three Named Graphs to the nanopublication’s unique identifier;
#the assertion Named Graph, containing the assertion modelled as a graph;
#the provenance Named Graph, containing metadata about the assertion; and
#the publicationInfo Named Graph, containing metadata about the nanopublication itself.

The assertion Named Graph would contain the data graph of a statement unit, whereas the head Named Graph its semantic-units graph. Triples in the provenance Named Graph can potentially link to other semantic units and thus other nanopublications that contain detailed metadata descriptions (e.g., a metadata graph as shown in Fig. 4).

A compound unit, being a collection of two or more semantic units, can be organized in an RDF/OWL-based knowledge graph by linking the compound unit’s UPRI to the UPRIs of its associated semantic units. Following the nanopublication schema, this can be implemented by employing the compound unit’s semantic-units graph as the head Named Graph of a corresponding nanopublication, leaving the nanopublication’s assertion Named Graph empty. The head Named Graph thus specifies all statement and compound units associated with this compound unit.

====Implementing semantic units in Neo4j-based knowledge graphs using UPRIs and corresponding property-value pairs====
In Neo4j, a labeled property graph, the assignment of UPRIs to all nodes and relations through a ‘''UPRI:upri''’ property-value pair is an essential prerequisite for implementing semantic units. To identify all triples affiliated with the same statement unit, a ‘''statement_unit_UPRI:upri''’ property-value pair must be added to each node and relation belonging to the statement unit, with the statement unit’s UPRI serving as the value. Building on this primary abstraction layer of statement units, a secondary abstraction layer of compound units can be organized. The nodes and relations associated with all triples within a compound unit are endowed with a ‘''compound_unit_UPRI:upri''’ property-value pair, having the compound unit’s UPRI as their value. Since a particular statement unit may be associated with multiple compound units, its ‘''compound_unit_URI''’ property can incorporate an array of UPRIs representing different semantic units.

An initial software for demonstration purposes has been developed by one of the authors, illustrating how semantic units can manage a knowledge graph. [66] Built upon Neo4j as the persistence-layer technology, the application sources its content via a web interface and user input. This small-scale knowledge graph application is designed for documenting assertions from scholarly publications, offering users an exemplary platform to describe some of the contents (and not merely bibliographic metadata) found in a scholarly publication. Each described paper stands as its own item group unit, featuring assertions covered by statement units linked to item units and granularity tree units. The prototype encompasses versioning of semantic units and automatic tracking of their editing histories and provenance. The application employs the organization of the graph into semantic units within a navigation tree, facilitating exploration of a given item group unit through its associated item units (see Fig. 13). The showcase is built using Python and flask/Jinja2 and is openly available at https://github.com/LarsVogt/Knowledge-Graph-Building-Blocks.

[[File:Fig13 Vogt JofBiomedSem24 15.png|1000px]]
{{clear}}
{|
| style="vertical-align:top;" |
{| border="0" cellpadding="5" cellspacing="0" width="1000px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;" |<blockquote>'''Figure 13.''' User interface of a prototype web application that implements semantic units. On the left is a navigation tree that leverages the organization of the underlying Neo4j knowledge graph into different item group, item, and statement units. Currently selected is the infectious agent population item group. On the right, all statements belonging to the selected item group are displayed.</blockquote>
|-
|}
|}

====Strategies for implementation====
Given that only statement units store information, while compound units act as their containers, the first step of implementing semantic units should focus on identifying the statement unit classes required for representing the types of statements integral to the knowledge graph’s coverage. Each statement unit class requires an assigned graph schema, preferably articulated using a shapes constraint language like SHACL. [51] In this initial step, statement types that are grounded in partial order relations must be identified as well (required for identifying granularity tree units). From here, three distinct implementation strategies are available:

#'''Develop from scratch''': In cases where no knowledge graph exists yet, the focus should be on developing a knowledge graph application that organizes incoming information into statement units in accordance with their assigned graph schemata. Rules for organizing statement units into compound units, contingent on the compound unit type, must be established. For example, statement units sharing the same subject resource form a corresponding item unit.
#'''Transfer an existing knowledge graph''': If there is an existing knowledge graph that needs restructuring into semantic units, crafting queries to transfer all triples into corresponding statement units, based on the graph schemata identified in the first step, is the next step. The main challenge is maintaining disjointedness of triples between statement units.
#'''A hybrid approach''': For scenarios where restructuring an entire knowledge graph seems impractical or undesirable, but there is a desire to organize newly added information into semantic units, a hybrid approach is possible. This involves developing input workflows to ensure that all incoming data conforms to the semantic units structure.

====Semantic units as FAIR Digital Objects====
The concept of FAIR Digital Objects, as proposed by the European Commission Expert Group on FAIR Data, stands at the core of achieving the FAIR Principles [67], emphasizing persistent identifiers, comprehensive metadata, and contextual documentation for reliable discovery, citation, and reuse. The concept of semantic units aligns with that of FAIR Digital Objects. Each semantic unit inherently possesses a UPRI, serving as a ready-made persistent identifier. Accessibility and searchability are ensured through established protocols like SPARQL and CYPHER, with RDF, JSON, and other formats supporting data export. When knowledge graphs adhere to controlled vocabularies and ontologies, and when they employ standard graph-patterns using tools like SHACL [51], ShEx [68, 69], or OTTR [70, 71], the data within the data graphs of semantic units may more easily achieve semantic interoperability.

Moreover, semantic units can provide provenance—crucial for tracking a semantic unit’s history—through utilizing property-value pairs for labeled property knowledge graphs or a designated provenance Named Graph for RDF/OWL knowledge graphs. The provenance metadata of a semantic unit encompasses details like the creator, creation date, application used, title, contributing users, and last-update, focusing solely on the semantic unit itself, not the original data production process.

Access control metadata can specify any licenses as well as access control restrictions.

==Conclusion and future work==
In conclusion, the adoption of semantic units in structuring knowledge graphs may be useful to address the challenges faced in knowledge representation mentioned in the introduction. By encapsulating each statement within its dedicated statement unit, accompanied by a corresponding statement unit class and data schema (e.g., as a SHACL shape), a robust foundation for FAIR data and metadata is established, supporting schematic interoperability. Because statement units partition the knowledge graph so that every triple belongs to exactly one statement unit and every statement unit’s subgraph is identifiable and referenceable through its UPRI, data in a knowledge graph is linked to graph patterns, which are identifiable as a whole. By providing each schema its own UPRI, each semantic unit can specify its underlying schema in its metadata. Identifying semantically interoperable semantic units is then straightforward, and schema crosswalks between different schemata can increase schematic interoperability. [72] (This addresses Challenge 1.)

Graph query languages can use the graph patterns (semantic units), and therefore allow access to knowledge graph content through higher levels of abstractions than basic triples. (This addresses Challenge 2.) Further, we have shown how semantic units can organize knowledge graphs in different layers and make statements about statements. (This addresses Challenge 3.)

Future research involves extending the semantic units approach to incorporate question units and a nuanced categorization of assertional, contingent, prototypical, and universal statement units. This extension will encompass formal semantics for the latter, including provisions for negations and cardinality restrictions. Additionally, we are exploring novel approaches to knowledge graph exploration based on semantic units.

==Abbreviations, acronyms, and initialisms==

*'''BFO''': Basic Formal Ontology
*'''CRUD''': Create, Read, Update, Delete
*'''FAIR''': Findable, Accessible, Interoperable, and Reusable
*'''HTTP''': Hypertext Transfer Protocol
*'''HTTPS''': Hypertext Transfer Protocol Secure
*'''IAO''': Information Artifact Ontology
*'''ID''': Identifier
*'''JSON''': JavaScript Object Notation
*'''LinkML''': Linked Data Modeling Language
*'''NCIT''': National Cancer Institute
*'''NoSQL''': Not only Structured Query Language
*'''OBI''': Ontology for Biomedical Investigations
*'''OBOE''': Extensible Observation Ontology
*'''OBO Foundry''': Open Biological and Biomedical Ontology Foundry
*'''OTTR''': Reasonable Ontology Templates
*'''OWL''': Web Ontology Language
*'''PATO''': Phenotype and Trait Ontology
*'''RDF''': Resource Description Framework
*'''RDFS''': RDF-Schema
*'''RO''': OBO Relations Ontology
*'''SHACL''': Shape Constraint Language
*'''ShEx''': Shape Expression
*'''SIO''': Semanticscience Integrated Ontology
*'''SPARQL''': SPARQL Protocol and RDF Query Language
*'''TI''': Time Ontology in OWL
*'''TRUST''': Transparency, Responsibility, User Focus, Sustainability, and Technology
*'''UBERON''': Uber-anatomy ontology
*'''UO''': Units of Measurement Ontology
*'''UPRI''': Unique Persistent and Resolvable Identifier
*'''XSD''': Extensible Markup Language Schema Definition

==Foonotes==
{{reflist|group=lower-alpha}}

==Acknowledgements==
We thank Werner Ceusters, Nico Matentzoglu, Manuel Prinz, Marcel Konrad, Philip Strömert, Roman Baum, Björn Quast, Peter Grobe, István Míko, Manfred Jeusfeld, Manolis Koubarakis, Javad Chamanara, and Kheir Eddine for discussing some of the presented ideas. We also thank to anonymous reviewers for their suggestions and feedback. We are solely responsible for all the arguments and statements in this paper.

===Author contributions===
L.V. developed the concept of semantic units and wrote the initial manuscript text. All authors reviewed and revised the manuscript.

===Funding===
Open Access funding enabled and organized by Projekt DEAL. Lars Vogt received funding by the ERC H2020 Project ‘ScienceGraph’ (819536).

===Conflict of interest===
The authors declare no competing interests.

==References==
{{Reflist|colwidth=30em}}

==Notes==
This presentation is faithful to the original, with only a few minor changes to presentation, though grammar and word usage was substantially updated for improved readability. In some cases important information was missing from the references, and that information was added.


[[Category:LIMSwiki journal articles (added in 2024)]]
[[Category:LIMSwiki journal articles (all)]]
[[Category:LIMSwiki journal articles on data management and sharing]]
[[Category:LIMSwiki journal articles on FAIR data principles]]
[[Category:LIMSwiki journal articles on health informatics]]

Journal:Semantic units: Organizing knowledge graphs into semantically meaningful units of representation

2024-06-17T20:00:35Z

Shawndouglas: Saving and adding more.

{{Infobox journal article
|name =
|image =
|alt = 
|caption =
|title_full = Semantic units: Organizing knowledge graphs into semantically meaningful units of representation
|journal = ''Journal of Biomedical Semantics''
|authors = Vogt, Lars; Kuhn, Tobias; Hoehndorf, Robert
|affiliations = TIB Leibniz Information Centre for Science and Technology, Vrije Universiteit, King Abdullah University of Science and Technology
|contact = Email: lars dot m dot vogt at googlemail dot com
|editors =
|pub_year = 2024
|vol_iss = '''15'''
|at = 7
|doi = [https://doi.org/10.1186/s13326-024-00310-5 10.1186/s13326-024-00310-5]
|issn = 2041-1480
|license = [http://creativecommons.org/licenses/by/4.0/ Creative Commons Attribution 4.0 International]
|website = [https://jbiomedsem.biomedcentral.com/articles/10.1186/s13326-024-00310-5 https://jbiomedsem.biomedcentral.com/articles/10.1186/s13326-024-00310-5]
|download = [https://jbiomedsem.biomedcentral.com/counter/pdf/10.1186/s13326-024-00310-5.pdf https://jbiomedsem.biomedcentral.com/counter/pdf/10.1186/s13326-024-00310-5.pdf] (PDF)
}}
{{ombox
| type = notice
| image = [[Image:Emblem-important-yellow.svg|40px]]
| style = width: 500px;
| text = This article should be considered a work in progress and incomplete. Consider this article incomplete until this notice is removed.
}}
==Abstract==
'''Background''': In today’s landscape of [[Information management|data management]], the importance of [[knowledge graph]]s and [[Ontology (information science)|ontologies]] is escalating as critical mechanisms aligned with the [[Journal:The FAIR Guiding Principles for scientific data management and stewardship|FAIR Guiding Principles]] ask that research data and [[metadata]] be more findable, accessible, interoperable, and reusable. We discuss three challenges that may hinder the effective exploitation of the full potential of applying FAIR concepts to research objects using knowledge graphs.

'''Results''': We introduce “semantic units” as a conceptual solution, although currently exemplified only in a limited prototype. Semantic units structure a knowledge graph into identifiable and [[Semantics|semantically]] meaningful subgraphs by adding another layer of triples on top of the conventional data layer. Semantic units and their subgraphs are represented by their own resource that instantiates a corresponding semantic unit class. We distinguish statement and compound units as basic categories of semantic units. A statement unit is the smallest independent proposition that is semantically meaningful for a human reader. Depending on the relation of its underlying proposition, it consists of one or more triples. Organizing a knowledge graph into statement units results in a partition of the graph, with each triple belonging to exactly one statement unit. A compound unit, on the other hand, is a semantically meaningful collection of statement and compound units that form larger subgraphs. Some semantic units organize the graph into different levels of representational granularity, others orthogonally into different types of granularity trees or different frames of reference, structuring and organizing the knowledge graph into partially overlapping, partially enclosed subgraphs, each of which can be referenced by its own resource.

'''Conclusions''': Semantic units, applicable in RDF/OWL and labeled property graphs, offer support for making statements about statements and facilitate graph-alignment, subgraph-matching, knowledge graph profiling, and management of access restrictions to sensitive data. Additionally, we argue that organizing the graph into semantic units promotes the differentiation of ontological and discursive [[information]], and that it also supports the differentiation of multiple frames of reference within the graph.

'''Keywords''': FAIR data and metadata, knowledge graph, OWL, RDF, semantic unit, graph organization, granularity tree, representational granularity

==Background==
In an era marked by the exponential generation of data<ref>{{Cite journal |last=Adam, K.; Hammad, I.; Fakhreldin, M.A.I. et al. |year=2015 |title=Big Data Analysis and Storage |url=http://umpir.ump.edu.my/id/eprint/7341 |journal=Proceedings of the 2015 International Conference on Operations Excellence and Service Engineering |pages=648–59}}</ref><ref>{{Cite web |last=Marr, B. |date=21 May 2018 |title=How Much Data Do We Create Every Day? The Mind-Blowing Stats Everyone Should Read |work=Forbes |url=https://www.forbes.com/sites/bernardmarr/2018/05/21/how-much-data-do-we-create-every-day-the-mind-blowing-stats-everyone-should-read/ |accessdate=22 May 2024}}</ref><ref>{{Cite web |date=2017 |title=Data Never Sleeps 5 |url=https://www.domo.com/learn/infographic/data-never-sleeps-5 |publisher=Domo, Inc}}</ref>, both technically and socially intricate challenges have emerged<ref>{{Cite journal |last=Idrees |first=Sheikh Mohammad |last2=Alam |first2=M. Afshar |last3=Agarwal |first3=Parul |date=2019-12 |title=A study of big data and its challenges |url=http://link.springer.com/10.1007/s41870-018-0185-1 |journal=International Journal of Information Technology |language=en |volume=11 |issue=4 |pages=841–846 |doi=10.1007/s41870-018-0185-1 |issn=2511-2104}}</ref>, necessitating innovative approaches to data representation and [[Information management|management]] in science and industry. The growing volume of produced data requires systems capable of collecting, [[Data integration|integrating]], and [[Data analysis|analyzing]] extensive datasets from diverse sources, a critical requirement in addressing contemporary global challenges.<ref>{{Cite web |last=United Nations |date=2015 |title=Transforming our world: the 2030 Agenda for Sustainable Development |url=https://wedocs.unep.org/20.500.11822/9814 |publisher=United Nations Environment Programme |accessdate=22 May 2024}}</ref> Notably, data stewardship should rest within the hands of the domain experts or institutions to ensure technical autonomy, aligning with the concept of "data visiting" rather than conventional "[[data sharing]]."<ref>{{Cite web |last=Mons, B. |date=December 2018 |title=Message from President Barend Mons (2018-2023) |url=https://codata.org/about-codata/message-from-president-merce-crosas/message-from-president-barend-mons-2018-2023/ |publisher=Committee on Data (CODATA) |accessdate=22 May 2024}}</ref>

From the standpoint of data representation and management, meeting these demands relies on adherence to the [[Journal:The FAIR Guiding Principles for scientific data management and stewardship|FAIR Guiding Principles]], which ask for research data and [[metadata]] to be readily findable, accessible, interoperable, and reusable for machines and humans alike.<ref>{{Cite journal |last=Wilkinson |first=Mark D. |last2=Dumontier |first2=Michel |last3=Aalbersberg |first3=IJsbrand Jan |last4=Appleton |first4=Gabrielle |last5=Axton |first5=Myles |last6=Baak |first6=Arie |last7=Blomberg |first7=Niklas |last8=Boiten |first8=Jan-Willem |last9=da Silva Santos |first9=Luiz Bonino |last10=Bourne |first10=Philip E. |last11=Bouwman |first11=Jildau |date=2016-03-15 |title=The FAIR Guiding Principles for scientific data management and stewardship |url=https://www.nature.com/articles/sdata201618 |journal=Scientific Data |language=en |volume=3 |issue=1 |pages=160018 |doi=10.1038/sdata.2016.18 |issn=2052-4463 |pmc=PMC4792175 |pmid=26978244}}</ref> Failure to achieve FAIRness risks transforming big data into opaque dark data.<ref>{{Cite journal |last=Heidorn |first=P. Bryan |date=2008-09 |title=Shedding Light on the Dark Data in the Long Tail of Science |url=https://muse.jhu.edu/article/262029 |journal=Library Trends |language=en |volume=57 |issue=2 |pages=280–299 |doi=10.1353/lib.0.0036 |issn=1559-0682}}</ref> Establishing the FAIRness of these research objects not only contributes to a solution for the reproducibility crisis in science<ref>{{Cite journal |last=Baker |first=Monya |date=2016-05-26 |title=1,500 scientists lift the lid on reproducibility |url=https://www.nature.com/articles/533452a |journal=Nature |language=en |volume=533 |issue=7604 |pages=452–454 |doi=10.1038/533452a |issn=0028-0836}}</ref> but also addresses broader concerns regarding the trustworthiness of [[information]] (see also the TRUST Principles of transparency, responsibility, user focus, sustainability, and technology<ref>{{Cite journal |last=Lin |first=Dawei |last2=Crabtree |first2=Jonathan |last3=Dillo |first3=Ingrid |last4=Downs |first4=Robert R. |last5=Edmunds |first5=Rorie |last6=Giaretta |first6=David |last7=De Giusti |first7=Marisa |last8=L’Hours |first8=Hervé |last9=Hugo |first9=Wim |last10=Jenkyns |first10=Reyna |last11=Khodiyar |first11=Varsha |date=2020-05-14 |title=The TRUST Principles for digital repositories |url=https://www.nature.com/articles/s41597-020-0486-7 |journal=Scientific Data |language=en |volume=7 |issue=1 |pages=144 |doi=10.1038/s41597-020-0486-7 |issn=2052-4463 |pmc=PMC7224370 |pmid=32409645}}</ref>).

To capitalize on the transformative potential of the FAIR Principles, the idea of an internet of FAIR data and services was suggested.<ref>{{Cite web |title=The Internet of FAIR Data & Services |url=https://www.go-fair.org/resources/internet-fair-data-services/ |publisher=GO FAIR |accessdate=22 May 2024}}</ref> Such a framework would seamlessly scale with the demands of big data, enabling relevant data-rich institutions, research projects, and citizen-science initiatives to make their research objects universally accessible in adherence to the FAIR Guiding Principles.<ref>{{Cite book |last=European Commission. Directorate General for Research and Innovation. |date=2016 |title=Realising the European open science cloud: first report and recommendations of the Commission high level expert group on the European open science cloud. |url=https://data.europa.eu/doi/10.2777/940154 |publisher=Publications Office |place=LU |doi=10.2777/940154}}</ref><ref>{{Citation |last=Hasnain |first=Ali |last2=Rebholz-Schuhmann |first2=Dietrich |date=2018 |editor-last=Gangemi |editor-first=Aldo |editor2-last=Gentile |editor2-first=Anna Lisa |editor3-last=Nuzzolese |editor3-first=Andrea Giovanni |editor4-last=Rudolph |editor4-first=Sebastian |editor5-last=Maleshkova |editor5-first=Maria |title=Assessing FAIR Data Principles Against the 5-Star Open Data Principles |url=https://link.springer.com/10.1007/978-3-319-98192-5_60 |work=The Semantic Web: ESWC 2018 Satellite Events |language=en |publisher=Springer International Publishing |place=Cham |volume=11155 |pages=469–477 |doi=10.1007/978-3-319-98192-5_60 |isbn=978-3-319-98191-8 |accessdate=2024-06-17}}</ref> The key lies in furnishing comprehensive, machine-actionable{{Efn|Machine-actionable data and metadata are machine-interpretable and belong to a type for which operations have been specified in symbolic grammar, such as logical reasoning based on description logics for statements formalized in the Web Ontology Language (OWL) or rule-based data transformations such as unit conversion for defined types of elements.<ref name="WEilandFDO22">{{cite web |url=https://docs.google.com/document/d/1hbCRJvMTmEmpPcYb4_x6dv1OWrBtKUUW5CEXB2gqsRo |title=FDO Machine Actionability, Version 2.1 |author=Weiland, C.; Islam, S.; Broder, D. et al. |work=Google Docs |publisher=FDO Forum |date=19 August 2022}}</ref>}} data and metadata, complemented by human-readable interfaces and search capabilities.

[[Knowledge graph]]s can contribute to the needed technical frameworks, offering a structure for managing and representing FAIR data and metadata.<ref>{{Cite journal |last=Vogt |first=Lars |last2=Baum |first2=Roman |last3=Bhatty |first3=Philipp |last4=Köhler |first4=Christian |last5=Meid |first5=Sandra |last6=Quast |first6=Björn |last7=Grobe |first7=Peter |date=2019-01-01 |title=SOCCOMAS: a FAIR web content management system that uses knowledge graphs and that is based on semantic programming |url=https://academic.oup.com/database/article/doi/10.1093/database/baz067/5544589 |journal=Database |language=en |volume=2019 |pages=baz067 |doi=10.1093/database/baz067 |issn=1758-0463 |pmc=PMC6686081 |pmid=31392324}}</ref> Knowledge graphs are particularly applied in the context of [[Semantics|semantic]] search based on entities and relations, deep reasoning, disambiguation of natural language, machine reading, and entity consolidation for big data and text analytics.<ref>{{Cite journal |last=Bonatti |first=Piero Andrea |last2=Decker |first2=Stefan |last3=Polleres |first3=Axel |last4=Presutti |first4=Valentina |date=2019 |title=Knowledge Graphs: New Directions for Knowledge Representation on the Semantic Web (Dagstuhl Seminar 18371) |url=https://drops.dagstuhl.de/entities/document/10.4230/DagRep.8.9.29 |language=en |pages=83 pages, 5326322 bytes |doi=10.4230/dagrep.8.9.29}}</ref>

The distinctive graph-based abstractions inherent in knowledge graphs yield advantages over traditional [[Relational database|relational]] or other NoSQL models. These include

*an intuitive way for modelling relations;
*the flexibility to defer data schema definitions to accommodate evolving knowledge, which is especially important when dealing with incomplete knowledge;
*incorporation of machine-actionable knowledge representation formalisms like [[Ontology (information science)|ontologies]] and rules;
*deployment of graph analytics and [[machine learning]] (ML); and
*utilization of specialized graph query languages that support, in addition to standard relational operators such as joins, unions, and projections, also navigational operators for recursively searching for entities through arbitrary-length paths.<ref>{{Cite journal |last=Hogan |first=Aidan |last2=Blomqvist |first2=Eva |last3=Cochez |first3=Michael |last4=D’amato |first4=Claudia |last5=Melo |first5=Gerard De |last6=Gutierrez |first6=Claudio |last7=Kirrane |first7=Sabrina |last8=Gayo |first8=José Emilio Labra |last9=Navigli |first9=Roberto |last10=Neumaier |first10=Sebastian |last11=Ngomo |first11=Axel-Cyrille Ngonga |date=2022-05-31 |title=Knowledge Graphs |url=https://dl.acm.org/doi/10.1145/3447772 |journal=ACM Computing Surveys |language=en |volume=54 |issue=4 |pages=1–37 |doi=10.1145/3447772 |issn=0360-0300}}</ref><ref>{{Citation |last=Abiteboul |first=Serge |date=1997 |editor-last=Afrati |editor-first=Foto |editor2-last=Kolaitis |editor2-first=Phokion |title=Querying semi-structured data |url=http://link.springer.com/10.1007/3-540-62222-5_33 |work=Database Theory — ICDT '97 |publisher=Springer Berlin Heidelberg |place=Berlin, Heidelberg |volume=1186 |pages=1–18 |doi=10.1007/3-540-62222-5_33 |isbn=978-3-540-62222-2 |accessdate=2024-06-17}}</ref><ref>{{Cite journal |last=Angles |first=Renzo |last2=Gutierrez |first2=Claudio |date=2008-02 |title=Survey of graph database models |url=https://dl.acm.org/doi/10.1145/1322432.1322433 |journal=ACM Computing Surveys |language=en |volume=40 |issue=1 |pages=1–39 |doi=10.1145/1322432.1322433 |issn=0360-0300}}</ref><ref>{{Cite journal |last=Angles |first=Renzo |last2=Arenas |first2=Marcelo |last3=Barceló |first3=Pablo |last4=Hogan |first4=Aidan |last5=Reutter |first5=Juan |last6=Vrgoč |first6=Domagoj |date=2018-09-30 |title=Foundations of Modern Query Languages for Graph Databases |url=https://dl.acm.org/doi/10.1145/3104031 |journal=ACM Computing Surveys |language=en |volume=50 |issue=5 |pages=1–40 |doi=10.1145/3104031 |issn=0360-0300}}</ref><ref>{{Cite web |last=Hitzler, P.; Krötzsch, M.; Parsia, B. et al. |date=11 December 2012 |title=OWL 2 Web Ontology Language Primer (Second Edition) |url=https://www.w3.org/TR/owl2-primer/ |publisher=World Wide Web Consortium}}</ref><ref>{{Cite journal |last=Philip |first=Stutz |last2=Daniel |first2=Strebel |last3=Abraham |first3=Bernstein |date=2016 |title=Signal/collect12: processing large graphs in seconds |url=https://www.zora.uzh.ch/id/eprint/119576 |doi=10.5167/UZH-119576}}</ref><ref>{{Cite journal |last=Wang |first=Quan |last2=Mao |first2=Zhendong |last3=Wang |first3=Bin |last4=Guo |first4=Li |date=2017-12-01 |title=Knowledge Graph Embedding: A Survey of Approaches and Applications |url=http://ieeexplore.ieee.org/document/8047276/ |journal=IEEE Transactions on Knowledge and Data Engineering |volume=29 |issue=12 |pages=2724–2743 |doi=10.1109/TKDE.2017.2754499 |issn=1041-4347}}</ref>

Moreover, the inherent semantic transparency of knowledge graphs can improve the transparency of data-based decision-making and improve the communication of data and knowledge within research and science in general.<ref>{{Cite journal |last=Stocker |first=Markus |last2=Oelen |first2=Allard |last3=Jaradeh |first3=Mohamad Yaser |last4=Haris |first4=Muhammad |last5=Oghli |first5=Omar Arab |last6=Heidari |first6=Golsa |last7=Hussein |first7=Hassan |last8=Lorenz |first8=Anna-Lena |last9=Kabenamualu |first9=Salomon |last10=Farfar |first10=Kheir Eddine |last11=Prinz |first11=Manuel |date=2023-01-11 |editor-last=Magagna |editor-first=Barbara |title=FAIR scientific information with the Open Research Knowledge Graph |url=https://www.medra.org/servlet/aliasResolver?alias=iospress&doi=10.3233/FC-221513 |journal=FAIR Connect |volume=1 |issue=1 |pages=19–21 |doi=10.3233/FC-221513}}</ref><ref>{{Cite journal |last=Aisopos |first=Fotis |last2=Jozashoori |first2=Samaneh |last3=Niazmand |first3=Emetis |last4=Purohit |first4=Disha |last5=Rivas |first5=Ariam |last6=Sakor |first6=Ahmad |last7=Iglesias |first7=Enrique |last8=Vogiatzis |first8=Dimitrios |last9=Menasalvas |first9=Ernestina |last10=Rodriguez Gonzalez |first10=Alejandro |last11=Vigueras |first11=Guillermo |date=2023-05-08 |editor-last=Kondylakis |editor-first=Haridimos |editor2-last=Rao |editor2-first=Praveen |editor3-last=Stefanidis |editor3-first=Kostas |editor4-last=Stefanidis |editor4-first=Kostas |editor5-last=Kondylakis |editor5-first=Haridimos |title=Knowledge graphs for enhancing transparency in health data ecosystems1 |url=https://www.medra.org/servlet/aliasResolver?alias=iospress&doi=10.3233/SW-223294 |journal=Semantic Web |volume=14 |issue=5 |pages=943–976 |doi=10.3233/SW-223294}}</ref><ref>{{Cite journal |last=Cifuentes-Silva |first=Francisco |last2=Fernández-Álvarez |first2=Daniel |last3=Labra-Gayo |first3=Jose Emilio |date=2020-06-03 |title=National Budget as Linked Open Data: New Tools for Supporting the Sustainability of Public Finances |url=https://www.mdpi.com/2071-1050/12/11/4551 |journal=Sustainability |language=en |volume=12 |issue=11 |pages=4551 |doi=10.3390/su12114551 |issn=2071-1050}}</ref><ref>{{Cite journal |last=Rajabi |first=Enayat |last2=Kafaie |first2=Somayeh |date=2022-09-28 |title=Knowledge Graphs and Explainable AI in Healthcare |url=https://www.mdpi.com/2078-2489/13/10/459 |journal=Information |language=en |volume=13 |issue=10 |pages=459 |doi=10.3390/info13100459 |issn=2078-2489}}</ref><ref>{{Cite journal |last=Tiddi |first=Ilaria |last2=Schlobach |first2=Stefan |date=2022-01 |title=Knowledge graphs as tools for explainable machine learning: A survey |url=https://linkinghub.elsevier.com/retrieve/pii/S0004370221001788 |journal=Artificial Intelligence |language=en |volume=302 |pages=103627 |doi=10.1016/j.artint.2021.103627}}</ref>

Despite offering an appropriate technical foundation, the utilization of a knowledge graph for storing data and metadata does not inherently ensure the achievement of the FAIR Guiding Principles. Realizing FAIR research objects necessitates adherence to specific guidelines, encompassing the consistent application of adequate semantic data models tailored to distinct types of data and metadata statements. This approach is pivotal for ensuring seamless interoperability across a dataset.

The rest of the paper is organized as such. In the Problem statement section, we discuss three specific challenges that, from our perspective, can be effectively addressed by systematically organizing a knowledge graph into well-defined subgraphs. Prior attempts at this, such as defining a characteristic set as a subgraph based on triples that share the same resource in the ''Subject'' position, have demonstrated noteworthy enhancements in space and query performance<ref>{{Cite journal |last=Hogan |first=Aidan |last2=Arenas |first2=Marcelo |last3=Mallea |first3=Alejandro |last4=Polleres |first4=Axel |date=2014-08 |title=Everything you always wanted to know about blank nodes |url=https://linkinghub.elsevier.com/retrieve/pii/S1570826814000481 |journal=Journal of Web Semantics |language=en |volume=27-28 |pages=42–69 |doi=10.1016/j.websem.2014.06.004}}</ref><ref>{{Cite web |last=Neumann, T.; Moerkotte, G. |title=Characteristic sets: Accurate cardinality estimation for RDF queries with multiple joins {{!}} IEEE Conference Publication {{!}} IEEE Xplore |work=Proceedings of the 2011 IEEE 27th International Conference on Data Engineering |url=https://ieeexplore.ieee.org/document/5767868/ |doi=10.1109/icde.2011.5767868 |accessdate=}}</ref> (see also the related concept of RDF molecules<ref>{{Cite journal |last=Papastefanatos |first=George |last2=Meimaris |first2=Marios |last3=Vassiliadis |first3=Panos |date=2022-02 |title=Relational schema optimization for RDF-based knowledge graphs |url=https://linkinghub.elsevier.com/retrieve/pii/S0306437921000223 |journal=Information Systems |language=en |volume=104 |pages=101754 |doi=10.1016/j.is.2021.101754}}</ref><ref>{{Cite journal |last=Collarana |first=Diego |last2=Galkin |first2=Mikhail |last3=Traverso-Ribón |first3=Ignacio |last4=Vidal |first4=Maria-Esther |last5=Lange |first5=Christoph |last6=Auer |first6=Sören |date=2017-06-19 |title=MINTE: semantically integrating RDF graphs |url=https://dl.acm.org/doi/10.1145/3102254.3102280 |journal=Proceedings of the 7th International Conference on Web Intelligence, Mining and Semantics |language=en |publisher=ACM |place=Amantea Italy |pages=1–11 |doi=10.1145/3102254.3102280 |isbn=978-1-4503-5225-3}}</ref>), but they do not fully mitigate the challenges outlined below.

The Results section introduces a novel concept: the partitioning and structuring of a knowledge graph into semantic units, identifiable subgraphs represented in the graph with their own resource. Semantic units are semantically meaningful units of representation, which will contribute to overcoming the challenges at hand. The concept builds upon an idea originally proposed for structuring descriptions of [[phenotype]]s into distinct subgraphs, each of which models a descriptive statement like a particular weight measurement or a particular parthood statement for a given anatomical entity.<ref>{{Cite journal |last=Vogt |first=Lars |date=2019-12 |title=Organizing phenotypic data—a semantic data model for anatomy |url=https://jbiomedsem.biomedcentral.com/articles/10.1186/s13326-019-0204-6 |journal=Journal of Biomedical Semantics |language=en |volume=10 |issue=1 |pages=12 |doi=10.1186/s13326-019-0204-6 |issn=2041-1480 |pmc=PMC6585074 |pmid=31221226}}</ref> Each such subgraph is organized in its own "Named Graph" and functions as the smallest semantically meaningful unit in a phenotype description. Generalizing and extending this concept, we present semantic units as accessible, searchable, identifiable, and reusable data items in their own right, forming units of representation implemented through graphs based on the [[Resource Description Framework]] (RDF) and the Web Ontology Language (OWL) or labeled property graphs. Two basic categories of semantic units—statement units and compound units—are introduced, supplementing the well-established triples and the overall graph in FAIR knowledge graphs. These units offer a structure that organizes a knowledge graph into five levels of representational granularity, from individual triples to the graph as a whole. In further refinement, additional subcategories of semantic units are proposed for enhanced graph organization. The incorporation of unique, persistent, and resolvable identifiers (UPRIs) for each semantic unit enables their efficient referencing within triples, facilitating an efficient way of making statements about statements. The introduction of semantic units adds further layers of triples to the well-established RDF and OWL layer for knowledge graphs. (Fig. 1) This augmentation aims to enhance the usability of knowledge graphs for both domain experts and developers.

[[File:Fig1 Vogt JofBiomedSem24 15.png|600px]]
{{clear}}
{|
| style="vertical-align:top;" |
{| border="0" cellpadding="5" cellspacing="0" width="600px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;" |<blockquote>'''Figure 1.''' Semantic units introduce additional layers atop the RDF/OWL layer of triples within a knowledge graph. The figure illustrates a partitioning of the triple layer into statement units, wherein each triple aligns with exactly one statement unit, and each statement unit contains one or more triples. Statement units can be organized into diverse types of semantically meaningful collections, denoted as compound units. Compound units serve as the basis for defining several layers that contribute to the enhanced structuring and organization of the knowledge graph in semantically meaningful ways.</blockquote>
|-
|}
|}

In the Discussion section, we discuss the benefits we see from organizing knowledge graphs into distinct knowledge graph modules (i.e., semantic units) in terms of increasing data management flexibility and explorability of the graph. We also discuss possible strategies for implementing semantic units for RDF/OWL-based and labeled-property-graph-based knowledge graphs.

===Conventions used in this paper===
In this paper, the term "knowledge graph" denotes a machine-actionable semantic graph employed for the documentation, organization, and representation of data and metadata. It is essential to note that our discussion of semantic units is situated within the context of RDF-based triple stores, OWL, and Description Logics serving as a formal framework for inferencing, alongside labeled property graphs as an alternative to triple stores. We deliberately focus on these technologies as they constitute the primary technologies and logical frameworks within the knowledge graph domain, benefiting from widespread community support and established standards. We are aware of the fact that alternative technologies and frameworks exist that support an ''n''-tuples syntax and more advanced logics (e.g., First Order Logic)<ref>{{Citation |last=Ceusters |first=Werner |date=2022 |editor-last=Elkin |editor-first=Peter L. |title=The Place of Referent Tracking in Biomedical Informatics |url=https://link.springer.com/10.1007/978-3-031-11302-4_6 |work=Terminology, Ontology and their Implementations |language=en |publisher=Springer International Publishing |place=Cham |pages=39–46 |doi=10.1007/978-3-031-11302-4_6 |isbn=978-3-031-11301-7 |accessdate=2024-06-17}}</ref><ref>{{Cite journal |last=Ceusters |first=Werner |last2=Elkin |first2=Peter |last3=Smith |first3=Barry |date=2007-12 |title=Negative findings in electronic health records and biomedical ontologies: A realist approach |url=https://linkinghub.elsevier.com/retrieve/pii/S1386505607000408 |journal=International Journal of Medical Informatics |language=en |volume=76 |pages=S326–S333 |doi=10.1016/j.ijmedinf.2007.02.003 |pmc=PMC2211452 |pmid=17369081}}</ref>, but supporting tools and applications are missing or are not widely used to turn them into well-supported, scalable, and easily usable knowledge graph applications.

Throughout this text, <u>regular underlining</u> is employed for indicating ontology classes, while ''<u>italicsUnderlined</u>'' text is reserved for referencing properties. Identification (ID) numbers, formed by the ontology prefix followed by a colon and a number, uniquely specify each resource (e.g., ''<u>isAbout</u>'' [IAO:0000136]). When a term is not yet covered in any ontology, we denote the corresponding class with an asterisk (*). New classes and properties that relate to semantic units will use the ontology prefix SEMUNIT, as in the class *<u>SEMUNIT:metric measurement statement unit</u>*. These will be part of a future Semantic Unit ontology. We use '<u>regular underlined</u>' to indicate instances of classes, with the label referring to the class label and the ID to the ID of the class.

The term "resource" is employed to signify something uniquely designated, such as a Uniform Resource Identifier (URI), about which informative statements are made. It thus stands for something and represents something you want to talk about. In RDF, the ''Subject'' and the ''Predicate'' in a triple are always resources, whereas the ''Object'' can be either a resource or a literal. Resources encompass properties, instances, and classes, with properties occupying the ''Predicate'' position in a triple, instances referring to individuals (=particulars), and classes representing universals or kinds.

To maintain clarity, resources are represented with human-readable labels in both the text and all figures, opting for the implicit assumption that each property, instance, and class possesses its UPRI. Additionally, the term "triple" refers specifically to a triple statement, while "statement" pertains to a [[Natural language processing|natural language statement]], establishing a clear distinction between the two.

==Methods==
===Problem statement===
====Challenge 1: Ensuring schematic interoperability for FAIR empirical data====

In the pursuit of FAIRness in empirical data and metadata in a knowledge graph, it is important not only for the terms employed in data and metadata statements to possess identifiers from controlled vocabularies, such as ontologies, ensuring terminological interoperability, but also the semantic graph patterns underlying each statement. These patterns specify the relationships among the terms in a statement, facilitating schematic interoperability.

Due to the expressivity of RDF and OWL, statements can be modelled in multiple, often not directly interoperable ways within a knowledge graph. Distinguishing between RDF graphs with different structures that essentially model the same underlying data statement poses a challenge. Consequently, the presence of schematic interoperability conflicts becomes unavoidable, especially when data are represented using diverse graph patterns (cf. Figs. 2 and 3).

[[File:Fig2 Vogt JofBiomedSem24 15.png|900px]]
{{clear}}
{|
| style="vertical-align:top;" |
{| border="0" cellpadding="5" cellspacing="0" width="900px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;" |<blockquote>'''Figure 2.''' Comparison of a human-readable statement with its machine-actionable representation as a semantic graph following the RDF syntax. Top: A human-readable statement concerning the observation that a specific apple (X) weighs 204.56 grams. Bottom: The corresponding representation of the same statement as a semantic graph, adhering to RDF syntax and following the established pattern for measurement data from the Ontology for Biomedical Investigations (OBI)<ref>{{Cite journal |last=Bandrowski |first=Anita |last2=Brinkman |first2=Ryan |last3=Brochhausen |first3=Mathias |last4=Brush |first4=Matthew H. |last5=Bug |first5=Bill |last6=Chibucos |first6=Marcus C. |last7=Clancy |first7=Kevin |last8=Courtot |first8=Mélanie |last9=Derom |first9=Dirk |last10=Dumontier |first10=Michel |last11=Fan |first11=Liju |date=2016-04-29 |editor-last=Xue |editor-first=Yu |title=The Ontology for Biomedical Investigations |url=https://dx.plos.org/10.1371/journal.pone.0154556 |journal=PLOS ONE |language=en |volume=11 |issue=4 |pages=e0154556 |doi=10.1371/journal.pone.0154556 |issn=1932-6203 |pmc=PMC4851331 |pmid=27128319}}</ref> of the Open Biological and Biomedical Ontology Foundry (OBO).</blockquote>
|-
|}
|}

[[File:Fig3 Vogt JofBiomedSem24 15.png|800px]]
{{clear}}
{|
| style="vertical-align:top;" |
{| border="0" cellpadding="5" cellspacing="0" width="800px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;" |<blockquote>'''Figure 3.''' Alternative machine-actionable representation of the data statement from Fig. 2. This graph represents the same data statement as shown in Fig. 2 Top, but applies a semantic graph model that is based on the Extensible Observation Ontology (OBOE)<ref>{{Cite journal |last=Madin |first=Joshua |last2=Bowers |first2=Shawn |last3=Schildhauer |first3=Mark |last4=Krivov |first4=Sergeui |last5=Pennington |first5=Deana |last6=Villa |first6=Ferdinando |date=2007-10 |title=An ontology for describing and synthesizing ecological observation data |url=https://linkinghub.elsevier.com/retrieve/pii/S1574954107000362 |journal=Ecological Informatics |language=en |volume=2 |issue=3 |pages=279–296 |doi=10.1016/j.ecoinf.2007.05.004}}</ref>, an ontology frequently used in the ecology community.</blockquote>
|-
|}
|}

Therefore, to maintain interoperability in the representation of empirical data statements within an RDF graph, it can be beneficial to restrict the graph patterns employed for their semantic modelling. Statements of the same type, such as all weight measurements, would employ identical graph patterns to maintain interoperability. Each of these patterns would be assigned an identifier. When representing empirical data in the form of an RDF graph, the graph’s metadata should reference that graph-pattern identifier. This approach enables the identification of potentially interoperable RDF graphs sharing common graph-pattern identifiers.

Practically implementing these principles entails two criteria. Firstly, all statements within a knowledge graph must be categorized into statement classes, each associated with a specified graph pattern, typically in the form of a shape specification. Secondly, the subgraph corresponding to a particular statement must be distinctly identifiable.

====Challenge 2: Overcoming barriers in graph query language adoption====
Another significant challenge arises in the context of searching for specific information in a knowledge graph. The prevalent formats for knowledge graphs include RDF/OWL or labeled property graphs like Neo4j. Interacting directly with these graphs, encompassing CRUD operations for creating (= writing), reading (= searching), updating, and deleting statements in the knowledge graph, necessitates the utilization of a query language. SPARQL<ref>{{Cite web |last=Harris, S.; Seaborne, A. |date=21 March 2013 |title=SPARQL 1.1 Query Language |url=https://www.w3.org/TR/sparql11-query/ |publisher=World Wide Web Consortium}}</ref> is an example for RDF/OWL, while Cypher<ref>{{Cite web |date=2024 |title=The Neo4j Operations Manual v5 |url=https://neo4j.com/docs/operations-manual/current/ |publisher=Neo4j, Inc}}</ref> is employed for Neo4j.

Although these query languages empower users to formulate detailed and intricate queries, the challenge lies in their complexity, creating an entry barrier for seamless interactions with knowledge graphs.<ref>{{Cite web |last=Booth, D.; Wallace, E. |date=2019 |title=Session X: EasyRDF |work=2nd U.S. Semantic Technologies Symposium 2019 |url=https://us2ts.org/2019/posts/program-session-x.html}}</ref> Furthermore, query languages are not aware of graph patterns.

This challenge may potentially be addressed by providing reusable query patterns that link to specific graph patterns, thereby integrating representation and querying.

====Challenge 3: Addressing complexities in making statements about statements====
The RDF triple syntax of ''Subject'', ''Predicate'', and ''Object'' allows expressing a statement about another statement by creating a triple that relates a statement, composed of one or more triples, to a value, resource, or another statement. The scenario may arise where such statements about statements must be modelled. For instance, metadata for a measurement may relate two distinct subgraphs: one representing the measurement itself (as seen in Fig. 2) and another documenting the underlying measuring process (as seen in Fig. 4).

[[File:Fig4 Vogt JofBiomedSem24 15.png|1000px]]
{{clear}}
{|
| style="vertical-align:top;" |
{| border="0" cellpadding="5" cellspacing="0" width="1000px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;" |<blockquote>'''Figure 4.''' A detailed machine-actionable representation of the metadata relating to a weight measurement datum. This detailed illustration presents a machine-actionable representation of a mass measurement process employing a balance. It documents metadata associated with a weight measurement datum, articulated as an RDF graph. The graph establishes connections between an instance of <u>mass measurement assay</u> (OBI:0000445) and instances of various other classes from diverse ontologies. Noteworthy details include the identification of the measurement conductor, the location and timing of the measurement, the protocol followed, and the specific device utilized (i.e., a balance). Additionally, the graph outlines the material entity serving as the subject and input for the measurement process (i.e., "apple X"), along with specifying the resultant data encapsulated in a particular weight measurement assertion.</blockquote>
|-
|}
|}

In RDF reification, a statement resource is defined to represent a particular triple by describing it via three additional triples that specify its ''Subject'', ''Predicate'', and ''Object''. Alternatively, the RDF-star approach can be employed. [40, 41] Both methods increase complexity of the represented graph.

In cases like this, the adoption of Named Graphs is an alternative compared to RDF reification or RDF-star approaches. Within RDF-based knowledge graphs, a Named Graph resource identifies a set of triples by incorporating the URI of the Named Graph as a fourth element to each triple, transforming them into quads. In labeled property graphs, on the other hand, assigning a resource for identifying subgraphs within the overall data graph is straightforward and can be achieved by incorporating the resource identifier as the value of a corresponding property-value pair, subsequently adding this pair to all relations and nodes belonging to the same subgraph.

==Results==
===Semantic unit===
We developed an approach for organizing knowledge graphs into distinct layers of subgraphs using graph patterns. Unlike traditional methods of partitioning a knowledge graph that (i) rely on technical aspects such as shared graph-topological properties of its triples with the goal of (federated) reasoning and query optimization (see characteristic sets [29, 30], RDF molecules [31, 42], and other approaches [43,44,45]), that (ii) partition a knowledge graph into small blocks for embedding and entity alignment learning to scale knowledge graph fusion [46], or that (iii) partition knowledge extractions, allowing reasoning over them in parallel to speed up knowledge graph construction [47], our approach introduces "semantic units." Semantic units prioritize structuring a knowledge graph into identifiable sets of triples, as subgraphs that represent units of representation possessing semantic significance for human readers. Technically, a semantic unit is a subgraph within a knowledge graph, represented in the graph by its own resource—designated as a UPRI—and embodied in the graph as a node. This resource is classified as an instance of a specific semantic unit class.

Semantic units focus on creating units that are semantically meaningful to domain experts. For instance, the graph in Fig. 2 exemplifies a subgraph that can be organized in a semantic unit that instantiates the class *<u>SEMUNIT:weight statement unit</u>* as it is illustrated in Fig. 6 (later). The statement unit models a single, human-readable statement, as opposed to the individual triple ‘<u>weight</u>’ (PATO:0000128) ''isQualityMeasuredAs'' (IAO:0000417) ‘<u>scalar measurement datum</u>’ (IAO:0000032), which is a single triple from that subgraph. That triple, without the context of the other triples in the subgraph, lacks semantic meaningfulness for a domain expert who has no background in semantics.

Beyond statement units, which constitute the smallest semantically meaningful statements (e.g., a weight measurement), collections of statement units can form compound units representing a coarser level of representational granularity. The classification of semantic units thus distinguishes two fundamental categories: statement units and compound units, each with its respective subcategories. For a detailed classification of semantic units, refer to Fig. 5.

[[File:Fig5 Vogt JofBiomedSem24 15.png|300px]]
{{clear}}
{|
| style="vertical-align:top;" |
{| border="0" cellpadding="5" cellspacing="0" width="300px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;" |<blockquote>'''Figure 5.''' Classification of different categories of semantic units.</blockquote>
|-
|}
|}

The structuring of a knowledge graph into semantic units involves introducing an additional layer of triples to the existing graph. To distinguish these two layers, we label the pre-existing graph as the data graph layer, while the newly added triples constitute the semantic-units graph layer. For clarity across the graph, the resource representing a semantic unit, along with all triples featuring this resource in the ''Subject'' or ''Object'' position, is assigned to the semantic-units graph layer. Extending this distinction from the graph as a whole to individual semantic units, each semantic unit is associated with both a data graph and a semantic-units graph. The data graph of a particular semantic unit shares the same UPRI as its semantic unit resource. This alignment enables reference to the UPRI, concurrently denoting the semantic unit as a resource and its corresponding data graph. This interconnectedness empowers users to make statements about the content encapsulated within the semantic unit’s data graph, as shown in Fig. 6.

[[File:Fig6 Vogt JofBiomedSem24 15.png|1000px]]
{{clear}}
{|
| style="vertical-align:top;" |
{| border="0" cellpadding="5" cellspacing="0" width="1000px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;" |<blockquote>'''Figure 6.''' Example of a statement unit. The illustration displays a statement unit exemplifying a has-weight relation. The data graph, denoted within the blue box at the bottom, articulates the statement with "apple X" as the subject and "gram X" alongside the numerical value 204.56 as the objects. The peach-colored box encompasses the semantic-units graph, housing triples that encapsulate the semantic unit’s representation. It explicitly denotes the resource embodying the statement unit (bordered blue box), an instance of the *<u>SEMUNIT:weight statement unit</u>* class, with "apple X" identified as the subject. Notably, the UPRI of *’<u>weight statement unit</u>’* is also the UPRI of the semantic unit’s data graph (the unbordered subgraph in the blue box).</blockquote>
|-
|}
|}

====Statement unit: A proposition in the knowledge graph====
A statement unit is characterized as the fundamental unit of information encapsulating the smallest, independent proposition (i.e., statement) with semantic meaning for human comprehension (see also [32]). For instance, the weight measurement statement for "apple X" illustrated in Fig. 6 represents a statement unit.

Structuring a knowledge graph into statement units results in a partition of its graph. Each triple within the data graph layer of the knowledge graph is associated with exactly one statement unit, and merging the subgraphs of all statement units results in the complete data graph of a knowledge graph. This partitioning only applies to the data graph layer.

We can understand each statement unit to specify a particular proposition by establishing a relationship between a resource serving as the subject and either a literal or another resource, denoted as the object of the predicate. Every statement unit encompasses a single subject and one or more objects.

To illustrate, a has-part statement unit features a subject and one object. Conversely, a weight measurement statement unit consists of a subject, as well as two objects: the weight value and the weight unit (refer to Fig. 6). The resource signifying a statement unit in the graph establishes a connection with its subject through the property *<u>SEMUNIT:''hasSemanticUnitSubject''</u>*, which is documented in the semantic-units graph of the statement unit.

In scenarios where the proposition within the data graph is grounded in a binary relation—a divalent predicate like "This right hand has as a part this right thumb"—the associated statement unit typically comprises a single triple. This alignment arises from the nature of RDF, where ''Predicates'' of triples are inherently binary relations. In such cases, the RDF property concurrently embodies the statement’s verb or predicate. However, numerous propositions are grounded in ''n''-ary relations, making a single triple insufficient for their representation. Examples encompass the weight measurement statement in Fig. 6 and statements like "This right hand has part this right thumb on January 29th 2022," "Anna gives Bob a book," and "Carla travels by train from Paris to Berlin on the 29th of June 2022," each necessitating more than one triple. In these cases, the statement’s verb or predicate is often represented not by a property within a single triple but instead by an instance resource, as exemplified by ‘<u>weight X</u>’ (PATO:0000128) in Fig. 6. The composition of statement units, whether consisting of one or more triples, is contingent upon the relation of the underlying proposition, the ''n''-aryness of its predicate, and the incorporation of optional objects. Types of statement units can be distinguished based on the ''n''-ary verb or predicate that characterizes their underlying proposition. Notably, numerous object properties of the Basic Formal Ontology 2 denote ternary relations, particularly those entailing temporal dependencies. [48] For instance, "''b'' located_in ''c'' at ''t''" mandates at least two triples for accurate representation in RDF.

The determination of which triples belong to a statement unit necessitates case-by-case specification by human domain experts. The statement unit patterns can then be specified using languages like LinkML [49, 50] or the Shapes Constraint Language SHACL [51]. These languages enable the definition of graph patterns to represent specific propositions, subsequently constituting a statement unit. Each statement unit instantiates a designated statement unit class, a classification defined by the specific verb or predicate characterizing the propositions modelled by its instances. We can distinguish different subcategories of statement units based on the underlying predicate, such as ''has part'', ''type'', and ''develops from''.

A distinctive category within the statement units, denoted as identification units, serves a specific purpose, providing details about a particular named individual or class resource. Two principal subtypes define this category. A named individual identification unit is a statement unit that serves to identify a resource to be a named individual, adding information such as the resource’s label, type, and its class membership (refer to Fig. 7A). A class identification unit{{Efn|Analog to class identification units, one could specify property identification units that have property resources as their subject.}} is a statement unit that serves to identify a resource to be a class and provides details including its label, identifier, and optionally, the URIs of both the ontology and the specific version from which the class term has been imported (refer to Fig. 7B). Both types of identification units are important for providing human-readable displays of statement units, as they provide the labels for the resources used in them (see "typed statement unit" and "dynamic label" in Fig. 9, later).

[[File:Fig7 Vogt JofBiomedSem24 15.png|500px]]
{{clear}}
{|
| style="vertical-align:top;" |
{| border="0" cellpadding="5" cellspacing="0" width="500px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;" |<blockquote>'''Figure 7.''' Examples for two different types of identification units. '''A)''' Named-individual identification unit. The data graph within the unbordered box delineates the class-affiliation of the ‘<u>apple X</u>’ (NCIT:C71985) instance. The subject, "apple X," is connected to its class through the property ''<u>type</u>'' (RDF:type), while its label "apple X" is conveyed via the property ''<u>label</u>'' (RDFS:label). The unbordered blue box designates the data graph associated with this named-individual identification unit. '''B)''' Class identification unit. This data graph of this unit, represented by the unbordered blue box, captures the label and identifier of the class ‘<u>apple</u>’ (NCIT:C71985), the unit’s designated subject. Optionally, it includes the URI details of the ontology and the ontology version from which the class is derived. The bordered blue box designates the resource of this class identification unit.</blockquote>
|-
|}
|}

====Compound unit: A collection of propositions====
Compound units are containers of collections of associated semantic units, each possessing semantic significance for a human reader. Each compound unit possesses a UPRI and instantiates a corresponding compound unit class. The connection between the resource representing the compound unit and those representing its associated semantic units is detailed through the property *<u>SEMUNIT:hasAssociatedSemanticUnit</u>* (see Fig. 8). The subsequent sections introduce distinct subcategories of compound units.

[[File:Fig8 Vogt JofBiomedSem24 15.png|700px]]
{{clear}}
{|
| style="vertical-align:top;" |
{| border="0" cellpadding="5" cellspacing="0" width="700px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;" |<blockquote>'''Figure 8.''' Example of a compound unit, denoted as *‘<u>apple X item unit</u>’*, that encompasses multiple statement units. Compound units, by virtue of merging the data graphs of their associated statement units, indirectly manifest a data graph (here, highlighted by the blue arrow). Notably, the compound unit possesses a semantic-units graph (depicted in the peach-colored box) delineating the associated semantic units.</blockquote>
|-
|}
|}

===Typed statement unit===
A typed statement unit assigns a human-readable label to a statement unit. A typed statement unit is a compound unit comprising the following statement units (see Fig. 9A):

#A statement unit that is not an instance of a named-individual or a class identification unit. It functions as the reference statement unit of the typed statement unit, and its subject is also the subject of the typed statement unit.
#Identification units specifying the class affiliations of all the resources that are referenced in the data graph of the reference statement unit, together with their human-readable labels.

[[File:Fig9 Vogt JofBiomedSem24 15.png|700px]]
{{clear}}
{|
| style="vertical-align:top;" |
{| border="0" cellpadding="5" cellspacing="0" width="700px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;" |<blockquote>'''Figure 9.''' Typed statement unit with dynamic label and dynamic mind-map pattern. '''A)''' Typed statement unit exemplified for a weight statement. This typed statement unit consolidates the data graphs of six statement units, including the *’<u>weight statement unit</u>’* from Figure 6, serving as the reference statement unit for this *‘<u>typed statement unit</u>’*, and five instances of *<u>SEMUNIT:named-individual identification unit</u>*. '''B)''' Dynamic label: Illustrated is an example of the dynamic label associated with the reference statement unit class (*<u>SEMUNIT:weight statement unit</u>*). This dynamic label template is utilized for textual displays of information from the reference statement unit. '''C)''' Dynamic mind-map pattern: Depicted is an example of the dynamic mind-map pattern associated with the reference statement unit class (*<u>SEMUNIT:weight statement unit</u>*). This pattern template is employed for graphical displays of information from the reference statement unit.</blockquote>
|-
|}
|}

Each statement unit class has at least one display pattern associated with it. A display pattern acts as a template that takes as input the labels provided by the identification units associated with a typed statement unit and generates a human-readable dynamic label for the textual (see Fig. 9B) or a dynamic mind-map pattern for the graphical representation (see Fig. 9C) of the statement of its reference statement unit. Thus, a dynamic label and a dynamic mind-map pattern of a typed statement unit are derived from the corresponding templates provided by its reference statement unit, taking the human-readable labels provided by its identification units as input.

===Item unit===
An item unit encompasses all statement and typed statement units that share a common subject, i.e., they form a group of statements relating to the same entity. The subject resource becomes the subject of the item unit, and the resource representing an item unit in the semantic-units graph relates to its subject through the property *<u>SEMUNIT:hasSemanticUnitSubject</u>*. Conceptually, item units align with the ''graph-per-resource'' data management pattern [52] or the previously mentioned ''characteristic set'' or ''RDF molecule'', and they are akin to the ''Item'' concept in the Wikibase data model<ref name="MWWikibase24">{{cite web |url=https://www.mediawiki.org/wiki/Wikibase/DataModel#Item |title=Wikibase/DataModel - Overview of the data model |work=MediaWiki.org |date=07 April 2024}}</ref>, but adapt the concept to statement units rather than triples.

===Item group unit===
An item group unit is composed of a minimum of two item units. The subgraphs of the item units belonging to the same item group unit are connected through statement units that share their subject with the subject of one item unit and one of their objects with the subject of another item unit. As a result, merging the subgraphs of all the item units of an item group unit forms a connected graph.

===Granularity tree unit===
We can further identify types of statement units that depend on partial order relations (i.e., relations that are transitive, reflexive, and asymmetric), forming partial orders. Examples include class-subclass relations in ontologies, parthood relations in descriptive statements, and sequential relations like ''<u>before</u>'' (RO:0002083) in process specifications. Partial order relations give rise to granular partitions that form granularity trees [53,54,55] and contribute to defining granularity perspectives. [56,57,58]

Granularity perspectives identify specific types of semantically meaningful tree-like subgraphs within a knowledge graph, supporting graph exploration by modularization in addition to statement, item, and item group units.

Due to the nested structure of a granularity tree and its inherent directionality from root to leaves, the subject of a granularity tree unit can be specified as the subject of statement units sharing objects with the subjects but not their subject with the objects of other statement units within the same granularity tree unit.

===Granular item group unit===
A granular item group unit encompasses all statement units and item units whose subjects belong to the same granularity tree unit. The item units belonging to a granular item group unit can be systematically arranged within a nested hierarchy dictated by the underlying granularity tree. This additional organization offers improved explorability for users of a knowledge graph application.

===Context unit===
The ''<u>isAbout</u>'' property (IAO:0000136) connects an information artifact to an entity about which the artifact provides information. Using this property in a knowledge graph changes the frame of reference from the discursive layer to the ontological layer. An is-about statement thus divides a knowledge graph into two subgraphs, each forming a context unit that belongs to one of these two layers. Is-about statement units relate resources from the semantic-units graph with resources from the data graph of a knowledge graph. For example, in documenting a research activity that results in the creation of a dataset describing the anatomy of a multicellular organism, the statement *‘<u>description item unit</u>’* ''<u>isAbout</u>'' ‘<u>multicellular organism</u>’ (UBERON:0000468) marks a transition in the frame of reference from the research activity’s outcome to the multicellular organism being described (see also Fig. 12 further below).

===Dataset unit===
A dataset unit is an ordered set of semantic units. They can be employed to aggregate all data contributed by a specific institution in a collaborative project, document the state of a particular object at a given time, or store and make accessible the results of a specific search query. Knowledge graph users have the flexibility to specify dataset units for their individual needs, utilizing the unit’s UPRI as reference identifier.

===List unit===
In certain instances, it becomes necessary to articulate statements about a specific collection of particular resources. To achieve this, such a collection can be modelled as a list unit. We distinguish unordered list units from ordered list units, with the latter organizing resources in a specific sequence, such as the authors of a scholarly publication. Conversely, a set unit is an unordered list unit where each resource is listed only once, adhering to a uniqueness restriction.

From a technical standpoint, a list unit contains membership statement units, each delineating a resource belonging to the list by linking the UPRI of the list unit through a *<u>SEMUNIT:''child''</u>* relation to the respective resource. In the case of an ordered list unit, each membership statement unit must be indexed through a data property ''<u>index</u>'' (RDF:index).

List units can be employed as arrays and may incorporate cardinality restrictions, thereby characterizing a closed collection of entities and enabling a localized closed-world assumption.

==Discussion==
===Benefits of organizing a knowledge graph into semantic units===
====Semantic units enhance data management flexibility through modularity====
The organization of a knowledge graph into distinct subgraphs, each associated with a particular semantic unit, introduces modularity in a graph. Each semantic unit, represented in the graph by a dedicated resource classified as an instance of a specific semantic unit class, serves as a structured module that encapsulates complexity. This modular approach allows for the encapsulation of subgraphs, and may add flexibility in data management as larger parts of a graph can be manipulated jointly.

====Semantic units operate at a higher level of abstraction than individual triples====
Semantically, they encapsulate the contents of their data graphs, representing statements or sets of semantically and ontologically related statements. The specification of relations between semantic units further extends the flexibility of data management. A given semantic unit from a finer level of representational granularity can be associated with multiple units from a coarser level. Consequently, a statement unit may be linked to more than one compound unit, all while maintaining the centrality of the statement unit itself and its triples in a single location within the graph.

The modular nature introduced by semantic units may streamline partitioned-based querying of knowledge graphs. While other approaches for graph partitioning have shown success [59], employing semantic units for partitioning and establishing modularity in the graph is an avenue for future research exploration.

===Semantic units as a framework for knowledge graph alignment===
The instantiation of semantic units belonging to the same class inherently implies a semantic similarity across instances. This characteristic lays the groundwork for a systematic approach to aligning and comparing knowledge graphs that share a common set of semantic unit classes. The alignment process could operate in a stepwise manner across various levels of representational granularity. In the initial step, alignment focuses on item group units, leveraging their types of associated item units and their alignment for comparison. The latter alignment hinges on the types of subjects and the types of associated statement units, allowing for further alignment based on class. Ultimately, individual triples within the aligned statement units undergo comparison, marking a comprehensive strategy to enhance existing methods for knowledge graph alignment, subgraph-matching, graph comparison, and graph similarity measures.

===Managing restricted access to sensitive data===
The classification of statement units into corresponding ontology classes may serve as a framework for identifying subgraphs within a knowledge graph housing sensitive data that warrants restricted access. By identifying statement units containing sensitive information by class, access restrictions can be dynamically enforced based on specific criteria.

===Semantic units: A framework for nested and overlapping knowledge graph modules===
====Semantic units identify five levels of representational granularity====
Semantic units introduce a structured framework encompassing five levels of representational granularity within a knowledge graph: triples, statement units, item units, item group units, and the knowledge graph as a whole (refer to Fig. 10). While triples represent the lowest level of abstraction, semantic units provide coarser levels, organizing the semantic-units graph layer (i.e., the discursive layer of a knowledge graph) and, indirectly, the knowledge graph’s data graph layer.

[[File:Fig10 Vogt JofBiomedSem24 15.png|700px]]
{{clear}}
{|
| style="vertical-align:top;" |
{| border="0" cellpadding="5" cellspacing="0" width="700px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;" |<blockquote>'''Figure 10.''' Five levels of representational granularity. The integration of semantic units into a knowledge graph introduces a semantic-units graph layer, enriching the existing data graph layer. This augmentation includes distinct levels, namely triples, statement units, item units, and item group units, providing a nuanced hierarchy of representational granularity within a knowledge graph.</blockquote>
|-
|}
|}

The hierarchical organization of triples into statement units (→ smallest units of propositions that are semantically meaningful for a human reader), further into item units (→ comprising all the information from the knowledge graph about a particular entity), and eventually into item group units (→ collections of semantically interrelated entities) could enhance human readability and usability. This structural hierarchy supports users in seamlessly navigating across the graph, zooming in and out of different levels of representational granularity.

====Semantic units identify granularity trees====
Granularity trees offer a perspective that is orthogonal to representational granularity, structuring the data graph layer and thus the ontological layer of a knowledge graph into distinct granularity perspectives. Consider the example of a multicellular organism’s description, including a has-part statement unit stating that the organism has a head as its part. This unit is associated with the item unit of the organism itself, which is linked to additional item units about the organism’s other parts, constituting an item group unit. Moreover, since has-part is a partial order relation [55], the has-part statement unit is associated with a parthood granularity tree unit and its corresponding granular item group unit. Consequently, the statement unit is associated with at least four different compound units that can be communicated to the user alongside the statement itself, showcasing the versatility enabled by semantic units in exploring contextualized subgraphs. [54]

===Semantic units identify context-dependent subgraphs===
Semantic units empower the organization of item group units into context units, each defining a specific frame of reference. Intersections between context units are discerned through is-about statements (see also Fig. 12), facilitating traversal across diverse frames of reference. Context units contribute to structuring the data graph layer and thus the ontological layer of a knowledge graph into different frames of reference.

====Statements about statements and documenting ontological and discursive information in knowledge graphs using semantic units====
The introduction of semantic units provides a framework for making statements about statements in a knowledge graph. Each semantic unit, equipped with its unique UPRI and represented in the semantic-units graph layer, facilitates assertions about statement units. This structured approach offers the potential for cross-database and cross-knowledge-graph statements when semantic units are implemented as nanopublications or FAIR Digital Objects, addressing the challenge of making statements about statements in knowledge graphs.

Moreover, if a knowledge graph should cover contextual assertions such as “Author A asserts that the melting point of lead is at 327.5 °C” or “The assertion about the melting point of lead being at 327.5 °C is a result of experiment X,” it becomes challenging to model this without having a formalism for representing such discursive contextual information and its relationship to empirical data (see also Ingvar Johannson’s distinction between use and mention of linguistic entities [60]). Statement units with their data graphs contribute ontological information, nested within compound units of coarser representational granularity. In the semantic-units graph, propositions are represented as nodes, forming a significant portion of the discursive layer. Additionally, context units allow the explicit documentation of different frames of reference within both the ontological and discursive layers. The ability of statement units to establish relations between resources or even between other statement units (e.g., ‘''author_A -asserts-> statement_unit_Y''’; ‘''statement_unit_X -hasMetadata-> statement_unit_Z''’) facilitates the documentation of connections between the empirical and discursive layers. For instance, an item group unit focusing on the contents of a scholarly publication, can encapsulate information about the associated research activity, its inputs, outputs, research methods, and objectives (see Fig. 11).

[[File:Fig11 Vogt JofBiomedSem24 15.png|900px]]
{{clear}}
{|
| style="vertical-align:top;" |
{| border="0" cellpadding="5" cellspacing="0" width="900px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;" |<blockquote>'''Figure 11.''' A semantic schema for modelling the contents of scholarly publications. The depicted semantic schema outlines the modelling structure for encapsulating the components of scholarly publications. It delineates the relationship between a research activity, its associated input and output, and the underlying specification of its process plan, manifested in the form of a research method and research objective. The model draws inspiration from Vogt ''et al.'' [61]</blockquote>
|-
|}
|}

The proposed model may find application within a knowledge graph centered around scholarly publications. For example, the representation in Fig. 12 combines the discursive and the ontological layers and represents the connections between different frames of reference.

[[File:Fig12 Vogt JofBiomedSem24 15.png|1300px]]
{{clear}}
{|
| style="vertical-align:top;" |
{| border="0" cellpadding="5" cellspacing="0" width="1300px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;" |<blockquote>'''Figure 12.''' Detail from the RDF graph illustrating the contents of a scholarly publication. The data schema employed aligns with the schema shown in Figure 11, tailored to accommodate semantic units. The publication’s content is encapsulated within a dedicated publication item group unit instance through various interconnected semantic units. The publication itself is denoted as an instance of <u>journal article</u> (IAO:0000013). The publication item group unit encompasses multiple item units related to the research activity, interconnected through the *<u>SEMUNIT:''hasLinkedSemanticUnit''</u>* property. The interconnected hierarchy extends to an <u>investigation</u> (OBI:0000066) instance, resulting in a <u>data set</u> (IAO:0000100) instance with a <u>description</u> (SIO:000136) instance as its part. This description, in turn, has the multicellular organism item unit describing the organism as its part, which has an instance of <u>multicellular organism</u> (UBERON:0000468) as its subject. The blue arrow signifies the representation of the data graph (dark blue box with shadow) by this specific item unit (bordered box in the same color). The ontological layer is constituted by the data graphs of the semantic units, while their semantic-units graphs collectively form the discursive layer. Distinct context units demarcate the reference frames of the publication, research-activity, and research-subject, delineated by is-about statements. For reasons of clarity of presentation, the associated statement units are not shown in the discursive layer.</blockquote>
|-
|}
|}

===Implementation===
====Implementing semantic units in RDF/OWL-based knowledge graphs using nanopublications===
To initiate the structuring of a knowledge graph into semantic units, first, a layer of abstraction beyond the triple level must be created. This is accomplished by partitioning the knowledge graph into a set of statement units, where each triple belongs exclusively to one data graph of a statement unit. In RDF/OWL, statement units can be conceptualized like nanopublications.

Nanopublications are RDF graphs that serve as the smallest published information units extracted from literature and enriched with provenance and attribution information. [62,63,64,65] Leveraging Named Graphs and Semantic Web technologies, each nanopublication models a particular assertion, such as a scientific claim, in a machine-readable format and semantics and is accessible and citable through a unique identifier. Each nanopublication is organized into four Named Graphs:

#the head Named Graph, connecting the other three Named Graphs to the nanopublication’s unique identifier;
#the assertion Named Graph, containing the assertion modelled as a graph;
#the provenance Named Graph, containing metadata about the assertion; and
#the publicationInfo Named Graph, containing metadata about the nanopublication itself.

The assertion Named Graph would contain the data graph of a statement unit, whereas the head Named Graph its semantic-units graph. Triples in the provenance Named Graph can potentially link to other semantic units and thus other nanopublications that contain detailed metadata descriptions (e.g., a metadata graph as shown in Fig. 4).

A compound unit, being a collection of two or more semantic units, can be organized in an RDF/OWL-based knowledge graph by linking the compound unit’s UPRI to the UPRIs of its associated semantic units. Following the nanopublication schema, this can be implemented by employing the compound unit’s semantic-units graph as the head Named Graph of a corresponding nanopublication, leaving the nanopublication’s assertion Named Graph empty. The head Named Graph thus specifies all statement and compound units associated with this compound unit.

====Implementing semantic units in Neo4j-based knowledge graphs using UPRIs and corresponding property-value pairs====
In Neo4j, a labeled property graph, the assignment of UPRIs to all nodes and relations through a ‘''UPRI:upri''’ property-value pair is an essential prerequisite for implementing semantic units. To identify all triples affiliated with the same statement unit, a ‘''statement_unit_UPRI:upri''’ property-value pair must be added to each node and relation belonging to the statement unit, with the statement unit’s UPRI serving as the value. Building on this primary abstraction layer of statement units, a secondary abstraction layer of compound units can be organized. The nodes and relations associated with all triples within a compound unit are endowed with a ‘''compound_unit_UPRI:upri''’ property-value pair, having the compound unit’s UPRI as their value. Since a particular statement unit may be associated with multiple compound units, its ‘''compound_unit_URI''’ property can incorporate an array of UPRIs representing different semantic units.

An initial software for demonstration purposes has been developed by one of the authors, illustrating how semantic units can manage a knowledge graph. [66] Built upon Neo4j as the persistence-layer technology, the application sources its content via a web interface and user input. This small-scale knowledge graph application is designed for documenting assertions from scholarly publications, offering users an exemplary platform to describe some of the contents (and not merely bibliographic metadata) found in a scholarly publication. Each described paper stands as its own item group unit, featuring assertions covered by statement units linked to item units and granularity tree units. The prototype encompasses versioning of semantic units and automatic tracking of their editing histories and provenance. The application employs the organization of the graph into semantic units within a navigation tree, facilitating exploration of a given item group unit through its associated item units (see Fig. 13). The showcase is built using Python and flask/Jinja2 and is openly available at https://github.com/LarsVogt/Knowledge-Graph-Building-Blocks.

[[File:Fig13 Vogt JofBiomedSem24 15.png|1000px]]
{{clear}}
{|
| style="vertical-align:top;" |
{| border="0" cellpadding="5" cellspacing="0" width="1000px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;" |<blockquote>'''Figure 13.''' User interface of a prototype web application that implements semantic units. On the left is a navigation tree that leverages the organization of the underlying Neo4j knowledge graph into different item group, item, and statement units. Currently selected is the infectious agent population item group. On the right, all statements belonging to the selected item group are displayed.</blockquote>
|-
|}
|}

====Strategies for implementation====
Given that only statement units store information, while compound units act as their containers, the first step of implementing semantic units should focus on identifying the statement unit classes required for representing the types of statements integral to the knowledge graph’s coverage. Each statement unit class requires an assigned graph schema, preferably articulated using a shapes constraint language like SHACL. [51] In this initial step, statement types that are grounded in partial order relations must be identified as well (required for identifying granularity tree units). From here, three distinct implementation strategies are available:

#'''Develop from scratch''': In cases where no knowledge graph exists yet, the focus should be on developing a knowledge graph application that organizes incoming information into statement units in accordance with their assigned graph schemata. Rules for organizing statement units into compound units, contingent on the compound unit type, must be established. For example, statement units sharing the same subject resource form a corresponding item unit.
#'''Transfer an existing knowledge graph''': If there is an existing knowledge graph that needs restructuring into semantic units, crafting queries to transfer all triples into corresponding statement units, based on the graph schemata identified in the first step, is the next step. The main challenge is maintaining disjointedness of triples between statement units.
#'''A hybrid approach''': For scenarios where restructuring an entire knowledge graph seems impractical or undesirable, but there is a desire to organize newly added information into semantic units, a hybrid approach is possible. This involves developing input workflows to ensure that all incoming data conforms to the semantic units structure.

====Semantic units as FAIR Digital Objects====
The concept of FAIR Digital Objects, as proposed by the European Commission Expert Group on FAIR Data, stands at the core of achieving the FAIR Principles [67], emphasizing persistent identifiers, comprehensive metadata, and contextual documentation for reliable discovery, citation, and reuse. The concept of semantic units aligns with that of FAIR Digital Objects. Each semantic unit inherently possesses a UPRI, serving as a ready-made persistent identifier. Accessibility and searchability are ensured through established protocols like SPARQL and CYPHER, with RDF, JSON, and other formats supporting data export. When knowledge graphs adhere to controlled vocabularies and ontologies, and when they employ standard graph-patterns using tools like SHACL [51], ShEx [68, 69], or OTTR [70, 71], the data within the data graphs of semantic units may more easily achieve semantic interoperability.

Moreover, semantic units can provide provenance—crucial for tracking a semantic unit’s history—through utilizing property-value pairs for labeled property knowledge graphs or a designated provenance Named Graph for RDF/OWL knowledge graphs. The provenance metadata of a semantic unit encompasses details like the creator, creation date, application used, title, contributing users, and last-update, focusing solely on the semantic unit itself, not the original data production process.

Access control metadata can specify any licenses as well as access control restrictions.

==Conclusion and future work==
In conclusion, the adoption of semantic units in structuring knowledge graphs may be useful to address the challenges faced in knowledge representation mentioned in the introduction. By encapsulating each statement within its dedicated statement unit, accompanied by a corresponding statement unit class and data schema (e.g., as a SHACL shape), a robust foundation for FAIR data and metadata is established, supporting schematic interoperability. Because statement units partition the knowledge graph so that every triple belongs to exactly one statement unit and every statement unit’s subgraph is identifiable and referenceable through its UPRI, data in a knowledge graph is linked to graph patterns, which are identifiable as a whole. By providing each schema its own UPRI, each semantic unit can specify its underlying schema in its metadata. Identifying semantically interoperable semantic units is then straightforward, and schema crosswalks between different schemata can increase schematic interoperability. [72] (This addresses Challenge 1.)

Graph query languages can use the graph patterns (semantic units), and therefore allow access to knowledge graph content through higher levels of abstractions than basic triples. (This addresses Challenge 2.) Further, we have shown how semantic units can organize knowledge graphs in different layers and make statements about statements. (This addresses Challenge 3.)

Future research involves extending the semantic units approach to incorporate question units and a nuanced categorization of assertional, contingent, prototypical, and universal statement units. This extension will encompass formal semantics for the latter, including provisions for negations and cardinality restrictions. Additionally, we are exploring novel approaches to knowledge graph exploration based on semantic units.

==Abbreviations, acronyms, and initialisms==

*'''BFO''': Basic Formal Ontology
*'''CRUD''': Create, Read, Update, Delete
*'''FAIR''': Findable, Accessible, Interoperable, and Reusable
*'''HTTP''': Hypertext Transfer Protocol
*'''HTTPS''': Hypertext Transfer Protocol Secure
*'''IAO''': Information Artifact Ontology
*'''ID''': Identifier
*'''JSON''': JavaScript Object Notation
*'''LinkML''': Linked Data Modeling Language
*'''NCIT''': National Cancer Institute
*'''NoSQL''': Not only Structured Query Language
*'''OBI''': Ontology for Biomedical Investigations
*'''OBOE''': Extensible Observation Ontology
*'''OBO Foundry''': Open Biological and Biomedical Ontology Foundry
*'''OTTR''': Reasonable Ontology Templates
*'''OWL''': Web Ontology Language
*'''PATO''': Phenotype and Trait Ontology
*'''RDF''': Resource Description Framework
*'''RDFS''': RDF-Schema
*'''RO''': OBO Relations Ontology
*'''SHACL''': Shape Constraint Language
*'''ShEx''': Shape Expression
*'''SIO''': Semanticscience Integrated Ontology
*'''SPARQL''': SPARQL Protocol and RDF Query Language
*'''TI''': Time Ontology in OWL
*'''TRUST''': Transparency, Responsibility, User Focus, Sustainability, and Technology
*'''UBERON''': Uber-anatomy ontology
*'''UO''': Units of Measurement Ontology
*'''UPRI''': Unique Persistent and Resolvable Identifier
*'''XSD''': Extensible Markup Language Schema Definition

==Foonotes==
{{reflist|group=lower-alpha}}

==Acknowledgements==
We thank Werner Ceusters, Nico Matentzoglu, Manuel Prinz, Marcel Konrad, Philip Strömert, Roman Baum, Björn Quast, Peter Grobe, István Míko, Manfred Jeusfeld, Manolis Koubarakis, Javad Chamanara, and Kheir Eddine for discussing some of the presented ideas. We also thank to anonymous reviewers for their suggestions and feedback. We are solely responsible for all the arguments and statements in this paper.

===Author contributions===
L.V. developed the concept of semantic units and wrote the initial manuscript text. All authors reviewed and revised the manuscript.

===Funding===
Open Access funding enabled and organized by Projekt DEAL. Lars Vogt received funding by the ERC H2020 Project ‘ScienceGraph’ (819536).

===Conflict of interest===
The authors declare no competing interests.

==References==
{{Reflist|colwidth=30em}}

==Notes==
This presentation is faithful to the original, with only a few minor changes to presentation, though grammar and word usage was substantially updated for improved readability. In some cases important information was missing from the references, and that information was added.


[[Category:LIMSwiki journal articles (added in 2024)]]
[[Category:LIMSwiki journal articles (all)]]
[[Category:LIMSwiki journal articles on data management and sharing]]
[[Category:LIMSwiki journal articles on FAIR data principles]]
[[Category:LIMSwiki journal articles on health informatics]]

Journal:Semantic units: Organizing knowledge graphs into semantically meaningful units of representation

2024-06-17T19:21:04Z

Shawndouglas: Saving and adding more.

{{Infobox journal article
|name =
|image =
|alt = 
|caption =
|title_full = Semantic units: Organizing knowledge graphs into semantically meaningful units of representation
|journal = ''Journal of Biomedical Semantics''
|authors = Vogt, Lars; Kuhn, Tobias; Hoehndorf, Robert
|affiliations = TIB Leibniz Information Centre for Science and Technology, Vrije Universiteit, King Abdullah University of Science and Technology
|contact = Email: lars dot m dot vogt at googlemail dot com
|editors =
|pub_year = 2024
|vol_iss = '''15'''
|at = 7
|doi = [https://doi.org/10.1186/s13326-024-00310-5 10.1186/s13326-024-00310-5]
|issn = 2041-1480
|license = [http://creativecommons.org/licenses/by/4.0/ Creative Commons Attribution 4.0 International]
|website = [https://jbiomedsem.biomedcentral.com/articles/10.1186/s13326-024-00310-5 https://jbiomedsem.biomedcentral.com/articles/10.1186/s13326-024-00310-5]
|download = [https://jbiomedsem.biomedcentral.com/counter/pdf/10.1186/s13326-024-00310-5.pdf https://jbiomedsem.biomedcentral.com/counter/pdf/10.1186/s13326-024-00310-5.pdf] (PDF)
}}
{{ombox
| type = notice
| image = [[Image:Emblem-important-yellow.svg|40px]]
| style = width: 500px;
| text = This article should be considered a work in progress and incomplete. Consider this article incomplete until this notice is removed.
}}
==Abstract==
'''Background''': In today’s landscape of [[Information management|data management]], the importance of [[knowledge graph]]s and [[Ontology (information science)|ontologies]] is escalating as critical mechanisms aligned with the [[Journal:The FAIR Guiding Principles for scientific data management and stewardship|FAIR Guiding Principles]] ask that research data and [[metadata]] be more findable, accessible, interoperable, and reusable. We discuss three challenges that may hinder the effective exploitation of the full potential of applying FAIR concepts to research objects using knowledge graphs.

'''Results''': We introduce “semantic units” as a conceptual solution, although currently exemplified only in a limited prototype. Semantic units structure a knowledge graph into identifiable and [[Semantics|semantically]] meaningful subgraphs by adding another layer of triples on top of the conventional data layer. Semantic units and their subgraphs are represented by their own resource that instantiates a corresponding semantic unit class. We distinguish statement and compound units as basic categories of semantic units. A statement unit is the smallest independent proposition that is semantically meaningful for a human reader. Depending on the relation of its underlying proposition, it consists of one or more triples. Organizing a knowledge graph into statement units results in a partition of the graph, with each triple belonging to exactly one statement unit. A compound unit, on the other hand, is a semantically meaningful collection of statement and compound units that form larger subgraphs. Some semantic units organize the graph into different levels of representational granularity, others orthogonally into different types of granularity trees or different frames of reference, structuring and organizing the knowledge graph into partially overlapping, partially enclosed subgraphs, each of which can be referenced by its own resource.

'''Conclusions''': Semantic units, applicable in RDF/OWL and labeled property graphs, offer support for making statements about statements and facilitate graph-alignment, subgraph-matching, knowledge graph profiling, and management of access restrictions to sensitive data. Additionally, we argue that organizing the graph into semantic units promotes the differentiation of ontological and discursive [[information]], and that it also supports the differentiation of multiple frames of reference within the graph.

'''Keywords''': FAIR data and metadata, knowledge graph, OWL, RDF, semantic unit, graph organization, granularity tree, representational granularity

==Background==
In an era marked by the exponential generation of data [1,2,3], both technically and socially intricate challenges have emerged [4], necessitating innovative approaches to data representation and [[Information management|management]] in science and industry. The growing volume of produced data requires systems capable of collecting, [[Data integration|integrating]], and [[Data analysis|analyzing]] extensive datasets from diverse sources, a critical requirement in addressing contemporary global challenges. [5] Notably, data stewardship should rest within the hands of the domain experts or institutions to ensure technical autonomy, aligning with the concept of "data visiting" rather than conventional "[[data sharing]]." [6]

From the standpoint of data representation and management, meeting these demands relies on adherence to the [[Journal:The FAIR Guiding Principles for scientific data management and stewardship|FAIR Guiding Principles]], which ask for research data and [[metadata]] to be readily findable, accessible, interoperable, and reusable for machines and humans alike. [7] Failure to achieve FAIRness risks transforming big data into opaque dark data. [8] Establishing the FAIRness of these research objects not only contributes to a solution for the reproducibility crisis in science [9] but also addresses broader concerns regarding the trustworthiness of [[information]] (see also the TRUST Principles of transparency, responsibility, user focus, sustainability, and technology [10]).

To capitalize on the transformative potential of the FAIR Principles, the idea of an internet of FAIR data and services was suggested. [11] Such a framework would seamlessly scale with the demands of big data, enabling relevant data-rich institutions, research projects, and citizen-science initiatives to make their research objects universally accessible in adherence to the FAIR Guiding Principles. [12, 13] The key lies in furnishing comprehensive, machine-actionable{{Efn|Machine-actionable data and metadata are machine-interpretable and belong to a type for which operations have been specified in symbolic grammar, such as logical reasoning based on description logics for statements formalized in the Web Ontology Language (OWL) or rule-based data transformations such as unit conversion for defined types of elements.<ref name="WEilandFDO22">{{cite web |url=https://docs.google.com/document/d/1hbCRJvMTmEmpPcYb4_x6dv1OWrBtKUUW5CEXB2gqsRo |title=FDO Machine Actionability, Version 2.1 |author=Weiland, C.; Islam, S.; Broder, D. et al. |work=Google Docs |publisher=FDO Forum |date=19 August 2022}}</ref>}} data and metadata, complemented by human-readable interfaces and search capabilities.

[[Knowledge graph]]s can contribute to the needed technical frameworks, offering a structure for managing and representing FAIR data and metadata. [14] Knowledge graphs are particularly applied in the context of [[Semantics|semantic]] search based on entities and relations, deep reasoning, disambiguation of natural language, machine reading, and entity consolidation for big data and text analytics. [15]

The distinctive graph-based abstractions inherent in knowledge graphs yield advantages over traditional [[Relational database|relational]] or other NoSQL models. These include
* an intuitive way for modelling relations;
* the flexibility to defer data schema definitions to accommodate evolving knowledge, which is especially important when dealing with incomplete knowledge;
* incorporation of machine-actionable knowledge representation formalisms like [[Ontology (information science)|ontologies]] and rules;
* deployment of graph analytics and [[machine learning]] (ML); and
* utilization of specialized graph query languages that support, in addition to standard relational operators such as joins, unions, and projections, also navigational operators for recursively searching for entities through arbitrary-length paths. [16,17,18,19,20,21,22]

Moreover, the inherent semantic transparency of knowledge graphs can improve the transparency of data-based decision-making and improve the communication of data and knowledge within research and science in general. [23,24,25,26,27]

Despite offering an appropriate technical foundation, the utilization of a knowledge graph for storing data and metadata does not inherently ensure the achievement of the FAIR Guiding Principles. Realizing FAIR research objects necessitates adherence to specific guidelines, encompassing the consistent application of adequate semantic data models tailored to distinct types of data and metadata statements. This approach is pivotal for ensuring seamless interoperability across a dataset.

The rest of the paper is organized as such. In the Problem statement section, we discuss three specific challenges that, from our perspective, can be effectively addressed by systematically organizing a knowledge graph into well-defined subgraphs. Prior attempts at this, such as defining a characteristic set as a subgraph based on triples that share the same resource in the ''Subject'' position, have demonstrated noteworthy enhancements in space and query performance [28, 29] (see also the related concept of RDF molecules [30, 31]), but they do not fully mitigate the challenges outlined below.

The Results section introduces a novel concept: the partitioning and structuring of a knowledge graph into semantic units, identifiable subgraphs represented in the graph with their own resource. Semantic units are semantically meaningful units of representation, which will contribute to overcoming the challenges at hand. The concept builds upon an idea originally proposed for structuring descriptions of [[phenotype]]s into distinct subgraphs, each of which models a descriptive statement like a particular weight measurement or a particular parthood statement for a given anatomical entity. [32] Each such subgraph is organized in its own "Named Graph" and functions as the smallest semantically meaningful unit in a phenotype description. Generalizing and extending this concept, we present semantic units as accessible, searchable, identifiable, and reusable data items in their own right, forming units of representation implemented through graphs based on the [[Resource Description Framework]] (RDF) and the Web Ontology Language (OWL) or labeled property graphs. Two basic categories of semantic units—statement units and compound units—are introduced, supplementing the well-established triples and the overall graph in FAIR knowledge graphs. These units offer a structure that organizes a knowledge graph into five levels of representational granularity, from individual triples to the graph as a whole. In further refinement, additional subcategories of semantic units are proposed for enhanced graph organization. The incorporation of unique, persistent, and resolvable identifiers (UPRIs) for each semantic unit enables their efficient referencing within triples, facilitating an efficient way of making statements about statements. The introduction of semantic units adds further layers of triples to the well-established RDF and OWL layer for knowledge graphs. (Fig. 1) This augmentation aims to enhance the usability of knowledge graphs for both domain experts and developers.

[[File:Fig1 Vogt JofBiomedSem24 15.png|600px]]
{{clear}}
{|
| style="vertical-align:top;" |
{| border="0" cellpadding="5" cellspacing="0" width="600px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;" |<blockquote>'''Figure 1.''' Semantic units introduce additional layers atop the RDF/OWL layer of triples within a knowledge graph. The figure illustrates a partitioning of the triple layer into statement units, wherein each triple aligns with exactly one statement unit, and each statement unit contains one or more triples. Statement units can be organized into diverse types of semantically meaningful collections, denoted as compound units. Compound units serve as the basis for defining several layers that contribute to the enhanced structuring and organization of the knowledge graph in semantically meaningful ways.</blockquote>
|-
|}
|}

In the Discussion section, we discuss the benefits we see from organizing knowledge graphs into distinct knowledge graph modules (i.e., semantic units) in terms of increasing data management flexibility and explorability of the graph. We also discuss possible strategies for implementing semantic units for RDF/OWL-based and labeled-property-graph-based knowledge graphs.

===Conventions used in this paper===
In this paper, the term "knowledge graph" denotes a machine-actionable semantic graph employed for the documentation, organization, and representation of data and metadata. It is essential to note that our discussion of semantic units is situated within the context of RDF-based triple stores, OWL, and Description Logics serving as a formal framework for inferencing, alongside labeled property graphs as an alternative to triple stores. We deliberately focus on these technologies as they constitute the primary technologies and logical frameworks within the knowledge graph domain, benefiting from widespread community support and established standards. We are aware of the fact that alternative technologies and frameworks exist that support an ''n''-tuples syntax and more advanced logics (e.g., First Order Logic) [33, 34], but supporting tools and applications are missing or are not widely used to turn them into well-supported, scalable, and easily usable knowledge graph applications.

Throughout this text, <u>regular underlining</u> is employed for indicating ontology classes, while ''<u>italicsUnderlined</u>'' text is reserved for referencing properties. Identification (ID) numbers, formed by the ontology prefix followed by a colon and a number, uniquely specify each resource (e.g., ''<u>isAbout</u>'' [IAO:0000136]). When a term is not yet covered in any ontology, we denote the corresponding class with an asterisk (*). New classes and properties that relate to semantic units will use the ontology prefix SEMUNIT, as in the class *<u>SEMUNIT:metric measurement statement unit</u>*. These will be part of a future Semantic Unit ontology. We use '<u>regular underlined</u>' to indicate instances of classes, with the label referring to the class label and the ID to the ID of the class.

The term "resource" is employed to signify something uniquely designated, such as a Uniform Resource Identifier (URI), about which informative statements are made. It thus stands for something and represents something you want to talk about. In RDF, the ''Subject'' and the ''Predicate'' in a triple are always resources, whereas the ''Object'' can be either a resource or a literal. Resources encompass properties, instances, and classes, with properties occupying the ''Predicate'' position in a triple, instances referring to individuals (=particulars), and classes representing universals or kinds.

To maintain clarity, resources are represented with human-readable labels in both the text and all figures, opting for the implicit assumption that each property, instance, and class possesses its UPRI. Additionally, the term "triple" refers specifically to a triple statement, while "statement" pertains to a [[Natural language processing|natural language statement]], establishing a clear distinction between the two.

==Methods==
===Problem statement===
====Challenge 1: Ensuring schematic interoperability for FAIR empirical data====

In the pursuit of FAIRness in empirical data and metadata in a knowledge graph, it is important not only for the terms employed in data and metadata statements to possess identifiers from controlled vocabularies, such as ontologies, ensuring terminological interoperability, but also the semantic graph patterns underlying each statement. These patterns specify the relationships among the terms in a statement, facilitating schematic interoperability.

Due to the expressivity of RDF and OWL, statements can be modelled in multiple, often not directly interoperable ways within a knowledge graph. Distinguishing between RDF graphs with different structures that essentially model the same underlying data statement poses a challenge. Consequently, the presence of schematic interoperability conflicts becomes unavoidable, especially when data are represented using diverse graph patterns (cf. Figs. 2 and 3).

[[File:Fig2 Vogt JofBiomedSem24 15.png|900px]]
{{clear}}
{|
| style="vertical-align:top;" |
{| border="0" cellpadding="5" cellspacing="0" width="900px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;" |<blockquote>'''Figure 2.''' Comparison of a human-readable statement with its machine-actionable representation as a semantic graph following the RDF syntax. Top: A human-readable statement concerning the observation that a specific apple (X) weighs 204.56 grams. Bottom: The corresponding representation of the same statement as a semantic graph, adhering to RDF syntax and following the established pattern for measurement data from the Ontology for Biomedical Investigations (OBI) [35] of the Open Biological and Biomedical Ontology Foundry (OBO).</blockquote>
|-
|}
|}

[[File:Fig3 Vogt JofBiomedSem24 15.png|800px]]
{{clear}}
{|
| style="vertical-align:top;" |
{| border="0" cellpadding="5" cellspacing="0" width="800px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;" |<blockquote>'''Figure 3.''' Alternative machine-actionable representation of the data statement from Fig. 2. This graph represents the same data statement as shown in Fig. 2 Top, but applies a semantic graph model that is based on the Extensible Observation Ontology (OBOE) [36], an ontology frequently used in the ecology community.</blockquote>
|-
|}
|}

Therefore, to maintain interoperability in the representation of empirical data statements within an RDF graph, it can be beneficial to restrict the graph patterns employed for their semantic modelling. Statements of the same type, such as all weight measurements, would employ identical graph patterns to maintain interoperability. Each of these patterns would be assigned an identifier. When representing empirical data in the form of an RDF graph, the graph’s metadata should reference that graph-pattern identifier. This approach enables the identification of potentially interoperable RDF graphs sharing common graph-pattern identifiers.

Practically implementing these principles entails two criteria. Firstly, all statements within a knowledge graph must be categorized into statement classes, each associated with a specified graph pattern, typically in the form of a shape specification. Secondly, the subgraph corresponding to a particular statement must be distinctly identifiable.

====Challenge 2: Overcoming barriers in graph query language adoption====
Another significant challenge arises in the context of searching for specific information in a knowledge graph. The prevalent formats for knowledge graphs include RDF/OWL or labeled property graphs like Neo4j. Interacting directly with these graphs, encompassing CRUD operations for creating (= writing), reading (= searching), updating, and deleting statements in the knowledge graph, necessitates the utilization of a query language. SPARQL [37] is an example for RDF/OWL, while Cypher [38] is employed for Neo4j.

Although these query languages empower users to formulate detailed and intricate queries, the challenge lies in their complexity, creating an entry barrier for seamless interactions with knowledge graphs [39]. Furthermore, query languages are not aware of graph patterns.

This challenge may potentially be addressed by providing reusable query patterns that link to specific graph patterns, thereby integrating representation and querying.

====Challenge 3: Addressing complexities in making statements about statements====
The RDF triple syntax of ''Subject'', ''Predicate'', and ''Object'' allows expressing a statement about another statement by creating a triple that relates a statement, composed of one or more triples, to a value, resource, or another statement. The scenario may arise where such statements about statements must be modelled. For instance, metadata for a measurement may relate two distinct subgraphs: one representing the measurement itself (as seen in Fig. 2) and another documenting the underlying measuring process (as seen in Fig. 4).

[[File:Fig4 Vogt JofBiomedSem24 15.png|1000px]]
{{clear}}
{|
| style="vertical-align:top;" |
{| border="0" cellpadding="5" cellspacing="0" width="1000px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;" |<blockquote>'''Figure 4.''' A detailed machine-actionable representation of the metadata relating to a weight measurement datum. This detailed illustration presents a machine-actionable representation of a mass measurement process employing a balance. It documents metadata associated with a weight measurement datum, articulated as an RDF graph. The graph establishes connections between an instance of <u>mass measurement assay</u> (OBI:0000445) and instances of various other classes from diverse ontologies. Noteworthy details include the identification of the measurement conductor, the location and timing of the measurement, the protocol followed, and the specific device utilized (i.e., a balance). Additionally, the graph outlines the material entity serving as the subject and input for the measurement process (i.e., "apple X"), along with specifying the resultant data encapsulated in a particular weight measurement assertion.</blockquote>
|-
|}
|}

In RDF reification, a statement resource is defined to represent a particular triple by describing it via three additional triples that specify its ''Subject'', ''Predicate'', and ''Object''. Alternatively, the RDF-star approach can be employed. [40, 41] Both methods increase complexity of the represented graph.

In cases like this, the adoption of Named Graphs is an alternative compared to RDF reification or RDF-star approaches. Within RDF-based knowledge graphs, a Named Graph resource identifies a set of triples by incorporating the URI of the Named Graph as a fourth element to each triple, transforming them into quads. In labeled property graphs, on the other hand, assigning a resource for identifying subgraphs within the overall data graph is straightforward and can be achieved by incorporating the resource identifier as the value of a corresponding property-value pair, subsequently adding this pair to all relations and nodes belonging to the same subgraph.

==Results==
===Semantic unit===
We developed an approach for organizing knowledge graphs into distinct layers of subgraphs using graph patterns. Unlike traditional methods of partitioning a knowledge graph that (i) rely on technical aspects such as shared graph-topological properties of its triples with the goal of (federated) reasoning and query optimization (see characteristic sets [29, 30], RDF molecules [31, 42], and other approaches [43,44,45]), that (ii) partition a knowledge graph into small blocks for embedding and entity alignment learning to scale knowledge graph fusion [46], or that (iii) partition knowledge extractions, allowing reasoning over them in parallel to speed up knowledge graph construction [47], our approach introduces "semantic units." Semantic units prioritize structuring a knowledge graph into identifiable sets of triples, as subgraphs that represent units of representation possessing semantic significance for human readers. Technically, a semantic unit is a subgraph within a knowledge graph, represented in the graph by its own resource—designated as a UPRI—and embodied in the graph as a node. This resource is classified as an instance of a specific semantic unit class.

Semantic units focus on creating units that are semantically meaningful to domain experts. For instance, the graph in Fig. 2 exemplifies a subgraph that can be organized in a semantic unit that instantiates the class *<u>SEMUNIT:weight statement unit</u>* as it is illustrated in Fig. 6 (later). The statement unit models a single, human-readable statement, as opposed to the individual triple ‘<u>weight</u>’ (PATO:0000128) ''isQualityMeasuredAs'' (IAO:0000417) ‘<u>scalar measurement datum</u>’ (IAO:0000032), which is a single triple from that subgraph. That triple, without the context of the other triples in the subgraph, lacks semantic meaningfulness for a domain expert who has no background in semantics.

Beyond statement units, which constitute the smallest semantically meaningful statements (e.g., a weight measurement), collections of statement units can form compound units representing a coarser level of representational granularity. The classification of semantic units thus distinguishes two fundamental categories: statement units and compound units, each with its respective subcategories. For a detailed classification of semantic units, refer to Fig. 5.

[[File:Fig5 Vogt JofBiomedSem24 15.png|300px]]
{{clear}}
{|
| style="vertical-align:top;" |
{| border="0" cellpadding="5" cellspacing="0" width="300px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;" |<blockquote>'''Figure 5.''' Classification of different categories of semantic units.</blockquote>
|-
|}
|}

The structuring of a knowledge graph into semantic units involves introducing an additional layer of triples to the existing graph. To distinguish these two layers, we label the pre-existing graph as the data graph layer, while the newly added triples constitute the semantic-units graph layer. For clarity across the graph, the resource representing a semantic unit, along with all triples featuring this resource in the ''Subject'' or ''Object'' position, is assigned to the semantic-units graph layer. Extending this distinction from the graph as a whole to individual semantic units, each semantic unit is associated with both a data graph and a semantic-units graph. The data graph of a particular semantic unit shares the same UPRI as its semantic unit resource. This alignment enables reference to the UPRI, concurrently denoting the semantic unit as a resource and its corresponding data graph. This interconnectedness empowers users to make statements about the content encapsulated within the semantic unit’s data graph, as shown in Fig. 6.

[[File:Fig6 Vogt JofBiomedSem24 15.png|1000px]]
{{clear}}
{|
| style="vertical-align:top;" |
{| border="0" cellpadding="5" cellspacing="0" width="1000px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;" |<blockquote>'''Figure 6.''' Example of a statement unit. The illustration displays a statement unit exemplifying a has-weight relation. The data graph, denoted within the blue box at the bottom, articulates the statement with "apple X" as the subject and "gram X" alongside the numerical value 204.56 as the objects. The peach-colored box encompasses the semantic-units graph, housing triples that encapsulate the semantic unit’s representation. It explicitly denotes the resource embodying the statement unit (bordered blue box), an instance of the *<u>SEMUNIT:weight statement unit</u>* class, with "apple X" identified as the subject. Notably, the UPRI of *’<u>weight statement unit</u>’* is also the UPRI of the semantic unit’s data graph (the unbordered subgraph in the blue box).</blockquote>
|-
|}
|}

====Statement unit: A proposition in the knowledge graph====
A statement unit is characterized as the fundamental unit of information encapsulating the smallest, independent proposition (i.e., statement) with semantic meaning for human comprehension (see also [32]). For instance, the weight measurement statement for "apple X" illustrated in Fig. 6 represents a statement unit.

Structuring a knowledge graph into statement units results in a partition of its graph. Each triple within the data graph layer of the knowledge graph is associated with exactly one statement unit, and merging the subgraphs of all statement units results in the complete data graph of a knowledge graph. This partitioning only applies to the data graph layer.

We can understand each statement unit to specify a particular proposition by establishing a relationship between a resource serving as the subject and either a literal or another resource, denoted as the object of the predicate. Every statement unit encompasses a single subject and one or more objects.

To illustrate, a has-part statement unit features a subject and one object. Conversely, a weight measurement statement unit consists of a subject, as well as two objects: the weight value and the weight unit (refer to Fig. 6). The resource signifying a statement unit in the graph establishes a connection with its subject through the property *<u>SEMUNIT:''hasSemanticUnitSubject''</u>*, which is documented in the semantic-units graph of the statement unit.

In scenarios where the proposition within the data graph is grounded in a binary relation—a divalent predicate like "This right hand has as a part this right thumb"—the associated statement unit typically comprises a single triple. This alignment arises from the nature of RDF, where ''Predicates'' of triples are inherently binary relations. In such cases, the RDF property concurrently embodies the statement’s verb or predicate. However, numerous propositions are grounded in ''n''-ary relations, making a single triple insufficient for their representation. Examples encompass the weight measurement statement in Fig. 6 and statements like "This right hand has part this right thumb on January 29th 2022," "Anna gives Bob a book," and "Carla travels by train from Paris to Berlin on the 29th of June 2022," each necessitating more than one triple. In these cases, the statement’s verb or predicate is often represented not by a property within a single triple but instead by an instance resource, as exemplified by ‘<u>weight X</u>’ (PATO:0000128) in Fig. 6. The composition of statement units, whether consisting of one or more triples, is contingent upon the relation of the underlying proposition, the ''n''-aryness of its predicate, and the incorporation of optional objects. Types of statement units can be distinguished based on the ''n''-ary verb or predicate that characterizes their underlying proposition. Notably, numerous object properties of the Basic Formal Ontology 2 denote ternary relations, particularly those entailing temporal dependencies. [48] For instance, "''b'' located_in ''c'' at ''t''" mandates at least two triples for accurate representation in RDF.

The determination of which triples belong to a statement unit necessitates case-by-case specification by human domain experts. The statement unit patterns can then be specified using languages like LinkML [49, 50] or the Shapes Constraint Language SHACL [51]. These languages enable the definition of graph patterns to represent specific propositions, subsequently constituting a statement unit. Each statement unit instantiates a designated statement unit class, a classification defined by the specific verb or predicate characterizing the propositions modelled by its instances. We can distinguish different subcategories of statement units based on the underlying predicate, such as ''has part'', ''type'', and ''develops from''.

A distinctive category within the statement units, denoted as identification units, serves a specific purpose, providing details about a particular named individual or class resource. Two principal subtypes define this category. A named individual identification unit is a statement unit that serves to identify a resource to be a named individual, adding information such as the resource’s label, type, and its class membership (refer to Fig. 7A). A class identification unit{{Efn|Analog to class identification units, one could specify property identification units that have property resources as their subject.}} is a statement unit that serves to identify a resource to be a class and provides details including its label, identifier, and optionally, the URIs of both the ontology and the specific version from which the class term has been imported (refer to Fig. 7B). Both types of identification units are important for providing human-readable displays of statement units, as they provide the labels for the resources used in them (see "typed statement unit" and "dynamic label" in Fig. 9, later).

[[File:Fig7 Vogt JofBiomedSem24 15.png|500px]]
{{clear}}
{|
| style="vertical-align:top;" |
{| border="0" cellpadding="5" cellspacing="0" width="500px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;" |<blockquote>'''Figure 7.''' Examples for two different types of identification units. '''A)''' Named-individual identification unit. The data graph within the unbordered box delineates the class-affiliation of the ‘<u>apple X</u>’ (NCIT:C71985) instance. The subject, "apple X," is connected to its class through the property ''<u>type</u>'' (RDF:type), while its label "apple X" is conveyed via the property ''<u>label</u>'' (RDFS:label). The unbordered blue box designates the data graph associated with this named-individual identification unit. '''B)''' Class identification unit. This data graph of this unit, represented by the unbordered blue box, captures the label and identifier of the class ‘<u>apple</u>’ (NCIT:C71985), the unit’s designated subject. Optionally, it includes the URI details of the ontology and the ontology version from which the class is derived. The bordered blue box designates the resource of this class identification unit.</blockquote>
|-
|}
|}

====Compound unit: A collection of propositions====
Compound units are containers of collections of associated semantic units, each possessing semantic significance for a human reader. Each compound unit possesses a UPRI and instantiates a corresponding compound unit class. The connection between the resource representing the compound unit and those representing its associated semantic units is detailed through the property *<u>SEMUNIT:hasAssociatedSemanticUnit</u>* (see Fig. 8). The subsequent sections introduce distinct subcategories of compound units.

[[File:Fig8 Vogt JofBiomedSem24 15.png|700px]]
{{clear}}
{|
| style="vertical-align:top;" |
{| border="0" cellpadding="5" cellspacing="0" width="700px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;" |<blockquote>'''Figure 8.''' Example of a compound unit, denoted as *‘<u>apple X item unit</u>’*, that encompasses multiple statement units. Compound units, by virtue of merging the data graphs of their associated statement units, indirectly manifest a data graph (here, highlighted by the blue arrow). Notably, the compound unit possesses a semantic-units graph (depicted in the peach-colored box) delineating the associated semantic units.</blockquote>
|-
|}
|}

===Typed statement unit===
A typed statement unit assigns a human-readable label to a statement unit. A typed statement unit is a compound unit comprising the following statement units (see Fig. 9A):

#A statement unit that is not an instance of a named-individual or a class identification unit. It functions as the reference statement unit of the typed statement unit, and its subject is also the subject of the typed statement unit.
#Identification units specifying the class affiliations of all the resources that are referenced in the data graph of the reference statement unit, together with their human-readable labels.

[[File:Fig9 Vogt JofBiomedSem24 15.png|700px]]
{{clear}}
{|
| style="vertical-align:top;" |
{| border="0" cellpadding="5" cellspacing="0" width="700px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;" |<blockquote>'''Figure 9.''' Typed statement unit with dynamic label and dynamic mind-map pattern. '''A)''' Typed statement unit exemplified for a weight statement. This typed statement unit consolidates the data graphs of six statement units, including the *’<u>weight statement unit</u>’* from Figure 6, serving as the reference statement unit for this *‘<u>typed statement unit</u>’*, and five instances of *<u>SEMUNIT:named-individual identification unit</u>*. '''B)''' Dynamic label: Illustrated is an example of the dynamic label associated with the reference statement unit class (*<u>SEMUNIT:weight statement unit</u>*). This dynamic label template is utilized for textual displays of information from the reference statement unit. '''C)''' Dynamic mind-map pattern: Depicted is an example of the dynamic mind-map pattern associated with the reference statement unit class (*<u>SEMUNIT:weight statement unit</u>*). This pattern template is employed for graphical displays of information from the reference statement unit.</blockquote>
|-
|}
|}

Each statement unit class has at least one display pattern associated with it. A display pattern acts as a template that takes as input the labels provided by the identification units associated with a typed statement unit and generates a human-readable dynamic label for the textual (see Fig. 9B) or a dynamic mind-map pattern for the graphical representation (see Fig. 9C) of the statement of its reference statement unit. Thus, a dynamic label and a dynamic mind-map pattern of a typed statement unit are derived from the corresponding templates provided by its reference statement unit, taking the human-readable labels provided by its identification units as input.

===Item unit===
An item unit encompasses all statement and typed statement units that share a common subject, i.e., they form a group of statements relating to the same entity. The subject resource becomes the subject of the item unit, and the resource representing an item unit in the semantic-units graph relates to its subject through the property *<u>SEMUNIT:hasSemanticUnitSubject</u>*. Conceptually, item units align with the ''graph-per-resource'' data management pattern [52] or the previously mentioned ''characteristic set'' or ''RDF molecule'', and they are akin to the ''Item'' concept in the Wikibase data model<ref name="MWWikibase24">{{cite web |url=https://www.mediawiki.org/wiki/Wikibase/DataModel#Item |title=Wikibase/DataModel - Overview of the data model |work=MediaWiki.org |date=07 April 2024}}</ref>, but adapt the concept to statement units rather than triples.

===Item group unit===
An item group unit is composed of a minimum of two item units. The subgraphs of the item units belonging to the same item group unit are connected through statement units that share their subject with the subject of one item unit and one of their objects with the subject of another item unit. As a result, merging the subgraphs of all the item units of an item group unit forms a connected graph.

===Granularity tree unit===
We can further identify types of statement units that depend on partial order relations (i.e., relations that are transitive, reflexive, and asymmetric), forming partial orders. Examples include class-subclass relations in ontologies, parthood relations in descriptive statements, and sequential relations like ''<u>before</u>'' (RO:0002083) in process specifications. Partial order relations give rise to granular partitions that form granularity trees [53,54,55] and contribute to defining granularity perspectives. [56,57,58]

Granularity perspectives identify specific types of semantically meaningful tree-like subgraphs within a knowledge graph, supporting graph exploration by modularization in addition to statement, item, and item group units.

Due to the nested structure of a granularity tree and its inherent directionality from root to leaves, the subject of a granularity tree unit can be specified as the subject of statement units sharing objects with the subjects but not their subject with the objects of other statement units within the same granularity tree unit.

===Granular item group unit===
A granular item group unit encompasses all statement units and item units whose subjects belong to the same granularity tree unit. The item units belonging to a granular item group unit can be systematically arranged within a nested hierarchy dictated by the underlying granularity tree. This additional organization offers improved explorability for users of a knowledge graph application.

===Context unit===
The ''<u>isAbout</u>'' property (IAO:0000136) connects an information artifact to an entity about which the artifact provides information. Using this property in a knowledge graph changes the frame of reference from the discursive layer to the ontological layer. An is-about statement thus divides a knowledge graph into two subgraphs, each forming a context unit that belongs to one of these two layers. Is-about statement units relate resources from the semantic-units graph with resources from the data graph of a knowledge graph. For example, in documenting a research activity that results in the creation of a dataset describing the anatomy of a multicellular organism, the statement *‘<u>description item unit</u>’* ''<u>isAbout</u>'' ‘<u>multicellular organism</u>’ (UBERON:0000468) marks a transition in the frame of reference from the research activity’s outcome to the multicellular organism being described (see also Fig. 12 further below).

===Dataset unit===
A dataset unit is an ordered set of semantic units. They can be employed to aggregate all data contributed by a specific institution in a collaborative project, document the state of a particular object at a given time, or store and make accessible the results of a specific search query. Knowledge graph users have the flexibility to specify dataset units for their individual needs, utilizing the unit’s UPRI as reference identifier.

===List unit===
In certain instances, it becomes necessary to articulate statements about a specific collection of particular resources. To achieve this, such a collection can be modelled as a list unit. We distinguish unordered list units from ordered list units, with the latter organizing resources in a specific sequence, such as the authors of a scholarly publication. Conversely, a set unit is an unordered list unit where each resource is listed only once, adhering to a uniqueness restriction.

From a technical standpoint, a list unit contains membership statement units, each delineating a resource belonging to the list by linking the UPRI of the list unit through a *<u>SEMUNIT:''child''</u>* relation to the respective resource. In the case of an ordered list unit, each membership statement unit must be indexed through a data property ''<u>index</u>'' (RDF:index).

List units can be employed as arrays and may incorporate cardinality restrictions, thereby characterizing a closed collection of entities and enabling a localized closed-world assumption.

==Discussion==
===Benefits of organizing a knowledge graph into semantic units===
====Semantic units enhance data management flexibility through modularity====
The organization of a knowledge graph into distinct subgraphs, each associated with a particular semantic unit, introduces modularity in a graph. Each semantic unit, represented in the graph by a dedicated resource classified as an instance of a specific semantic unit class, serves as a structured module that encapsulates complexity. This modular approach allows for the encapsulation of subgraphs, and may add flexibility in data management as larger parts of a graph can be manipulated jointly.

====Semantic units operate at a higher level of abstraction than individual triples====
Semantically, they encapsulate the contents of their data graphs, representing statements or sets of semantically and ontologically related statements. The specification of relations between semantic units further extends the flexibility of data management. A given semantic unit from a finer level of representational granularity can be associated with multiple units from a coarser level. Consequently, a statement unit may be linked to more than one compound unit, all while maintaining the centrality of the statement unit itself and its triples in a single location within the graph.

The modular nature introduced by semantic units may streamline partitioned-based querying of knowledge graphs. While other approaches for graph partitioning have shown success [59], employing semantic units for partitioning and establishing modularity in the graph is an avenue for future research exploration.

===Semantic units as a framework for knowledge graph alignment===
The instantiation of semantic units belonging to the same class inherently implies a semantic similarity across instances. This characteristic lays the groundwork for a systematic approach to aligning and comparing knowledge graphs that share a common set of semantic unit classes. The alignment process could operate in a stepwise manner across various levels of representational granularity. In the initial step, alignment focuses on item group units, leveraging their types of associated item units and their alignment for comparison. The latter alignment hinges on the types of subjects and the types of associated statement units, allowing for further alignment based on class. Ultimately, individual triples within the aligned statement units undergo comparison, marking a comprehensive strategy to enhance existing methods for knowledge graph alignment, subgraph-matching, graph comparison, and graph similarity measures.

===Managing restricted access to sensitive data===
The classification of statement units into corresponding ontology classes may serve as a framework for identifying subgraphs within a knowledge graph housing sensitive data that warrants restricted access. By identifying statement units containing sensitive information by class, access restrictions can be dynamically enforced based on specific criteria.

===Semantic units: A framework for nested and overlapping knowledge graph modules===
====Semantic units identify five levels of representational granularity====
Semantic units introduce a structured framework encompassing five levels of representational granularity within a knowledge graph: triples, statement units, item units, item group units, and the knowledge graph as a whole (refer to Fig. 10). While triples represent the lowest level of abstraction, semantic units provide coarser levels, organizing the semantic-units graph layer (i.e., the discursive layer of a knowledge graph) and, indirectly, the knowledge graph’s data graph layer.

[[File:Fig10 Vogt JofBiomedSem24 15.png|700px]]
{{clear}}
{|
| style="vertical-align:top;" |
{| border="0" cellpadding="5" cellspacing="0" width="700px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;" |<blockquote>'''Figure 10.''' Five levels of representational granularity. The integration of semantic units into a knowledge graph introduces a semantic-units graph layer, enriching the existing data graph layer. This augmentation includes distinct levels, namely triples, statement units, item units, and item group units, providing a nuanced hierarchy of representational granularity within a knowledge graph.</blockquote>
|-
|}
|}

The hierarchical organization of triples into statement units (→ smallest units of propositions that are semantically meaningful for a human reader), further into item units (→ comprising all the information from the knowledge graph about a particular entity), and eventually into item group units (→ collections of semantically interrelated entities) could enhance human readability and usability. This structural hierarchy supports users in seamlessly navigating across the graph, zooming in and out of different levels of representational granularity.

====Semantic units identify granularity trees====
Granularity trees offer a perspective that is orthogonal to representational granularity, structuring the data graph layer and thus the ontological layer of a knowledge graph into distinct granularity perspectives. Consider the example of a multicellular organism’s description, including a has-part statement unit stating that the organism has a head as its part. This unit is associated with the item unit of the organism itself, which is linked to additional item units about the organism’s other parts, constituting an item group unit. Moreover, since has-part is a partial order relation [55], the has-part statement unit is associated with a parthood granularity tree unit and its corresponding granular item group unit. Consequently, the statement unit is associated with at least four different compound units that can be communicated to the user alongside the statement itself, showcasing the versatility enabled by semantic units in exploring contextualized subgraphs. [54]

===Semantic units identify context-dependent subgraphs===
Semantic units empower the organization of item group units into context units, each defining a specific frame of reference. Intersections between context units are discerned through is-about statements (see also Fig. 12), facilitating traversal across diverse frames of reference. Context units contribute to structuring the data graph layer and thus the ontological layer of a knowledge graph into different frames of reference.

====Statements about statements and documenting ontological and discursive information in knowledge graphs using semantic units====
The introduction of semantic units provides a framework for making statements about statements in a knowledge graph. Each semantic unit, equipped with its unique UPRI and represented in the semantic-units graph layer, facilitates assertions about statement units. This structured approach offers the potential for cross-database and cross-knowledge-graph statements when semantic units are implemented as nanopublications or FAIR Digital Objects, addressing the challenge of making statements about statements in knowledge graphs.

Moreover, if a knowledge graph should cover contextual assertions such as “Author A asserts that the melting point of lead is at 327.5 °C” or “The assertion about the melting point of lead being at 327.5 °C is a result of experiment X,” it becomes challenging to model this without having a formalism for representing such discursive contextual information and its relationship to empirical data (see also Ingvar Johannson’s distinction between use and mention of linguistic entities [60]). Statement units with their data graphs contribute ontological information, nested within compound units of coarser representational granularity. In the semantic-units graph, propositions are represented as nodes, forming a significant portion of the discursive layer. Additionally, context units allow the explicit documentation of different frames of reference within both the ontological and discursive layers. The ability of statement units to establish relations between resources or even between other statement units (e.g., ‘''author_A -asserts-> statement_unit_Y''’; ‘''statement_unit_X -hasMetadata-> statement_unit_Z''’) facilitates the documentation of connections between the empirical and discursive layers. For instance, an item group unit focusing on the contents of a scholarly publication, can encapsulate information about the associated research activity, its inputs, outputs, research methods, and objectives (see Fig. 11).

[[File:Fig11 Vogt JofBiomedSem24 15.png|900px]]
{{clear}}
{|
| style="vertical-align:top;" |
{| border="0" cellpadding="5" cellspacing="0" width="900px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;" |<blockquote>'''Figure 11.''' A semantic schema for modelling the contents of scholarly publications. The depicted semantic schema outlines the modelling structure for encapsulating the components of scholarly publications. It delineates the relationship between a research activity, its associated input and output, and the underlying specification of its process plan, manifested in the form of a research method and research objective. The model draws inspiration from Vogt ''et al.'' [61]</blockquote>
|-
|}
|}

The proposed model may find application within a knowledge graph centered around scholarly publications. For example, the representation in Fig. 12 combines the discursive and the ontological layers and represents the connections between different frames of reference.

[[File:Fig12 Vogt JofBiomedSem24 15.png|1300px]]
{{clear}}
{|
| style="vertical-align:top;" |
{| border="0" cellpadding="5" cellspacing="0" width="1300px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;" |<blockquote>'''Figure 12.''' Detail from the RDF graph illustrating the contents of a scholarly publication. The data schema employed aligns with the schema shown in Figure 11, tailored to accommodate semantic units. The publication’s content is encapsulated within a dedicated publication item group unit instance through various interconnected semantic units. The publication itself is denoted as an instance of <u>journal article</u> (IAO:0000013). The publication item group unit encompasses multiple item units related to the research activity, interconnected through the *<u>SEMUNIT:''hasLinkedSemanticUnit''</u>* property. The interconnected hierarchy extends to an <u>investigation</u> (OBI:0000066) instance, resulting in a <u>data set</u> (IAO:0000100) instance with a <u>description</u> (SIO:000136) instance as its part. This description, in turn, has the multicellular organism item unit describing the organism as its part, which has an instance of <u>multicellular organism</u> (UBERON:0000468) as its subject. The blue arrow signifies the representation of the data graph (dark blue box with shadow) by this specific item unit (bordered box in the same color). The ontological layer is constituted by the data graphs of the semantic units, while their semantic-units graphs collectively form the discursive layer. Distinct context units demarcate the reference frames of the publication, research-activity, and research-subject, delineated by is-about statements. For reasons of clarity of presentation, the associated statement units are not shown in the discursive layer.</blockquote>
|-
|}
|}

===Implementation===
====Implementing semantic units in RDF/OWL-based knowledge graphs using nanopublications===
To initiate the structuring of a knowledge graph into semantic units, first, a layer of abstraction beyond the triple level must be created. This is accomplished by partitioning the knowledge graph into a set of statement units, where each triple belongs exclusively to one data graph of a statement unit. In RDF/OWL, statement units can be conceptualized like nanopublications.

Nanopublications are RDF graphs that serve as the smallest published information units extracted from literature and enriched with provenance and attribution information. [62,63,64,65] Leveraging Named Graphs and Semantic Web technologies, each nanopublication models a particular assertion, such as a scientific claim, in a machine-readable format and semantics and is accessible and citable through a unique identifier. Each nanopublication is organized into four Named Graphs:

#the head Named Graph, connecting the other three Named Graphs to the nanopublication’s unique identifier;
#the assertion Named Graph, containing the assertion modelled as a graph;
#the provenance Named Graph, containing metadata about the assertion; and
#the publicationInfo Named Graph, containing metadata about the nanopublication itself.

The assertion Named Graph would contain the data graph of a statement unit, whereas the head Named Graph its semantic-units graph. Triples in the provenance Named Graph can potentially link to other semantic units and thus other nanopublications that contain detailed metadata descriptions (e.g., a metadata graph as shown in Fig. 4).

A compound unit, being a collection of two or more semantic units, can be organized in an RDF/OWL-based knowledge graph by linking the compound unit’s UPRI to the UPRIs of its associated semantic units. Following the nanopublication schema, this can be implemented by employing the compound unit’s semantic-units graph as the head Named Graph of a corresponding nanopublication, leaving the nanopublication’s assertion Named Graph empty. The head Named Graph thus specifies all statement and compound units associated with this compound unit.

====Implementing semantic units in Neo4j-based knowledge graphs using UPRIs and corresponding property-value pairs====
In Neo4j, a labeled property graph, the assignment of UPRIs to all nodes and relations through a ‘''UPRI:upri''’ property-value pair is an essential prerequisite for implementing semantic units. To identify all triples affiliated with the same statement unit, a ‘''statement_unit_UPRI:upri''’ property-value pair must be added to each node and relation belonging to the statement unit, with the statement unit’s UPRI serving as the value. Building on this primary abstraction layer of statement units, a secondary abstraction layer of compound units can be organized. The nodes and relations associated with all triples within a compound unit are endowed with a ‘''compound_unit_UPRI:upri''’ property-value pair, having the compound unit’s UPRI as their value. Since a particular statement unit may be associated with multiple compound units, its ‘''compound_unit_URI''’ property can incorporate an array of UPRIs representing different semantic units.

An initial software for demonstration purposes has been developed by one of the authors, illustrating how semantic units can manage a knowledge graph. [66] Built upon Neo4j as the persistence-layer technology, the application sources its content via a web interface and user input. This small-scale knowledge graph application is designed for documenting assertions from scholarly publications, offering users an exemplary platform to describe some of the contents (and not merely bibliographic metadata) found in a scholarly publication. Each described paper stands as its own item group unit, featuring assertions covered by statement units linked to item units and granularity tree units. The prototype encompasses versioning of semantic units and automatic tracking of their editing histories and provenance. The application employs the organization of the graph into semantic units within a navigation tree, facilitating exploration of a given item group unit through its associated item units (see Fig. 13). The showcase is built using Python and flask/Jinja2 and is openly available at https://github.com/LarsVogt/Knowledge-Graph-Building-Blocks.

[[File:Fig13 Vogt JofBiomedSem24 15.png|1000px]]
{{clear}}
{|
| style="vertical-align:top;" |
{| border="0" cellpadding="5" cellspacing="0" width="1000px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;" |<blockquote>'''Figure 13.''' User interface of a prototype web application that implements semantic units. On the left is a navigation tree that leverages the organization of the underlying Neo4j knowledge graph into different item group, item, and statement units. Currently selected is the infectious agent population item group. On the right, all statements belonging to the selected item group are displayed.</blockquote>
|-
|}
|}

====Strategies for implementation====
Given that only statement units store information, while compound units act as their containers, the first step of implementing semantic units should focus on identifying the statement unit classes required for representing the types of statements integral to the knowledge graph’s coverage. Each statement unit class requires an assigned graph schema, preferably articulated using a shapes constraint language like SHACL. [51] In this initial step, statement types that are grounded in partial order relations must be identified as well (required for identifying granularity tree units). From here, three distinct implementation strategies are available:

#'''Develop from scratch''': In cases where no knowledge graph exists yet, the focus should be on developing a knowledge graph application that organizes incoming information into statement units in accordance with their assigned graph schemata. Rules for organizing statement units into compound units, contingent on the compound unit type, must be established. For example, statement units sharing the same subject resource form a corresponding item unit.
#'''Transfer an existing knowledge graph''': If there is an existing knowledge graph that needs restructuring into semantic units, crafting queries to transfer all triples into corresponding statement units, based on the graph schemata identified in the first step, is the next step. The main challenge is maintaining disjointedness of triples between statement units.
#'''A hybrid approach''': For scenarios where restructuring an entire knowledge graph seems impractical or undesirable, but there is a desire to organize newly added information into semantic units, a hybrid approach is possible. This involves developing input workflows to ensure that all incoming data conforms to the semantic units structure.

====Semantic units as FAIR Digital Objects====
The concept of FAIR Digital Objects, as proposed by the European Commission Expert Group on FAIR Data, stands at the core of achieving the FAIR Principles [67], emphasizing persistent identifiers, comprehensive metadata, and contextual documentation for reliable discovery, citation, and reuse. The concept of semantic units aligns with that of FAIR Digital Objects. Each semantic unit inherently possesses a UPRI, serving as a ready-made persistent identifier. Accessibility and searchability are ensured through established protocols like SPARQL and CYPHER, with RDF, JSON, and other formats supporting data export. When knowledge graphs adhere to controlled vocabularies and ontologies, and when they employ standard graph-patterns using tools like SHACL [51], ShEx [68, 69], or OTTR [70, 71], the data within the data graphs of semantic units may more easily achieve semantic interoperability.

Moreover, semantic units can provide provenance—crucial for tracking a semantic unit’s history—through utilizing property-value pairs for labeled property knowledge graphs or a designated provenance Named Graph for RDF/OWL knowledge graphs. The provenance metadata of a semantic unit encompasses details like the creator, creation date, application used, title, contributing users, and last-update, focusing solely on the semantic unit itself, not the original data production process.

Access control metadata can specify any licenses as well as access control restrictions.

==Conclusion and future work==
In conclusion, the adoption of semantic units in structuring knowledge graphs may be useful to address the challenges faced in knowledge representation mentioned in the introduction. By encapsulating each statement within its dedicated statement unit, accompanied by a corresponding statement unit class and data schema (e.g., as a SHACL shape), a robust foundation for FAIR data and metadata is established, supporting schematic interoperability. Because statement units partition the knowledge graph so that every triple belongs to exactly one statement unit and every statement unit’s subgraph is identifiable and referenceable through its UPRI, data in a knowledge graph is linked to graph patterns, which are identifiable as a whole. By providing each schema its own UPRI, each semantic unit can specify its underlying schema in its metadata. Identifying semantically interoperable semantic units is then straightforward, and schema crosswalks between different schemata can increase schematic interoperability. [72] (This addresses Challenge 1.)

Graph query languages can use the graph patterns (semantic units), and therefore allow access to knowledge graph content through higher levels of abstractions than basic triples. (This addresses Challenge 2.) Further, we have shown how semantic units can organize knowledge graphs in different layers and make statements about statements. (This addresses Challenge 3.)

Future research involves extending the semantic units approach to incorporate question units and a nuanced categorization of assertional, contingent, prototypical, and universal statement units. This extension will encompass formal semantics for the latter, including provisions for negations and cardinality restrictions. Additionally, we are exploring novel approaches to knowledge graph exploration based on semantic units.

==Abbreviations, acronyms, and initialisms==
* '''BFO''': Basic Formal Ontology
* '''CRUD''': Create, Read, Update, Delete
* '''FAIR''': Findable, Accessible, Interoperable, and Reusable
* '''HTTP''': Hypertext Transfer Protocol
* '''HTTPS''': Hypertext Transfer Protocol Secure
* '''IAO''': Information Artifact Ontology
* '''ID''': Identifier
* '''JSON''': JavaScript Object Notation
* '''LinkML''': Linked Data Modeling Language
* '''NCIT''': National Cancer Institute
* '''NoSQL''': Not only Structured Query Language
* '''OBI''': Ontology for Biomedical Investigations
* '''OBOE''': Extensible Observation Ontology
* '''OBO Foundry''': Open Biological and Biomedical Ontology Foundry
* '''OTTR''': Reasonable Ontology Templates
* '''OWL''': Web Ontology Language
* '''PATO''': Phenotype and Trait Ontology
* '''RDF''': Resource Description Framework
* '''RDFS''': RDF-Schema
* '''RO''': OBO Relations Ontology
* '''SHACL''': Shape Constraint Language
* '''ShEx''': Shape Expression
* '''SIO''': Semanticscience Integrated Ontology
* '''SPARQL''': SPARQL Protocol and RDF Query Language
* '''TI''': Time Ontology in OWL
* '''TRUST''': Transparency, Responsibility, User Focus, Sustainability, and Technology
* '''UBERON''': Uber-anatomy ontology
* '''UO''': Units of Measurement Ontology
* '''UPRI''': Unique Persistent and Resolvable Identifier
* '''XSD''': Extensible Markup Language Schema Definition

==Foonotes==
{{reflist|group=lower-alpha}}

==Acknowledgements==
We thank Werner Ceusters, Nico Matentzoglu, Manuel Prinz, Marcel Konrad, Philip Strömert, Roman Baum, Björn Quast, Peter Grobe, István Míko, Manfred Jeusfeld, Manolis Koubarakis, Javad Chamanara, and Kheir Eddine for discussing some of the presented ideas. We also thank to anonymous reviewers for their suggestions and feedback. We are solely responsible for all the arguments and statements in this paper.

===Author contributions===
L.V. developed the concept of semantic units and wrote the initial manuscript text. All authors reviewed and revised the manuscript.

===Funding===
Open Access funding enabled and organized by Projekt DEAL. Lars Vogt received funding by the ERC H2020 Project ‘ScienceGraph’ (819536).

===Conflict of interest===
The authors declare no competing interests.

==References==
{{Reflist|colwidth=30em}}

==Notes==
This presentation is faithful to the original, with only a few minor changes to presentation, though grammar and word usage was substantially updated for improved readability. In some cases important information was missing from the references, and that information was added.


[[Category:LIMSwiki journal articles (added in 2024)]]
[[Category:LIMSwiki journal articles (all)]]
[[Category:LIMSwiki journal articles on data management and sharing]]
[[Category:LIMSwiki journal articles on FAIR data principles]]
[[Category:LIMSwiki journal articles on health informatics]]

Template:Article of the week

2024-06-17T15:05:39Z

Shawndouglas: Updated article of the week text

<div style="float: left; margin: 0.5em 0.9em 0.4em 0em;">[[File:Fig1 Soto-Perdomo SoftwareX2023 24.jpg|240px]]</div>
'''"[[Journal:OptiGUI DataCollector: A graphical user interface for automating the data collecting process in optical and photonics labs|OptiGUI DataCollector: A graphical user interface for automating the data collecting process in optical and photonics labs]]"'''

OptiGUI DataCollector is a Python 3.8-based graphical user interface (GUI) that facilitates automated data collection in optics and photonics research and development equipment. It provides an intuitive and easy-to-use platform for controlling a wide range of optical instruments, including [[spectrometer]]s and lasers. OptiGUI DataCollector is a flexible and modular framework that enables simple integration with different types of devices. It simplifies experimental workflow and reduces human error by automating parameter control, data acquisition, and [[Data analysis|analysis]]. OptiGUI DataCollector is currently focused on optical mode conversion utilizing fiber optic technologies ... ('''[[Journal:OptiGUI DataCollector: A graphical user interface for automating the data collecting process in optical and photonics labs|Full article...]]''')<br />

''Recently featured'':
{{flowlist |
* [[Journal:Ten simple rules for managing laboratory information|Ten simple rules for managing laboratory information]]
* [[Journal:Hierarchical AI enables global interpretation of culture plates in the era of digital microbiology|Hierarchical AI enables global interpretation of culture plates in the era of digital microbiology]]
* [[Journal:Critical analysis of the impact of AI on the patient–physician relationship: A multi-stakeholder qualitative study|Critical analysis of the impact of AI on the patient–physician relationship: A multi-stakeholder qualitative study]]
}}

Journal:OptiGUI DataCollector: A graphical user interface for automating the data collecting process in optical and photonics labs

2024-06-17T15:04:46Z

Shawndouglas: /* bstract */ Fix

{{Infobox journal article
|name =
|image =
|alt = 
|caption =
|title_full = OptiGUI DataCollector: A graphical user interface for automating the data collecting process in optical and photonics labs
|journal = ''SoftwareX''
|authors = Soto-Perdomo, Juan; Morales-Guerra, Juan; Arango, Juan D.; Villada, Sebastian M.; Torres, Pedro; Reyes-Vera, Erick
|affiliations = Instituto Tecnológico Metropolitano, Universidad Nacional de Colombia
|contact = Email: juansoto319998 at correo dot itm dot edu dot co
|editors =
|pub_year = 2023
|vol_iss = '''24'''
|at = 101521
|doi = [https://doi.org/10.1016/j.softx.2023.101521 10.1016/j.softx.2023.101521]
|issn = 2352-7110
|license = [https://creativecommons.org/licenses/by/4.0/ Creative Commons Attribution 4.0 International]
|website = [https://www.sciencedirect.com/science/article/pii/S2352711023002170 https://www.sciencedirect.com/science/article/pii/S2352711023002170]
|download = [https://www.sciencedirect.com/science/article/pii/S2352711023002170/pdfft?md5=e8d1b4827091c5d820ccca056a49015b&pid=1-s2.0-S2352711023002170-main.pdf https://www.sciencedirect.com/science/article/pii/S2352711023002170/pdfft] (PDF)
}}
==Abstract==
OptiGUI DataCollector is a Python 3.8-based graphical user interface (GUI) that facilitates automated data collection in optics and photonics research and development equipment. It provides an intuitive and easy-to-use platform for controlling a wide range of optical instruments, including [[spectrometer]]s and lasers. OptiGUI DataCollector is a flexible and modular framework that enables simple integration with different types of devices. It simplifies experimental workflow and reduces human error by automating parameter control, data acquisition, and [[Data analysis|analysis]]. OptiGUI DataCollector is currently focused on optical mode conversion utilizing fiber optic technologies but can be expanded to other [[research]] and development (R&D) processes.

'''Keywords''': laboratory automation, device integration, software framework, graphical user interface, optical fiber, optical mode conversion

==Motivation and significance==
Experiments in today's scientific [[research]] are becoming increasingly complicated and involve several hardware devices that work together in a coordinated manner. These experiments can contain a variety of instruments such as sensors, analyzers, detectors, and actuators, each of which has a specific purpose in the experiment.<ref>{{Cite journal |last=Arango |first=Juan |last2=Aristizabal |first2=Victor |last3=Vélez |first3=Francisco |last4=Carrasquilla |first4=Juan |last5=Gomez |first5=Jorge |last6=Quijano |first6=Jairo |last7=Herrera-Ramirez |first7=Jorge |date=2023-06 |title=Synthetic dataset of speckle images for fiber optic temperature sensor |url=https://linkinghub.elsevier.com/retrieve/pii/S2352340923002536 |journal=Data in Brief |language=en |volume=48 |pages=109134 |doi=10.1016/j.dib.2023.109134 |pmc=PMC10139894 |pmid=37122920}}</ref><ref name=":0">{{Cite journal |last=Valencia-Garzón |first=Sebastian |last2=Reyes-Vera |first2=Erick |last3=Galvis-Arroyave |first3=Jorge |last4=Montoya |first4=Jose P. |last5=Gomez-Cardona |first5=Nelson |date=2022-11-27 |title=Metrological Characterization of a CO2 Laser-Based System for Inscribing Long-Period Gratings in Optical Fibers |url=https://www.mdpi.com/2410-390X/6/4/79 |journal=Instruments |language=en |volume=6 |issue=4 |pages=79 |doi=10.3390/instruments6040079 |issn=2410-390X}}</ref><ref name=":1">{{Cite journal |last=Del Villar |first=Ignacio |last2=Montoya-Cardona |first2=Jorge |last3=Imas |first3=José J. |last4=Reyes-Vera |first4=Erick |last5=Zamarreño |first5=Carlos R. |last6=Matias |first6=Ignacio R. |last7=Cruz |first7=Jose L. |date=2023-07-01 |title=Tunable Sensitivity in Long Period Fiber Gratings During Mode Transition With Low Refractive Index Intermediate Layer |url=https://ieeexplore.ieee.org/document/9970326/ |journal=Journal of Lightwave Technology |volume=41 |issue=13 |pages=4219–4229 |doi=10.1109/JLT.2022.3226800 |issn=0733-8724}}</ref><ref>{{Cite journal |last=Reyes-Vera |first=Erick |last2=Botero-Valencia |first2=Juan S. |last3=Arango-Bustamante |first3=Karen |last4=Zuluaga |first4=Alejandra |last5=Naranjo |first5=Tonny W. |date=2022-04-29 |title=Microscopic Imaging and Labeling Dataset for the Detection of Pneumocystis jirovecii Using Methenamine Silver Staining Method |url=https://www.mdpi.com/2306-5729/7/5/56 |journal=Data |language=en |volume=7 |issue=5 |pages=56 |doi=10.3390/data7050056 |issn=2306-5729}}</ref><ref>{{Cite journal |last=Muñoz-Hernández |first=Tatiana |last2=Reyes-Vera |first2=Erick |last3=Torres |first3=Pedro |date=2019-08-19 |title=Tunable Whispering Gallery Mode Photonic Device Based on Microstructured Optical Fiber with Internal Electrodes |url=https://www.nature.com/articles/s41598-019-48598-z |journal=Scientific Reports |language=en |volume=9 |issue=1 |pages=12083 |doi=10.1038/s41598-019-48598-z |issn=2045-2322 |pmc=PMC6700125 |pmid=31427674}}</ref><ref>{{Cite journal |last=Arango |first=J D |last2=Aristizabal |first2=V H |last3=Carrasquilla |first3=J F |last4=Gomez |first4=J A |last5=Quijano |first5=J C |last6=Velez |first6=F J |last7=Herrera-Ramirez |first7=J |date=2021-12-01 |title=Deep learning classification and regression models for temperature values on a simulated fibre specklegram sensor |url=https://iopscience.iop.org/article/10.1088/1742-6596/2139/1/012001 |journal=Journal of Physics: Conference Series |volume=2139 |issue=1 |pages=012001 |doi=10.1088/1742-6596/2139/1/012001 |issn=1742-6588}}</ref><ref>{{Cite journal |last=Montoya |first=Manuel |last2=Lopera |first2=Maria J. |last3=Gómez-Ramírez |first3=Alejandra |last4=Buitrago-Duque |first4=Carlos |last5=Pabón-Vidal |first5=Adriana |last6=Herrera-Ramirez |first6=Jorge |last7=Garcia-Sucerquia |first7=Jorge |last8=Trujillo |first8=Carlos |date=2023-06 |title=FocusNET: An autofocusing learning‐based model for digital lensless holographic microscopy |url=https://linkinghub.elsevier.com/retrieve/pii/S0143816623000751 |journal=Optics and Lasers in Engineering |language=en |volume=165 |pages=107546 |doi=10.1016/j.optlaseng.2023.107546}}</ref><ref>{{Cite journal |last=Galvis-Arroyave |first=J L |last2=Villegas-Aristizabal |first2=J |last3=Montoya-Cardona |first3=J |last4=Montoya-Villada |first4=S |last5=Reyes-Vera |first5=E |date=2020-05-01 |title=Experimental characterization of a tuneable all-fiber mode converter device for mode-division multiplexing systems |url=https://iopscience.iop.org/article/10.1088/1742-6596/1547/1/012004 |journal=Journal of Physics: Conference Series |volume=1547 |issue=1 |pages=012004 |doi=10.1088/1742-6596/1547/1/012004 |issn=1742-6588}}</ref><ref>{{Cite journal |last=Gómez-Cardona |first=Nelson |last2=Jiménez-Durango |first2=Cristian |last3=Usuga-Restrepo |first3=Juan |last4=Torres |first4=Pedro |last5=Reyes-Vera |first5=Erick |date=2021-02 |title=Thermo-optically tunable polarization beam splitter based on selectively gold-filled dual-core photonic crystal fiber with integrated electrodes |url=http://link.springer.com/10.1007/s11082-020-02718-6 |journal=Optical and Quantum Electronics |language=en |volume=53 |issue=2 |pages=68 |doi=10.1007/s11082-020-02718-6 |issn=0306-8919}}</ref> Efficiently controlling these experiments demands software that can effectively coordinate the functioning of these devices, not only by delivering orders but also by ensuring accurate data collection and control of experimental conditions.<ref name=":2">{{Cite journal |last=Binder |first=Jan M. |last2=Stark |first2=Alexander |last3=Tomek |first3=Nikolas |last4=Scheuer |first4=Jochen |last5=Frank |first5=Florian |last6=Jahnke |first6=Kay D. |last7=Müller |first7=Christoph |last8=Schmitt |first8=Simon |last9=Metsch |first9=Mathias H. |last10=Unden |first10=Thomas |last11=Gehring |first11=Tobias |date=2017 |title=Qudi: A modular python suite for experiment control and data processing |url=https://linkinghub.elsevier.com/retrieve/pii/S2352711017300055 |journal=SoftwareX |language=en |volume=6 |pages=85–90 |doi=10.1016/j.softx.2017.02.001}}</ref><ref name=":3">{{Cite journal |last=Bromig |first=Lukas |last2=Leiter |first2=David |last3=Mardale |first3=Alexandru-Virgil |last4=von den Eichen |first4=Nikolas |last5=Bieringer |first5=Emmeran |last6=Weuster-Botz |first6=Dirk |date=2022-01 |title=The SiLA 2 Manager for rapid device integration and workflow automation |url=https://linkinghub.elsevier.com/retrieve/pii/S2352711022000103 |journal=SoftwareX |language=en |volume=17 |pages=100991 |doi=10.1016/j.softx.2022.100991}}</ref><ref name=":4">{{Cite journal |last=Colle |first=Jean-Yves |last2=Rautio |first2=Jouni |last3=Freis |first3=Daniel |date=2021-12 |title=A modular LabVIEW application frame for Knudsen Effusion Mass Spectrometry instrument control |url=https://linkinghub.elsevier.com/retrieve/pii/S2352711021001412 |journal=SoftwareX |language=en |volume=16 |pages=100875 |doi=10.1016/j.softx.2021.100875}}</ref><ref name=":5">{{Cite journal |last=Killoran |first=Nathan |last2=Izaac |first2=Josh |last3=Quesada |first3=Nicolás |last4=Bergholm |first4=Ville |last5=Amy |first5=Matthew |last6=Weedbrook |first6=Christian |date=2019-03-11 |title=Strawberry Fields: A Software Platform for Photonic Quantum Computing |url=https://quantum-journal.org/papers/q-2019-03-11-129/ |journal=Quantum |language=en |volume=3 |pages=129 |doi=10.22331/q-2019-03-11-129 |issn=2521-327X}}</ref> The software must be able to handle the communication protocols and data formats of many devices, as well as manage potential conflicts or dependencies between them.

In addition to accurate [[Information management|management]], fast data processing and [[Data visualization|visualization]] are critical for effective data interpretation. Because experiments create significant amounts of real-time data, the software must be able to process the data rapidly and efficiently. This may include data filtering, [[Data cleansing|standardization]], [[Data analysis|analysis]], and visualization to obtain useful insights from the experiment. Real-time data visualization can also provide researchers with quick feedback, allowing them to make informed decisions and modify experimental conditions as needed.

Each experiment requires a unique combination of hardware devices depending on the specific needs of the research.<ref name=":2" /><ref name=":3" /><ref name=":4" /><ref name=":5" /><ref name=":6">{{Cite web |last=Gomez Labat, J. |date=2016 |title=Desarrollo de una interfaz gráfica de usuario para el control de analizadores de espectros ópticos mediante Matlab |work=Academica-e |url=https://academica-e.unavarra.es/xmlui/handle/2454/21691 |publisher=Universidad Pública de Navarra - Nafarroako Unibertsitate Publikoa |archiveurl=https://web.archive.org/web/20231210081252/https://academica-e.unavarra.es/xmlui/handle/2454/21691 |archivedate=10 December 2023}}</ref><ref name=":7">{{Citation |last=Harun S.W., Emami S.D., Arof H., Hajireza P., Ahmad H. |first= |last2= |first2= |last3= |first3= |last4= |first4= |last5= |first5= |date=2011-01-21 |editor-last=de Asmundis |editor-first=Riccardo |title=LabVIEW Applications for Optical Amplifier Automated Measurements, Fiber-Optic Remote Test and Fiber Sensor Systems |url=http://www.intechopen.com/books/modeling-programming-and-simulations-using-labview-software/labview-applications-for-optical-amplifier-automated-measurements-fiber-optic-remote-test-and-fiber- |work=Modeling, Programming and Simulations Using LabVIEW™ Software |language=en |publisher=InTech |doi=10.5772/13247 |isbn=978-953-307-521-1 |accessdate=}}</ref> Two studies discussed in this passage highlight the development of software interfaces for fiber optic devices. In the first study, Labat<ref name=":6" /> created a graphical user interface (GUI) in Matlab that allows for remote control, data acquisition, and data visualization from optical spectrum analyzers. The GUI was implemented to analyze the transmission spectrum of long-period fiber gratings (LPFGs), with various parameters such as operating wavelength and center settings. In the second study, Harun ''et al.''<ref name=":7" /> developed an automated system in LabVIEW for self-calibration and measurement of fiber optic devices such as sensors and Erbium Doped Fiber Amplifiers (EDFAs). Their system allows for up to an 80% reduction in data acquisition time while providing precise and consistent measurements with a low uncertainty value of ±0.012 dB.

Based on the above, this work has developed a computerized instrumentation system that allows constant monitoring, processing, and data collection of research and development (R&D) equipment; therefore, we propose an object-oriented GUI capable of executing the tasks dynamically without having to incur system modifications that may disturb the measurements. The purpose of executing various tasks requires employing reliable models to optimize the functions of each of the instruments, allowing the efficient use of the R&D personnel is working time, as well as providing a user-friendly interface to facilitate the manipulation of the instruments. With object-oriented programming, the problem to be solved is modeled through a series of interactions between the optical equipment, reusing lines of code through inheritance, and the implementation of polymorphism.

==Software description==
OptiGUI DataCollector is a GUI created using [[Python (programming language)|Python 3.8]], the PyQt5 library, and QTDesigner. This GUI uses an object-oriented programming paradigm to create a framework that is unique. The GUI allows the user to intuitively operate the basic functions of the most used instruments in a photonics [[laboratory]], minimizing the time required to obtain data through continuous interaction with the instruments. In addition, the GUI design allows easy reuse of existing modules and implementation with other hardware controllers, allowing the use of equipment from other manufacturers.

The user can acquire optical power and image data for different wavelengths and polarization states of light, as well as subject optical devices to different temperature changes to analyze variations in the characteristics of light as it propagates through an optical fiber. In addition, OptiGUI DataCollector allows exporting the data collected during the experiment for further study. This UI allows R&D personnel to automate and streamline their studies of fiber optic devices.

===Framework components===
Python modules are used to build the GUI. These modules are designed to work together so that the code can focus on performing the required functions. The user interface (UI) was created using a methodology involving an iterative process, as detailed in the schematic shown in Fig. 1.

[[File:Fig1 Soto-Perdomo SoftwareX2023 24.jpg|700px]]
{{clear}}
{|
| style="vertical-align:top;" |
{| border="0" cellpadding="5" cellspacing="0" width="700px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;" |<blockquote>'''Figure 1.''' Diagram of the framework components with their respective input and output parameters.</blockquote>
|-
|}
|}

In addition, the following five hardware components controlled by the UI are shown in the diagram:

*'''Tunable laser''': A tunable laser is a component of equipment that enables the user to alter the operating wavelengths to suit a variety of optical applications. This is accomplished by specifying a starting and ending wavelength, as well as defining the step to be used for the wavelength sweep. It also enables the user to adjust the output power of the light source.
*'''Power meter''': The user may select either the optical power meter or the optical spectrum analyzer. In both instances, the user can configure the interface by providing the initial wavelength, the final wavelength, and the acquisition interval. These parameters should, ideally, correspond to the configuration of the adjustable laser. It is crucial to note that the optical power meter measures the output power of the light source at a specific wavelength, whereas the optical spectrum analyzer measures the power spectrum. Consequently, the choice between the two will depend on the user’s particular measurement needs.
*'''Rotational stage''': This component facilitates the activation of the rotational stage. This rotatable platform allows for a degree-by-degree adjustment of the angle at which certain optical elements are positioned. The user can therefore adjust the initial angle, the final angle, and the step size.
*'''Visible or IR camera''': The component is a power on/off switch for a camera. The camera is a commonly used device for capturing images in different applications, such as capturing images of the spatial distribution of modes propagating in an optical fiber.
*'''Temperature controlled system''': This module contains the hardware components necessary for closed-loop temperature control. It is composed of an Arduino Mega 2560 coupled to a 12 V (40 W) ceramic cartridge heater for heating the fiber (via the RAMPS 1.4 power control board) and a 100 K NTC thermistor for measuring the temperature. The ceramic cartridge heater is encapsulated in an aluminum block containing the fiber to ensure uniform heat distribution. To monitor the actual temperature, a thermistor is also housed in the same block.

The Arduino script is programmed to preprocess and convert the thermistor signal to °C. PID control is then implemented to maintain a constant temperature set point. The Arduino continuously receives this set point via the serial port, which is dynamically adjusted based on the user-specified temperature range in °C via the Python script. Essentially, the Arduino's main function is to maintain temperature control at the specified set point.

===Software architecture===
As shown in Fig. 2, OptiGUI DataCollector is a software that incorporates five distinct optical laboratory instruments and enables their control from a single interactive UI. The program's GUI allows the user to automate and control the optical setup intuitively.

[[File:Fig2 Soto-Perdomo SoftwareX2023 24.jpg|800px]]
{{clear}}
{|
| style="vertical-align:top;" |
{| border="0" cellpadding="5" cellspacing="0" width="800px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;" |<blockquote>'''Figure 2.''' The main window of the user interface.</blockquote>
|-
|}
|}

This window allows the user to configure parameters such as optical power, wavelength, temperature, and any other variables required to change the operation of each device. These values can be saved and reloaded for subsequent experiments, making the program extremely versatile and useful for R&D activities.

Communication between optical equipment and the UI is performed using instrument-specific protocols and libraries. The SDK ([[software development kit]]) consists of a compiler, debugger, documentation, drivers, and network protocols for a particular hardware component. A Python-based, object-oriented API ([[application programming interface]]) facilitates the project's scalability by facilitating packet exchange. Using a simple access protocol, the user and the instrument exchange information.

The UI consists of seven modules, each of which performs a specific purpose, as described below:

'''Module 1. Tunable_Laser.py''': The TCP/IP protocol over Ethernet is used to communicate with the tunable laser. This protocol allows the simultaneous exchange of data between the UI and the server (Laser PyApex). This feature is made possible by the Python 3-written SDK from Apex Technologies. The SDK comprises a collection of examples Apex1000 device-controlling codes.

The tunable laser module is responsible for setting the characteristic attributes of the laser, such as input power in dBm and operating wavelengths in nm. In addition, it has a drop-down menu for selecting the laser source reference.

'''Module 2. Degrees_Polarization.py''': To establish communication between the UI and a Thorlabs rotational stage, the thorlabs-apt-device library version 0.3.3 is used, which implements the Advanced Positioning Technology (APT) protocol for this type of device. With this library, the Degrees polarization.py module was created to govern the rotational stage using four distinct attributes: initial angle, final angle, steps, and rotational direction. This module allows communication with any device using a DC servo motor driver from Thorlabs. Additionally, the module includes error checking and validation to guarantee correct input and avoid errors during execution. The GUI offers visual feedback and user-friendly controls for simple rotational stage operation.

'''Module 3. Power_Meter.py''': An optical power meter is also used in this UI. This equipment has its drivers in LabVIEW, a development environment that uses a graphical programming language to design systems. To integrate Python with LabVIEW, the “outside-in” function is used, which allows Python to instruct LabVIEW to perform an action and return its result.

To use this integration approach, the “autoliv” and “Activex” libraries are used to retrieve and manipulate data acquired with LabVIEW in Python. The interface also includes a plotter that displays the transmission spectrum plot, where the wavelength and optical power are connected. Each power value corresponding to a wavelength is added to a list to visualize the optical device's transmission spectrum.

'''Module 4. Image_Capture.py''': This module contains two classes for controlling a camera from the system UI. One is responsible for activating the camera in the system and capturing frames, while the other checks the data and stores it in a list data structure. This module also makes use of the OpenCV library to read and save .tiff pictures. Furthermore, the pixel data is saved as a data structure in an array using the numpy library. The module also enables the user to view the images in a window that is generated within the UI.

The use of the OpenCV library provides a wide range of features for processing and analyzing camera images. Image filtering, object detection, and picture segmentation are examples of such characteristics. The acquired frames can also be analyzed in real-time using the library 's functions, allowing the user to perform tasks such as object tracking or motion detection.

'''Module 5. Temperature_Control''': This module is designed to connect to the Arduino Mega through serial communication using the pySerial library. A new object named “mega” is created, which specifies the serial port attributes, the baud rate (115200), and a timeout of one second.

To initiate the connection and send the temperature setpoint, the Python function “mega.write()” is used. On the other hand, to receive the actual temperature read by the thermistor, the “mega.readline()” function is employed.

Here, it is worth noting that the use of the pySerial library enables the communication between the Arduino Mega and Python, allowing the integration of different technologies and enabling software developers to design more comprehensive and efficient solutions. After establishing a connection between the Python script and the Arduino board, the script runs through a user-defined range of temperatures. Each set point is sent individually to the Arduino, and after the arrival signal at the desired temperature, a stabilization time is expected. After this interval, the image and actual temperature are captured and saved in the dataset, and so on for all temperature set points.

'''Module 6. Execution.py''': This module serves as the central control hub for the entire system. It handles all the initialization tasks required to start the program and connects all the modules described above. The module provides access to different system attributes, including wavelengths, camera control, angles, rotation direction, and PLAY and STOP functions.

These attributes play a critical role in the operation of the system. For example, the wavelengths are used “to tell” the system what range of electromagnetic radiation to look for, while the camera control attribute indicates if whether or not the camera will be used. The angles and direction of rotation are used to control how the system moves, allowing it to collect data from different angles. Finally, the PLAY and STOP functions allow users to start or stop the system as needed.

'''Module 7. Principal_Inteface.py''': This is the interface where applications and users interact. It contains all the PyQt5 widgets needed to create the application window. As shown in Fig. 3, the components are organized in a hierarchical structure, where the left-hand side is composed of elementary widgets to control the camera and the rotational stage, respectively. In the middle of the main window, there is a container widget for displaying images and graphs. On the right-hand side are the power meter and the tunable laser control element, and at the bottom of the window are the temperature control elements. The PLAY button starts the program by sending an input function that obtains the system parameters. Having a GUI makes it easy for users to interact with the system and verify its operation.

[[File:Fig3 Soto-Perdomo SoftwareX2023 24.jpg|700px]]
{{clear}}
{|
| style="vertical-align:top;" |
{| border="0" cellpadding="5" cellspacing="0" width="700px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;" |<blockquote>'''Figure 3.''' The main window of the user interface.</blockquote>
|-
|}
|}

===Software functionalities===
The system flowchart, shown in Fig. 4, describes the process that starts with the PLAY button and allows the activation of the Tunable_Laser.py module.

[[File:Fig4 Soto-Perdomo SoftwareX2023 24.jpg|900px]]
{{clear}}
{|
| style="vertical-align:top;" |
{| border="0" cellpadding="5" cellspacing="0" width="900px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;" |<blockquote>'''Figure 4.''' Structural flowchart of the main modules of the GUI.</blockquote>
|-
|}
|}

The system has two independent case studies that are described below:

*'''Case 1''': In this situation, the system determines if camera verification is enabled. If it is not, the Power_Meter.py module is run to measure the strength of the laser signal. Next, the system checks if the user has provided the temperature data (temperature range and step-by-step temperature). If so, the Temperature_Control.py module is run to regulate the temperature of the medium and record the transmission spectrum as a function of temperature. The Temperature_Control.py module is skipped if no temperature information is provided, and the transmission spectrum at a given temperature is displayed on the main screen.

*'''Case 2''': In this case, the system checks if the camera verification is active. If so, the Image_Capture.py module is activated to capture the sample image. Then, the system checks if the user has entered the temperature data (temperature range and step-by-step temperature). If so, the Temperature_Control.py module is executed to control the sample temperature and record the sample image as a function of temperature. If no temperature data is entered, the Temperature_Control.py module is skipped and the sample image is shown on the main screen. If no temperature values are entered, but the camera check is active, the Degrees_Polarization.py module is activated to measure the polarization angle of the sample.

In any scenario, the user can modify the input parameters of the system and examine the results in real time, simplifying decision-making and data interpretation. The system is an effective instrument for performing optical measurements over a range of temperatures and polarization angles simultaneously.

==Illustrative examples==
In this section, one case of how the developed software can be used for data acquisition will be presented, and the operation of the GUI with the included modules will be demonstrated (other illustrative examples are provided as supplementary material). This example uses an experimental setup to perform the investigation of the transmission spectra of a LPFG-based modal conversor.

The validation experiment was conducted using a tunable laser from APEX Technologies, specifically utilizing the AP3350 A or AP3352 A models. The laser allowed adjustments to both the wavelength, spanning from 1526 nm to 1608 nm, and the output power, which ranged from −30 dBm to ＋13 dBm. In addition, a Thorlabs linear polarizer was mounted on a monitored rotational station (Motor Rotation Stage PRM1Z8) to adjust the polarization of the light at the system's output. This station is equipped with a DC servo motor controller that allows Python-based computer control. It is then feasible to rotate the polarization plane of the light as it exits the LPFGs. In addition, the use of the computerized controller in Python simplifies the adjustment process and allows for improved measurement repeatability. The Thorlabs Motorized Rotation Stage PRM1Z8 enables high-precision rotation with a maximum speed of 25 degrees per second and an accuracy of 0.1%.

On the other hand, identical wavelength values are assigned to both the tunable laser and the EXFO1600 high-speed power meter employed in this experiment. As illustrated in Fig. 4, the power meter is located at the output of the LPFG and is connected to the GUI through a serial connection. In addition, the use of a high-speed power meter such as the EXFO1600 enables fast data capture and accurate measurement of the LPFG output power.

To examine the impact of temperature on the fiber, a PID control system was developed. A 12 V heater and a temperature sensor compose this feedback control system (100 K thermistor). A microcontroller performs PID control with the heater and thermistor (Arduino Mega 2560). Arduino is responsible for managing the temperature based on a set point once the PID control has been implemented. Finally, Python was used to automate the serial transmission of a series of temperature set-points to the Arduino, after confirming that the temperature had stabilized at the desired level.

===Measuring transmission spectra at different temperatures using an LPFG===
The LPFG transmission spectra were measured at different temperatures using a feedback-controlled heating system and an optical power meter, as shown in the prior Fig. 3. The power meter utilized the same wavelength values assigned to the tunable laser, ensuring precision and consistency in the measurements. The transmission spectrum was measured once the required temperature setting was achieved, and this procedure was repeated at different temperatures to evaluate the changes in the transmission spectrum. The obtained results allowed for quantification of the temperature dependency of LPFG transmission, with data only collected when the temperature increased.

In this context, the GUI was used with the Tunable_Laser.py, Power_Meter.py, and Temperature_Control modules to reconstruct the LPFG transmission spectra while varying the temperature from 28 °C to 88 °C. To reconstruct the transmission spectra of the LPFG, a laser power of 13 dBm was assigned, while the wavelength varied from 1527 nm to 1601 nm with steps of 0.3 nm. Based on this, power data were collected using a power meter to reconstruct the LPFG transmission spectra for seven different temperatures. As shown in Fig. 5(a), the resonant wavelength of the LPFG exhibited a shift towards longer wavelengths, as expected.<ref name=":0" /><ref name=":1" />

[[File:Fig5 Soto-Perdomo SoftwareX2023 24.jpg|800px]]
{{clear}}
{|
| style="vertical-align:top;" |
{| border="0" cellpadding="5" cellspacing="0" width="800px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;" |<blockquote>'''Figure 5.''' Structural flowchart of the main modules of the GUI.</blockquote>
|-
|}
|}

To guarantee that the experiment could be repeated with the same results, measurements were performed five times for each temperature value. The experiments were conducted randomly to reduce the effect of any external influences that could have an impact on the accuracy and reproducibility of the results. Fig. 5(b) shows the sensitivity curve of the LPFG-based temperature sensor, which demonstrates a sensitivity very close to 108.57 pm/°C. In Fig. 5(b), the mean value is shown, and the bars surrounding each point illustrate the range of error that can be expected from the measurement. In all cases, the range of error is quite small. This suggests that the experiment can be repeated with a high degree of accuracy, and the software that was developed allows a greater degree of control over the experiment, further enhancing its repeatability. It is important to note that the time required to acquire data is around 18 minutes for each data set.

==Impact==
OptiGUI DataCollector offers researchers in photonics and optoelectronics areas a variety of advantages. By automating repetitive and time-consuming tasks such as data acquisition, processing, and analysis, researchers can save significant time and effort when conducting experiments. This increased productivity can result in accelerated research progress and more accurate and reliable findings. In addition, the software enables researchers to conduct complex investigations that may be difficult or impossible to conduct manually.

Another benefit of automation is the reduction of human errors, which can have a significant impact on the reliability of experimental results. By automating data collection and analysis, OptiGUI DataCollector reduces the possibility of errors, resulting in more precise and reliable findings. In addition, the OptiGUI DataCollector's easy-to-user interfaces and automation tools to make photonics and optoelectronics research accessible to a broader spectrum of researchers, including those with limited technical expertise.

==Conclusions==
To summarize, OptiGUI DataCollector is a flexible and powerful experiment control software application that offers significant benefits to photonics and optoelectronics R&D personnel. Its modular design simplifies the construction of new experiments and reduces the effort required to set up an experiment. The software's automation capabilities accelerate data acquisition, processing, and analysis, saving researchers time and effort. The software reduces human error, improves reproducibility, and provides flexibility and customization options. OptiGUI DataCollector could be a useful tool that can accelerate research progress, and improve the veracity of findings in R&D labs that focus on photonics, optics, and optoelectronics.

==Code metadata==

*Current code version: v1.0
*Permanent link to code/repository used for this code version: https://github.com/ElsevierSoftwareX/SOFTX-D-23-00293
*Code Ocean compute capsule: none
*Legal Code License: GNU (GPL)
*Code versioning system used: Git
*Software code languages, tools, and services used: Python 3.8, LabView
*Compilation requirements, operating environments & dependencies: See User Manual in GitHub repository
*If available Link to developer documentation/manual: See User Manual in GitHub repository
*Support email for questions: juansoto319998 at correo dot itm dot edu dot co

==Supplementary data==

*[https://ars.els-cdn.com/content/image/1-s2.0-S2352711023002170-mmc1.mp4 Video S1]. This video (.mp4) contains additional examples of OptiGUI Data Collector.
*[https://ars.els-cdn.com/content/image/1-s2.0-S2352711023002170-mmc2.pdf Illustrative Examples] (.pdf)

==Acknowledgements==
The authors acknowledge the support of Instituto Tecnologico Metropolitano, through project P20212, and the Universidad Nacional de Colombia, through Hermes project 47472.

===Data availability===
Data will be made available on request.

===Competing interests===
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

==References==
{{Reflist|colwidth=30em}}

==Notes==
This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added. In the original, Figure 3 and 4 are mentioned out of order; they have been swapped for this version.


[[Category:LIMSwiki journal articles (added in 2024)]]
[[Category:LIMSwiki journal articles (all)]]
[[Category:LIMSwiki journal articles on laboratory informatics]]
[[Category:LIMSwiki journal articles on software]]

Main Page/Featured article of the week/2024

2024-06-17T15:03:07Z

Shawndouglas: Added last week's article of the week

{{ombox
| type = notice
| text = If you're looking for other "Article of the Week" archives: [[Main Page/Featured article of the week/2014|2014]] - [[Main Page/Featured article of the week/2015|2015]] - [[Main Page/Featured article of the week/2016|2016]] - [[Main Page/Featured article of the week/2017|2017]] - [[Main Page/Featured article of the week/2018|2018]] - [[Main Page/Featured article of the week/2019|2019]] - [[Main Page/Featured article of the week/2020|2020]] - [[Main Page/Featured article of the week/2021|2021]] - [[Main Page/Featured article of the week/2022|2022]] - [[Main Page/Featured article of the week/2023|2023]] - 2024
}}

==Featured article of the week archive - 2024==
Welcome to the LIMSwiki 2024 archive for the Featured Article of the Week.


{| id="mp-upper" style="width: 85%; margin:4px 0 0 0; background:none; border-spacing: 0px;"
|-

| class="MainPageBG" style="width:50%; border:1px solid #cedff2; background:#f5faff; vertical-align:top;"|
{| id="mp-right" style="width:100%; vertical-align:top; background:#f5faff;"
|-
|-


<h2 style="font-size:105%; font-weight:bold; text-align:left; color:#000; padding:0.2em 0.4em; width:50%;">Featured article of the week: June 10–16:</h2>
<div style="float: left; margin: 0.5em 0.9em 0.4em 0em;">[[File:Fig2 Berezin PLoSCompBio23 19-12.png|240px]]</div>
'''"[[Journal:Ten simple rules for managing laboratory information|Ten simple rules for managing laboratory information]]"'''

[[Information]] is the cornerstone of [[research]], from experimental data/[[metadata]] and computational processes to complex inventories of reagents and equipment. These 10 simple rules discuss best practices for leveraging [[laboratory information management system]]s (LIMS) to transform this large information load into useful scientific findings. The development of [[mathematical model]]s that can predict the properties of biological systems is the holy grail of [[computational biology]]. Such models can be used to test biological hypotheses, guide the development of biomanufactured products, engineer new systems meeting user-defined specifications, and much more ... ('''[[Journal:Ten simple rules for managing laboratory information|Full article...]]''')<br />
|-
|<br /><h2 style="font-size:105%; font-weight:bold; text-align:left; color:#000; padding:0.2em 0.4em; width:50%;">Featured article of the week: June 03–09:</h2>
<div style="float: left; margin: 0.5em 0.9em 0.4em 0em;">[[File:Fig1 Signoroni NatComm23 14.png|240px]]</div>
'''"[[Journal:Hierarchical AI enables global interpretation of culture plates in the era of digital microbiology|Hierarchical AI enables global interpretation of culture plates in the era of digital microbiology]]"'''

Full [[laboratory automation]] is revolutionizing work habits in an increasing number of clinical [[microbiology]] facilities worldwide, generating huge streams of [[Imaging|digital images]] for interpretation. Contextually, [[deep learning]] (DL) architectures are leading to paradigm shifts in the way computers can assist with difficult visual interpretation tasks in several domains. At the crossroads of these epochal trends, we present a system able to tackle a core task in clinical microbiology, namely the global interpretation of diagnostic [[Bacteria|bacterial]] [[Cell culture|culture]] plates, including presumptive [[pathogen]] identification. This is achieved by decomposing the problem into a hierarchy of complex subtasks and addressing them with a multi-network architecture we call DeepColony ... ('''[[Journal:Hierarchical AI enables global interpretation of culture plates in the era of digital microbiology|Full article...]]''')<br />
|-
|<br /><h2 style="font-size:105%; font-weight:bold; text-align:left; color:#000; padding:0.2em 0.4em; width:50%;">Featured article of the week: May 27–June 02:</h2>
<div style="float: left; margin: 0.5em 0.9em 0.4em 0em;">[[File:Fig1 Čartolovni DigitalHealth2023 9.jpeg|240px]]</div>
'''"[[Journal:Critical analysis of the impact of AI on the patient–physician relationship: A multi-stakeholder qualitative study|Critical analysis of the impact of AI on the patient–physician relationship: A multi-stakeholder qualitative study]]"'''

This qualitative study aims to present the aspirations, expectations, and critical analysis of the potential for [[artificial intelligence]] (AI) to transform the patient–physician relationship, according to multi-stakeholder insight. This study was conducted from June to December 2021, using an anticipatory ethics approach and sociology of expectations as the theoretical frameworks. It focused mainly on three groups of stakeholders, namely physicians (''n'' = 12), patients (''n'' = 15), and healthcare managers (''n'' = 11), all of whom are directly related to the adoption of AI in medicine (''n'' = 38). In this study, interviews were conducted with 40% of the patients in the sample (15/38), as well as 31% of the physicians (12/38) and 29% of health managers in the sample (11/38) ... ('''[[Journal:Critical analysis of the impact of AI on the patient–physician relationship: A multi-stakeholder qualitative study|Full article...]]''')<br />
|-
|<br /><h2 style="font-size:105%; font-weight:bold; text-align:left; color:#000; padding:0.2em 0.4em; width:50%;">Featured article of the week: May 20–26:</h2>
<div style="float: left; margin: 0.5em 0.9em 0.4em 0em;">[[File:Fig1 Niszczota EconBusRev23 9-2.png|240px]]</div>
'''"[[Journal:Judgements of research co-created by generative AI: Experimental evidence|Judgements of research co-created by generative AI: Experimental evidence]]"'''

The introduction of [[ChatGPT]] has fuelled a public debate on the appropriateness of using generative [[artificial intelligence]] (AI) ([[large language model]]s or LLMs) in work, including a debate on how they might be used (and abused) by researchers. In the current work, we test whether delegating parts of the research process to LLMs leads people to distrust researchers and devalues their scientific work. Participants (''N'' = 402) considered a researcher who delegates elements of the research process to a PhD student or LLM and rated three aspects of such delegation. Firstly, they rated whether it is morally appropriate to do so. Secondly, they judged whether—after deciding to delegate the research process—they would trust the scientist (who decided to delegate) to oversee future projects ... ('''[[Journal:Judgements of research co-created by generative AI: Experimental evidence|Full article...]]''')<br />
|-
|<br /><h2 style="font-size:105%; font-weight:bold; text-align:left; color:#000; padding:0.2em 0.4em; width:50%;">Featured article of the week: May 13–19:</h2>
<div style="float: left; margin: 0.5em 0.9em 0.4em 0em;">[[File:Fig1 Bispo-Silva Geosciences23 13-11.png|240px]]</div>
'''"[[Journal:Geochemical biodegraded oil classification using a machine learning approach|Geochemical biodegraded oil classification using a machine learning approach]]"'''

[[Chromatography|Chromatographic]] oil analysis is an important step for the identification of biodegraded petroleum via peak visualization and interpretation of phenomena that explain the oil geochemistry. However, analyses of chromatogram components by geochemists are comparative, visual, and consequently slow. This article aims to improve the chromatogram analysis process performed during geochemical interpretation by proposing the use of [[convolutional neural network]]s (CNN), which are deep learning techniques widely used by big tech companies. Two hundred and twenty-one (221) chromatographic oil images from different worldwide basins (Brazil, USA, Portugal, Angola, and Venezuela) were used. The [[open-source software]] Orange Data Mining was used to process images by CNN. The CNN algorithm extracts, pixel by pixel, recurring features from the images through convolutional operations ... ('''[[Journal:Geochemical biodegraded oil classification using a machine learning approach|Full article...]]''')<br />
|-
|<br /><h2 style="font-size:105%; font-weight:bold; text-align:left; color:#000; padding:0.2em 0.4em; width:50%;">Featured article of the week: May 06–12:</h2>
<div style="float: left; margin: 0.5em 0.9em 0.4em 0em;">[[File:Fig1 Mishra JofNepMedAss23 61-258.png|220px]]</div>
'''"[[Journal:Knowledge of internal quality control for laboratory tests among laboratory personnel working in a biochemistry department of a tertiary care center: A descriptive cross-sectional study|Knowledge of internal quality control for laboratory tests among laboratory personnel working in a biochemistry department of a tertiary care center: A descriptive cross-sectional study]]"'''

The [[clinical laboratory]] holds a central position in patient care, and as such, ensuring accurate [[laboratory]] test results is a necessity. Internal [[quality control]] (QC) ensures day-to-day laboratory consistency. However, unless practiced, the success of laboratory [[quality management system]]s (QMSs) cannot be achieved. This depends on the efforts and commitment of laboratory personnel for its implementation. Hence, the aim of this study was to find out the knowledge of internal QC for laboratory tests among laboratory personnel working in the Department of Biochemistry, B.P. Koirala Institute of Health Sciences (BPKIHS), a tertiary care center ... ('''[[Journal:Knowledge of internal quality control for laboratory tests among laboratory personnel working in a biochemistry department of a tertiary care center: A descriptive cross-sectional study|Full article...]]''')<br />
|-
|<br /><h2 style="font-size:105%; font-weight:bold; text-align:left; color:#000; padding:0.2em 0.4em; width:50%;">Featured article of the week: April 29–May 05:</h2>
<div style="float: left; margin: 0.5em 0.9em 0.4em 0em;">[[File:Fig1 Karaattuthazhathu NatJLabMed23 12-2.png|260px]]</div>
'''"[[Journal:Sigma metrics as a valuable tool for effective analytical performance and quality control planning in the clinical laboratory: A retrospective study|Sigma metrics as a valuable tool for effective analytical performance and quality control planning in the clinical laboratory: A retrospective study]]"'''

For the release of precise and accurate reports of [[Medical test|routine tests]], its necessary to follow a proper [[quality management system]] (QMS) in the [[clinical laboratory]]. As one of the most popular QMS tools for process improvement, Six Sigma techniques and tools have been accepted widely in the [[laboratory]] testing process. Six Sigma gives an objective assessment of analytical methods and instrumentation, measuring the outcome of a process on a scale of 0 to 6. Poor outcomes are measured in terms of defects per million opportunities (DPMO). To do the performance assessment of each clinical laboratory [[analyte]] by Six Sigma analysis and to plan and chart out a better, customized [[quality control]] (QC) plan for each analyte, according to its own sigma value ... ('''[[Journal:Sigma metrics as a valuable tool for effective analytical performance and quality control planning in the clinical laboratory: A retrospective study|Full article...]]''')<br />
|-
|<br /><h2 style="font-size:105%; font-weight:bold; text-align:left; color:#000; padding:0.2em 0.4em; width:50%;">Featured article of the week: April 22–28:</h2>
<div style="float: left; margin: 0.5em 0.9em 0.4em 0em;">[[File:Fig1 Tomich Sustain23 15-8.png|260px]]</div>
'''"[[Journal:Why do we need food systems informatics? Introduction to this special collection on smart and connected regional food systems|Why do we need food systems informatics? Introduction to this special collection on smart and connected regional food systems]]"'''

Public interest in where food comes from and how it is produced, processed, and distributed has increased over the last few decades, with even greater focus emerging during the [[COVID-19]] [[pandemic]]. Mounting evidence and experience point to disturbing weaknesses in our food systems’ abilities to support human livelihoods and wellbeing, and alarming long-term trends regarding both the environmental footprint of food systems and mounting vulnerabilities to shocks and stressors. How can we tackle the “wicked problems” embedded in a food system? More specifically, how can convergent research programs be designed and resulting knowledge implemented to increase inclusion, sustainability, and resilience within these complex systems ... ('''[[Journal:Why do we need food systems informatics? Introduction to this special collection on smart and connected regional food systems|Full article...]]''')<br />
|-
|<br /><h2 style="font-size:105%; font-weight:bold; text-align:left; color:#000; padding:0.2em 0.4em; width:50%;">Featured article of the week: April 15–21:</h2>
<div style="float: left; margin: 0.5em 0.9em 0.4em 0em;">[[File:Tab1 Williamson F1000Res2023 10.png|240px]]</div>
'''"[[Journal:Data management challenges for artificial intelligence in plant and agricultural research|Data management challenges for artificial intelligence in plant and agricultural research]]"'''

[[Artificial intelligence]] (AI) is increasingly used within plant science, yet it is far from being routinely and effectively implemented in this domain. Particularly relevant to the development of novel food and agricultural technologies is the development of validated, meaningful, and usable ways to integrate, compare, and [[Data visualization|visualize]] large, multi-dimensional datasets from different sources and scientific approaches. After a brief summary of the reasons for the interest in data science and AI within plant science, the paper identifies and discusses eight key challenges in [[Information management|data management]] that must be addressed to further unlock the potential of AI in crop and agronomic research, and particularly the application of [[machine learning]] (ML), which holds much promise for this domain ... ('''[[Journal:Data management challenges for artificial intelligence in plant and agricultural research|Full article...]]''')<br />
|-
|<br /><h2 style="font-size:105%; font-weight:bold; text-align:left; color:#000; padding:0.2em 0.4em; width:50%;">Featured article of the week: April 08–14:</h2>
<div style="float: left; margin: 0.5em 0.9em 0.4em 0em;">[[File:Fig1 Manisha HighConComp2023 3-3.jpg|240px]]</div>
'''"[[Journal:A blockchain-driven IoT-based food quality traceability system for dairy products using a deep learning model|A blockchain-driven IoT-based food quality traceability system for dairy products using a deep learning model]]"'''

Food [[traceability]] is a critical factor that can ensure food safety while enhancing the credibility of the manufactured product, thus achieving heightened user satisfaction and loyalty. The perishable food supply chain (PFSC) requires paramount care for ensuring [[Quality (business)|quality]] owing to the limited product life. The PFSC comprises of multiple organizations with varied interests and is more likely to be hesitant in sharing the traceability details among one another owing to a lack of trust, which can be overcome by using blockchain. In this research, an efficient scheme using a blockchain-enabled deep [[wikipedia:Residual neural network|residual network]] (BC-DRN) is developed to provide food traceability for dairy products. Here, food traceability is determined by using various modular tools ... ('''[[Journal:A blockchain-driven IoT-based food quality traceability system for dairy products using a deep learning model|Full article...]]''')<br />
|-
|<br /><h2 style="font-size:105%; font-weight:bold; text-align:left; color:#000; padding:0.2em 0.4em; width:50%;">Featured article of the week: April 01–07:</h2>
<div style="float: left; margin: 0.5em 0.9em 0.4em 0em;">[[File:Fig1 Patel JofClinDiagRes2023 17-9.jpg|140px]]</div>
'''"[[Journal:Effect of good clinical laboratory practices (GCLP) quality training on knowledge, attitude, and practice among laboratory professionals: Quasi-experimental study|Effect of good clinical laboratory practices (GCLP) quality training on knowledge, attitude, and practice among laboratory professionals: Quasi-experimental study]]"'''

Good clinical laboratory practices (GCLP) play a vital role in early and accurate diagnosis, providing high-quality data and timely [[Sample (material)|sample]] processing. Adhering to a robust [[quality management system]] (QMS) that complies with GCLP standards is crucial for [[laboratory]] personnel in a [[clinical laboratory]] to deliver outstanding healthcare services and reliable, reproducible reports. [The aim of this study is to] assess the knowledge, attitude, and practice (KAP) of laboratory professionals towards [[Quality (business)|quality]] in the laboratory through GCLP training... ('''[[Journal:Effect of good clinical laboratory practices (GCLP) quality training on knowledge, attitude, and practice among laboratory professionals: Quasi-experimental study|Full article...]]''')<br />
|-
|<br /><h2 style="font-size:105%; font-weight:bold; text-align:left; color:#000; padding:0.2em 0.4em; width:50%;">Featured article of the week: March 25–31:</h2>
<div style="float: left; margin: 0.5em 0.9em 0.4em 0em;">[[File:Fig1 Scroggie DigDisc2023 2.gif|240px]]</div>
'''"[[Journal:GitHub as an open electronic laboratory notebook for real-time sharing of knowledge and collaboration|GitHub as an open electronic laboratory notebook for real-time sharing of knowledge and collaboration]]"'''

[[Electronic laboratory notebook]]s (ELNs) have expanded the utility of the paper [[laboratory notebook]] beyond that of a simple record keeping tool. Open ELNs offer additional benefits to the scientific community, including increased transparency, reproducibility, and [[Data integrity|integrity]]. A key element underpinning these benefits is facile and expedient knowledge sharing which aids communication and collaboration. In previous projects, we have used [[LabTrove]] and [[Vendor:LabArchives, LLC|LabArchives]] as open ELNs, in partnership with GitHub (an open-source web-based platform originally developed for collaborative coding) for communication and discussion. Here we present our personal experiences using GitHub as the central platform for many aspects of the scientific process ... ('''[[Journal:GitHub as an open electronic laboratory notebook for real-time sharing of knowledge and collaboration|Full article...]]''')<br />
|-
|<br /><h2 style="font-size:105%; font-weight:bold; text-align:left; color:#000; padding:0.2em 0.4em; width:50%;">Featured article of the week: March 18–24:</h2>
<div style="float: left; margin: 0.5em 0.9em 0.4em 0em;">[[File:Fig1 Nieminen GigaScience2023 12.jpeg|240px]]</div>
'''"[[Journal:SODAR: Managing multiomics study data and metadata|SODAR: Managing multiomics study data and metadata]]"'''

Scientists employing omics in [[Life sciences industry|life science]] studies face challenges such as the modeling of multiassay studies, recording of all relevant parameters, and managing many [[Sample (material)|samples]] with their [[metadata]]. They must manage many large files that are the results of the assays or subsequent computation. Users with diverse backgrounds, ranging from computational scientists to wet-lab scientists, have dissimilar needs when it comes to data access, with programmatic interfaces being favored by the former and graphical ones by the latter. We introduce SODAR, the system for [[omics]] data access and retrieval. SODAR is a software package that addresses these challenges by providing a web-based graphical user interface (GUI) for managing multiassay studies and describing them using the ISA (Investigation, Study, Assay) data model and the ISA-Tab file format ... ('''[[Journal:SODAR: Managing multiomics study data and metadata|Full article...]]''')<br />
|-
|<br /><h2 style="font-size:105%; font-weight:bold; text-align:left; color:#000; padding:0.2em 0.4em; width:50%;">Featured article of the week: March 11–17:</h2>
<div style="float: left; margin: 0.5em 0.9em 0.4em 0em;">[[File:Chang HealthInfoRes2023 29-4.png|240px]]</div>
'''"[[Journal:Benefits of information technology in healthcare: Artificial intelligence, internet of things, and personal health records|Benefits of information technology in healthcare: Artificial intelligence, internet of things, and personal health records]]"'''

Systematic evaluations of the benefits of [[health information technology]] (HIT) play an essential role in enhancing healthcare [[Quality (business)|quality]] by improving outcomes. However, there is limited empirical evidence regarding the benefits of IT adoption in healthcare settings. This study aimed to review the benefits of [[artificial intelligence]] (AI), the [[internet of things]] (IoT), and [[personal health record]]s (PHR), based on scientific evidence. The literature published in peer-reviewed journals between 2016 and 2022 was searched for systematic reviews and meta-analysis studies using the PubMed, Cochrane, and Embase databases. Manual searches were also performed using the reference lists of systematic reviews and eligible studies from major [[health informatics]] journals. The benefits of each HIT were assessed from multiple perspectives across four outcome domains ... ('''[[Journal:Benefits of information technology in healthcare: Artificial intelligence, internet of things, and personal health records|Full article...]]''')<br />
|-
|<br /><h2 style="font-size:105%; font-weight:bold; text-align:left; color:#000; padding:0.2em 0.4em; width:50%;">Featured article of the week: March 04–10:</h2>
<div style="float: left; margin: 0.5em 0.9em 0.4em 0em;">[[File:Fig1 Villegas-Pérez Foods23 12-22.png|240px]]</div>
'''"[[Journal:A quality assurance discrimination tool for the evaluation of satellite laboratory practice excellence in the context of European regulatory meat inspection for Trichinella spp.|A quality assurance discrimination tool for the evaluation of satellite laboratory practice excellence in the context of European regulatory meat inspection for ''Trichinella spp.'']]"'''

[[Trichinosis|Trichinellosis]] is a parasitic foodborne zoonotic disease transmitted by ingestion of raw or undercooked meat containing the first larval stage (L1) of the nematode. To ensure the [[Quality (business)|quality]] and safety of food intended for human consumption, meat inspection for detection of ''Trichinella'' spp. larvae is a mandatory procedure per European Union (E.U.) regulations. The implementation of [[quality assurance]] (QA) practices in [[Laboratory|laboratories]] that are responsible for ''Trichinella'' spp. detection is essential given that the detection of this parasite is still a pivotal threat to public health, and it is included in List A of Annex I, Directive 2003/99/EC, which determines the agents to be monitored on a mandatory basis ... ('''[[Journal:A quality assurance discrimination tool for the evaluation of satellite laboratory practice excellence in the context of European regulatory meat inspection for Trichinella spp.|Full article...]]''')<br />
|-
|<br /><h2 style="font-size:105%; font-weight:bold; text-align:left; color:#000; padding:0.2em 0.4em; width:50%;">Featured article of the week: February 26–March 03:</h2>
<div style="float: left; margin: 0.5em 0.9em 0.4em 0em;">[[File:Fig1 Pineda-Pampliega EFSAJournal2023 20-S2.png|240px]]</div>
'''"[[Journal:Developing a framework for open and FAIR data management practices for next generation risk- and benefit assessment of fish and seafood|Developing a framework for open and FAIR data management practices for next generation risk- and benefit assessment of fish and seafood]]"'''

[[Risk assessment|Risk and risk–benefit assessments]] of food are complex exercises, in which access to and use of several disconnected individual stand-alone [[database]]s is required to obtain hazard and exposure information. Data obtained from such databases ideally should be in line with the [[Journal:The FAIR Guiding Principles for scientific data management and stewardship|FAIR principles]], i.e. the data must be findable, accessible, interoperable, and reusable. However, often cases are encountered when one or more of these principles are not followed. In this project, we set out to assess if existing commonly used databases in risk assessment are in line with the FAIR principles ... ('''[[Journal:Developing a framework for open and FAIR data management practices for next generation risk- and benefit assessment of fish and seafood|Full article...]]''')<br />
|-
|<br /><h2 style="font-size:105%; font-weight:bold; text-align:left; color:#000; padding:0.2em 0.4em; width:50%;">Featured article of the week: February 19–25:</h2>
<div style="float: left; margin: 0.5em 0.9em 0.4em 0em;">[[File:Fig2 Henke JMIRMedInfo2023 11.png|240px]]</div>
'''"[[Journal:An extract-transform-load process design for the incremental loading of German real-world data based on FHIR and OMOP CDM: Algorithm development and validation|An extract-transform-load process design for the incremental loading of German real-world data based on FHIR and OMOP CDM: Algorithm development and validation]]"'''

In the Medical Informatics in Research and Care in University Medicine (MIRACUM) consortium, an IT-based clinical trial recruitment support system was developed based on the Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM). Currently, OMOP CDM is populated with German Fast Healthcare Interoperability Resources (FHIR) data using an extract-transform-load (ETL) process, which was designed as a bulk load. However, the computational effort that comes with an everyday full load is not efficient for daily recruitment ... ('''[[Journal:An extract-transform-load process design for the incremental loading of German real-world data based on FHIR and OMOP CDM: Algorithm development and validation|Full article...]]''')<br />
|-
|<br /><h2 style="font-size:105%; font-weight:bold; text-align:left; color:#000; padding:0.2em 0.4em; width:50%;">Featured article of the week: February 12–18:</h2>
<div style="float: left; margin: 0.5em 0.9em 0.4em 0em;">[[File:Fig3 Johnson JofCannRes23 5.png|240px]]</div>
'''"[[Journal:Potency and safety analysis of hemp-derived delta-9 products: The hemp vs. cannabis demarcation problem|Potency and safety analysis of hemp-derived delta-9 products: The hemp vs. cannabis demarcation problem]]"'''

[[Hemp]]-derived [[Tetrahydrocannabinol|delta-9-tetrahydrocannabinol]] (Δ<sup>9</sup>-THC) products are freely available for sale across much of the USA, but the federal legislation allowing their sale places only minimal requirements on companies. Products must contain no more than 0.3% Δ<sup>9</sup>-THC by dry weight, but no limit is placed on overall dosage, and there is no requirement that products derived from hemp-based Δ<sup>9</sup>-THC be tested. However, some states—such as Colorado—specifically prohibit products created by “chemically modifying” a natural hemp component. Fifty-three hemp-derived Δ<sup>9</sup>-THC products were ordered and submitted to InfiniteCAL [[laboratory]] for analysis ... ('''[[Journal:Potency and safety analysis of hemp-derived delta-9 products: The hemp vs. cannabis demarcation problem|Full article...]]''')<br />
|-
|<br /><h2 style="font-size:105%; font-weight:bold; text-align:left; color:#000; padding:0.2em 0.4em; width:50%;">Featured article of the week: February 05–11:</h2>
<div style="float: left; margin: 0.5em 0.9em 0.4em 0em;">[[File:Fig1 Sbailò npjCompMat22 8.png|240px]]</div>
'''"[[Journal:The NOMAD Artificial Intelligence Toolkit: Turning materials science data into knowledge and understanding|The NOMAD Artificial Intelligence Toolkit: Turning materials science data into knowledge and understanding]]"'''

We present the Novel Materials Discovery (NOMAD) [[Artificial intelligence|Artificial Intelligence]] (AI) Toolkit, a web-browser-based infrastructure for the interactive AI-based analysis of [[materials science]] data under FAIR (findable, accessible, interoperable, and reusable) data principles. The AI Toolkit readily operates on FAIR data stored in the central server of the NOMAD Archive, the largest database of materials science data worldwide, as well as locally stored, user-owned data. The NOMAD Oasis, a local, stand-alone server can also be used to run the AI Toolkit. By using [[Jupyter Notebook]]s that run in a web-browser, the NOMAD data can be queried and accessed; [[data mining]], [[machine learning]] (ML), and other AI techniques can then be applied to analyze them ... ('''[[Journal:The NOMAD Artificial Intelligence Toolkit: Turning materials science data into knowledge and understanding|Full article...]]''')<br />
|-
|<br /><h2 style="font-size:105%; font-weight:bold; text-align:left; color:#000; padding:0.2em 0.4em; width:50%;">Featured article of the week: January 29–February 04:</h2>
<div style="float: left; margin: 0.5em 0.9em 0.4em 0em;">[[File:Fig1 Naphade JofClinDiagRes2023 17-2.jpg|240px]]</div>
'''"[[Journal:Quality control in the clinical biochemistry laboratory: A glance|Quality control in the clinical biochemistry laboratory: A glance]]"'''

[Quality control]] (QC) is a process, designed to ensure reliable test results. It is part of overall [[laboratory]] quality management in terms of accuracy, reliability, and timeliness of reported test results. Two types of QC are exercised in [[Clinical chemistry|clinical biochemistry]]: internal QC (IQC) and external [[quality assurance]] (QA). IQC represents the quality methods performed every day by laboratory personnel with the laboratory’s materials and equipment. It primarily checks the precision (i.e., repeatability or reproducibility) of the test method. External quality assurance service (EQAS) is performed periodically (i.e., every month, every two months, twice a year) by the laboratory personnel, who primarily are checking the accuracy of the laboratory’s analytical methods ... ('''[[Journal:Quality control in the clinical biochemistry laboratory: A glance|Full article...]]''')<br />
|-
|<br /><h2 style="font-size:105%; font-weight:bold; text-align:left; color:#000; padding:0.2em 0.4em; width:50%;">Featured article of the week: January 22–28:</h2>
<div style="float: left; margin: 0.5em 0.9em 0.4em 0em;">[[File:Fig1 Ghiringhelli SciData23 10.png|240px]]</div>
'''"[[Journal:Shared metadata for data-centric materials science|Shared metadata for data-centric materials science]]"'''

The expansive production of data in [[materials science]], as well as their widespread [[Data sharing|sharing]] and repurposing, requires educated support and stewardship. In order to ensure that this need helps rather than hinders scientific work, the implementation of the [[Journal:The FAIR Guiding Principles for scientific data management and stewardship|FAIR data principles]] (that ask for data and information to be findable, accessible, interoperable, and reusable) must not be too narrow. At the same time, the wider materials science community ought to agree on the strategies to tackle the challenges that are specific to its data, both from computations and experiments. In this paper, we present the result of the discussions held at the workshop on “Shared Metadata and Data Formats for Big-Data Driven Materials Science.” ... ('''[[Journal:Shared metadata for data-centric materials science|Full article...]]''')<br />
|-
|<br /><h2 style="font-size:105%; font-weight:bold; text-align:left; color:#000; padding:0.2em 0.4em; width:50%;">Featured article of the week: January 15–21:</h2>
<div style="float: left; margin: 0.5em 0.9em 0.4em 0em;">[[File:Fig2 Jadhav IntJofMolSci23 24-9.png|240px]]</div>
'''"[[Journal:A metabolomics and big data approach to cannabis authenticity (authentomics)|A metabolomics and big data approach to cannabis authenticity (authentomics)]]"'''

With the increasing accessibility of [[cannabis]] ([[Cannabis sativa|''Cannabis sativa'' L.]], also known as marijuana and [[hemp]]), its products are being developed as [[Cannabis concentrate|extracts]] for both recreational and [[Cannabis (drug)|therapeutic]] use. This has led to increased scrutiny by [[Regulatory compliance|regulatory bodies]], who aim to understand and regulate the complex chemistry of these products to ensure their safety and efficacy. Regulators use targeted analyses to track the concentration of key bioactive [[Metabolomics|metabolites]] and potentially harmful [[Contamination|contaminants]], such as [[heavy metals]] and other impurities. However, the complexity of cannabis' metabolic pathways requires a more comprehensive approach. A non-targeted metabolomic analysis of cannabis products is necessary to generate data that can be used to determine their authenticity and efficacy ... ('''[[Journal:A metabolomics and big data approach to cannabis authenticity (authentomics)|Full article...]]''')<br />
|-
|<br /><h2 style="font-size:105%; font-weight:bold; text-align:left; color:#000; padding:0.2em 0.4em; width:50%;">Featured article of the week: January 08–14:</h2>
<div style="float: left; margin: 0.5em 0.9em 0.4em 0em;">[[File:GA Ishii SciTechAdvMatMeth2023 3-1.jpg|240px]]</div>
'''"[[Journal:Integration of X-ray absorption fine structure databases for data-driven materials science|Integration of X-ray absorption fine structure databases for data-driven materials science]]"'''

With the aim of introducing data-driven science and establishing an infrastructure for making [[X-ray absorption fine structure]] (XAFS) [[Spectroscopy|spectra]] findable and reusable, we have integrated XAFS databases in Japan. This integrated database (MDR XAFS DB) enables cross searching of spectra from more than 2,000 [[Sample (material)|samples]] and more than 700 unique materials with machine-readable [[metadata]]. The introduction of a materials dictionary with approximately 6,000 synonyms has improved the search performance, and links with large external databases have been established. In order to compare spectra in the database, the energy calibration policies of each institution were compiled, and the energy calibration methods across institutions were shown ... ('''[[Journal:Integration of X-ray absorption fine structure databases for data-driven materials science|Full article...]]''')<br />
|-
|<br /><h2 style="font-size:105%; font-weight:bold; text-align:left; color:#000; padding:0.2em 0.4em; width:50%;">Featured article of the week: January 01–07:</h2>
<div style="float: left; margin: 0.5em 0.9em 0.4em 0em;">[[File:Fig1 Heavey ForSciIntSyn2023 7.jpg|240px]]</div>
'''"[[Journal:Management and disclosure of quality issues in forensic science: A survey of current practice in Australia and New Zealand|Management and disclosure of quality issues in forensic science: A survey of current practice in Australia and New Zealand]]"'''

The investigation of [[Quality (business)|quality]] issues detected within the [[Forensic science|forensic]] process is a critical feature in robust [[quality management system]]s (QMSs) to provide assurance of the validity of reported [[laboratory]] results and inform strategies for [[Continual improvement process|continuous improvement]] and innovation. A survey was conducted to gain insight into the current state of practice in the management and handling of quality issues amongst the government service provider agencies of Australia and New Zealand. The results demonstrate the value of standardized quality system structures for the recording and management of quality issues, but also areas where inconsistent reporting increases the risk of overlooking important data to inform continuous improvement ... ('''[[Journal:Management and disclosure of quality issues in forensic science: A survey of current practice in Australia and New Zealand|Full article...]]''')<br />
|-
|}
|}
<br />
<div align="center">[[Main Page|— Return to the front page —]]</div>

Journal:Semantic units: Organizing knowledge graphs into semantically meaningful units of representation

2024-06-16T20:16:55Z

Shawndouglas: Saving and adding more.

{{Infobox journal article
|name =
|image =
|alt = 
|caption =
|title_full = Semantic units: Organizing knowledge graphs into semantically meaningful units of representation
|journal = ''Journal of Biomedical Semantics''
|authors = Vogt, Lars; Kuhn, Tobias; Hoehndorf, Robert
|affiliations = TIB Leibniz Information Centre for Science and Technology, Vrije Universiteit, King Abdullah University of Science and Technology
|contact = Email: lars dot m dot vogt at googlemail dot com
|editors =
|pub_year = 2024
|vol_iss = '''15'''
|at = 7
|doi = [https://doi.org/10.1186/s13326-024-00310-5 10.1186/s13326-024-00310-5]
|issn = 2041-1480
|license = [http://creativecommons.org/licenses/by/4.0/ Creative Commons Attribution 4.0 International]
|website = [https://jbiomedsem.biomedcentral.com/articles/10.1186/s13326-024-00310-5 https://jbiomedsem.biomedcentral.com/articles/10.1186/s13326-024-00310-5]
|download = [https://jbiomedsem.biomedcentral.com/counter/pdf/10.1186/s13326-024-00310-5.pdf https://jbiomedsem.biomedcentral.com/counter/pdf/10.1186/s13326-024-00310-5.pdf] (PDF)
}}
{{ombox
| type = notice
| image = [[Image:Emblem-important-yellow.svg|40px]]
| style = width: 500px;
| text = This article should be considered a work in progress and incomplete. Consider this article incomplete until this notice is removed.
}}
==Abstract==
'''Background''': In today’s landscape of [[Information management|data management]], the importance of [[knowledge graph]]s and [[Ontology (information science)|ontologies]] is escalating as critical mechanisms aligned with the [[Journal:The FAIR Guiding Principles for scientific data management and stewardship|FAIR Guiding Principles]] ask that research data and [[metadata]] be more findable, accessible, interoperable, and reusable. We discuss three challenges that may hinder the effective exploitation of the full potential of applying FAIR concepts to research objects using knowledge graphs.

'''Results''': We introduce “semantic units” as a conceptual solution, although currently exemplified only in a limited prototype. Semantic units structure a knowledge graph into identifiable and [[Semantics|semantically]] meaningful subgraphs by adding another layer of triples on top of the conventional data layer. Semantic units and their subgraphs are represented by their own resource that instantiates a corresponding semantic unit class. We distinguish statement and compound units as basic categories of semantic units. A statement unit is the smallest independent proposition that is semantically meaningful for a human reader. Depending on the relation of its underlying proposition, it consists of one or more triples. Organizing a knowledge graph into statement units results in a partition of the graph, with each triple belonging to exactly one statement unit. A compound unit, on the other hand, is a semantically meaningful collection of statement and compound units that form larger subgraphs. Some semantic units organize the graph into different levels of representational granularity, others orthogonally into different types of granularity trees or different frames of reference, structuring and organizing the knowledge graph into partially overlapping, partially enclosed subgraphs, each of which can be referenced by its own resource.

'''Conclusions''': Semantic units, applicable in RDF/OWL and labeled property graphs, offer support for making statements about statements and facilitate graph-alignment, subgraph-matching, knowledge graph profiling, and management of access restrictions to sensitive data. Additionally, we argue that organizing the graph into semantic units promotes the differentiation of ontological and discursive [[information]], and that it also supports the differentiation of multiple frames of reference within the graph.

'''Keywords''': FAIR data and metadata, knowledge graph, OWL, RDF, semantic unit, graph organization, granularity tree, representational granularity

==Background==
In an era marked by the exponential generation of data [1,2,3], both technically and socially intricate challenges have emerged [4], necessitating innovative approaches to data representation and [[Information management|management]] in science and industry. The growing volume of produced data requires systems capable of collecting, [[Data integration|integrating]], and [[Data analysis|analyzing]] extensive datasets from diverse sources, a critical requirement in addressing contemporary global challenges. [5] Notably, data stewardship should rest within the hands of the domain experts or institutions to ensure technical autonomy, aligning with the concept of "data visiting" rather than conventional "[[data sharing]]." [6]

From the standpoint of data representation and management, meeting these demands relies on adherence to the [[Journal:The FAIR Guiding Principles for scientific data management and stewardship|FAIR Guiding Principles]], which ask for research data and [[metadata]] to be readily findable, accessible, interoperable, and reusable for machines and humans alike. [7] Failure to achieve FAIRness risks transforming big data into opaque dark data. [8] Establishing the FAIRness of these research objects not only contributes to a solution for the reproducibility crisis in science [9] but also addresses broader concerns regarding the trustworthiness of [[information]] (see also the TRUST Principles of transparency, responsibility, user focus, sustainability, and technology [10]).

To capitalize on the transformative potential of the FAIR Principles, the idea of an internet of FAIR data and services was suggested. [11] Such a framework would seamlessly scale with the demands of big data, enabling relevant data-rich institutions, research projects, and citizen-science initiatives to make their research objects universally accessible in adherence to the FAIR Guiding Principles. [12, 13] The key lies in furnishing comprehensive, machine-actionable{{Efn|Machine-actionable data and metadata are machine-interpretable and belong to a type for which operations have been specified in symbolic grammar, such as logical reasoning based on description logics for statements formalized in the Web Ontology Language (OWL) or rule-based data transformations such as unit conversion for defined types of elements.<ref name="WEilandFDO22">{{cite web |url=https://docs.google.com/document/d/1hbCRJvMTmEmpPcYb4_x6dv1OWrBtKUUW5CEXB2gqsRo |title=FDO Machine Actionability, Version 2.1 |author=Weiland, C.; Islam, S.; Broder, D. et al. |work=Google Docs |publisher=FDO Forum |date=19 August 2022}}</ref>}} data and metadata, complemented by human-readable interfaces and search capabilities.

[[Knowledge graph]]s can contribute to the needed technical frameworks, offering a structure for managing and representing FAIR data and metadata. [14] Knowledge graphs are particularly applied in the context of [[Semantics|semantic]] search based on entities and relations, deep reasoning, disambiguation of natural language, machine reading, and entity consolidation for big data and text analytics. [15]

The distinctive graph-based abstractions inherent in knowledge graphs yield advantages over traditional [[Relational database|relational]] or other NoSQL models. These include
* an intuitive way for modelling relations;
* the flexibility to defer data schema definitions to accommodate evolving knowledge, which is especially important when dealing with incomplete knowledge;
* incorporation of machine-actionable knowledge representation formalisms like [[Ontology (information science)|ontologies]] and rules;
* deployment of graph analytics and [[machine learning]] (ML); and
* utilization of specialized graph query languages that support, in addition to standard relational operators such as joins, unions, and projections, also navigational operators for recursively searching for entities through arbitrary-length paths. [16,17,18,19,20,21,22]

Moreover, the inherent semantic transparency of knowledge graphs can improve the transparency of data-based decision-making and improve the communication of data and knowledge within research and science in general. [23,24,25,26,27]

Despite offering an appropriate technical foundation, the utilization of a knowledge graph for storing data and metadata does not inherently ensure the achievement of the FAIR Guiding Principles. Realizing FAIR research objects necessitates adherence to specific guidelines, encompassing the consistent application of adequate semantic data models tailored to distinct types of data and metadata statements. This approach is pivotal for ensuring seamless interoperability across a dataset.

The rest of the paper is organized as such. In the Problem statement section, we discuss three specific challenges that, from our perspective, can be effectively addressed by systematically organizing a knowledge graph into well-defined subgraphs. Prior attempts at this, such as defining a characteristic set as a subgraph based on triples that share the same resource in the ''Subject'' position, have demonstrated noteworthy enhancements in space and query performance [28, 29] (see also the related concept of RDF molecules [30, 31]), but they do not fully mitigate the challenges outlined below.

The Results section introduces a novel concept: the partitioning and structuring of a knowledge graph into semantic units, identifiable subgraphs represented in the graph with their own resource. Semantic units are semantically meaningful units of representation, which will contribute to overcoming the challenges at hand. The concept builds upon an idea originally proposed for structuring descriptions of [[phenotype]]s into distinct subgraphs, each of which models a descriptive statement like a particular weight measurement or a particular parthood statement for a given anatomical entity. [32] Each such subgraph is organized in its own "Named Graph" and functions as the smallest semantically meaningful unit in a phenotype description. Generalizing and extending this concept, we present semantic units as accessible, searchable, identifiable, and reusable data items in their own right, forming units of representation implemented through graphs based on the [[Resource Description Framework]] (RDF) and the Web Ontology Language (OWL) or labeled property graphs. Two basic categories of semantic units—statement units and compound units—are introduced, supplementing the well-established triples and the overall graph in FAIR knowledge graphs. These units offer a structure that organizes a knowledge graph into five levels of representational granularity, from individual triples to the graph as a whole. In further refinement, additional subcategories of semantic units are proposed for enhanced graph organization. The incorporation of unique, persistent, and resolvable identifiers (UPRIs) for each semantic unit enables their efficient referencing within triples, facilitating an efficient way of making statements about statements. The introduction of semantic units adds further layers of triples to the well-established RDF and OWL layer for knowledge graphs. (Fig. 1) This augmentation aims to enhance the usability of knowledge graphs for both domain experts and developers.

[[File:Fig1 Vogt JofBiomedSem24 15.png|600px]]
{{clear}}
{|
| style="vertical-align:top;" |
{| border="0" cellpadding="5" cellspacing="0" width="600px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;" |<blockquote>'''Figure 1.''' Semantic units introduce additional layers atop the RDF/OWL layer of triples within a knowledge graph. The figure illustrates a partitioning of the triple layer into statement units, wherein each triple aligns with exactly one statement unit, and each statement unit contains one or more triples. Statement units can be organized into diverse types of semantically meaningful collections, denoted as compound units. Compound units serve as the basis for defining several layers that contribute to the enhanced structuring and organization of the knowledge graph in semantically meaningful ways.</blockquote>
|-
|}
|}

In the Discussion section, we discuss the benefits we see from organizing knowledge graphs into distinct knowledge graph modules (i.e., semantic units) in terms of increasing data management flexibility and explorability of the graph. We also discuss possible strategies for implementing semantic units for RDF/OWL-based and labeled-property-graph-based knowledge graphs.

===Conventions used in this paper===
In this paper, the term "knowledge graph" denotes a machine-actionable semantic graph employed for the documentation, organization, and representation of data and metadata. It is essential to note that our discussion of semantic units is situated within the context of RDF-based triple stores, OWL, and Description Logics serving as a formal framework for inferencing, alongside labeled property graphs as an alternative to triple stores. We deliberately focus on these technologies as they constitute the primary technologies and logical frameworks within the knowledge graph domain, benefiting from widespread community support and established standards. We are aware of the fact that alternative technologies and frameworks exist that support an ''n''-tuples syntax and more advanced logics (e.g., First Order Logic) [33, 34], but supporting tools and applications are missing or are not widely used to turn them into well-supported, scalable, and easily usable knowledge graph applications.

Throughout this text, <u>regular underlining</u> is employed for indicating ontology classes, while ''<u>italicsUnderlined</u>'' text is reserved for referencing properties. Identification (ID) numbers, formed by the ontology prefix followed by a colon and a number, uniquely specify each resource (e.g., ''<u>isAbout</u>'' [IAO:0000136]). When a term is not yet covered in any ontology, we denote the corresponding class with an asterisk (*). New classes and properties that relate to semantic units will use the ontology prefix SEMUNIT, as in the class *<u>SEMUNIT:metric measurement statement unit</u>*. These will be part of a future Semantic Unit ontology. We use '<u>regular underlined</u>' to indicate instances of classes, with the label referring to the class label and the ID to the ID of the class.

The term "resource" is employed to signify something uniquely designated, such as a Uniform Resource Identifier (URI), about which informative statements are made. It thus stands for something and represents something you want to talk about. In RDF, the ''Subject'' and the ''Predicate'' in a triple are always resources, whereas the ''Object'' can be either a resource or a literal. Resources encompass properties, instances, and classes, with properties occupying the ''Predicate'' position in a triple, instances referring to individuals (=particulars), and classes representing universals or kinds.

To maintain clarity, resources are represented with human-readable labels in both the text and all figures, opting for the implicit assumption that each property, instance, and class possesses its UPRI. Additionally, the term "triple" refers specifically to a triple statement, while "statement" pertains to a [[Natural language processing|natural language statement]], establishing a clear distinction between the two.

==Methods==
===Problem statement===
====Challenge 1: Ensuring schematic interoperability for FAIR empirical data====

In the pursuit of FAIRness in empirical data and metadata in a knowledge graph, it is important not only for the terms employed in data and metadata statements to possess identifiers from controlled vocabularies, such as ontologies, ensuring terminological interoperability, but also the semantic graph patterns underlying each statement. These patterns specify the relationships among the terms in a statement, facilitating schematic interoperability.

Due to the expressivity of RDF and OWL, statements can be modelled in multiple, often not directly interoperable ways within a knowledge graph. Distinguishing between RDF graphs with different structures that essentially model the same underlying data statement poses a challenge. Consequently, the presence of schematic interoperability conflicts becomes unavoidable, especially when data are represented using diverse graph patterns (cf. Figs. 2 and 3).

[[File:Fig2 Vogt JofBiomedSem24 15.png|900px]]
{{clear}}
{|
| style="vertical-align:top;" |
{| border="0" cellpadding="5" cellspacing="0" width="900px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;" |<blockquote>'''Figure 2.''' Comparison of a human-readable statement with its machine-actionable representation as a semantic graph following the RDF syntax. Top: A human-readable statement concerning the observation that a specific apple (X) weighs 204.56 grams. Bottom: The corresponding representation of the same statement as a semantic graph, adhering to RDF syntax and following the established pattern for measurement data from the Ontology for Biomedical Investigations (OBI) [35] of the Open Biological and Biomedical Ontology Foundry (OBO).</blockquote>
|-
|}
|}

[[File:Fig3 Vogt JofBiomedSem24 15.png|800px]]
{{clear}}
{|
| style="vertical-align:top;" |
{| border="0" cellpadding="5" cellspacing="0" width="800px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;" |<blockquote>'''Figure 3.''' Alternative machine-actionable representation of the data statement from Fig. 2. This graph represents the same data statement as shown in Fig. 2 Top, but applies a semantic graph model that is based on the Extensible Observation Ontology (OBOE) [36], an ontology frequently used in the ecology community.</blockquote>
|-
|}
|}

Therefore, to maintain interoperability in the representation of empirical data statements within an RDF graph, it can be beneficial to restrict the graph patterns employed for their semantic modelling. Statements of the same type, such as all weight measurements, would employ identical graph patterns to maintain interoperability. Each of these patterns would be assigned an identifier. When representing empirical data in the form of an RDF graph, the graph’s metadata should reference that graph-pattern identifier. This approach enables the identification of potentially interoperable RDF graphs sharing common graph-pattern identifiers.

Practically implementing these principles entails two criteria. Firstly, all statements within a knowledge graph must be categorized into statement classes, each associated with a specified graph pattern, typically in the form of a shape specification. Secondly, the subgraph corresponding to a particular statement must be distinctly identifiable.

====Challenge 2: Overcoming barriers in graph query language adoption====
Another significant challenge arises in the context of searching for specific information in a knowledge graph. The prevalent formats for knowledge graphs include RDF/OWL or labeled property graphs like Neo4j. Interacting directly with these graphs, encompassing CRUD operations for creating (= writing), reading (= searching), updating, and deleting statements in the knowledge graph, necessitates the utilization of a query language. SPARQL [37] is an example for RDF/OWL, while Cypher [38] is employed for Neo4j.

Although these query languages empower users to formulate detailed and intricate queries, the challenge lies in their complexity, creating an entry barrier for seamless interactions with knowledge graphs [39]. Furthermore, query languages are not aware of graph patterns.

This challenge may potentially be addressed by providing reusable query patterns that link to specific graph patterns, thereby integrating representation and querying.

====Challenge 3: Addressing complexities in making statements about statements====
The RDF triple syntax of ''Subject'', ''Predicate'', and ''Object'' allows expressing a statement about another statement by creating a triple that relates a statement, composed of one or more triples, to a value, resource, or another statement. The scenario may arise where such statements about statements must be modelled. For instance, metadata for a measurement may relate two distinct subgraphs: one representing the measurement itself (as seen in Fig. 2) and another documenting the underlying measuring process (as seen in Fig. 4).

[[File:Fig4 Vogt JofBiomedSem24 15.png|1000px]]
{{clear}}
{|
| style="vertical-align:top;" |
{| border="0" cellpadding="5" cellspacing="0" width="1000px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;" |<blockquote>'''Figure 4.''' A detailed machine-actionable representation of the metadata relating to a weight measurement datum. This detailed illustration presents a machine-actionable representation of a mass measurement process employing a balance. It documents metadata associated with a weight measurement datum, articulated as an RDF graph. The graph establishes connections between an instance of <u>mass measurement assay</u> (OBI:0000445) and instances of various other classes from diverse ontologies. Noteworthy details include the identification of the measurement conductor, the location and timing of the measurement, the protocol followed, and the specific device utilized (i.e., a balance). Additionally, the graph outlines the material entity serving as the subject and input for the measurement process (i.e., "apple X"), along with specifying the resultant data encapsulated in a particular weight measurement assertion.</blockquote>
|-
|}
|}

In RDF reification, a statement resource is defined to represent a particular triple by describing it via three additional triples that specify its ''Subject'', ''Predicate'', and ''Object''. Alternatively, the RDF-star approach can be employed. [40, 41] Both methods increase complexity of the represented graph.

In cases like this, the adoption of Named Graphs is an alternative compared to RDF reification or RDF-star approaches. Within RDF-based knowledge graphs, a Named Graph resource identifies a set of triples by incorporating the URI of the Named Graph as a fourth element to each triple, transforming them into quads. In labeled property graphs, on the other hand, assigning a resource for identifying subgraphs within the overall data graph is straightforward and can be achieved by incorporating the resource identifier as the value of a corresponding property-value pair, subsequently adding this pair to all relations and nodes belonging to the same subgraph.

==Results==
===Semantic unit===
We developed an approach for organizing knowledge graphs into distinct layers of subgraphs using graph patterns. Unlike traditional methods of partitioning a knowledge graph that (i) rely on technical aspects such as shared graph-topological properties of its triples with the goal of (federated) reasoning and query optimization (see characteristic sets [29, 30], RDF molecules [31, 42], and other approaches [43,44,45]), that (ii) partition a knowledge graph into small blocks for embedding and entity alignment learning to scale knowledge graph fusion [46], or that (iii) partition knowledge extractions, allowing reasoning over them in parallel to speed up knowledge graph construction [47], our approach introduces "semantic units." Semantic units prioritize structuring a knowledge graph into identifiable sets of triples, as subgraphs that represent units of representation possessing semantic significance for human readers. Technically, a semantic unit is a subgraph within a knowledge graph, represented in the graph by its own resource—designated as a UPRI—and embodied in the graph as a node. This resource is classified as an instance of a specific semantic unit class.

Semantic units focus on creating units that are semantically meaningful to domain experts. For instance, the graph in Fig. 2 exemplifies a subgraph that can be organized in a semantic unit that instantiates the class *<u>SEMUNIT:weight statement unit</u>* as it is illustrated in Fig. 6 (later). The statement unit models a single, human-readable statement, as opposed to the individual triple ‘<u>weight</u>’ (PATO:0000128) ''isQualityMeasuredAs'' (IAO:0000417) ‘<u>scalar measurement datum</u>’ (IAO:0000032), which is a single triple from that subgraph. That triple, without the context of the other triples in the subgraph, lacks semantic meaningfulness for a domain expert who has no background in semantics.

Beyond statement units, which constitute the smallest semantically meaningful statements (e.g., a weight measurement), collections of statement units can form compound units representing a coarser level of representational granularity. The classification of semantic units thus distinguishes two fundamental categories: statement units and compound units, each with its respective subcategories. For a detailed classification of semantic units, refer to Fig. 5.

[[File:Fig5 Vogt JofBiomedSem24 15.png|300px]]
{{clear}}
{|
| style="vertical-align:top;" |
{| border="0" cellpadding="5" cellspacing="0" width="300px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;" |<blockquote>'''Figure 5.''' Classification of different categories of semantic units.</blockquote>
|-
|}
|}

The structuring of a knowledge graph into semantic units involves introducing an additional layer of triples to the existing graph. To distinguish these two layers, we label the pre-existing graph as the data graph layer, while the newly added triples constitute the semantic-units graph layer. For clarity across the graph, the resource representing a semantic unit, along with all triples featuring this resource in the ''Subject'' or ''Object'' position, is assigned to the semantic-units graph layer. Extending this distinction from the graph as a whole to individual semantic units, each semantic unit is associated with both a data graph and a semantic-units graph. The data graph of a particular semantic unit shares the same UPRI as its semantic unit resource. This alignment enables reference to the UPRI, concurrently denoting the semantic unit as a resource and its corresponding data graph. This interconnectedness empowers users to make statements about the content encapsulated within the semantic unit’s data graph, as shown in Fig. 6.

[[File:Fig6 Vogt JofBiomedSem24 15.png|1000px]]
{{clear}}
{|
| style="vertical-align:top;" |
{| border="0" cellpadding="5" cellspacing="0" width="1000px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;" |<blockquote>'''Figure 6.''' Example of a statement unit. The illustration displays a statement unit exemplifying a has-weight relation. The data graph, denoted within the blue box at the bottom, articulates the statement with "apple X" as the subject and "gram X" alongside the numerical value 204.56 as the objects. The peach-colored box encompasses the semantic-units graph, housing triples that encapsulate the semantic unit’s representation. It explicitly denotes the resource embodying the statement unit (bordered blue box), an instance of the *<u>SEMUNIT:weight statement unit</u>* class, with "apple X" identified as the subject. Notably, the UPRI of *’<u>weight statement unit</u>’* is also the UPRI of the semantic unit’s data graph (the unbordered subgraph in the blue box).</blockquote>
|-
|}
|}

====Statement unit: A proposition in the knowledge graph====
A statement unit is characterized as the fundamental unit of information encapsulating the smallest, independent proposition (i.e., statement) with semantic meaning for human comprehension (see also [32]). For instance, the weight measurement statement for "apple X" illustrated in Fig. 6 represents a statement unit.

Structuring a knowledge graph into statement units results in a partition of its graph. Each triple within the data graph layer of the knowledge graph is associated with exactly one statement unit, and merging the subgraphs of all statement units results in the complete data graph of a knowledge graph. This partitioning only applies to the data graph layer.

We can understand each statement unit to specify a particular proposition by establishing a relationship between a resource serving as the subject and either a literal or another resource, denoted as the object of the predicate. Every statement unit encompasses a single subject and one or more objects.

To illustrate, a has-part statement unit features a subject and one object. Conversely, a weight measurement statement unit consists of a subject, as well as two objects: the weight value and the weight unit (refer to Fig. 6). The resource signifying a statement unit in the graph establishes a connection with its subject through the property *<u>SEMUNIT:''hasSemanticUnitSubject''</u>*, which is documented in the semantic-units graph of the statement unit.

In scenarios where the proposition within the data graph is grounded in a binary relation—a divalent predicate like "This right hand has as a part this right thumb"—the associated statement unit typically comprises a single triple. This alignment arises from the nature of RDF, where ''Predicates'' of triples are inherently binary relations. In such cases, the RDF property concurrently embodies the statement’s verb or predicate. However, numerous propositions are grounded in ''n''-ary relations, making a single triple insufficient for their representation. Examples encompass the weight measurement statement in Fig. 6 and statements like "This right hand has part this right thumb on January 29th 2022," "Anna gives Bob a book," and "Carla travels by train from Paris to Berlin on the 29th of June 2022," each necessitating more than one triple. In these cases, the statement’s verb or predicate is often represented not by a property within a single triple but instead by an instance resource, as exemplified by ‘<u>weight X</u>’ (PATO:0000128) in Fig. 6. The composition of statement units, whether consisting of one or more triples, is contingent upon the relation of the underlying proposition, the ''n''-aryness of its predicate, and the incorporation of optional objects. Types of statement units can be distinguished based on the ''n''-ary verb or predicate that characterizes their underlying proposition. Notably, numerous object properties of the Basic Formal Ontology 2 denote ternary relations, particularly those entailing temporal dependencies. [48] For instance, "''b'' located_in ''c'' at ''t''" mandates at least two triples for accurate representation in RDF.

The determination of which triples belong to a statement unit necessitates case-by-case specification by human domain experts. The statement unit patterns can then be specified using languages like LinkML [49, 50] or the Shapes Constraint Language SHACL [51]. These languages enable the definition of graph patterns to represent specific propositions, subsequently constituting a statement unit. Each statement unit instantiates a designated statement unit class, a classification defined by the specific verb or predicate characterizing the propositions modelled by its instances. We can distinguish different subcategories of statement units based on the underlying predicate, such as ''has part'', ''type'', and ''develops from''.

A distinctive category within the statement units, denoted as identification units, serves a specific purpose, providing details about a particular named individual or class resource. Two principal subtypes define this category. A named individual identification unit is a statement unit that serves to identify a resource to be a named individual, adding information such as the resource’s label, type, and its class membership (refer to Fig. 7A). A class identification unit{{Efn|Analog to class identification units, one could specify property identification units that have property resources as their subject.}} is a statement unit that serves to identify a resource to be a class and provides details including its label, identifier, and optionally, the URIs of both the ontology and the specific version from which the class term has been imported (refer to Fig. 7B). Both types of identification units are important for providing human-readable displays of statement units, as they provide the labels for the resources used in them (see "typed statement unit" and "dynamic label" in Fig. 9, later).

[[File:Fig7 Vogt JofBiomedSem24 15.png|500px]]
{{clear}}
{|
| style="vertical-align:top;" |
{| border="0" cellpadding="5" cellspacing="0" width="500px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;" |<blockquote>'''Figure 7.''' Examples for two different types of identification units. '''A)''' Named-individual identification unit. The data graph within the unbordered box delineates the class-affiliation of the ‘<u>apple X</u>’ (NCIT:C71985) instance. The subject, "apple X," is connected to its class through the property ''<u>type</u>'' (RDF:type), while its label "apple X" is conveyed via the property ''<u>label</u>'' (RDFS:label). The unbordered blue box designates the data graph associated with this named-individual identification unit. '''B)''' Class identification unit. This data graph of this unit, represented by the unbordered blue box, captures the label and identifier of the class ‘<u>apple</u>’ (NCIT:C71985), the unit’s designated subject. Optionally, it includes the URI details of the ontology and the ontology version from which the class is derived. The bordered blue box designates the resource of this class identification unit.</blockquote>
|-
|}
|}

====Compound unit: A collection of propositions====
Compound units are containers of collections of associated semantic units, each possessing semantic significance for a human reader. Each compound unit possesses a UPRI and instantiates a corresponding compound unit class. The connection between the resource representing the compound unit and those representing its associated semantic units is detailed through the property *<u>SEMUNIT:hasAssociatedSemanticUnit</u>* (see Fig. 8). The subsequent sections introduce distinct subcategories of compound units.

[[File:Fig8 Vogt JofBiomedSem24 15.png|700px]]
{{clear}}
{|
| style="vertical-align:top;" |
{| border="0" cellpadding="5" cellspacing="0" width="700px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;" |<blockquote>'''Figure 8.''' Example of a compound unit, denoted as *‘<u>apple X item unit</u>’*, that encompasses multiple statement units. Compound units, by virtue of merging the data graphs of their associated statement units, indirectly manifest a data graph (here, highlighted by the blue arrow). Notably, the compound unit possesses a semantic-units graph (depicted in the peach-colored box) delineating the associated semantic units.</blockquote>
|-
|}
|}

===Typed statement unit===
A typed statement unit assigns a human-readable label to a statement unit. A typed statement unit is a compound unit comprising the following statement units (see Fig. 9A):

#A statement unit that is not an instance of a named-individual or a class identification unit. It functions as the reference statement unit of the typed statement unit, and its subject is also the subject of the typed statement unit.
#Identification units specifying the class affiliations of all the resources that are referenced in the data graph of the reference statement unit, together with their human-readable labels.

[[File:Fig9 Vogt JofBiomedSem24 15.png|700px]]
{{clear}}
{|
| style="vertical-align:top;" |
{| border="0" cellpadding="5" cellspacing="0" width="700px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;" |<blockquote>'''Figure 9.''' Typed statement unit with dynamic label and dynamic mind-map pattern. '''A)''' Typed statement unit exemplified for a weight statement. This typed statement unit consolidates the data graphs of six statement units, including the *’<u>weight statement unit</u>’* from Figure 6, serving as the reference statement unit for this *‘<u>typed statement unit</u>’*, and five instances of *<u>SEMUNIT:named-individual identification unit</u>*. '''B)''' Dynamic label: Illustrated is an example of the dynamic label associated with the reference statement unit class (*<u>SEMUNIT:weight statement unit</u>*). This dynamic label template is utilized for textual displays of information from the reference statement unit. '''C)''' Dynamic mind-map pattern: Depicted is an example of the dynamic mind-map pattern associated with the reference statement unit class (*<u>SEMUNIT:weight statement unit</u>*). This pattern template is employed for graphical displays of information from the reference statement unit.</blockquote>
|-
|}
|}

Each statement unit class has at least one display pattern associated with it. A display pattern acts as a template that takes as input the labels provided by the identification units associated with a typed statement unit and generates a human-readable dynamic label for the textual (see Fig. 9B) or a dynamic mind-map pattern for the graphical representation (see Fig. 9C) of the statement of its reference statement unit. Thus, a dynamic label and a dynamic mind-map pattern of a typed statement unit are derived from the corresponding templates provided by its reference statement unit, taking the human-readable labels provided by its identification units as input.

===Item unit===
An item unit encompasses all statement and typed statement units that share a common subject, i.e., they form a group of statements relating to the same entity. The subject resource becomes the subject of the item unit, and the resource representing an item unit in the semantic-units graph relates to its subject through the property *<u>SEMUNIT:hasSemanticUnitSubject</u>*. Conceptually, item units align with the ''graph-per-resource'' data management pattern [52] or the previously mentioned ''characteristic set'' or ''RDF molecule'', and they are akin to the ''Item'' concept in the Wikibase data model<ref name="MWWikibase24">{{cite web |url=https://www.mediawiki.org/wiki/Wikibase/DataModel#Item |title=Wikibase/DataModel - Overview of the data model |work=MediaWiki.org |date=07 April 2024}}</ref>, but adapt the concept to statement units rather than triples.

===Item group unit===
An item group unit is composed of a minimum of two item units. The subgraphs of the item units belonging to the same item group unit are connected through statement units that share their subject with the subject of one item unit and one of their objects with the subject of another item unit. As a result, merging the subgraphs of all the item units of an item group unit forms a connected graph.

===Granularity tree unit===
We can further identify types of statement units that depend on partial order relations (i.e., relations that are transitive, reflexive, and asymmetric), forming partial orders. Examples include class-subclass relations in ontologies, parthood relations in descriptive statements, and sequential relations like ''<u>before</u>'' (RO:0002083) in process specifications. Partial order relations give rise to granular partitions that form granularity trees [53,54,55] and contribute to defining granularity perspectives. [56,57,58]

Granularity perspectives identify specific types of semantically meaningful tree-like subgraphs within a knowledge graph, supporting graph exploration by modularization in addition to statement, item, and item group units.

Due to the nested structure of a granularity tree and its inherent directionality from root to leaves, the subject of a granularity tree unit can be specified as the subject of statement units sharing objects with the subjects but not their subject with the objects of other statement units within the same granularity tree unit.

===Granular item group unit===
A granular item group unit encompasses all statement units and item units whose subjects belong to the same granularity tree unit. The item units belonging to a granular item group unit can be systematically arranged within a nested hierarchy dictated by the underlying granularity tree. This additional organization offers improved explorability for users of a knowledge graph application.

===Context unit===
The ''<u>isAbout</u>'' property (IAO:0000136) connects an information artifact to an entity about which the artifact provides information. Using this property in a knowledge graph changes the frame of reference from the discursive layer to the ontological layer. An is-about statement thus divides a knowledge graph into two subgraphs, each forming a context unit that belongs to one of these two layers. Is-about statement units relate resources from the semantic-units graph with resources from the data graph of a knowledge graph. For example, in documenting a research activity that results in the creation of a dataset describing the anatomy of a multicellular organism, the statement *‘<u>description item unit</u>’* ''<u>isAbout</u>'' ‘<u>multicellular organism</u>’ (UBERON:0000468) marks a transition in the frame of reference from the research activity’s outcome to the multicellular organism being described (see also Fig. 12 further below).

===Dataset unit===
A dataset unit is an ordered set of semantic units. They can be employed to aggregate all data contributed by a specific institution in a collaborative project, document the state of a particular object at a given time, or store and make accessible the results of a specific search query. Knowledge graph users have the flexibility to specify dataset units for their individual needs, utilizing the unit’s UPRI as reference identifier.

===List unit===
In certain instances, it becomes necessary to articulate statements about a specific collection of particular resources. To achieve this, such a collection can be modelled as a list unit. We distinguish unordered list units from ordered list units, with the latter organizing resources in a specific sequence, such as the authors of a scholarly publication. Conversely, a set unit is an unordered list unit where each resource is listed only once, adhering to a uniqueness restriction.

From a technical standpoint, a list unit contains membership statement units, each delineating a resource belonging to the list by linking the UPRI of the list unit through a *<u>SEMUNIT:''child''</u>* relation to the respective resource. In the case of an ordered list unit, each membership statement unit must be indexed through a data property ''<u>index</u>'' (RDF:index).

List units can be employed as arrays and may incorporate cardinality restrictions, thereby characterizing a closed collection of entities and enabling a localized closed-world assumption.

==Discussion==
===Benefits of organizing a knowledge graph into semantic units===
====Semantic units enhance data management flexibility through modularity====
The organization of a knowledge graph into distinct subgraphs, each associated with a particular semantic unit, introduces modularity in a graph. Each semantic unit, represented in the graph by a dedicated resource classified as an instance of a specific semantic unit class, serves as a structured module that encapsulates complexity. This modular approach allows for the encapsulation of subgraphs, and may add flexibility in data management as larger parts of a graph can be manipulated jointly.

====Semantic units operate at a higher level of abstraction than individual triples====
Semantically, they encapsulate the contents of their data graphs, representing statements or sets of semantically and ontologically related statements. The specification of relations between semantic units further extends the flexibility of data management. A given semantic unit from a finer level of representational granularity can be associated with multiple units from a coarser level. Consequently, a statement unit may be linked to more than one compound unit, all while maintaining the centrality of the statement unit itself and its triples in a single location within the graph.

The modular nature introduced by semantic units may streamline partitioned-based querying of knowledge graphs. While other approaches for graph partitioning have shown success [59], employing semantic units for partitioning and establishing modularity in the graph is an avenue for future research exploration.

===Semantic units as a framework for knowledge graph alignment===
The instantiation of semantic units belonging to the same class inherently implies a semantic similarity across instances. This characteristic lays the groundwork for a systematic approach to aligning and comparing knowledge graphs that share a common set of semantic unit classes. The alignment process could operate in a stepwise manner across various levels of representational granularity. In the initial step, alignment focuses on item group units, leveraging their types of associated item units and their alignment for comparison. The latter alignment hinges on the types of subjects and the types of associated statement units, allowing for further alignment based on class. Ultimately, individual triples within the aligned statement units undergo comparison, marking a comprehensive strategy to enhance existing methods for knowledge graph alignment, subgraph-matching, graph comparison, and graph similarity measures.

===Managing restricted access to sensitive data===
The classification of statement units into corresponding ontology classes may serve as a framework for identifying subgraphs within a knowledge graph housing sensitive data that warrants restricted access. By identifying statement units containing sensitive information by class, access restrictions can be dynamically enforced based on specific criteria.

===Semantic units: A framework for nested and overlapping knowledge graph modules===
====Semantic units identify five levels of representational granularity====
Semantic units introduce a structured framework encompassing five levels of representational granularity within a knowledge graph: triples, statement units, item units, item group units, and the knowledge graph as a whole (refer to Fig. 10). While triples represent the lowest level of abstraction, semantic units provide coarser levels, organizing the semantic-units graph layer (i.e., the discursive layer of a knowledge graph) and, indirectly, the knowledge graph’s data graph layer.

[[File:Fig10 Vogt JofBiomedSem24 15.png|700px]]
{{clear}}
{|
| style="vertical-align:top;" |
{| border="0" cellpadding="5" cellspacing="0" width="700px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;" |<blockquote>'''Figure 10.''' Five levels of representational granularity. The integration of semantic units into a knowledge graph introduces a semantic-units graph layer, enriching the existing data graph layer. This augmentation includes distinct levels, namely triples, statement units, item units, and item group units, providing a nuanced hierarchy of representational granularity within a knowledge graph.</blockquote>
|-
|}
|}

The hierarchical organization of triples into statement units (→ smallest units of propositions that are semantically meaningful for a human reader), further into item units (→ comprising all the information from the knowledge graph about a particular entity), and eventually into item group units (→ collections of semantically interrelated entities) could enhance human readability and usability. This structural hierarchy supports users in seamlessly navigating across the graph, zooming in and out of different levels of representational granularity.

====Semantic units identify granularity trees====
Granularity trees offer a perspective that is orthogonal to representational granularity, structuring the data graph layer and thus the ontological layer of a knowledge graph into distinct granularity perspectives. Consider the example of a multicellular organism’s description, including a has-part statement unit stating that the organism has a head as its part. This unit is associated with the item unit of the organism itself, which is linked to additional item units about the organism’s other parts, constituting an item group unit. Moreover, since has-part is a partial order relation [55], the has-part statement unit is associated with a parthood granularity tree unit and its corresponding granular item group unit. Consequently, the statement unit is associated with at least four different compound units that can be communicated to the user alongside the statement itself, showcasing the versatility enabled by semantic units in exploring contextualized subgraphs. [54]

===Semantic units identify context-dependent subgraphs===
Semantic units empower the organization of item group units into context units, each defining a specific frame of reference. Intersections between context units are discerned through is-about statements (see also Fig. 12), facilitating traversal across diverse frames of reference. Context units contribute to structuring the data graph layer and thus the ontological layer of a knowledge graph into different frames of reference.

====Statements about statements and documenting ontological and discursive information in knowledge graphs using semantic units====
The introduction of semantic units provides a framework for making statements about statements in a knowledge graph. Each semantic unit, equipped with its unique UPRI and represented in the semantic-units graph layer, facilitates assertions about statement units. This structured approach offers the potential for cross-database and cross-knowledge-graph statements when semantic units are implemented as nanopublications or FAIR Digital Objects, addressing the challenge of making statements about statements in knowledge graphs.

Moreover, if a knowledge graph should cover contextual assertions such as “Author A asserts that the melting point of lead is at 327.5 °C” or “The assertion about the melting point of lead being at 327.5 °C is a result of experiment X,” it becomes challenging to model this without having a formalism for representing such discursive contextual information and its relationship to empirical data (see also Ingvar Johannson’s distinction between use and mention of linguistic entities [60]). Statement units with their data graphs contribute ontological information, nested within compound units of coarser representational granularity. In the semantic-units graph, propositions are represented as nodes, forming a significant portion of the discursive layer. Additionally, context units allow the explicit documentation of different frames of reference within both the ontological and discursive layers. The ability of statement units to establish relations between resources or even between other statement units (e.g., ‘''author_A -asserts-> statement_unit_Y''’; ‘''statement_unit_X -hasMetadata-> statement_unit_Z''’) facilitates the documentation of connections between the empirical and discursive layers. For instance, an item group unit focusing on the contents of a scholarly publication, can encapsulate information about the associated research activity, its inputs, outputs, research methods, and objectives (see Fig. 11).

[[File:Fig11 Vogt JofBiomedSem24 15.png|900px]]
{{clear}}
{|
| style="vertical-align:top;" |
{| border="0" cellpadding="5" cellspacing="0" width="900px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;" |<blockquote>'''Figure 11.''' A semantic schema for modelling the contents of scholarly publications. The depicted semantic schema outlines the modelling structure for encapsulating the components of scholarly publications. It delineates the relationship between a research activity, its associated input and output, and the underlying specification of its process plan, manifested in the form of a research method and research objective. The model draws inspiration from Vogt ''et al.'' [61]</blockquote>
|-
|}
|}

The proposed model may find application within a knowledge graph centered around scholarly publications. For example, the representation in Fig. 12 combines the discursive and the ontological layers and represents the connections between different frames of reference.

[[File:Fig12 Vogt JofBiomedSem24 15.png|1300px]]
{{clear}}
{|
| style="vertical-align:top;" |
{| border="0" cellpadding="5" cellspacing="0" width="1300px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;" |<blockquote>'''Figure 12.''' Detail from the RDF graph illustrating the contents of a scholarly publication. The data schema employed aligns with the schema shown in Figure 11, tailored to accommodate semantic units. The publication’s content is encapsulated within a dedicated publication item group unit instance through various interconnected semantic units. The publication itself is denoted as an instance of <u>journal article</u> (IAO:0000013). The publication item group unit encompasses multiple item units related to the research activity, interconnected through the *<u>SEMUNIT:''hasLinkedSemanticUnit''</u>* property. The interconnected hierarchy extends to an <u>investigation</u> (OBI:0000066) instance, resulting in a <u>data set</u> (IAO:0000100) instance with a <u>description</u> (SIO:000136) instance as its part. This description, in turn, has the multicellular organism item unit describing the organism as its part, which has an instance of <u>multicellular organism</u> (UBERON:0000468) as its subject. The blue arrow signifies the representation of the data graph (dark blue box with shadow) by this specific item unit (bordered box in the same color). The ontological layer is constituted by the data graphs of the semantic units, while their semantic-units graphs collectively form the discursive layer. Distinct context units demarcate the reference frames of the publication, research-activity, and research-subject, delineated by is-about statements. For reasons of clarity of presentation, the associated statement units are not shown in the discursive layer.</blockquote>
|-
|}
|}

===Implementation===
====Implementing semantic units in RDF/OWL-based knowledge graphs using nanopublications===
To initiate the structuring of a knowledge graph into semantic units, first, a layer of abstraction beyond the triple level must be created. This is accomplished by partitioning the knowledge graph into a set of statement units, where each triple belongs exclusively to one data graph of a statement unit. In RDF/OWL, statement units can be conceptualized like nanopublications.

Nanopublications are RDF graphs that serve as the smallest published information units extracted from literature and enriched with provenance and attribution information. [62,63,64,65] Leveraging Named Graphs and Semantic Web technologies, each nanopublication models a particular assertion, such as a scientific claim, in a machine-readable format and semantics and is accessible and citable through a unique identifier. Each nanopublication is organized into four Named Graphs:

#the head Named Graph, connecting the other three Named Graphs to the nanopublication’s unique identifier;
#the assertion Named Graph, containing the assertion modelled as a graph;
#the provenance Named Graph, containing metadata about the assertion; and
#the publicationInfo Named Graph, containing metadata about the nanopublication itself.

The assertion Named Graph would contain the data graph of a statement unit, whereas the head Named Graph its semantic-units graph. Triples in the provenance Named Graph can potentially link to other semantic units and thus other nanopublications that contain detailed metadata descriptions (e.g., a metadata graph as shown in Fig. 4).

A compound unit, being a collection of two or more semantic units, can be organized in an RDF/OWL-based knowledge graph by linking the compound unit’s UPRI to the UPRIs of its associated semantic units. Following the nanopublication schema, this can be implemented by employing the compound unit’s semantic-units graph as the head Named Graph of a corresponding nanopublication, leaving the nanopublication’s assertion Named Graph empty. The head Named Graph thus specifies all statement and compound units associated with this compound unit.

====Implementing semantic units in Neo4j-based knowledge graphs using UPRIs and corresponding property-value pairs====
In Neo4j, a labeled property graph, the assignment of UPRIs to all nodes and relations through a ‘''UPRI:upri''’ property-value pair is an essential prerequisite for implementing semantic units. To identify all triples affiliated with the same statement unit, a ‘''statement_unit_UPRI:upri''’ property-value pair must be added to each node and relation belonging to the statement unit, with the statement unit’s UPRI serving as the value. Building on this primary abstraction layer of statement units, a secondary abstraction layer of compound units can be organized. The nodes and relations associated with all triples within a compound unit are endowed with a ‘''compound_unit_UPRI:upri''’ property-value pair, having the compound unit’s UPRI as their value. Since a particular statement unit may be associated with multiple compound units, its ‘''compound_unit_URI''’ property can incorporate an array of UPRIs representing different semantic units.

An initial software for demonstration purposes has been developed by one of the authors, illustrating how semantic units can manage a knowledge graph. [66] Built upon Neo4j as the persistence-layer technology, the application sources its content via a web interface and user input. This small-scale knowledge graph application is designed for documenting assertions from scholarly publications, offering users an exemplary platform to describe some of the contents (and not merely bibliographic metadata) found in a scholarly publication. Each described paper stands as its own item group unit, featuring assertions covered by statement units linked to item units and granularity tree units. The prototype encompasses versioning of semantic units and automatic tracking of their editing histories and provenance. The application employs the organization of the graph into semantic units within a navigation tree, facilitating exploration of a given item group unit through its associated item units (see Fig. 13). The showcase is built using Python and flask/Jinja2 and is openly available at https://github.com/LarsVogt/Knowledge-Graph-Building-Blocks.

[[File:Fig13 Vogt JofBiomedSem24 15.png|1000px]]
{{clear}}
{|
| style="vertical-align:top;" |
{| border="0" cellpadding="5" cellspacing="0" width="1000px"
|-
| style="background-color:white; padding-left:10px; padding-right:10px;" |<blockquote>'''Figure 13.''' User interface of a prototype web application that implements semantic units. On the left is a navigation tree that leverages the organization of the underlying Neo4j knowledge graph into different item group, item, and statement units. Currently selected is the infectious agent population item group. On the right, all statements belonging to the selected item group are displayed.</blockquote>
|-
|}
|}

====Strategies for implementation====
Given that only statement units store information, while compound units act as their containers, the first step of implementing semantic units should focus on identifying the statement unit classes required for representing the types of statements integral to the knowledge graph’s coverage. Each statement unit class requires an assigned graph schema, preferably articulated using a shapes constraint language like SHACL. [51] In this initial step, statement types that are grounded in partial order relations must be identified as well (required for identifying granularity tree units). From here, three distinct implementation strategies are available:

#'''Develop from scratch''': In cases where no knowledge graph exists yet, the focus should be on developing a knowledge graph application that organizes incoming information into statement units in accordance with their assigned graph schemata. Rules for organizing statement units into compound units, contingent on the compound unit type, must be established. For example, statement units sharing the same subject resource form a corresponding item unit.
#'''Transfer an existing knowledge graph''': If there is an existing knowledge graph that needs restructuring into semantic units, crafting queries to transfer all triples into corresponding statement units, based on the graph schemata identified in the first step, is the next step. The main challenge is maintaining disjointedness of triples between statement units.
#'''A hybrid approach''': For scenarios where restructuring an entire knowledge graph seems impractical or undesirable, but there is a desire to organize newly added information into semantic units, a hybrid approach is possible. This involves developing input workflows to ensure that all incoming data conforms to the semantic units structure.

====Semantic units as FAIR Digital Objects====

==Foonotes==
{{reflist|group=lower-alpha}}

==References==
{{Reflist|colwidth=30em}}

==Notes==
This presentation is faithful to the original, with only a few minor changes to presentation, though grammar and word usage was substantially updated for improved readability. In some cases important information was missing from the references, and that information was added.


[[Category:LIMSwiki journal articles (added in 2024)]]
[[Category:LIMSwiki journal articles (all)]]
[[Category:LIMSwiki journal articles on data management and sharing]]
[[Category:LIMSwiki journal articles on FAIR data principles]]
[[Category:LIMSwiki journal articles on health informatics]]

File:Fig13 Vogt JofBiomedSem24 15.png

2024-06-16T20:13:13Z

Shawndouglas: Added summary.

==Summary==
{{Information
|Description='''Figure 13.''' User interface of a prototype web application that implements semantic units. On the left is a navigation tree that leverages the organization of the underlying Neo4j knowledge graph into different item group, item, and statement units. Currently selected is the infectious agent population item group. On the right, all statements belonging to the selected item group are displayed.
|Source={{cite journal |title=Semantic units: Organizing knowledge graphs into semantically meaningful units of representation |journal=Journal of Biomedical Semantics |author=Vogt, Lars; Kuhn, Tobias; Hoehndorf, Robert |volume=15 |at=7 |year=2024 |doi=10.1186/s13326-024-00310-5}}
|Author=Vogt, Lars; Kuhn, Tobias; Hoehndorf, Robert
|Date=2024
|Permission=[http://creativecommons.org/licenses/by/4.0/ Creative Commons Attribution 4.0 International]
}}

== Licensing ==
{{cc-by-4.0}}

File:Fig13 Vogt JofBiomedSem24 15.png

2024-06-16T20:12:06Z

Shawndouglas:

== Licensing ==
{{cc-by-4.0}}

File:Fig12 Vogt JofBiomedSem24 15.png

2024-06-16T20:11:52Z

Shawndouglas: Added summary.

==Summary==
{{Information
|Description='''Figure 12.''' Detail from the RDF graph illustrating the contents of a scholarly publication. The data schema employed aligns with the schema shown in Figure 11, tailored to accommodate semantic units. The publication’s content is encapsulated within a dedicated publication item group unit instance through various interconnected semantic units. The publication itself is denoted as an instance of <u>journal article</u> (IAO:0000013). The publication item group unit encompasses multiple item units related to the research activity, interconnected through the *<u>SEMUNIT:''hasLinkedSemanticUnit''</u>* property. The interconnected hierarchy extends to an <u>investigation</u> (OBI:0000066) instance, resulting in a <u>data set</u> (IAO:0000100) instance with a <u>description</u> (SIO:000136) instance as its part. This description, in turn, has the multicellular organism item unit describing the organism as its part, which has an instance of <u>multicellular organism</u> (UBERON:0000468) as its subject. The blue arrow signifies the representation of the data graph (dark blue box with shadow) by this specific item unit (bordered box in the same color). The ontological layer is constituted by the data graphs of the semantic units, while their semantic-units graphs collectively form the discursive layer. Distinct context units demarcate the reference frames of the publication, research-activity, and research-subject, delineated by is-about statements. For reasons of clarity of presentation, the associated statement units are not shown in the discursive layer.
|Source={{cite journal |title=Semantic units: Organizing knowledge graphs into semantically meaningful units of representation |journal=Journal of Biomedical Semantics |author=Vogt, Lars; Kuhn, Tobias; Hoehndorf, Robert |volume=15 |at=7 |year=2024 |doi=10.1186/s13326-024-00310-5}}
|Author=Vogt, Lars; Kuhn, Tobias; Hoehndorf, Robert
|Date=2024
|Permission=[http://creativecommons.org/licenses/by/4.0/ Creative Commons Attribution 4.0 International]
}}

== Licensing ==
{{cc-by-4.0}}

File:Fig12 Vogt JofBiomedSem24 15.png

2024-06-16T20:01:10Z

Shawndouglas:

== Licensing ==
{{cc-by-4.0}}

File:Fig11 Vogt JofBiomedSem24 15.png

2024-06-16T19:57:54Z

Shawndouglas: /* Summary */ Added summary

==Summary==
{{Information
|Description='''Figure 11.''' A semantic schema for modelling the contents of scholarly publications. The depicted semantic schema outlines the modelling structure for encapsulating the components of scholarly publications. It delineates the relationship between a research activity, its associated input and output, and the underlying specification of its process plan, manifested in the form of a research method and research objective. The model draws inspiration from Vogt ''et al.''
|Source={{cite journal |title=Semantic units: Organizing knowledge graphs into semantically meaningful units of representation |journal=Journal of Biomedical Semantics |author=Vogt, Lars; Kuhn, Tobias; Hoehndorf, Robert |volume=15 |at=7 |year=2024 |doi=10.1186/s13326-024-00310-5}}
|Author=Vogt, Lars; Kuhn, Tobias; Hoehndorf, Robert
|Date=2024
|Permission=[http://creativecommons.org/licenses/by/4.0/ Creative Commons Attribution 4.0 International]
}}

== Licensing ==
{{cc-by-4.0}}

File:Fig11 Vogt JofBiomedSem24 15.png

2024-06-16T19:56:30Z

Shawndouglas: Added summary.

==Summary==
{{Information
|Description='''Figure 11.''' Five levels of representational granularity. The integration of semantic units into a knowledge graph introduces a semantic-units graph layer, enriching the existing data graph layer. This augmentation includes distinct levels, namely triples, statement units, item units, and item group units, providing a nuanced hierarchy of representational granularity within a knowledge graph.
|Source={{cite journal |title=Semantic units: Organizing knowledge graphs into semantically meaningful units of representation |journal=Journal of Biomedical Semantics |author=Vogt, Lars; Kuhn, Tobias; Hoehndorf, Robert |volume=15 |at=7 |year=2024 |doi=10.1186/s13326-024-00310-5}}
|Author=Vogt, Lars; Kuhn, Tobias; Hoehndorf, Robert
|Date=2024
|Permission=[http://creativecommons.org/licenses/by/4.0/ Creative Commons Attribution 4.0 International]
}}

== Licensing ==
{{cc-by-4.0}}

File:Fig11 Vogt JofBiomedSem24 15.png

2024-06-16T19:55:45Z

Shawndouglas:

== Licensing ==
{{cc-by-4.0}}

File:Fig10 Vogt JofBiomedSem24 15.png

2024-06-16T19:45:45Z

Shawndouglas: Added summary.

==Summary==
{{Information
|Description='''Figure 10.''' Five levels of representational granularity. The integration of semantic units into a knowledge graph introduces a semantic-units graph layer, enriching the existing data graph layer. This augmentation includes distinct levels, namely triples, statement units, item units, and item group units, providing a nuanced hierarchy of representational granularity within a knowledge graph.
|Source={{cite journal |title=Semantic units: Organizing knowledge graphs into semantically meaningful units of representation |journal=Journal of Biomedical Semantics |author=Vogt, Lars; Kuhn, Tobias; Hoehndorf, Robert |volume=15 |at=7 |year=2024 |doi=10.1186/s13326-024-00310-5}}
|Author=Vogt, Lars; Kuhn, Tobias; Hoehndorf, Robert
|Date=2024
|Permission=[http://creativecommons.org/licenses/by/4.0/ Creative Commons Attribution 4.0 International]
}}

== Licensing ==
{{cc-by-4.0}}

File:Fig10 Vogt JofBiomedSem24 15.png

2024-06-16T19:44:46Z

Shawndouglas:

== Licensing ==
{{cc-by-4.0}}

Journal:Semantic units: Organizing knowledge graphs into semantically meaningful units of representation

2024-06-16T19:28:53Z

Shawndouglas: Saving and adding more.

File:Fig9 Vogt JofBiomedSem24 15.png

2024-06-16T19:16:30Z

Shawndouglas: Added summary.

==Summary==
{{Information
|Description='''Figure 9.''' Typed statement unit with dynamic label and dynamic mind-map pattern. '''A)''' Typed statement unit exemplified for a weight statement. This typed statement unit consolidates the data graphs of six statement units, including the *’<u>weight statement unit</u>’* from Figure 6, serving as the reference statement unit for this *‘<u>typed statement unit</u>’*, and five instances of *<u>SEMUNIT:named-individual identification unit</u>*. '''B)''' Dynamic label: Illustrated is an example of the dynamic label associated with the reference statement unit class (*<u>SEMUNIT:weight statement unit</u>*). This dynamic label template is utilized for textual displays of information from the reference statement unit. '''C)''' Dynamic mind-map pattern: Depicted is an example of the dynamic mind-map pattern associated with the reference statement unit class (*<u>SEMUNIT:weight statement unit</u>*). This pattern template is employed for graphical displays of information from the reference statement unit.
|Source={{cite journal |title=Semantic units: Organizing knowledge graphs into semantically meaningful units of representation |journal=Journal of Biomedical Semantics |author=Vogt, Lars; Kuhn, Tobias; Hoehndorf, Robert |volume=15 |at=7 |year=2024 |doi=10.1186/s13326-024-00310-5}}
|Author=Vogt, Lars; Kuhn, Tobias; Hoehndorf, Robert
|Date=2024
|Permission=[http://creativecommons.org/licenses/by/4.0/ Creative Commons Attribution 4.0 International]
}}

== Licensing ==
{{cc-by-4.0}}

File:Fig8 Vogt JofBiomedSem24 15.png

2024-06-16T19:13:15Z

Shawndouglas: Added summary.

==Summary==
{{Information
|Description='''Figure 8.''' Example of a compound unit, denoted as *‘<u>apple X item unit</u>’*, that encompasses multiple statement units. Compound units, by virtue of merging the data graphs of their associated statement units, indirectly manifest a data graph (here, highlighted by the blue arrow). Notably, the compound unit possesses a semantic-units graph (depicted in the peach-colored box) delineating the associated semantic units.
|Source={{cite journal |title=Semantic units: Organizing knowledge graphs into semantically meaningful units of representation |journal=Journal of Biomedical Semantics |author=Vogt, Lars; Kuhn, Tobias; Hoehndorf, Robert |volume=15 |at=7 |year=2024 |doi=10.1186/s13326-024-00310-5}}
|Author=Vogt, Lars; Kuhn, Tobias; Hoehndorf, Robert
|Date=2024
|Permission=[http://creativecommons.org/licenses/by/4.0/ Creative Commons Attribution 4.0 International]
}}

== Licensing ==
{{cc-by-4.0}}

File:Fig9 Vogt JofBiomedSem24 15.png

2024-06-16T19:13:07Z

Shawndouglas:

== Licensing ==
{{cc-by-4.0}}

2024-06-16T17:55:49Z

Shawndouglas: /* Conventions used in this paper */

Journal:Semantic units: Organizing knowledge graphs into semantically meaningful units of representation

2024-06-16T17:55:02Z

Shawndouglas: Saving and adding more.

File:Fig1 Vogt JofBiomedSem24 15.png

2024-06-16T17:44:06Z

Shawndouglas: Added summary.

==Summary==
{{Information
|Description='''Figure 1.''' Semantic units introduce additional layers atop the RDF/OWL layer of triples within a knowledge graph. The figure illustrates a partitioning of the triple layer into statement units, wherein each triple aligns with exactly one statement unit, and each statement unit contains one or more triples. Statement units can be organized into diverse types of semantically meaningful collections, denoted as compound units. Compound units serve as the basis for defining several layers that contribute to the enhanced structuring and organization of the knowledge graph in semantically meaningful ways.
|Source={{cite journal |title=Semantic units: Organizing knowledge graphs into semantically meaningful units of representation |journal=Journal of Biomedical Semantics |author=Vogt, Lars; Kuhn, Tobias; Hoehndorf, Robert |volume=15 |at=7 |year=2024 |doi=10.1186/s13326-024-00310-5}}
|Author=Vogt, Lars; Kuhn, Tobias; Hoehndorf, Robert
|Date=2024
|Permission=[http://creativecommons.org/licenses/by/4.0/ Creative Commons Attribution 4.0 International]
}}

== Licensing ==
{{cc-by-4.0}}