Difference between revisions of "LII:Organizational Memory and Laboratory Knowledge Management: Its Impact on Laboratory Information Flow and Electronic Notebooks"
Shawndouglas (talk | contribs) (Saving and adding more.) |
Shawndouglas (talk | contribs) m (→Abbreviations, acronyms, and initialisms: Per John, linked) |
||
(8 intermediate revisions by the same user not shown) | |||
Line 6: | Line 6: | ||
'''Publication date''': April 2024 | '''Publication date''': April 2024 | ||
==Introduction== | ==Introduction== | ||
Line 29: | Line 22: | ||
LLM systems have shown rapid development and deployment in almost every facet of industries throughout 2023. Unless something drastic happens, development will only accelerate, given the potential impact on business operations and interest of technology-driven companies. The scientific community needs to not only ensure its unique needs (once they’ve been defined) are included in LLM development and are met, but also that the resultant output reflects empirical rigor. | LLM systems have shown rapid development and deployment in almost every facet of industries throughout 2023. Unless something drastic happens, development will only accelerate, given the potential impact on business operations and interest of technology-driven companies. The scientific community needs to not only ensure its unique needs (once they’ve been defined) are included in LLM development and are met, but also that the resultant output reflects empirical rigor. | ||
==Organizational memory== | ==Organizational memory== | ||
Line 77: | Line 71: | ||
*'''Infrastructure''': Establish robust IT infrastructure capable of handling large datasets with high levels of security and accessibility. Cooperation between laboratory or scientific personnel and IT support is needed concerning access to instrument database structures, LIMS, SDMS, etc. First, a choice must be made on what to include and how to go about it without compromising lab operations or integrity. | *'''Infrastructure''': Establish robust IT infrastructure capable of handling large datasets with high levels of security and accessibility. Cooperation between laboratory or scientific personnel and IT support is needed concerning access to instrument database structures, LIMS, SDMS, etc. First, a choice must be made on what to include and how to go about it without compromising lab operations or integrity. | ||
*'''Data governance''': Develop a clear policy for data management, including quality control, privacy, and sharing protocols. | *'''Data governance''': Develop a clear policy for data management, including [[quality control]] (QC), privacy, and sharing protocols. | ||
*'''AI integration''': Choose and customize AI tools for data analysis, natural language processing (NLP), predictive analytics, etc., that suit the laboratory's specific needs. | *'''AI integration''': Choose and customize AI tools for data analysis, [[natural language processing]] (NLP), predictive analytics, etc., that suit the laboratory's specific needs. | ||
*'''Training''': Ensure staff are trained in the technical skills to use the system and understand the importance of data entry and curation. Lab personnel need to understand what is being done and why. Care must be taken to ensure that only reviewed and approved material is made available to the OM system so that premature release or the release of work in progress does not occur; this work may be updated over time and the organization will want to avoid the inclusion of out-of-date material. This should be done with the cooperation of lab personnel and not as part of a corporate mandate so that researchers and scientists maintain control over their work and ensure [[data integrity]] and governance. | *'''Training''': Ensure staff are trained in the technical skills to use the system and understand the importance of data entry and curation. Lab personnel need to understand what is being done and why. Care must be taken to ensure that only reviewed and approved material is made available to the OM system so that premature release or the release of work in progress does not occur; this work may be updated over time and the organization will want to avoid the inclusion of out-of-date material. This should be done with the cooperation of lab personnel and not as part of a corporate mandate so that researchers and scientists maintain control over their work and ensure [[data integrity]] and governance. | ||
*'''[[Continual improvement process|Continuous improvement]]''': Regularly update the system with new data and continuously improve the AI models as more data is collected. | *'''[[Continual improvement process|Continuous improvement]]''': Regularly update the system with new data and continuously improve the AI models as more data is collected. | ||
Line 85: | Line 79: | ||
===What information would we put into an OM system?=== | ===What information would we put into an OM system?=== | ||
Put simply, everything could feasibly included in such a system. This could include monthly reports, research reports, project plans, monthly summaries, test results, vendor proposals, hazardous materials records (including disposal information and health concerns such as caution statements, treatment for exposure, etc.), inventory records, production records, and anything else that might contain potentially useful information across the company. That will require a lot of organization, but what would it mean to have all that data and information continuously searchable by an intelligent assistant? Again, as noted previously, security is paramount. (Note that personnel information is intentionally omitted due to privacy and confidentiality issues.) | Put simply, everything could feasibly included in such a system. This could include monthly reports, research reports, project plans, monthly summaries, test results, vendor proposals, hazardous materials records (including disposal information and health concerns such as caution statements, treatment for exposure, etc.), inventory records, production records, and anything else that might contain potentially useful information across the company. That will require a lot of organization, but what would it mean to have all that data and information continuously searchable by an intelligent assistant? Again, as noted previously, security is paramount. (Note that personnel information is intentionally omitted due to privacy and confidentiality issues.) | ||
==Organizational memory and scientific information flow== | ==Organizational memory and scientific information flow== | ||
Line 98: | Line 93: | ||
{| border="0" cellpadding="5" cellspacing="0" width="400px" | {| border="0" cellpadding="5" cellspacing="0" width="400px" | ||
|- | |- | ||
| style="background-color:white; padding-left:10px; padding-right:10px;" |<blockquote>'''Figure 1.''' Basic K/I/D model. Databases for knowledge, information, and data (K/I/D) are represented as ovals, and the processes acting on them as arrows. A more detailed description of this model appeared originally in ''Computerized Systems in the Modern Laboratory: A Practical Guide'', though a slightly modified version of that is included in the | | style="background-color:white; padding-left:10px; padding-right:10px;" |<blockquote>'''Figure 1.''' Basic K/I/D model. Databases for knowledge, information, and data (K/I/D) are represented as ovals, and the processes acting on them as arrows. A more detailed description of this model appeared originally in ''Computerized Systems in the Modern Laboratory: A Practical Guide'', though a slightly modified version of that is included in the Supplemental information as Attachment 1.</blockquote> | ||
|- | |- | ||
|} | |} | ||
Line 136: | Line 131: | ||
synthesis. In turn, the IDS may be a component of an LES’s work. | synthesis. In turn, the IDS may be a component of an LES’s work. | ||
While | While Supplemental information, Attachment 1 expands upon the K/I/D flow model in Figure 1, it does not give much attention to the synthesis process nor the knowledge database(s), other than noting that they exist and contain useful OM data and information. When the models for a laboratory's K/I/D flow were first developed in the 1980s, they essentially viewed the management of K/I/D as largely a human-driven effort, with software providing organizational assistance. With the advent of AI and OM systems, we are hoping for a significant advancement in organization, access, and utilization of the results of scientific work. | ||
The basic K/I/D flow model can in reality become quite complex despite the apparent simplicity of it, especially as we look at organizational behavior models. For example, the following diagram (Figure 3) shows three research groups working independently from a common data set (genomics research is one example), each building its own project-specific knowledgebase (a smaller OM) that will become part of a larger structure (i.e., the organizational knowledgebase) as work progresses. This staging of project- and organization-wide knowledge bases is important since it gives the researchers and project managers control over their work until they are ready to report it. | The basic K/I/D flow model can in reality become quite complex despite the apparent simplicity of it, especially as we look at organizational behavior models. For example, the following diagram (Figure 3) shows three research groups working independently from a common data set (genomics research is one example), each building its own project-specific knowledgebase (a smaller OM) that will become part of a larger structure (i.e., the organizational knowledgebase) as work progresses. This staging of project- and organization-wide knowledge bases is important since it gives the researchers and project managers control over their work until they are ready to report it. | ||
Line 172: | Line 167: | ||
*In this case, the test method descriptions are part of the research knowledge database. This is just an example; some organizations may prefer a different structure. The key is that the diagram can be used to detail information flow to make connections and integrate systems where appropriate. | *In this case, the test method descriptions are part of the research knowledge database. This is just an example; some organizations may prefer a different structure. The key is that the diagram can be used to detail information flow to make connections and integrate systems where appropriate. | ||
These models were originally intended to show the interactions between systems in a lab and to also extend them to show the movement of information between departments, for example, between a | These models were originally intended to show the interactions between systems in a lab and to also extend them to show the movement of information between departments, for example, between a QC lab and production management; there the transformation between “information” in the QC lab and “data” in production takes place as it does in Figure 4, and for the same reasons. Other organizations can use similar models to detail their workflow. | ||
===Productivity, integration, data governance, and OM=== | ===Productivity, integration, data governance, and OM=== | ||
Line 181: | Line 176: | ||
The 1-10-100 rule can also have an impact on data integrity. The [[Food and Drug Administration]]’s (FDA's) inspection program frequently identified data integrity violations, including<ref name="NeumeyerData20">{{cite web |url=https://www.americanpharmaceuticalreview.com/Featured-Articles/565600-Data-Integrity-2020-FDA-Data-Integrity-Observations-in-Review/ |title=Data Integrity: 2020 FDA Data Integrity Observations in Review |author=Neumeyer, M. |work=American Pharmaceutical Review |date=23 June 2020 |accessdate=10 April 2024}}</ref>: | The 1-10-100 rule can also have an impact on data integrity. The [[Food and Drug Administration]]’s (FDA's) inspection program frequently identified data integrity violations, including<ref name="NeumeyerData20">{{cite web |url=https://www.americanpharmaceuticalreview.com/Featured-Articles/565600-Data-Integrity-2020-FDA-Data-Integrity-Observations-in-Review/ |title=Data Integrity: 2020 FDA Data Integrity Observations in Review |author=Neumeyer, M. |work=American Pharmaceutical Review |date=23 June 2020 |accessdate=10 April 2024}}</ref>: | ||
* Deletion or manipulation of data, | *Deletion or manipulation of data, | ||
* Aborted sample analysis without justification, | *Aborted sample analysis without justification, | ||
* Invalidated out-of-specification (OOS) results without justification, | *Invalidated out-of-specification (OOS) results without justification, | ||
* Destruction or loss of data, | *Destruction or loss of data, | ||
* Failure to document work contemporaneously, and | *Failure to document work contemporaneously, and | ||
* Uncontrolled documentation. | *Uncontrolled documentation. | ||
This naturally has an impact on OM; you want to avoid errors from getting into that system since the consequences can be significant. | This naturally has an impact on OM; you want to avoid errors from getting into that system since the consequences can be significant. | ||
Line 195: | Line 190: | ||
===AI considerations=== | ===AI considerations=== | ||
The AI world includes a variety of domains, including NLP, [[machine learning]] (ML), recognition, recommender systems, autonomous vehicles, AI in gaming, fuzzy logic, and sentiment analysis. | |||
An AI-driven OM would draw from knowledge management systems and AI applications in expert systems, NLP, and ML. All of these are used to manage and utilize organizational knowledge effectively. The following table (Table 2) shows a comparison of AI learning modes. | |||
{| | |||
| style="vertical-align:top;" | | |||
{| class="wikitable" border="1" cellpadding="5" cellspacing="0" width="70%" | |||
|- | |||
| colspan="4" style="background-color:white; padding-left:10px; padding-right:10px;" |'''Table 2.''' Comparison of three types of AI learning modes. | |||
|- | |||
! style="background-color:#e2e2e2; padding-left:10px; padding-right:10px;" |Learning type | |||
! style="background-color:#e2e2e2; padding-left:10px; padding-right:10px;" |Description | |||
! style="background-color:#e2e2e2; padding-left:10px; padding-right:10px;" |Common algorithms | |||
! style="background-color:#e2e2e2; padding-left:10px; padding-right:10px;" |Deep learning models | |||
|- | |||
| style="background-color:white; padding-left:10px; padding-right:10px;" |Supervised learning | |||
| style="background-color:white; padding-left:10px; padding-right:10px;" |Uses labeled data (known input and output models) to train models. | |||
| style="background-color:white; padding-left:10px; padding-right:10px;" |• K-Nearest Neighbor (KNN)<br />• Linear Regression (LR)<br />• Support Vector Machine (SVM)<br />• Decision Trees (DT)<br />• Random Forests (RF) | |||
| style="background-color:white; padding-left:10px; padding-right:10px;" |[[Convolutional neural network]]s (CNNs) for image recognition, time series analysis, [[recurrent neural network]]s (RNNs) for sequential data like text or speech, and transformers for NLP | |||
|- | |||
| style="background-color:white; padding-left:10px; padding-right:10px;" |Unsupervised learning | |||
| style="background-color:white; padding-left:10px; padding-right:10px;" |Finds hidden patterns in unlabeled data. | |||
| style="background-color:white; padding-left:10px; padding-right:10px;" |• K-means clustering<br />• Hierarchical clustering<br />• [[Principal component analysis]] (PCA) | |||
| style="background-color:white; padding-left:10px; padding-right:10px;" |Autoencoders for dimensionality reduction and anomaly detection and [[generative adversarial network]]s (GANs) for generating realistic data | |||
|- | |||
| style="background-color:white; padding-left:10px; padding-right:10px;" |Semi-supervised learning | |||
| style="background-color:white; padding-left:10px; padding-right:10px;" |Combines labeled and unlabeled data. Useful when labeling is costly/limited. | |||
| style="background-color:white; padding-left:10px; padding-right:10px;" |Variations of supervised and unsupervised algorithms | |||
| style="background-color:white; padding-left:10px; padding-right:10px;" |[[Variational autoencoder]]s (VAEs) and [[deep reinforcement learning]] (DRL) | |||
|- | |||
|} | |||
|} | |||
===Planning for OM=== | |||
A discussion about planning for developing an OM system is too ambitious for this document. Additionally, the material on this topic is growing rapidly; any details here will be quickly superseded by | |||
more comprehensive treatments. For introductory material, you might look at: | |||
* {{cite web |url=https://www.algolia.com/blog/ai/what-does-it-take-to-build-and-train-a-large-language-model-an-introduction/ |title=What does it take to build and train a large language model? An introduction |author=Caruana, V. |publisher=Algolia |date=01 November 2023}} | |||
* {{cite web |url=https://www.mckinsey.com/capabilities/people-and-organizational-performance/our-%20insights/the-organization-of-the-future-enabled-by-gen-ai-driven-by-people |title=The organization of the future: Enabled by gen AI, driven by people |author=Durth, S.; Hancock, B.; Maor, D. et al. |work=McKinsey & Company |date=19 September 2023}} | |||
* {{cite web |url=https://www.ibm.com/blog/artificial-intelligence-strategy/ |title=How to build a successful AI strategy |author=Finio, M. |work=IBM Blog |publisher=IBM |date=20 December 2023}} | |||
* {{cite web |url=https://hbr.org/2019/07/building-the-ai-powered-organization |title=Building the AI-Powered Organization |author=Fountaine, T.; McCarthy, B.; Saleh, T. |work=Harvard Business Review |date=July 2019}} | |||
* {{Cite journal |last=Jaillant |first=Lise |last2=Rees |first2=Arran |date=2023-05-31 |title=Applying AI to digital archives: trust, collaboration and shared professional ethics |url=https://academic.oup.com/dsh/article/38/2/571/6832097 |journal=Digital Scholarship in the Humanities |language=en |volume=38 |issue=2 |pages=571–585 |doi=10.1093/llc/fqac073 |issn=2055-7671}} | |||
* {{Cite journal |last=Jarrahi |first=Mohammad Hossein |last2=Askay |first2=David |last3=Eshraghi |first3=Ali |last4=Smith |first4=Preston |date=2023-01 |title=Artificial intelligence and knowledge management: A partnership between human and AI |url=https://linkinghub.elsevier.com/retrieve/pii/S0007681322000222 |journal=Business Horizons |language=en |volume=66 |issue=1 |pages=87–99 |doi=10.1016/j.bushor.2022.03.002}} | |||
==The implications for ELNs== | |||
For an OM system to be effective, it has to be updated with new material regularly. Part of the value of such a system comes from continued updates and re-evaluation of older material against the new input. Does recent work remove a stumbling block to earlier efforts? Are there contradictions? One source of that input is ELNs used for scientific work and engineering, both in the lab and in the field. Updating the OM could be a separate effort from recording material in an ELN, requiring an interface between the two, as we have with IDSs and LIMS. Still, it would be far more effective if the update process were built into the ELNs, e.g., via an “update now” button that keeps control over when and what is updated in the user’s hands. The OM wouldn't be peeking into work as it is executed but taking it in when the researcher or lab personnel deem it appropriate, after it is reviewed and approved, for example, to prevent errors from creeping into the system. Selected methods to capture OM-relevant work must also be validated and documented in standard operating procedures (SOPs). | |||
This may change the definition of what an ELN is. To date, contenders for the role of electronic notebooks for laboratories, scientists, and engineers have been software tools that are independent entities (along with IDS, LIMS, SDMS, etc.) that connect to other systems to take on specialized tasks such as molecular drawing, statistics, reaction databases, chemical structure databases, and so on. If we are going to proceed with the development of OMs, we need a systems approach that integrates ELNs and other notebooks, updating OMs, the use of AI to answer questions and assist research, and not just create one more piece in a chain of software tools. | |||
We need a system that supports: | |||
*Document preparation for either an individual or a team to work cooperatively; | |||
*Communications capabilities for email and the exchange of documents; | |||
*An AI assistant, along with supporting database structures; | |||
*Control over files for users and groups, keeping some private, and others shared, with controls over access privileges; | |||
*Scalability, so that it can be used by a single researcher/user or teams of people; and | |||
*Protection from vendor dominance in a facility, though a balance can be stricken in achieving smoothly integrated solutions that follow coherent processes with connected technologies. | |||
These are some basic criteria, though users will likely want to add more. The key distinction is that we aren’t looking at a single product, but a software system that has file structures, applications access, communications capability, and is independent of operating systems and hardware. The hierarchy would be ''notebook platform'' > ''operating systems'' > ''hardware''. This allows the latter two elements to change while the notebook platform is stable (though continuing to improve), consistent, and scalable, depending on user needs. | |||
One reason that this is being addressed is that we need to change the practice of software development for laboratory and scientific use. Most software is developed by vendors based on their understanding of market needs, and what it may take to support instruments and applications environments. Users may be asked for their input, but by and large users take what vendors provide and see if it fits their needs, or can be modified to fit their needs, with both financial and operational consequences. The user pays for the initial modifications, does what is needed to meet the regulatory environments, and may need to reinvest in those modifications every time the vendor produces an update or new version. | |||
Considering the impact AI, OMs, ELNS, and other electronic notebooks can have, we need the user community to step up and guide the development of systems rather than adapting to them. | |||
The acceptance of ELNs has been slow initially but appears to be increasing. The promise is there, but the adoption rate is below what one would have hoped for. There are several factors that can inhibit the use of ELNs: | |||
*'''Cost''': While some ELN software is free, vendors may limit data storage, file size, and the number of users.<ref name="SchmerkerSwitch20">{{cite web |url=https://www.idtdna.com/pages/community/blog/post/thinking-about-making-the-switch-to-an-electronic-lab-notebook-here-are-some-pros-and-cons |title=Switch to an Electronic Lab Notebook? Pros and Cons |author=Schmerker, J. |publisher=Integrated DNA Technologies |date=22 June 2020 |accessdate=10 April 2024}}</ref> | |||
*'''Portability''': If an ELN maker goes out of business or raises their prices, the information stored on that company’s products might only get a PDF export, which can’t be transferred to another product.<ref name="KanzaElectronic17">{{Cite journal |last=Kanza |first=Samantha |last2=Willoughby |first2=Cerys |last3=Gibbins |first3=Nicholas |last4=Whitby |first4=Richard |last5=Frey |first5=Jeremy Graham |last6=Erjavec |first6=Jana |last7=Zupančič |first7=Klemen |last8=Hren |first8=Matjaž |last9=Kovač |first9=Katarina |date=2017-12 |title=Electronic lab notebooks: can they replace paper? |url=https://jcheminf.biomedcentral.com/articles/10.1186/s13321-017-0221-3 |journal=Journal of Cheminformatics |language=en |volume=9 |issue=1 |pages=31 |doi=10.1186/s13321-017-0221-3 |issn=1758-2946 |pmc=PMC5443717 |pmid=29086051}}</ref> There is also the problem of an individual moving internally between organizations: will their work be transportable? | |||
* '''Ease-of-use''': The ease-of-use, or lack thereof, can be a significant barrier. If the ELN is not user-friendly, it can deter researchers from adopting it.<ref name="KanzaElectronic17" /> | |||
* '''Accessibility''': There can be issues with accessibility across different devices and operating systems. | |||
* '''Privacy''': By putting their work in an ELN, will researchers lose control over that work, and will they have to be concerned about premature release or disclosure? | |||
* '''Scalability''': Can an ELN be used on a laptop, tablet, and desktop system within the organization’s facilities or when they move outside of them, fieldwork for example, or travel between facilities (with appropriate security)? | |||
Factors that can increase the acceptance of ELNs: | |||
* Early-career researchers, who grew up with digital technology, expect and embrace electronic solutions. | |||
* Researchers that deal with increasing volumes of data can't help but find traditional paper notebooks less practical. | |||
* Concerns about reproducibility and stricter data management requirements from funding agencies motivate improvements in lab work documentation. | |||
* The ELN market now includes more intuitive tools, such as [[Cloud computing|cloud-based]] products, which are easier to adopt without extensive IT support. | |||
We have in those lists an indication that the market wants electronic notebooks, that cultural changes have occurred to favor them, but that the “right” product has yet to emerge. It may be that the right product isn’t a traditional software entity but a system. An open-source system with user community support could provide the basis for developing a notebook that meets the needs of the scientific or engineering community. | |||
===A potential platform for building an electronic notebook system=== | |||
One example of such a system is Nextcloud<ref name="NextcloudHome">{{cite web |url=https://nextcloud.com/ |title=Nextcloud |publisher=Nextcloud GmbH |accessdate=10 April 2024}}</ref>, an open-source system. (See Supplemental information, Attachment 2 for a summary of Nextcloud.) The key characteristics that make it (or a similar product) an attractive platform to build upon are: | |||
* It is an open-source package. This protects against a vendor going out of business and losing access to the system, provides for community support, and protects against loss of data. The vendor has been responsive to user input and provides frequent updates to the product. The pricing ranges from a free version to an enterprise pricing tier. | |||
* It is scalable and can be run on a laptop for a single user, or as part of a networked multi-user configuration. | |||
* It provides multi-user access with security controls at the file and folder level, and it supports collaborative editing of documents. | |||
* It has built-in team communications facilities, including messaging, email, calendars, etc. | |||
* It supports a built-in AI assistant. | |||
* It permits migration for researchers work as they move through an organization, or from one organization to another. In the latter case, security controls would prevent unauthorized disclosure of confidential information. A scientist could begin using the system early in their career and continue using it as their career develops. | |||
* It supports electronic signatures. | |||
* It is operating system- and hardware-independent. | |||
The system also has a rudimentary audit trail facility, though it would benefit from additional work. | |||
By itself, Nextcloud or a similar system could provide a good tool for recording observations, experiment/project planning, and recording data and information. However, to be a truly useful notebook application, it would need to be able to reach into a laboratory data aggregator and generators, data systems, and instrument systems. Table 3 describes a layered structure that addresses those needs. The system is scalable and can be modified as needed. | |||
{| | |||
| style="vertical-align:top;" | | |||
{| class="wikitable" border="1" cellpadding="5" cellspacing="0" width="70%" | |||
|- | |||
| colspan="2" style="background-color:white; padding-left:10px; padding-right:10px;" |'''Table 3.''' Layered notebook structure. | |||
|- | |||
! style="background-color:#e2e2e2; padding-left:10px; padding-right:10px;" |Layer | |||
! style="background-color:#e2e2e2; padding-left:10px; padding-right:10px;" |Description | |||
|- | |||
| style="background-color:white; padding-left:10px; padding-right:10px;" |Nextcloud layer: The electronic notebook | |||
| style="background-color:white; padding-left:10px; padding-right:10px;" |• Document preparation (e.g., reports, presentations, general purpose graphics, images, spreadsheets, etc.)<br />• Communications (e.g., mail, video, messaging, etc.)<br />• Document, file sharing<br />• Scheduling, calendars | |||
|- | |||
| style="background-color:white; padding-left:10px; padding-right:10px;" |Interface layer | |||
| style="background-color:white; padding-left:10px; padding-right:10px;" |• Standardized connections to lower layers | |||
|- | |||
| style="background-color:white; padding-left:10px; padding-right:10px;" |Data aggregator and analysis layer | |||
| style="background-color:white; padding-left:10px; padding-right:10px;" |• LIMS<br />• SDMS<br />• Analysis/Database/Statistics/Chemical modelling solution | |||
|- | |||
| style="background-color:white; padding-left:10px; padding-right:10px;" |Data and information generators layer | |||
| style="background-color:white; padding-left:10px; padding-right:10px;" |• IDS<br />• LES<br />• Manual methods such as testing and observation<br />• Robotics sample preparation, procedure execution<br />• External sources | |||
|- | |||
|} | |||
|} | |||
This is a logical layering and may not be confined to one computer. Most implementations would have this as a networked structure, with the interface layer providing connections to different systems. | |||
Let's examine the four layers from Table 3: | |||
1. '''Electronic notebook layer''': The electronic notebook would be part of the document processing system in Nextcloud or Nextcloud-like systems. This would manage scheduling, security, communications, report preparation, the notebook(s), etc. This would also be a source of material for the OM. | |||
2. '''Interface layer''': Ideally, this would provide vendor-neutral communications to underlying software, essentially acting as [[middleware]], which would convert requests from the notebook to specific applications. This would permit easy reconfiguration if applications change, or if a researcher moved from one facility to another. The same results could be achieved using copy/paste or export/import facilities from the applications; however, a standardized layer would provide an easier-to-use interface. | |||
3. '''Data aggregator layer''': The data aggregator level would connect to the data/information generator layers, with results working through the system to the notebook. LIMS would be an integral part of the notebook system and provide a means of ordering tests and managing results. The Nextcloud layer could be on a laptop, or a multi-user distributed system depending on the workplace’s requirements. All instrument connections would be through the LIMS and SDMS applications. This would make it easier to manage changes in the lower layer. | |||
4. '''Data and information generators layer''': This is the most volatile level in the system, the place where procedures are executed, new ones come into practice, and old ones retire. This level is best managed by a LIMS-SDMS combination and would provide a common point of reference for all user testing. | |||
==In closing...== | |||
The development of LLM and AI systems provides a means of improving the ROI for laboratory and scientific work. To be fully useful, there needs to be a method of updating the LLM database with new material. One approach is to build an ELN or other electronic notebook system that can feed the database as well as support research, development, and other laboratory activities. This work describes one such | |||
system. | |||
Line 203: | Line 337: | ||
{{reflist|group=lower-alpha}} | {{reflist|group=lower-alpha}} | ||
== | ==Supplemental information== | ||
*Attachment 1: ''[[LII:Laboratory Informatics: Information and Workflows|Laboratory Informatics: Information and Workflows]]'', on LIMSwiki | |||
*Attachment 2: "[https://www.limsforum.com/nextcloud-com-and-laboratory-informatics/111171/ Nextcloud.com and Laboratory Informatics]," at LIMSforum | |||
==Abbreviations, acronyms, and initialisms== | |||
* | *'''AI''': [[artificial intelligence]] | ||
* | *'''CNN''': [[convolutional neural network]] | ||
*'''DCMI''': Dublin Core Metadata Initiative | |||
*'''DMS''': [[document management system]] | |||
*'''DRL''': [[deep reinforcement learning]] | |||
*'''ELN''': [[electronic laboratory notebook]] | |||
*'''ES''': external source | |||
*'''GAN''': [[generative adversarial network]] | |||
*'''IDS''': instrument data system | |||
*'''K/I/D''': knowledge, information, and data | |||
*'''LES''': [[laboratory execution system]] | |||
*'''LIMS''': [[laboratory information management system]] | |||
*'''LIS''': [[laboratory information system]] | |||
*'''LLM''': [[large language model]] | |||
*'''ML''': [[machine learning]] | |||
*'''NLP''': [[natural language processing]] | |||
*'''OM''': organizational memory | |||
*'''OOS''': out-of-specification | |||
*'''PCA''': [[principal component analysis]] | |||
*'''PSK''': project-specific knowledge | |||
*'''QC''': [[quality control]] | |||
*'''RNN''': [[recurrent neural network]] | |||
*'''ROI''': return on investment | |||
*'''SDMS''': [[scientific data management system]] | |||
*'''SOP''': [[standard operating procedure]] | |||
*'''VAE''': [[variational autoencoder]] | |||
==About the author== | ==About the author== | ||
Initially educated as a chemist, author Joe Liscouski (joe dot liscouski at gmail dot com) is an experienced laboratory automation/computing professional with over forty years of experience in the field, including the design and development of automation systems (both custom and commercial systems), LIMS, robotics and data interchange standards. He also consults on the use of computing in laboratory work. He has held symposia on validation and presented technical material and short courses on laboratory automation and computing in the U.S., Europe, and Japan. He has worked/consulted in pharmaceutical, biotech, polymer, medical, and government laboratories. His current work centers on working with companies to establish planning programs for lab systems, developing effective support groups, and helping people with the application of automation and information technologies in research and | Initially educated as a chemist, author Joe Liscouski (joe dot liscouski at gmail dot com) is an experienced laboratory automation/computing professional with over forty years of experience in the field, including the design and development of automation systems (both custom and commercial systems), LIMS, robotics and data interchange standards. He also consults on the use of computing in laboratory work. He has held symposia on validation and presented technical material and short courses on laboratory automation and computing in the U.S., Europe, and Japan. He has worked/consulted in pharmaceutical, biotech, polymer, medical, and government laboratories. His current work centers on working with companies to establish planning programs for lab systems, developing effective support groups, and helping people with the application of automation and information technologies in research and QC environments. | ||
==References== | ==References== |
Latest revision as of 18:40, 11 April 2024
Title: Organizational Memory and Laboratory Knowledge Management: Its Impact on Laboratory Information Flow and Electronic Notebooks
Author for citation: Joe Liscouski, with editorial modifications by Shawn Douglas
License for content: Creative Commons Attribution-ShareAlike 4.0 International
Publication date: April 2024
Introduction
Beginning in the 1960s, the application of computing to laboratory work was focused on productivity: the reduction of the amount of work and cost needed to generate results, along with an improvement in the return on investment (ROI). This was very much a bottom-up approach, addressing the most labor-intensive issues and moving progressively to higher levels of data and information processing and productivity.
The efforts began with work at Perkin-Elmer, Nelson Analytical, Spectra Physics, Digital Equipment Corporation, and many others on the computer controlled recording and processing of instrument data. Once we learned how to acquire the data, robotic tools were introduced to help process samples and make them ready for introduction into instruments, that with the connection to a computer for data acquisition further increased productivity. That was followed by an emphasis on the storage, management, and analysis of that data through the application of laboratory information management systems (LIMS) and other software. With the recent development of artificial intelligence (AI) systems and large language models (LLMs), we are ready to consider the next stage in automation and system’s application: organizational memory and laboratory knowledge management.
This piece discusses the convergence of a set of technologies and their application to scientific work. The development of software systems like ChatGPT, Gemini, and others[1] means that with a bit of effort the ROI in research and testing can be greatly improved.
The initial interest discussed herein is on the topic of using LLMs to create an effective organizational memory (OM) and how that OM can benefit scientific organizations. Following that, we'll then examine how that potential technology impacts information flow, integration, and productivity, as well as what it could mean for developing electronic laboratory notebooks (ELNs). We’ll also have to extend that discussion to having AI and OM systems work with LIMS, scientific data management systems (SDMS), instrument data systems (IDSs), engineering tools, and field work found in various industries.
This work is not a "how to acquire and implement" article but rather a prompt for "something to think about and pursue" if makes sense within your organization. The idea is the creation of an effective OM (i.e., an extensive document and information database) that fills a gap in scientific and laboratory informatics[a], one that can be used effectively with an AI tool to search, organize, synthesize, and present material in an immediately applicable way. We need to seriously think about what we want from these systems and what our requirements are for them before the rapid pace of development produces products that need extensive modifications to be useful in scientific, laboratory, field, and engineering work.
Why should you read this?
Most of the products used in scientific work (whether in the lab, field, office, etc.) are designed for a specific application (working with instruments, for example) or adapted from general-purpose tools used in various industries and settings. The ideas discussed here need further development, as do the tools specifically for the needs of the scientific community. Still, that work needs to begin as a community effort to gain possible benefits. We need to guide the development of technologies so that they meet the needs of the scientific community rather than try to adapt them once they are delivered to the general marketplace.
LLM systems have shown rapid development and deployment in almost every facet of industries throughout 2023. Unless something drastic happens, development will only accelerate, given the potential impact on business operations and interest of technology-driven companies. The scientific community needs to not only ensure its unique needs (once they’ve been defined) are included in LLM development and are met, but also that the resultant output reflects empirical rigor.
Organizational memory
A researcher came into our analytical lab and asked about some results reported a few years earlier. One chemist recalled the project as well as the person in charge of that work, who had since left the company. The researcher thought he had a better approach to the problem being studied in the original work and was asked to investigate it. The bad news is that all the work, both analytical and previous research notes, was written into paper laboratory notebooks (1960s). Because of their age, they had left the library and were stored in banker’s boxes in a trailer in the parking lot. There, they were subject to water damage and rodents. Most of that material was unusable, and the investigation was dropped.
Many laboratories have similar stories to the above, lamenting the loss of knowledge within the overall organization due to poor knowledge management practices. Knowledge management has been a human activity for thousands of years since the first pictographs were placed on cave walls. The technology being used, the amount of knowledge generated, and our ability to work with it has changed over many centuries. Today, the subject of organizational knowledge management has seen evolving interest as organizations have moved from disparate archives and libraries of physical documents to more organized "computer-based organizational memories"[2] where a higher level of productivity can be had.
Walsh and Ungson define organizational memory as "stored information from an organization's history that can be brought to bear on present decisions," with that information being "stored as a consequence of implementing decisions to which they refer, by individual recollections, and through shared interpretations."[2] Until recently, many electronic approaches to OM development have relied on document management systems (DMSs) with keyword indexing and local search engines for retrieval. While those are a start, we need more; search engines still rely too heavily on people to sort through their output to find and organize relevant material.
Recently—particularly in 2023—AI systems like the notable ChatGPT[3] have offered a means of searching, organizing, and presenting material in a form that requires little additional human effort. Initial versions have had several issues (e.g., "hallucinations,” a tame way of saying the AI fabricates and falsifies data and information[4]), but as new models and tools are developed to better address these issues[5][6], sufficient improvement may be shown so that those AI systems eventually may deliver on their potential. Outside of ChatGPT, there are similar systems available (e.g., Microsoft CoPilot and Google Gemini), and more are likely under development. Our intent is not to make a comparison since any effort will quickly become outdated.
Why are organizational memory systems important?
Research and development and supporting laboratory activities can be an expensive operation. ROI is one measure of the wisdom behind the investment in that work, which can be substantively affected by the informatics environment within the laboratory and the larger organization of which it is a part. We'll take a brief look at three approaches to OM systems: paper-based, electronic, and AI-driven systems.
1. Paper-based systems: Paper-based systems pose a high risk of knowledge loss. While paper notebooks are in active use, the user knows the contents and can find material quickly. However, once the notebook is filled and put first in a library and then in an archive, the memory of what is in it fades. Once the original contributor leaves his post (due to promotion, transfers, or outside employment), you’re left depending on someone's recall or brute force searching to retrieve the contents. The cost of using that paper-based work and trying to gain benefit from it increases significantly, and the benefit is questionable depending on the ability of the information to be found, understood, and put to use. All of this assumes that the material hasn’t been damaged or lost. Paper-based lab notebooks create a knowledge bottleneck. Digital solutions are needed for secure, long-term storage and efficient searchability of experimental data.
2. Electronic systems and search engines: Analytical and experimental reports, as well as other organizational documents, can be entered into a DMS with suitable keyword entries (i.e., metadata)[7], indexed, and searched via search engines local to the organization or lab. The problem with this approach is that you get a list of reference documents that must be reviewed manually to ferret out and organize relevant content, which is time-consuming and expensive. This work has to be prioritized along with other demands on people’s time. Suppose a LIMS—whether it's a true LIMS or LIMS-like spreadsheet implementation—or an SDMS is used. In that case, the search may not include material in these systems but may be limited to descriptions in reports. Until the advent of popularized AI in 2023, readily available capabilities faced limitations. Only organizations with substantial budgets and resources could independently pursue more comprehensive technologies.
3. AI-driven systems: Building upon electronic systems with query capability, we can use the stored documents to train and update an AI assistant (a special purpose variation of ChatGPT, Watsonx 5, or other AI, for example). Variations can be created that are limited to private material to provide data security, and later they may be extended to public documents on the internet with controls to avoid information leakage. Based on the material available to date and at least one user’s experience using ChatGPT v4, the results of a search question provided by the AI system were more comprehensive, better organized, and presented in a readable and useable fashion that made it immediately useful, instead of simply providing a starting point for further research work. One change noted from earlier AI models is a lower tendency to provide false references, and the references provided are seemingly more relevant, summarized, and accurate. (Note: Any information an AI provides should be checked for accuracy before use.) An additional benefit is that its incorporation becomes synergistic as more material is provided. Connecting an AI to a LIMS or SDMS would provide additional benefits. However, extreme care must be taken to prevent premature disclosure of results before they are signed off, and data security has to be a high priority.
Examples of some of these OM system approaches include:
- NASA Lessons Learned: A publicly searchable "database of lessons learned from contributors across NASA and other organizations," containing "the official, reviewed learned lessons from NASA programs and projects"[8]
- Xerox's Eureka: A service technician database that is credited with improving service and reducing equipment downtime[9]
- Salesforce's Einstein and Service Cloud: An AI-driven database for customer service issues used to improve operations internally and externally[10]
There are likely many more examples of OM work going on in companies that is kept confidential. Large biopharma operations, for example, are expected to be working on methods of organizing and mining their extensive internal research databases.
Of the three approaches noted above, the last provides the best opportunity to increase the ROI on laboratory work. It reduces the amount of additional human effort needed to make use of lab results. Yet how it is implemented can make a significant difference in the results. There are several key benefits to implementing such an AI-driven systems approach:
- Such a system can capture and retain past work, putting it in an environment where its value or utility will continue to be enhanced by making it available to more sophisticated analysis methods, as they are developed, and more projects, as they become defined. This includes mitigating the effects of staff turnover (i.e., forgetting what had been done) and improving data organization. However, for this to be most effective, several steps must be taken. Digital repositories must be created on centralized databases where all lab reports, results, and presentations are stored. Additionally, and entered data should be governed by standardized formats and protocols to ensure consistency and retrievability. Finally, metadata tagging needs to be robust, allowing data and information to be tagged with keywords, project names, dates, etc. for easier searching and retrieval. (This metadata approach may be driven by initiatives such as the Dublin Core Metadata Initiative [DCMI].[7])
- Such a system can broaden the scope of material that can be used to analyze past and current work, drawing upon internal and external resources while enforcing proper security controls.
- Such a system has the ability to analyze or re-analyze past work. Working with the amount of data generated by laboratory work can be daunting. An AI organizational memory system should be capable of continuously analyzing incoming data and re-analyzing past data to gain new insights, particularly if it can access external data with appropriate security protocols. This would include the ability to notify researchers of relevant new findings (e.g., RSS feeds, and database integrations) or remind them of past work that could be applied to current projects.
In addition, a well-designed OM system can improve:
- Knowledge retention: Organizations recognize that employee turnover is inevitable. When employees leave, they take their knowledge and experience with them. Organizational memory helps capture this invaluable tacit knowledge, ensuring that critical information, technical nuances, and expertise remain within the company.
- Efficiency and productivity: Having an accessible repository of past projects, decisions, and outcomes allows current employees to learn from previous successes and mistakes. This can significantly reduce redundant efforts, accelerate training, and improve decision-making processes.
- Innovation and competitive advantage: Companies can foster innovation by effectively utilizing past knowledge. Understanding historical context, past experiments, and the evolution of products or strategies can inspire new ideas and prevent reinvention of the wheel. This ongoing learning can be a significant competitive advantage, by reducing the risk of making uninformed or repetitive mistakes and by providing a deeper understanding of previous obstacles and potential solutions.
- Risk management: Organizational memory can play a crucial role in risk management. Companies can better anticipate and mitigate risks by maintaining records of past incidents, responses, and outcomes. This is particularly important in regulated industries with extensive compliance requirements.
- Cultural continuity: Organizational memory contributes to the building and preserving of institutional culture. Stories, successes, failures, and milestones form a narrative that helps inculcate values, mission, and vision among employees.
(Note: Some information in the previous two sets of bullet points was suggested by ChatGPT v4. While ChatGPT was used in the research phase of this piece, primarily for making inquiries about topics and testing ideas, the writing is the author’s effort and responsibility.)
The implementation of a modern OM system—particularly an AI-driven one—has numerous considerations that should be made prior to implementation. One significant issue that needs to be addressed is the impact on personnel. What we are discussing is the development of a tool that can be used by researchers, scientists, and organizations to further their work and take advantage of past efforts. People are often possessive about their work even though they understand it belongs to those paying for its execution. They don't want their work released prematurely or want to feel that someone or something is watching their work as it develops. The development of a system needs to emphasize that this is a tool, perhaps a guide, but not an evaluator or potential replacement. Trust-building through shared ethical principles can facilitate collaboration among lab members.
Other considerations that should be made before implementing AI-driven OM systems include:
- Infrastructure: Establish robust IT infrastructure capable of handling large datasets with high levels of security and accessibility. Cooperation between laboratory or scientific personnel and IT support is needed concerning access to instrument database structures, LIMS, SDMS, etc. First, a choice must be made on what to include and how to go about it without compromising lab operations or integrity.
- Data governance: Develop a clear policy for data management, including quality control (QC), privacy, and sharing protocols.
- AI integration: Choose and customize AI tools for data analysis, natural language processing (NLP), predictive analytics, etc., that suit the laboratory's specific needs.
- Training: Ensure staff are trained in the technical skills to use the system and understand the importance of data entry and curation. Lab personnel need to understand what is being done and why. Care must be taken to ensure that only reviewed and approved material is made available to the OM system so that premature release or the release of work in progress does not occur; this work may be updated over time and the organization will want to avoid the inclusion of out-of-date material. This should be done with the cooperation of lab personnel and not as part of a corporate mandate so that researchers and scientists maintain control over their work and ensure data integrity and governance.
- Continuous improvement: Regularly update the system with new data and continuously improve the AI models as more data is collected.
- Security: Unless care is taken, an AI system can expose confidential information to the outside world. Take measures to ensure that internal and external sources of information are separated and that internal sources are protected against intrusion and leaking.
What information would we put into an OM system?
Put simply, everything could feasibly included in such a system. This could include monthly reports, research reports, project plans, monthly summaries, test results, vendor proposals, hazardous materials records (including disposal information and health concerns such as caution statements, treatment for exposure, etc.), inventory records, production records, and anything else that might contain potentially useful information across the company. That will require a lot of organization, but what would it mean to have all that data and information continuously searchable by an intelligent assistant? Again, as noted previously, security is paramount. (Note that personnel information is intentionally omitted due to privacy and confidentiality issues.)
Organizational memory and scientific information flow
The introduction of laboratory informatics into scientific work is often on an "as needed" basis. An instrument is purchased, and, in most cases, it is either accompanied by an external computer or has one within it. Regardless, the end result is the same: a computer is in the lab, and the subject of scientific and laboratory informatics begins to take shape. As the work develops, more computerized equipment is put in place, and the informatics landscape grows. The point is that computer systems are set in place to support software tools to solve particular problems, such as data management, inventory management, etc., but these aren’t planned acquisitions that are designed to fit into a pre-described informatics architecture. Suppose we are going to begin thinking in terms of OM and its effective advancement and use. In that case, an architecture is what is called for to make sure that the OM system is fed the materials it needs, and that the AI component has material to work with. (Note: Our emphasis is going to be on the OM; the AI is just a tool for accessing, extracting, and working with the OM contents.)
Scientific and laboratory information flow
The basic flow model of a lab's knowledge, information, and data (K/I/D) is represented in Figure 1.
|
Figure 2 zooms into the top of that model and highlights the position of AI-driven OM within the greater K/I/D model.
|
The K/I/D model of Figure 1 also highlights the three associated databases for knowledge, data, and information. Each has its own technologies, as seen in Table 1.
|
In regards to Table 1, there are a few points worth noting. First is the number of places that "instrument data systems" is found. An IDS is a tool that participates in several sub-processes during "measurements & experiments," "conversion," "data [storage]," and "analysis"; some of these are not obvious. During measurements and experiments, the IDS is connected to the measuring device's analog output stream and converts the continuous signal flow into a series of discrete numerical values via an analog-to-digital converter. A conversion process then turns those values into a set of descriptive numbers that are used in a later process. For example, in instrumental techniques whose output is a series of peaks, the stream of converted analog measurements becomes peak position, height, area, width, etc., that are used in the quantitative analysis of samples. Then those measurements and converted values are stored within the IDS's database. Finally, the descriptive values from several samples and reference standards are further processed to calculate the results of the analysis of each sample for components of interest. Those values are stored in a connected information database like a LIMS. Additionally, some databased data, information, and process elements may be viewable within a spreadsheet application for greater flexibility.
Note that laboratory execution systems (LESs) are also found in several processes because they are supervisory sub-processes monitoring all or part of, for example, an analytical or material synthesis. In turn, the IDS may be a component of an LES’s work.
While Supplemental information, Attachment 1 expands upon the K/I/D flow model in Figure 1, it does not give much attention to the synthesis process nor the knowledge database(s), other than noting that they exist and contain useful OM data and information. When the models for a laboratory's K/I/D flow were first developed in the 1980s, they essentially viewed the management of K/I/D as largely a human-driven effort, with software providing organizational assistance. With the advent of AI and OM systems, we are hoping for a significant advancement in organization, access, and utilization of the results of scientific work.
The basic K/I/D flow model can in reality become quite complex despite the apparent simplicity of it, especially as we look at organizational behavior models. For example, the following diagram (Figure 3) shows three research groups working independently from a common data set (genomics research is one example), each building its own project-specific knowledgebase (a smaller OM) that will become part of a larger structure (i.e., the organizational knowledgebase) as work progresses. This staging of project- and organization-wide knowledge bases is important since it gives the researchers and project managers control over their work until they are ready to report it.
|
As we add more components and intra-organizational interactions, the need for an OM system becomes more important. The following diagram (Figure 4) shows the addition of a dedicated testing group to Figure 3.
|
There are a couple of points worth noting about Figure 4:
- Test requests are sent to the “information” database, a LIMS, where they would be loaded and scheduled, and when the testing is completed, updated with the associated test results.
- The test results, upon completion, are sent to the research "data" database, not an "information" database. From the standpoint of the testing group, the test results are information (e.g., "sample A has X ppm of a chemical)", but that is just a data point in the research system.
- In this case, the test method descriptions are part of the research knowledge database. This is just an example; some organizations may prefer a different structure. The key is that the diagram can be used to detail information flow to make connections and integrate systems where appropriate.
These models were originally intended to show the interactions between systems in a lab and to also extend them to show the movement of information between departments, for example, between a QC lab and production management; there the transformation between “information” in the QC lab and “data” in production takes place as it does in Figure 4, and for the same reasons. Other organizations can use similar models to detail their workflow.
Productivity, integration, data governance, and OM
The reason we were concerned with information flow in the prior subsection was to look for ways to streamline lab operations and increase productivity. Increasing productivity usually means dependence on electronic systems and automation; removing humans from processes increases throughput, as well as the reliability of information and data. It also minimizes errors, as long as we deal with well-designed and validated systems and the cost of fixing them.
Labovitz et al.[11] created the 1-10-100 rule, which states that data entry errors multiply costs exponentially according to the stage at which they are identified and corrected. If it costs you a dollar to fix a data entry as soon as it's made, the cost will be ten dollars at the next step of the process, perhaps when it is used as part of a calculation. If the error persists and is reported as part of an analytical sample report, it may cost $100 to fix, plus the embarrassment caused by the error. Those dollar figures are in 1992 valuations; $100 in 1992 is equal to $223 in 2024.[12]
The 1-10-100 rule can also have an impact on data integrity. The Food and Drug Administration’s (FDA's) inspection program frequently identified data integrity violations, including[13]:
- Deletion or manipulation of data,
- Aborted sample analysis without justification,
- Invalidated out-of-specification (OOS) results without justification,
- Destruction or loss of data,
- Failure to document work contemporaneously, and
- Uncontrolled documentation.
This naturally has an impact on OM; you want to avoid errors from getting into that system since the consequences can be significant.
With the consideration of OM, the models take on additional benefit by providing a means of looking at the definition of “knowledge” in each organization and detailing what elements should be in a local OM system and what should move into an organization-wide OM system, as shown in Figure 3.
Earlier, we noted that an OM should contain "everything." To some, that is overkill; the raw data collected by an instrument may have little use outside the lab or an IDS, but a lot depends on the lab's data archiving practices. If they aren't well structured, "everything" isn't a bad idea, as the data is somewhere. This is particularly important as lab instrumentation changes, with older systems being retired and new ones that may be incompatible with older equipment introduced. Whether or not you build an OM, a comprehensive data management architecture is needed.
AI considerations
The AI world includes a variety of domains, including NLP, machine learning (ML), recognition, recommender systems, autonomous vehicles, AI in gaming, fuzzy logic, and sentiment analysis.
An AI-driven OM would draw from knowledge management systems and AI applications in expert systems, NLP, and ML. All of these are used to manage and utilize organizational knowledge effectively. The following table (Table 2) shows a comparison of AI learning modes.
|
Planning for OM
A discussion about planning for developing an OM system is too ambitious for this document. Additionally, the material on this topic is growing rapidly; any details here will be quickly superseded by more comprehensive treatments. For introductory material, you might look at:
- Caruana, V. (1 November 2023). "What does it take to build and train a large language model? An introduction". Algolia. https://www.algolia.com/blog/ai/what-does-it-take-to-build-and-train-a-large-language-model-an-introduction/.
- Durth, S.; Hancock, B.; Maor, D. et al. (19 September 2023). "The organization of the future: Enabled by gen AI, driven by people". McKinsey & Company. https://www.mckinsey.com/capabilities/people-and-organizational-performance/our-%20insights/the-organization-of-the-future-enabled-by-gen-ai-driven-by-people.
- Finio, M. (20 December 2023). "How to build a successful AI strategy". IBM Blog. IBM. https://www.ibm.com/blog/artificial-intelligence-strategy/.
- Fountaine, T.; McCarthy, B.; Saleh, T. (July 2019). "Building the AI-Powered Organization". Harvard Business Review. https://hbr.org/2019/07/building-the-ai-powered-organization.
- Jaillant, Lise; Rees, Arran (31 May 2023). "Applying AI to digital archives: trust, collaboration and shared professional ethics" (in en). Digital Scholarship in the Humanities 38 (2): 571–585. doi:10.1093/llc/fqac073. ISSN 2055-7671. https://academic.oup.com/dsh/article/38/2/571/6832097.
- Jarrahi, Mohammad Hossein; Askay, David; Eshraghi, Ali; Smith, Preston (1 January 2023). "Artificial intelligence and knowledge management: A partnership between human and AI" (in en). Business Horizons 66 (1): 87–99. doi:10.1016/j.bushor.2022.03.002. https://linkinghub.elsevier.com/retrieve/pii/S0007681322000222.
The implications for ELNs
For an OM system to be effective, it has to be updated with new material regularly. Part of the value of such a system comes from continued updates and re-evaluation of older material against the new input. Does recent work remove a stumbling block to earlier efforts? Are there contradictions? One source of that input is ELNs used for scientific work and engineering, both in the lab and in the field. Updating the OM could be a separate effort from recording material in an ELN, requiring an interface between the two, as we have with IDSs and LIMS. Still, it would be far more effective if the update process were built into the ELNs, e.g., via an “update now” button that keeps control over when and what is updated in the user’s hands. The OM wouldn't be peeking into work as it is executed but taking it in when the researcher or lab personnel deem it appropriate, after it is reviewed and approved, for example, to prevent errors from creeping into the system. Selected methods to capture OM-relevant work must also be validated and documented in standard operating procedures (SOPs).
This may change the definition of what an ELN is. To date, contenders for the role of electronic notebooks for laboratories, scientists, and engineers have been software tools that are independent entities (along with IDS, LIMS, SDMS, etc.) that connect to other systems to take on specialized tasks such as molecular drawing, statistics, reaction databases, chemical structure databases, and so on. If we are going to proceed with the development of OMs, we need a systems approach that integrates ELNs and other notebooks, updating OMs, the use of AI to answer questions and assist research, and not just create one more piece in a chain of software tools.
We need a system that supports:
- Document preparation for either an individual or a team to work cooperatively;
- Communications capabilities for email and the exchange of documents;
- An AI assistant, along with supporting database structures;
- Control over files for users and groups, keeping some private, and others shared, with controls over access privileges;
- Scalability, so that it can be used by a single researcher/user or teams of people; and
- Protection from vendor dominance in a facility, though a balance can be stricken in achieving smoothly integrated solutions that follow coherent processes with connected technologies.
These are some basic criteria, though users will likely want to add more. The key distinction is that we aren’t looking at a single product, but a software system that has file structures, applications access, communications capability, and is independent of operating systems and hardware. The hierarchy would be notebook platform > operating systems > hardware. This allows the latter two elements to change while the notebook platform is stable (though continuing to improve), consistent, and scalable, depending on user needs.
One reason that this is being addressed is that we need to change the practice of software development for laboratory and scientific use. Most software is developed by vendors based on their understanding of market needs, and what it may take to support instruments and applications environments. Users may be asked for their input, but by and large users take what vendors provide and see if it fits their needs, or can be modified to fit their needs, with both financial and operational consequences. The user pays for the initial modifications, does what is needed to meet the regulatory environments, and may need to reinvest in those modifications every time the vendor produces an update or new version.
Considering the impact AI, OMs, ELNS, and other electronic notebooks can have, we need the user community to step up and guide the development of systems rather than adapting to them.
The acceptance of ELNs has been slow initially but appears to be increasing. The promise is there, but the adoption rate is below what one would have hoped for. There are several factors that can inhibit the use of ELNs:
- Cost: While some ELN software is free, vendors may limit data storage, file size, and the number of users.[14]
- Portability: If an ELN maker goes out of business or raises their prices, the information stored on that company’s products might only get a PDF export, which can’t be transferred to another product.[15] There is also the problem of an individual moving internally between organizations: will their work be transportable?
- Ease-of-use: The ease-of-use, or lack thereof, can be a significant barrier. If the ELN is not user-friendly, it can deter researchers from adopting it.[15]
- Accessibility: There can be issues with accessibility across different devices and operating systems.
- Privacy: By putting their work in an ELN, will researchers lose control over that work, and will they have to be concerned about premature release or disclosure?
- Scalability: Can an ELN be used on a laptop, tablet, and desktop system within the organization’s facilities or when they move outside of them, fieldwork for example, or travel between facilities (with appropriate security)?
Factors that can increase the acceptance of ELNs:
- Early-career researchers, who grew up with digital technology, expect and embrace electronic solutions.
- Researchers that deal with increasing volumes of data can't help but find traditional paper notebooks less practical.
- Concerns about reproducibility and stricter data management requirements from funding agencies motivate improvements in lab work documentation.
- The ELN market now includes more intuitive tools, such as cloud-based products, which are easier to adopt without extensive IT support.
We have in those lists an indication that the market wants electronic notebooks, that cultural changes have occurred to favor them, but that the “right” product has yet to emerge. It may be that the right product isn’t a traditional software entity but a system. An open-source system with user community support could provide the basis for developing a notebook that meets the needs of the scientific or engineering community.
A potential platform for building an electronic notebook system
One example of such a system is Nextcloud[16], an open-source system. (See Supplemental information, Attachment 2 for a summary of Nextcloud.) The key characteristics that make it (or a similar product) an attractive platform to build upon are:
- It is an open-source package. This protects against a vendor going out of business and losing access to the system, provides for community support, and protects against loss of data. The vendor has been responsive to user input and provides frequent updates to the product. The pricing ranges from a free version to an enterprise pricing tier.
- It is scalable and can be run on a laptop for a single user, or as part of a networked multi-user configuration.
- It provides multi-user access with security controls at the file and folder level, and it supports collaborative editing of documents.
- It has built-in team communications facilities, including messaging, email, calendars, etc.
- It supports a built-in AI assistant.
- It permits migration for researchers work as they move through an organization, or from one organization to another. In the latter case, security controls would prevent unauthorized disclosure of confidential information. A scientist could begin using the system early in their career and continue using it as their career develops.
- It supports electronic signatures.
- It is operating system- and hardware-independent.
The system also has a rudimentary audit trail facility, though it would benefit from additional work.
By itself, Nextcloud or a similar system could provide a good tool for recording observations, experiment/project planning, and recording data and information. However, to be a truly useful notebook application, it would need to be able to reach into a laboratory data aggregator and generators, data systems, and instrument systems. Table 3 describes a layered structure that addresses those needs. The system is scalable and can be modified as needed.
|
This is a logical layering and may not be confined to one computer. Most implementations would have this as a networked structure, with the interface layer providing connections to different systems.
Let's examine the four layers from Table 3:
1. Electronic notebook layer: The electronic notebook would be part of the document processing system in Nextcloud or Nextcloud-like systems. This would manage scheduling, security, communications, report preparation, the notebook(s), etc. This would also be a source of material for the OM.
2. Interface layer: Ideally, this would provide vendor-neutral communications to underlying software, essentially acting as middleware, which would convert requests from the notebook to specific applications. This would permit easy reconfiguration if applications change, or if a researcher moved from one facility to another. The same results could be achieved using copy/paste or export/import facilities from the applications; however, a standardized layer would provide an easier-to-use interface.
3. Data aggregator layer: The data aggregator level would connect to the data/information generator layers, with results working through the system to the notebook. LIMS would be an integral part of the notebook system and provide a means of ordering tests and managing results. The Nextcloud layer could be on a laptop, or a multi-user distributed system depending on the workplace’s requirements. All instrument connections would be through the LIMS and SDMS applications. This would make it easier to manage changes in the lower layer.
4. Data and information generators layer: This is the most volatile level in the system, the place where procedures are executed, new ones come into practice, and old ones retire. This level is best managed by a LIMS-SDMS combination and would provide a common point of reference for all user testing.
In closing...
The development of LLM and AI systems provides a means of improving the ROI for laboratory and scientific work. To be fully useful, there needs to be a method of updating the LLM database with new material. One approach is to build an ELN or other electronic notebook system that can feed the database as well as support research, development, and other laboratory activities. This work describes one such system.
Acknowledgements
I’d like to thank Gretchen Boria for her help in improving this article and her contributions to it.
Footnotes
- ↑ By addressing both "scientific" and "laboratory," we recognize that not all scientific work occurs in a laboratory.
Supplemental information
- Attachment 1: Laboratory Informatics: Information and Workflows, on LIMSwiki
- Attachment 2: "Nextcloud.com and Laboratory Informatics," at LIMSforum
Abbreviations, acronyms, and initialisms
- AI: artificial intelligence
- CNN: convolutional neural network
- DCMI: Dublin Core Metadata Initiative
- DMS: document management system
- DRL: deep reinforcement learning
- ELN: electronic laboratory notebook
- ES: external source
- GAN: generative adversarial network
- IDS: instrument data system
- K/I/D: knowledge, information, and data
- LES: laboratory execution system
- LIMS: laboratory information management system
- LIS: laboratory information system
- LLM: large language model
- ML: machine learning
- NLP: natural language processing
- OM: organizational memory
- OOS: out-of-specification
- PCA: principal component analysis
- PSK: project-specific knowledge
- QC: quality control
- RNN: recurrent neural network
- ROI: return on investment
- SDMS: scientific data management system
- SOP: standard operating procedure
- VAE: variational autoencoder
About the author
Initially educated as a chemist, author Joe Liscouski (joe dot liscouski at gmail dot com) is an experienced laboratory automation/computing professional with over forty years of experience in the field, including the design and development of automation systems (both custom and commercial systems), LIMS, robotics and data interchange standards. He also consults on the use of computing in laboratory work. He has held symposia on validation and presented technical material and short courses on laboratory automation and computing in the U.S., Europe, and Japan. He has worked/consulted in pharmaceutical, biotech, polymer, medical, and government laboratories. His current work centers on working with companies to establish planning programs for lab systems, developing effective support groups, and helping people with the application of automation and information technologies in research and QC environments.
References
- ↑ Malhotra, T. (30 January 2024). "This AI Paper Unveils the Future of MultiModal Large Language Models (MM-LLMs) – Understanding Their Evolution, Capabilities, and Impact on AI Research". Marktechpost. Marketechpost Media, LLC. https://www.marktechpost.com/2024/01/30/this-ai-paper-unveils-the-future-of-multimodal-large-language-models-mm-llms-understanding-their-evolution-capabilities-and-impact-on-ai-research/. Retrieved 10 April 2024.
- ↑ 2.0 2.1 Walsh, James P.; Ungson, Gerardo Rivera (1 January 1991). "Organizational Memory". The Academy of Management Review 16 (1): 57. doi:10.2307/258607. http://www.jstor.org/stable/258607?origin=crossref.
- ↑ "ChatGPT 3.5". OpenAI OpCo, LLC. https://chat.openai.com/. Retrieved 10 April 2024.
- ↑ Emsley, Robin (19 August 2023). "ChatGPT: these are not hallucinations – they’re fabrications and falsifications" (in en). Schizophrenia 9 (1): 52, s41537–023–00379-4. doi:10.1038/s41537-023-00379-4. ISSN 2754-6993. PMC PMC10439949. PMID 37598184. https://www.nature.com/articles/s41537-023-00379-4.
- ↑ Fabbro, R. (29 March 2024). "Microsoft is apprehending AI hallucinations — and not just its own". Quartz. https://qz.com/microsoft-azure-ai-hallucinations-chatbots-1851374390. Retrieved 10 April 2024.
- ↑ Maurin, N. (15 March 2024). "The bank quant who wants to stop gen AI hallucinating". Risk.net. https://www.risk.net/risk-management/7959062/the-bank-quant-who-wants-to-stop-gen-ai-hallucinating. Retrieved 10 April 2024.
- ↑ 7.0 7.1 "About DCMI". Association for Information Science and Technology. https://www.dublincore.org/about/. Retrieved 10 April 2024.
- ↑ "NASA Lessons Learned". NASA. 26 July 2023. https://www.nasa.gov/nasa-lessons-learned/. Retrieved 10 April 2024.
- ↑ Doyle, K. (22 January 2016). "Xerox’s Eureka: A 20-Year-Old Knowledge Management Platform That Still Performs". Field Service Digital. ServiceMax. https://fsd.servicemax.com/2016/01/22/xeroxs-eureka-20-year-old-knowledge-management-platform-still-performs/. Retrieved 10 April 2024.
- ↑ Hiter, S. (9 April 2024). "Salesforce and AI: How Salesforce’s Einstein Transforms Sales". e-Week. https://www.eweek.com/artificial-intelligence/how-salesforce-drives-business-through-ai/. Retrieved 11 April 2024.
- ↑ Labovitz, George; Chang, Yu Sang; Rosansky, Victor (1992). Making quality work: a leadership guide for the results-driven manager. Essex Junction, VT: Omneo. ISBN 978-0-939246-54-0.
- ↑ "U.S. Inflation Calculator". CoinNews Media Group, LLC. https://www.usinflationcalculator.com/. Retrieved 10 April 2024.
- ↑ Neumeyer, M. (23 June 2020). "Data Integrity: 2020 FDA Data Integrity Observations in Review". American Pharmaceutical Review. https://www.americanpharmaceuticalreview.com/Featured-Articles/565600-Data-Integrity-2020-FDA-Data-Integrity-Observations-in-Review/. Retrieved 10 April 2024.
- ↑ Schmerker, J. (22 June 2020). "Switch to an Electronic Lab Notebook? Pros and Cons". Integrated DNA Technologies. https://www.idtdna.com/pages/community/blog/post/thinking-about-making-the-switch-to-an-electronic-lab-notebook-here-are-some-pros-and-cons. Retrieved 10 April 2024.
- ↑ 15.0 15.1 Kanza, Samantha; Willoughby, Cerys; Gibbins, Nicholas; Whitby, Richard; Frey, Jeremy Graham; Erjavec, Jana; Zupančič, Klemen; Hren, Matjaž et al. (1 December 2017). "Electronic lab notebooks: can they replace paper?" (in en). Journal of Cheminformatics 9 (1): 31. doi:10.1186/s13321-017-0221-3. ISSN 1758-2946. PMC PMC5443717. PMID 29086051. https://jcheminf.biomedcentral.com/articles/10.1186/s13321-017-0221-3.
- ↑ "Nextcloud". Nextcloud GmbH. https://nextcloud.com/. Retrieved 10 April 2024.