Difference between revisions of "User:Shawndouglas/sandbox/sublevel5"

From LIMSWiki
Jump to navigationJump to search
 
(332 intermediate revisions by the same user not shown)
Line 3: Line 3:
| type      = notice
| type      = notice
| style    = width: 960px;
| style    = width: 960px;
| text      = This is sublevel2 of my sandbox, where I play with features and test MediaWiki code. If you wish to leave a comment for me, please see [[User_talk:Shawndouglas|my discussion page]] instead.<p></p>
| text      = This is sublevel5 of my sandbox, where I play with features and test MediaWiki code. If you wish to leave a comment for me, please see [[User_talk:Shawndouglas|my discussion page]] instead.<p></p>
}}
}}


==Sandbox begins below==
==Sandbox begins below==
{{Infobox journal article
{{raw:wikipedia::Detection limit}}
|name        =
|image        =
|alt          = <!-- Alternative text for images -->
|caption      =
|title_full  = Big data management for healthcare systems: Architecture, requirements, and implementation
|journal      = ''Advances in Bioinformatics''
|authors      = El aboudi, Naoual; Benhilma, Laila
|affiliations = Mohammed V University
|contact      = Email: nawal dot elaboudi at gmail dot com
|editors      = Fdez-Riverola, Florentino
|pub_year    = 2018
|vol_iss      = '''2018'''
|pages        = 4059018
|doi          = [http://10.1155/2018/4059018 10.1155/2018/4059018]
|issn        = 1687-8035
|license      = [http://creativecommons.org/licenses/by/4.0/ Creative Commons Attribution 4.0 International]
|website      = [https://www.hindawi.com/journals/abi/2018/4059018/ https://www.hindawi.com/journals/abi/2018/4059018/]
|download    = [http://downloads.hindawi.com/journals/abi/2018/4059018.pdf http://downloads.hindawi.com/journals/abi/2018/4059018.pdf] (PDF)
}}
{{ombox
| type      = content
| style    = width: 500px;
| text      = This article should not be considered complete until this message box has been removed. This is a work in progress.
}}
==Abstract==
The growing amount of data in the healthcare industry has made inevitable the adoption of big data techniques in order to improve the quality of healthcare delivery. Despite the integration of big data processing approaches and platforms in existing [[Information management|data management]] architectures for healthcare systems, these architectures face difficulties in preventing emergency cases. The main contribution of this paper is proposing an extensible big data architecture based on both stream computing and batch computing in order to enhance further the reliability of healthcare systems by generating real-time alerts and making accurate predictions on patient health condition. Based on the proposed architecture, a prototype implementation has been built for healthcare systems in order to generate real-time alerts. The suggested prototype is based on Spark and MongoDB tools.
 
==Introduction==
The proportion of elderly people in society is growing worldwide<ref name="WHOGlobal11">{{cite web |url=http://www.who.int/ageing/publications/global_health/en/ |title=Global Health and Aging |editor=World Health Organization; National Institute of Aging |publisher=WHO |date=October 2011}}</ref>; this phenomenon—referred to by the World Health Organization as "humanity’s aging"<ref name="WHOGlobal11" />—has many implications on healthcare services, especially in terms of cost. In the face of such a situation, relying on classical systems may result in a life quality decline for millions of people. Seeking to overcome this problem, a variety of different healthcare systems have been designed. Their common principle is transferring, on a periodical basis, medical parameters like blood pressure, heart rate, glucose level, body temperature, and ECG signals to an automated system aimed at monitoring in real time patients' health condition. Such systems provide quick assistance when needed since data is analyzed continuously. Automating health monitoring favors a proactive approach that relieves medical facilities by saving costs related to [[Hospital|hospitalization]], and it also enhances healthcare services by improving waiting time for consultations. Recently, the number of data sources in the healthcare industry has grown rapidly as a result of widespread use of mobile and wearable sensor technologies, which have flooded the healthcare arena with a huge amount of data. Therefore, it becomes challenging to perform healthcare [[data analysis]] based on traditional methods which are unfit to handle the high volume of diversified medical data. In general, the healthcare domain has four categories of analytics: descriptive, diagnostic, predictive, and prescriptive analytics. A brief description of each one of them is given below.
 
'''Descriptive analytics''' refers to describing current situations and reporting on them. Several techniques are employed to perform this level of analytics. For instance, descriptive statistics tools like histograms and charts are among the techniques used in descriptive analytics.
 
'''Diagnostic analysis''' aims to explain why certain events occurred and what the factors that triggered them are. For example, diagnostic analysis attempts to understand the reasons behind the regular readmission of some patients by using several methods such as clustering and decision trees.
 
'''Predictive analytics''' reflects the ability to predict future events; it also helps in identifying trends and determining probabilities of uncertain outcomes. An illustration of its role is to predict whether or not a patient will have complications. Predictive models are often built using machine learning techniques.
 
'''Prescriptive analytics''' proposes suitable actions leading to optimal decision-making. For instance, prescriptive analysis may suggest rejecting a given treatment in the case of a harmful side effect's high probability. Decision trees and Monte Carlo simulation are examples of methods applied to perform prescriptive analytics. Figure 1 illustrates analytics phases for the healthcare domain.<ref name="GandomiBeyond15">{{cite journal |title=Beyond the hype: Big data concepts, methods, and analytics |journal=International Journal of Information Management |author=Gandomi, A.; Haider, M. |volume=35 |issue=2 |pages=137–44 |year=2015 |doi=10.1016/j.ijinfomgt.2014.10.007}}</ref> The integration of big data technologies into healthcare analytics may lead to better performance of medical systems.
 
 
[[File:Fig1 Elaboudi AdvInBioinfo2018 2018.png|510px]]
{{clear}}
{|
| STYLE="vertical-align:top;"|
{| border="0" cellpadding="5" cellspacing="0" width="510px"
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"| <blockquote>'''Figure 1.''' Analytics for the healthcare domain</blockquote>
|-
|}
|}
 
In fact, big data refers to large datasets that combine the following characteristics<ref name="ChenBig14">{{cite journal |title=Big Data: A Survey |journal=Mobile Networks and Applications |author=Chen, M.; Mao, S.; Liu, Y. |volume=19 |issue=2 |pages=171–209 |year=2014 |doi=10.1007/s11036-013-0489-0}}</ref>: volume, which refers to high amounts of data; velocity, which means that data is generated at a rapid pace; variety, which emphasizes that data comes under different formats; and veracity, which means that data originates from a trustworthy sources.
 
Another characteristic of big data is variability. It indicates variations that occur in the data flow rates. Indeed, velocity does not provide a consistent description of the data due to its periodic peaks and troughs. Another important aspect of big data is complexity; it arises from the fact that big data is often produced through many sources, which implies to perform many operations over the data, these operations include identifying relationships and cleansing and transforming data flowing from different origins.
 
Moreover, Oracle decided to introduce value as a key attribute of big data. According to Oracle, big data has a “low value density,” which means that raw data has a low value compared to its high volume. Nevertheless, analysis of important volumes of data may lead to obtaining a high value.
 
In the context of healthcare, high volumes of data are generated by multiple medical sources, and it includes, for example, biomedical images, lab test reports, physician written notes, and health condition parameters allowing real-time patient health monitoring. In addition to its huge volume and its diversity, healthcare data flows at high speed. As a result, big data approaches offer tremendous opportunities regarding healthcare systems efficiency.
 
The contribution of this research paper is to propose an extensible big data architecture for healthcare applications formed by several components capable of storing, processing, and analyzing the significant amount of data in real time and batch modes. This paper demonstrates the potential of using big data analytics in the healthcare domain to find useful information in highly valuable data.
 
The paper has been organized as follows: In the next section, a background of big data computing approaches and big data platforms is provided. Recent contributions on big data for healthcare systems are reviewed in the section after. Then, in the section "An extensible big data architecture for healthcare," the components of the proposed big data architecture for healthcare are described. The implementation process is reported in the penultimate section, followed by conclusions, along with recommendations for future research.
 
==Background==
===An overview of big data approaches===
Big data technologies have received great attention due to their successful handling of high volume data compared to traditional approaches. A big data framework supports all kinds of data—including structured, semistructured, and unstructured data—while providing several features. Those features include predictive model design and big data mining tools that allow better decision-making processes through the selection of relevant [[information]].
 
Big data processing can be performed through two manners: batch processing and stream processing.<ref name="ShahrivariBeyond14">{{cite journal |title=Beyond Batch Processing: Towards Real-Time and Streaming Big Data |journal=Computers |author=Shahrivari, S. |volume=3 |issue=4 |pages=117-129 |year=2014 |doi=10.3390/computers3040117}}</ref> The first method is based on analyzing data over a specified period of time; it is adopted when there are no constraints regarding the response time. On the other hand, stream processing is suitable for applications requiring real-time feedback. Batch processing aims to process a high volume of data by collecting and storing batches to be analyzed in order to generate results.
 
Batch processing mode requires ingesting all data before processing it in a specified time. MapReduce represents a widely adopted solution in the field of batch computing<ref name="DeanMap08">{{cite journal |title=MapReduce: Simplified data processing on large clusters |journal=Communications of the ACM |author=Dean, J.; Ghemawat, S. |volume=51 |issue=1 |pages=107-113 |year=2008 |doi=10.1145/1327452.1327492}}</ref>; it operates by splitting data into small pieces that are distributed to multiple nodes in order to obtain intermediate results. Once data processing by nodes is terminated, outcomes will be aggregated in order to generate the final results. Seeking to optimize computational resources use, MapReduce allocates processing tasks to nodes close to data location. This model has encountered a lot of success in many applications, especially in the field of [[bioinformatics]] and healthcare. Batch processing framework has many characteristics such as the ability to access all data and to perform many complex computation operations, and its latency is measured by minutes or more.
 
Stream processing offers another methodology to analysts. In real applications such as healthcare, intelligent transportation, and finance, a high amount of data is produced in a continuous manner. When the need of processing such data streams in real time arises, data analysis takes into consideration the continuous evolution of data and permanent change regarding statistical characteristics of data streams, referred to as concept drift.<ref name="TatbulStreaming10">{{cite journal |title=Streaming data integration: Challenges and opportunities |journal=Proceedings from the 26th IEEE International Conference on Data Engineering Workshops |author=Tatbul, N. |pages=155-158 |year=2010 |doi=10.1109/ICDEW.2010.5452751}}</ref> Indeed, storing a large amount of data for further processing may be challenging in terms of memory resources. Moreover, real applications tend to produce noisy data containing missing values and contain redundant features, making data analysis complicated, as it requires important computational time. Stream processing reduces this computational burden by performing simple and fast computations for one data element or for a window of recent data, and such computations take seconds at most.
 
Big data stream mining methods—including classification, frequent pattern mining, and clustering—relieve computational effort through rapid extraction of the most relevant information; this objective is often achieved by mining data in a distributed manner. Those methods belong to one of the two following classes: data-based techniques and task-based techniques.<ref name="SinghASurvey15">{{cite journal |title=A survey on platforms for big data analytics |journal=Journal of Big Data |author=Singh, D.; Reddy, C.K. |volume=2 |page=8 |year=2015 |doi=10.1186/s40537-014-0008-6}}</ref> Data-based techniques allow summarizing the entire dataset or selecting a subset of the continuous flow of streaming data to be processed. Sampling is one of these techniques; it consists of choosing a small subset of data to be processed according to a statistical criterion. Another data-based method is load shedding which drops a part from the entire data, while the sketching technique establishes a random projection on a feature set. The synopsis data structures method and aggregation method belong also to the family of data-based techniques, the first one summarizing data streams and the latter representing a number of elements in one element by using a statistical measure.
 
Task-based techniques update existing methods or design new ones to reduce the computational time in the case of data stream processing. They are categorized into approximation algorithms that generate outputs with an acceptable error margin, a sliding window that analyzes recent data under the assumption that it is more useful than older data, and algorithm output granularity that processes data according to the available memory and time constraints.
 
Big data approaches are essential for modern healthcare analytics; they allow real-time extraction of relevant information from a large amount of patient data. As a result, alerts are generated when the prediction model detects possible complications. This process helps to prevent health emergencies from occurring; it also assists medical professionals in [[Clinical decision support system|decision-making]] regarding disease diagnosis and provides special care recommendations.
 
===Big data processing frameworks===
Concerning batch processing mode, the MapReduce framework is widely adopted; it allows distributed analysis of big data on a cluster of machines. Thus, simple computations are performed through two functions that consist of map and reduce. MapReduce relies on a master/slave architecture, with the master node allocating processing tasks to slave nodes and dividing data into blocks, and, then, structuring data into a set of keys/values as an input of map tasks. Each worker assigns a map task to slaves and reads the appropriate input data, and, after that, the system writes generated results of the map task into intermediate files. Then the reducer worker transmits results generated by the map task as an input of the reducer task. Finally, the results are written into final output files. MapReduce runs on Hadoop, an open-source framework that stores and analyzes data in a parallel manner through clusters.
 
The entire framework is composed of two main components: Hadoop MapReduce and a distributed file system. Distributed file systems (HDFS) store data by duplicating it in many nodes. On the other hand, Hadoop MapReduce implements the MapReduce programming model; its master node stores [[metadata]] information such as locations of duplicated blocks, and it identifies locations of data nodes to recover missing blocks in failure cases. The data are split into several blocks, and the processing operations are made in the same machine. With Hadoop, other tools for data storage can be used instead of HDFS, such as HBase, Cassandra, and relational databases. [[Data warehouse|Data warehousing]] may be performed by other tools, for instance, Pig and Hive, while Apache Mahout is employed for machine learning purposes. When stream processing is required, Hadoop may not be a suitable choice since all input data must be available before starting MapReduce tasks.
 
Recently, Storm from Twitter, S4 from Yahoo, Spark, and other programs were presented as solutions for processing incoming stream data. Each solution has its own peculiarities.
 
'''Storm''' is an open-source framework to analyze data in real time, and it is composted of "spouts" and "bolts."<ref name="EvansApache15">{{cite journal |title=Apache Storm, a Hands on Tutorial |journal=Proceedings of the 2015 IEEE International Conference on Cloud Engineering |author=Evans, R. |page=2 |year=2015 |doi=10.1109/IC2E.2015.67}}</ref> Spouts can produce data or load data from an input queue, and bolts processes input streams and generate output streams. In Storm programming, a combination of a bolt and a spout results in a named topology. Storm has three nodes: the master node or nimbus, the worker node, and the zookeeper. The master node distributes and coordinates the execution of topology, while the worker node is responsible for executing spouts/bolts. Finally, the zookeeper synchronizes distributed coordination.
 
'''S4''' is a distributed stream processing engine, inspired by the MapReduce model in order to process data streams.<ref name="NeumeyerS410">{{cite journal |title=S4: Distributed Stream Computing Platform |journal=Proceedings from the 2010 IEEE International Conference on Data Mining Workshops |author=Neumeyer, L.; Robbins, B.; Nair, A.; Kesari, A. |pages=170-177 |year=2010 |doi=10.1109/ICDMW.2010.172}}</ref> It was implemented by Yahoo through Java. Data streams feed to S4 as events.
 
'''Spark''' can be applied to both batch and stream processing; therefore, spark may be considered a powerful framework compared with other tools such as Hadoop and Storm.<ref name="ZahariaApache16">{{cite journal |title=Apache Spark: A Unified Engine For Big Data Processing |journal=Communications of the ACM |author=Zaharia, M.; Xin, R.S.; Wendell, P. et al. |volume=59 |issue=11 |pages=56–65 |year=2016 |doi=10.1145/2934664}}</ref> It can access several data sources like HDFS, Cassandra, and HBase. Spark provides several interesting features, for example, iterative machine learning algorithms through the Mllib library, which provides efficient algorithms with high speed, structured data analysis using Hive, and graph processing based on GraphX and SparkSQL that restore data from many sources and manipulate them using the SQL languages. Before processing data streams, Spark divides them into small portions and transforms them into a set of RDDs (resilient distributed datasets) named DStream (Discretised Stream).
 
'''Apache Flink''' is an open-source solution that analyzes data in both batch and real-time mode.<ref name="FriedmanIntro16">{{cite book |title=Introduction to Apache Flink: Stream Processing for Real Time and Beyond |author=Friedman, E.; Tzoumas, K. |publisher=O'Reilly Media |year=2016 |isbn=9781491976586}}</ref> The programming models of Flink and MapReduce share many similarities. Flink allows iterative processing and real-time computation on stream data collected by tools such as Flume and Kafka. Apache Flink provides several features like FlinkML, which represents a machine learning library capable of providing many learning algorithms for fast and scalable big data applications.
 
'''MongoDB''' is a NoSQL database capable of storing a significant amount of data. MongoDB relies on the JSON standard (Java Script Object Notation) in order to store records. It consists of an open, human, and machine-readable format that makes data interchange easier compared to classical formats such as rows and tables. In addition, JSON scales better since join-based queries are not needed due to the fact that relevant data of a given record is contained in a single JSON document. Spark is easily integrated with MongoDB.<ref name="PluggeTheDef15">{{cite book |chapter=Introduction to MongoDB |title=The Definitive Guide to MongoDB: A complete guide to dealing with Big Data using MongoDB |author=Plugge, E.; Hows, D.; Membrey, P.; Hawkins, T. |publisher=APress |year=2015 |isbn=9781491976586}}</ref>
 
Table 1 summarizes big data processing solutions.
 
{|
| STYLE="vertical-align:top;"|
{| class="wikitable" border="1" cellpadding="5" cellspacing="0" width="80%"
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" colspan="6"|'''Table 1.''' Big data processing solutions
|-
|-
  ! style="background-color:#e2e2e2; padding-left:10px; padding-right:10px;"|Framework
  ! style="background-color:#e2e2e2; padding-left:10px; padding-right:10px;"|Type
  ! style="background-color:#e2e2e2; padding-left:10px; padding-right:10px;"|Latency
  ! style="background-color:#e2e2e2; padding-left:10px; padding-right:10px;"|Developed by
  ! style="background-color:#e2e2e2; padding-left:10px; padding-right:10px;"|Stream primitive
  ! style="background-color:#e2e2e2; padding-left:10px; padding-right:10px;"|Stream source
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Hadoop
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Batch
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Minutes or more
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Yahoo
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Key-value
  | style="background-color:white; padding-left:10px; padding-right:10px;"|HDFS
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Storm
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Streaming
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Subseconds
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Twitter
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Tuples
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Spouts
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Spark (streaming)
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Batch/streaming
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Few seconds
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Berkley AMPLay
  | style="background-color:white; padding-left:10px; padding-right:10px;"|DStream
  | style="background-color:white; padding-left:10px; padding-right:10px;"|HDFS
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|S4
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Streaming
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Few seconds
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Yahoo
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Events
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Networks
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Flink
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Batch/streaming
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Few seconds
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Apache Software Foundation
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Key-value
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Kafka
|-
|}
|}
 
==Big data-based healthcare systems==
The potential offered by big data approaches in healthcare analytics has attracted the attention of many researchers. Huang ''et al.''<ref name="HuangPromises15">{{cite journal |title=Promises and Challenges of Big Data Computing in Health Sciences |journal=Big Data Research |author=Huang, T.; Lan, L.; Fang, X. et al. |volume=2 |issue=1 |pages=2–11 |year=2015 |doi=10.1016/j.bdr.2015.02.002}}</ref>, for example, present recent advances in big data for [[health informatics]] and their role to tackle disease management, including diagnosis prevention and treatment of several illnesses. Their study demonstrates that data privacy and security represent challenging issues in healthcare systems.
 
Raghupathi and Raghupathi<ref name="RaghupathiBig14">{{cite journal |title=Big data analytics in healthcare: Promise and potential |journal=Health Information Science and Systems |author=Raghupathi, W.; Raghupathi, V. |volume=2 |page=3 |year=2014 |doi=10.1186/2047-2501-2-3 |pmid=25825667 |pmc=PMC4341817}}</ref> have also exposed the architectural framework and challenges of big data healthcare analytics. In another study<ref name="OlaronkeBigData16">{{cite journal |title=Big data in healthcare: Prospects, challenges and resolutions |journal=Proceedings from the 2016 Future Technologies Conference |author=Olaronke, I.; Oluwaseun, O. |pages=1152-7 |year=2016 |doi=10.1109/FTC.2016.7821747}}</ref>, the importance of security and privacy issues is demonstrated in implementing successfully big data healthcare systems. Belle ''et al.''<ref name="BelleBigData15">{{cite journal |title=Big Data Analytics in Healthcare |journal=BioMed Research International |author=Belle, A.; Thiagarajan, R.; Soroushmehr, S.M.R. et al. |volume=2015 |page=370194 |year=2015 |doi=10.1155/2015/370194}}</ref> discuss the role of big data in improving the quality of care delivery by aggregating and processing the large volume of data generated by healthcare systems.
 
Sun and Reddy<ref name="SunBigData13">{{cite journal |title=Big data analytics for healthcare |journal=Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining |author=Sun, J.; Reddy, C.K. |volume=2013 |page=1525 |year=2013 |doi=10.1145/2487575.2506178}}</ref> discuss data mining techniques for healthcare analytics, especially those used in healthcare applications like survival analysis and patient similarity. Bochicchio ''et al.''<ref name="BochicchioABigData16">{{cite journal |title=A Big Data Analytics Framework for Supporting Multidimensional Mining over Big Healthcare Data |journal=Proceedings from the 15th IEEE International Conference on Machine Learning and Applications |author=Bochicchio, M.; Cuzzocrea, A.; Vaira, L. |page=508-13 |year=2016 |doi=10.1109/ICMLA.2016.0090}}</ref> propose a big data healthcare analytics framework for supporting multidimensional mining over big healthcare data. The objective of this framework is analyzing the huge volume of data by applying data mining methods. Sakr and Elgammal<ref name="SakrTowards16">{{cite journal |title=Towards a Comprehensive Data Analytics Framework for Smart Healthcare Services |journal=Big Data Research |author=Sakr, S.; Elgammal, A. |volume=4 |page=44–58 |year=2016 |doi=10.1016/j.bdr.2016.05.002}}</ref> discuss a composite big data healthcare analytics framework called SmartHealth, whose goal is to overcome the challenges raised by healthcare big data via ICT technologies.
 
Li ''et al.''<ref name="LiWiki14">{{cite book |chapter=Chapter 4: Wiki-Health: A Big Data Platform for Health Sensor Data Management |title=Cloud Computing Applications for Quality Health Care Delivery |author=Li, Y.; Wu, C.; Guo, L. et al. |publisher=IGI Global |pages=19 |year=2014 |isbn=9781466661189 |doi=10.4018/978-1-4666-6118-9.ch004}}</ref> present a more focused framework called Wiki-Health, a big data platform that processes data produced by health sensors. This platform is formed by the three following layers: application, query and analysis, and data storage. The application layer ensures data access, data collection, security, and data sharing. On the other hand, the query and analysis layers provide data management and data analysis, while the data storage layer manages data storage. Challenges regarding the design of such platforms, especially in terms of data privacy and data security, are highlighted by Poh ''et al.''<ref name="PohChallenges14">{{cite journal |title=Challenges in designing an online healthcare platform for personalised patient analytics |journal=Proceedings of the 2014 IEEE Symposium on Computational Intelligence in Big Data |author=Poh, N.; Tirunagari, S.; Windridge, D. |pages=1–6 |year=2014 |doi=10.1109/CIBD.2014.7011526}}</ref> Baldominos ''et al.''<ref name="BaldominosDataCare18">{{cite journal |title=DataCare: Big data analytics solution for intelligent healthcare management |journal=International Journal of Interactive Multimedia and Artificial Intelligence |author=Baldominos, A.; de Rada, F.; Saez, Y. |volume=4 |issue=7 |pages=13–20 |year=2018 |doi=10.9781/ijimai.2017.03.002}}</ref> also designed an intelligent big data healthcare management solution, aimed at retrieving and aggregating data and predicting future values.
 
Based on big data technologies, a few data processing systems for the healthcare domain have been designed in order to handle the important amount of data streams generated by medical devices; a brief description of the major ones is provided in the next section.
 
A Borealis-based heart rate variability monitor, as discussed by Jiang ''et al.''<ref name="JiangDSMS11">{{cite journal |title=DSMS in ubiquitous-healthcare: A Borealis-based heart rate variability monitor |journal=Proceedings from the 2011 4th International Conference on Biomedical Engineering and Informatics |author=Jiang, X.; Yoo, S.; Choi, J. |pages=2144-7 |year=2011 |doi=10.1109/BMEI.2011.6098425}}</ref>, belongs to the category of big data processing systems for healthcare systems; it processes data originating from various sources in order to perform desired monitoring activities. It is composed of a stream transmitter that represents an interface between sensors collecting data and the Borealis application. It encapsulates the collected data into Borealis format in order to obtain a single stream. Then, the final stream is transferred to the Borealis application for processing purposes. This system also includes a graphical user interface (GUI) that allows physicians to select from among patients those whose health condition is going to be the subject of close monitoring. Moreover, the [[Interface (computing|graphical interface]] permits the medical staff to choose the parameters they want to focus on, regarding a monitoring task. Furthermore, it allows [[Data visualization|visualization]] of Borealis application outcomes. However, the system has many drawbacks. For instance, it does not include a machine learning component capable of making accurate predictions on patient health condition. It also lacks an alarm component, which would enhance emergency case detection.
 
A Hadoop-based medical emergency management system using [[Internet of things|internet of things]] (IoT) technology relies on sensors measuring medical parameters through different processes [24]. Those sensors may be devices mounted on patient body or other types of medical devices capable of providing remote measuring. Before being transferred to the component called intelligent building (IB), the collected data flows through the primary medical device (PMD). Next, IB starts by aggregating the input stream thanks to its collection unit; then, the resulting data is transferred to the Hadoop Processing Unit (HPU) to perform statistical analyses of parameters measured by sensors based on the MapReduce paradigm. The map function aims to verify sensor readings; this verification occurs by performing a comparison with their corresponding normal threshold. If readings are considered to be normal, they are stored in a database without further processing. On the other hand, if they are abnormal, an alert is triggered and transmitted to the application layer. Meanwhile, when sensors return values that are neither normal nor abnormal, it is necessary to analyze them closely. Results of such analyses are collected by the aggregation result unit through a reducer from different data nodes; then, they are sent to the final decision server. Finally, the decision server receives the current results and applies machine learning classifiers and medical expert knowledge to process past patient data for more accurate decisions and generates outputs based on Hadoop Processing Unit results. This system is based on a Hadoop ecosystem which is adapted for batch processing; however, it does not support stream processing. Therefore, it is more recommended to use Spark in order to improve the system performance in terms of processing time using data stream mining approaches.
 
Liu ''et al.'' proposed a prototype of a healthcare big data processing system based on Spark<ref name="LiuAProto15">{{cite journal |title=A prototype of healthcare big data processing system based on Spark |journal=Proceedings from the 8th International Conference on Biomedical Engineering and Informatics |author=Liu, W.; Li, Q.; Cai, Y.; Li, X. |pages=516–20 |year=2015 |doi=10.1109/BMEI.2015.7401559}}</ref> to analyze the high amount of data generated by healthcare big data process systems. It is formed by two logical parts: big data application service and big data supporting platform performing data analysis. The first logical part visualizes the processing results and plays the role of an interface between applications and data warehouse big data tools such as Hive or Spark SQL. The second one is responsible for computing operations and distributed storage allowing high storage capabilities. This solution is based on Spark, which is very promising since it handles batch computing, stream computing, and ad hoc query. The system has many drawbacks; for instance, it does not include big data mining and big data analytics in experimental platform, which hampers prediction possibilities that are vital for improving the quality of patient outcomes.
 
In this paper, we continue to emphasize the added value of big data technologies on healthcare analytics by presenting an extensible big data architecture for healthcare analytics that combines advantages of both batch and stream computing to generate real-time alerts and make accurate predictions about patient health condition. In this research, an architecture for management and analysis of medical data was designed based on big data methods and can be implemented via a combination of several big data technologies. Designing systems capable of handling both batch and real-time processing is a complex task and requires an effective conceptual architecture for implementing the system.
 
==An extensible big data architecture for healthcare==
We are developing a system that has the advantage to be generic and can deal with various situations such as early disease diagnosis and emergency detection. In this study, we propose a new architecture aimed at handling medical big data originating from heterogeneous sources in different formats. Data management in this architecture is illustrated through the following scenario.
 
Indeed, new medical data is sent simultaneously to both batch layer and streaming layer. In batch mode, data is stored in data nodes, then it is transmitted to a semantic module, which affects meaning to data using ontology store. After that, cleaning and filtering operations are applied to the resulting data before processing it. In the next step, the prepared data is analyzed through different phases: feature selection and feature extraction. Finally, the prepared data is used to design models predicting patients' future health condition. This mode is solicited periodically on an offline basis. In the stream scenario, data comes from multiple sources such as medical sensors connected to a patient's body, measuring several medical parameters like blood pressure. Then, the collected data is synchronized based on time and its missing values are handled.
 
Based on sliding window technique, the adaptive preprocessor splits data into blocks, and then it extracts relevant information for the predictor component in order to build a predictive model for every window tuple. Figure 2 represents the layer architecture of the proposal.
 
 
[[File:Fig2 Elaboudi AdvInBioinfo2018 2018.png|340px]]
{{clear}}
{|
| STYLE="vertical-align:top;"|
{| border="0" cellpadding="5" cellspacing="0" width="340px"
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"| <blockquote>'''Figure 2.''' The layer architecture</blockquote>
|-
|}
|}
 
===Batch processing layer===
Batch computing is performed on extracted data from prepared data store through different phases.
 
====Data acquisition====
When monitoring continuously a patient's health condition, several types of data are generated. Medical data may include structured data from the traditional [[electronic health record]] (EHR), semistructured data such as logs produced by some medical devices, and unstructured data generated by biomedical imagery. The following are examples of sources for medical data.
 
'''Electronic health records''' provide a complete patient medical history stored in a digital format. The EHR is composed of a multitude of medical data describing the patient’s health status like demographics, medications, diagnoses, [[laboratory]] tests, doctor’s notes, radiology documents, clinical information, and payment notes. Thus, EHRs represent a valuable source of information for the purpose of healthcare analytics. Furthermore, EHRs allows exchanging data between professional healthcare communities.
 
'''Biomedical imaging''' is considered a powerful tool regarding disease detection and care delivery. Nevertheless, processing the resulting mages is challenging as they include noisy data that needs to be discarded in order to help physicians make accurate decisions.
 
'''Social network analyses''' require gathering data from social media like social networking sites. The next step consists of extracting knowledge that could affect healthcare predictive analysis such as discovering infectious illnesses. In general, social network data is marked by uncertainty, which makes their use in designing predictive models risky.
 
'''Device sensors''' of different types are employed in healthcare monitoring solutions. Those devices are essential in monitoring a patient's health, as they measure a wide range of medical indicators such as body temperature, blood pressure, respiratory rate, heart rate, and cardiovascular status. In order to ensure efficient health monitoring, a patient's living area may be full of devices like surveillance cameras, microphones, and pressure sensors. Consequently, data volume generated by health monitoring systems tends to increase tremendously, which requires adopting sophisticated methods during the processing phase.
 
'''Mobile phones''' represent some of the most popular technological devices in the world. Compared to their early beginnings, mobile phones transformed from a basic communication tool to a complex device offering many features and services. They are currently equipped with several sensors like satellite positioning services, accelerometers, and cameras. Due to their multiple capabilities and wide use, mobile phones are ideal candidates regarding health data collection, allowing the design of many successful healthcare applications for activities such as monitoring pregnancy<ref name="BachiriMobile16">{{cite journal |title=Mobile personal health records for pregnancy monitoring functionalities: Analysis and potential |journal=Computer Methods and Programs in Biomedicine |author=Bachiri, M.; Idri, A.; Fernández-Alemán, J.L.; Toval, A. |volume=134 |pages=121–35 |year=2016 |doi=10.1016/j.cmpb.2016.06.008}}</ref>, tracking child nutrition<ref name="GuyonMobile16">{{cite journal |title=Mobile-Based Nutrition and Child Health Monitoring to Inform Program Development: An Experience From Liberia |journal=Global Health: Science and Practice |author=Guyon, A.; Bock, A.; Buback, L. Knittel, B. |volume=4 |issue=4 |pages=661–70 |year=2016 |doi=10.9745/GHSP-D-16-00189}}</ref>, and monitoring heart beat rate.<ref name="PelegrisANovel10">{{cite journal |title=A novel method to detect heart beat rate using a mobile phone |journal=Conference Proceedings of the International Conference of the IEEE Engineering in Medicine and Biology Society |author=Pelegris, P.; Banitsas, K.; Orbach, T.; Marias, K. |pages=5488–91 |year=2010 |doi=10.1109/IEMBS.2010.5626580 |pmid=21096290}}</ref>
 
The objective of the data acquisition phase is to read the data gathered from healthcare sensors in several formats, and then direct data through a semantic module before being normalized. A semantic module is based on ontologies, which constitute efficient tools when it comes to representing actionable knowledge in the field of biomedicine. In fact, ontologies have the ability to extract biomedical knowledge in a formal, powerful, and incremental way. They also allow automation and interoperability between different clinical information systems. Automation has a major benefit: it helps medical personnel in processing large amounts of patients’ data, especially when taking into consideration that this personnel is often overwhelmed by a series of healthcare tasks. Introducing automation in the healthcare setting contributes to providing assistance to human medical staff, which enhances its overall performance. It should be highlighted that automation will help humans in performing their duties rather than replacing them.
 
Interoperability is an important issue when dealing with medical data. In fact, healthcare databases lack homogeneity as they adopt different structures and terminologies. Therefore, it is difficult to share information and integrate healthcare data. In this context, ontologies may play a determinant role by establishing a common structure and semantics, which allows sharing and reuse of data across different systems. In other words, by defining a standard ontology format, it becomes possible to map heterogeneous databases into a common structure and terminology. For instance, the Web Ontology Language (OWL) represents the standard interchange format regarding ontology data that employs XML syntax.
 
 
==References==
{{Reflist|colwidth=30em}}
 
==Notes==
This presentation is faithful to the original, with only a few minor changes to presentation. Grammar was cleaned up for smoother reading. In some cases important information was missing from the references, and that information was added. The [https://link.springer.com/chapter/10.1007%2F978-3-540-28608-0_17 original reference] the author used for "Baldominos ''et al.''" was incorrect; the presumably correct citation was added in its place.
 
<!--Place all category tags here-->
[[Category:LIMSwiki journal articles (added in 2018)‎]]
[[Category:LIMSwiki journal articles (all)‎]]
[[Category:LIMSwiki journal articles on big data]]
[[Category:LIMSwiki journal articles on data management and sharing]]
[[Category:LIMSwiki journal articles on health informatics]]
[[Category:LIMSwiki journal articles on information technology]]

Latest revision as of 18:25, 10 January 2024

Sandbox begins below

Template:Short description

The limit of detection (LOD or LoD) is the lowest signal, or the lowest corresponding quantity to be determined (or extracted) from the signal, that can be observed with a sufficient degree of confidence or statistical significance. However, the exact threshold (level of decision) used to decide when a signal significantly emerges above the continuously fluctuating background noise remains arbitrary and is a matter of policy and often of debate among scientists, statisticians and regulators depending on the stakes in different fields.

Significance in analytical chemistry

In analytical chemistry, the detection limit, lower limit of detection, also termed LOD for limit of detection or analytical sensitivity (not to be confused with statistical sensitivity), is the lowest quantity of a substance that can be distinguished from the absence of that substance (a blank value) with a stated confidence level (generally 99%).[1][2][3] The detection limit is estimated from the mean of the blank, the standard deviation of the blank, the slope (analytical sensitivity) of the calibration plot and a defined confidence factor (e.g. 3.2 being the most accepted value for this arbitrary value).[4] Another consideration that affects the detection limit is the adequacy and the accuracy of the model used to predict concentration from the raw analytical signal.[5]

As a typical example, from a calibration plot following a linear equation taken here as the simplest possible model:

where, corresponds to the signal measured (e.g. voltage, luminescence, energy, etc.), "Template:Mvar" the value in which the straight line cuts the ordinates axis, "Template:Mvar" the sensitivity of the system (i.e., the slope of the line, or the function relating the measured signal to the quantity to be determined) and "Template:Mvar" the value of the quantity (e.g. temperature, concentration, pH, etc.) to be determined from the signal ,[6] the LOD for "Template:Mvar" is calculated as the "Template:Mvar" value in which equals to the average value of blanks "Template:Mvar" plus "Template:Mvar" times its standard deviation "Template:Mvar" (or, if zero, the standard deviation corresponding to the lowest value measured) where "Template:Mvar" is the chosen confidence value (e.g. for a confidence of 95% it can be considered Template:Mvar = 3.2, determined from the limit of blank).[4]

Thus, in this didactic example:

There are a number of concepts derived from the detection limit that are commonly used. These include the instrument detection limit (IDL), the method detection limit (MDL), the practical quantitation limit (PQL), and the limit of quantitation (LOQ). Even when the same terminology is used, there can be differences in the LOD according to nuances of what definition is used and what type of noise contributes to the measurement and calibration.[7]

The figure below illustrates the relationship between the blank, the limit of detection (LOD), and the limit of quantitation (LOQ) by showing the probability density function for normally distributed measurements at the blank, at the LOD defined as 3 × standard deviation of the blank, and at the LOQ defined as 10 × standard deviation of the blank. (The identical spread along Abscissa of these two functions is problematic.) For a signal at the LOD, the alpha error (probability of false positive) is small (1%). However, the beta error (probability of a false negative) is 50% for a sample that has a concentration at the LOD (red line). This means a sample could contain an impurity at the LOD, but there is a 50% chance that a measurement would give a result less than the LOD. At the LOQ (blue line), there is minimal chance of a false negative.

Template:Wide image

Instrument detection limit

Most analytical instruments produce a signal even when a blank (matrix without analyte) is analyzed. This signal is referred to as the noise level. The instrument detection limit (IDL) is the analyte concentration that is required to produce a signal greater than three times the standard deviation of the noise level. This may be practically measured by analyzing 8 or more standards at the estimated IDL then calculating the standard deviation from the measured concentrations of those standards.

The detection limit (according to IUPAC) is the smallest concentration, or the smallest absolute amount, of analyte that has a signal statistically significantly larger than the signal arising from the repeated measurements of a reagent blank.

Mathematically, the analyte's signal at the detection limit () is given by:

where, is the mean value of the signal for a reagent blank measured multiple times, and is the known standard deviation for the reagent blank's signal.

Other approaches for defining the detection limit have also been developed. In atomic absorption spectrometry usually the detection limit is determined for a certain element by analyzing a diluted solution of this element and recording the corresponding absorbance at a given wavelength. The measurement is repeated 10 times. The 3σ of the recorded absorbance signal can be considered as the detection limit for the specific element under the experimental conditions: selected wavelength, type of flame or graphite oven, chemical matrix, presence of interfering substances, instrument... .

Method detection limit

Often there is more to the analytical method than just performing a reaction or submitting the analyte to direct analysis. Many analytical methods developed in the laboratory, especially these involving the use of a delicate scientific instrument, require a sample preparation, or a pretreatment of the samples prior to being analysed. For example, it might be necessary to heat a sample that is to be analyzed for a particular metal with the addition of acid first (digestion process). The sample may also be diluted or concentrated prior to analysis by means of a given instrument. Additional steps in an analysis method add additional opportunities for errors. Since detection limits are defined in terms of errors, this will naturally increase the measured detection limit. This "global" detection limit (including all the steps of the analysis method) is called the method detection limit (MDL). The practical way for determining the MDL is to analyze seven samples of concentration near the expected limit of detection. The standard deviation is then determined. The one-sided Student's t-distribution is determined and multiplied versus the determined standard deviation. For seven samples (with six degrees of freedom) the t value for a 99% confidence level is 3.14. Rather than performing the complete analysis of seven identical samples, if the Instrument Detection Limit is known, the MDL may be estimated by multiplying the Instrument Detection Limit, or Lower Level of Detection, by the dilution prior to analyzing the sample solution with the instrument. This estimation, however, ignores any uncertainty that arises from performing the sample preparation and will therefore probably underestimate the true MDL.

Limit of each model

The issue of limit of detection, or limit of quantification, is encountered in all scientific disciplines. This explains the variety of definitions and the diversity of juridiction specific solutions developed to address preferences. In the simplest cases as in nuclear and chemical measurements, definitions and approaches have probably received the clearer and the simplest solutions. In biochemical tests and in biological experiments depending on many more intricate factors, the situation involving false positive and false negative responses is more delicate to handle. In many other disciplines such as geochemistry, seismology, astronomy, dendrochronology, climatology, life sciences in general, and in many other fields impossible to enumerate extensively, the problem is wider and deals with signal extraction out of a background of noise. It involves complex statistical analysis procedures and therefore it also depends on the models used,[5] the hypotheses and the simplifications or approximations to be made to handle and manage uncertainties. When the data resolution is poor and different signals overlap, different deconvolution procedures are applied to extract parameters. The use of different phenomenological, mathematical and statistical models may also complicate the exact mathematical definition of limit of detection and how it is calculated. This explains why it is not easy to come to a general consensus, if any, about the precise mathematical definition of the expression of limit of detection. However, one thing is clear: it always requires a sufficient number of data (or accumulated data) and a rigorous statistical analysis to render better signification statistically.

Limit of quantification

The limit of quantification (LoQ, or LOQ) is the lowest value of a signal (or concentration, activity, response...) that can be quantified with acceptable precision and accuracy.

The LoQ is the limit at which the difference between two distinct signals / values can be discerned with a reasonable certainty, i.e., when the signal is statistically different from the background. The LoQ may be drastically different between laboratories, so another detection limit is commonly used that is referred to as the Practical Quantification Limit (PQL).

See also

References

  1. IUPAC, Compendium of Chemical Terminology, 2nd ed. (the "Gold Book") (1997). Online corrected version:  (2006–) "detection limit".
  2. "Guidelines for Data Acquisition and Data Quality Evaluation in Environmental Chemistry". Analytical Chemistry 52 (14): 2242–49. 1980. doi:10.1021/ac50064a004. 
  3. Saah AJ, Hoover DR (1998). "[Sensitivity and specificity revisited: significance of the terms in analytic and diagnostic language."]. Ann Dermatol Venereol 125 (4): 291–4. PMID 9747274. https://pubmed.ncbi.nlm.nih.gov/9747274. 
  4. 4.0 4.1 "Limit of blank, limit of detection and limit of quantitation". The Clinical Biochemist. Reviews 29 Suppl 1 (1): S49–S52. August 2008. PMC 2556583. PMID 18852857. https://www.ncbi.nlm.nih.gov/pmc/articles/2556583. 
  5. 5.0 5.1 "R: "Detection" limit for each model" (in English). search.r-project.org. https://search.r-project.org/CRAN/refmans/bioOED/html/calculate_limit.html. 
  6. "Signal enhancement on gold nanoparticle-based lateral flow tests using cellulose nanofibers". Biosensors & Bioelectronics 141: 111407. September 2019. doi:10.1016/j.bios.2019.111407. PMID 31207571. http://ddd.uab.cat/record/218082. 
  7. Long, Gary L.; Winefordner, J. D., "Limit of detection: a closer look at the IUPAC definition", Anal. Chem. 55 (7): 712A–724A, doi:10.1021/ac00258a724 

Further reading

  • "Limits for qualitative detection and quantitative determination. Application to radiochemistry". Analytical Chemistry 40 (3): 586–593. 1968. doi:10.1021/ac60259a007. ISSN 0003-2700. 

External links

Template:BranchesofChemistry Template:Authority control