Journal:A view of programming scalable data analysis: From clouds to exascale

Full article title	A view of programming scalable data analysis: From clouds to exascale
Journal	Journal of Cloud Computing
Author(s)	Talia, Domenico
Author affiliation(s)	DIMES at Università della Calabria
Primary contact	Email: talia at dimes dot unical dot it
Year published	2019
Volume and issue	8
Page(s)	4
DOI	10.1186/s13677-019-0127-x
ISSN	2192-113X
Distribution license	Creative Commons Attribution 4.0 International
Website	https://link.springer.com/article/10.1186/s13677-019-0127-x
Download	https://link.springer.com/content/pdf/10.1186%2Fs13677-019-0127-x.pdf (PDF)

This article contains rendered mathematical formulae. You may require the Math Anywhere plugin for Chrome or the Native MathML add-on and fonts for Firefox if they don't render properly for you.

This article should not be considered complete until this message box has been removed. This is a work in progress.

Abstract

Scalability is a key feature for big data analysis and machine learning frameworks and for applications that need to analyze very large and real-time data available from data repositories, social media, sensor networks, smartphones, and the internet. Scalable big data analysis today can be achieved by parallel implementations that are able to exploit the computing and storage facilities of high-performance computing (HPC) systems and cloud computing systems, whereas in the near future exascale systems will be used to implement extreme-scale data analysis. Here is discussed how cloud computing currently supports the development of scalable data mining solutions and what the main challenges to be addressed and solved for implementing innovative data analysis applications on exascale systems currently are.

Keywords: big data analysis, cloud computing, exascale computing, data mining, parallel programming, scalability

Introduction

Solving problems in science and engineering was the first motivation for inventing computers. Much later, computer science remains the main area in which innovative solutions and technologies are being developed and applied. Also due to the extraordinary advancement of computer technology, nowadays data are generated as never before. In fact, the amount of structured and unstructured digital data is going to increase beyond any estimate. Databases, file systems, data streams, social media, and data repositories are increasingly pervasive and decentralized.

As the data scale increases, we must address new challenges and attack ever-larger problems. New discoveries will be achieved and more accurate investigations can be carried out due to the increasingly widespread availability of large amounts of data. Scientific sectors that fail to make full use of the volume of digital data available today risk losing out on the significant opportunities that big data can offer.

To benefit from big data availability, specialists and researchers need advanced data analysis tools and applications running on scalable architectures allowing for the extraction of useful knowledge from such huge data sources. High-performance computing (HPC) systems and cloud computing systems today are capable platforms for addressing both the computational and data storage needs of big data mining and parallel knowledge discovery applications. These computing architectures are needed to run data analysis because complex data mining tasks involve data- and compute-intensive algorithms that require large, reliable, and effective storage facilities together with high-performance processors to obtain results in a timely fashion.

Now that data sources have become pervasively huge, reliable and effective programming tools and applications for data analysis are needed to extract value and find useful insights in them. New ways to correctly and proficiently compose different distributed models and paradigms are required, and interaction between hardware resources and programming levels must be addressed. Users, professionals, and scientists working in the area of big data need advanced data analysis programming models and tools coupled with scalable architectures to support the extraction of useful information from such massive repositories. The scalability of a parallel computing system is a measure of its capacity to reduce program execution time in proportion to the number of its processing elements. (The appendix of this article introduces and discusses in detail scalability in parallel systems.) According to scalability definition, scalable data analysis refers to the ability of a hardware/software parallel system to exploit increasing computing resources effectively in the analysis of (very) large datasets.

Today, complex analysis of real-world massive data sources requires using high-performance computing systems such as massively parallel machines or clouds. However in the next years, as parallel technologies advance, exascale computing systems will be exploited for implementing scalable big data analysis in all areas of science and engineering.^[1] To reach this goal, new design and programming challenges must be addressed and solved. As such, the focus of this paper is on discussing current cloud-based designing and programming solutions for data analysis and suggesting new programming requirements and approaches to be conceived for meeting big data analysis challenges on future exascale platforms.

Current cloud computing platforms and parallel computing systems represent two different technological solutions for addressing the computational and data storage needs of big data mining and parallel knowledge discovery applications. Indeed, parallel machines offer high-end processors with the main goal to support HPC applications, whereas cloud systems implement a computing model in which dynamically scalable virtualized resources are provided to users and developers as a service over the internet. In fact, clouds do not mainly target HPC applications; they represent scalable computing and storage delivery platforms that can be adapted to the needs of different classes of people and organizations by exploiting a service-oriented architecture (SOA) approach. Clouds offer large facilities to many users who were unable to own their parallel/distributed computing systems to run applications and services. In particular, big data analysis applications requiring access and manipulating very large datasets with complex mining algorithms will significantly benefit from the use of cloud platforms.

Although not many cloud-based data analysis frameworks are available today for end users, within a few years they will become common.^[2] Some current solutions are based on open-source systems, such as Apache Hadoop and Mahout, Spark, and SciDB, while others are proprietary solutions provided by companies such as Google, Microsoft, EMC, Amazon, BigML, Splunk Hunk, and InsightsOne. As more such platforms emerge, researchers and professionals will port increasingly powerful data mining programming tools and frameworks to the cloud to exploit complex and flexible software models such as the distributed workflow paradigm. The growing utilization of the service-oriented computing model could accelerate this trend.

From the definition of the term "big data," which refers to datasets so large and complex that traditional hardware and software data processing solutions are inadequate to manage and analyze, we can infer that conventional computer systems are not so powerful to process and mine big data^[3], and they are not able to scale with the size of problems to be solved. As mentioned before, to face with limits of sequential machines, advanced systems like HPC, cloud computing, and even more scalable architectures are used today to analyze big data. Starting from this scenario, exascale computing systems will represent the next computing step.^[4]^[5] Exascale systems refers to high-performance computing systems capable of at least one exaFLOPS, so their implementation represents a significant research and technology challenge. Their design and development is currently under investigation with the goal of building by 2020 high-performance computers composed of a very large number of multi-core processors expected to deliver a performance of 10¹⁸ operations per second. Cloud computing systems used today are able to store very large amounts of data; however, they do not provide the high performance expected from massively parallel exascale systems. This is the main motivation for developing exascale systems. Exascale technology will represent the most advanced model of supercomputers. They have been conceived for single-site supercomputing centers, not for distributed infrastructures that could use multi-clouds or fog computing systems for decentralizing computing and pervasive data management, and later be interconnected with exascale systems that could be used as a backbone for very large scale data analysis.

The development of exascale systems spurs a need to address and solve issues and challenges at both the hardware and software level. Indeed, it requires the design and implementation of novel software tools and runtime systems able to manage a high degree of parallelism, reliability, and data locality in extreme scale computers.^[6] Needed are new programming constructs and runtime mechanisms able to adapt to the most appropriate parallelism degree and communication decomposition for making scalable and reliable data analysis tasks. Their dependence on parallelism grain size and data analysis task decomposition must be deeply studied. This is needed because parallelism exploitation depends on several features like parallel operations, communication overhead, input data size, I/O speed, problem size, and hardware configuration. Moreover, reliability and reproducibility are two additional key challenges to be addressed. At the programming level, constructs for handling and recovering communication, data access, and computing failures must be designed. At the same time, reproducibility in scalable data analysis asks for rich information useful to assure similar results on environments that may dynamically change. All these factors must be taken into account in designing data analysis applications and tools that will be scalable on exascale systems.

Moreover, reliable and effective methods for storing, accessing, and communicating data; intelligent techniques for massive data analysis; and software architectures enabling the scalable extraction of knowledge from data are needed.^[3] To reach this goal, models and technologies enabling cloud computing systems and HPC architectures must be extended/adapted or completely changed to be reliable and scalable on the very large number of processors/cores that compose extreme scale platforms and for supporting the implementation of clever data analysis algorithms that ought to be scalable and dynamic in resource usage. Exascale computing infrastructures will play the role of an extraordinary platform for addressing both the computational and data storage needs of big data analysis applications. However, as mentioned before, to have a complete scenario, efforts must be performed for implementing big data analytics algorithms, architectures, programming tools, and applications in exascale systems.^[7]

Pursuing this objective within a few years, scalable data access and analysis systems will become the most used platforms for big data analytics on large-scale clouds. In the long term, new exascale computing infrastructures will appear as viable platforms for big data analytics in the next decades, and data mining algorithms, tools, and applications will be ported on such platforms for implementing extreme data discovery solutions.

In this paper we first discuss cloud-based scalable data mining and machine learning solutions, then we examine the main research issues that must be addressed for implementing massively parallel data mining applications on exascale computing systems. Data-related issues are discussed together with communication, multi-processing, and programming issues. We then introduce issues and systems for scalable data analysis on clouds and then discuss design and programming issues for big data analysis in exascale systems. We close by outlining some open design challenges.

Data analysis on cloud computing platforms

Abbreviations

APGAS: asynchronous partitioned global address space

BSP: bulk synchronous parallel

CAF: Co-Array Fortran

DAaaS: data analysis as a service

DAIaaS: data analysis infrastructure as a service

DAPaaS: data analysis platform as a service

DASaaS: data analysis software as a service

DMCF: Data Mining Cloud Framework

ECL: Enterprise Control Language

ESnet: Energy Sciences Network

GA: global array

HPC: high-performance computing

IaaS: infrastructure as a service

JS4Cloud: JavaScript for Cloud

PaaS: platform as a service

PGAS: partitioned global address space

RDD: resilient distributed dataset

SaaS: software as a service

SOA: service oriented computing

TBB: threading building blocks

VL4Cloud: Visual Language for Cloud

XaaS: everything as a service

References

↑ Petcu, D.; Iuhasz, G.; Pop, D. et al. (2015). "On Processing Extreme Data". Scalable Computing: Practice and Experience 16 (4). doi:10.12694/scpe.v16i4.1134.
↑ Tardieu, O.; Herta, B.; Cunningham, D. et al. (2016). "X10 and APGAS at Petascale". ACM Transactions on Parallel Computing (TOPC) 2 (4): 25. doi:10.1145/2894746.
↑ ^3.0 ^3.1 Talia, D. (2015). "Making knowledge discovery services scalable on clouds for big data mining". Proceedings from the Second IEEE International Conference on Spatial Data Mining and Geographical Knowledge Services (ICSDM): 1–4. doi:10.1109/ICSDM.2015.7298015.
↑ Amarasinghe, S.; Campbell, D.; Carlson, W. et al. (14 September 2009). "ExaScale Software Study: Software Challenges in Extreme Scale Systems". DARPA IPTO. pp. 153. doi:10.1.1.205.3944. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.205.3944.
↑ Zaharia. M.; Xin, R.S.; Wendell, P. et al. (2016). "Apache Spark: A unified engine for big data processing". Communications of the ACM 59 (11): 56–65. doi:10.1145/2934664.
↑ Maheshwari, K.; Montagnat, J. (2010). "Scientific Workflow Development Using Both Visual and Script-Based Representation". 6th World Congress on Services: 328–35. doi:10.1109/SERVICES.2010.14.
↑ Reed, D.A.; Dongarra, J. (2015). "Exascale computing and big data". Communications of the ACM 58 (7): 56–68. doi:10.1145/2699414.

Notes

This presentation is faithful to the original, with only a few minor changes to presentation. Some grammar and punctuation was cleaned up to improve readability. In some cases important information was missing from the references, and that information was added. The original article lists references alphabetically, but this version—by design—lists them in order of appearance.