Apache Spark is a general, fast and open source cluster computing framework for big data processing. The basic advantage of the platform is that it provides a high level API that allows the combination of a variety of computation methods and algorithms that previously required the use of separate distributed systems (e.g. text mining, machine learning, streaming, graph algorithms, etc.), while it automates and hides from its users significant low-level details. By supporting this big variety of computing algorithms on the same platform, Spark makes it easy and inexpensive to combine different processing engines and reduces the burden of maintaining separate tools. One of its main features is its ability to run calculations in memory, while at the same time it efficiently supports complex applications running on the secondary memory.
This 3 hour tutorial provides a detailed description of the core part of the Apache Spark platform and a small hands-on.
Specifically, the first part of the tutorial (2 hours) aims at: a) providing a thorough understanding of the Apache Spark’s internals, pinpointing those aspects of the platform that can affect the performance of applications built on top of it, b) describing the available data transformations and actions, and finally c) presenting an overview of the Apache Spark ecosystem
After a short break a small hands-on session (1 hour) will help the participants write their first Apache Spark applications, through a small number of simple tasks. For this purpose you are kindly asked to bring your own laptop.
Panagiotis Papadakos is a postdoctoral researcher at FORTH-ICS. He owns a PhD in Computer Science from the University of Crete. His main research interests lie in the areas of Exploratory Search, the Semantic Web, Recommendation Engines and Big-Data processing, with an emphasis on preference-based interactive exploration of multi-dimensional information spaces.
Vangelis Kritsotakis is a technical staff member at FORTH-ICS. He received his MSc degree in Internet Computing from the University of Surrey and his BSc degree in Mathematics & Computer Science from the University of Sussex. His research interests lie in the areas of Biomedical Information Systems, Semantic Web, Information Modelling and Data Integration, Service Oriented Technologies and Big Data Processing.