Ophidia – High Performance Data Mining & Analytics for eScience

Ophidia is a CMCC Foundation research project addressing big data challenges for eScience. It provides support for data-intensive analysis exploiting advanced parallel computing techniques and smart data distribution methods. It relies on an array-based storage model and a hierarchical storage organization to partition and distribute multidimensional scientific datasets over multiple nodes. Even though the Ophidia analytics framework has been mainly and primarily designed to meet Climate Change analysis requirements, it can also be used in other scientific domains where data has a multidimensional nature.

Ophidia key features:

1. Designed for eScience
The n-dimensionality of scientific datasets requires tools that support specific data types (e.g. arrays) and primitives (e.g. slicing, dicing, pivoting, drill-down, roll-up) to properly enable data access and analysis. With regard to general-purpose analytics multidimensional systems, scientific data has a higher computing demand, which definitely leads to the need of having efficient parallel/distributed solutions to meet the (near) real-time analytics requirements. Ophidia supports different data formats (e.g. NetCDF, FITS, SAC, etc.), which allows managing data in different scientific domains.

2. Server-side approach

Most scientists currently use a workflow based on the “search, locate, download, and analyze” sequence of steps. This workflow will not be feasible on large scale and will fail for several reasons, such as: (i) ever-larger scientific datasets, (ii) time- and resource- consuming data downloads, and (iii) increased problem size and complexity requiring bigger computing facilities. On large scale, scientific discovery will strongly need to rely on data-intensive facilities close to data storage, parallel I/O systems and server-side analysis capabilities. Such an approach will move the analysis (and complexity) from the user’s desktop to the data centers and, accordingly, it will change the infrastructure focus from data sharing to data analysis.

3. Parallel and distributed
The Ophidia analytics platform provides several MPI-based parallel operators to manipulate (as a whole) the entire set of fragments associated with a datacube. Some relevant examples include: datacube sub-setting (slicing and dicing), datacube aggregation, array-based primitives at the datacube level, datacube duplication, datacube pivoting, and NetCDF file import and export. To address scalability and enable parallelism, from a physical point of view, a datacube in Ophidia is then horizontally split into several chunks (called fragments) that are distributed across multiple I/O nodes. Each I/O node hosts a set of I/O servers optimized to manage n-dimensional arrays. Each I/O server, in turn, manages a set of databases consisting of one or more fragments. As it can be easily argued, tuning the levels in this hierarchy can also affect performance. For a specific datacube, the higher the product of the four levels is, the smaller the size of each fragment will be.

4. Declarative
Ophidia exploits a declarative (query-like) approach to express and define data analysis tasks. The defined declarative data analytics language allows the user to create, manage and manipulate datacubes, as well as analyzing the data, by describing “what” is actually performed rather than “how”, leaving the selection of the implementation strategies to the system. Moreover, Ophidia provides support for complex workflows / operational chains, with a specific syntax and a dedicated scheduler to exploit inter- and intra-task parallelism.

5. Extensible
Ophidia provides a large set of operators (50+) and primitives (100+) covering various types of data manipulations, such as sub-setting (slicing and dicing), data reductions, duplication, pivoting, and file import and export. However, the framework is highly customizable: there is a minimal set of APIs through which it is possible to develop your own operator or primitive to implement and provide new algorithms and functionalities.

References:

Contacts:

Sandro Fiore – ASC Division
[email protected]

Divisions:

Start typing and press Enter to search

Shopping Cart