03 - EUDAT: Towards A European Collaborative Data Infrastructure

Michał Jankowski (PSNC), Damien Lecarpentier (CSC -IT), David Vicente (BSC)

In recent years significant investments have been made by the European
Commission and European member states to create a pan-European e-Infrastructure supporting multiple research communities. As a result, a European e-Infrastructure ecosystem is currently taking shape, with communication networks, distributed grids and HPC facilities providing European researchers from all fields with state-of-the-art instruments and services that support the deployment of new research facilities on a pan-European level. However, the accelerated proliferation of data – newly
available from powerful new scientific instruments, simulations and the digitization of existing resources – has created a new impetus for increasing efforts and investments in order to tackle the specific challenges of data management, and to ensure a coherent approach to research data access and preservation.

EUDAT is a pan-European initiative that started in October 2011 and which aims to help overcome these challenges by laying out the foundations of a
Collaborative Data Infrastructure (CDI) in which centres offering community-specific support services to their users could rely on a set of common data services shared between different research communities. Although research communities from different disciplines have different ambitions and approaches – particularly with respect to data organization and content – they also share many basic service requirements. This commonality makes it possible for EUDAT to establish common data services, designed to support multiple research communities, as part of this CDI.

During the first 18 months of the project, EUDAT has been reviewing the
approaches and requirements of a first subset of communities from linguistics (CLARIN), solid earth sciences (EPOS), climate sciences (ENES), environmental sciences (LIFEWATCH), and biological and medical sciences (VPH), and shortlisted four generic services to be deployed as shared services on the EUDAT infrastructure. These services are data replication from site to site, data staging to compute facilities, metadata, and easy storage. A number of enabling services such as distributed authentication and authorization, persistent identifiers, hosting of services, workspaces
and centre registry were also discussed. Service prototypes are currently being developed by multi-disciplinary teams, as well as a pre-production operational infrastructure, which at the moment comprises five sites (RZG, CINECA, SARA, CSC, and FZJ), offering 480TB of online storage and 4PB of near-line (tape) storage. Additional services are currently being considered for discussion and inclusion in the second development
round – next 18 months. They are concerned with issues related to data
annotation, handling of (near) real-time data, crowd-sourcing, and web-services.

The services being designed in EUDAT will thus be of interest to a broad
range of communities that lack their own robust data infrastructures, or that are simply looking for additional storage and/or computing capacities to better access, use, re-use, and preserve their data. Although EUDAT initially focused on a subset of research communities, discussions with other research communities – belonging to the fields of environmental sciences, biomedical science, physics, social sciences and humanities – have already begun and EUDAT aims to associate these communities to the design of the infrastructure and its services.

While the first months of the project mostly focussed on investigating communities’ requirements and designing first prototypes, increasing effort will be put in the coming months on two the operation of the infrastructure and its evolution. Among other things, this implies early definition of future partnership and business models for adopting, supporting and sustaining common services developed for, and partly operated by, the different research communities.

About the poster content:

The poster presents the concept of layered Collaborative Data
Infrastructure (CDI)
that includes Common Data Services - a common basis for various community-
specific mechanisms and functionalities included in Community Support

The poster shortly discusses main service cases and scenarios and data
aspects currently addressed by CDI, including data replication for safe
data staging (transmission of data among EUDAT storage and HPC/HTC
and simple storage services as well as meta-data handling and
AAI-related mechanisms.
The poster also shows what communities and data centres are already involved
in the project and explains how the other, new communities can join the

Download file