Updated 15/02/2024
DTE Infrastructure Component

RUCIO

Federated Data Infrastructure
Image

Description

Providing a service that manages data locality. It provides a scalable solution for managing the dynamic locality of files in a heterogeneous, federated storage Datalake.

Rucio manages data locality within a heterogeneous, federated storage ensemble: the Data Lake.  Rucio does this by creating and removing file replicas in order to satisfy both various declarative rules and the capacity limits of the available storage services.  Rules specify potentially complex criteria on where data should be located, optionally operating only for a limited duration, and with users allowed to add or remove rules at any time.  Rucio also allows new data to receive rules automatically that dictate immediate data management activity.  These sophisticated controls allow Rucio to support complex and dynamic data placement requirements.

Rucio provides a scalable solution that has operated in production for over a decade, managing petabytes of data and millions of files.  Rucio supports globally distributed data centres, with corresponding monitoring, and analytics.

When deployed, the Rucio software provides a service that allows a group of researchers to manage non-trivial amounts of data.  For any given file, dataset (a collection of files), or container (a collection of files and datasets), it provides information on where that data is currently available. It also supports dynamic, time-limited data placement, with data being made available for some period (e.g., to support some computational workflow). It optimises the use of available storage by operating a cache, assuming that data that was previously used is more likely to be needed in the future. Desired data locality is expressed in terms of declarative rules.  These rules may be applied both to existing datasets and anticipated, future data.

Target Audience
+

In principle, all DT users that use the Datalake concept to manage their data are using Rucio.  Depending on the use cases, DT users may interact directly with Rucio, or they may use Rucio via some intermediate service.

License
+

Apache 2.0

Created by
+

Release Notes

Rucio is a fully established project, independent of the interTwin project.  The software is production-ready, at TRL 9, and hardened with over a decade of production mission-critical use.  The project has multiple deployments of their software, operated by different user communities.

In interTwin, we are tracking the latest releases of Rucio, to take advantage of the latest features.

Future Plans

Federated data management within interTwin’s DTE blueprint was developed based on the feature set of Rucio.  The Rucio project has a well-established support process that is science-agnostic and community-driven.  Deployments of the interTwin DTE blueprint that take advantage of Rucio will be supported through the Rucio project.

As the interTwin project sees the initial adoption of Rucio by new communities, some additional features have been identified as useful.  Not all of these have been implemented within the interTwin project.  Those that have not have been documented and made available as possible topics for future projects.