Rucio manages data locality within a heterogeneous, federated storage ensemble: the Data Lake. Rucio does this by creating and removing file replicas in order to satisfy both various declarative rules and the capacity limits of the available storage services. Rules specify potentially complex criteria on where data should be located, optionally operating only for a limited duration, and with users allowed to add or remove rules at any time. Rucio also allows new data to receive rules automatically that dictate immediate data management activity. These sophisticated controls allow Rucio to support complex and dynamic data placement requirements.
Rucio provides a scalable solution that has operated in production for over a decade, managing petabytes of data and millions of files. Rucio supports globally distributed data centres, with corresponding monitoring, and analytics.
When deployed, the Rucio software provides a service that allows a group of researchers to manage non-trivial amounts of data. For any given file, dataset (a collection of files), or container (a collection of files and datasets), it provides information on where that data is currently available. It also supports dynamic, time-limited data placement, with data being made available for some period (e.g., to support some computational workflow). It optimises the use of available storage by operating a cache, assuming that data that was previously used is more likely to be needed in the future. Desired data locality is expressed in terms of declarative rules. These rules may be applied both to existing datasets and anticipated, future data.