Data Management Policy

Introduction

PIC provides its users with services and support to manage, analyze and share their research data. 

PIC recognizes that good practice in data management is essential for an efficient and effective research process. Preserving the data that is used to produce scientific results is key to ensuring research integrity and reproducibility.

We provide data storage services optimized for different needs as well as tools and support to enable users analyzing, managing and sharing their data effectively.

 

Definitions

Project: represents a group of users using shared resources at PIC within the framework of a single research entity, e.g.: a scientific collaboration, a university department or a research institution.

Project/group contact: the person, typically a member of PIC’s staff, acting as the liaison with the centre and providing support to the members of a Project. The list of contacts for each project can be found in PIC’s web page.

Data: primary information or content of interest, typically stored at PIC in file format

Metadata: contextual information that describes the characteristics, properties and usage of the data.

File sizes

  • Small: KBs to a few MBs, e.g. source code and configuration files
  • Medium: Several MBs to hundreds of MBs
  • Large: GBs and higher

 

Datacenter and User Responsibilities

Users are the owners of the data and have the ultimate responsibility for managing it. PIC provides services and support that help them in this task, but users are responsible for:

  • Determining what data requires backup protection. Critical data should never have a single copy. PIC is not responsible for data loss.
  • Configuring appropriate access control for stored data. 
  • Following the applicable use policies. 
  • Defining the file naming convention and versioning their data and using the appropriate metadata.
  • Defining a Data Management Plan (DMP) which is coherent with this policy. Users are encouraged to use tools such as https://dmp.csuc.cat/ for managing their DMP.

The datacenter is responsible for the operation of the infrastructure which support the research data storage which implies:

  • Keeping the storage platform monitored and solve the operational problems that may occur
  • Providing accounting information for each research project
  • Deploying different storage types with different characteristics. Support research projects providing all the needed information in order to choose the appropriate storage type depending on their requirements.
  • Informing the research groups in advance when maintenance tasks are needed if they have any impact on the availability of the storage services
  • Applying security measures to ensure platform integrity and protection 
  • Managing the backup infrastructure to guarantee data integrity and availability and ensure regular backups according to the backup policy of each project/storage type/dataset. All backups are stored at the PIC facility and there is no off-site backup.

 

Data Access Control

The PIC data centre is intended primarily for fundamental research, therefore it is designed for open research data. PIC does not allow the storage or analysis of personal identifiable information on its infrastructure. 

Files are protected using POSIX file permissions (and ACLs) based on user and group IDs. It is the user’s responsibility to configure file permissions and umasks to match their access protection needs.

 

Storage Resources

PIC provides various storage solutions, each targeting different performance and volume needs. Users should select the most appropriate system for their project needs. Most users will use a combination of various file systems, depending on data type or lifecycle stage.

Storage for internal access

Home File System

Purpose

The home file system is intended to hold source code, configuration files, etc. It is optimized for small to medium sized files and it is not meant to store data files. 

Backup

Home directories keep two separate copies of the files and use a snapshot capability to preserve a 8 day history of their directories. If recovery is needed, users must request it via email by contacting their designated group representative at PIC.

Quota

By default, each user has a directory with a 25 GB hard quota in the home file system. Alternative quotas may be applied to specific groups or users upon bilateral agreement. This should be requested via email to your group contact at PIC.

 

Software Area File System

Purpose

Primarily intended for sharing software installations within a project. High IOPS file system, optimized for small files.

Backup

Software Area directories keep two separate copies of the files and use a snapshot capability to preserve a 8 day history of their directories. If recovery is needed, users must request it via email by contacting their designated group representative at PIC.

Quota

By default, the software area directory for a project has a 50GB hard quota. Alternative quotas may be applied to specific groups upon bilateral agreement.

 

Internal Data File System 

Purpose

Intended for storing input and output data files for analysis jobs sent to the processing cluster at PIC and for sharing data among project members. The project directory has two sub-directories intended for different purposes:

  • Common: area with group read/write access, intended for sharing common files among project members.
  • Scratch: area with group read/write access. Single copy and no backup. Should only hold temporary and non-critical data.

Backup

Common Data File System directories keep two separate copies of the files and use a snapshot capability to preserve a 15 day history of their directories.  If recovery is needed, users must request it via email by contacting their designated group representative at PIC. 

Scratch directories keep only one copy of files. No backup. 

Quota

The default quotas for new projects on the different file system directories are: 5 TB in Scratch, 5 TB in Common. Projects with larger quota needs can request them via email to their group contact at PIC.

 

External Data File System 

Purpose

Intended for storing large volumes of input and output data files for analysis jobs sent to the processing cluster at PIC and for sharing data with external partners. Implemented on dCache, a highly reliable and scalable file-system designed to handle immutable files as efficiently as possible. It is a write-once-read-many (WORM) storage system, therefore does not support updating of existing files.

The External Data File System is accessible from remote locations through high-performance data transfer links (10Gbps or faster). Multiple protocols are available: http/webdav or xrootd. 

Backup

The External Data File System (dCache) is not backed up. The default configuration keeps one copy per file, but it can be modified to perform multiple copies depending on the directory. Projects requiring special data redundancy should coordinate the appropriate replication configuration through their contact person at PIC.

Quota

A project can use from 100s to 1000s of TBs of data in the External Data File System, depending on the availability of resources and the particular agreement of the project with PIC. Allocations will be defined on a case by case basis.

 

Long Term Data Archive

Purpose

For preserving data which is not frequently accessed (once a year or less) or is accessed sequentially. This system is optimized for large files, ideally 10GB or larger.

If this type of storage is needed, the project’s designated contact person must request it via email through their group representative at PIC.

Backup

By default, a single copy of the data is written to tape. Multiple copies within PIC can be configured upon request to protect against data loss due to tape media problems. All tapes are stored on-site at PIC, and there is no off-site backup.

Quota

Tape allocations will be defined on a per project basis.