February 2024 Research Storage Incident
Updates
20-Mar-2024, 4PM: All restores are complete apart from the single directory containing 9+ million files inside /cs/natlang, and a few files that could not be recovered due to Unicode filenames. We are still investigating ways to handle these cases since they can be restored locally but not to the new NFS server.
Linux Users
If you use any of the shares listed below, please reboot to refresh your NFS automounts table. Paths remain the same.
macOS Users
Please note the "New macOS Path" below. You will now connect directly to the new file server rather than going through rcga-bluebell.
To connect to a share, switch to the Finder => [Command]+[K] => [copy and paste New macOS path below] => [Return].
Use your SFU computing ID and password if prompted. Your computing ID is jsixpack; do not add @sfu.ca.
Windows Users
Please note the "New Windows Path" below. You will now connect directly to the new file server rather than going through rcga-bluebell.
To connect to a share, [Windows key]+[R] => [copy and paste New Windows Path below] => [Enter].
Use your SFU computing ID and password if prompted. Your computing ID is jsixpack; do not add @sfu.ca.
Linux Path | Old CIFS Path | New macOS Path | New Windows Path |
---|---|---|---|
/cs/compbio2 | n/a | smb://bbysvm-nfs1.its.sfu.ca/Computational_Biology_Lab | \\bbysvm-nfs1.its.sfu.ca\Computational_Biology_Lab |
/cs/ddmlab | \\rcga-bluebell\ddmlab | smb://bbysvm-nfs1.its.sfu.ca/Database_and_Data_Mining_Lab | \\bbysvm-nfs1.its.sfu.ca\Database_and_Data_Mining_Lab |
/cs/ghassan | \\rcga-bluebell\ghassan | smb://bbysvm-nfs1.its.sfu.ca/Ghassan_Hamarneh_Lab | \\bbysvm-nfs1.its.sfu.ca\Ghassan_Hamarneh_Lab |
/cs/ghassan{1,2,3} | \\rcga-bluebell\ghassan{1,2,3} | smb://bbysvm-nfs1.its.sfu.ca/Ghassan_Hamarneh_Lab | \\bbysvm-nfs1.its.sfu.ca\Ghassan_Hamarneh_Lab |
/cs/natlang-* | n/a | smb://bbysvm-nfs1.its.sfu.ca/Natural_Language_Lab | \\bbysvm-nfs1.its.sfu.ca\Natural_Language_Lab |
/cs/oschulte | \\rcga-bluebell\Schulte Research | smb://bbysvm-nfs1.its.sfu.ca/Oliver_Schulte_Lab | \\bbysvm-nfs1.its.sfu.ca\Oliver_Schulte_Lab |
/gruvi/Data | \\rcga-bluebell\Gruvi_Data | smb://bbysvm-nfs1.its.sfu.ca/GrUViLab/Data | \\bbysvm-nfs1.its.sfu.ca\GrUViLab\Data |
/gruvi/home | \\rcga-bluebell\Gruvi_home | smb://bbysvm-nfs1.its.sfu.ca/GrUViLab/home | \\bbysvm-nfs1.its.sfu.ca\GrUViLab\home |
/gruvi/usr | \\rcga-bluebell\Gruvi_usr | smb://bbysvm-nfs1.its.sfu.ca/GrUViLab/usr | \\bbysvm-nfs1.its.sfu.ca\GrUViLab\usr |
/ensc/BORG | \\rcga-bluebell\BORG | smb://bbysvm-nfs1.its.sfu.ca/Biomedical_Optics_Research_Group/BORG | \\bbysvm-nfs1.its.sfu.ca\Biomedical_Optics_Research_Group\BORG |
/ensc/IMAGEBORG | \\rcga-bluebell\IMAGEBORG | smb://bbysvm-nfs1.its.sfu.ca/Biomedical_Optics_Research_Group/IMAGEBORG | \\bbysvm-nfs1.its.sfu.ca\Biomedical_Optics_Research_Group\IMAGEBORG |
/kin/ipml-* | \\rcga-bluebell\IPM-* | smb://bbysvm-nfs1.its.sfu.ca/Biomedical_Physiology_and_Kinesiology | \\bbysvm-nfs1.its.sfu.ca\Biomedical_Physiology_and_Kinesiology |
/rcg/afh | \\rcga-bluebell\afh-share | smb://bbysvm-nfs1.its.sfu.ca/Technology_in_Context_Design_Lab | \\bbysvm-nfs1.its.sfu.ca\Technology_in_Context_Design_Lab |
20-Mar-2024, 8AM: All restores are complete apart from the single large /srv/natlang directory and a few files that could not be recovered due to Unicode filenames, which we have since discovered can be restored locally. We are waiting on Servers and Storage to complete the final round of security configuration changes before resuming normal operations.
18-Mar-2024, 8AM: /srv/compbio: 33 of 50TB have been restored. /srv/hamarneh: restore completed. We will be bringing as many volumes as possible back online today.
15-Mar-2024, 8AM: /srv/compbio: 23 of 50TB have been restored. /srv/hamarneh: 3 of 8TB have been restored.
14-Mar-2024, 5PM: /srv/compbio: 20 of 50TB have been restored. /srv/hamarneh: 3 of 8TB have been restored.
14-Mar-2024, 8AM: /srv/compbio: 15 of 50TB have been restored. /srv/hamarneh: 2.9 of 8TB has been restored. We are waiting on Servers & Storage configuration changes so we can bring restored volumes back online for users.
13-Mar-2024, 8AM: /srv/natlang has been restored except for the aforementioned single large directory that caused the restore to halt. /srv/hamarneh and /srv/compbio are still being restored.
12-Mar-2024, 5PM: The /srv/ddmlab restore process has completed sans a few files with extended characters in their names. /srv/oschulte and /srv/hamarneh are now being restored.
12-Mar-2024, 8AM: 0.7TB of the 1TB /srv/ddmlab volume has been restored (164,000 of 166,000 files).
11-Mar-2024, 5PM: /srv/ddmlab is being restored.
11-Mar-2024, 8AM: /srv/borg has been restored but access via CIFS and NFS still needs to be arranged. The restore of /srv/gruvi is still in progress. To work around the large directory limitations of the backup system, /srv/natlang is still being restored one directory at a time in the background.
08-Mar-2024, 5PM: 16TB of the 28TB /srv/borg volume has been restored. The issue regarding VPN access to research storage has been resolved.
08-Mar-2024, 8AM: 10TB of the 28TB /srv/borg volume has been restored. /srv/gruvi, which has been available in read/write mode via NFS up until this point, has been switched to read-only mode while its restore process completes. We are still waiting to hear back from Servers and Storage re: the current inability to mount restored volumes via CIFS over the SFU VPN.
07-Mar-2024, 5PM: We have started the restore process for /srv/borg.
07-Mar-2024, 8AM: The restore process for /srv/natlang has been paused while we wait for additional RAM to complete the process. We have started the restore process for /srv/gruvi in the meantime.
06-Mar-2024, 5PM: The restore process for /srv/natlang is finally making some progress now that the aforementioned large directory has been temporaily set aside. We are waiting for additional resources to be provisioned before attempting to run two restores in parallel.
06-Mar-2024, 8AM: A single directory containing tens of millions of files has been identified as the cause of false "disk full" errors during the /srv/natlang restore process. We have advised Servers & Storage, and are attempting to run another restore in parallel while the /srv/natlang restore (without the troublesome directory) completes.
05-Mar-2024, 5PM: See previous update.
05-Mar-2024, 8AM: The backup system is still in the process of re-scanning the 96 million files in /srv/natlang.
04-Mar-2024, 5PM: The backup administrator has had to restart the /srv/natlang restore process to deal with unconventional characters in filenames.
04-Mar-2024, 8AM: The tape backup system is still scanning the files on /srv/natlang to be restored.
01-Mar-2024, 1PM: Test restore complete. Initiated restore of /srv/natlang, which may take significantly longer due to the large quantity of small files and overall size (60TB, 96 million files).
01-Mar-2024, 8AM: 8.0TB has been restored (2.0 million files). As previously mentioned, the restore process has made very little progress over the last several hours due to frequent tape changes.
29-Feb-2024, 5PM: 24 hours into the test restore of a single lab's data, 5TB of 13TB has been restored (1.3 million of 2.2 million files).
29-Feb-2024, 8AM: 12 hours into the test restore of a single lab's data, 2.8TB of 13TB has been restored (700,000 of 2,200,000 files). The backup administrator advises that the process will slow down significantly as it progresses since newer files are more likely to be spread across many different tapes.
28-Feb-2024, 5PM: Backup attempt of remaining online volume unsuccessful. New storage has been provisioned. Restoring a single volume from tape backup as an initial test of new storage. Currently running at ~120GB/hour. Distributed e-mail re: storage incident to affected faculty members.
28-Feb-2024, 9AM: The final attempt to revive the storage node has failed. It is still not known why multiple disk failures occurred at once. RCG has been advised to restore from tape. Performing last-minute backup for one volume that still appears responsive.
27-Feb-2024, 5PM: Waiting for communications approval; held up by staff busy with datacenter moves. Still waiting on provisioning of new storage.
27-Feb-2024, 12PM: Servers & Storage is still trying to revive the affected storage node. Waiting on new storage provisioning. Preparing communication with all affected clients. Established incident update page.
26-Feb-2024, 4PM: Working with Servers & Storage to provision new storage to prepare for "full restore from tape backup" scenario.
26-Feb-2024, 1PM: Surveying the extent of the damage and communicating with immediately affected clients. Servers & Storage still attempting to revive storage node.
26-Feb-2024, 11AM: Servers & Storage advises this may be a fatal hardware failure, and we should plan on restoring from backups.
26-Feb-2024, 10AM: Servers & Storage reports the storage node has crashed and refuses to come back online.
26-Feb-2024, 9AM: Storage node throwing hard errors. Contacted Servers & Storage.
Background
What/When
At approximately 9:00AM 26-Feb-2024, one of the research storage nodes began throwing timeouts and read/write errors before dropping offline altogether.
What should I do?
Do not attempt to save new data into the paths and shares listed in the table below. Adding or updating data is likely to fail. [The table has been moved above, and it is safe to write data to these shares again.]
Are there any backups? Has any data been lost?
The last successful tape backup of the affected data concluded 25-Feb-2024. Data saved between the 25-Feb-2024 backup and before the hardware failure on 26-Feb-2024 has likely been lost.
When will my data be available again?
We do not have an ETA. The restore process, which began 28-Feb-2024, depends on the speed of the tape backup system. The data to be restored totals 182TB across 267 million files. [The final restore from tape took place 19-Mar-2024.]
Who
The following shares are affected:
[Table moved above. N.B. /srv/polab was previously listed but this was an error.]