Status

Common storage Dragon2 Hercules2 Lemaitre4 Login management Lyra NIC5

Available Degraded Unavailable Maintenance

Current issues

None. If you notice something wrong, please notify us.

Future events

None.

Past events

2026-03-19 – Hercules2
The Hercules frontend is currently unavailable. All existing jobs continue to run normally.
2026-01-17 – NIC5
The persistent cache backup device for the /home filesystem on NIC5 failed.
As a result:
- write-back caching was disabled,
- write latency increased significantly,
- write performance was strongly degraded.
To reduce risk of data loss and further degradation:
- compute nodes were drained,
- new jobs were not started.
User data remained accessible via the login node, but users were strongly advised to avoid write operations during this period.
Updates
- 2026-01-22 14:00: The faulty controller and its battery were replaced. We still needed up to 4 hours for full battery charge before reopening job submission.
- 2026-01-23 16:00: The replacement battery also failed. A new battery was expected on Tuesday 27th. In the meantime, /home write performance remained severely degraded and job submission stayed closed.
- 2026-01-27 16:00: The new controller battery was received. An engineer was scheduled on site for Wednesday 28th at 11:00 to install it.
2026-01-12 – Lyra
Cluster was unavailable from 13:00 to 16:30, If you had jobs during the unavailable interval today, you should check the status of them.
2025-12-06 – nic5
Major electrical power maintenance in the datacenter. Please, plan your work accordingly
2025-11-25 – Hercules2
cluster is back
2025-11-25 – Hercules2
cluster is unavailable
2025-10-15 – Lyra
All compute nodes are back online and the cluster is working normally. During the unplanned downtime a maintenance operation was performed to switch the VMs to EPYC-Genoa CPUs.
2025-10-14 – Lyra
Due to a power issue all compute nodes are currently switched off.
2025-09-25 – CommonStorageMigration
10-13h: Namur’s hardware for the common Storage is physically moved to the new datacenter. Perturbation on all sites are possible.
2025-08-18 – Hercules2
2025-08-18 / 2025-08-31: maintenance.
2025-06-23 – Lemaitre4
Maintenance week with global scratch (/globalscratch, $GLOBALSCRATCH) cleanup.
2025-06-04 – NIC5
Due to a cooling problem, we had to drain the cluster and block the submission of new jobs. A technical intervention is scheduled tomorrow.
2025-04-30 – NIC5
Due to a cooling issue some nodes had to be switched off. Impacted jobs should have been requeued.
2025-04-14 – Common storage
The Common storage has been migrated to the new infrastructure. Details here.
2025-02-04 – NIC5
Unplanned service outage occured around 10:30. Running jobs have been requeued.
2025-02-02 – Lemaitre3
The global filesystem is experiencing instabilities with recurrent crashes of the services. Many nodes must be restarted periodically. We are investigating the issue. UPDATE: we have restarted the whole cluster and checked everything. We are monitoring the operations this afternoon and will move to green tomorrow morning if no further error is seen. Many jobs have been cancelled. We are sorry for the inconvenience.
2024-09-21 – NIC5
At 3:47 AM, the master node of NIC5, along with all services running on it (including the Slurm controller), experienced a failure. Normal operations were restored by 7:00 AM. No impact of this outage has been observed on active jobs.
2024-09-01 – Lemaitre3
Deactivation of the full system, which will become completely unavailable.
2024-08-08 – Dragon2
Due to unforeseen problems during the Dragon2 maintenance, Dragon2 is currently in a degraded state. The cluster is up and you can submit jobs, with precautions explained in an email associated with the event.
2024-07-29 – Dragon1/2
2024-07-29 / 2024-08-05: Maintenance week with cleaning of global scratches.
2024-07-01 – Lemaitre3
Cleaning of the global scratch, deactivation of slurm, and freezing of the home directories (read-only).
2024-07-01 – Lemaitre4
Some short disruptions of services to be expected from time to time during the maintenance week.
2024-06-24 – NIC5
Start of the urgent unplanned maintenance, NIC5 unavailable until 13:00. Due to network problems perturbing access to /home or /CECI on some compute nodes, we had to drain the cluster during the weekend to have it empty of jobs Monday morning to perform a reboot of the Infiniband switches. NIC5 back at 13:00 as forecasted.
2024-06-10 – Hercules2
Planned maintenance week.
2024-05-13 – NIC5
The second /scratch server is up, and the faulty disk has been replaced and is slowly rebuilding. To ensure data safety, until tonight, the size and number of jobs per user is strictly limited.
2024-05-12 – NIC5
One of the two /scratch fileservers is down. Data are safe and available, but the performances are degraded. Submission of jobs is momentarily suspended.
2024-04-08 – Hercules2
Hercules2 is back in service.
2024-04-05 – UNamur CÉCI gateway
The UNamur CÉCI gateway is back online.
2024-04-04 – Hercules2
Due to a power outage, the GPU nodes on Hercules2 are unavailable. They are expected to be back in service in the next few days.
2024-04-04 – Hercules2
Due to a power outage, Hercules2 is down. The service is expected to resume Monday April 8th.
2024-04-04 – UNamur CÉCI gateway
Due to a power outage, the UNamur CÉCI gateway is down.
2024-03-19 – Lemaitre3 and Lemaitre4
Planned power cut.
2024-02-19 – Lucia
Planned maintenance (7:00-19:00).
2024-01-31 – Lemaitre3
Planned power outage (7:00-19:00).
2024-01-29 – Manneback
Planned maintenance week (New date!).
2023-10-12 – NIC5
The scheduled maintenance went well and ended sooner than expected.
2023-10-12 – NIC5
The CECI common file system gateway of NIC5 has been rebooted. Access to all /CECI partitions has been restored.
2023-10-12 – NIC5
The CECI common file system gateway of NIC5 failed. As a consequence, access to all /CECI partitions was lost. Jobs using one of these partitions may have failed.
2023-10-02 – NIC5 and CECI websites
NIC5 and CECI websites inaccessible due to a networking issue (10:00–12:00).
2023-09-24 – Hercules
Home filesystem back online.
2023-09-23 – Hercules
Home filesystem unavailable preventing login.
2023-09-20 – Lemaitre3
The BeeGFS global scratch /scratch is back online after replacement of the failing hardware.
2023-09-20 – Lemaitre3
The BeeGFS global scratch /scratch is currently unavailable.
2023-09-17 – Lemaitre3 and gwceci.cism.ucl.ac.be
Network connectivity has been restored.
2023-09-16 – Lemaitre3 and gwceci.cism.ucl.ac.be
UCLouvain HPC infrastructure inaccessible due to a networking issue.
2023-09-05 – Hercules2
Workaround implemented to mitigate the slowdowns.
2023-09-05 – Hercules2
Cluster stability issues detected due to defective network device.
2023-08-10 – NIC5
NIC5 is up and running again.
2023-08-10 – NIC5
Login node memory replacement and reboot.
2023-08-06 – NIC5
Hardware memory problem on login node detected.