Current issues

None. If you notice something wrong, please notify us.

Future events

None.

Past events

  • 2026-03-19Hercules2
    The Hercules frontend is currently unavailable. All existing jobs continue to run normally.
  • 2026-01-17NIC5

    The persistent cache backup device for the /home filesystem on NIC5 failed.

    As a result:

    • write-back caching was disabled,
    • write latency increased significantly,
    • write performance was strongly degraded.

    To reduce risk of data loss and further degradation:

    • compute nodes were drained,
    • new jobs were not started.

    User data remained accessible via the login node, but users were strongly advised to avoid write operations during this period.

    Updates

    • 2026-01-22 14:00: The faulty controller and its battery were replaced. We still needed up to 4 hours for full battery charge before reopening job submission.
    • 2026-01-23 16:00: The replacement battery also failed. A new battery was expected on Tuesday 27th. In the meantime, /home write performance remained severely degraded and job submission stayed closed.
    • 2026-01-27 16:00: The new controller battery was received. An engineer was scheduled on site for Wednesday 28th at 11:00 to install it.
  • 2026-01-12Lyra
    Cluster was unavailable from 13:00 to 16:30, If you had jobs during the unavailable interval today, you should check the status of them.
  • 2025-12-06nic5
    Major electrical power maintenance in the datacenter. Please, plan your work accordingly
  • 2025-11-25Hercules2
    cluster is back
  • 2025-11-25Hercules2
    cluster is unavailable
  • 2025-10-15Lyra
    All compute nodes are back online and the cluster is working normally. During the unplanned downtime a maintenance operation was performed to switch the VMs to EPYC-Genoa CPUs.
  • 2025-10-14Lyra
    Due to a power issue all compute nodes are currently switched off.
  • 2025-09-25CommonStorageMigration
    10-13h: Namur’s hardware for the common Storage is physically moved to the new datacenter. Perturbation on all sites are possible.
  • 2025-08-18Hercules2
    2025-08-18 / 2025-08-31: maintenance.
  • 2025-06-23Lemaitre4
    Maintenance week with global scratch (/globalscratch, $GLOBALSCRATCH) cleanup.
  • 2025-06-04NIC5
    Due to a cooling problem, we had to drain the cluster and block the submission of new jobs. A technical intervention is scheduled tomorrow.
  • 2025-04-30NIC5
    Due to a cooling issue some nodes had to be switched off. Impacted jobs should have been requeued.
  • 2025-04-14Common storage
    The Common storage has been migrated to the new infrastructure. Details here.
  • 2025-02-04NIC5
    Unplanned service outage occured around 10:30. Running jobs have been requeued.
  • 2025-02-02Lemaitre3
    The global filesystem is experiencing instabilities with recurrent crashes of the services. Many nodes must be restarted periodically. We are investigating the issue. UPDATE: we have restarted the whole cluster and checked everything. We are monitoring the operations this afternoon and will move to green tomorrow morning if no further error is seen. Many jobs have been cancelled. We are sorry for the inconvenience.
  • 2024-09-21NIC5
    At 3:47 AM, the master node of NIC5, along with all services running on it (including the Slurm controller), experienced a failure. Normal operations were restored by 7:00 AM. No impact of this outage has been observed on active jobs.
  • 2024-09-01Lemaitre3
    Deactivation of the full system, which will become completely unavailable.
  • 2024-08-08Dragon2
    Due to unforeseen problems during the Dragon2 maintenance, Dragon2 is currently in a degraded state. The cluster is up and you can submit jobs, with precautions explained in an email associated with the event.
  • 2024-07-29Dragon1/2
    2024-07-29 / 2024-08-05: Maintenance week with cleaning of global scratches.
  • 2024-07-01Lemaitre3
    Cleaning of the global scratch, deactivation of slurm, and freezing of the home directories (read-only).
  • 2024-07-01Lemaitre4
    Some short disruptions of services to be expected from time to time during the maintenance week.
  • 2024-06-24NIC5
    Start of the urgent unplanned maintenance, NIC5 unavailable until 13:00. Due to network problems perturbing access to /home or /CECI on some compute nodes, we had to drain the cluster during the weekend to have it empty of jobs Monday morning to perform a reboot of the Infiniband switches. NIC5 back at 13:00 as forecasted.
  • 2024-06-10Hercules2
    Planned maintenance week.
  • 2024-05-13NIC5
    The second /scratch server is up, and the faulty disk has been replaced and is slowly rebuilding. To ensure data safety, until tonight, the size and number of jobs per user is strictly limited.
  • 2024-05-12NIC5
    One of the two /scratch fileservers is down. Data are safe and available, but the performances are degraded. Submission of jobs is momentarily suspended.
  • 2024-04-08Hercules2
    Hercules2 is back in service.
  • 2024-04-05UNamur CÉCI gateway
    The UNamur CÉCI gateway is back online.
  • 2024-04-04Hercules2
    Due to a power outage, the GPU nodes on Hercules2 are unavailable. They are expected to be back in service in the next few days.
  • 2024-04-04Hercules2
    Due to a power outage, Hercules2 is down. The service is expected to resume Monday April 8th.
  • 2024-04-04UNamur CÉCI gateway
    Due to a power outage, the UNamur CÉCI gateway is down.
  • 2024-03-19Lemaitre3 and Lemaitre4
    Planned power cut.
  • 2024-02-19Lucia
    Planned maintenance (7:00-19:00).
  • 2024-01-31Lemaitre3
    Planned power outage (7:00-19:00).
  • 2024-01-29Manneback
    Planned maintenance week (New date!).
  • 2023-10-12NIC5
    The scheduled maintenance went well and ended sooner than expected.
  • 2023-10-12NIC5
    The CECI common file system gateway of NIC5 has been rebooted. Access to all /CECI partitions has been restored.
  • 2023-10-12NIC5
    The CECI common file system gateway of NIC5 failed. As a consequence, access to all /CECI partitions was lost. Jobs using one of these partitions may have failed.
  • 2023-10-02NIC5 and CECI websites
    NIC5 and CECI websites inaccessible due to a networking issue (10:00–12:00).
  • 2023-09-24Hercules
    Home filesystem back online.
  • 2023-09-23Hercules
    Home filesystem unavailable preventing login.
  • 2023-09-20Lemaitre3
    The BeeGFS global scratch /scratch is back online after replacement of the failing hardware.
  • 2023-09-20Lemaitre3
    The BeeGFS global scratch /scratch is currently unavailable.
  • 2023-09-17Lemaitre3 and gwceci.cism.ucl.ac.be
    Network connectivity has been restored.
  • 2023-09-16Lemaitre3 and gwceci.cism.ucl.ac.be
    UCLouvain HPC infrastructure inaccessible due to a networking issue.
  • 2023-09-05Hercules2
    Workaround implemented to mitigate the slowdowns.
  • 2023-09-05Hercules2
    Cluster stability issues detected due to defective network device.
  • 2023-08-10NIC5
    NIC5 is up and running again.
  • 2023-08-10NIC5
    Login node memory replacement and reboot.
  • 2023-08-06NIC5
    Hardware memory problem on login node detected.