2025-11-06 - Terminating Kubernetes Node Caused LDAP Database Downtime

Prev Next

Executive summary

On November 6, 2025, some customers on the us001 cluster experienced an issue where they were unable to log in to their portal.

Customers experienced this issue as either:

  • “Unable to establish an LDAP connection” message when attempting to log in

  • Connect jobs failing due to a bridge being offline

Investigating into the issue, we identified that the RI Cloud Elastic Kubernetes Service (EKS) cluster for us001 was misbehaving.  Specifically, a single node had an internal issue which prevented pods from terminating. Typically, this would not be an issue because our services are architected for high availability (HA).  When a node fails (for any reason) kubernetes spins up a new node and moves the pods there.  However, on this occasion, because kubernetes was unable to terminate the pods, it also prevented the pods from relocating.  This caused some tenant services to fail, including:

  • LDAP pods that are HA but RapidIdentity is not configured to use them in that manner

  • individual bridges (since they are not automatically HA)

In reviewing the timeline for the incident, we see that the underlying node was running fine until kubernetes decided that the node was unhealthy and attempted to eject it from the cluster.  We have since connected to that node and it appears to be in good health.  So we have not been able to pinpoint exactly why the node failed.  We do not believe that it was the fault of RapidIdentity, because we saw behavior across multiple services (like ingress controllers, LDAP, log aggregators, etc.) on the node.  

We do believe that it was either: a temporary networking issue on the underlying EC2 hardware, or a problem in the EKS/kubernetes control plane or services. Because we believe it is an AWS “behind the scenes” issue, there is little we can do there, except to ensure we are staying up-to-date with EKS patches, which is done (at least monthly) during our scheduled maintenance windows.

Lastly, we have had a task in backlog to reach out to customers to remedy the non-HA LDAP pods.  This has been re-prioritized and we have fixed 30 or so affected customers, leaving about 10 to reconfigure over the coming days.  While this won’t address the problem completely, it should significantly reduce downtime for those affected customers, including customers on any other clusters that may experience this in the future.