Background
Yesterday morning, I pushed a prod release to our Spring Boot service that handles order packing and pickup operations. It was a rather large release, containing mostly tech debt fixes and an upgrade to Spring Boot 4, along with upgrading most of the other dependencies. While performing follow-up and monitoring after the release, things looked to be ok. But a short time after, we started noticing app instance restarts.
Investigation
I began to use an internal tool that someone developed that uses Copilot to analyze the production AKS pod logs. It revealed that our instances were restarting due to out of memory (OOM) errors. It recommended bumping our total memory config from 1250 MB to 1500 MB, and that doing so would automatically scale the metaspace required to load the additional classes into memory that are required by Spring Boot 4. I pushed this hotfix to prod, but it did not solve the OOM issues. We made the call to rollback the install and take a step back to investigate.
Digging In
The internal tool that analyzes the AKS pod logs doesn’t tie into source code that you may have on your local machine, so its ability to completely determine root cause is limited. Since we are not yet allowed to use APIs to access Kibana in our organization, and since we had reverted the service in prod (which destroys all the previous AKS pod logs), I had to manually export the log data from Kibana to a CSV. I was then able to provide that to Copilot on my machine, while pointing it to the source code. It had determined that there were two causes of the restart loop:
- The out of memory (OOM) errors previously mentioned related to the metaspace
- A conflict with the Quartz version upgrade and how we deploy some pod instances of the service with Kafka consumers enabled, and others that are not enabled. The Kafka consumer enabled instances are the only ones we allow to run Quartz, in an effort to not weigh down the API instances that must maintain low latency for user requests. It found that the non-Quartz instances, despite having Quartz disabled in our settings, were not starting Quartz, but were attempting to do job cleanup, and that this was a behavior of the newer version of Quartz.
It was obvious that it was correct about point number 1, since we could see it clearly in the logs, and the internal log tool’s summary showed a few dozen of those restarts. But the restarts caused by Quartz is something we had observed in previous versions of Quartz, and it only happens occasionally, and only to some instances during start up, never in a constant loop. In fact, we used resilience4J’s retry mechanism to account for this.
After pushing back about point number 2, it of course agreed with me (typical LLM behavior). It then found that the Quartz issue was not repetitive and suggested the fix for point 1 only. This was hallucination number 1. The second hallucination came after I had told Copilot to create a JIRA bug ticket to track this work. It recommended adding a specific JAVA_OPTS variable to explicitly set our metaspace size to 300 MB, which should account for the additional classes being loaded into memory. But after I told it to review our nonprod logs (since we aren’t seeing the restarts there), it changed its mind and instead said we should use a different JAVA_OPTs argument to set a maximum limit to classes loaded, and that this should work more effectively.
The point to all this is that I think many engineers using AI to code right now have concerns that they will no longer have jobs because AI is going to replace them. But I have encountered many situations like this so far where the only way you can use AI effectively is to still be an engineer with the ability to question things, spot problems, and push back on the LLM when your gut tells you to. We’ve already seen issues now where large companies (such as Amazon) have suffered outages due to engineers pushing code that was generated and reviewed by AI alone. It seems like right now we have two large camps of engineers, those who love using AI to code and praise it up and down without question, and those who are alarmists and are trying to resist it. I think the prudent stance to have is somewhere in the middle, which is where I lie. I truly believe that engineers with that mindset will be the most successful, the ones who view AI as a great tool, but not something that we should relinquish all control to.