In Soluto we use a microservices architecture, whereby each service can ship its logs to our third party logs provider and query them to create alerts and dashboards and to improve visibility.
As logs multiplied, the issues started…
The continuous growth of our company has presented a new challenge to our logs collection system. The weekly addition of new services to our backend architecture led to an exponential growth of our log volume. All good up to here. However, this reached a point where our logs quota was reached and our logs shipping was blocked by our provider on a daily basis.
This meant no logs collection, no visibility to our services, and a lot of hassle in getting things up and running again, every single day.
When the logs quota is reached, the on-call company engineer receives an alert, and needs to figure out what caused the writing of so many logs? In most cases, the reason would be that a service is ‘acting up’ or a database is down. All services that use this resource would then start writing error logs such as “Can’t write to database”, “Connection timed out” etc’.
For example, if a service handles 1000 messages from a queue, and for some reason, cannot write them to the database, it’ll end up writing 1000 repetitive error messages about the same issue – not that useful.
At this point the on-call company engineer has two possible solutions: either to increase the logs quota or to find the offender microservice and fix it to go back to a normal state.
Increasing the logs quota never helped us for more than a few days.
As established in the Parkinsons’s Law, increasing the number of resources will only lead to people maximizing their use of it. Therefore raising the quota will encourage inefficient use of resources and the same problem will recur with a bigger bill as a bonus.
We therefore usually opt for the second solution: the on-call engineer contacts the relevant service owner who identifies the source of the problem, fix it and reduces the logs count – profit. Unfortunately though, finding the offender microservice and asking the responsible team to reduce the logs is not a straightforward task, and causes friction in the team and frustration for the engineer. It is also heavy in resources, as it requires both the on-call engineer and the additional team to intervene.
This is how dealing with such an incident would look like in our internal communication channels:
This issue was a recurring one in Soluto for several years, but not anymore since Kubernetes came along.
Since the introduction of Kubernetes two years ago, our services are gradually migrated onto the new orchestrator, and new services are exclusively created on it.
In Kubernetes, log shipping is greatly simplified thanks to the fact that the services write their logs to stdout while Fluentd collects them behind the scenes and sends them to our third party logs provider. Having the log shipping placed in a centralized service has provided us much better control over it.
The adoption of Kubernetes and Fluentd and our constant effort to reduce toil have provided an opportunity to reevaluate the logs quota issue. We wanted to implement deduping to handle repetitive logs. After performing analysis to ensure we choose the best path for solving the problem. We found that Fluentd could offer a solution with throttling.
We then discovered fluentd-plugin-throttle, a cool fluentd plugin that allows logs throttling by configurable groups. It felt promising but one key feature that was missing was the capacity to ignore some logs (in our case, logs of type “Information” or “Debug” are sent to a different logs storage, and shouldn’t be throttled). We’ve added this capability to the plugin, opened a PR (which hasn’t been merged yet), installed it from the local fork, and defined a configuration to filter logs by pod groups.
Here is a snippet of our Fluentd configuration:
<source> @type tail read_from_head true tag "kubernetes.*" </source> <filter kubernetes.**> @type kubernetes_metadata </filter> <filter kubernetes.**> @type throttle group_key kubernetes.pod_name group_bucket_period_s 60 group_bucket_limit 60 <ignore> key level regex /^([Ii]nfo|[Ii]nformation|[Dd]ebug)$/ </ignore> </filter>
As you can see, we collect all the logs from the container files, enrich them with the kubernetes_metadata plugin, then use the throttle plugin to define the logs grouping. Finally, we set the throttling to 60 logs/min.
We’re enjoying great results who took effect immediately: logs are being throttled at all times, without the developers even noticing. The logs often contain a lot of redundant information, and 60 messages per minute are enough in order to understand the root cause of any issue that may arise and to take appropriate action.
Putting it simply: we found a solution to our problem, the on-call engineer is no longer getting annoyed by those alerts, other company engineers get the flag on time rather than when the service is already blocked and well…. Everyone is happy.
You’re welcome to try it as well! Happy throttling!