Finding Ideal JVM Thread Pool Size With Kubernetes and Docker

In this post, I explain why we use JVM thread pools, what are the parameters that we should take into account when we decide a thread pool and it’s size. And how these parameters are affected when we run our application with Docker and Kubernetes.

Why do we use thread pools?

We use thread pools to align our system workload with system resources efficiently. This system workload should be tasks that can run independently, for example, every Http request to a web application can be in this category, we can process every request without thinking another Http request.

We expect both a good throughput and good responsiveness from our applications. To achieve this, firstly we should divide our application work into independent tasks and then we should run these tasks with efficient usage of system resources like CPU, RAM (Utilization).

With the usage of thread pools, we aim to run these individual tasks concurrently with using system resources efficiently.

What kind of thread pool types does Java has?

Java’s Executors class provides some different kind of thread pools;

We don’t need a pool size for newSingleThreadExecutor(), because there is only 1 thread for this pool, therefore every task that we submit to this thread pool is working sequentially, there is no concurrency and if we have tasks that can run independently, this configuration is not good in terms of our application throughput and responsiveness.

We also don’t need a pool size for newCachedThreadPool(), because this pool creates a new thread or uses an existing thread for each task that’s submitted to the pool. This pool usage can be meaningful for our independent tasks for some scenarios, such as if our tasks are short living tasks. If our tasks are not short living, using this kind of thread pool can cause creating many threads on the application. If we create threads above than a threshold then we can’t use CPU resource efficiently because CPU spends most of its time with thread or context switching other than the real work. And again this causes a decrease in our application responsiveness and throughput.

We need a thread pool size for the newFixedThreadPool​(int nThreads), and we should select ideal size to increase our application throughput and responsiveness(I assume that we have tasks that can be run independently). The point is selecting not too much and not too small. The former one causes CPU to spend too much time for thread switching instead of a real task, also causes too much memory usage problems, the latter one causes idle CPU while we have tasks that should be processed.

How can we find an ideal thread pool size on a bare metal server?

Brian Goetz who is one of the architects of JVM has a book called Java Concurrency In Practice. The book has a chapter about thread pools, he expresses sizing thread pools in a very clear way and suggests a formula to find an ideal size for our applications. Finding a thread pool size completely depends on our application nature and system resources that our application runs on.

The number of threads that can be run simultaneously is bounded with the number of CPU cores in a system. If you have a 4-core CPU, this means that you can have only 4 threads that can run simultaneously. If you have a 1-core CPU, this means that every task (so thread) that runs on this system is sequentially executed. But having 1-core CPU doesn’t mean that you can’t design concurrent tasks that run on this system. I’ll explain why below.

If our application’s tasks are doing many IO operations, such as database access, file system access, networking, etc. this means that our threads will have some wait time for these IO operations and while they waiting for IO operations they don’t use CPU resources. If we select thread pool size that’s the same with system CPU count for this kind of application, this means that we don’t use our CPU resources efficiently, because when some threads pass to wait state, there will be no another thread that claims this idle CPU resource. And this thread pool size will be too small, you should select a larger thread pool size.

If our application’s tasks are doing compute-intensive operations, this means that they use CPU resources without waiting for any external resources. If we select thread pool size that’s much greater than the system CPU count for this kind of application, this means that we don’t use our CPU resources efficiently. Because there will be many threads are waiting for CPU resources, and CPU will try to make many switch between these threads and this will cause wasting CPU time with too much thread switch operation. And this thread pool size will be too much, you should select a smaller thread pool size.

Brian Goetz’s formula to find the optimal thread pool size for keeping the processor at the desired utilization is:

Number of Threads = Number of Available CPU Cores * Target CPU Utilization * (1 + Wait Time / Compute Time)

Number of Available CPU Cores can be found with the call Runtime.getRuntime().availableProcessors().

Target CPU Utilization can have a value in a range [0..1]. When we select this utilization value we should think about the other processes that run on our system and use the same CPU resources with our application. For example, JVM threads also use the same CPU resources with the application.

(Wait Time / Compute Time) value is dependent on our application’s task’s nature. We should run some profiling/instrumentation tests to find these values. Also, tracing tools(like datadog) have some functionalities to show these kinds of data, I mean you can see the distribution of task’s completion time.

Let’s say we have 24-core CPU, we aim %80 CPU utilization and we have compute-intensive tasks so our (Wait Time / Compute Time) = 110. Then we find our thread pool size as (24 * 0.8 * (1 + 110)) = 21.

If we have IO heavy tasks and our (Wait Time / Compute Time) = 101. Then we find our thread pool size as (24 * 0.8 * (1 + 101)) = 211.

What If we have a Docker container runs on a Kubernetes cluster?

If we run our application as a docker container on a Kubernetes cluster then finding available CPU cores becomes quite complicated.

In a Kubernetes cluster, there can be many docker containers that run on a node(host). If the Runtime.getRuntime().availableProcessors() call returns the number of CPU cores of a node, then we can have some problem. Because the node’s CPU cores can be shared among many different applications/containers. This causes us to use too much thread pool size for our application’s thread pool size.

Finding CPU core count wrongly in a containerized environment is also a very important problem for the JVM itself because it also sets it’s thread pool size according to this information, to manage garbage collection threads, parallel stream processing threads, etc.

JVM’s solution to this problem is using container configuration and limits while determining the CPU core count for an application that runs on a containerized environment.

To understand how JVM’s solution fits into Docker and Kubernetes environment, let’s try to understand how they manage CPU resources of containers and pods.

How does Docker manage CPU resource usage of containers?

For Docker;

By default, each container’s access to the host machine’s CPU cycles is unlimited. You can set various constraints to limit a given container’s access to the host machine’s CPU cycles.

Docker has --cpu-shares and --cpu-quota (or --cpus for Docker 1.13 and higher) parameters for the docker run command to configure or/and limit a container’s CPU resource usage.

You can check the details of these parameters from the Docker web site, for a summary;

--cpu-shares is a relative CPU request weight parameter for a container and is used to distribute available CPU cycles between containers. Its default value is 1024 and its minimum value is 2.

This is only enforced when CPU cycles are constrained. When plenty of CPU cycles are available, all containers use as much CPU as they need.

So --cpu-shares doesn’t put a limit to use of CPU by containers, it just tries to distribute CPU resources in a fair way when containers want to use %100 of CPU.

--cpu-quota is a hard limit for CPU resource usage of a container. By using --cpu-quota you can determine the total amount of CPU time that a container can use every --cpu-period. Default --cpu-period is 100ms. If a container exceeds this limit, it will be throttled. You can set --cpu-quota in a microsecond format.

How does Kubernetes manage CPU resource usage of containers?

Kubernetes has two parameters to manage CPU resource usage of containers;

spec.containers[].resources.requests.cpu is corresponding parameter of Kubernetes for the Docker --cpu-shares parameter.

Requested CPU parameter is used by the Kubernetes scheduler to schedule a pod to an appropriate node, it means a node that has requested CPU resource as allocatable CPU resource.

Requested CPU parameter isn’t only a relative CPU request weight value for Kubernetes. A node should have the requested CPU resource as an allocatable CPU resource to accept the pod. A pod’s total CPU request is guaranteed by the Kubernetes scheduler when it selects a node for the pod.

The scheduler ensures that, for each resource type, the sum of the resource requests of the scheduled Containers is less than the capacity of the node.

Kubernetes uses millicore(or millicpu) resolution for CPU, 1000m equal to 1 CPU core.

spec.containers[].resources.limits.cpu is corresponding parameter of Kubernetes for the Docker --cpu-quota parameter.

Kubernetes scheduler doesn’t use this parameter to schedule a pod to a node.

Kubernetes CPU resource usage parameters are not directly mapped to Docker’s parameters, Kubernetes makes some conversion.

How does JVM find CPU core count on a containerized environment?

JVM container support was enabled by default with the JDK10. You can disable it by setting -XX:-UseContainerSupport parameter to false.

Number of CPU core count is found with the --cpu-quota parameter as (--cpu-quota / --cpu-period). If --cpu-shares parameter is set then the number of CPU core count is found as (--cpu-shares / 1024). 1024 is taken as the default and standard unit for --cpu-shares.

If both --cpu-shares and --cpu-quota are set then a minimum of two number of CPU core counts is taken.

Also a flag -XX:ActiveProcessorCount was added with JDK10 to override any computed CPU core count or directly queried CPU core count from the operating system.

With the JDK11 another flag is added to enable people to prefer --cpu-quota instead of --cpu-shares when both of them are set. This flag is -XX:+PreferContainerQuotaForCPUCount and it’s true by default.

Tradeoffs when using Kubernetes

Which parameter do we prefer to find CPU core count on a containerized environment that is orchestrated with Kubernetes? CPU shares or CPU quota?

You may not want to set CPU quota to utilize Kubernetes node resources as much as possible, because CPU quota is a hard limit and your container can’t take more CPU time than it’s CPU quota. But if you don’t set CPU quota, and if there is enough CPU on the node, your container can take whatever it wants as a CPU time. If there is not enough CPU time then CPU is distributed according to CPU shares values of containers. In this case, JVM will try to find a CPU core count with CPU shares value of the container.

If you set CPU share value to a small value then you will get a small number of CPU core count, this means that small thread pool size, for JVM and your application. If you set CPU share value to a larger number, then you can get a larger CPU core count, so thread pool size. But in this case, you can have scheduling problems at Kubernetes’s side because Kubernetes uses CPU shares to calculate the available allocatable CPU resource of a node. You can end up with a case that there are too few pods on a node, and they may not use CPU. Node can’t accept additional pods because scheduled pods require too many CPU and Kubernetes guarantee requested CPU to pods even they don’t use all of this requested CPU.

It depends on the nature of your application but for a normal web application, with some benchmarking, preferring CPU quota makes sense to me. If we need more scale then using more replica can be a solution.

References

Comments

comments powered by Disqus