Load Testing for each Microservice

Load testing of our system is being done to evaluate the performance of our system under real-life load conditions and to test the limits of our system.

We have implemented load testing for all our microservices via each exposed API route of our application.

These API routes are as follows:

History Route (Registry and DB Service)
Landing Route (Frontend Service)
Login Route (Auth Service)
Metadata Route (Metadata Service)
Widget Route (Data Service)

For each microservice, we primarily observe the “Aggregate Graph” and “Response Time Graph” results to understand throughput and error rates of our load testing. This is analyzed after incrementally increasing replica count to 1, 3 and 5 instances per service in each service’s Kubernetes deployment configuration.

As needed for each service, we have included additional graphs to analyze the performance.

History (Registry + DB Service)

Replica Count 1

250,000 requests with 0% error rate and low response time

Replica Count 3

Load: 50,000 samples (2500 users with a loop count of 200)
Throughput increased to 501.7/sec with a low error rate of 0.04

Replica Count 5

Load balanced among 3 instances of each service as seen below.

Landing (UI) Service

Replica Count 1

Replica Count 3

Replica Count 5

Login (Auth + User Service)

Replica Count 1

Replica Count 3

Replica Count 5

Metadata (Metadata Service)

Replica Count 1

Replica Count 3

Replica Count 5

Widget (Data Service)

Replica Count 1

Tested our data service with replica count 1 and a load of 50 users requesting the graph. With 1 instance, we observed that the data service eventually (after 28 requests in this case). This occurs because the data service maxes out the RAM allocated to its pod (5000Mi in our case) and restarts, resulting in the rejection of subsequent requests to this pod.

Response time dropped as a result of pod restarting.

Replica Count 3

Increasing the replicas to 3 helped improve the throughput from 15.4/min to 20.2/min and decreased the error rate significantly from 44% to 6%. This is because 1 pod restarting does not make the service unavailable.

Response time remained approximately the same throughout execution.