What is load balancing?
Load balancing is an important concept in distributed systems. It is used to distribute incoming traffic across multiple servers. Let us understand this through an analogy.
Analogy of restaurant
Think of a restaurant that has only one counter to order and only one staff to prepare the meal. If this was a popular restaurant then there would be a long queue of customers waiting for long time to get their meals. How would you scale your restaurant operations? Well to scale this you will need to add more staff to make more meals in the kitchen. So now you will need an intelligent mechnism to allow for assigning of meal orders to each staff such that none of the staff is overwhelmed with the orders and none are idle. This will ensure optimum utilization of the kitchen staff. Once you achive this, it would speed up the delivery process resulting in happy customers and you would have scaled your restaurant operations.
Now, in the software world you can apply the same logic. If you just have one server then it can only serve few requests. As the number of requests increases the server will start getting overwhelmed and you will suffer from long wait times and request drops. So to scale this you want to add more servers and when you add more servers you want to make sure that the incoming requests are distributed among your different servers in such a manner that none of the server gets overwhelmed and this is where load balancing comes in the picture.
Load balancing is a method used to distribute incoming network traffic across multiple servers to ensure that no single server becomes overwhelmed by the load and you are able to scale your operations. The primary goal of load balancing is to improve the availability, reliability, and performance of a service or application by ensuring that all servers are used optimally.
Key concepts around load balancing
- Distribution of workload - Primary aspect of load balancing is to distribute incoming traffic / workload among multiple servers.
- Load balancers - In order to achive distribution of traffic / workload, the primary component is the load balancer. There are many different types of load balancers. We will go through these different load balancers in detail in the later sections.
- Load balancing algorithms - Load balancing of traffic requires an algorithm. Over the the years there has been evolution of many different types of load balancing algorithms. We will take a close look at evoution of these load balancing algorithms in detail in the later sections.
Why do we need load balancing
- Scalability: A single server can only handle a limited amount of traffic. Load balancers allow you to distribute traffic across multiple servers, enabling horizontal scaling.
- Availability: By distributing traffic across multiple servers, load balancers prevent a single point of failure. If one server goes down, the load balancer can route traffic to healthy servers.
- Efficiency: Load balancers can optimize resource utilization by distributing requests based on the current load or server health.
- Performance: Load balancers reduce response times by routing requests to the least loaded servers, enhancing user experience.
Evolution of Load Balancers
Client Server world
In the initial days as the networked systems started to grow, we deployed our server applications on one central server machines and multiple clients would connect to these server and be able to get access to business operations and data hosted by the servers. This was the client server days. As the number of users accessing the server increased, these servers started becoming a bottleneck.

Hardware load balancers
As the single server started getting overwhelmed by the increasing number of clients connecting to it, there was need to have multiple servers to serve the clients. To achieve this one of the earliest solution was Hardware load balancers that could sit between the clients and the servers and distribute the requests among two or more servers. This allowed for more number of clients to be served using two or more servers. These hardware load balancers where simplistic in that they distributed requests to the servers in a round robin fashion.

F5 Networks and Cisco LocalDirector where among the earliest hardware load balancers.
Software load balancers
Systems continued to scale, but the hardware based load balancers were expensive and inflexible. This spurned the way alternative software load balancers. These software load balancers allowed for more adaptable and scalable solutions, making it easier for organizations to implement load balancing without investing in expensive hardware.

This is when software load balancers like HAProxy and NGINX emerged.
Layer 4 load balancers
Layer 4 load balancers where introduced alongside the hardware load balancers to distribute traffic based on network layer information. Its routing decisions are based on information defined at Layer 4, which represents the fourth layer of the OSI (Open Systems Interconnection) Model. Which means it could route the incoming traffic based on the IP Address and or TCP/UDP Port of the clients making request.
Layer 4 load balancers are uselful where scalability, speed are paramount and routing decisions can be made based on the IP Address, protocol and port.
AWS NLB, NGINX and F5 Networks BIG-IP LTM support Layer 4 load balancing.
Layer 7 load balancers
As the applications continued to evolve into multi-tier systems where different servers handled different parts of the applicaiton, it was not enough to distribute traffic based on Layer 4 information (IP Address, Protocol and Port) as not all requests were the same, and some needed to be directed to specific servers based on the content of the request. This is when load balancers evolved to operate on Layer 7. These load balancers could now inspect the content of the request (such as urls, http headers or cookies) to decide the target server for traffic distribution. It was now possible to direct static content request to one type of servers and dynamic content request to say application servers.
Since these load balancers can inspect the request content, it enabled advanced features like SSL termination, content-based routing, and URL re-writing.
Layer 7 load balancers are often used in microservices architectures, where they can route traffic to specific services based on the request path or method, enforce security policies, or manage API traffic.
Both NGINX and F5 Networks added Layer 7 load balancing features.
Global load balancers
As the user base continued to widen with the rise of internet traffic, users from all across the globe needed to access applications. Users would now be located anywhere in the world and if they tried to reach servers situated far away in different continent then this introduced network latency and the problem became how to reduce this latency. This brought the need for directing traffic to different data centers based on user's geographic location. To solve for this Global load balancers and DNS-based load balancers where introduced. These load balancers could detect user's geographic location and then direct the traffic to the closest data center to the user's location, reducing latency and improving performance.
This introduces interesting problems like how do you maintain synchronization between data centers and introduced complexities of managing geo-eplication and eventual consistencies.
Akamai's Global Load Balancer and Google's Global Load Balancer both help direct traffic based on user's geographic location to the nearest data centers or cluster of servers.
Elastic load balancers
As cloud computing gained traction, the need for responding to demand in real time became important. Cloud computing brought in realtime elasticity in scaling servers, this required that even load balancers be able to handle realtime elasticity in servers, be able to continuously monitor servers and had cloud dependency now. To support for this need cloud providers introduced Elastic Load Balancers that were designed to automatically scale based on traffic demands, a key feature for cloud-based applications.
AWS introduced Elastic Load Balancer (ELB)
Service Mesh & Internal load balancers
Next came the popularity of the Microservices architecture. This evolved the load balancing landscape further. Microservices are small and independent services that have a need to communicate with each other over the internal network within the same data center or virtual private cloud. Traditional load balancers were too centralized for service to service inter communication. Service Meshes and Internal Load Balancers where introduced to handle traffic between various microservices. They operated at a finer granularity balancing traffice between individual services within the cluster.
Service Mesh architectures introduced some new complexities in networking and observability, requiring sophisticated tools for management.
Istio and Envoy were some of the first widely used service mesh platform.
Comparison of the Load balancer types
| Type | Layer | Use Case | Example Providers |
|---|---|---|---|
| Hardware Load Balancer | Physical Layer | High-performance network traffic distribution in large enterprises | F5 Networks, Citrix NetScaler |
| Software Load Balancer | OS Layer | Flexibility and cost-effectiveness in cloud/on-premise environments | HAProxy, NGINX, Traefik |
| Layer 4 Load Balancer | Transport Layer | Load balancing TCP/UDP traffic, DNS, VoIP traffic | AWS NLB, Azure Load Balancer |
| Layer 7 Load Balancer | Application Layer | Web applications needing content-based routing (e.g., HTTP/S, REST APIs) | AWS ALB, NGINX, HAProxy |
| Cloud Load Balancer | Cloud Services | Managed, scalable traffic distribution in the cloud | AWS ELB, Google Cloud Load Balancer |
| Global Server Load Balancer | Global Routing | Multi-region disaster recovery, global traffic distribution | AWS Route 53, Cloudflare, GSLB |
| Internal Load Balancer | Application Layer | Handling HTTPS encryption/decryption | AWS ALB, F5 BIG-IP |
Load Balancer vs API Gateway
While a load balancer distributes incoming network traffic across multiple servers to ensure no single server is overwhelmed. An API Gateway is a software layer that acts as an intermediary between a client and a collection of backend services. It provides a single point of entry for all client interactions with a microservices-based application
API Gateways are responsible for handling API traffic, including:
- Routing API requests dynamically to the appropriate backend services based on service discovery.
- Applying security policies like authentication, authorization, and encryption
- Monitoring all API traffic
- Performing tasks like rate limiting, caching, and request/response transformation
- Validating that customer requests contain the required information in the correct format
- Combining responses from multiple services into a single response for the client.
- Facilitating protocol translations
Load Balancer vs Proxies
Load balancers and proxies are networking components that play an important role in routing traffic between clients and servers, but they serve different purposes and operate in different ways.
Load Balancers - are positioned between the client and multiple backend servers. It directs traffic to one of the backend servers based on the load balancing algorithm and Operates transparently, meaning clients may not be aware of the presence of a load balancer. Its primary purpose is to distribute client requests across multiple servers to balance the load.
Proxy - on the other hand is a server that is referred to as an "intermediary" and sits between users and servers with primary focus being security, caching or traffic filtering. Proxy's are of two types - Forward Proxy and Reverse Proxy.
Forward Proxy
A Forward Proxy acts on behalf of clients, masking their identity and routing the requests on behalf of them.
- It is often used to filter traffic, enforce policies, or provide anonymity.
- A forward proxy can hide the client's IP address
- Traffic is routed from clients to external servers.
Reverse Proxy
A Reverse Proxy acts on behalf of servers, handling client requests on behalf of one or more servers.
- It often provides caching, SSL termination, or application firewall capabilities.
- A reverse proxy protects servers by hiding their internal structure and routing requests to the correct server.
- Traffic is routed from clients to internal servers.
Evolution of load balancing algorithms
We will now look at how challenges to growing traffic, variations in server capacities, and evolutions of application architectures have shaped the evolution of different load balancing algorithms. Each algorithm reflects an improvement to address specific needs in distributing workloads efficiently. We will look at some of the key algorithms, understanding the problem each solved and the context in which they emerged.
Round-Robin
In the early days of web applications, when systems began deploying multiple servers to handle growing traffic. A simple and effective way to distribute traffic was needed and the Round-Robin algorithm was one of the first techniques used to distribute traffic.

In a Round-Robin algorithm each server is setup in sequence in a circle and each incoming request is then sent to the next server in sequence. Once all servers have received a request, the cycle repeats. This algorithm is simple to implement, however, it is very simplistic in its approach.
Lets understand this through the analogy of a restaurant. If a restaurant has three chefs - A, B, and C, then we can go on assigning orders to each of the chefs one after the other as the orders come in, then cycle back and keep doing this over and over. So first order goes to chef A, second order to chef B, third order to chef C, and then cycling back the fourth order will go to chef A, fifth order will go to chef B and so on. This is called a Round-Robin way of allocating orders to the three chefs.
It is fine to use Round-Robin approach when the traffic is fairly consistent and all the servers are also similar in performance.
Weighted Round-Robin
In most cases different servers are procured at different times and with different configurations. Some may have more processing power while others might have more optimized storage or memory. This results in servers having differing capabilities for processing incoming requests. To account for this difference in sever processing capabilities the regular Round-Robin algorithm was tweaked by introducing weights for each server. This is called a Weighted Round-Robin approach.

So, in Weighted Round-Robin approach each servers is assigned a weight based on their capacity or capability to handle requests. Servers with higher weights will receive more requests than the servers with lower weights. This makes the it more efficient than pure round-robin for heterogeneous server environments. This still doesn’t account for real-time changes in server performance or request complexity.
Lets expand this to our restaurant. If our restaurant chefs - A, B, and C had different experience then we could assign more orders to the most experienced chef say A and less to B and C. This requires a bit more intelligence during order assignment as now we are taking into account the level of experience of each chef. So first order goes to chef A, second order to again goes to chef A, third order to chef B, fourth order to chef C and then cycling back the fifth order will go to chef A, sixth order will also goes to chef A, and then seventh one can goes to chef B, eighth to chef C and then repeat the whole cycle again. Here chef A keeps getting more orders compared to chef A and B because of his higher experience. This is called a Weighted Round-Robin way of allocating orders to the three chefs having differing experience level.
This is ideal when server capacities are known in advance and differ from each other, but the load is fairly stable.
Least Connections
So far we were able to take into account server capability or performance to adjust the traffic distribution among the servers. However, not every request is same. What this means is that some request require longer processing time while others take less processing time. This difference in requests introduces a varying load on the server. In order to account for this varying load on the server a Least Connections algorithm was introduced.

So in Least Connections algorithm we track the build up of connections / pending requests with a server and adjust our distribution of load based on this factor. In this case we will direct the new incoming request to a server that has the least active connections / pending requests with it. This allows us to distribute traffic more dynamically based on real-time server load.
Lets us expand this to our restaurant analogy. In our restaurant not all orders are going to be for same thing. It may vary among Pasta, Sandwich, Pizza and a Burger. So as we go about distributing these incoming orders among our chefs it is likely that chef A cooking Pasta has a higher request building up while the chef B cooking Sandwich has lower request buildup. So in this case the next incoming order can be given to chef B who has the least number of pending orders to fulfill.
This algorithm is effective where requests vary in complexity causing more number of active connections with some servers while less active connections on others.
Least Response Time
Least Connections approach does not account for the processing time or resource demands of individual requests (e.g., long-running queries vs. simple requests). A server could have fever active connections but could be still slow to respond back to requests due to resource exhaustion or heavy CPU load. So to account for this we can instead look at the response times for each request by these servers and adjust our traffic distribution based on this. This is what the Least Response Time algorithm does.

So in Least Response Time algorithm we track the amount of time it takes for a server to respond to the requests. The server that is taking least amount of time to respond to incoming requests gets the next incoming requests. This combination of looking at the number of active connections along with the least response time factor for server allows us to more accurately distribute traffic among the servers.
Looking at our analogy of restaurant, even though there might be similar order build up with chef A for pasta and chef B for pizza but if chef A is known to be able to respond quicker by fulfilling his orders faster than chef B then we can have the next incoming order to be lined up with chef A.
Lease Response Time improves upon Least Connections and distributes traffic more accurately by factoring in both number of active connections and the faster response times by the servers. However, this is slightly more complex to implement.
This is useful where server response times vary due to varying workloads and where quick response time is critical.
Weighted Least Connections
Another variation on Least Connections is also possible. Since, often we might know before hand that certain servers have more capacity or are more performant than other servers, we can combine this knowledge of server performance with number of active connections to better distribute the load. This is called Weighted Least Connections algorithm.

Weighted Least Connections is an improvement on Least Connections algorithm. It combines the concept of weights from Weighted Round-Robin algorithm with the Least Connections Algorithm. Here we assign weights to servers based on their performance and then consider these weights along with the number of active connections to derive which server should receive the next incoming request.
Applying this to our restaurant analogy, we have chefs A, B, and C with varying experience with chef C having most experience. So in this case we can say that we want chef C to take up more orders because of his higher experience versus chefs A and B. Now as the orders come in even though there might be less pending orders to fulfill with chefs A and B but we can still push the incoming order to chef C who has more pending orders because chef C can handle more orders.
So traffic is directed to the server with the fewest active connections relative to its capacity. This is effective in heterogeneous environments where some servers are more powerful than others. However, this also is more complex to configure and requires ongoing monitoring and adjustment of server weights.
This works well in environments with varying server capacities and fluctuating traffic loads.
IP Hash
We have many applications that require maintaining a session between client and servers to function correctly. All the above algorithms focus on the server capacity or traffic distribution without consideration of who the client is. So when there is a need for sessions to be maintained between client and server, every request for a client should be directed to the same server for the correct functioning of the application. These scenarios are often called as sticky sessions. In order to handle such scenarios we have another algorithm called IP Hash.

IP Hash algorithm was developed to ensure consistent routing of requests, so that all requests from a client go to the same server. In this approach client's IP address is run through an hashing function to determine which server will handle the requests from this client. This ensures that all requests coming from same IP address is always sent to the same server.
Applying this to our restaurant analogy, we have specific chefs who get assigned to specific customers. Now the customer may order multiple dishes during his/her dinning course, but all of their orders will go to the same chef. This allows for the chef to pay specially attention to orders coming from this dinner or table.
IP Hash algorithm ensures session persistence. This does suffer from challenges like if a server fails, traffic redistribution can become problematic. We will need to now handle IP Hash pointing client requests to the failed server.
This is useful when session persistence is important, such as in applications with stateful sessions.
Consistent Hashing
In todays environment of elastic scaling, it is very common for servers to go down and new servers to get added. Whenever, this happens it is difficult to adjust these changes with IP Hash algorithm. Consistent Hashing was introduced to improve upon IP Hash algorithm to handle the problem of server failures and new severs getting added.

In this algorithm servers are arranged in a circle and a range of values from Hashing function is allowed to fall in the bucket of next available server arranged clock wise in a circular ring. This allows for easy removal of a server and addition of servers. Hashing function values generated against client's IP Address get automatically aligned with one of the servers in a circular ring.
This is slightly difficult to explain using our restaurant analogy. However, we can think of it in a way that if a chef assigned to a customer or table needs to take a break then another check handling different table can easily take over orders from this customer as well without disrupting larger work distribution.
Consistent Hashing minimizes traffic redistribution when servers join or leave the pool, making it ideal for elastic environments like cloud computing. This is more complex to implement.
This is commonly used in distributed systems and caching solutions like CDNs, where server pools are dynamic.
Random with Two Choices (Power of Two Choices)
There are some more variations to Least Connections Algorithm. Here Least Connections criteria is applied after selecting two servers randomly from the pool of servers. This is done to reduce complexity and random assignment of requests is considered to be sufficient to manage the load distribution, but pure random assignment can lead to suboptimal load distribution, so it is tweaked to select two random servers out of pool of servers. This is called Random with Two Choices Algorithm, also known as Power of Two Choices algorithm.
In this case two servers are randomly selected, and the request is sent to the server with fewer active connections. This is simple yet surprisingly effective at balancing loads without requiring detailed knowledge of all server states. However, it is less precise than some of the more sophisticated algorithms.
This algorithm is popular in large-scale distributed systems where simplicity and scalability are key.
Adaptive Load Balancing
Most of the algorithms we saw so far do not respond well to real-time changes in system performances. To better adapt for real-time changes in system performances an Adaptive Load Balancing was introduced.

Adaptive Load Balancing can dynamically adjust traffic distribution based on real-time feedback of the servers. It makes use of machine learning or real-time monitoring to adjust the distribution of traffic based on server performance, request characteristics and overall health of the system and servers. This approach however, requires more computational resources and sophisticated monitoring.
It is suitable for modern cloud-native applications, microservices, and platforms that demand real-time optimization.
Benefits of load balancers
Load balancers offer several benefits crucial for maintaining the performance, availability and security of modern applications and services.
Scalability
Load balancers enable horizontal scaling of applications and enable elastic scaling in cloud.
- Horizontal Scaling: Load balancers enable horizontal scaling by distributing traffic across multiple servers or services. This allows you to add or remove servers based on demand or server health, ensuring your application can handle increased traffic.
- Elastic Scaling in the Cloud: In cloud environments, load balancers can automatically adjust to traffic patterns by adding or removing instances dynamically, ensuring efficient use of resources.
High availability and redundancy
Load balancers help us to redirect traffic to healthy servers.
- Fault Tolerance: Load balancers prevent a single point of failure by distributing traffic across multiple servers. If one server fails, the load balancer can redirect traffic to other healthy servers, ensuring continuous availability.
- Health Monitoring: Load balancers often include health checks to monitor the status of backend servers. If a server fails a health check, the load balancer stops sending traffic to it until it recovers.
Improved Performance
Load balancers help in optimal utilization of resources and reduce latency.
- Optimized Resource Utilization: By distributing traffic based on server load or performance metrics, load balancers ensure that no single server is overwhelmed, leading to faster response times and better overall performance.
- Low Latency Routing: Some load balancers can route traffic to the closest or fastest server, reducing latency and improving the user experience.
Security
Load balancers provides us an opportunity to apply security on traffic even before it hits our servers.
- DDoS Mitigation: Load balancers can help mitigate Distributed Denial of Service (DDoS) attacks by distributing malicious traffic across multiple servers, reducing the impact on any single server.
- SSL Offloading: Load balancers can handle SSL/TLS encryption and decryption, reducing the CPU load on backend servers and allowing them to focus on processing requests.
- Access Control: Load balancers can enforce access control policies, blocking or allowing traffic based on IP addresses, geolocation, or other criteria.
Efficient Traffic Management
Load balancers help with intelligent traffic management based on source and content information.
- Content-Based Routing: Layer 7 load balancers can route traffic based on content, such as URLs, headers, or cookies. This allows you to direct requests to specific services or servers based on the type of request.
- Session Persistence: Load balancers can maintain session persistence (also known as sticky sessions), ensuring that a user's requests are consistently routed to the same server. This is useful for stateful applications where user data is stored in server memory.
Cost Efficiency
Load balancers help reduce costs through horizontal scaling of inexpensive servers.
- Optimized Resource Usage: By efficiently distributing traffic, load balancers reduce the need for over-provisioning servers, leading to cost savings. In cloud environments, they can help you make better use of pay-as-you-go resources.
- Reduced Downtime Costs: Load balancers minimize downtime by rerouting traffic away from failed servers, reducing the potential revenue loss or reputational damage associated with outages.
Simplified Infrastructure Management
Load balancers help with infrastructure management through central management and automation.
- Centralized Management: Load balancers allow centralized management of traffic, simplifying the deployment and maintenance of backend servers.
- Integration with Automation Tools: Modern load balancers can integrate with DevOps tools enabling different deployment patterns.
Support for Modern Architectures
Load balancers enable modern architectures that need high scaling granular service to service communication.
- Microservices and Containers: Load balancers are essential in microservices and containerized environments, where they help route traffic to the appropriate service instance, ensuring smooth communication between services.
- API Management: Load balancers can be integrated with API gateways and help manage API traffic, ensuring that APIs remain responsive and available during high traffic.
Scenarios to use load balancers
- Web Servers: A load balancer can distribute web traffic across multiple servers, ensuring high availability and improved performance during traffic spikes.
- Application Servers: Microservice-based architectures use load balancers to distribute requests across different instances of the same service.
- Containers: When using containerized deployment use load balancers to distribute requests across multiple container instances of the same services.
- Database Servers: Some databases are distributed, and load balancers can route read requests to replica servers to reduce load on the primary database.
- Cloud Environments: In cloud-based architectures, load balancers are crucial to scaling applications dynamically by adding or removing instances based on load.
When not to use load balancer
While load balancers are powerful tools, they are not always the right solution for every scenario. There are certain situations where using a load balancer might introduce unnecessary complexity, cost, or performance overhead. Here are scenarios when you may want to avoid using a load balancer.
- Simple or Low-Traffic Applications - If you have a simple application with low traffic that one server can comfortably handle, a load balancer may be overkill. Introducing a load balancer would add complexity and cost without significant benefits. For eample, an internal tool used by a handful of users might not need a load balancer, as a single server is sufficient.
- Single Server Setup - If your application runs on a single server, there’s nothing to balance the load across, making a load balancer unnecessary. For example, a small, self-contained application running on a single server doesn’t benefit from a load balancer.
- Budget Constraints - Load balancers, especially managed ones in the cloud or dedicated hardware devices, can add to the operational costs. If you’re working within a very limited budget, and the cost of maintaining or subscribing to a load balancer outweighs the benefits, it might not be a good fit. For example, startups or small businesses with minimal traffic might prioritize cost savings over advanced traffic distribution capabilities.
- Applications with Minimal Fault Tolerance Requirements - If the application can tolerate occasional downtime or you have alternative fault-tolerance mechanisms in place (e.g., regular backups, failover strategies), then a load balancer might not be necessary. For example, internal development and QA environments where downtime is acceptable for short periods.
- Stateless Services with Alternative Routing - If your service is stateless and can make use of DNS round-robin, load balancers may not be needed. DNS-based solutions are simpler and can distribute traffic without the overhead of managing a load balancer. For example, a CDN (Content Delivery Network) that uses DNS to route traffic to geographically distributed servers may not need an additional load balancer layer.
- Low Latency/Performance-Critical Applications - Load balancers introduce some latency, even if minimal. In such cases, direct routing without a load balancer might be preferred. For example, high-frequency trading systems where every millisecond matters may avoid load balancers to minimize latency.
- Specialized Backend Infrastructure - Some backend systems or databases are not designed to work with load balancers or might require highly specialized configurations that are better handled by other redundancy mechanisms. Introducing a load balancer could complicate the architecture and cause unexpected issues. For example, databases that require master-slave configurations might not be compatible with traditional load balancing strategies.
- Already Distributed Systems - In some distributed systems, traffic distribution is handled internally by the application itself. Like in peer-to-peer networks or decentralized systems distribute load and requests naturally without the need for an external load balancer. For example, Blockchain networks or torrent systems distribute tasks among nodes inherently, making load balancers redundant.
- Homogeneous Traffic to a Single Service - If your application handles homogeneous traffic and there’s no need to distinguish between different types of requests, a load balancer may not add value. For example, a data ingestion pipeline that processes similar types of data streams in a predictable manner.