Solving Real-Time Chat at Scale: A System Design Challenge

4 min readJan 29, 2025

Do you ever wonder how you receive your friend’s message nearly as soon as they send it, even though applications such as Instagram, Facebook, and WhatsApp have millions or even billions of users? The magic behind this seamless communication lies in the intricate design of their real-time messaging systems.

In the world of real-time communication, designing a scalable and efficient chat application is a fascinating challenge. The key question is: how do you ensure that when Jean, connected to WebSocket server 1000, sends a message to Jack, who is on WebSocket server 1, Jack receives that message within 10 seconds (ideally)?

Imagine you’ve built a real-time chat app with thousands of users connected across multiple WebSocket servers. Users expect instantaneous message delivery, but achieving this across a distributed system with numerous servers is no small feat. The challenge is maintaining low-latency message delivery in a distributed environment, ensuring that messages are routed efficiently and reliably to the correct destination server.

In this blog post, we’ll explore various solutions to this problem and discuss their pros and cons.

The Problem Statement

The core challenge is maintaining low-latency message delivery in a distributed environment with multiple WebSocket servers. The solution needs to ensure that messages are routed efficiently and reliably to the correct destination server.

Proposed Solution: Third Service with Cache

Overview

My proposed solution involves a third service that acts as a directory, using a cache to maintain WebSocket server IDs, chat sessions, and user associations. This service holds a chat ID (key) and user ID (value) pair that provides information on the users’ WebSocket servers, enabling efficient event dispatch as soon as a message is received.

Pros

Centralized management of user connections.
Efficient routing with low latency due to caching.
Scalable and easy to implement.

Cons

Potential single point of failure if not properly managed.
Requires robust cache management strategies.

Implementation Steps

Track Connections: On user connection, WebSocket servers update the directory service with the user’s server ID.
Query Directory: When Jean sends a message, Server 1000 queries the directory service to find Jack’s server ID.
Route and Deliver: The message is routed to Server 1 for delivery to Jack.

Solution 1: Distributed Message Broker

Overview

One effective approach is to use a distributed message broker like Kafka, RabbitMQ, or NATS. Here’s how it works:

Each WebSocket server publishes messages to the message broker.
The broker handles the routing and delivery of messages to the appropriate servers.

Pros

Reliable message delivery.
Scalable architecture.

Cons

Added complexity in managing the message broker.
Potential latency due to message broker overhead.

Implementation Steps

Publish Messages: When Jean sends a message, WebSocket Server 1000 publishes it to the message broker.
Route Messages: The message broker routes the message to WebSocket Server 1.
Deliver Messages: WebSocket Server 1 delivers the message to Jack.

Solution 2: Global User Directory Service

Overview

A global directory service can track which WebSocket server each user is connected to. This service can use a distributed cache like Redis or a database.

Pros

Centralized tracking of user connections.
Simplified message routing.

Cons

Single point of failure if not properly managed.
Potential performance bottleneck.

Implementation Steps

Track Connections: On user connection, WebSocket servers update the directory service with the user’s server ID.
Query Directory: When Jean sends a message, Server 1000 queries the directory service to find Jack’s server ID.
Route and Deliver: The message is routed to Server 1 for delivery to Jack.

Solution 3: Client-Side Acknowledgment

Overview

Ensure message delivery through client-side acknowledgment. Messages are stored until receipt is acknowledged by the recipient.

Pros

Guarantees message delivery.
Resilient to server failures.

Cons

Increased complexity in handling acknowledgments.
Potential for duplicate messages.

Implementation Steps

Send Message: Jean sends a message, which is stored persistently until acknowledged.
Acknowledge Receipt: Jack acknowledges receipt of the message.
Retry Mechanism: If no acknowledgment is received, the system retries message delivery.

Solution 4: WebSocket Server Clustering

Overview

Cluster WebSocket servers and use consistent hashing to distribute users. Servers within the cluster can route messages directly to each other.

Pros

Decentralized and scalable.
Reduces single points of failure.

Cons

Requires complex cluster management.
Potential latency in inter-server communication.

Implementation Steps

Cluster Management: Manage a cluster of WebSocket servers.
Consistent Hashing: Distribute users across servers using consistent hashing.
Direct Routing: Servers route messages directly to the appropriate server in the cluster.

Solution 5: Geo-Distributed Data Centers

Overview

Deploy WebSocket servers in multiple geographically distributed data centers. Use DNS routing to direct users to the nearest data center.

Pros

Low latency and high availability.
Improved user experience.

Cons

High infrastructure costs.
Complex synchronization between data centers.

Implementation Steps

Deploy Globally: Set up WebSocket servers in multiple data centers.
DNS Routing: Use DNS to route users to the nearest data center.
Sync Data: Synchronize user session information across data centers.

Additional Considerations

Latency Optimization

Use fast, in-memory databases for the user directory service.
Implement efficient serialization formats (e.g., Protocol Buffers).

Scalability

Shard the user directory service to distribute the load.
Implement auto-scaling groups for WebSocket servers.

Fault Tolerance

Redundancy for critical services to avoid single points of failure.
Leader-election algorithms for managing primary and secondary instances.

Monitoring and Alerts

Set up monitoring and alerting for message delivery times and server health.
Use tools like Prometheus, Grafana, and ELK stack for real-time monitoring.

Conclusion

Designing a scalable real-time chat system involves balancing reliability, scalability, and performance. Each solution has its trade-offs, and the best approach depends on your specific requirements and constraints. By combining elements from these solutions, you can build a robust system that ensures low-latency message delivery across a distributed network of WebSocket servers.

Have you faced similar challenges? How did you solve them? Share your thoughts and solutions in the comments below. Let’s solve this problem together!

Solving Real-Time Chat at Scale: A System Design Challenge

The Problem Statement

Proposed Solution: Third Service with Cache

Overview

Pros

Cons

Implementation Steps

Solution 1: Distributed Message Broker

Overview

Pros

Cons

Implementation Steps

Solution 2: Global User Directory Service

Overview

Pros

Cons

Implementation Steps

Solution 3: Client-Side Acknowledgment

Overview

Pros

Cons

Implementation Steps

Solution 4: WebSocket Server Clustering

Overview

Pros

Cons

Implementation Steps

Solution 5: Geo-Distributed Data Centers

Overview

Pros

Cons

Implementation Steps

Additional Considerations

Latency Optimization

Scalability

Fault Tolerance

Monitoring and Alerts

Conclusion

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Vic

No responses yet