Slack 🏗️ Slack serves 20+ million daily active users across 750,000+ organizations, delivering billions of messages daily with real-time presence and sub-100ms message delivery. This document outlines the comprehensive architecture that powers enterprise team communication at scale.
WebSocket Gateway Real-time
Real-time Messaging Event Distribution
WebSocket Gateway Real-time
Real-time Messaging Event Distribution
Slack's core real-time messaging system delivers billions of messages daily.
Validation Rate Limit, Perms
Store Message Vitess/MySQL
Fanout Service Recipient List
WebSocket Connections Per User
Push Notifications Mobile/Desktop
Batch Delivery Offline Users
Ordering Per-channel Sequence
Acknowledgment Delivery Confirm
Validation Rate Limit, Perms
Store Message Vitess/MySQL
Fanout Service Recipient List
WebSocket Connections Per User
Push Notifications Mobile/Desktop
Batch Delivery Offline Users
Ordering Per-channel Sequence
Acknowledgment Delivery Confirm
Real-time Features:
Sub-100ms Delivery : P99 message latency
WebSocket Connections : Persistent bi-directional channels
Ordered Delivery : Consistent message ordering per channel
Reconnection Handling : Seamless recovery from disconnects
Manages millions of concurrent connections.
Redis Cluster RTM Service Auth Service WebSocket Gateway Client Redis Cluster RTM Service Auth Service WebSocket Gateway Client Keep-alive ping/pong every 30 seconds WebSocket Connect Validate Token User Identity Register Connection Subscribe to Events Event Stream Real-time Events Redis Cluster RTM Service Auth Service WebSocket Gateway Client Redis Cluster RTM Service Auth Service WebSocket Gateway Client Keep-alive ping/pong every 30 seconds WebSocket Connect Validate Token User Identity Register Connection Subscribe to Events Event Stream Real-time Events
Gateway Architecture:
Connection Management : 500K+ connections per server
Protocol : WebSocket with custom binary protocol
Heartbeat : Keep-alive for connection health
Graceful Degradation : Fallback to long-polling
Supports channels with 10K+ members efficiently.
Public Channels Org-wide Visibility
Private Channels Invite Only
Direct Messages 1:1 or Group
Shared Channels Cross-organization
Channel Metadata Name, Topic, Purpose
Message History Searchable Archive
Permissions Read, Write, Admin
Workspace Policies Default Access
Channel Override Specific Rules
Public Channels Org-wide Visibility
Private Channels Invite Only
Direct Messages 1:1 or Group
Shared Channels Cross-organization
Channel Metadata Name, Topic, Purpose
Message History Searchable Archive
Permissions Read, Write, Admin
Workspace Policies Default Access
Channel Override Specific Rules
Real-time user status across the platform.
User Activity Keystrokes, Clicks
Calendar Integration Meetings
Presence Compute Status Algorithm
Timeout Logic Away Detection
User Activity Keystrokes, Clicks
Calendar Integration Meetings
Presence Compute Status Algorithm
Timeout Logic Away Detection
Presence Features:
Real-time Updates : Instant status propagation
Multi-device : Aggregate presence across devices
Custom Status : Emoji and text status
DND Mode : Notification suppression
Vitess Benefits:
Horizontal Scaling : Shard by workspace/channel
Connection Pooling : Efficient MySQL connections
Query Routing : Automatic shard selection
Online Resharding : Zero-downtime splits
Presence Node 1 User Status
Presence Node 2 User Status
Session Node 1 User Sessions
Session Node 2 User Sessions
Presence Node 1 User Status
Presence Node 2 User Status
Session Node 1 User Sessions
Session Node 2 User Sessions
File Indexer Content Extraction
User Indexer Profile Search
Full-text Search Messages, Files
Filters Channel, User, Date
Highlighting Match Context
File Indexer Content Extraction
User Indexer Profile Search
Full-text Search Messages, Files
Filters Channel, User, Date
Highlighting Match Context
Search Indexer Solr Updates
Notification Service Push/Email
Analytics Pipeline Usage Metrics
Webhook Delivery App Integrations
Search Indexer Solr Updates
Notification Service Push/Email
Analytics Pipeline Usage Metrics
Webhook Delivery App Integrations
Kafka : Millions of events per second
Real-time Indexing : Sub-second search updates
Webhook Delivery : Reliable app notifications
Analytics : Real-time usage tracking
Tier 1: Edge SSL Termination
Tier 2: Gateway Protocol Handling
Horizontal Scale Add Servers
Consistent Hashing User Routing
Sticky Sessions Connection Affinity
500K Connections Per Server
Millions Total Connections
Tier 1: Edge SSL Termination
Tier 2: Gateway Protocol Handling
Horizontal Scale Add Servers
Consistent Hashing User Routing
Sticky Sessions Connection Affinity
500K Connections Per Server
Millions Total Connections
Announcements 10K+ Members
Lazy Loading On-demand Fetch
Notification Only No Real-time
> 1000 Members
Batched Fanout Chunked Delivery
Priority Queue Active Users First
Announcements 10K+ Members
Lazy Loading On-demand Fetch
Notification Only No Real-time
> 1000 Members
Batched Fanout Chunked Delivery
Priority Queue Active Users First
Shard by Workspace Primary Strategy
Shard by Channel Large Workspaces
Shard by Time Message Archives
Shard Lookup Mapping Table
Cross-shard Scatter-gather
Shard by Workspace Primary Strategy
Shard by Channel Large Workspaces
Shard by Time Message Archives
Shard Lookup Mapping Table
Cross-shard Scatter-gather
Data Loss Prevention Content Policies
Message Retention Custom Policies
Multi-factor Auth TOTP, SMS
Role-based Access Admin, Member, Guest
Channel Permissions Public, Private
Workspace Policies Admin Controls
Data Loss Prevention Content Policies
Message Retention Custom Policies
Multi-factor Auth TOTP, SMS
Role-based Access Admin, Member, Guest
Channel Permissions Public, Private
Workspace Policies Admin Controls
Enterprise Key Management : Customer-controlled keys
Data Residency : Region-specific data storage
Audit Logs : Comprehensive activity tracking
DLP Integration : Third-party DLP support
SOC 2 Type II : Security controls audit
GDPR : European data protection
HIPAA : Healthcare compliance (Enterprise Grid)
FedRAMP : Government authorization
Jaeger Distributed Tracing
AlertManager Notifications
Anomaly Detection ML-based
Jaeger Distributed Tracing
AlertManager Notifications
Anomaly Detection ML-based
Message Delivery Latency : P50, P95, P99
WebSocket Connection Health : Success rate, reconnections
API Latency : Endpoint-level response times
Search Latency : Query response times
main feature-branch Feature Dev Code Changes Unit Tests Integration Tests Build Artifacts Canary Deploy Production Rollout main feature-branch Feature Dev Code Changes Unit Tests Integration Tests Build Artifacts Canary Deploy Production Rollout
Gradual Rollout 10%, 50%, 100%
Auto Rollback Error Threshold
Feature Flags LaunchDarkly
Gradual Rollout 10%, 50%, 100%
Auto Rollback Error Threshold
Feature Flags LaunchDarkly
Kubernetes : Container orchestration
AWS : Primary cloud provider
Terraform : Infrastructure as code
Consul : Service discovery
Network Partition AZ Failure
Service Failure Dependency Outage
Load Testing Traffic Spike
Message Delivery Success Rate
Network Partition AZ Failure
Service Failure Dependency Outage
Load Testing Traffic Spike
Message Delivery Success Rate
Practices:
GameDay Exercises : Quarterly failure simulations
Chaos Monkey : Random service termination
Load Testing : 10x normal traffic simulation
AZ Failover : Regular availability zone drills
User Events Actions, Navigation
Message Events Sends, Reads
Search Events Queries, Clicks
Search Ranking Relevance ML
Spam Detection Abuse Prevention
User Events Actions, Navigation
Message Events Sends, Reads
Search Events Queries, Clicks
Search Ranking Relevance ML
Spam Detection Abuse Prevention
Search Ranking : Personalized result ordering
Channel Suggestions : Recommend relevant channels
Spam Detection : Automated abuse prevention
Smart Notifications : Intelligent alert timing
Emoji Predictions : Suggested reactions
35% 25% 20% 15% 5% Slack Infrastructure Cost Distribution Compute & Real-time Storage Networking Search Infrastructure Operations 35% 25% 20% 15% 5% Slack Infrastructure Cost Distribution Compute & Real-time Storage Networking Search Infrastructure Operations
Spot Instances Batch Processing
Reserved Capacity Real-time Services
Auto-scaling Traffic Patterns
Tiered Storage Hot/Cold Messages
Archive Policy Old Messages
Protocol Optimization Binary WebSocket
Batch API Calls Reduce Round Trips
Spot Instances Batch Processing
Reserved Capacity Real-time Services
Auto-scaling Traffic Patterns
Tiered Storage Hot/Cold Messages
Archive Policy Old Messages
Protocol Optimization Binary WebSocket
Batch API Calls Reduce Round Trips
Message Compression : 40% storage reduction
Connection Multiplexing : Efficient WebSocket usage
Tiered Storage : Archive old messages to cold storage
Reserved Instances : Predictable baseline costs
WebRTC Integration : Native audio/video calls
AI Assistance : Smart message suggestions
Workflow Automation : No-code automation tools
Edge Computing : Lower latency for global users
Salesforce Integration : Deeper CRM integration
Canvas : Rich document collaboration
Clips : Async video messaging
Huddles : Lightweight audio calls
Multi-Cloud : Resilience through cloud diversity
Global Expansion : New regions for data residency
Zero-Trust Security : Enhanced security model
Sustainable Computing : Carbon-neutral operations
Slack's architecture demonstrates how to build a real-time collaboration platform at scale. The combination of WebSocket-based real-time messaging, Vitess-powered database sharding, and efficient presence tracking enables Slack to deliver reliable communication for millions of teams.
The platform continues to evolve with deeper enterprise integrations, enhanced AI capabilities, and improved collaboration features, all while maintaining the real-time responsiveness that users depend on for productive teamwork.
There might be iterations needed, current data is as close I could get.