Discord
Gaming Communication Platform
Discord's 70% Cost Reduction Moving to Bare Metal
Total Savings
Discord's 70% Cost Reduction Moving to Bare Metal
Introduction
Discord, the popular voice, video, and text communication platform for gamers and communities, faced a critical infrastructure challenge: their voice services were becoming prohibitively expensive in the cloud.
With millions of concurrent voice users and hundreds of millions of messages daily, Discord needed to rethink their infrastructure strategy.
The Voice Infrastructure Challenge
Voice communication is fundamentally different from typical web services:
- Real-time requirements: Sub-50ms latency critical
- High bandwidth: Voice data requires significant network capacity
- CPU intensive: Audio encoding/decoding is computationally expensive
- Predictable load: Gaming patterns are relatively consistent
The Cloud Cost Problem
Discord's voice infrastructure costs on cloud providers were escalating:
- Per-minute charges added up quickly at scale
- Network egress fees were enormous for voice traffic
- CPU costs for encoding were higher than expected
- Unpredictable billing made budgeting difficult
The Decision to Move
In 2023, Discord's infrastructure team made a bold decision: move voice services from cloud to bare-metal servers in their own data centers.
Key Motivations
- Cost: Voice services were eating 40% of infrastructure budget
- Performance: Latency could be improved with dedicated hardware
- Control: Fine-tune hardware for specific workloads
- Predictability: Fixed costs for capacity planning
The Migration Strategy
Phase 1: Proof of Concept (1 month)
- Built small bare-metal cluster
- Migrated 5% of voice traffic
- Measured performance and costs
- Proved concept viability
Phase 2: Regional Deployment (3 months)
- Deployed servers in 15 regions globally
- Gradual traffic migration
- Real-time monitoring of quality metrics
- A/B testing cloud vs. bare-metal
Phase 3: Full Migration (2 months)
- Migrated remaining traffic
- Maintained cloud as backup
- Optimized routing and capacity
- Decommissioned cloud infrastructure
Technical Architecture
Hardware Specifications
Each voice server cluster consists of:
- AMD EPYC 7763 processors (optimized for audio encoding)
- 256GB DDR4 RAM (handling many concurrent connections)
- 10Gbps network cards (low-latency networking)
- NVMe SSDs (fast local caching)
Software Stack
- WebRTC for real-time communication
- Rust-based voice servers (performance and safety)
- Custom routing algorithms (optimal server selection)
- Prometheus + Grafana for monitoring
Global Distribution
- 15 regions worldwide
- 50+ data centers for low latency
- Anycast networking for optimal routing
- Automated failover between regions
The Results
Cost Savings
Before (Cloud):
- Voice infrastructure: ~$14M/year
- Network egress: ~$6M/year
- Total: ~$20M/year
After (Bare Metal):
- Hardware (amortized): ~$3M/year
- Colocation: ~$2M/year
- Network: ~$1M/year
- Total: ~$6M/year
Annual Savings: ~$14M (70% reduction)
Performance Improvements
- 35% lower latency on average
- 99.9% to 99.99% uptime improvement
- Better audio quality due to dedicated resources
- Faster connection times for users
Operational Benefits
- Predictable costs for budgeting
- Fine-grained control over hardware
- Faster debugging with full stack access
- Better optimization possibilities
Challenges Overcome
1. Global Distribution
Challenge: Deploying to 15 regions simultaneously
Solution: Phased regional rollout with automated provisioning
2. Network Routing
Challenge: Optimal server selection for users
Solution: Custom routing algorithms based on latency and load
3. Capacity Planning
Challenge: Predicting growth and seasonal spikes
Solution: Real-time analytics and automatic scaling triggers
4. Monitoring at Scale
Challenge: Tracking metrics across distributed infrastructure
Solution: Centralized monitoring with Prometheus and custom dashboards
Engineering Insights
Why Voice is Different
Voice services have unique characteristics that make them ideal for bare-metal:
- Predictable CPU usage - easier to capacity plan
- High network throughput - cloud egress fees hurt
- Latency sensitive - dedicated hardware helps
- Stateful connections - instance pricing inefficient
Hardware Optimization
Discord's team optimized hardware choices:
- AMD EPYC processors for better price/performance on audio
- High-frequency RAM for WebRTC workloads
- Low-latency NICs for real-time communication
- Local NVMe for session caching
Software Optimization
Moving to bare-metal enabled new optimizations:
- Direct access to network hardware
- Custom kernel tuning for WebRTC
- NUMA-aware scheduling
- Specialized audio codec configurations
Impact on Users
User Experience Improvements
- Faster connection to voice channels
- Clearer audio quality with fewer drops
- Lower latency in conversations
- More reliable service overall
Scale Achievements
Discord now handles:
- 4M+ concurrent voice users at peak
- Billions of messages daily
- Hundreds of servers globally
- Sub-40ms latency in most regions
Lessons Learned
1. Voice/Video is Expensive in Cloud
Real-time media services have economics that heavily favor owned infrastructure at scale.
2. Hardware Specialization Matters
Choosing processors and hardware specifically for your workload can yield massive performance gains.
3. Migration Can Be Low-Risk
Gradual rollout and maintaining cloud backup made migration safe and reversible.
4. Expertise is Essential
Success required deep understanding of networking, WebRTC, and systems engineering.
5. Monitoring is Critical
Comprehensive metrics enabled confident migration and ongoing optimization.
Community Response
Discord's engineering team shared their journey through:
- Technical blog posts
- Conference presentations
- Open-source tools
- Community discussions
The developer community praised their transparency and technical depth, inspiring similar migrations at other real-time service companies.
Future Plans
Discord continues to optimize their infrastructure:
- Exploring newer CPU architectures
- Improving routing algorithms
- Expanding to more regions
- Sharing more learnings publicly
Conclusion
Discord's migration proves that even for demanding real-time services, bare-metal infrastructure can deliver superior economics and performance compared to cloud solutions.
Their success demonstrates that with proper planning, expertise, and execution, companies can achieve massive cost savings while actually improving service quality.
Key Metrics
💰 70% cost reduction
⚡ 35% lower latency
👥 4M+ concurrent voice users
🌍 15 regions globally
📈 99.99% uptime achieved
Running real-time services in the cloud? Discover how much you could save with our free infrastructure assessment.
