← Back to All Stories
🎮

Discord

Gaming Communication Platform

Discord's 70% Cost Reduction Moving to Bare Metal

2024-01-207 min read
70% cost reduction

Total Savings

Discord's 70% Cost Reduction Moving to Bare Metal

Introduction

Discord, the popular voice, video, and text communication platform for gamers and communities, faced a critical infrastructure challenge: their voice services were becoming prohibitively expensive in the cloud.

With millions of concurrent voice users and hundreds of millions of messages daily, Discord needed to rethink their infrastructure strategy.

The Voice Infrastructure Challenge

Voice communication is fundamentally different from typical web services:

  • Real-time requirements: Sub-50ms latency critical
  • High bandwidth: Voice data requires significant network capacity
  • CPU intensive: Audio encoding/decoding is computationally expensive
  • Predictable load: Gaming patterns are relatively consistent

The Cloud Cost Problem

Discord's voice infrastructure costs on cloud providers were escalating:

  • Per-minute charges added up quickly at scale
  • Network egress fees were enormous for voice traffic
  • CPU costs for encoding were higher than expected
  • Unpredictable billing made budgeting difficult

The Decision to Move

In 2023, Discord's infrastructure team made a bold decision: move voice services from cloud to bare-metal servers in their own data centers.

Key Motivations

  1. Cost: Voice services were eating 40% of infrastructure budget
  2. Performance: Latency could be improved with dedicated hardware
  3. Control: Fine-tune hardware for specific workloads
  4. Predictability: Fixed costs for capacity planning

The Migration Strategy

Phase 1: Proof of Concept (1 month)

  • Built small bare-metal cluster
  • Migrated 5% of voice traffic
  • Measured performance and costs
  • Proved concept viability

Phase 2: Regional Deployment (3 months)

  • Deployed servers in 15 regions globally
  • Gradual traffic migration
  • Real-time monitoring of quality metrics
  • A/B testing cloud vs. bare-metal

Phase 3: Full Migration (2 months)

  • Migrated remaining traffic
  • Maintained cloud as backup
  • Optimized routing and capacity
  • Decommissioned cloud infrastructure

Technical Architecture

Hardware Specifications

Each voice server cluster consists of:

  • AMD EPYC 7763 processors (optimized for audio encoding)
  • 256GB DDR4 RAM (handling many concurrent connections)
  • 10Gbps network cards (low-latency networking)
  • NVMe SSDs (fast local caching)

Software Stack

  • WebRTC for real-time communication
  • Rust-based voice servers (performance and safety)
  • Custom routing algorithms (optimal server selection)
  • Prometheus + Grafana for monitoring

Global Distribution

  • 15 regions worldwide
  • 50+ data centers for low latency
  • Anycast networking for optimal routing
  • Automated failover between regions

The Results

Cost Savings

Before (Cloud):

  • Voice infrastructure: ~$14M/year
  • Network egress: ~$6M/year
  • Total: ~$20M/year

After (Bare Metal):

  • Hardware (amortized): ~$3M/year
  • Colocation: ~$2M/year
  • Network: ~$1M/year
  • Total: ~$6M/year

Annual Savings: ~$14M (70% reduction)

Performance Improvements

  • 35% lower latency on average
  • 99.9% to 99.99% uptime improvement
  • Better audio quality due to dedicated resources
  • Faster connection times for users

Operational Benefits

  • Predictable costs for budgeting
  • Fine-grained control over hardware
  • Faster debugging with full stack access
  • Better optimization possibilities

Challenges Overcome

1. Global Distribution

Challenge: Deploying to 15 regions simultaneously
Solution: Phased regional rollout with automated provisioning

2. Network Routing

Challenge: Optimal server selection for users
Solution: Custom routing algorithms based on latency and load

3. Capacity Planning

Challenge: Predicting growth and seasonal spikes
Solution: Real-time analytics and automatic scaling triggers

4. Monitoring at Scale

Challenge: Tracking metrics across distributed infrastructure
Solution: Centralized monitoring with Prometheus and custom dashboards

Engineering Insights

Why Voice is Different

Voice services have unique characteristics that make them ideal for bare-metal:

  1. Predictable CPU usage - easier to capacity plan
  2. High network throughput - cloud egress fees hurt
  3. Latency sensitive - dedicated hardware helps
  4. Stateful connections - instance pricing inefficient

Hardware Optimization

Discord's team optimized hardware choices:

  • AMD EPYC processors for better price/performance on audio
  • High-frequency RAM for WebRTC workloads
  • Low-latency NICs for real-time communication
  • Local NVMe for session caching

Software Optimization

Moving to bare-metal enabled new optimizations:

  • Direct access to network hardware
  • Custom kernel tuning for WebRTC
  • NUMA-aware scheduling
  • Specialized audio codec configurations

Impact on Users

User Experience Improvements

  • Faster connection to voice channels
  • Clearer audio quality with fewer drops
  • Lower latency in conversations
  • More reliable service overall

Scale Achievements

Discord now handles:

  • 4M+ concurrent voice users at peak
  • Billions of messages daily
  • Hundreds of servers globally
  • Sub-40ms latency in most regions

Lessons Learned

1. Voice/Video is Expensive in Cloud

Real-time media services have economics that heavily favor owned infrastructure at scale.

2. Hardware Specialization Matters

Choosing processors and hardware specifically for your workload can yield massive performance gains.

3. Migration Can Be Low-Risk

Gradual rollout and maintaining cloud backup made migration safe and reversible.

4. Expertise is Essential

Success required deep understanding of networking, WebRTC, and systems engineering.

5. Monitoring is Critical

Comprehensive metrics enabled confident migration and ongoing optimization.

Community Response

Discord's engineering team shared their journey through:

  • Technical blog posts
  • Conference presentations
  • Open-source tools
  • Community discussions

The developer community praised their transparency and technical depth, inspiring similar migrations at other real-time service companies.

Future Plans

Discord continues to optimize their infrastructure:

  • Exploring newer CPU architectures
  • Improving routing algorithms
  • Expanding to more regions
  • Sharing more learnings publicly

Conclusion

Discord's migration proves that even for demanding real-time services, bare-metal infrastructure can deliver superior economics and performance compared to cloud solutions.

Their success demonstrates that with proper planning, expertise, and execution, companies can achieve massive cost savings while actually improving service quality.

Key Metrics

💰 70% cost reduction
35% lower latency
👥 4M+ concurrent voice users
🌍 15 regions globally
📈 99.99% uptime achieved


Running real-time services in the cloud? Discover how much you could save with our free infrastructure assessment.

Want Similar Results?

Get a free assessment and see how much you could save

Get Your Free Report