1. The Final Bottleneck: When the "Write Engine" Runs Dry
We’ve used Replication to handle millions of Reads and PITR to ensure an RPO of zero. However, there is a brutal physical limit that Single-leader architecture cannot overcome: Every single Write command must pass through a single server.
When a product scales up and hits a threshold of, say, 10,000 Write QPS (Queries Per Second)—think Grab during peak hours or an e-wallet during a Flash Sale— the Primary DB’s CPU and Disk IOPS will hit a 100% ceiling. At this stage, throwing money at a bigger server (Vertical Scaling) is no longer viable because even the world's best hardware has its limits.
The ultimate solution for Big Tech is Database Sharding: breaking a massive data block into smaller pieces (Shards) and distributing them across multiple independent server clusters.
The PM Trade-off: Sharding allows a system to scale capacity and Write QPS almost infinitely by adding more servers. However, the price is a 10x increase in operational complexity.
2. Life or Death in the Shard Key: The Hotspot Disaster
Sharding doesn't work automatically. The system (Routing Layer) needs a rule to know: "Which shard should this record be stored in?". That rule is determined by the Shard Key. Choose the wrong Shard Key, and the entire architecture collapses.
Scenario: Building a transaction history system for an e-wallet (100M rows/day).
The Mistake (Vague/Bad): Choosing Created_At (Timestamp) as the Shard Key, creating a new Shard for each month.
The Consequence (Hotspot): 100% of the Write traffic for the current month will flood the single Shard for that month. This shard "burns up" while previous months' shards sit idle. You’ve paid for 10x more servers, but your Write bottleneck remains unchanged.
Best Practice: Even Distribution (User ID Hash-based)
Practical Decision (Specific/Good): Using a hash function: Hash(User_ID) % Number_of_Shards.
Business Impact: The Hash algorithm randomly and evenly distributes users across shards. The Write load is perfectly balanced. More importantly, it ensures Data Locality: a user's entire history remains within a single shard, making "View History" operations extremely fast.
3. The Dark Side of Sharding: Trading Speed for Operational Pain
A PM/TPM should never force a tech team to use Sharding unless the system is truly at a dead end, because its side effects can kill product velocity.
The Collapse of JOINs (Scatter-Gather): If the Users table is on Shard 1, but the Orders table is scattered across Shards 2 and 3, you can no longer use a classic JOIN. The application must send queries to all shards (Scatter), pull the data into the App server's RAM, and manually stitch them together (Gather). Complex reporting becomes dozens of times slower.
Loss of Distributed Transactions: If User A (Shard 1) sends $20 to User B (Shard 2), the deduction from Shard 1 and the addition to Shard 2 must happen simultaneously. Maintaining transactions across shards (Two-Phase Commit) is heavy and prone to "stuck money" errors. Engineers often have to rewrite logic using complex SAGA Patterns.
The Resharding Nightmare: When your 3 shards are full and you need to move to 5, your hash algorithm changes from % 3 to % 5. This means terabytes of data are suddenly at the "wrong address" and must be migrated. Moving data on a live system without downtime is a technical "wonder" that can take a Data Platform team months to execute.
4. Auditing Sharded Clusters with the PEUF Framework
Don't let your sharded system become a black box. Anticipate these edge cases:
[P] Permission / Product Logic (Feature Limits):
Risk: Product demands a "Real-time Global Leaderboard." With Sharding, sorting billions of rows across different shards every second is nearly impossible.
Solution: Compromise on the feature—make it "Updated every 5 minutes," then use an async flow to push aggregated data to a centralized Cache (Redis) for reading.
[E] Extreme (The "Justin Bieber" Effect - Data Skew):
Risk: You hash by User_ID. Unfortunately, a KOL account (like Justin Bieber) has 100 million followers interacting every second. Even with a perfect hash, the shard containing that KOL's data will still crash due to Data Skew.
Solution: Design a "KOL Routing" mechanism. Move massive accounts out of the standard shard pool and onto dedicated infrastructure.
[U] Unavailability (Asynchronous Backups):
Risk: You backup Shard 1 at 12:00 and Shard 2 at 12:05. If the system crashes and you restore via PITR, that 5-minute gap could create errors: money was deducted in Shard 1 but hasn't been added in Shard 2 yet.
Solution: The backup infrastructure for sharded clusters must use Distributed Snapshots consistent with a Global Clock.
[F] Fraud (Crashing via Scatter-Gather):
Risk: Attackers create bots to search using keywords that don't include the Shard Key (e.g., searching by Note_Content instead of User_ID). The router is forced to fan out the search to all 50 shards at once, exhausting the entire system's connection pool.
Solution: Set extremely strict Rate Limits for "Cross-shard Queries" or synchronize searchable fields to a system like Elasticsearch.
5. Series Summary: Data Architecture Mindset
Through this 4-part series, we’ve built a Scaling Journey that every Senior PM/TPM should master:
HA & Monitoring: Identify SPOFs. Don't wait for a crash to fix it; use core metrics to trigger automation.
Replication (Single-leader): Separate Read/Write to save query performance. Use "Read-your-own-writes" to mask Replication Lag.
Backup & PITR: Understand RPO/RTO. Never trust Replication for human-caused disasters; WAL/Binlog is your only true time machine.
Sharding: The ultimate weapon to break physical Write limits. Your choice of Shard Key (Locality vs. Hotspot) determines the operational fate of the product.
Building a tech product isn't just about UI/UX. Mastering the foundations of data architecture is how you protect revenue, negotiate fairly with Engineering, and build a sustainable product at a scale of millions of users.
Don't let the illusion of Replication steal all your data because of a single wrong command. Master RPO, RTO, and PITR mechanisms to design a battle-tested Disaster Recovery strategy.