The Data Massagist The Data Massagist by Pablo Junco

Scaling Databases Is Easy. Architecture at 100TB Is Not.

April 24, 2026 · 13 min read
MS Fabric MS SQL
This content is mirrored from LinkedIn and may contain formatting inconsistencies. For the full experience — including comments and reactions — read the original on LinkedIn.

Scaling Databases Is Easy. Architecture at 100TB Is Not.

Created on 2026-04-24 02:59

Published on 2026-04-24 05:47

Every major cloud provider can take you to 100TB and beyond. That milestone is no longer a differentiator—it’s table stakes. What separates leading organizations is how they get there.

The real question is not scale. It is architecture.

For CIOs and CDOs, this is less a technical choice and more a business decision shaped by:

  • Time-to-market

  • Transformation risk

  • Cost of operations

  • Readiness for AI

The most effective organizations scale without rewriting applications, without adding unnecessary complexity, and without accumulating long-term technical debt.

Three Models of Scale, Only One Fits Your Workload

Wearing my CTO hat, scaling is not just about capacity—it’s about recognizing the architectural patterns that emerge. In my experience, most enterprise architectures ultimately converge into three distinct models:

  • Model 1: Scale by Abstraction — a single logical database that grows transparently

  • Model 2: Scale by Distribution — partitioned systems designed for global scale

  • Model 3: Scale for Analytics — platforms built for data, not transactions

Understanding where each model applies is critical to avoiding costly missteps.

Model 1: Scale by Abstraction (Low Effort, High Impact)

In my opinion, this is the fastest path to scale for operational systems.

  • Single logical database

  • No application redesign required

  • No need for partitioning strategies

  • Storage scales transparently

Business impact:

  • Low-risk modernization

  • Faster time-to-value

  • Predictable operations

This model is ideal when the goal is to scale existing systems without disrupting the business.

Model 2: Scale by Distribution (High Effort, Maximum Scale)

This model is designed for global, internet-scale applications—but it comes at a cost.

  • Data is partitioned across nodes

  • Requires a well-defined partitioning strategy

  • Applications must become “scale-aware”

Business impact:

  • Enables massive global scale

  • Introduces engineering complexity

  • Requires long-term architectural commitment

This is not a lift-and-shift model. It is a redesign.

Model 3: Scale for Analytics (Different Problem, Different System)

This is where many organizations get it wrong.

Data platforms are not operational databases—they are built for a different purpose:

  • Petabyte-scale storage

  • Support for structured and unstructured data

  • Optimized for analytics, AI, and BI

Designed for:

  • Data lakes and lakehouses

  • Enterprise analytics

  • AI and machine learning pipelines

Not designed for:

  • Transactional workloads

  • ACID-compliant application systems

  • Real-time operational processing

The distinction matters: this is scale for insight, not for operations.

The Microsoft Perspective: A Portfolio Approach

As a Principal Solution Engineer for Data Platform at Microsoft, my scope goes beyond analytics to include the full portfolio of Azure-managed databases. From the Microsoft SQL family to open-source engines like Azure-managed PostgreSQL, MySQL, and MariaDB, I see firsthand how comprehensive—and competitive—this portfolio is in today’s market.

Databases on Microsoft Azure

With the years, I learned that no single system solves all scaling challenges. The most effective strategies combine all three models:

  • Model 1 (Operational scale): Azure SQL Hyperscale, Azure HorizonDB

  • Model 2 (Global distribution): Azure Cosmos DB

  • Model 3 (Analytics and AI): Microsoft Fabric, ADLS, Azure Synapse

Each serves a different purpose—and together, they form a complete data strategy.

Model 1 (Operation Scale)

You may have noticed I introduced two newer additions to the Microsoft portfolio: Azure SQL Database Hyperscale and Azure HorizonDB. Both represent Model 1—but with two distinct paths and outcomes. While they share a common foundation, they are designed to solve different business problems. Let’s take a closer look.

Azure SQL Hyperscale — System of Record

Azure SQL Database Hyperscale was highlighted by Priya Sathy at FabCon Atlanta as the foundation for large-scale operational workloads in the modern data platform. Built on the SQL Server engine, it delivers high availability by design—even in the face of infrastructure failures.

At first glance, this may seem familiar. Hyperscale has been generally available since May 2019, initially supporting up to 100 TB—and expanded to 128 TB in November 2024. So, what’s actually new?

The real innovation is not just scale. For me; it’s how that scale is delivered.

Hyperscale removes many of the traditional constraints of cloud databases by decoupling compute and storage, enabling true elasticity without architectural disruption. With support for large compute configurations—now extending to 160 and 192 vCores (in public preview)—it can handle high-throughput workloads without requiring T‑SQL rewrites or application redesign.

Key capabilities:

  • Scale existing SQL workloads to 100+ TB seamlessly

  • No application or schema redesign required

  • Full SQL Server compatibility with ACID guarantees and built-in governance

Business impact:

  • Protects existing investments (no engine change, no refactoring)

  • Minimizes migration risk and accelerates time-to-value

  • Enables modernization through: Automatic storage growth up to 128 TB Independent scaling of compute and storage Fast backups and near-instant restore capabilities.

Therefore, it’s about enabling enterprise-scale OLTP and translytical workloads with minimal friction, while preserving the SQL Server experience organizations already trust.

Azure HorizonDB — System of Innovation

On the other hand, a new PostgreSQL database service is designed to power mission-critical applications at any scale, turning data from a passive asset into a true competitive advantage. It offers seamless integration within the Microsoft ecosystem, easy access to advanced AI and analytics services, and helps eliminate the burden of complex integration work.

Azure HorizonDB was announced on November 18, 2025 as a private preview, and I do expect it to go to public preview around June 2026—likely during the upcoming Microsoft Build event—as we are already seeing strong customer interest driven by demand for a next‑generation hyperscale PostgreSQL platform.

What we are hearing consistently from customers is a clear need to move beyond the current generation of managed PostgreSQL services. While solutions like AWS Aurora and Google AlloyDB have brought important innovations, they still present limitations in areas such as true hyperscale elasticity, deep AI integration, and seamless alignment with end‑to‑end data platforms.

Azure HorizonDB is being designed specifically to address these gaps. It introduces a cloud‑native architecture with independent scale‑out compute and storage, optimized for modern workloads—including AI‑driven applications—along with native integration into services like Microsoft Fabric and Azure AI. This enables scenarios such as real-time analytics, operational workloads, and AI pipelines to be connected without the complexity of moving or duplicating data.

From a customer perspective, this represents a shift from “managed PostgreSQL” to a fully integrated, AI‑first, hyperscale data platform, capable of supporting both transactional and intelligent application workloads at enterprise scale.

From a business value perspective, it enables AI-first architectures, accelerates innovation cycles, and simplifies the integration of AI capabilities into modern applications, allowing organizations to move faster from experimentation to production.

In summary, Azure HorizonDB enables:

  • GenAI applications (RAG, copilots, semantic search)

  • Intelligent SaaS (personalization, recommendations)

  • Cloud-native microservices

  • Real-time AI-driven applications

Model 2: Scale by Distribution

This model is fundamentally about global distribution at scale. Unlike Model 1, where the focus is on abstraction over a single logical database, this model assumes from the start that data is spread across regions, partitions, and nodes—and that applications must be designed accordingly.

In Microsoft’s portfolio, the reference implementation of this model is Azure Cosmos DB.

Azure Cosmos DB is designed for scenarios where:

  • Data must be globally distributed with low-latency access

  • Applications require elastic, virtually unlimited scale

  • Consistency levels can be tuned per workload

  • High availability is expected by design, not as an add-on

At its core, Azure Cosmos DB introduces a shift in responsibility. The platform handles global replication, partitioning, and failover, but the application must be partition-aware.

This means:

  • Choosing a correct partition key becomes a critical architectural decision

  • Data modeling must align with access patterns from day one

  • Cross-partition queries are possible but intentionally constrained for scale

From a business perspective, the value is clear:

  • Enables true global applications with predictable performance

  • Supports hyperscale ingestion and transaction workloads

  • Provides built-in multi-region resiliency

  • Reduces operational burden of managing distributed infrastructure

However, this scale comes with trade-offs. It requires a higher level of architectural discipline and a commitment to designing for distribution upfront. Unlike Model 1, where scale can often be achieved without changing the application, Model 2 demands that the application evolves with the data model.

In practice, Azure Cosmos DB is the right choice when:

  • You are building globally distributed SaaS platforms

  • Latency consistency across geographies is a hard requirement

  • You need elastic scale without operational bottlenecks

  • The application is designed cloud-native from the beginning

In summary, Model 2 is not just about scaling databases—it is about designing for distribution as a first principle.

Model 3: Analytics and AI (Not Operational Databases)

First, it is important to be precise—and a bit purist here: Azure Data Lake Storage (ADLS) Gen2, Azure Synapse Analytics, and Microsoft Fabric are not traditional databases. They are analytics data platforms, not transactional systems.

This is important because Model 3 is often misinterpreted as simply “a database at larger scale”—and it is not.

In essence, Model 3 represents a different kind of scale: petabyte-level storage, massive parallel processing, AI and machine learning enablement, and enterprise-wide data unification. It is not a system of record, not a transactional database, and not designed for application workloads. This is scale for analytics and AI, not for operations.

When used correctly, Model 3 enables global enterprise data platforms, AI training and feature engineering, cross-domain data integration, and both real-time and batch analytics at unprecedented volume. Ultimately, it is where data is transformed into intelligence—after it has been decoupled from operational systems.

Azure Data Lake Storage Gen2 (ADLS)

ADLS is the foundation layer of enterprise-scale data storage.

  • No practical account size limit (petabyte-scale and beyond)

  • Stores trillions of files across structured, semi-structured, and unstructured formats

  • Designed for extreme durability and scale

But the key point is architectural:

  • It is not a database

  • There is no transactional engine (no OLTP, no ACID semantics for applications)

  • It does not serve operational workloads

Instead, ADLS becomes the storage backbone for:

  • Microsoft Fabric OneLake

  • Azure Synapse Analytics

  • Azure Databricks and Spark-based processing

In simple terms: this is analytics-scale storage, not an operational database.

Azure Synapse Analytics

Azure Synapse sits on top of ADLS and extends it with distributed compute capabilities.

  • Built for large-scale analytics and data warehousing

  • Can query 100TB+ to petabyte-scale datasets

  • Uses massively parallel processing (MPP) engines for performance

However:

  • It is not an OLTP system

  • It is not designed for transactional applications

  • It does not replace operational databases

Azure Synapse is an analytics engine, not a system of record.

Microsoft Fabric (Lakehouse + Unified Analytics Platform)

Microsoft Fabric brings ADLS and Synapse capabilities together into a unified analytics experience.

  • Unifies data engineering, data science, real-time analytics, and BI

  • Built on OneLake as a single logical data foundation

  • Enables end-to-end AI and analytics workflows

But again, it is critical to separate intent:

  • Fabric is designed for insight generation, not transaction processing

  • It operates at enterprise-scale analytics, not application-level consistency

A New Chapter: Microsoft Fabric Enters the Database Arena

Ah—and this is where things start to get interesting. Microsoft Fabric is no longer just an analytics platform. It is evolving into a broader data foundation with the introduction of database capabilities delivered as a SaaS experience.

Within Fabric, a new Databases engine brings together multiple paradigms into a unified model:

  • A full SQL database engine supporting transactional and analytical workloads

  • Native PostgreSQL capabilities for modern, cloud-native applications

  • A roadmap that includes HorizonDB, adding AI-native database capabilities to the platform

  • A unified experience that combines elements of traditional SQL databases and Cosmos DB within Fabric

This is not just an incremental addition. As stated by Shireesh Thota (CVP, Microsoft Azure Databases), Microsoft Fabric Databases is designed as a SaaS-native, serverless, and autonomous experiencesimplifying provisioning, management, and scaling while maintaining enterprise-grade control.

Key capabilities include:

  • Serverless architecture for reduced operational overhead

  • Enterprise security for mission-critical workloads

  • Native AI integration, including vector data and RAG patterns

  • OneLake integration for a consistent, analytics-ready data layer

  • Unified billing model for simplified cost management at scale

  • Deep integration with tools such as VS Code, GitHub, and Azure OpenAI Service

  • Broad connectivity across existing systems, including SQL Server and Azure SQL databases

This matters because it starts to blur the traditional boundaries between operational databases and analytics platforms.

What was once a clear separation—Model 1 for operational systems, Model 2 for distributed systems, and Model 3 for analytics platforms—is now beginning to converge.

Fabric is positioning itself as a unified data layer where:

  • Data is stored, processed, and analyzed within a single ecosystem

  • AI capabilities are embedded directly into the data platform

  • Architectural fragmentation is significantly reduced

For leaders, the implications are strategic:

  • Fewer systems to integrate and operate

  • Faster movement from data to insight to action

  • A more coherent and scalable path to AI adoption

We are entering a phase where the question is no longer simply which database or platform to choose, but how to simplify the overall data architecture without sacrificing the specialization required for each workload.

By the way, Microsoft is doing a strong job at anticipating this convergence early, but it is not alone in shaping the direction of the market. Databricks is clearly following a similar trajectory with the announcement of Lakebase, a new operational database designed for AI agents and modern applications. Lakebase, also available in Azure Databricks,integrates PostgreSQL directly with the lakehouse, aiming to unify operational workloads with analytics and AI in a single, coherent architecture.

Final Insight & Executive Takeaway

At scale, architecture is strategy. The question is not whether you can reach 100TB—it’s whether you are choosing the right model to get there without adding unnecessary risk or complexity.

  • Model 1 powers operational systems with speed and simplicity

  • Model 2 enables globally distributed, internet-scale applications

  • Model 3 unlocks analytics and AI across the enterprise

Most organizations will require all three. The competitive advantage comes from aligning each workload to the right model—deliberately, not by default.

For many enterprises, the fastest and lowest-risk path forward starts with Model 1. If your priority is to scale existing SQL workloads without redesign, it is worth seriously evaluating Azure SQL Database Hyperscale—especially if your requirements include:

  • High compute density (e.g., up to 192 vCores)

  • Named replicas for read scale and isolation

  • Columnstore for real-time analytics

  • Integrated vector search and AI scoring

  • Built-in monitoring and governance

  • Enterprise-grade availability (99.995% SLA)

All delivered within a single database, using the same SQL model your teams already know.

That combination is not just about scale—it is about removing friction. It allows organizations to modernize faster, reduce architectural sprawl, and introduce AI capabilities without replatforming.

The broader takeaway is simple:

  • Azure SQL DB Hyperscale protects your past while extending it

  • AI-native databases like Azure HorizonDB build your future

  • Data platforms like Microsoft Fabric unlock enterprise-wide value

What we are seeing across the industry is a broader structural shift. The traditional separation between operational databases and data platforms is being actively dismantled. Microsoft is driving this convergence through Microsoft Fabric’s emerging database layer, bringing SQL, PostgreSQL, Cosmos-like capabilities, and AI-native features into a unified SaaS experience. At the same time, Databricks is extending the lakehouse model into operational territory with Azure Databricks Lakebase.

Not all 100TB solutions are created equal. The winners are those who scale with clarity, not complexity.

View on LinkedIn ← Back to Articles

Let’s talk!
Let's have cafecito together.

If you’re a Chief Data Officer (CDO), a data leader, or simply someone who believes in the power of preparing data for AI—you’re already a Data Massagist.

Whether you have an idea, a challenge, or just want a fresh perspective, let’s connect. I’m always open to collaborating, learning, and helping others move forward.

You can find me on LinkedIn (feel free to connect and send me a message), or book time with me directly for a virtual coffee (or "cafecito").