← Blog

Service Dialects for Cross-Cloud Compilation

Michael Ten-Pow · CEO, · January 2026

A Valkey cluster has properties: memory, replicas, TLS settings, failover behavior. AWS ElastiCache and Kubernetes Valkey operators describe the same thing using different representations. AWS measures memory in node types (cache.t3.medium). Kubernetes measures it in resource limits (3Gi). Same property, different representations.

AWS has multi_az_enabled. Kubernetes doesn't think in availability zones. Instead, it has a different high-availability model based on pod anti-affinity and topology spread constraints.

Coherently mapping between these different configuration spaces is an interesting and fun problem I'll enjoy writing about here.

Background Context

This post describes a particular subset of internals of a cross-cloud infrastructure compiler: the component that transforms infrastructure definitions from one cloud platform to another while preserving service semantics.

Key Terms

Vendor: A company that builds software and needs to deploy it into customer environments. Vendors define their infrastructure in Terraform targeting one cloud (typically AWS), and want that infrastructure to run anywhere.
Customer: An organization that hosts vendor software in their own infrastructure. Customers might run on-premises Kubernetes, a different cloud provider, or have specific compliance requirements (FedRAMP, air-gapped networks).
Form Factor: A target deployment environment (AWS, GCP, Azure, on-premises Kubernetes, or a specific Kubernetes distribution like OpenShift).

Running Example: Valkey

Throughout this post, I'll use a Valkey cache cluster as the running example. Valkey is the open-source fork of Redis, and it's a good example because:

Every cloud has a managed Valkey/Redis offering (ElastiCache, Memorystore, Azure Cache)
Kubernetes has multiple Valkey operators (chideat, SAP, Hyperspike)
The configuration dimensions vary widely: instance types vs. memory limits, cluster mode vs. replicas, managed TLS vs. cert-manager

When I show transformations, I'm taking an AWS ElastiCache Valkey cluster defined in Terraform and compiling it to run on Kubernetes via a Valkey Operator, preserving memory, replication, TLS, and authentication semantics while changing every underlying resource.

Here's what that actually looks like:

Compilation: AWS ElastiCache → Valkey Operator (Cluster Mode)

Input: Terraform

resource "aws_elasticache_replication_group" "valkey" {
  replication_group_id = "myapp-cache"
  engine               = "valkey"
  node_type            = "cache.t3.medium"
  num_cache_clusters   = 3

  transit_encryption_enabled = true
  auth_token           = var.valkey_auth_token

  automatic_failover_enabled = true
  multi_az_enabled     = true
}

Output: Kubernetes CRD

apiVersion: rds.valkey.buf.red/v1alpha1
kind: Valkey
metadata:
  name: myapp-cache
spec:
  version: "8.0"
  arch: cluster  # native failover
  replicas:
    shards: 3
    replicasOfShard: 1
  resources:
    limits:
      memory: 3Gi
  access:
    enableTLS: true

node_type → resources.limits.memory (field transform). num_cache_clusters: 3 → shards: 3 (structural: cluster mode uses sharding). multi_az_enabled → no equivalent (lossy: K8s uses pod anti-affinity instead). auth_token → ACL User resource (reference transform: value never in output).

Two Types of Transforms

After building transformations for dozens of services, a pattern emerged. There are two fundamentally different kinds of transforms:

Field transforms: The shape is preserved, only the representation changes. cache.t3.medium → 4096 → 4Gi.^† Same data structure, different encoding. This includes configuration values, naming schemes (ARN → K8s namespace/name), and addressing formats.
Structural transforms: The shape changes but semantics are preserved. An array of instance objects collapses into a single integer. Or a flat structure expands into a nested hierarchy. The meaning is the same, but the data structure differs.

A caveat on memory: ElastiCache gives you dedicated, guaranteed memory. K8s resource limits compete with other pods and OOMKiller. The number transforms; the operational characteristics don't. Behavioral testing catches these differences.

Structural transforms are the tricky ones. An RDS cluster's instances[...] array is a list of objects, each with its own instance_class and promotion_tier. CloudNativePG represents the same concept as a single integer: spec.instances: 3.

The problem: how do you take an AWS ElastiCache Valkey cluster and run it on a customer's private Kubernetes via a Valkey Operator, preserving replication, TLS, auth, while changing every underlying resource?

The naive approach is templating: write Terraform for AWS, write different Terraform for GCP, maintain both forever. This doesn't scale. N clouds times M services means N×M maintenance, and it grows with every feature you add.

The compiler approach works differently: parse infrastructure into a typed intermediate representation, transform through cloud-agnostic dialects, emit to any target.^†

This is similar to how LLVM works for programming languages: a common IR that different frontends target and different backends consume. But the closer analogy is actually MLIR, which I'll discuss shortly.

The LLVM analogy is instructive. Before LLVM, every language needed a separate backend for every CPU: C×ARM, C×x86, Rust×ARM, Rust×x86. N×M combinations. LLVM introduced a common IR; now languages compile to LLVM IR, LLVM compiles to any architecture. O(N+M) instead of O(N×M).

Infrastructure has the same structure. AWS Terraform, GCP Terraform, K8s manifests, Helm charts: these are all "source languages." Different clouds and operators are "target architectures." A common IR factors out the complexity.

MLIR is the closer analogy. MLIR was designed for the same architectural challenge: multiple levels of abstraction that need to interoperate. It has "dialects," domain-specific IRs for TensorFlow ops, linear algebra, GPU kernels. That pattern inspired ours.^†

MLIR uses multiple specialized IRs (dialects) with well-defined transformations between them, rather than one universal IR. Our service dialects (Valkey, PostgreSQL, etc.) borrow this organizational pattern.

MLIR's "progressive lowering" also shaped the architecture. You don't transform directly from TensorFlow to machine code; you lower through intermediate dialects: TensorFlow → Linalg → Affine → LLVM → machine code. Each step is a smaller, more tractable transformation. Our raise/lower pattern works the same way: AWS → Dialect → K8s, with the dialect as the intermediate level.

There's an important difference though. MLIR transformations are provably correct; formal semantics guarantee that lowering preserves computation. Infrastructure compilation can't make that claim. ElastiCache and a K8s Valkey Operator are semantically similar (both provide a Valkey cluster) but they're not identical in features, configuration, or edge behavior. Our transformations are heuristic. We use the dialect pattern but we don't get MLIR's correctness guarantees.

Preserving semantics across fundamentally different implementations is the hard part, not syntax translation.

Part 2: STIR, the Graph Representation

The foundation is STIR (STack Intermediate Representation), a graph-based IR where nodes are infrastructure constructs and edges are relationships. I covered STIR in detail in How Tensor9 Models Your Stack; the short version: infrastructure is a graph, not a tree, so the IR should be too.

Lifters parse source formats (Terraform HCL, Helm charts, K8s manifests) into the graph; emitters walk the graph and generate target formats. The correctness test is round-trip fidelity: lift source into STIR, emit back to source, compare.^† If they match, the lifter and emitter preserve structure correctly.

All compilation happens locally. No API calls to Tensor9 or cloud providers during transformation. The compiler runs entirely in your CI pipeline or on your laptop.

Important caveat: round-trip tests validate structural preservation, not behavioral equivalence. Whether compiled infrastructure performs identically under load is a different question. ElastiCache and self-hosted Valkey have different failover timing, different connection handling, different memory management. Behavioral testing happens through integration tests and staged rollouts.^†

We provide a behavioral test harness that runs the same workload against source and target deployments, comparing latency distributions, failover timing, and error rates. It won't catch everything, but it surfaces the big differences before production.

Part 3: Dialects, Cloud-Agnostic Intermediate Representation

STIR can represent any infrastructure format. But to actually compile ElastiCache to a K8s Valkey Operator, you need something higher-level: an abstraction that captures the service's semantics independent of how it's implemented. We call these dialects.

A dialect is a cloud-agnostic IR for a service category. The Valkey dialect captures what a Valkey cluster needs (replicas, TLS, auth, resources) without caring whether it's backed by ElastiCache or a K8s operator.

0 / 4

The compiler has dialects for: Containers (EKS/GKE/AKS/Kubernetes), Functions (Lambda/Cloud Functions/Knative), PostgreSQL, MySQL, MongoDB, Caching (ElastiCache/Memorystore/Redis), Message Streaming (Kafka/MSK/Event Hubs), Object Storage (S3/GCS/MinIO), Search (OpenSearch), Load Balancers, IAM, DNS, and Networking.^† Each captures the essential semantics of its service category.

Full list here: Service Equivalents Registry

Fair question: who maintains 12+ dialects across multiple providers? This is a real scaling challenge; each AWS release, each Helm chart update, each new cloud provider creates work. Our bet: dialects capture stable semantics (a cache needs memory, replicas, TLS), not provider-specific implementation details. Those details live in raisers and lowerers, which are smaller and more mechanical. The maintenance burden is real; I've traded architectural complexity for reduced operational complexity. Whether that tradeoff holds depends on scale.^†

Custom dialects are supported for internal services. You define the schema, write a raiser from your source format, and a lowerer to your target. The compiler handles the rest. It's not trivial, but it's documented.

The Raiser/Lowerer Pattern

Dialects sit between cloud-specific implementations:

0 / 3

Raiser: Transforms cloud-specific resources UP to the dialect surface. AwsValkeyRaiser takes an aws_elasticache_replication_group and creates a ValkeyCluster on the Valkey dialect.
Lowerer: Transforms the dialect DOWN to cloud-specific resources. K8sValkeyLowerer takes a ValkeyCluster and creates a kubernetes_manifest for the Valkey Operator CRD.

Note that raisers delete the original resources. After raising, the dialect surface is the source of truth. Lowerers reconstruct cloud-specific resources from the dialect representation, not from the original.^†

This is intentional. The raiser isn't annotating the original; it's replacing it with a canonical form. Round-trip preservation happens via source map metadata, not by keeping the original around.

Anatomy of a Dialect: The Valkey Schema

Let's look at the Valkey dialect in detail. A dialect has three components:

Schema: The data model (what fields exist and their types)
Semantics: What the fields mean and how they relate
Constraints: Valid value ranges and cross-field validation

AWS ElastiCache → Valkey Dialect

Terraform HCL (AWS ElastiCache)

resource "aws_elasticache_replication_group" "cache" {
  replication_group_id = "myapp-cache"
  engine               = "valkey"
  engine_version       = "7.2"
  node_type            = "cache.t3.medium"
  num_cache_clusters   = 3

  transit_encryption_enabled = true
  auth_token           = var.auth_token
}

STIR Graph (Valkey Dialect)

Notice what's not in a dialect: AWS-specific concepts like node_type, subnet_group_name, or parameter_group_name. And no K8s-specific concepts like storageClassName or podAntiAffinity. The dialect captures what a Valkey cluster is, not how any particular platform implements it.

Dialects also capture semantic relationships: shards and replicas are mutually constraining, auth implies TLS for secure configurations, persistence affects recovery behavior. These relationships matter during lowering. If you're targeting a platform that doesn't support sharding, the compiler needs to handle that gracefully.

What Makes a Good Dialect

After getting dialect design wrong several times, here's what we learned:

Cloud-agnostic vocabulary: Use neutral terms. memoryMb not instanceType or resourcesLimitsMemory.
Semantic completeness: Capture everything needed for the service to function correctly. Missing a field means losing information during transformation.
Minimal redundancy: Don't include fields that can be derived from others. If totalNodes can be computed from shards × (1 + replicasPerShard), don't store it separately.
Explicit optionality: Nullable fields have meaning. cpuMillis: null means "let the platform decide" vs cpuMillis: 1000 meaning "explicitly request 1 core."

Bad dialects leak implementation details. If I find myself adding fields like awsNodeType to a dialect, something is wrong.

Part 4: Configuration Spaces

Service configuration is a structured space where each field represents a property of the service.

Different providers use different representations for the same thing. AWS expresses memory as a node_type string (cache.t3.medium); K8s expresses it as resources.limits.memory in bytes. Same property, different shape.

Two types of transforms require different strategies. The first are field transforms: value transforms where the shape is preserved. The same conceptual field exists in both systems, just expressed differently (like measuring temperature in Celsius versus Fahrenheit).

Same configuration, different representations

The second type are structural transforms: shape transforms where the semantics are preserved but the structure changes. An array collapses to a scalar, or a nested tree flattens; the information is preserved through a projection.

Structure preserved through dialect: three replicas in, three replicas out

Looking at the full picture, multiple fields transform together. Some are lossless, some normalized, some lossy:

0 / 5

The compiler models this explicitly as a configuration space: a collection of transform objects, each describing how one field transforms between representations. Not all fields transform the same way:

Type	Meaning	Example
Lossless	Direct semantic equivalence. Perfect round-trip.	`transit_encryption_enabled` → `tls` → `tls.enabled`
Normalized	Different units, same information. Reversible with lookup.	`node_type` (cache.t3.medium) → `memoryMb` (4096) → `memory` (4Gi)
Lossless (aided)	Preserved for same-cloud via metadata. Lost in cross-cloud.	`replication_group_id` stored in source map, recovered for AWS→AWS
Lossy (no equivalent)	Dimension collapses in cross-cloud. No target equivalent.	`multi_az_enabled`: AWS concept, no K8s equivalent
Lossy (non-canonical)	Field exists only in one provider, no canonical mapping.	`subnet_group_name`: AWS VPC concept, no canonical mapping
Synthetic	No origin: value must be synthesized for target.	Aurora auto-scales storage, but CNPG needs explicit PVC size

With these classifications, the compiler can report the fidelity of any source→target transformation. For AWS ElastiCache → Valkey Operator: roughly 50% of dimensions are lossless (TLS, auth, replicas), 25% are normalized (node type → memory), and 25% are lossy (multi-AZ, subnet groups, security groups).

The lossy dimensions aren't failures; they're explicit reporting. multi_az_enabled doesn't map to Kubernetes because Kubernetes has a fundamentally different high-availability model. The compiler tells you this upfront, rather than silently ignoring the field or making up a mapping that doesn't preserve semantics.^†

Networking (VPCs, subnets, security groups, peering) is handled by a separate Networking dialect. It's complex enough to warrant its own treatment; cross-cloud network topology is a post unto itself.

The lossy fields are handled by resolvers, which we'll get into later.

The examples so far show AWS ElastiCache compiling to a single Kubernetes target. But Kubernetes has multiple Valkey implementations, each with different configuration spaces and operational characteristics. The compiler can target any of them, and the choice matters.

Consider two production-grade Valkey operators:

Sentinel Mode (Bitnami/SAP)

apiVersion: cache.cs.sap.com/v1alpha1
kind: Valkey
spec:
  replicas: 3
  sentinel:
    enabled: true  # external failover
  resources:
    limits:
      memory: 3Gi
  tls:
    enabled: true

Cluster Mode (chideat)

apiVersion: rds.valkey.buf.red/v1alpha1
kind: Valkey
spec:
  arch: cluster  # native failover
  replicas:
    shards: 3
    replicasOfShard: 1
  resources:
    limits:
      memory: 3Gi
  access:
    enableTLS: true

Same source (AWS ElastiCache with 3 nodes, 3GB memory, TLS). Same dialect (ValkeyCluster). Different targets with different configuration spaces and different operational characteristics.

Configuration Space Differences

Dimension	Sentinel Mode	Cluster Mode
`replicas`	Single integer (total nodes)	Shards × replicas per shard
`failover`	External Sentinel quorum	Built-in gossip protocol
`sharding`	Not supported	16384 hash slots across shards
`scaling`	Add replicas only	Add shards or replicas
`client library`	Sentinel-aware required	Cluster-aware required

Operational Characteristics

The configuration space differences translate to operational differences that the compiler can surface explicitly:

Characteristic	Sentinel Mode	Cluster Mode
Failover time	10–60 seconds	1–6 seconds
Failure detection	Sentinel quorum (SDOWN→ODOWN)^†	Node gossip (cluster-node-timeout)
Data model	All keys on primary	Keys sharded by hash slot
Multi-key operations	Always work	Same-slot only (or hash tags)
Horizontal scaling	Read replicas only	Add shards for write capacity

SDOWN (Subjectively Down) means one Sentinel thinks the master is down. ODOWN (Objectively Down) means a quorum agrees. Failover only proceeds after ODOWN. See Redis Sentinel documentation.

These details matter operationally: failover timing affects outage duration, and cluster mode determines whether your application needs hash tags for multi-key operations. The compiler surfaces these differences in the compilation report.

The Compiler's Role

The dialect (ValkeyCluster) captures what the service is: memory, replicas, TLS, authentication. The lowerer chooses how it's implemented. Different lowerers target different operators:

K8sValkeySentinelLowerer → Bitnami/SAP operator (Sentinel mode)
K8sValkeyClusterLowerer → chideat operator (Cluster mode)

The compilation report includes target-specific warnings:

Target: chideat::1.0.0::valkey

⚠ OPERATIONAL: Cluster mode requires cluster-aware client library
  Your application must use a Redis/Valkey client that supports
  CLUSTER SLOTS and automatic redirect handling.

⚠ OPERATIONAL: Multi-key operations require same hash slot
  Commands like MGET, MSET, SUNION across keys will fail unless
  keys share a hash slot. Consider hash tags: {user:123}:profile

✓ ADVANTAGE: Failover time ~1-6 seconds (vs 10-60s Sentinel)
  Cluster nodes detect failures via gossip and self-promote.
  No external Sentinel quorum required.

✓ ADVANTAGE: Horizontal write scaling via shards
  Add shards to increase write throughput. Sentinel mode
  can only add read replicas.

Vendor/Customer Negotiation for Target Selection

Target selection follows the same negotiation model as field-level configuration. Vendors express requirements; customers express constraints; the compiler finds the intersection or reports why it can't.

# Vendor tuning
valkey:
  target:
    preferred: [cluster, sentinel]
  requirements:
    min_failover_seconds: 10

# Customer config
valkey:
  target:
    allowed: [sentinel]  # app not cluster-aware
  constraints:
    client_library: "jedis-3.x"

The compiler evaluates this negotiation:

Vendor prefers cluster mode (1–6s failover meets their 10s requirement)
Customer only allows sentinel mode (client library constraint)
Conflict: Vendor's failover requirement (10s) can't be met by sentinel mode (10–60s typical)

This surfaces as a compilation issue, not a silent degradation:

⚠ TARGET CONFLICT: valkey.myapp-cache

  Vendor requires: failover ≤ 10 seconds
  Customer allows:  sentinel mode only (client constraint)

  Sentinel mode typical failover: 10-60 seconds

  Options:
    1. Customer upgrades to cluster-aware client (enables cluster mode)
    2. Vendor relaxes failover requirement to 60 seconds
    3. Customer accepts risk of longer failover window

  Severity: WARNING (compilation can proceed with accepted risk)

Neither party makes this decision alone. The vendor can't force cluster mode on a customer whose application doesn't support it. The customer can't demand sub-10-second failover while constraining to sentinel mode. The compiler makes the trade-off explicit so both parties can negotiate with full information.

"Configuration space" includes the operational envelope of what's possible on each target, beyond just field mappings. The compiler surfaces this so both parties can make informed decisions.

Part 5: Resolvers, Handling Ambiguity

When lowering from a dialect to a specific cloud, some fields can't be deterministically derived. ElastiCache needs a node_type, but the Valkey dialect only has memoryMb. Multiple node types have the same memory: which one?

Resolvers handle this. A resolver implements a resolution chain, a priority-ordered fallback strategy:

For same-cloud round-trips (AWS → Dialect → AWS), the raiser stores the original node_type in source map metadata. The lowerer checks metadata first, finds it, and emits the exact original value. Isomorphic.

For cross-cloud transformations (AWS → Dialect → K8s → Dialect → AWS), there's no metadata. The resolver falls through to Vendor config (if the vendor specified preferences), then customer config (if the customer has infrastructure constraints), then heuristics (pick a reasonable node type for the memory size).^†

Secrets are handled differently: they're never stored in source map metadata or resolver state. Secret references (AWS Secrets Manager ARNs, K8s Secret names) transform through a separate subsystem that ensures credentials never appear in Terraform state files.

Resolution Provenance

Every resolved value carries provenance: where it came from. This matters for debugging ("why did the compiler pick this node type?") and for compliance ("prove that customer constraints were respected").

A valid concern: if vendor preferences or customer constraints change between compilations, outputs differ. The compiler addresses this through resolution manifests, a record of every resolved value and its source. Re-running compilation with the same manifest produces identical output. The manifest also enables auditing: you can trace exactly why each field has its value.^†

Escape hatch: you can edit the manifest to override any resolved value before re-compiling. The output Terraform is also human-readable and can be modified directly for emergency fixes, though you lose round-trip guarantees.

Resolution Failures

What happens when resolution fails? The compiler doesn't silently pick a default. It surfaces a structured error explaining the conflict: what was needed, what was allowed, and how to fix it.

For example: "Your customer config only allows t3.micro and t3.small, but the vendor needs 16GB of memory. Either allow larger node types or reduce the memory requirement." The user sees the conflict clearly and can make an informed decision.

Part 6: Configuration Negotiation Between Vendors and Customers

The resolver chain reflects something important: vendors and customers have different concerns, and they're intentionally asymmetric.

The two parties provide different kinds of configuration: vendors specify what the application needs, customers specify what the infrastructure allows.

Customer Config: The Base Configuration

The customer provides the foundation that vendors must work within:

Infrastructure inputs: Kubernetes cluster endpoint, VPC IDs, credential references
Constraints: "Only use t3.medium or smaller", "No multi-AZ (cost)", "Must be in us-west-2"
Preferences: Ranked choices for negotiable fields (customer's priority order)

The customer is saying: "Here's my infrastructure and what I'm willing to accept."

Vendor Tuning: The Vendor's Adjustment

The vendor tunes service behavior on top of the customer's base:

Requirements: "My app needs at least 3 replicas and Valkey 7.2"
Preferences: Ranked choices within the customer's allowed set
Defaults: Sensible values when the customer doesn't constrain

The vendor is saying: "Given what the customer allows, here's how I want to configure this."

The Ranked Preference Pattern

For negotiable fields like node tier, both sides provide ranked preferences:

// Customer's acceptable tiers (in preference order)
@CustomerField val nodeTierPreference: List<NodeTier> = listOf(Small, Medium)  // no Large

// Vendor's preferred tiers (in preference order)
@VendorField val nodeTierTuning: List<NodeTier> = listOf(Large, Medium, Small)

Resolution: pick the highest vendor preference that appears in the customer's allowed set. Here, the vendor wants Large → Medium → Small. The customer allows Small and Medium. Result: Medium.

This is not symmetric negotiation where both parties have equal say. The customer provides constraints (the ceiling); the vendor tunes within those constraints. Vendors cannot override customer constraints; they can only choose within them.^†

This asymmetry is intentional. Customers own their infrastructure; vendors own their applications. Each has authority over their domain.

Why This Matters for Compilation

When the resolver needs a value, it checks:

Metadata: Was this value preserved from a previous transformation?
Dialect value: Is there an explicit canonical value?
Vendor config: What does the vendor require/prefer?
Customer config: What does the customer allow/prefer?
Heuristic: What's a reasonable default?

Vendor requirements establish a floor. Customer constraints establish a ceiling. The resolver finds the best value in the overlap, or fails if there is no overlap.^†

For regulated environments, every resolution decision is logged with its source (metadata, vendor config, customer constraint, or heuristic). This audit trail lets compliance teams verify that transformations respected policy. Not covered here, but essential for SOC2/HIPAA.

0 / 4

A Negotiation Example

Consider a vendor whose application needs 16GB memory for production workloads. The customer has cost constraints and only allows smaller instance types (up to 6GB).

The compiler detects this conflict: vendor requires 16GB minimum, customer's largest allowed option provides only 6GB. No overlap exists. Resolution fails, correctly. The compiler can't magically make 6GB work for a 16GB requirement. Instead, it surfaces the conflict clearly with actionable suggestions.

When Negotiation Succeeds

If the customer adds larger instance types to their allowed list, negotiation can succeed. The compiler picks the best option that satisfies both parties: the highest vendor preference that falls within customer constraints.

If the resulting choice is below the vendor's stated requirements (say, 13GB when vendor wanted 16GB), the compiler emits a warning. It proceeds (that's the customer's choice) but makes the trade-off visible.

Part 7: Cross-Cloud Roundtrip Semantics

What happens when you round-trip? This is really asking whether the compiler respects the original intent. A compiler that silently drops information without telling you is dangerous.

Let me formalize this with two different roundtrip scenarios, each with different expectations.

Same-Cloud Roundtrip: Isomorphic

AWS → Dialect → AWS should produce an identical graph. This is the baseline requirement: if you're not changing clouds, the compiler shouldn't change your infrastructure. The raiser extracts all configuration into the dialect, stores cloud-specific details in source map metadata, and the lowerer reconstructs the original exactly.

This works because the source map preserves everything the dialect doesn't capture natively. The dialect knows about replicas, memory, and TLS, but it doesn't know about AWS-specific subnet groups or security group IDs. Those get stashed in metadata and restored on the way back.

0 / 3

Cross-Cloud Roundtrip: Lossy (By Design)

Cross-cloud transformation is different. AWS → Dialect → K8s → Dialect → AWS is lossy. Some AWS-specific concepts don't exist in the operator, and vice versa. The configuration space contracts when you move to the intersection of what both platforms support, then expands again with platform-specific defaults on the way back.

Consider multi_az_enabled. AWS ElastiCache uses this to spread replicas across availability zones for fault tolerance. Kubernetes doesn't have availability zones in the same sense. Instead, it has node affinity, pod anti-affinity, and topology spread constraints. These aren't equivalent concepts; they're different approaches to the same goal. The compiler can't preserve multi_az_enabled because the target platform doesn't have that dimension.^†

For air-gapped deployments, the compiled output (Terraform + Helm charts) can be packaged with all dependencies for offline application. The compiler itself needs no network access; only the resulting artifacts need to reach the target environment.

0 / 5

This is correct behavior. Cross-cloud transformation is inherently lossy. The compiler makes this explicit: you can query the configuration space to see exactly which dimensions are preserved and which require resolver intervention.^†

Think of it like translating between languages. Some concepts don't translate directly. The compiler tells you what's lost, rather than pretending everything maps 1:1.

Interlude: Disconnected and Air-Gapped Deployments

Everything described so far works without internet connectivity during compilation. The compiler runs locally: no API calls to Tensor9, no cloud provider APIs, no external services. Local-only compilation is the default architecture, not a special mode.

For fully disconnected environments (FedRAMP, defense, on-premises), the workflow extends to artifact distribution:

Release bundles package everything needed for deployment: container images (as Docker tars), Terraform configurations, and metadata. Vendors create bundles in their connected environment; customers apply them in their disconnected environment.

Terraform variables handle customer-specific configuration. The compiled output references ${var.appliance_registry_uri} instead of hardcoded registry URLs. Customers provide their registry URI, namespace, and other infrastructure details when applying the bundle. Vendors don't need to know customer infrastructure.

Clean separation: vendors package and sign; customers verify and apply. The bundle is the interface between connected and disconnected worlds.

This architecture was designed with regulated and disconnected environments in mind. Artifact bundling, signing, and verification are covered in a separate post.

Part 8: The Compiler Pipeline, End to End

Now let's see how all these pieces fit together. The compiler processes infrastructure in four distinct phases, each with a clear responsibility. Understanding the pipeline helps explain why certain errors occur where they do, and why some transformations are possible while others aren't.

Phase 1: Lift to STIR

The Terraform lifter parses HCL and creates a StirGraph. Each resource becomes a node; each reference becomes an edge. The graph preserves the full structure of the original code (modules, variables, outputs, and all).

This phase is purely syntactic. The lifter doesn't know what an aws_elasticache_replication_group means. It just knows it's a resource with certain attributes that reference other resources. Errors here are parse errors: malformed HCL, invalid syntax, unresolved variables.

Phase 2: Raise to Dialect

The raiser transforms cloud-specific resources to their dialect representation. This is where semantic understanding enters the picture. The raiser knows that an ElastiCache cluster with num_cache_clusters = 3 means "three replicas," and that this maps to a replicas field in the Valkey dialect.

For an ElastiCache cluster, the raiser extracts semantic fields (replicas, TLS, auth), normalizes cloud-specific values (node type → memory in MB), and stores AWS-specific metadata in source map for potential round-trip recovery. Errors here are semantic: unsupported resource types, invalid configurations, missing required fields.

Phase 3: Lower to Target

The lowerer reads the dialect representation and creates target-specific resources. Unlike raising, lowering sometimes requires decisions that weren't explicit in the dialect (which Helm chart version? what namespace?). Resolvers handle those decisions.

For the Valkey Operator, the lowerer generates a kubernetes_manifest with the Valkey CRD. Memory becomes resource limits, replicas map directly, TLS and auth settings map to chart values. Errors here are typically configuration conflicts: the dialect says one thing, but the target platform can't accommodate it.

Phase 4: Emit

The emitter walks the StirGraph and generates valid Terraform HCL. This phase is again purely syntactic: it doesn't interpret meaning, just serializes the graph to text. The output can be applied to a Kubernetes cluster with the Helm provider configured.^†

The emitted Terraform is formatted and commented for human readability. You can review, diff, and debug it like hand-written code. No minified blobs.

Errors here are rare. If the graph is well-formed, emission should succeed. When they do occur, they're usually bugs in the compiler itself rather than problems with the input infrastructure.

What Changed (And What Didn't)

Preserved	Transformed	Lost
3 replicas TLS enabled Auth enabled ~13GB memory	`node_type` → `resources.limits` `auth_token` → `auth.password` `automatic_failover` → `sentinel.enabled`	`multi_az_enabled` `subnet_group_name` `security_group_ids`

Part 9: Error Handling

Compilation can fail for many reasons: missing required fields, incompatible configurations, features that don't exist on the target platform. The compiler surfaces these clearly rather than failing silently, because silent failures in infrastructure are the ones that wake you up at 3 AM.

Error handling follows the same approach as good compilers: collect as many issues as possible in a single pass, categorize them by severity, and provide actionable information about how to fix them. Nobody wants to fix one error, re-run, find another error, fix it, re-run, and repeat twenty times.

Issue Severity Levels

Not all issues are equal. Some prevent compilation entirely; others are warnings that you might choose to accept. The compiler distinguishes three levels:

Error: Compilation cannot proceed. Resolution failure with no fallback. You must fix this before the compiler can produce output.
Warning: Compilation proceeds but behavior may differ. Lossy transformation, heuristic applied. The compiler is telling you "this worked, but you should know about this."
Info: Informational. Transformation decisions for audit trails. These don't indicate problems; they're a record of what the compiler decided and why.

Common Issue Types

After compiling hundreds of real-world Terraform stacks, patterns emerge. Most issues fall into a few categories:

Feature Not Supported: The target platform doesn't support a source feature (e.g., multi_az_enabled has no Kubernetes equivalent). The compiler tells you what's lost and whether it affects your workload.
Configuration Conflict: Vendor and customer requirements don't overlap. The compiler shows both sides and suggests how to resolve the conflict.
Missing Dependency: A referenced resource wasn't included in compilation. Common when compiling a subset of a larger stack.

Issues are collected and presented together at the end. Many issues are related: fixing one might resolve others. Issues are deduplicated: if the same warning applies to 15 resources, it shows once with a count. The goal is a summary you can actually read, not a wall of repetitive text.

Part 10: More Examples: RabbitMQ, Kafka, and PostgreSQL

The Valkey examples throughout this post are clean, but caches are relatively simple. Let me walk through three more dialects that handle messier transformations: RabbitMQ, Kafka, and PostgreSQL. Each shows the same patterns (dialects, transforms, resolvers, negotiation) applied to services with more complexity.

RabbitMQ: Amazon MQ → RabbitMQ Cluster Operator

The RabbitMQ Dialect

RabbitMQ adds another dimension: messaging topology. The dialect captures cluster configuration (nodes, memory, storage), but also virtual hosts, users, and HA policies that define message durability and replication.

Amazon MQ → RabbitMQ Dialect

Terraform HCL (Amazon MQ)

resource "aws_mq_broker" "rabbitmq" {
  broker_name        = "myapp-mq"
  engine_type        = "RabbitMQ"
  engine_version     = "3.12.0"
  host_instance_type = "mq.m5.large"
  deployment_mode    = "CLUSTER_MULTI_AZ"

  user {
    username = "admin"
    password = var.mq_password
  }
}

STIR Graph (RabbitMQ Dialect)

Transform Types

Amazon MQ abstracts away clustering details that become explicit in K8s:

Amazon MQ	Dialect	RabbitMQ Operator	Type
`mq.m5.large`	`nodeMemoryMb: 8192`	`resources.limits.memory: 8Gi`	Lossless
`deployment_mode: CLUSTER_MULTI_AZ`	`nodeCount: 3`	`spec.replicas: 3`	Lossless
(implicit mirroring)	`haMode: Quorum`	`spec.override.additionalConfig: quorum`	Normalized
`maintenance_window_start_time`	(no equivalent)	(use K8s PodDisruptionBudget)	Lossy
`logs.audit: true`	`auditLogging: true`	(external log aggregation)	Lossy

Amazon MQ's "classic" HA (mirrored queues) is deprecated in RabbitMQ 3.13+. The compiler normalizes to quorum queues when targeting modern RabbitMQ versions, noting the behavioral change in the compilation report.^†

Quorum queues have different performance characteristics than mirrored queues: better consistency guarantees but higher latency. The report notes this trade-off.

RabbitMQ lowering requires resolving several ambiguous values:

HA policy: Amazon MQ uses implicit mirroring. The resolver determines whether to use quorum queues (recommended) or classic mirrored queues based on target RabbitMQ version and customer preference.
Erlang cookie: Cluster authentication secret. The resolver generates a new cookie or uses one from customer config.
Storage class: RabbitMQ benefits from fast storage for message persistence. The resolver maps to appropriate K8s StorageClasses.

The resolution chain for haMode:

// Resolution chain for RabbitMQ HA mode
1. Customer config: Preferred HA mode? → "quorum" (if specified)
2. Target version: RabbitMQ 3.13+? → prefer quorum (classic deprecated)
3. Source hint: Amazon MQ was using mirroring → "quorum" (modern equivalent)
4. Heuristic: Default to quorum for new deployments → "quorum"

Vendors and customers negotiate RabbitMQ configuration through complementary constraints:

// Vendor config
rabbitmq:
  min_nodes: 3                # Minimum for quorum queues
  min_memory_mb: 4096         # Memory-intensive workload
  required_features:
    - tls                     # Must have encryption
    - quorum_queues           # Data durability requirement

// Customer config
rabbitmq:
  max_nodes: 5                # Cost ceiling
  max_memory_mb: 8192         # Resource limit
  storage_class: "fast-ssd"
  erlang_cookie_secret: "rabbitmq-cookie"  # Existing secret

The compiler validates requirements: 3 nodes (within ceiling), quorum queues (enabled), TLS (enabled). The customer's existing Erlang cookie secret is used for cluster authentication rather than generating a new one. Amazon MQ → RabbitMQ Operator is lossy. Several features don't translate:

CloudWatch metrics: Amazon MQ metrics go to CloudWatch. The operator exposes Prometheus metrics. Different systems, different alerts.
Automatic minor version upgrades: Amazon MQ handles upgrades. On K8s, you manage version pinning and rolling updates yourself.
Maintenance windows: Amazon MQ schedules maintenance. K8s uses PodDisruptionBudgets for disruption control, but it's not the same.

RabbitMQ Operator → Amazon MQ round-trip loses: custom plugins (Amazon MQ has a fixed plugin set), sidecar containers, custom Erlang VM settings, and any operator-specific CRD fields. What if the customer requires classic mirrored queues on a modern RabbitMQ version?

⚠ CONFIGURATION CONFLICT: rabbitmq.message-broker

  Vendor requires: quorum_queues (data durability)
  Customer config: ha_mode: classic_mirrored

  Conflict: Classic mirrored queues are deprecated in RabbitMQ 3.13+
            and cannot provide the durability guarantees vendor requires.

  Resolution options:
    - Customer updates to quorum queues (recommended)
    - Vendor accepts classic mirrored queues on older RabbitMQ version
    - Customer accepts reduced durability guarantees

  Severity: ERROR (compilation cannot proceed)

Kafka: MSK → Strimzi

Kafka is more complex. The dialect captures: cluster identity, broker configuration (count, resources, storage), topic defaults, security (TLS, SASL), ZooKeeper vs KRaft mode, and replication settings.

AWS MSK → Kafka Dialect

Terraform HCL (AWS MSK)

resource "aws_msk_cluster" "kafka" {
  cluster_name           = "myapp-kafka"
  kafka_version          = "3.6.0"
  number_of_broker_nodes = 3

  broker_node_group_info {
    instance_type   = "kafka.m5.large"
    ebs_volume_size = 500
  }

  encryption_info {
    encryption_in_transit {
      client_broker = "TLS"
    }
  }
}

STIR Graph (Kafka Dialect)

Transform Types

MSK has tight AWS integration that doesn't translate directly to Strimzi:

MSK	Dialect	Strimzi	Type
`kafka.m5.large`	`brokerMemoryMb: 8192`	`resources.limits.memory: 8Gi`	Lossless
`number_of_broker_nodes: 3`	`brokerCount: 3`	`spec.kafka.replicas: 3`	Lossless
`client_authentication.sasl.iam`	`saslMechanism: "IAM"`	`authentication.type: scram-sha-512`	Normalized
`enhanced_monitoring: PER_TOPIC`	(no equivalent)	(use Strimzi metrics)	Lossy
`serverless` mode	`brokerCount: 3` (default)	`spec.kafka.replicas: 3`	Operational

MSK's IAM authentication is AWS-specific. When lowering to Strimzi, the compiler normalizes to SCRAM-SHA-512 and generates the required K8s Secrets. The compilation report notes this: "MSK IAM auth converted to SCRAM-SHA-512. Update client configs."^†

MSK Serverless compiles but with significant operational warnings: auto-scaling becomes manual, pay-per-use becomes fixed capacity. The compilation report lists all operational changes.

Kafka lowering has its own resolution challenges:

ZooKeeper vs KRaft: MSK uses ZooKeeper; Strimzi supports both. The resolver checks if the target K8s cluster supports KRaft (requires specific Strimzi version) and customer preference.
SASL credentials: IAM auth becomes SCRAM-SHA-512. The resolver generates KafkaUser resources and corresponding Secrets.
Storage class: Kafka brokers need fast storage. The resolver maps MSK's EBS configuration to appropriate K8s StorageClasses.

The resolution chain for controllerMode:

// Resolution chain for Kafka controller mode
1. Customer config: Preferred mode? → "kraft" (if specified)
2. Target capability: Strimzi version supports KRaft? → true (v0.36+)
3. Source hint: MSK was ZooKeeper-based → prefer ZooKeeper for compatibility
4. Heuristic: Default to KRaft for new deployments → "kraft"

Vendors and customers negotiate Kafka configuration through complementary constraints:

// Vendor config
kafka:
  min_brokers: 3              # Minimum for HA
  min_replication_factor: 3   # Data durability requirement
  required_features:
    - tls                     # Must have encryption in transit
    - sasl                    # Must have authentication

// Customer config
kafka:
  max_brokers: 6              # Cost ceiling
  allowed_sasl_mechanisms:
    - scram-sha-512           # We support this
  storage_class: "fast-ssd"
  controller_mode: "kraft"    # Prefer KRaft

The compiler validates that vendor requirements are satisfiable within customer constraints: 3 brokers (within ceiling), SCRAM-SHA-512 (allowed), TLS enabled. The customer's KRaft preference is honored since the target supports it. MSK → Strimzi is lossy. Some information doesn't survive:

IAM policies: MSK uses IAM for fine-grained access control. Strimzi uses KafkaUser ACLs. The mapping is approximate.
CloudWatch integration: MSK metrics go to CloudWatch. Strimzi exposes Prometheus metrics. Different systems, different dashboards.
Tiered storage: MSK supports tiered storage to S3. Strimzi doesn't have an equivalent. Data retention policies may need adjustment.

Strimzi → MSK round-trip would lose different things: KRaft mode (MSK is ZooKeeper-only for now), custom JVM settings, sidecar containers. Each direction has its own lossy dimensions. MSK Serverless compiles to Strimzi, but with significant operational warnings:

⚠ OPERATIONAL CHANGES: kafka.event-stream

  Source: MSK Serverless
  Target: Strimzi Kafka cluster (3 brokers, 8Gi memory each)

  The following operational characteristics change significantly:

  ⚠ SCALING: MSK Serverless auto-scales transparently.
             Strimzi requires manual broker count changes.
             Action: Monitor throughput and scale proactively.

  ⚠ BILLING: MSK Serverless is pay-per-use.
             Strimzi runs fixed capacity 24/7.
             Action: Right-size brokers for expected load.

  ⚠ MAINTENANCE: MSK Serverless handles broker patching.
                 Strimzi requires planned rolling updates.
                 Action: Schedule maintenance windows.

  ⚠ CAPACITY: MSK Serverless has no partition limits per broker.
              Strimzi performance degrades beyond ~4000 partitions/broker.
              Action: Plan partition strategy for growth.

  Severity: WARNING (compilation proceeds, review operational changes)

PostgreSQL: RDS Aurora → CloudNativePG

The PostgreSQL dialect captures: identity, compute resources (CPU, memory), storage (size, IOPS), replication (replicas, sync mode), version, extensions, connection limits, and backup configuration.

AWS RDS Aurora → PostgreSQL Dialect

Terraform HCL (AWS RDS)

resource "aws_rds_cluster" "db" {
  cluster_identifier = "myapp-db"
  engine             = "aurora-postgresql"
  engine_version     = "15.4"
  database_name      = "myapp"
  master_username    = "admin"
  master_password    = var.db_password
  storage_encrypted  = true
  allocated_storage  = 100
  iops               = 3000
}

resource "aws_rds_cluster_instance" "writer" {
  identifier         = "myapp-db-writer"
  cluster_identifier = aws_rds_cluster.db.id
  instance_class     = "db.r5.large"
  engine             = aws_rds_cluster.db.engine
}

resource "aws_rds_cluster_instance" "reader" {
  identifier         = "myapp-db-reader"
  cluster_identifier = aws_rds_cluster.db.id
  instance_class     = "db.r5.large"
  engine             = aws_rds_cluster.db.engine
}

STIR Graph (PostgreSQL Dialect)

Notice what's absent: no db_instance_class, no storage_type, no parameter_group_name. Those are AWS-specific. And no storageClassName or podAntiAffinity. Those are K8s-specific. The dialect captures what a PostgreSQL cluster is, not how any platform implements it.

Transform Types

RDS Aurora and CloudNativePG have different feature sets:

RDS Aurora	Dialect	CloudNativePG	Type
`db.r6g.xlarge`	`memoryMb: 32768`	`resources.limits.memory: 32Gi`	Lossless
`engine_version: "15.4"`	`version: "15.4"`	`imageName: ...postgresql:15.4`	Lossless
`multi_az: true`	(no equivalent)	(HA via replica count)	Lossy
`performance_insights_enabled`	(no equivalent)	(use pg_stat_statements)	Lossy
(auto-scales)	`storageSizeGb: 100`	`storage.size: 100Gi`	Synthetic

Aurora auto-scales storage; you don't specify a size. CloudNativePG requires an explicit PVC size. The lowerer synthesizes a value based on current usage plus headroom, or uses a customer-provided default. The compilation report flags this as a synthetic field.^†

Aurora's storage auto-scaling is genuinely better for most workloads. When compiling to K8s, you trade that convenience for explicit capacity planning.

Resolvers for PostgreSQL

When lowering to CloudNativePG, several fields can't be deterministically derived:

Storage class: What K8s StorageClass to use? The dialect has IOPS requirements; the resolver maps to customer-allowed storage classes that meet those requirements.
Image registry: Where to pull the PostgreSQL container image? Different customers have different registries (public, private, air-gapped).
Backup schedule: Aurora has automated backups; CNPG needs explicit ScheduledBackup resources. The resolver creates these based on retention policy.

The resolution chain for storageClassName:

// Resolution chain for PostgreSQL storage class
1. Metadata: Did a previous compilation set this? → "gp3-encrypted"
2. Customer config: What storage classes are allowed? → ["gp3-encrypted", "io2-fast"]
3. Dialect requirement: IOPS ≥ 3000 → filters to ["io2-fast"]
4. Heuristic: Pick first allowed → "io2-fast"

Vendors and customers negotiate PostgreSQL configuration through complementary constraints:

// Vendor config (SvcDefVendor)
postgres:
  min_memory_mb: 16384        # Need at least 16GB for our workload
  required_extensions:
    - postgis                 # Spatial queries
    - pg_stat_statements      # Query monitoring
  min_connections: 200        # Our connection pool size

// Customer config (SvcTuning)
postgres:
  max_memory_mb: 32768        # Cost constraint
  allowed_extensions:
    - postgis
    - pg_stat_statements
    - pgvector                # We also allow this
  storage_class: "gp3-encrypted"

The compiler resolves memoryMb to 16384 (vendor minimum within customer ceiling), enables the required extensions (all allowed), and uses the customer's storage class. What if the customer doesn't allow a required extension?

⚠ CONFIGURATION CONFLICT: postgres.myapp-db

  Vendor requires extension: postgis
  Customer allowed_extensions: [pg_stat_statements, pgvector]

  Resolution: Extension 'postgis' is required for spatial queries.
              Either add to allowed_extensions or discuss with vendor
              about alternative approaches.

  Severity: ERROR (compilation cannot proceed)

Summary

AWS, GCP, Azure, and Kubernetes are different representations of the same underlying service properties (memory, replicas, TLS, authentication).

Some transformations preserve all information (same-cloud round-trips). Some collapse properties (cross-cloud, where features don't exist on the target). The compiler reports which is which.

What This Post Doesn't Cover

Compilation is necessary but not sufficient. Producing correct Terraform is perhaps 40% of the multi-cloud deployment problem. The harder 60% is operational:

Behavioral equivalence testing: The compiler surfaces operational differences (failover timing, client library requirements), but doesn't verify that compiled infrastructure behaves identically under load. That requires integration testing.
Stateful migration: Compiling infrastructure definitions doesn't migrate data. A 50GB cache cluster needs actual data migration, not just Terraform.
Secrets management: Reference transforms (ARN → K8s Secret name) are syntax. The secret value must exist in the target environment with proper rotation and access controls.
Drift and rollback: What happens when deployed infrastructure diverges from compiled state? When deployments fail halfway?
Performance tuning: Compiled infrastructure runs, but is it fast? Different platforms have different performance characteristics. There's no substitute for vendors putting in the time to performance test each target environment. No magic bullet here.

These are real concerns that I'll get into some other time. The compiler can tell you that Cluster mode has 1–6 second failover; it can't tell you whether your application handles redirects correctly.

-mtp 2026-01-30