A Valkey cluster has properties: memory, replicas, TLS settings, failover behavior. AWS ElastiCache
and Kubernetes Valkey operators describe the same thing using different representations.
AWS measures memory in node types (cache.t3.medium). Kubernetes
measures it in resource limits (3Gi). Same property, different representations.
AWS has multi_az_enabled. Kubernetes
doesn't think in availability zones. Instead, it has a different high-availability model based on pod
anti-affinity and topology spread constraints.
Coherently mapping between these different configuration spaces is an interesting and fun problem I'll enjoy writing about here.
This post describes a particular subset of internals of a cross-cloud infrastructure compiler: the component that transforms infrastructure definitions from one cloud platform to another while preserving service semantics.
Throughout this post, I'll use a Valkey cache cluster as the running example. Valkey is the open-source fork of Redis, and it's a good example because:
When I show transformations, I'm taking an AWS ElastiCache Valkey cluster defined in Terraform and compiling it to run on Kubernetes via a Valkey Operator, preserving memory, replication, TLS, and authentication semantics while changing every underlying resource.
Here's what that actually looks like:
resource "aws_elasticache_replication_group" "valkey" {
replication_group_id = "myapp-cache"
engine = "valkey"
node_type = "cache.t3.medium"
num_cache_clusters = 3
transit_encryption_enabled = true
auth_token = var.valkey_auth_token
automatic_failover_enabled = true
multi_az_enabled = true
}
apiVersion: rds.valkey.buf.red/v1alpha1
kind: Valkey
metadata:
name: myapp-cache
spec:
version: "8.0"
arch: cluster # native failover
replicas:
shards: 3
replicasOfShard: 1
resources:
limits:
memory: 3Gi
access:
enableTLS: true
node_type → resources.limits.memory (field transform).
num_cache_clusters: 3 → shards: 3 (structural: cluster mode uses sharding).
multi_az_enabled → no equivalent (lossy: K8s uses pod anti-affinity instead).
auth_token → ACL User resource (reference transform: value never in output).
After building transformations for dozens of services, a pattern emerged. There are two fundamentally different kinds of transforms:
cache.t3.medium → 4096 → 4Gi.† Same data structure,
different encoding. This includes configuration values, naming schemes (ARN → K8s
namespace/name), and addressing formats.
Structural transforms are the tricky ones. An RDS cluster's instances[...] array
is a list of objects, each with its own instance_class and promotion_tier.
CloudNativePG represents the same concept as a single integer: spec.instances: 3.
The problem: how do you take an AWS ElastiCache Valkey cluster and run it on a customer's private Kubernetes via a Valkey Operator, preserving replication, TLS, auth, while changing every underlying resource?
The naive approach is templating: write Terraform for AWS, write different Terraform for GCP, maintain both forever. This doesn't scale. N clouds times M services means N×M maintenance, and it grows with every feature you add.
The compiler approach works differently: parse infrastructure into a typed intermediate representation, transform through cloud-agnostic dialects, emit to any target.†
This is similar to how LLVM works for programming languages: a common IR that different frontends target and different backends consume. But the closer analogy is actually MLIR, which I'll discuss shortly.The LLVM analogy is instructive. Before LLVM, every language needed a separate backend for every CPU: C×ARM, C×x86, Rust×ARM, Rust×x86. N×M combinations. LLVM introduced a common IR; now languages compile to LLVM IR, LLVM compiles to any architecture. O(N+M) instead of O(N×M).
Infrastructure has the same structure. AWS Terraform, GCP Terraform, K8s manifests, Helm charts: these are all "source languages." Different clouds and operators are "target architectures." A common IR factors out the complexity.
MLIR is the closer analogy. MLIR was designed for the same architectural challenge: multiple levels of abstraction that need to interoperate. It has "dialects," domain-specific IRs for TensorFlow ops, linear algebra, GPU kernels. That pattern inspired ours.†
MLIR uses multiple specialized IRs (dialects) with well-defined transformations between them, rather than one universal IR. Our service dialects (Valkey, PostgreSQL, etc.) borrow this organizational pattern.MLIR's "progressive lowering" also shaped the architecture. You don't transform directly from TensorFlow to machine code; you lower through intermediate dialects: TensorFlow → Linalg → Affine → LLVM → machine code. Each step is a smaller, more tractable transformation. Our raise/lower pattern works the same way: AWS → Dialect → K8s, with the dialect as the intermediate level.
There's an important difference though. MLIR transformations are provably correct; formal semantics guarantee that lowering preserves computation. Infrastructure compilation can't make that claim. ElastiCache and a K8s Valkey Operator are semantically similar (both provide a Valkey cluster) but they're not identical in features, configuration, or edge behavior. Our transformations are heuristic. We use the dialect pattern but we don't get MLIR's correctness guarantees.
Preserving semantics across fundamentally different implementations is the hard part, not syntax translation.
The foundation is STIR (STack Intermediate Representation), a graph-based IR where nodes are infrastructure constructs and edges are relationships. I covered STIR in detail in How Tensor9 Models Your Stack; the short version: infrastructure is a graph, not a tree, so the IR should be too.
Lifters parse source formats (Terraform HCL, Helm charts, K8s manifests) into the graph; emitters walk the graph and generate target formats. The correctness test is round-trip fidelity: lift source into STIR, emit back to source, compare.† If they match, the lifter and emitter preserve structure correctly.
All compilation happens locally. No API calls to Tensor9 or cloud providers during transformation. The compiler runs entirely in your CI pipeline or on your laptop.Important caveat: round-trip tests validate structural preservation, not behavioral equivalence. Whether compiled infrastructure performs identically under load is a different question. ElastiCache and self-hosted Valkey have different failover timing, different connection handling, different memory management. Behavioral testing happens through integration tests and staged rollouts.†
We provide a behavioral test harness that runs the same workload against source and target deployments, comparing latency distributions, failover timing, and error rates. It won't catch everything, but it surfaces the big differences before production.STIR can represent any infrastructure format. But to actually compile ElastiCache to a K8s Valkey Operator, you need something higher-level: an abstraction that captures the service's semantics independent of how it's implemented. We call these dialects.
A dialect is a cloud-agnostic IR for a service category. The Valkey dialect captures what a Valkey cluster needs (replicas, TLS, auth, resources) without caring whether it's backed by ElastiCache or a K8s operator.
The compiler has dialects for: Containers (EKS/GKE/AKS/Kubernetes), Functions (Lambda/Cloud Functions/Knative), PostgreSQL, MySQL, MongoDB, Caching (ElastiCache/Memorystore/Redis), Message Streaming (Kafka/MSK/Event Hubs), Object Storage (S3/GCS/MinIO), Search (OpenSearch), Load Balancers, IAM, DNS, and Networking.† Each captures the essential semantics of its service category.
Full list here: Service Equivalents RegistryFair question: who maintains 12+ dialects across multiple providers? This is a real scaling challenge; each AWS release, each Helm chart update, each new cloud provider creates work. Our bet: dialects capture stable semantics (a cache needs memory, replicas, TLS), not provider-specific implementation details. Those details live in raisers and lowerers, which are smaller and more mechanical. The maintenance burden is real; I've traded architectural complexity for reduced operational complexity. Whether that tradeoff holds depends on scale.†
Custom dialects are supported for internal services. You define the schema, write a raiser from your source format, and a lowerer to your target. The compiler handles the rest. It's not trivial, but it's documented.Dialects sit between cloud-specific implementations:
AwsValkeyRaiser takes an aws_elasticache_replication_group
and creates a ValkeyCluster on the Valkey dialect.K8sValkeyLowerer takes a ValkeyCluster and creates a
kubernetes_manifest for the Valkey Operator CRD.Note that raisers delete the original resources. After raising, the dialect surface is the source of truth. Lowerers reconstruct cloud-specific resources from the dialect representation, not from the original.†
This is intentional. The raiser isn't annotating the original; it's replacing it with a canonical form. Round-trip preservation happens via source map metadata, not by keeping the original around.Let's look at the Valkey dialect in detail. A dialect has three components:
resource "aws_elasticache_replication_group" "cache" {
replication_group_id = "myapp-cache"
engine = "valkey"
engine_version = "7.2"
node_type = "cache.t3.medium"
num_cache_clusters = 3
transit_encryption_enabled = true
auth_token = var.auth_token
}
Notice what's not in a dialect: AWS-specific concepts like node_type,
subnet_group_name, or parameter_group_name. And no K8s-specific
concepts like storageClassName or podAntiAffinity. The dialect
captures what a Valkey cluster is, not how any particular platform implements it.
Dialects also capture semantic relationships: shards and replicas are mutually constraining, auth implies TLS for secure configurations, persistence affects recovery behavior. These relationships matter during lowering. If you're targeting a platform that doesn't support sharding, the compiler needs to handle that gracefully.
After getting dialect design wrong several times, here's what we learned:
memoryMb not
instanceType or resourcesLimitsMemory.totalNodes can be computed from shards × (1 + replicasPerShard),
don't store it separately.cpuMillis: null
means "let the platform decide" vs cpuMillis: 1000 meaning "explicitly request 1 core."
Bad dialects leak implementation details. If I find myself adding fields like
awsNodeType to a dialect, something is wrong.
Service configuration is a structured space where each field represents a property of the service.
Different providers use different representations for the same thing.
AWS expresses memory as a node_type string
(cache.t3.medium); K8s expresses it as resources.limits.memory
in bytes. Same property, different shape.
Two types of transforms require different strategies. The first are field transforms: value transforms where the shape is preserved. The same conceptual field exists in both systems, just expressed differently (like measuring temperature in Celsius versus Fahrenheit).
Same configuration, different representations
The second type are structural transforms: shape transforms where the semantics are preserved but the structure changes. An array collapses to a scalar, or a nested tree flattens; the information is preserved through a projection.
Structure preserved through dialect: three replicas in, three replicas out
Looking at the full picture, multiple fields transform together. Some are lossless, some normalized, some lossy:
The compiler models this explicitly as a configuration space: a collection of transform objects, each describing how one field transforms between representations. Not all fields transform the same way:
| Type | Meaning | Example |
|---|---|---|
| Lossless | Direct semantic equivalence. Perfect round-trip. | transit_encryption_enabled → tls → tls.enabled |
| Normalized | Different units, same information. Reversible with lookup. | node_type (cache.t3.medium) → memoryMb (4096) → memory (4Gi) |
| Lossless (aided) | Preserved for same-cloud via metadata. Lost in cross-cloud. | replication_group_id stored in source map, recovered for AWS→AWS |
| Lossy (no equivalent) | Dimension collapses in cross-cloud. No target equivalent. | multi_az_enabled: AWS concept, no K8s equivalent |
| Lossy (non-canonical) | Field exists only in one provider, no canonical mapping. | subnet_group_name: AWS VPC concept, no canonical mapping |
| Synthetic | No origin: value must be synthesized for target. | Aurora auto-scales storage, but CNPG needs explicit PVC size |
With these classifications, the compiler can report the fidelity of any source→target transformation. For AWS ElastiCache → Valkey Operator: roughly 50% of dimensions are lossless (TLS, auth, replicas), 25% are normalized (node type → memory), and 25% are lossy (multi-AZ, subnet groups, security groups).
The lossy dimensions aren't failures; they're explicit reporting. multi_az_enabled
doesn't map to Kubernetes because Kubernetes has a fundamentally different high-availability
model. The compiler tells you this upfront, rather than silently ignoring the field or
making up a mapping that doesn't preserve semantics.†
The lossy fields are handled by resolvers, which we'll get into later.
The examples so far show AWS ElastiCache compiling to a single Kubernetes target. But Kubernetes has multiple Valkey implementations, each with different configuration spaces and operational characteristics. The compiler can target any of them, and the choice matters.
Consider two production-grade Valkey operators:
apiVersion: cache.cs.sap.com/v1alpha1
kind: Valkey
spec:
replicas: 3
sentinel:
enabled: true # external failover
resources:
limits:
memory: 3Gi
tls:
enabled: true
apiVersion: rds.valkey.buf.red/v1alpha1
kind: Valkey
spec:
arch: cluster # native failover
replicas:
shards: 3
replicasOfShard: 1
resources:
limits:
memory: 3Gi
access:
enableTLS: true
Same source (AWS ElastiCache with 3 nodes, 3GB memory, TLS). Same dialect (ValkeyCluster). Different targets with different configuration spaces and different operational characteristics.
| Dimension | Sentinel Mode | Cluster Mode |
|---|---|---|
replicas |
Single integer (total nodes) | Shards × replicas per shard |
failover |
External Sentinel quorum | Built-in gossip protocol |
sharding |
Not supported | 16384 hash slots across shards |
scaling |
Add replicas only | Add shards or replicas |
client library |
Sentinel-aware required | Cluster-aware required |
The configuration space differences translate to operational differences that the compiler can surface explicitly:
| Characteristic | Sentinel Mode | Cluster Mode |
|---|---|---|
| Failover time | 10–60 seconds | 1–6 seconds |
| Failure detection | Sentinel quorum (SDOWN→ODOWN)† | Node gossip (cluster-node-timeout) |
| Data model | All keys on primary | Keys sharded by hash slot |
| Multi-key operations | Always work | Same-slot only (or hash tags) |
| Horizontal scaling | Read replicas only | Add shards for write capacity |
These details matter operationally: failover timing affects outage duration, and cluster mode determines whether your application needs hash tags for multi-key operations. The compiler surfaces these differences in the compilation report.
The dialect (ValkeyCluster) captures what the service is: memory,
replicas, TLS, authentication. The lowerer chooses how it's implemented.
Different lowerers target different operators:
K8sValkeySentinelLowerer → Bitnami/SAP operator (Sentinel mode)K8sValkeyClusterLowerer → chideat operator (Cluster mode)The compilation report includes target-specific warnings:
Target: chideat::1.0.0::valkey
⚠ OPERATIONAL: Cluster mode requires cluster-aware client library
Your application must use a Redis/Valkey client that supports
CLUSTER SLOTS and automatic redirect handling.
⚠ OPERATIONAL: Multi-key operations require same hash slot
Commands like MGET, MSET, SUNION across keys will fail unless
keys share a hash slot. Consider hash tags: {user:123}:profile
✓ ADVANTAGE: Failover time ~1-6 seconds (vs 10-60s Sentinel)
Cluster nodes detect failures via gossip and self-promote.
No external Sentinel quorum required.
✓ ADVANTAGE: Horizontal write scaling via shards
Add shards to increase write throughput. Sentinel mode
can only add read replicas.
Target selection follows the same negotiation model as field-level configuration. Vendors express requirements; customers express constraints; the compiler finds the intersection or reports why it can't.
# Vendor tuning
valkey:
target:
preferred: [cluster, sentinel]
requirements:
min_failover_seconds: 10
# Customer config
valkey:
target:
allowed: [sentinel] # app not cluster-aware
constraints:
client_library: "jedis-3.x"
The compiler evaluates this negotiation:
This surfaces as a compilation issue, not a silent degradation:
⚠ TARGET CONFLICT: valkey.myapp-cache
Vendor requires: failover ≤ 10 seconds
Customer allows: sentinel mode only (client constraint)
Sentinel mode typical failover: 10-60 seconds
Options:
1. Customer upgrades to cluster-aware client (enables cluster mode)
2. Vendor relaxes failover requirement to 60 seconds
3. Customer accepts risk of longer failover window
Severity: WARNING (compilation can proceed with accepted risk)
Neither party makes this decision alone. The vendor can't force cluster mode on a customer whose application doesn't support it. The customer can't demand sub-10-second failover while constraining to sentinel mode. The compiler makes the trade-off explicit so both parties can negotiate with full information.
"Configuration space" includes the operational envelope of what's possible on each target, beyond just field mappings. The compiler surfaces this so both parties can make informed decisions.
When lowering from a dialect to a specific cloud, some fields can't be deterministically
derived. ElastiCache needs a node_type, but the Valkey dialect only has
memoryMb. Multiple node types have the same memory: which one?
Resolvers handle this. A resolver implements a resolution chain, a priority-ordered fallback strategy:
For same-cloud round-trips (AWS → Dialect → AWS), the raiser stores the original
node_type in source map metadata. The lowerer checks metadata first,
finds it, and emits the exact original value. Isomorphic.
For cross-cloud transformations (AWS → Dialect → K8s → Dialect → AWS), there's no metadata. The resolver falls through to Vendor config (if the vendor specified preferences), then customer config (if the customer has infrastructure constraints), then heuristics (pick a reasonable node type for the memory size).†
Secrets are handled differently: they're never stored in source map metadata or resolver state. Secret references (AWS Secrets Manager ARNs, K8s Secret names) transform through a separate subsystem that ensures credentials never appear in Terraform state files.Every resolved value carries provenance: where it came from. This matters for debugging ("why did the compiler pick this node type?") and for compliance ("prove that customer constraints were respected").
A valid concern: if vendor preferences or customer constraints change between compilations, outputs differ. The compiler addresses this through resolution manifests, a record of every resolved value and its source. Re-running compilation with the same manifest produces identical output. The manifest also enables auditing: you can trace exactly why each field has its value.†
Escape hatch: you can edit the manifest to override any resolved value before re-compiling. The output Terraform is also human-readable and can be modified directly for emergency fixes, though you lose round-trip guarantees.What happens when resolution fails? The compiler doesn't silently pick a default. It surfaces a structured error explaining the conflict: what was needed, what was allowed, and how to fix it.
For example: "Your customer config only allows t3.micro and t3.small, but the vendor needs 16GB of memory. Either allow larger node types or reduce the memory requirement." The user sees the conflict clearly and can make an informed decision.
The resolver chain reflects something important: vendors and customers have different concerns, and they're intentionally asymmetric.
The two parties provide different kinds of configuration: vendors specify what the application needs, customers specify what the infrastructure allows.
The customer provides the foundation that vendors must work within:
The customer is saying: "Here's my infrastructure and what I'm willing to accept."
The vendor tunes service behavior on top of the customer's base:
The vendor is saying: "Given what the customer allows, here's how I want to configure this."
For negotiable fields like node tier, both sides provide ranked preferences:
// Customer's acceptable tiers (in preference order) @CustomerField val nodeTierPreference: List<NodeTier> = listOf(Small, Medium) // no Large // Vendor's preferred tiers (in preference order) @VendorField val nodeTierTuning: List<NodeTier> = listOf(Large, Medium, Small)
Resolution: pick the highest vendor preference that appears in the customer's allowed set. Here, the vendor wants Large → Medium → Small. The customer allows Small and Medium. Result: Medium.
This is not symmetric negotiation where both parties have equal say. The customer provides constraints (the ceiling); the vendor tunes within those constraints. Vendors cannot override customer constraints; they can only choose within them.†
This asymmetry is intentional. Customers own their infrastructure; vendors own their applications. Each has authority over their domain.When the resolver needs a value, it checks:
Vendor requirements establish a floor. Customer constraints establish a ceiling. The resolver finds the best value in the overlap, or fails if there is no overlap.†
For regulated environments, every resolution decision is logged with its source (metadata, vendor config, customer constraint, or heuristic). This audit trail lets compliance teams verify that transformations respected policy. Not covered here, but essential for SOC2/HIPAA.Consider a vendor whose application needs 16GB memory for production workloads. The customer has cost constraints and only allows smaller instance types (up to 6GB).
The compiler detects this conflict: vendor requires 16GB minimum, customer's largest allowed option provides only 6GB. No overlap exists. Resolution fails, correctly. The compiler can't magically make 6GB work for a 16GB requirement. Instead, it surfaces the conflict clearly with actionable suggestions.
If the customer adds larger instance types to their allowed list, negotiation can succeed. The compiler picks the best option that satisfies both parties: the highest vendor preference that falls within customer constraints.
If the resulting choice is below the vendor's stated requirements (say, 13GB when vendor wanted 16GB), the compiler emits a warning. It proceeds (that's the customer's choice) but makes the trade-off visible.
What happens when you round-trip? This is really asking whether the compiler respects the original intent. A compiler that silently drops information without telling you is dangerous.
Let me formalize this with two different roundtrip scenarios, each with different expectations.
AWS → Dialect → AWS should produce an identical graph. This is the baseline requirement: if you're not changing clouds, the compiler shouldn't change your infrastructure. The raiser extracts all configuration into the dialect, stores cloud-specific details in source map metadata, and the lowerer reconstructs the original exactly.
This works because the source map preserves everything the dialect doesn't capture natively. The dialect knows about replicas, memory, and TLS, but it doesn't know about AWS-specific subnet groups or security group IDs. Those get stashed in metadata and restored on the way back.
Cross-cloud transformation is different. AWS → Dialect → K8s → Dialect → AWS is lossy. Some AWS-specific concepts don't exist in the operator, and vice versa. The configuration space contracts when you move to the intersection of what both platforms support, then expands again with platform-specific defaults on the way back.
Consider multi_az_enabled. AWS ElastiCache uses this to spread replicas
across availability zones for fault tolerance. Kubernetes doesn't have availability zones
in the same sense. Instead, it has node affinity, pod anti-affinity, and topology spread constraints.
These aren't equivalent concepts; they're different approaches to the same goal. The compiler
can't preserve multi_az_enabled because the target platform doesn't have that dimension.†
This is correct behavior. Cross-cloud transformation is inherently lossy. The compiler makes this explicit: you can query the configuration space to see exactly which dimensions are preserved and which require resolver intervention.†
Think of it like translating between languages. Some concepts don't translate directly. The compiler tells you what's lost, rather than pretending everything maps 1:1.Everything described so far works without internet connectivity during compilation. The compiler runs locally: no API calls to Tensor9, no cloud provider APIs, no external services. Local-only compilation is the default architecture, not a special mode.
For fully disconnected environments (FedRAMP, defense, on-premises), the workflow extends to artifact distribution:
Release bundles package everything needed for deployment: container images (as Docker tars), Terraform configurations, and metadata. Vendors create bundles in their connected environment; customers apply them in their disconnected environment.
Terraform variables handle customer-specific configuration. The compiled
output references ${var.appliance_registry_uri} instead of hardcoded registry
URLs. Customers provide their registry URI, namespace, and other infrastructure details
when applying the bundle. Vendors don't need to know customer infrastructure.
Clean separation: vendors package and sign; customers verify and apply. The bundle is the interface between connected and disconnected worlds.
This architecture was designed with regulated and disconnected environments in mind. Artifact bundling, signing, and verification are covered in a separate post.
Now let's see how all these pieces fit together. The compiler processes infrastructure in four distinct phases, each with a clear responsibility. Understanding the pipeline helps explain why certain errors occur where they do, and why some transformations are possible while others aren't.
The Terraform lifter parses HCL and creates a StirGraph. Each resource becomes a node; each reference becomes an edge. The graph preserves the full structure of the original code (modules, variables, outputs, and all).
This phase is purely syntactic. The lifter doesn't know what an aws_elasticache_replication_group
means. It just knows it's a resource with certain attributes that reference other
resources. Errors here are parse errors: malformed HCL, invalid syntax, unresolved variables.
The raiser transforms cloud-specific resources to their dialect representation. This is
where semantic understanding enters the picture. The raiser knows that an ElastiCache
cluster with num_cache_clusters = 3 means "three replicas," and that this
maps to a replicas field in the Valkey dialect.
For an ElastiCache cluster, the raiser extracts semantic fields (replicas, TLS, auth), normalizes cloud-specific values (node type → memory in MB), and stores AWS-specific metadata in source map for potential round-trip recovery. Errors here are semantic: unsupported resource types, invalid configurations, missing required fields.
The lowerer reads the dialect representation and creates target-specific resources. Unlike raising, lowering sometimes requires decisions that weren't explicit in the dialect (which Helm chart version? what namespace?). Resolvers handle those decisions.
For the Valkey Operator, the lowerer generates a kubernetes_manifest with the Valkey CRD.
Memory becomes resource limits, replicas map directly, TLS and auth
settings map to chart values. Errors here are typically configuration conflicts: the
dialect says one thing, but the target platform can't accommodate it.
The emitter walks the StirGraph and generates valid Terraform HCL. This phase is again purely syntactic: it doesn't interpret meaning, just serializes the graph to text. The output can be applied to a Kubernetes cluster with the Helm provider configured.†
The emitted Terraform is formatted and commented for human readability. You can review, diff, and debug it like hand-written code. No minified blobs.Errors here are rare. If the graph is well-formed, emission should succeed. When they do occur, they're usually bugs in the compiler itself rather than problems with the input infrastructure.
| Preserved | Transformed | Lost |
|---|---|---|
| 3 replicas TLS enabled Auth enabled ~13GB memory |
node_type → resources.limitsauth_token → auth.passwordautomatic_failover → sentinel.enabled |
multi_az_enabledsubnet_group_namesecurity_group_ids |
Compilation can fail for many reasons: missing required fields, incompatible configurations, features that don't exist on the target platform. The compiler surfaces these clearly rather than failing silently, because silent failures in infrastructure are the ones that wake you up at 3 AM.
Error handling follows the same approach as good compilers: collect as many issues as possible in a single pass, categorize them by severity, and provide actionable information about how to fix them. Nobody wants to fix one error, re-run, find another error, fix it, re-run, and repeat twenty times.
Not all issues are equal. Some prevent compilation entirely; others are warnings that you might choose to accept. The compiler distinguishes three levels:
After compiling hundreds of real-world Terraform stacks, patterns emerge. Most issues fall into a few categories:
multi_az_enabled has no Kubernetes equivalent). The compiler tells you
what's lost and whether it affects your workload.Issues are collected and presented together at the end. Many issues are related: fixing one might resolve others. Issues are deduplicated: if the same warning applies to 15 resources, it shows once with a count. The goal is a summary you can actually read, not a wall of repetitive text.
The Valkey examples throughout this post are clean, but caches are relatively simple. Let me walk through three more dialects that handle messier transformations: RabbitMQ, Kafka, and PostgreSQL. Each shows the same patterns (dialects, transforms, resolvers, negotiation) applied to services with more complexity.
RabbitMQ adds another dimension: messaging topology. The dialect captures cluster configuration (nodes, memory, storage), but also virtual hosts, users, and HA policies that define message durability and replication.
resource "aws_mq_broker" "rabbitmq" {
broker_name = "myapp-mq"
engine_type = "RabbitMQ"
engine_version = "3.12.0"
host_instance_type = "mq.m5.large"
deployment_mode = "CLUSTER_MULTI_AZ"
user {
username = "admin"
password = var.mq_password
}
}
Amazon MQ abstracts away clustering details that become explicit in K8s:
| Amazon MQ | Dialect | RabbitMQ Operator | Type |
|---|---|---|---|
mq.m5.large |
nodeMemoryMb: 8192 |
resources.limits.memory: 8Gi |
Lossless |
deployment_mode: CLUSTER_MULTI_AZ |
nodeCount: 3 |
spec.replicas: 3 |
Lossless |
| (implicit mirroring) | haMode: Quorum |
spec.override.additionalConfig: quorum |
Normalized |
maintenance_window_start_time |
(no equivalent) | (use K8s PodDisruptionBudget) | Lossy |
logs.audit: true |
auditLogging: true |
(external log aggregation) | Lossy |
Amazon MQ's "classic" HA (mirrored queues) is deprecated in RabbitMQ 3.13+. The compiler normalizes to quorum queues when targeting modern RabbitMQ versions, noting the behavioral change in the compilation report.†
Quorum queues have different performance characteristics than mirrored queues: better consistency guarantees but higher latency. The report notes this trade-off.RabbitMQ lowering requires resolving several ambiguous values:
The resolution chain for haMode:
// Resolution chain for RabbitMQ HA mode
1. Customer config: Preferred HA mode? → "quorum" (if specified)
2. Target version: RabbitMQ 3.13+? → prefer quorum (classic deprecated)
3. Source hint: Amazon MQ was using mirroring → "quorum" (modern equivalent)
4. Heuristic: Default to quorum for new deployments → "quorum"
Vendors and customers negotiate RabbitMQ configuration through complementary constraints:
// Vendor config
rabbitmq:
min_nodes: 3 # Minimum for quorum queues
min_memory_mb: 4096 # Memory-intensive workload
required_features:
- tls # Must have encryption
- quorum_queues # Data durability requirement
// Customer config
rabbitmq:
max_nodes: 5 # Cost ceiling
max_memory_mb: 8192 # Resource limit
storage_class: "fast-ssd"
erlang_cookie_secret: "rabbitmq-cookie" # Existing secret
The compiler validates requirements: 3 nodes (within ceiling), quorum queues (enabled), TLS (enabled). The customer's existing Erlang cookie secret is used for cluster authentication rather than generating a new one. Amazon MQ → RabbitMQ Operator is lossy. Several features don't translate:
RabbitMQ Operator → Amazon MQ round-trip loses: custom plugins (Amazon MQ has a fixed plugin set), sidecar containers, custom Erlang VM settings, and any operator-specific CRD fields. What if the customer requires classic mirrored queues on a modern RabbitMQ version?
⚠ CONFIGURATION CONFLICT: rabbitmq.message-broker
Vendor requires: quorum_queues (data durability)
Customer config: ha_mode: classic_mirrored
Conflict: Classic mirrored queues are deprecated in RabbitMQ 3.13+
and cannot provide the durability guarantees vendor requires.
Resolution options:
- Customer updates to quorum queues (recommended)
- Vendor accepts classic mirrored queues on older RabbitMQ version
- Customer accepts reduced durability guarantees
Severity: ERROR (compilation cannot proceed)
Kafka is more complex. The dialect captures: cluster identity, broker configuration (count, resources, storage), topic defaults, security (TLS, SASL), ZooKeeper vs KRaft mode, and replication settings.
resource "aws_msk_cluster" "kafka" {
cluster_name = "myapp-kafka"
kafka_version = "3.6.0"
number_of_broker_nodes = 3
broker_node_group_info {
instance_type = "kafka.m5.large"
ebs_volume_size = 500
}
encryption_info {
encryption_in_transit {
client_broker = "TLS"
}
}
}
MSK has tight AWS integration that doesn't translate directly to Strimzi:
| MSK | Dialect | Strimzi | Type |
|---|---|---|---|
kafka.m5.large |
brokerMemoryMb: 8192 |
resources.limits.memory: 8Gi |
Lossless |
number_of_broker_nodes: 3 |
brokerCount: 3 |
spec.kafka.replicas: 3 |
Lossless |
client_authentication.sasl.iam |
saslMechanism: "IAM" |
authentication.type: scram-sha-512 |
Normalized |
enhanced_monitoring: PER_TOPIC |
(no equivalent) | (use Strimzi metrics) | Lossy |
serverless mode |
brokerCount: 3 (default) |
spec.kafka.replicas: 3 |
Operational |
MSK's IAM authentication is AWS-specific. When lowering to Strimzi, the compiler normalizes to SCRAM-SHA-512 and generates the required K8s Secrets. The compilation report notes this: "MSK IAM auth converted to SCRAM-SHA-512. Update client configs."†
MSK Serverless compiles but with significant operational warnings: auto-scaling becomes manual, pay-per-use becomes fixed capacity. The compilation report lists all operational changes.Kafka lowering has its own resolution challenges:
The resolution chain for controllerMode:
// Resolution chain for Kafka controller mode
1. Customer config: Preferred mode? → "kraft" (if specified)
2. Target capability: Strimzi version supports KRaft? → true (v0.36+)
3. Source hint: MSK was ZooKeeper-based → prefer ZooKeeper for compatibility
4. Heuristic: Default to KRaft for new deployments → "kraft"
Vendors and customers negotiate Kafka configuration through complementary constraints:
// Vendor config
kafka:
min_brokers: 3 # Minimum for HA
min_replication_factor: 3 # Data durability requirement
required_features:
- tls # Must have encryption in transit
- sasl # Must have authentication
// Customer config
kafka:
max_brokers: 6 # Cost ceiling
allowed_sasl_mechanisms:
- scram-sha-512 # We support this
storage_class: "fast-ssd"
controller_mode: "kraft" # Prefer KRaft
The compiler validates that vendor requirements are satisfiable within customer constraints: 3 brokers (within ceiling), SCRAM-SHA-512 (allowed), TLS enabled. The customer's KRaft preference is honored since the target supports it. MSK → Strimzi is lossy. Some information doesn't survive:
Strimzi → MSK round-trip would lose different things: KRaft mode (MSK is ZooKeeper-only for now), custom JVM settings, sidecar containers. Each direction has its own lossy dimensions. MSK Serverless compiles to Strimzi, but with significant operational warnings:
⚠ OPERATIONAL CHANGES: kafka.event-stream
Source: MSK Serverless
Target: Strimzi Kafka cluster (3 brokers, 8Gi memory each)
The following operational characteristics change significantly:
⚠ SCALING: MSK Serverless auto-scales transparently.
Strimzi requires manual broker count changes.
Action: Monitor throughput and scale proactively.
⚠ BILLING: MSK Serverless is pay-per-use.
Strimzi runs fixed capacity 24/7.
Action: Right-size brokers for expected load.
⚠ MAINTENANCE: MSK Serverless handles broker patching.
Strimzi requires planned rolling updates.
Action: Schedule maintenance windows.
⚠ CAPACITY: MSK Serverless has no partition limits per broker.
Strimzi performance degrades beyond ~4000 partitions/broker.
Action: Plan partition strategy for growth.
Severity: WARNING (compilation proceeds, review operational changes)
The PostgreSQL dialect captures: identity, compute resources (CPU, memory), storage (size, IOPS), replication (replicas, sync mode), version, extensions, connection limits, and backup configuration.
resource "aws_rds_cluster" "db" {
cluster_identifier = "myapp-db"
engine = "aurora-postgresql"
engine_version = "15.4"
database_name = "myapp"
master_username = "admin"
master_password = var.db_password
storage_encrypted = true
allocated_storage = 100
iops = 3000
}
resource "aws_rds_cluster_instance" "writer" {
identifier = "myapp-db-writer"
cluster_identifier = aws_rds_cluster.db.id
instance_class = "db.r5.large"
engine = aws_rds_cluster.db.engine
}
resource "aws_rds_cluster_instance" "reader" {
identifier = "myapp-db-reader"
cluster_identifier = aws_rds_cluster.db.id
instance_class = "db.r5.large"
engine = aws_rds_cluster.db.engine
}
Notice what's absent: no db_instance_class, no storage_type,
no parameter_group_name. Those are AWS-specific. And no
storageClassName or podAntiAffinity. Those are K8s-specific.
The dialect captures what a PostgreSQL cluster is, not how any platform implements it.
RDS Aurora and CloudNativePG have different feature sets:
| RDS Aurora | Dialect | CloudNativePG | Type |
|---|---|---|---|
db.r6g.xlarge |
memoryMb: 32768 |
resources.limits.memory: 32Gi |
Lossless |
engine_version: "15.4" |
version: "15.4" |
imageName: ...postgresql:15.4 |
Lossless |
multi_az: true |
(no equivalent) | (HA via replica count) | Lossy |
performance_insights_enabled |
(no equivalent) | (use pg_stat_statements) | Lossy |
| (auto-scales) | storageSizeGb: 100 |
storage.size: 100Gi |
Synthetic |
Aurora auto-scales storage; you don't specify a size. CloudNativePG requires an explicit PVC size. The lowerer synthesizes a value based on current usage plus headroom, or uses a customer-provided default. The compilation report flags this as a synthetic field.†
Aurora's storage auto-scaling is genuinely better for most workloads. When compiling to K8s, you trade that convenience for explicit capacity planning.When lowering to CloudNativePG, several fields can't be deterministically derived:
The resolution chain for storageClassName:
// Resolution chain for PostgreSQL storage class
1. Metadata: Did a previous compilation set this? → "gp3-encrypted"
2. Customer config: What storage classes are allowed? → ["gp3-encrypted", "io2-fast"]
3. Dialect requirement: IOPS ≥ 3000 → filters to ["io2-fast"]
4. Heuristic: Pick first allowed → "io2-fast"
Vendors and customers negotiate PostgreSQL configuration through complementary constraints:
// Vendor config (SvcDefVendor)
postgres:
min_memory_mb: 16384 # Need at least 16GB for our workload
required_extensions:
- postgis # Spatial queries
- pg_stat_statements # Query monitoring
min_connections: 200 # Our connection pool size
// Customer config (SvcTuning)
postgres:
max_memory_mb: 32768 # Cost constraint
allowed_extensions:
- postgis
- pg_stat_statements
- pgvector # We also allow this
storage_class: "gp3-encrypted"
The compiler resolves memoryMb to 16384 (vendor minimum within customer ceiling),
enables the required extensions (all allowed), and uses the customer's storage class.
What if the customer doesn't allow a required extension?
⚠ CONFIGURATION CONFLICT: postgres.myapp-db
Vendor requires extension: postgis
Customer allowed_extensions: [pg_stat_statements, pgvector]
Resolution: Extension 'postgis' is required for spatial queries.
Either add to allowed_extensions or discuss with vendor
about alternative approaches.
Severity: ERROR (compilation cannot proceed)
AWS, GCP, Azure, and Kubernetes are different representations of the same underlying service properties (memory, replicas, TLS, authentication).
Some transformations preserve all information (same-cloud round-trips). Some collapse properties (cross-cloud, where features don't exist on the target). The compiler reports which is which.
Compilation is necessary but not sufficient. Producing correct Terraform is perhaps 40% of the multi-cloud deployment problem. The harder 60% is operational:
These are real concerns that I'll get into some other time. The compiler can tell you that Cluster mode has 1–6 second failover; it can't tell you whether your application handles redirects correctly.
-mtp 2026-01-30