Platform Engineer Guide¶
This guide is for platform engineers responsible for enabling self-service infrastructure in InfraKitchen. It covers the key concepts, patterns, and best practices needed to deliver secure, consistent, and auditable infrastructure workflows at scale.
✅ Quick Checklist¶
- Create cloud integrations (AWS/Azure/GCP + any ancillary: Datadog, MongoDB).
- Create Git provider integration(s).
- Register workspaces (Git repos for generated resource code).
- Define hierarchy (Account → Environment/Region → Network → Platform → App Layer).
- Import or create templates (abstract first, then concrete).
- Add initial source code versions (tag/branch + module path).
- Enable approval flow (optional policy).
- Establish naming + tagging conventions.
- Create abstract resources (Account / Environment).
- Provision first concrete resource (e.g. VPC) with dry run.
- Cascade child resources (e.g. VPC → EKS → Namespace).
🔗 Integrations¶
1.1 Cloud Provider¶
Configure cloud provider integrations to enable connectivity, management, and automation of workloads across environments.
- Use least‑privilege roles/service principals.
- Separate read vs write if future compliance requires.
- Rotate credentials regularly. Updating integration does not break existing resources.
1.2 Git Provider¶
Add GitHub/Bitbucket/Azure DevOps integration with repo write + PR permissions.
- PAT / App / Service Connection must allow branch create + merge.
- Avoid using personal tokens for long‑term automation.
1.3 Auth¶
Register SSO (GitHub/Microsoft) and tooling (Datadog, MongoDB) as needed for downstream templates.
For more details, see Integrations Page.
🗂️ Workspaces¶
Register a pre‑created repository as a workspace to hold generated resource code.
- Keep main branch protected. Approvals occur via PR.
- Repository layout auto:
<template_key>/<resource_name_slug>/. - Use CI hooks (lint, policy, security scan) on PRs for infra quality gates.
Branch prefixes:
update_for create/update.delete_for destroy (code removal).
Workspace usage:
- Review PR for every update/destroy (branch diff shows adds/removals).
- CI gates (format, security scan, policy).
- Merge = approval; close unmerged = reject.
- Drift avoidance: do not hand‑edit generated folders. Push changes via resource update.
For more details, see Workspace Page.
🏗️ Template Hierarchy Design¶
Start abstract → concrete:
aws_account(abstract)aws_environment_region(abstract)aws_vpceks_clusterkubernetes_namespace. Add others (rds_instance, s3_bucket, alb) under appropriate parent.
Rules:
- Avoid circular parent refs.
- Mark purely structural nodes as abstract.
- Disable templates before large refactors. Re‑enable after validation.
📝 Creating Templates¶
Option A: Import From Repository¶
Use existing Terraform/OpenTofu module:
- Provide integration, repo, path, tag/branch.
- Auto analysis extracts variables & outputs.
- Creates initial source code version.
Option B: Manual Template Then Version¶
- Create template metadata (name, key, abstract flag, parents).
- Add source code version linking repo + module path + tag.
Revision number increments on metadata edits. Resource compatibility unaffected unless parent links change (which gate new resources but not existing ones).
For more details, see Templates Page.
🏷️ Source Code Version Strategy¶
- Tag per release (e.g.
v1.2.0). - Keep variable schema changes backward compatible.
- Test new versions in staging environment resource before promoting to production.
Change process:
- Add new version.
- Create staging resource update (dry run → apply).
- If safe, update production resource referencing new version.
🔄 Resource Lifecycle (Engineer View)¶
| Phase | Action | Key Safety Controls |
|---|---|---|
| Create | Resource record only | Parent state check, variable validation, inheritance applied. |
| Dry Run | Plan | Review adds/changes/destroys. Prevents surprise costs. |
| Provision | Apply | Isolated execution. Outputs stored. |
| Update | Temp state + plan | Diff review + approval (if enabled). |
| Destroy | Infra removal, record retained | Child dependency + output usage checks. |
| Delete | DB record purge | Only after destroyed. Audit archived. |
| Recreate | Re-provision after reject/destroy | Revision increment for traceability. |
🏢 Abstract Resources First¶
Create top‑level abstract resources:
prod-accountprod-us-east-1(environment/region)
Attach:
- Dependency Tags (e.g.
BillingCode=PROD-123inherited). - Dependency Configs (e.g.
aws_region=us-east-1,name_prefix=prod).
These automatically seed child variable defaults and enforce uniform tagging.
🚀 Provision Concrete Resources¶
Sequence Example:
- Create VPC resource under
prod-us-east-1(inherit region + tags). - Dry run → review CIDR correctness + cost tags.
- Provision → capture outputs (
vpc_id,subnet_ids). - Create EKS cluster resource referencing VPC outputs.
- Provision namespaces referencing cluster outputs.
For more details, see Resource Page.
✏️ Updating Resource¶
Workflow:
- Request update → temp state.
- Adjust variables or source code version.
- Dry run (plan) → ensure only expected changes.
- Approval (if enabled) → apply.
- Outputs refreshed. Revision number increments.
🗑️ Destroy & Delete Policy¶
Recommended:
- Enforce “Destroy before Delete” for audit retention.
- Batch cleanup: destroy leaf resources first; walk upward.
Blocked scenarios:
- Active child
provisionedresources. - Downstream dependencies (outputs referenced). Resolve by destroying or detaching dependents.
🛡️ Governance & RBAC¶
Roles (simplified):
- Owner/Admin: approve + destructive actions.
- Write: create/update resources.
- Read: view only.
💰 Cost & Compliance Tips¶
| Goal | Mechanism |
|---|---|
| Cost Allocation | Mandatory inherited tags from abstract account/environment. |
| Least Privilege | Separate integration per account. Scope permissions. |
| Change Audit | Use workspaces + approval flow + revision numbers. |
| Cleanup | Scheduled review of provisioned but idle resources. Run destroy sequences. |
| Consistency | Abstract configs for shared values (CIDR ranges, region, naming prefix). |
🧰 Troubleshooting Quick Reference¶
| Symptom | Likely Cause | Action |
|---|---|---|
| Plan shows unexpected destroys | Variable mismatch or source version drift | Re-check inherited configs. Compare previous revision. |
Resource stuck in_progress |
Long apply or state lock | Inspect logs. Wait for known long tasks. |
| Cannot destroy | Active children or output dependencies | Destroy children first. Clear references. |
| Variables missing | Source code version analysis failed | Re-analyze. Confirm path & variables.tf presence. |
| PR not created | Existing branch/PR | Approve or close current PR. Retrigger update. |
Status error after apply |
Terraform failure | Review logs. Fix variables/integration. Retry. |
📚 Summary¶
Establish hierarchy & governance first (abstract resources + tags/configs), then incrementally introduce concrete templates and controlled version updates. Use dry runs, approvals, and workspaces as guardrails. Keep naming/tagging uniform, rotate credentials, and leverage revision + audit trails for accountability.