Skip to main content

Cluster Infrastructure Services

Ominis Cluster Manager uses a two-tier infrastructure model: cluster-wide services (shared) and tenant services (isolated). This document provides a deep dive into the cluster infrastructure layer that powers all tenant deployments.

Introduction

Cluster vs Tenant Infrastructure Model

The Ominis platform uses a two-tier infrastructure approach to balance efficiency, isolation, and operational simplicity:

Cluster Infrastructure (this document):

  • Deployed once per Kubernetes cluster
  • Shared by all tenants
  • Namespaces: cert-manager, authentik, vaultwarden, flow-proxy, homer, excalidraw
  • Examples: TLS certificate management, identity provider, HTTP ingress, SIP monitoring
  • Repository: cluster-infra

Tenant Infrastructure:

  • Deployed per customer/tenant
  • Isolated in dedicated namespaces
  • Namespace pattern: client-{tenant-name} (e.g., client-demo-client)
  • Examples: API servers, queue pods, IVR pods, databases, tenant-specific ingress
  • Repository: cluster-manager

Why Separate?

This separation provides several key benefits:

BenefitDescription
Cost EfficiencyOne cert-manager instance serves all tenants instead of N instances
Operational SimplicitySingle upgrade, single monitoring stack, centralized configuration
Resource OptimizationShared ingress controller reduces pod overhead
Faster OnboardingNew tenants start immediately without waiting for infrastructure
Security IsolationTenant workloads isolated via namespaces and network policies
Centralized ManagementIdentity, secrets, TLS certificates managed in one place
Relationship to Tenant Infrastructure

This documentation focuses on cluster-level services. For tenant-specific infrastructure (API servers, queues, databases), see Helm Infrastructure Deployment.

Cluster Architecture Overview

The cluster infrastructure consists of six core services that work together to provide shared capabilities:

Service Dependencies

Understanding service dependencies is critical for deployment order and troubleshooting:

Key Dependencies:

  • Cert-Manager is the foundation (no dependencies)
  • Authentik requires cert-manager for TLS
  • Vaultwarden requires cert-manager for TLS
  • Flow-Proxy discovers ingress resources dynamically
  • Homer and Excalidraw are independent (optional)

Service Catalog

1. Cert-Manager: Automated TLS Certificate Management

Purpose: Automated TLS certificate management using Let's Encrypt ACME protocol.

Architecture

Cert-Manager uses Kubernetes Custom Resource Definitions (CRDs) to automate certificate lifecycle:

  • ClusterIssuer: Cluster-wide certificate authority (Let's Encrypt)
  • Certificate: Defines a certificate request
  • CertificateRequest: Generated automatically by Certificate
  • Challenge: ACME HTTP-01 or DNS-01 validation

Certificate Lifecycle

Configuration

ClusterIssuer Definition:

apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: letsencrypt-prod
spec:
acme:
# Let's Encrypt production server
server: https://acme-v02.api.letsencrypt.org/directory
email: admin@ominis.ai
# Private key for ACME account
privateKeySecretRef:
name: letsencrypt-prod
solvers:
# HTTP-01 challenge using Traefik ingress
- http01:
ingress:
class: traefik

Certificate Request Example:

apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: demo-client-api-tls
namespace: client-demo-client
spec:
secretName: demo-client-api-tls
issuerRef:
name: letsencrypt-prod
kind: ClusterIssuer
dnsNames:
- demo-client-api.app.ominis.ai

Integration with Tenant Ingress

Tenants consume cert-manager via ingress annotations:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: api-ingress
namespace: client-demo-client
annotations:
# This annotation triggers cert-manager
cert-manager.io/cluster-issuer: "letsencrypt-prod"
spec:
ingressClassName: traefik
tls:
- hosts:
- demo-client-api.app.ominis.ai
secretName: demo-client-api-tls # Cert-manager creates this
rules:
- host: demo-client-api.app.ominis.ai
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: api
port:
number: 8000

Use Cases

Use CaseDescription
Tenant API Ingressdemo-client-api.app.ominis.ai automatically gets TLS
Documentation Sitedocs.app.ominis.ai with Let's Encrypt certificate
Internal ServicesAny service with ingress gets free TLS
Automatic RenewalCertificates renewed 60 days before expiry

Troubleshooting

Certificate not issued:

# Check certificate status
kubectl describe certificate demo-client-api-tls -n client-demo-client

# Check challenges
kubectl get challenges -n client-demo-client

# Check cert-manager logs
kubectl logs -n cert-manager -l app=cert-manager --tail=100

Common issues:

  • DNS not configured: Ensure DNS points to cluster ingress IP
  • HTTP-01 challenge failed: Check Traefik routing and firewall
  • Rate limit exceeded: Let's Encrypt has rate limits (50 certs/week per domain)

2. Authentik: Identity and Access Management

Purpose: Enterprise-grade identity provider for SSO, OAuth, OIDC, LDAP, and SAML.

Architecture

Authentik is a flows-based authentication system:

  • PostgreSQL Backend: Stores users, groups, applications
  • Flows: Customizable authentication pipelines
  • Providers: OAuth2/OIDC, SAML, LDAP
  • Policies: Fine-grained access control
  • Applications: Integrated services

Features

FeatureDescription
Single Sign-OnUsers log in once, access multiple services
OAuth2/OIDC ProviderStandard protocol for API authentication
LDAP ServerBridge to legacy systems
SAML ProviderEnterprise SSO integration
User ManagementSelf-service enrollment, password reset
Multi-Factor AuthTOTP, WebAuthn, SMS

Configuration Example

Helm Values:

authentik:
# PostgreSQL backend
postgresql:
enabled: true
persistence:
enabled: true
size: 10Gi

# Ingress configuration
ingress:
enabled: true
ingressClassName: traefik
annotations:
cert-manager.io/cluster-issuer: "letsencrypt-prod"
hosts:
- host: auth.ominis.ai
paths:
- path: /
pathType: Prefix
tls:
- secretName: auth-ominis-ai-tls
hosts:
- auth.ominis.ai

# Secret key for session encryption
secret_key: "changeme-random-secret-key"

# Email configuration
email:
host: smtp.sendgrid.net
port: 587
username: apikey
password: "SG.xxx"
from: noreply@ominis.ai

Application Integration

Adding an Application:

  1. Create OAuth2/OIDC Provider in Authentik UI
  2. Configure redirect URIs
  3. Get client ID and secret
  4. Configure application to use Authentik

Example: Grafana Integration:

[auth.generic_oauth]
enabled = true
name = Authentik
client_id = grafana-client-id
client_secret = your-client-secret
scopes = openid profile email
auth_url = https://auth.ominis.ai/application/o/authorize/
token_url = https://auth.ominis.ai/application/o/token/
api_url = https://auth.ominis.ai/application/o/userinfo/

Use Cases

Use CaseDescription
Internal Tool SSOGrafana, Homer UI, admin dashboards
API OAuthSecure API endpoints with OAuth2
LDAP BridgeConnect legacy systems requiring LDAP
Multi-Tenant IsolationUser groups per tenant

3. Vaultwarden: Password Management

Purpose: Self-hosted password manager compatible with Bitwarden clients.

Architecture

  • Rust-based: Lightweight, efficient alternative to official Bitwarden
  • SQLite/PostgreSQL: Flexible storage backends
  • Client Compatible: Works with all Bitwarden clients (browser, mobile, CLI)
  • API Compatible: REST API for automation
  • Organizations: Team password sharing

Terraform Provider Integration

Vaultwarden includes a custom Terraform provider for infrastructure-as-code:

Provider Configuration:

terraform {
required_providers {
vaultwarden = {
source = "ominis-ai/vaultwarden"
version = "~> 1.0"
}
}
}

provider "vaultwarden" {
endpoint = "https://vault.ominis.ai"
email = "admin@ominis.ai"
password = var.admin_password
}

Manage Organizations:

resource "vaultwarden_organization" "platform" {
name = "Platform Team"
}

resource "vaultwarden_organization_collection" "credentials" {
organization_id = vaultwarden_organization.platform.id
name = "Production Credentials"
}

resource "vaultwarden_login_item" "database" {
organization_id = vaultwarden_organization.platform.id
collection_id = vaultwarden_organization_collection.credentials.id
name = "PostgreSQL Root"
username = "postgres"
password = random_password.db_password.result
uris = ["postgres://postgres.prod.svc.cluster.local:5432"]
notes = "Production database root credentials"
}

Deployment Configuration

Helm Values:

vaultwarden:
# Image
image:
repository: vaultwarden/server
tag: latest

# Persistence
persistence:
enabled: true
size: 1Gi

# Ingress
ingress:
enabled: true
className: traefik
annotations:
cert-manager.io/cluster-issuer: "letsencrypt-prod"
hosts:
- host: vault.ominis.ai
paths:
- path: /
pathType: Prefix
tls:
- secretName: vault-ominis-ai-tls
hosts:
- vault.ominis.ai

# Environment variables
env:
DOMAIN: "https://vault.ominis.ai"
SIGNUPS_ALLOWED: "false"
INVITATIONS_ALLOWED: "true"
ADMIN_TOKEN: "changeme-admin-token"

Use Cases

Use CaseDescription
Team CredentialsShare API keys, passwords securely
API Key ManagementStore tokens for external services
Certificate StorageTLS certificates and private keys
SSH Key ManagementStore SSH keys for server access
Infrastructure SecretsTerraform-managed secret lifecycle

Backup Strategy

Database Backup:

# Backup Vaultwarden data
kubectl exec -n vaultwarden vaultwarden-0 -- \
tar -czf - /data > vaultwarden-backup-$(date +%Y%m%d).tar.gz

# Upload to S3
aws s3 cp vaultwarden-backup-$(date +%Y%m%d).tar.gz \
s3://backups/vaultwarden/

4. Flow-Proxy: HTTP Reverse Proxy (Traefik)

Purpose: Dynamic HTTP reverse proxy and Kubernetes ingress controller.

Architecture

Traefik is a cloud-native ingress controller:

  • Dynamic Configuration: Auto-discovers Kubernetes ingress resources
  • Middleware System: Composable request/response transformations
  • TLS Termination: Integrates with cert-manager
  • Load Balancing: Round-robin, weighted, sticky sessions
  • Observability: Metrics, tracing, access logs

Ingress Pattern

Standard Tenant Ingress:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: example-ingress
namespace: client-example
annotations:
# Cert-manager integration
cert-manager.io/cluster-issuer: "letsencrypt-prod"
# Traefik middleware (optional)
traefik.ingress.kubernetes.io/router.middlewares: "default-security-headers@kubernetescrd"
spec:
# Traefik ingress class
ingressClassName: traefik
tls:
- hosts:
- example-api.app.ominis.ai
secretName: example-api-tls
rules:
- host: example-api.app.ominis.ai
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: api
port:
number: 8000

Middleware System

Traefik middleware enables request/response transformations:

Security Headers Middleware:

apiVersion: traefik.containo.us/v1alpha1
kind: Middleware
metadata:
name: security-headers
namespace: default
spec:
headers:
customResponseHeaders:
X-Frame-Options: "SAMEORIGIN"
X-Content-Type-Options: "nosniff"
X-XSS-Protection: "1; mode=block"
Referrer-Policy: "no-referrer-when-downgrade"
Permissions-Policy: "geolocation=(), microphone=(), camera=()"

Rate Limiting Middleware:

apiVersion: traefik.containo.us/v1alpha1
kind: Middleware
metadata:
name: rate-limit
namespace: default
spec:
rateLimit:
average: 100
burst: 50
period: 1m

Compression Middleware:

apiVersion: traefik.containo.us/v1alpha1
kind: Middleware
metadata:
name: compress
namespace: default
spec:
compress: {}

Apply Middleware to Ingress:

metadata:
annotations:
traefik.ingress.kubernetes.io/router.middlewares: >-
default-security-headers@kubernetescrd,
default-rate-limit@kubernetescrd,
default-compress@kubernetescrd

Configuration

Helm Values:

flow-proxy:
# Deployment
deployment:
replicas: 2

# Service (LoadBalancer for external access)
service:
type: LoadBalancer
annotations:
# OVH LoadBalancer
service.beta.kubernetes.io/ovh-loadbalancer-flavor: "small"

# Ports
ports:
web:
port: 80
exposedPort: 80
websecure:
port: 443
exposedPort: 443
tls:
enabled: true

# TLS options
tlsOptions:
default:
minVersion: VersionTLS12
cipherSuites:
- TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
- TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384

# Access logs
logs:
access:
enabled: true
format: json

# Metrics
metrics:
prometheus:
enabled: true

Use Cases

Use CaseDescription
Tenant API RoutingRoute {tenant}-api.app.ominis.ai to tenant pods
Static Site HostingServe documentation, landing pages
WebSocket ProxyingReal-time connections for IVR, monitoring
Path-Based Routing/api → API service, /docs → docs service

Dashboard Access

# Port-forward to dashboard
kubectl port-forward -n flow-proxy \
svc/traefik-dashboard 9000:9000

# Access dashboard
open http://localhost:9000/dashboard/

5. Homer: SIP Capture and VoIP Monitoring

Purpose: SIP protocol capture, analysis, and monitoring using HEP (Homer Encapsulation Protocol).

Architecture

Homer provides end-to-end VoIP monitoring:

  • Heplify: Capture agent (DaemonSet on all nodes)
  • Homer-App: Web UI for call flow analysis
  • PostgreSQL: Timeseries storage for SIP messages
  • HEP Protocol: Encapsulated SIP packet transport

Components

Heplify DaemonSet

Heplify runs on every node to capture SIP traffic:

apiVersion: apps/v1
kind: DaemonSet
metadata:
name: heplify
namespace: homer
spec:
selector:
matchLabels:
app: heplify
template:
metadata:
labels:
app: heplify
spec:
# Host network to capture traffic
hostNetwork: true
hostPID: true
containers:
- name: heplify
image: sipcapture/heplify:latest
securityContext:
privileged: true
capabilities:
add:
- NET_ADMIN
- NET_RAW
args:
# Capture on all interfaces
- -i
- any
# Homer server endpoint
- -hs
- homer-app.homer.svc.cluster.local:9060
# Capture SIP protocol
- -m
- SIP
# Port range for SIP
- -pr
- "5060-5061"
# Capture SIP and SDP
- -dim
- sip,sdp
resources:
requests:
memory: "64Mi"
cpu: "100m"
limits:
memory: "128Mi"
cpu: "500m"

FreeSWITCH HEP Configuration

Configure FreeSWITCH to send SIP traces to Homer:

sofia.conf.xml:

<configuration name="sofia.conf" description="sofia Endpoint">
<global_settings>
<!-- Enable HEP capture -->
<param name="capture-server" value="udp:homer-app.homer.svc.cluster.local:9060"/>
</global_settings>

<profiles>
<profile name="internal">
<!-- Profile-specific HEP -->
<settings>
<param name="sip-capture" value="yes"/>
<param name="capture-server" value="udp:homer-app.homer.svc.cluster.local:9060"/>
</settings>
</profile>
</profiles>
</configuration>

Analysis Features

FeatureDescription
Call Flow DiagramsLadder diagrams showing SIP message flow
Search & FilterFind calls by caller ID, method, response code
StatisticsCall success rate, average duration, error rates
Real-Time MonitoringLive SIP traffic analysis
ExportPCAP export for Wireshark analysis

Use Cases

Use CaseDescription
Call TroubleshootingDebug failed calls with full SIP trace
Quality MonitoringTrack call success rates, errors
Regulatory ComplianceSIP message logging for audits
Performance AnalysisIdentify bottlenecks in call flow

6. Excalidraw: Collaborative Whiteboarding

Purpose: Real-time collaborative whiteboard for architecture diagrams and brainstorming.

Architecture

  • React-based: Modern web application
  • Real-Time Collaboration: WebSocket-based multi-user editing
  • Export Formats: PNG, SVG, JSON
  • Embedding: Diagrams can be embedded in documentation
  • Privacy: Self-hosted, data stays in cluster

Deployment Configuration

Helm Values:

excalidraw:
# Deployment
replicas: 2
image:
repository: excalidraw/excalidraw
tag: latest

# Persistence for saved diagrams
persistence:
enabled: true
size: 5Gi
storageClass: ""

# Ingress
ingress:
enabled: true
className: traefik
annotations:
cert-manager.io/cluster-issuer: "letsencrypt-prod"
hosts:
- host: draw.ominis.ai
paths:
- path: /
pathType: Prefix
tls:
- secretName: draw-ominis-ai-tls
hosts:
- draw.ominis.ai

# Resources
resources:
requests:
memory: "128Mi"
cpu: "100m"
limits:
memory: "256Mi"
cpu: "200m"

Use Cases

Use CaseDescription
Architecture DiagramsNetwork topology, system design
Incident DocumentationDraw current vs desired state
Team BrainstormingReal-time collaborative design sessions
Documentation AssetsExport diagrams for docs

Deployment Order

Services must be deployed in order due to dependencies:

Rationale:

  1. Cert-Manager first: Required for TLS certificates
  2. Authentik second: Needs cert-manager for TLS
  3. Vaultwarden third: Needs cert-manager for TLS
  4. Flow-Proxy fourth: Discovers ingress resources after they exist
  5. Homer & Excalidraw: Independent, can be deployed in parallel

Deployment Sequence

Deployment Procedures

Prerequisites Check

Before deployment, verify cluster readiness:

# Verify Kubernetes access
kubectl cluster-info

# Check Kubernetes version (1.24+)
kubectl version --short

# Verify Helm installation
helm version

# Check storage class exists
kubectl get storageclass

# Verify DNS is configured
dig app.ominis.ai

# Check LoadBalancer support (for flow-proxy)
kubectl get svc -A | grep LoadBalancer

Step 1: Deploy Cert-Manager

Cert-Manager is the foundation for TLS automation:

# Add Helm repository
helm repo add jetstack https://charts.jetstack.io
helm repo update

# Install cert-manager with CRDs
helm install cert-manager jetstack/cert-manager \
--namespace cert-manager \
--create-namespace \
--version v1.13.0 \
--set installCRDs=true

# Wait for readiness
kubectl wait --for=condition=ready pod \
-l app.kubernetes.io/instance=cert-manager \
-n cert-manager \
--timeout=300s

# Verify installation
kubectl get pods -n cert-manager

Create ClusterIssuer:

# Create Let's Encrypt production issuer
cat <<EOF | kubectl apply -f -
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: letsencrypt-prod
spec:
acme:
server: https://acme-v02.api.letsencrypt.org/directory
email: admin@ominis.ai
privateKeySecretRef:
name: letsencrypt-prod
solvers:
- http01:
ingress:
class: traefik
EOF

# Verify ClusterIssuer
kubectl get clusterissuer letsencrypt-prod

Step 2: Deploy Authentik

cd /home/matt/projects/fml/cluster-infra

# Install Authentik
helm install authentik helm-charts/authentik/ \
--namespace authentik \
--create-namespace \
-f helm-charts/authentik/values.yaml

# Wait for pods
kubectl wait --for=condition=ready pod \
-l app.kubernetes.io/name=authentik \
-n authentik \
--timeout=300s

# Verify deployment
kubectl get pods -n authentik
kubectl get certificate -n authentik

# Get bootstrap password
kubectl get secret -n authentik authentik-bootstrap \
-o jsonpath='{.data.password}' | base64 -d

Access Authentik UI:

# Open browser
open https://auth.ominis.ai

# Default credentials:
# Username: akadmin
# Password: (from bootstrap secret)

Step 3: Deploy Vaultwarden

# Install Vaultwarden
helm install vaultwarden helm-charts/vaultwarden/ \
--namespace vaultwarden \
--create-namespace \
-f helm-charts/vaultwarden/values.yaml

# Wait for readiness
kubectl wait --for=condition=ready pod \
-l app.kubernetes.io/name=vaultwarden \
-n vaultwarden \
--timeout=180s

# Get admin token
kubectl get secret -n vaultwarden vaultwarden-admin \
-o jsonpath='{.data.token}' | base64 -d

# Verify
kubectl get pods -n vaultwarden
kubectl get certificate -n vaultwarden

Access Vaultwarden:

open https://vault.ominis.ai

Step 4: Deploy Flow-Proxy (Traefik)

# Install Traefik
helm install flow-proxy helm-charts/flow-proxy/ \
--namespace flow-proxy \
--create-namespace \
-f helm-charts/flow-proxy/values.yaml

# Wait for LoadBalancer IP
kubectl wait --for=condition=ready pod \
-l app.kubernetes.io/name=traefik \
-n flow-proxy \
--timeout=300s

# Get external IP
kubectl get svc -n flow-proxy traefik -o jsonpath='{.status.loadBalancer.ingress[0].ip}'

# Verify ingress controller
kubectl get pods -n flow-proxy
kubectl get svc -n flow-proxy

Configure DNS:

# Point wildcard DNS to LoadBalancer IP
# A record: *.app.ominis.ai → <EXTERNAL_IP>

Step 5: Deploy Homer (Optional)

# Install Homer
helm install homer helm-charts/homer/ \
--namespace homer \
--create-namespace \
-f helm-charts/homer/values.yaml

# Deploy Heplify DaemonSet
kubectl apply -f manifests/homer/heplify-daemonset.yaml

# Verify
kubectl get pods -n homer
kubectl get daemonset -n homer heplify

Access Homer UI:

open https://homer.ominis.ai

Step 6: Deploy Excalidraw (Optional)

# Install Excalidraw
helm install excalidraw helm-charts/excalidraw/ \
--namespace excalidraw \
--create-namespace \
-f helm-charts/excalidraw/values.yaml

# Wait for readiness
kubectl wait --for=condition=ready pod \
-l app.kubernetes.io/name=excalidraw \
-n excalidraw \
--timeout=180s

# Verify
kubectl get pods -n excalidraw
kubectl get certificate -n excalidraw

Access Excalidraw:

open https://draw.ominis.ai

Verification & Testing

Cert-Manager Verification

Test Certificate Creation:

# Create test certificate
cat <<EOF | kubectl apply -f -
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: test-cert
namespace: default
spec:
secretName: test-tls
issuerRef:
name: letsencrypt-prod
kind: ClusterIssuer
dnsNames:
- test.app.ominis.ai
EOF

# Check certificate status
kubectl describe certificate test-cert -n default

# Wait for certificate to be ready
kubectl wait --for=condition=ready certificate test-cert -n default --timeout=120s

# Verify secret created
kubectl get secret test-tls -n default

Authentik Testing

# Access Authentik UI
open https://auth.ominis.ai

# Get bootstrap credentials
kubectl get secret -n authentik authentik-bootstrap \
-o jsonpath='{.data.password}' | base64 -d

# Test OAuth2 endpoint
curl https://auth.ominis.ai/application/o/.well-known/openid-configuration

Traefik Dashboard

# Port-forward to dashboard
kubectl port-forward -n flow-proxy \
svc/traefik-dashboard 9000:9000

# Access dashboard
open http://localhost:9000/dashboard/

# Check ingress routes
kubectl get ingressroute -A

Homer Access

# Check Homer UI
open https://homer.ominis.ai

# Check HEP traffic
kubectl logs -n homer -l app=heplify --tail=100

# Check Homer database
kubectl exec -n homer homer-postgres-0 -- \
psql -U homer -c "SELECT COUNT(*) FROM hep_proto_1_call;"

Operational Procedures

Upgrade Cluster Service

# Update Helm repository
helm repo update

# Preview changes (requires helm-diff plugin)
helm diff upgrade cert-manager jetstack/cert-manager \
--namespace cert-manager \
--version v1.14.0

# Perform upgrade
helm upgrade cert-manager jetstack/cert-manager \
--namespace cert-manager \
--version v1.14.0 \
--reuse-values

# Verify upgrade
kubectl get pods -n cert-manager
kubectl rollout status deployment -n cert-manager cert-manager

# Rollback if needed
helm rollback cert-manager -n cert-manager

Monitor Service Health

Check All Cluster Services:

# One-liner to check all namespaces
for ns in cert-manager authentik vaultwarden flow-proxy homer excalidraw; do
echo "=== $ns ==="
kubectl get pods -n $ns
done

# Check ingress status
kubectl get ingress -A

# Check certificate status
kubectl get certificate -A

# Check PVC usage
kubectl get pvc -A

Check Resource Usage:

# Top pods per namespace
kubectl top pods -n authentik
kubectl top pods -n vaultwarden
kubectl top pods -n flow-proxy

# Check logs
kubectl logs -n cert-manager -l app=cert-manager --tail=50
kubectl logs -n flow-proxy -l app.kubernetes.io/name=traefik --tail=50

Backup and Disaster Recovery

Backup Procedures

Vaultwarden Database:

# Backup PostgreSQL database
kubectl exec -n vaultwarden vaultwarden-postgres-0 -- \
pg_dump -U vaultwarden vaultwarden | gzip > vaultwarden-backup-$(date +%Y%m%d).sql.gz

# Upload to S3
aws s3 cp vaultwarden-backup-$(date +%Y%m%d).sql.gz \
s3://ominis-backups/vaultwarden/

# Automated daily backup (CronJob)
cat <<EOF | kubectl apply -f -
apiVersion: batch/v1
kind: CronJob
metadata:
name: vaultwarden-backup
namespace: vaultwarden
spec:
schedule: "0 2 * * *" # Daily at 2 AM
jobTemplate:
spec:
template:
spec:
containers:
- name: backup
image: postgres:15
command:
- /bin/sh
- -c
- |
pg_dump -h vaultwarden-postgres -U vaultwarden vaultwarden | \
gzip > /backup/vaultwarden-\$(date +%Y%m%d).sql.gz
volumeMounts:
- name: backup
mountPath: /backup
volumes:
- name: backup
persistentVolumeClaim:
claimName: backup-pvc
restartPolicy: OnFailure
EOF

Authentik Database:

# Backup Authentik
kubectl exec -n authentik authentik-postgres-0 -- \
pg_dump -U authentik authentik | gzip > authentik-backup-$(date +%Y%m%d).sql.gz

# Upload to S3
aws s3 cp authentik-backup-$(date +%Y%m%d).sql.gz \
s3://ominis-backups/authentik/

Certificate Secrets:

# Export all TLS certificates
kubectl get secret -A \
-l cert-manager.io/certificate-name \
-o yaml > certificates-backup-$(date +%Y%m%d).yaml

# Upload to S3
aws s3 cp certificates-backup-$(date +%Y%m%d).yaml \
s3://ominis-backups/certificates/

Disaster Recovery

Restore Cluster Services:

# 1. Deploy cert-manager first
helm install cert-manager jetstack/cert-manager \
--namespace cert-manager \
--create-namespace \
--version v1.13.0 \
--set installCRDs=true

# 2. Restore ClusterIssuer
kubectl apply -f manifests/cert-manager/cluster-issuer.yaml

# 3. Deploy Authentik
helm install authentik helm-charts/authentik/ \
--namespace authentik \
--create-namespace

# 4. Restore Authentik database
kubectl exec -n authentik authentik-postgres-0 -- \
psql -U authentik -d authentik < authentik-backup.sql

# 5. Deploy other services
helm install vaultwarden helm-charts/vaultwarden/ --namespace vaultwarden --create-namespace
helm install flow-proxy helm-charts/flow-proxy/ --namespace flow-proxy --create-namespace

# 6. Restore Vaultwarden
kubectl exec -n vaultwarden vaultwarden-postgres-0 -- \
psql -U vaultwarden -d vaultwarden < vaultwarden-backup.sql

# 7. Verify certificate renewal
kubectl get certificate -A

Troubleshooting

Certificate Not Issued

Symptoms:

  • Certificate status: False
  • Ingress not accessible via HTTPS

Debug Steps:

# Check certificate status
kubectl describe certificate <name> -n <namespace>

# Check cert-manager logs
kubectl logs -n cert-manager -l app=cert-manager --tail=100

# Check ACME challenges
kubectl get challenges -A

# Describe challenge
kubectl describe challenge <challenge-name> -n <namespace>

Common Issues:

IssueSolution
DNS not configuredEnsure DNS points to LoadBalancer IP
HTTP-01 challenge failedCheck Traefik routing, firewall rules
Rate limit exceededLet's Encrypt: 50 certs/week per domain
Invalid emailUpdate ClusterIssuer with valid email

Ingress Not Working

Symptoms:

  • Service not accessible externally
  • 404 errors

Debug Steps:

# Check Traefik logs
kubectl logs -n flow-proxy -l app.kubernetes.io/name=traefik --tail=100

# Verify ingress resource
kubectl describe ingress <name> -n <namespace>

# Check service endpoints
kubectl get endpoints <service> -n <namespace>

# Check pod status
kubectl get pods -n <namespace>

# Port-forward to test service directly
kubectl port-forward -n <namespace> svc/<service> 8080:80
curl http://localhost:8080

Common Issues:

IssueSolution
Service has no endpointsCheck pod labels match service selector
Wrong port in ingressVerify service port matches ingress backend
Missing ingress classAdd ingressClassName: traefik
DNS not updatedWait for DNS propagation or check records

Service Unavailable

Symptoms:

  • Pods not starting
  • CrashLoopBackOff

Debug Steps:

# Check pod status
kubectl get pods -n <namespace>

# Check pod logs
kubectl logs -n <namespace> <pod-name>

# Check previous pod logs (if restarting)
kubectl logs -n <namespace> <pod-name> --previous

# Describe pod for events
kubectl describe pod <pod-name> -n <namespace>

# Check events
kubectl get events -n <namespace> --sort-by='.lastTimestamp'

Common Issues:

IssueSolution
Image pull errorsCheck image name, registry credentials
Insufficient resourcesIncrease node capacity or reduce requests
Volume mount issuesCheck PVC exists and is bound
Configuration errorsReview ConfigMaps and Secrets

Homer Not Capturing Traffic

Symptoms:

  • No SIP calls in Homer UI
  • Heplify logs show no traffic

Debug Steps:

# Check Heplify DaemonSet
kubectl get daemonset -n homer heplify

# Check Heplify logs
kubectl logs -n homer -l app=heplify --tail=100

# Verify FreeSWITCH HEP configuration
kubectl exec -n <tenant-namespace> <queue-pod> -- \
fs_cli -x "sofia status profile internal"

# Check Homer database
kubectl exec -n homer homer-postgres-0 -- \
psql -U homer -c "SELECT COUNT(*) FROM hep_proto_1_call;"

Common Issues:

IssueSolution
FreeSWITCH not configuredAdd sip-capture=yes to Sofia profile
Wrong Homer endpointUpdate capture-server parameter
Network policy blockingAllow traffic from pods to Homer
Heplify not runningCheck DaemonSet status, hostNetwork=true

Integration with Tenant Infrastructure

How Tenants Consume Cluster Services

Tenants consume cluster services through standard Kubernetes patterns:

1. TLS Certificates

Tenants request certificates via ingress annotations:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: api-ingress
namespace: client-demo-client
annotations:
# Reference ClusterIssuer (cluster service)
cert-manager.io/cluster-issuer: "letsencrypt-prod"
spec:
ingressClassName: traefik # Use cluster ingress controller
tls:
- hosts:
- demo-client-api.app.ominis.ai
secretName: demo-client-api-tls # Cert-manager creates this
rules:
- host: demo-client-api.app.ominis.ai
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: api
port:
number: 8000

Flow:

  1. Tenant creates Ingress with cert-manager.io/cluster-issuer annotation
  2. Cert-manager detects annotation, creates Certificate resource
  3. Certificate issued via Let's Encrypt ACME challenge
  4. Secret created in tenant namespace with TLS cert/key
  5. Traefik discovers ingress and terminates TLS

2. Ingress Routing

Tenants define Ingress resources, Traefik auto-discovers:

Pattern: {tenant}-api.app.ominis.ai

# Tenant A
host: demo-client-api.app.ominis.ai

# Tenant B
host: acme-corp-api.app.ominis.ai

Traefik Configuration:

  • Automatically discovers all Ingress resources
  • Routes based on host field
  • Applies middleware if specified
  • Terminates TLS using certificate secret

3. Authentication (Future)

Tenants can integrate with Authentik for OAuth:

Traefik Middleware for OAuth:

apiVersion: traefik.containo.us/v1alpha1
kind: Middleware
metadata:
name: oauth-auth
namespace: client-demo-client
spec:
forwardAuth:
address: https://auth.ominis.ai/outpost.goauthentik.io/auth/traefik
trustForwardHeader: true
authResponseHeaders:
- X-Auth-User
- X-Auth-Email

Apply to Ingress:

metadata:
annotations:
traefik.ingress.kubernetes.io/router.middlewares: >-
client-demo-client-oauth-auth@kubernetescrd

4. SIP Monitoring

Tenant queue pods send SIP traffic to Homer:

FreeSWITCH Configuration (in queue pod):

<configuration name="sofia.conf">
<global_settings>
<param name="capture-server" value="udp:homer-app.homer.svc.cluster.local:9060"/>
</global_settings>
</configuration>

Heplify (DaemonSet on all nodes):

  • Captures SIP packets on all interfaces
  • Encapsulates in HEP protocol
  • Sends to Homer API
  • Labels traffic by pod name/namespace

Homer UI:

  • Tenant-scoped call analysis
  • Search by namespace or pod name
  • Full SIP trace for troubleshooting

Security Considerations

Network Policies

Tenant Isolation:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: deny-cross-tenant
namespace: client-demo-client
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
ingress:
# Allow from same namespace
- from:
- namespaceSelector:
matchLabels:
name: client-demo-client
# Allow from ingress controller
- from:
- namespaceSelector:
matchLabels:
name: flow-proxy
egress:
# Allow DNS
- to:
- namespaceSelector:
matchLabels:
name: kube-system
ports:
- protocol: UDP
port: 53
# Allow to cluster services
- to:
- namespaceSelector:
matchLabels:
name: cert-manager
- to:
- namespaceSelector:
matchLabels:
name: homer

RBAC

Tenant Service Account:

apiVersion: v1
kind: ServiceAccount
metadata:
name: api-service-account
namespace: client-demo-client
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: api-role
namespace: client-demo-client
rules:
# Allow managing own resources
- apiGroups: [""]
resources: ["pods", "services", "configmaps"]
verbs: ["get", "list", "watch", "create", "update", "delete"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: api-rolebinding
namespace: client-demo-client
subjects:
- kind: ServiceAccount
name: api-service-account
namespace: client-demo-client
roleRef:
kind: Role
name: api-role
apiGroup: rbac.authorization.k8s.io

Restrictions:

  • Tenants cannot modify ClusterIssuer
  • Tenants cannot access other namespaces
  • Tenants cannot modify flow-proxy configuration
  • Service accounts scoped to tenant namespace

Secrets Management

Best Practices:

  • Kubernetes secrets encrypted at rest (enable encryption provider)
  • Vaultwarden for sensitive credentials (API keys, passwords)
  • Certificate rotation automated (cert-manager)
  • Secrets never in version control

Example: Store API Key in Vaultwarden:

# Using Bitwarden CLI
bw config server https://vault.ominis.ai
bw login admin@ominis.ai

# Create secure note
bw create item \
--organizationid <org-id> \
--collectionid <collection-id> \
--type secureNote \
--name "Demo Client API Key" \
--notes "Winnipeg2025"

TLS Everywhere

Policy: All HTTP traffic must use TLS

Enforcement:

# Traefik redirect HTTP to HTTPS
apiVersion: traefik.containo.us/v1alpha1
kind: Middleware
metadata:
name: redirect-https
namespace: default
spec:
redirectScheme:
scheme: https
permanent: true
---
# Apply globally to Traefik
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
name: flow-proxy
namespace: flow-proxy
spec:
values:
additionalArguments:
- "--entrypoints.web.http.redirections.entryPoint.to=websecure"
- "--entrypoints.web.http.redirections.entryPoint.scheme=https"

Certificate Rotation:

  • Let's Encrypt certificates valid 90 days
  • Cert-manager renews 60 days before expiry
  • Automatic renewal, no manual intervention

Architecture Decision Records

ADR-001: Cluster-Wide vs Tenant-Scoped Services

Context: Infrastructure services (cert-manager, ingress, identity) can be deployed at two levels:

  1. Cluster-level: One instance shared by all tenants
  2. Tenant-level: Each tenant gets their own instance

Decision: Deploy cert-manager, authentik, vaultwarden, flow-proxy, homer, and excalidraw at cluster-level.

Alternatives Considered:

ApproachProsCons
Tenant-scoped everythingPerfect isolation, tenant control over versionsMassive resource overhead, complex operations
Hybrid (some cluster, some tenant)Balanced approachInconsistent patterns, confusing
Cluster-wide onlyCost efficient, simple operationsShared failure domain, version lock

Consequences:

Pros:

  • Cost Efficiency: One cert-manager vs N cert-managers saves ~500MB RAM × N tenants
  • Operational Simplicity: Single upgrade path, single monitoring stack
  • Resource Optimization: Shared Traefik reduces pod overhead (2 replicas vs 2×N)
  • Centralized Management: Identity, secrets, TLS in one place
  • Faster Onboarding: New tenants start instantly, no infrastructure wait time

Cons:

  • Shared Failure Domain: Cert-manager failure affects all tenants
  • Version Lock: All tenants on same service versions
  • Security Boundary: Requires strict RBAC and network policies
  • Scaling Limits: Cluster services must handle all tenant load

Mitigation:

  • High Availability: Run 2+ replicas for critical services (cert-manager, Traefik)
  • Network Policies: Strict namespace isolation
  • Resource Quotas: Prevent noisy neighbors
  • Staged Rollouts: Blue-green deployments for cluster service updates
  • Monitoring: Alert on cluster service health

Status: Accepted

Date: 2025-10-14


ADR-002: Traefik Over Other Ingress Controllers

Context: Kubernetes requires an ingress controller to route external HTTP traffic to services. Options include:

  • Nginx Ingress: Most popular, mature, battle-tested
  • HAProxy: High performance, low-level control
  • Traefik: Modern, cloud-native, Kubernetes-native
  • Istio: Service mesh with ingress capabilities

Decision: Use Traefik as the cluster ingress controller (flow-proxy).

Alternatives Considered:

ControllerProsCons
Nginx IngressMost popular, mature, extensive docsComplex config, limited middleware, manual cert-manager setup
HAProxyHighest performance, granular controlSteep learning curve, less Kubernetes-native
TraefikDynamic config, native middleware, cert-manager integrationModerate performance vs HAProxy
IstioFull service mesh, advanced featuresOverkill for simple routing, complex

Why Traefik?:

FeatureBenefit
Dynamic ConfigurationAuto-discovers Kubernetes Ingress resources, no reload
Middleware SystemComposable transformations (headers, rate limiting, compression)
Cert-Manager IntegrationFirst-class TLS automation, seamless
Modern DashboardReal-time routing visualization
Cloud-NativeBuilt for containers and microservices
Let's Encrypt NativeACME protocol built-in

Trade-offs:

AspectTraefikNginx Ingress
ConfigurationAnnotationsConfigMap + Annotations
PerformanceVery Good (10K req/s)Excellent (15K req/s)
MiddlewareNative, composableLimited, via snippets
DashboardYes, built-inNo
Cert-ManagerSeamlessManual setup
Learning CurveModerateSteep

Consequences:

Benefits:

  • All ingress resources use ingressClassName: traefik
  • Middleware pattern for headers, auth, rate limiting
  • Dashboard available at traefik.ominis.ai (or port-forward)
  • Annotation-based configuration (simple)
  • Auto-discovery reduces operational overhead

Drawbacks:

  • Slightly lower raw performance than Nginx (~30% fewer req/s)
  • Different annotation namespace (traefik.ingress.kubernetes.io/)
  • Team needs to learn Traefik patterns

Mitigation:

  • Use multiple Traefik replicas (2+) for high availability
  • Cache static assets at CDN layer if performance becomes bottleneck
  • Document standard middleware patterns for team

Status: Accepted

Date: 2025-10-14


Summary

Ominis Cluster Manager uses a two-tier infrastructure model for efficiency and isolation:

Cluster Services (this document):

  • 6 services deployed once per cluster
  • Shared by all tenants
  • Examples: Cert-Manager (TLS), Traefik (ingress), Authentik (SSO), Vaultwarden (secrets), Homer (SIP monitoring), Excalidraw (collaboration)

Tenant Services:

  • Deployed per customer
  • Isolated in namespaces
  • Examples: API servers, queues, databases

This approach provides:

  • ✅ Cost efficiency (shared resources)
  • ✅ Operational simplicity (centralized management)
  • ✅ Fast tenant onboarding (no infrastructure wait)
  • ✅ Security isolation (network policies, RBAC)

Next Steps:

  1. Review Helm Infrastructure for tenant deployment
  2. Explore Queue Management for tenant service example
  3. Check Testing Strategy for infrastructure testing

Powered by Ominis.ai - Modern telephony infrastructure for the cloud era.