DevOps & Security

Observability Bootstrapper

Adds structured logging with trace IDs, Prometheus metrics, OpenTelemetry tracing, and baseline alert rules to any service. Useful for making services observable by default instead of as an afterthought. Engineers shipping new services, platform teams standardizing observability across a portfolio, SREs enforcing observability acceptance criteria on new deploys. The consequence is predictable: the first production incident reveals that the service has no structured logs, no metrics, and no traces, so debugging requires a correlated investigation across three teams with partial information. Bootstrapping observability from the start costs a few hours; retrofitting it after an incident costs a week. A structured bootstrapper produces a working baseline — instrumented code, dashboards, alerts — that the team extends as their system evolves.

Nexus CertifiedClaude CodeCodexOpenClawGoogle Antigravity

observabilityopentelemetrymetricsloggingtracing

One-Time Purchase

$19.99

Sample Output

Observability Bootstrap — `order-processing-service` (Python/FastAPI, tier 1)

Generated for: order-processing-service · Stack: FastAPI + SQLAlchemy + OpenTelemetry 1.25 · Tier: 1

At a glance

Drop-in instrumentation for a tier-1 Python service: structured JSON logs with W3C trace/span correlation, RED metrics with bounded cardinality, OTLP traces sampled at 10% in prod (100% in dev), five baseline Prometheus alerts, and a four-panel Grafana dashboard. Exporters: stdout for logs, OTLP gRPC → OpenTelemetry Collector → Grafana Cloud for metrics and traces.

Instrumentation coverage

Logs — trace_id, span_id, service.name on every recordJSON

Metrics — 3 RED counters, 2 latency histograms, no high-cardinality labels3 + 2

Traces — FastAPI + SQLAlchemy auto-instrumented; manual spans on business logicauto + manual

Alerts — error rate, p99, order failures, DB latency, up-probe5

Dashboard — RPS, error rate, latency quantiles, outcome breakdown4 panels

Exporters: stdout (logs/logfmt), OTLP → OpenTelemetry Collector (metrics + traces) → Grafana Cloud

`observability/logger.py`

import logging
import sys
from contextvars import ContextVar
from pythonjsonlogger import jsonlogger

_trace_id: ContextVar[str] = ContextVar("trace_id", default="unset")
_span_id: ContextVar[str] = ContextVar("span_id", default="unset")

class CorrelationFilter(logging.Filter):
    def filter(self, record):
        record.trace_id = _trace_id.get()
        record.span_id = _span_id.get()
        record.service = "order-processing-service"
        return True

def get_logger(name: str) -> logging.Logger:
    logger = logging.getLogger(name)
    handler = logging.StreamHandler(sys.stdout)
    formatter = jsonlogger.JsonFormatter(
        fmt="%(asctime)s %(levelname)s %(name)s %(trace_id)s %(span_id)s %(message)s",
        rename_fields={"asctime": "ts", "levelname": "level", "name": "logger"},
    )
    handler.setFormatter(formatter)
    logger.addFilter(CorrelationFilter())
    logger.addHandler(handler)
    logger.setLevel(logging.INFO)
    return logger

# ⚠️ Never log: customer_email, customer_name, card_last4, raw request/response bodies

`observability/metrics.py`

from prometheus_client import Counter, Histogram, CollectorRegistry, REGISTRY

# RED metrics — cardinality cap: status and endpoint only (never user_id, order_id)
REQUEST_COUNT = Counter(
    "orders_http_requests_total",
    "Total HTTP requests",
    ["method", "endpoint", "status_class"],  # status_class: 2xx/4xx/5xx, not raw code
)
REQUEST_LATENCY = Histogram(
    "orders_http_request_duration_seconds",
    "HTTP request latency",
    ["method", "endpoint"],
    buckets=[0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0],
)
ORDER_CREATED = Counter("orders_created_total", "Orders successfully created", ["payment_method"])
ORDER_FAILED = Counter("orders_failed_total", "Order creation failures", ["failure_reason"])
# failure_reason bounded to: validation_error | payment_declined | inventory_unavailable | timeout | unknown
DB_QUERY_DURATION = Histogram(
    "orders_db_query_duration_seconds",
    "Database query latency",
    ["operation", "table"],
    buckets=[0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1.0],
)

`observability/tracing.py`

import os
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.sampling import TraceIdRatioBased, ParentBased
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor

def init_tracing(app, db_engine):
    env = os.getenv("ENV", "development")
    sample_rate = 1.0 if env == "development" else float(os.getenv("TRACE_SAMPLE_RATE", "0.1"))

    provider = TracerProvider(
        sampler=ParentBased(root=TraceIdRatioBased(sample_rate)),
        resource=Resource({"service.name": "order-processing-service", "deployment.environment": env}),
    )
    provider.add_span_processor(
        BatchSpanProcessor(OTLPSpanExporter(endpoint=os.getenv("OTEL_EXPORTER_OTLP_ENDPOINT")))
    )
    trace.set_tracer_provider(provider)

    FastAPIInstrumentor.instrument_app(app)
    SQLAlchemyInstrumentor().instrument(engine=db_engine)
    # Propagates W3C TraceContext across async task boundaries via contextvars automatically

`observability/middleware.py`

import time
from starlette.middleware.base import BaseHTTPMiddleware
from opentelemetry import trace
from observability.logger import _trace_id, _span_id
from observability.metrics import REQUEST_COUNT, REQUEST_LATENCY

class ObservabilityMiddleware(BaseHTTPMiddleware):
    async def dispatch(self, request, call_next):
        start = time.perf_counter()
        span = trace.get_current_span()
        ctx = span.get_span_context()

        _trace_id.set(format(ctx.trace_id, "032x") if ctx.is_valid else "unset")
        _span_id.set(format(ctx.span_id, "016x") if ctx.is_valid else "unset")

        response = await call_next(request)
        duration = time.perf_counter() - start

        endpoint = request.scope.get("route", {}).path if hasattr(request.scope.get("route", {}), "path") else "unknown"
        status_class = f"{response.status_code // 100}xx"

        REQUEST_COUNT.labels(request.method, endpoint, status_class).inc()
        REQUEST_LATENCY.labels(request.method, endpoint).observe(duration)
        return response

`alerts.yaml`

# ⚠️ Thresholds below are starting points — tune after observing your baseline p50/p99.
groups:
  - name: order-processing-service
    rules:
      - alert: HighErrorRate
        expr: |
          rate(orders_http_requests_total{status_class="5xx"}[5m])
          / rate(orders_http_requests_total[5m]) > 0.05
        for: 3m
        labels:
          severity: critical
          service: order-processing-service
        annotations:
          summary: "Error rate above 5% for 3 minutes"
          runbook: "https://wiki.internal/runbooks/order-processing/high-error-rate"

      - alert: P99LatencyHigh
        expr: |
          histogram_quantile(0.99,
            rate(orders_http_request_duration_seconds_bucket[5m])) > 2.0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "p99 latency > 2s — investigate DB or downstream dependencies"

      - alert: OrderFailureSpike
        expr: rate(orders_failed_total[5m]) > 0.5
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Order failure rate > 0.5/sec — check payment gateway and inventory service"

      - alert: DBQueryLatencyHigh
        expr: |
          histogram_quantile(0.95,
            rate(orders_db_query_duration_seconds_bucket[5m])) > 0.5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "DB p95 query time > 500ms — check for slow queries or connection pool exhaustion"

      - alert: ServiceDown
        expr: up{job="order-processing-service"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "order-processing-service scrape target is down"

`handlers/create_order.py` — Instrumented Example

import asyncio
from fastapi import APIRouter, HTTPException
from opentelemetry import trace
from observability.logger import get_logger
from observability.metrics import ORDER_CREATED, ORDER_FAILED, DB_QUERY_DURATION

router = APIRouter()
logger = get_logger(__name__)
tracer = trace.get_tracer("order-processing-service")

@router.post("/orders")
async def create_order(payload: OrderRequest):
    with tracer.start_as_current_span("create_order") as span:
        span.set_attribute("order.payment_method", payload.payment_method)
        span.set_attribute("order.item_count", len(payload.items))
        # ✗ Do NOT set: span.set_attribute("order.customer_email", ...)

        logger.info("order_creation_started", extra={
            "payment_method": payload.payment_method,
            "item_count": len(payload.items),
        })

        try:
            with tracer.start_as_current_span("db.insert_order"):
                with DB_QUERY_DURATION.labels("insert", "orders").time():
                    order = await db.insert_order(payload)

            ORDER_CREATED.labels(payment_method=payload.payment_method).inc()
            logger.info("order_creation_succeeded", extra={"order_id_prefix": str(order.id)[:8]})
            return {"order_id": order.id}

        except PaymentDeclinedError:
            ORDER_FAILED.labels(failure_reason="payment_declined").inc()
            span.set_status(trace.StatusCode.ERROR, "payment declined")
            logger.warning("order_creation_failed", extra={"reason": "payment_declined"})
            raise HTTPException(status_code=402, detail="Payment declined")

        except Exception as e:
            ORDER_FAILED.labels(failure_reason="unknown").inc()
            span.record_exception(e)
            logger.error("order_creation_error", extra={"error_type": type(e).__name__})
            raise HTTPException(status_code=500, detail="Internal error")

`dashboard.json` (excerpt — full file: 847 lines)

{
  "title": "order-processing-service",
  "uid": "order-proc-v1",
  "panels": [
    {
      "title": "Request Rate (RPS)",
      "type": "timeseries",
      "targets": [{"expr": "sum(rate(orders_http_requests_total[1m])) by (endpoint)"}]
    },
    {
      "title": "Error Rate %",
      "type": "timeseries",
      "targets": [{"expr": "sum(rate(orders_http_requests_total{status_class='5xx'}[5m])) / sum(rate(orders_http_requests_total[5m])) * 100"}]
    },
    {
      "title": "p50 / p95 / p99 Latency",
      "type": "timeseries",
      "targets": [
        {"expr": "histogram_quantile(0.50, rate(orders_http_request_duration_seconds_bucket[5m]))", "legendFormat": "p50"},
        {"expr": "histogram_quantile(0.95, rate(orders_http_request_duration_seconds_bucket[5m]))", "legendFormat": "p95"},
        {"expr": "histogram_quantile(0.99, rate(orders_http_request_duration_seconds_bucket[5m]))", "legendFormat": "p99"}
      ]
    },
    {
      "title": "Order Outcomes",
      "type": "timeseries",
      "targets": [
        {"expr": "rate(orders_created_total[5m])", "legendFormat": "created"},
        {"expr": "rate(orders_failed_total[5m]) by (failure_reason)", "legendFormat": "failed — {{failure_reason}}"}
      ]
    }
  ]
}

Alerts & Dashboards Summary

Signal	Source	Threshold	Severity
5xx error rate > 5%	`orders_http_requests_total`	3m sustained	critical
p99 latency > 2s	`orders_http_request_duration_seconds`	5m sustained	warning
Order failure rate > 0.5/s	`orders_failed_total`	2m sustained	critical
DB p95 query > 500ms	`orders_db_query_duration_seconds`	5m sustained	warning
Scrape target down	`up{job=...}`	1m	critical

Cardinality discipline

The metrics module deliberately uses bounded labels — status_class (2xx/4xx/5xx, not the raw status code), failure_reason enum, endpoint from the route table rather than the raw URL path. Adding user_id, order_id, or full-resolution URLs to any label will blow up your Prometheus storage. The codebase has no enforcement; treat the label set as a code-review item.

SLO posture

The 5xx-rate and p99 alerts are paging thresholds, not SLOs. Recommended next step is to define a 99.9% availability SLO and a 95th-percentile-under-500ms latency SLO, then derive burn-rate alerts from those rather than the static thresholds above. The Prometheus + Grafana setup here supports SLO calculations without further code changes.

PII discipline

Do not log or attach as span attributes: customer_email, customer_name, card_last4, raw request/response bodies. The logger and tracer above are wired to make this easy to follow, not enforced. PII-in-traces is one of the most common compliance findings on observability rollouts.

Generated by the ClearPoint Nexus Observability Bootstrapper skill. Tune the alert thresholds against your service's baseline before paging humans on them.

This sample illustrates the skill's output format. Names, metrics, and operational details are illustrative unless the artifact explicitly analyzes public information.

View full sample →

I agree to the Terms, Privacy Policy, Acceptable Use Policy, and AI Disclosure, and I confirm I am at least 18 years old.

All sales final. No refunds on digital products.

Includes support for Claude Code, Codex, OpenClaw, and Google Antigravity in the same license.

Also in Infrastructure & Reliability

Bundle price: $55. Compare this skill with the full workflow bundle or Pro access.

View Bundle Compare Pro

Best for

Engineers and SRE leads shipping a new service who want structured logging, Prometheus metrics, OpenTelemetry tracing, and a baseline alert set wired up before the first production incident — not as a retrofit after one. Especially valuable for platform teams enforcing observability as an acceptance criterion on every new deploy.

Not ideal for

Services already deeply instrumented in a proprietary stack (Datadog APM with custom integrations, New Relic dashboards built over years) where the generated baseline conflicts with the existing convention. Also a poor fit for batch jobs and one-shot scripts where the observability surface is fundamentally different from a long-running service.

Included in this purchase

Claude Code, Codex, OpenClaw, and Google Antigravity skill files.
Setup guidance for the right adapter in your workspace.
One-time license for the purchased skill version.

Setup

Plan for a short setup in the repository or workspace where the skill will run. Some coding familiarity helps for implementation-heavy outputs.

Claude CodeCodexOpenClawGoogle Antigravity

Related Skills

Incident Response

Outage Response Playbook

Generates structured, role-clear incident response playbooks for specific failure scenarios. Covers detection through resolution and post-mortem — ready to use when an incident actually happens.

Claude CodeCodexOpenClawGoogle Antigravity

outage-responsereliabilityrunbooks

$19.99

One-time license

View Skill

Incident Response

Incident Postmortem Writer

Generates a structured blameless postmortem from incident timelines, alerts, and deploy logs with root cause analysis, impact assessment, and owned action items. Useful for producing first-draft postmortems under operational pressure.

Claude CodeCodexOpenClawGoogle Antigravity

postmortemsincident-responseoperations

$19.99

One-time license

View Skill

Security Scanning

OWASP Top 10 Scanner

Scans code for OWASP Top 10 vulnerability patterns including injection, XSS, IDOR, and insecure deserialization with severity ratings and remediation snippets. Useful for pre-commit security checks and enterprise compliance.

Claude CodeCodexOpenClawGoogle Antigravity

securityowaspvulnerabilities

$19.99

One-time license

View Skill

Future Updates

This purchase includes the current version of the skill. If you want future adapter updates — meaning compatibility and packaging updates as supported platforms evolve — plus new catalog additions included automatically, upgrade to Pro.

Upgrade to Pro

Observability Bootstrapper

Observability Bootstrap — order-processing-service (Python/FastAPI, tier 1)

observability/logger.py

observability/metrics.py

observability/tracing.py

observability/middleware.py

alerts.yaml

handlers/create_order.py — Instrumented Example

dashboard.json (excerpt — full file: 847 lines)

Alerts & Dashboards Summary

Observability Bootstrap — `order-processing-service` (Python/FastAPI, tier 1)

`observability/logger.py`

`observability/metrics.py`

`observability/tracing.py`

`observability/middleware.py`

`alerts.yaml`

`handlers/create_order.py` — Instrumented Example

`dashboard.json` (excerpt — full file: 847 lines)