Skip to content
Home NSCP Token Economics Demos Research NightClaw

Token Economics

Most platforms now offer token usage dashboards. Tracking is not the problem — Oracle, Microsoft, and SAP all show you what you spent. The problem is optimization at scale across providers and business units: routing decisions, caching strategies, budget enforcement, and cost attribution as a financial controllership discipline. The FinOps Foundation documents 30–200x cost variance between optimized and unoptimized deployments. This section explores the optimization layer that sits above any single vendor’s monitoring.

Disclaimer: All pricing, calculations, and architectures are generic and synthetic. They reference publicly available provider rates and common infrastructure patterns — not the proprietary systems or intellectual property of any current or former employer.
Representative pricing as of February 2026. LLM API rates change frequently — verify with providers before use in production cost models. Sources: provider pricing pages. All figures represent marginal token costs only and do not reflect total cost of ownership including infrastructure, engineering, operations, or organizational change.
ModelProviderInput $/1MOutput $/1MContextTierBest For
Claude Sonnet 4.6Anthropic$3.00$15.00200K1Financial analysis, compliance docs
GPT-5.4OpenAI$1.75$14.00200K1Complex reasoning, exception triage
GPT-5 MiniOpenAI$0.25$2.00200K2Forecasting, high-volume tasks
Claude Haiku 4.5Anthropic$1.00$5.00200K2Budget checks, routing
Gemini 2.5 FlashGoogle$0.30$2.501M2High-volume ad-hoc queries
DeepSeek V3.2DeepSeek$0.28$0.42128K2Classification, extraction
GPT-5 NanoOpenAI$0.05$0.40400K2High-volume simple tasks
Tier 1 vs Tier 2 Cost Difference
75–95%
Understanding cost tiers helps match model capability to task requirements. Tier 2 models handle classification, extraction, and routine tasks at a fraction of Tier 1 pricing.
NSCP Model Policy
Tier 1SOX reporting, compliance, complex analysis
Tier 2Ad-hoc queries, dev/test, classification
BudgetAuto-downgrade at 80% utilization
Blended Cost/1K Tokens
80/20 input/output token split
Claude Sonnet 4.6$0.0054
GPT-5.4$0.0042
GPT-5 Mini$0.0006
DeepSeek V3.2$0.0003
Inference Gateway Architecture
Token Tracking Pipeline
Forward Path — Request
Agent
Initiates call
Inference
Gateway
Auth + Route
Pre-Call Check
Budget + Token Est.
Model Router
Tier enforcement
LLM API
Provider
Return Path — Response + Logging
Agent
Receives result
Gateway
Strip metadata
Post-Call Log
Actuals + Budget
LLM Response
+ Usage headers
Downstream Consumers
Token Logger
fact_token_usage
Token Economics Dashboard
Budget Controller Agent
TGC Controls (TGC-001–006)
Gateway Design Targets
Overhead: <12ms p99
Availability: 99.95%
Peak RPS: 247
Cache hit: 44.2%
Pre-Call Checks
✓ Agent budget available?
✓ Model tier authorized?
✓ Token estimate within limit?
✓ Prompt hash — cache check
✓ Rate limit headroom?
Post-Call Actions
→ Write to fact_token_usage
→ Decrement agent budget
→ Update cache registry
→ Fire alerts if threshold crossed
→ Tag cost to business unit
Calculator Inputs
Input/Output split: 80% input / 20% output tokens
Payback assumes estimated platform setup cost ($5K–$60K based on volume)
Monthly Token Cost
Cost per Run
Cost per 1K Tokens
Annual Projection
Tokens per Dollar
Monthly Token Cost by Tier
Prompt Caching
Cache frequently used system prompts and static context. Dramatically reduces input token processing cost.
Savings: 50–90% on cached portions
Intelligent Model Routing
Route simple tasks to Tier-2 models. Reserve Tier-1 for complex reasoning, SOX compliance, and financial judgment.
Savings: 60–80% on routed calls
Batch Processing
Group similar requests to amortize system prompt overhead. Effective for reconciliation checks and exception classification.
Savings: 20–30% on batch-eligible tasks
Token Budget Allocation
Set daily/monthly token budgets per agent. Controller enforces hard limits, triggers downgrades at 80%, blocks at 90%.
Prevents runaway costs
Prompt Engineering
Optimize prompt structure to reduce tokens without degrading quality. Use structured output formats.
Typical reduction: 15–25% on input tokens
Semantic Caching
Cache responses for semantically similar queries using embedding-based similarity. Return cached responses with metadata.
Hit rate improvement: 10–20%
Combined Optimization Impact
OptimizationApplicable CallsReductionMonthly ReductionEffort
Model Routing70%65%$812Low
Prompt Caching44% hit70%$218Med
Prompt Engineering100%20%$110Med
Batch Processing35%25%$64High
Semantic Caching15%100%$47High
Total (combined)~80%$1,251/moROI: 12x

human@tokenarch.com