Author profile
Steve Tran
Founder & CTO, CloudThinker
Steve Tran is the founder and CEO of CloudThinker, the AgenticOps platform for production cloud operations. He writes about AgenticOps, VibeOps, CostOps, and the practices teams use to safely run autonomous AI agents against production AWS, GCP, and Azure environments.
Before founding CloudThinker, Steve was a Solutions Architect at AWS — where he watched customer after customer hit the same operational pain points around cloud cost, incident response, and the gap between insight and action. CloudThinker is the platform he wished those customers had.
Posts by Steve (19)
Product ·
CloudThinker × Rollbar: Day-2 Operations for the Full Error Lifecycle
Most error monitoring programs do not fail at capture — they fail in the hours after, when regressions hide behind a long tail of third-party noise. A walkthrough of how CloudThinker closes the Day-2 gap across the full Rollbar lifecycle: capture, triage, correlate, reproduce, fix, verify — all in the team's existing chat, code review, and ticketing tools.
Market Insights ·
AgenticOps Needs Its Own Platform — Why a Coding Tool Can't Safely Connect to Production
Claude Code, Codex, Kiro, Cursor, and ChatGPT are excellent at intent-to-diff. They are not AgenticOps platforms, and 2025–2026 incident data makes the cost of that mismatch hard to ignore. The case for treating AgenticOps as its own discipline: the six top failure modes the published incident data points to — credential exfiltration, destructive agent actions, supply-chain compromise of AI tooling, over-privileged IAM, vulnerable agents, and sensitive data leaving the boundary on every prompt — and the nine practices CloudThinker bakes in across Connections, Sandbox, Skills, Auto Mode, and deterministic tokenization to make team-grade production access real.
Product ·
Introducing the CloudThinker CostOps Agent: From Cost Anomaly to Landed Fix, on a Daily Loop
Most FinOps tools stop at the dashboard and recommend. The new CloudThinker CostOps Agent inside CloudKeeper runs the full eight-phase loop every day across AWS and GCP — detects the anomaly, isolates the cost driver, traces the root cause, washes the data, runs the chase, opens the Merge Request with the fix, ships it under the approval gate you choose, and learns from every approved change. Plus a side-by-side comparison against Cost Explorer, Compute Optimizer, Trusted Advisor, GCP Recommender, CloudHealth, Cloudability, Datadog Cloud Cost, Vantage, and Kubecost.
Market Insights ·
Data Sovereignty for Agentic AI in Vietnam and ASEAN BFSI: A Field Guide
For banks and insurers in Vietnam and across ASEAN, the question is no longer whether to adopt agentic AI — it is how to adopt it without unwinding the data-locality work that took the last decade to put in place. A field guide to the regulatory floor in 2026, how an agent's reasoning loop changes the data surface across storage, inference, memory, telemetry, and egress, the architecture patterns that hold up at audit, and a checklist to run before an autonomous agent reads live customer data.
Market Insights ·
Build vs Buy: The 24-Month TCO of an Agentic Operations Platform
Every engineering leader evaluating agentic operations eventually asks the same question: build it or buy CloudThinker? A structured walkthrough of the thirteen runtime primitives an internal platform actually requires, a capability-by-capability TCO comparison across pure-build, pure-buy, and hybrid scenarios, and a seven-question decision framework to take into your next architecture review.
Product ·
CloudThinker × Terraform: Day-2 Operations for the Full IaC Lifecycle
Most Terraform programs do not fail at plan — they fail in the months after, when the state file no longer describes production. A walkthrough of how CloudThinker closes the Day-2 gap across the full Terraform lifecycle: author, plan, apply, drift detect, reconcile, right-size, deprecate — all in the team's existing chat, code review, and ticketing tools.
Product ·
New CloudThinker Security Agent runs continuous agentic penetration testing from commit to deployment
Today we are announcing the CloudThinker Security Agent, an autonomous penetration testing system that runs on every commit. Six domain specialists — code, web, infrastructure, database, identity, and secrets — discover, plan, and safely validate exploits in under 15 minutes per run, with near-zero false positives.
How To ·
Best Practices: How to Build AI Skills That Actually Work for Your Business
Most teams clone public skills and wonder why they break. The real problem isn't the skill — it's missing connected intelligence: your incident history, your cost baseline, your deployment patterns. Here's how to build skills that detect, analyze, resolve, and validate — automatically — using your own practices, your own context, and the Ultra-to-Light strategy that cuts costs 40–60% over time.
Product ·
Introducing Olivier: CloudThinker's SuperPower Security Agent for Cloud
It's 2:47 AM. A GuardDuty alert fires. Your on-call engineer opens the console, cross-references CloudTrail logs, checks security groups, and tries to remember which CIS benchmark covers this. 45 minutes later, she's still context-switching. Meet Olivier — an AI security engineer with 20 purpose-built skills covering prevention, detection, response, and compliance. Your cloud runs 24/7. Your security engineer should too.
Product ·
CloudThinker Connections: How We Securely Connect to Your Infrastructure
Your databases live in private subnets. Your clusters sit behind firewalls. Your cloud accounts have strict network policies. A technical guide to four connectivity tiers — from public HTTPS to private VPN — that let AI agents reach your infrastructure without compromising your security posture.
Product ·
Inside CloudThinker's Sandbox: How We Built the Most Secure AI Execution Environment
A deep technical guide to CloudThinker's self-developed sandbox architecture — three-tier isolation, ephemeral microVMs, kernel-level syscall filtering, scoped credentials, and defense-in-depth security that makes autonomous AI operations safe for banking, healthcare, and enterprise.
Product ·
Human Expert Guidance Meets Agentic AI: The Architecture for Scalable Autonomous Operations
How organizations are building, testing, and sharing reusable AI automation assets — agents, skills, runbooks, and approval policies — to autonomously resolve 80% of common operational tasks while keeping humans in control of the remaining 20%.
Product ·
CloudThinker Makes GitLab Become Autonomous
It's 3:17 AM. Your phone lights up. PagerDuty. Again. A seemingly innocent refactor passed all tests and sailed through CI — but buried inside was a missing slash, a security misconfiguration, and a query that explodes under load. This is the story of why we built CloudThinker's GitLab integration — to make GitLab think for itself.
Market Insights ·
The Most Expensive Model Is No Longer the Best Choice
Open-source models are closing the gap. Claude Opus 4.6 scores 79.4% on SWE-bench, GPT-5.3 scores 78.2%, and GLM-5 — fully open-source under MIT — scores 77.8%. The price gap? 5-8x. The smartest teams are rethinking everything: from model-centric to system-centric AI, where Multi-Agent orchestration matters more than raw intelligence.
Case Study ·
How Diaflow Achieved Active-Active Architecture and SOC 2 Compliance in 28 Days
Diaflow, an AI-native automation platform, faced a critical scaling bottleneck: the need to simultaneously deploy multi-region infrastructure and achieve strict regulatory compliance (SOC 2, HIPAA, GDPR) to close enterprise deals. By leveraging CloudThinker’s unified AI operations, Diaflow compressed a standard 6-month roadmap into a 4-week sprint, achieving 99.9% uptime and reducing operational toil by 80%.
Product ·
Mastering Multi-Cloud CostOps: Why Multi-Cloud CostOps Matters
Three clouds. Three invoices. Three billing consoles. One frustrated CTO. The story of a startup drowning in $85K/month across AWS, Azure, and GCP — a homegrown dashboard that broke after six weeks, and the AI agent that found $28,500 in annual savings within two hours.
Product ·
Introducing CloudThinker SlackOps: The Future of Conversational Infrastructure Management
Fourteen browser tabs. Three terminal windows. Two Slack channels. One frantic on-call engineer. A 47-minute incident where only 8 minutes was actual investigation — and how AI agents in Slack collapsed the rest to seconds.
Product ·
The Kubernetes Agentic Operations Revolution: From Manual Management to Autonomous Intelligence with CloudThinker
The Kubernetes cluster was supposed to be self-healing. But at 3 AM on Monday, the only thing healing anything was a very tired platform engineer named Marcus. The story of a bad week across 12 clusters — and the AI agent that gave Marcus his Mondays back.
Product ·
The Database Analytics Revolution: From Manual Queries to Intelligent Insights with CloudThinker
Forty-seven pending report requests. Two analysts. Three-week turnaround. Then the CEO needed a churn analysis by Thursday. The story of a data team buried in SQL queries — and the AI agent that cleared the backlog in three days.