Digest #142: AWS Changes, Crowdstrike Incident Analysis, Terraform Plan-Light, and Scaling Prometheus with Thanos
Prevent production incidents, OpenTofu latest updates, learn to build SQS with Postgres, master Bash one-liners, and optimize Kubernetes costs.
Welcome to this week’s edition of the DevOps Bulletin!
Can a good explanation prevent a prod incident? Explore the impact of clear communication on production stability. Learn why CloudFlare suggests to "Destroy on Friday" to avoid downtime and discover the secrets behind delivering millions of notifications during the Super Bowl.
AWS is discontinuing Cloud9, CodeCommit, CloudSearch, and more. The recent Crowdstrike incident led to massive outages and costs for Delta Airlines; find out what happened and what we can learn.
Build your own SQS or Kafka with Postgres, explore Terraform plan -light for efficient planning, and master Bash one-liners for Linux maintenance. Discover how Amazon migrated from Apache Spark to Ray on EC2, and get step-by-step guidance on upgrading Terraform EKS.
Highlighted projects include Git-pr for self-hosted Git collaboration, DevX for building internal developer platforms, and Tailwarden for managing Kubernetes costs.
Newsworthy Stories
Stay informed with the latest news impacting the DevOps and SRE world:
Delivering Millions of Notifications Within Seconds During the Super Bowl
AWS to Discontinue Cloud9, CodeCommit, CloudSearch, and more.
Crowdstrike Incident Analysis
A NULL pointer caused 8.5M machines to go down and $500M in outage costs for Delta Airlines. This thread explains it all.
What's your preferred logging stack in Kubernetes
A HackerNews thread discussed Kubernetes logging solutions. Many use VictoriaLogs for its efficiency, Loki with S3 for cost, and ClickHouse for performance. Opinions on Loki versus ELK were mixed; Loki is simpler and scalable, but ELK offers better query capabilities. Solutions like Quickwit and SigNoz were also recommended for improved logging and observability.
Tutorials of the week
Build your own SQS or Kafka with Postgres: Create a queue table, handle message visibility timeouts, and ensure exactly one delivery.
Terraform plan -light: Targets only modified resources in code for planning, reducing plan times without consistency risks.
Bash-Oneliner: Handy Bash One-Liners and terminal tricks for data processing and Linux system maintenance.
Exabyte-Scale Migration from Apache Spark: How Amazon successfully migrated exabyte-scale data processing from Apache Spark to Ray on EC2.
Upgrading Terraform EKS from 19.21 to 20.0: Detailed steps and code examples in the blog post!
Kubernetes Migration from Day Minus One to Day Two: Planning to migrate to Kubernetes? This is about essential pre-migration considerations!
Orchestrating Complex Azure Infrastructure: Learn to create, use, and optimize Terraform modules, build a multi-tier application, and implement best practices for large-scale projects.
Scaling Prometheus with Thanos: Addressing Prometheus’s limitations, Thanos offers long-term metrics storage, efficient querying, and cost savings using object storage.
Reducing storage costs with s3-batch-object-store: Embrace reduced storage costs by 70% by creating an open-source module for efficiently storing and retrieving objects from a single file in S3.
GitOps your Terraform resources: Flux monitors a Git repository for changes in your Terraform configurations and reconciles them by adding, modifying, or deleting Azure resources accordingly.
AWS IAM Roles Anywhere: This guide covers implementing a serverless CA, creating trust anchors, and IAM Roles Anywhere profiles.
Getting real with virtual threads: Netflix's exploration of Java 21's virtual threads revealed a complex issue where applications using SpringBoot 3 and embedded Tomcat experienced intermittent timeouts and hangs.
Monitoring Workloads on Amazon EKS: Learn how to monitor workloads using Grafana and Prometheus.
Building Image Thumbnail Generator: Creating an image thumbnail generator using AWS Lambda and S3 Event Notifications with Terraform.
HTTP/0.9 From Scratch: Exploring HTTP/0.9, the simplest version from 1991, and how to build an HTTP/0.9 server in Go to understand the protocol fundamentals.
Projects of the week
Highlighting cool DevOps projects to keep an eye on:
Git-pr is a self-hosted Git collaboration! Combines email patch workflows with GitHub-style PRs, all within your local dev environment.
DevX is a tool for building lightweight Internal Developer Platforms. Can be used to build internal standards, prevent misconfigurations early, and enable infrastructure self-service.
Salmon a centralized environment that collects, stores, and processes alerts and execution statistics of your data pipelines.
PgManage is a modern Postgres-centric graphical database client.
CloudySetup is a CLI tool designed to streamline AWS resource management using AWS Cloud Control API.
Devzat is a custom SSH server that takes you to a chat instead of a shell prompt. Because there are SSH apps on all platforms (even on phones) you can connect to Devzat on any device!
Managing Kubernetes costs is now simpler than ever with Tailwarden!
Keep your budget in check as Kubernetes clusters dynamically scale resources. Tailwarden allows you to create customized dashboards for tracking costs by cluster, namespace, or any desired dimension.
And go beyond Kubernetes, with a unified view of your entire cloud spend. Whether your infrastructure is on AWS, GCP, or Azure, you can see all your expenses in one place!