OverflowByte — DevOps, Cloud & Linux for Engineers

How to Extract (Unzip) tar.xz File: A Complete Beginner's Guide

Pushpendra B — Sun, 01 Mar 2026 16:13:05 GMT

If you've spent any time working with Linux, downloading open-source software, managing remote servers securely (such as allowing SSH Root Login), or dealing with server backups, chances are you've encountered a .tar.xz file. At first glance, it might look like just another compressed folder, but how exactly do you open it?

If you are wondering how to extract a tar.xz file safely and efficiently using the terminal or graphical tools, you are in the right place. This comprehensive guide will walk you through everything you need to know about the tar.xz format, the essential Linux commands, and how to troubleshoot common extraction errors. Let's dive in and demystify the process of working with these highly compressed archives!

If you find this guide useful, you might also like:

Quick tar command reference — concise examples for creating and extracting archives.
Install and use xz on Linux — how to install xz and use its options.
Discovering the Power Behind Popular Linux GUI Applications — explore graphical alternatives for extraction.
Troubleshooting archive extraction errors — fixes for common problems.
Server backups: best practices — tips for reliable backup archives.

1. Introduction

What is a tar.xz file?

A .tar.xz file is an archive created by combining two different tools: tar (Tape Archive) and xz (a compression algorithm based on LZMA2). Essentially, the tar application bundles multiple files and directories into a single archive file, while xz compresses that single archive down to a much smaller size.

Why tar.xz is commonly used in Linux distributions

Over the years, Linux users relied heavily on .tar.gz (gzip compression) and .tar.bz2 (bzip2 compression). However, as software packages grew larger, the need for better compression became crucial. The xz algorithm provides higher compression ratios compared to older formats, meaning smaller file sizes without changing the basic workflow for creating or extracting archives.

For a comparison of formats and when to use each, see tar.gz vs tar.xz: Which to choose?.

Where users typically encounter tar.xz files

You will frequently see .tar.xz files when:

Downloading Software Source Code: Unlike installing precompiled tools via apt, major open-source projects (like the Linux Kernel, Python, and Node.js) distribute their source code in .tar.xz format.
Automated Workflows: When setting up a professional CI/CD pipeline, pipeline agents regularly download .tar.xz toolchains for caching and build environments.
Creating Backups: System administrators prefer tar.xz when backing up large directories where disk space is a primary concern.

2. Understanding tar and xz Separately

To quickly master the tar.xz command, it helps to understand what the individual tools do.

What is tar?

tar stands for Tape Archive. Originally developed decades ago to write data to sequential magnetic tape drives, it is now the standard utility for collecting many individual files and wrapping them into one single file (often called a "tarball"). Importantly, tar by itself does not compress data—it merely packages it.

What is xz compression?

xz is a data compression utility that uses the LZMA2 algorithm. It takes a file (like our uncompressed tarball) and shrinks its size drastically. While xz is slightly slower to compress data than gzip, it is incredibly fast at decompressing it, making it perfect for software distribution.

Difference between .tar, .tar.gz, .tar.bz2, and .tar.xz

.tar: Just an uncompressed bundle of files.
.tar.gz (gzip): Fast compression, fast extraction, decent file size reduction.
.tar.bz2 (bzip2): Slower compression, better size reduction than gzip (but largely superseded by xz).
.tar.xz (xz): Extremely high compression ratio, creating the smallest file sizes, with fast decompression speeds.

3. Prerequisites

Before we start typing commands to unzip a tar.xz file, let's verify that your system has the required utilities installed.

Required Tools

To extract tar.xz files via the command line, you need:

tar: The archiving tool.
xz-utils: The package containing the xz decompression libraries.

How to check if tar and xz are installed

Open your terminal and check the versions of these tools by running:

tar --version
xz --version

If the system outputs version information for both, you are good to go! If you get a "command not found" error, you need to install them. If xz is not installed, follow this quick tutorial to install xz.

Installation Commands for Different Distributions

Ubuntu, Debian, and Linux Mint:

sudo apt update
sudo apt install tar xz-utils

CentOS, RHEL, and Fedora:

sudo yum install tar xz

Arch Linux and Manjaro:

sudo pacman -S tar xz

4. How to Extract tar.xz File (Step-by-Step)

Now for the main event: how to extract tar.xz files on Linux. To extract with the terminal, use tar -xf archive.tar.xz; see the full tar command guide for more examples.

The Basic Extraction Command

If you have a file named archive.tar.xz in your current directory, the standard tar.xz command to extract it is:

tar -xf archive.tar.xz

This command will quietly unpack the contents into your current working directory.

Explanation of Each Flag

Let's break down the flags used in tar commands. While modern versions of tar can auto-detect the xz compression, the traditional and explicit way to untar tar.xz includes the -J flag:

tar -xvf archive.tar.xz
# OR explicitly:
tar -xJvf archive.tar.xz

Here is what these letters do:

-x (eXtract): Tells tar to extract files from an archive.
-v (Verbose): Tells tar to list the files on the screen as it extracts them (highly recommended so you can see what's happening!).
-f (File): Specifies the name of the archive file you are targeting. This must always be the last flag before the file name.
-J (xz): Explicitly tells tar that the archive is compressed using the xz algorithm.

Extract to a Specific Directory

By default, tar extracts files into the current folder. If you want to unzip tar.xz into a different location, use the -C (Change directory) flag:

tar -xf archive.tar.xz -C /path/to/destination/

Note: The destination directory must already exist before you run this command.

Extract Specific Files from the Archive

If you only need a single file (e.g., readme.txt) from a massive archive, you don't have to extract the whole thing. Append the internal file path to your command:

tar -xf archive.tar.xz path/inside/archive/readme.txt

List Contents Without Extracting

Want to see what is inside the archive before committing to an extraction? Use the -t (list) flag instead of -x:

tar -tf archive.tar.xz

Extract with Progress

If you are extracting a massive multi-gigabyte backup, the terminal might sit blank for a while. You can monitor the progress by installing the pv (Pipe Viewer) tool and piping the file through it:

pv archive.tar.xz | tar -xJ

Pro tip: If your extraction is slowing down because of heavy disk writes, you can profile your system's disk load utilizing great monitoring tools like iotop.

5. Extracting tar.xz on Different Platforms

While Linux handles tar files natively, you might find yourself needing to open these files on other operating systems.

Linux (CLI and GUI Methods)

As covered above, the standard tar -xf filename.tar.xz in your terminal is the preferred and fastest method. Many desktop environments support right-click extraction — see our roundup of GUI tools for Linux workflows for details.

Windows

Windows does not support tar.xz natively out of the box through double-clicking, but you have two excellent options:

Using 7-Zip: Download and install 7-Zip. Right-click the .tar.xz file, hover over "7-Zip," and select "Extract Here". Note: 7-Zip might extract the .xz part first, leaving you with a .tar file. Just right-click and extract the .tar file again.
Using Windows Subsystem for Linux (WSL): If you are a developer using WSL (Ubuntu on Windows), simply open your WSL terminal, navigate to your /mnt/c/ drive, and run standard Linux tar commands.

macOS (Terminal Method)

macOS is built on a Unix foundation, which means it comes with the tar application perfectly intact. Open your macOS Terminal app and run:

tar -xf archive.tar.xz

Alternatively, Mac tools like The Unarchiver can handle these files graphically.

6. Common Errors and Troubleshooting

Even seasoned DevOps engineers encounter errors. Here is how to handle them. If you run into permission or corrupted archive errors, consult our ultimate guide on troubleshooting extraction errors.

"tar: command not found" or "xz: command not found"

Cause: The required utilities are missing from your system. Fix: Refer back to Section 3 and run the installation commands for your Linux distribution.

"Permission denied"

Cause: You are trying to extract files into a directory where your current user doesn't have write permissions (e.g., /opt/ or /usr/local/). Fix: Prefix your extraction command with sudo:

sudo tar -xf archive.tar.xz -C /opt/

"Unexpected EOF in archive" or "Corrupted archive"

Cause: The file download was interrupted, or the file is genuinely corrupted. Fix: Re-download the file. If you have the checksum (like an MD5 or SHA256 hash), verify that the downloaded file matches the original hash.

7. Advanced Usage

Ready to level up your Linux CLI skills? Try these advanced techniques.

Combining Extraction with Pipe

Sometimes you might download a file utilizing curl or wget and want to extract it immediately without saving the compressed tarball to disk first:

curl -L https://example.com/software.tar.xz | tar -xJ

Extracting and Moving in One Command

You can utilize the -C argument and the --strip-components flag. Many archives put everything inside a top-level root folder. To bypass that parent folder and extract the contents directly into your target directory:

tar -xf archive.tar.xz -C /var/www/html/ --strip-components=1

Verifying Archive Integrity

You can test the integrity of an xz file before unzipping it using the xz tool directly:

xz -t archive.tar.xz

If the command completes silently and returns to the prompt, the file is structurally sound.

Performance Considerations

Extraction is heavily reliant on CPU performance. If you are dealing with very large files, some implementations of tar and xz (like pixz) can utilize multi-threading to speed up the decompression process.

8. Real-World Example

Let's look at a practical, end-to-end scenario: downloading, extracting, and compiling the popular htop tool from source.

Step 1: Download the software package

wget https://github.com/htop-dev/htop/releases/download/3.2.2/htop-3.2.2.tar.xz

Step 2: Extract the tar.xz file

tar -xvf htop-3.2.2.tar.xz

Step 3: Navigate into the newly extracted folder

cd htop-3.2.2

Step 4: Compile the source code (Assuming build tools are installed)

./configure
make
sudo make install

By simply unzipping the tar.xz file smoothly, you've set the stage to successfully compile a Linux package from scratch!

9. Best Practices

To ensure smooth operations going forward, keep these best practices in mind:

Security Tips: Never blindly extract archives using sudo unless you completely trust the source. Malicious archives can be crafted with absolute paths (e.g., /etc/passwd) to overwrite system files, although modern tar versions aggressively strip leading slashes by default to prevent this.
Verify Checksums: Always verify SHA-256 signatures when provided by software developers. This ensures you haven't downloaded a slightly corrupted package or fallen victim to a "man-in-the-middle" attack.
Extract Safely: Always consider running tar -tf archive.tar.xz first to preview the directory structure. It's frustrating to extract an archive that spills hundreds of loose files into your pristine ~/Downloads folder instead of containing them neatly inside a parent directory.

10. Conclusion

Knowing how to extract tar.xz files is an absolutely critical skill for anyone using Linux, whether you are a beginner fiddling with a Raspberry Pi or an intermediate user moving toward SysAdmin or DevOps roles.

To quickly summarize: the magic tar.xz command is just tar -xf filename.tar.xz. Remember to use -v if you want verbose output and -C to extract to a specific target directory.

Now that you've mastered how to untar tar.xz archives, you are well on your way to Linux command-line mastery. Don't be afraid to read the manual (man tar) to discover even more powerful tricks you can perform!

Frequently Asked Questions (FAQ)

1. Can I use the unzip command for a .tar.xz file?
No. The unzip command is specifically designed for .zip files. For .tar.xz files, you must use the tar command.

2. Is .tar.xz better than .zip?
In the Unix/Linux ecosystem, yes. .tar.xz preserves Linux file permissions, ownerships, and symbolic links perfectly, whereas the standard .zip format does not natively handle these attributes well. Furthermore, xz offers vastly superior compression ratios compared to zip.

3. How do I create my own tar.xz file?
To compress a folder into a tar.xz archive, use the -c (create) flag:
tar -cJf myarchive.tar.xz /path/to/my/folder

4. Why is extracting my .tar.xz taking so long?
The xz algorithm trades CPU computing time for smaller file sizes. Sometimes extracting very heavily compressed large files simply takes time, especially on low-powered CPUs. Using the -v flag during extraction lets you visually confirm that progress is continuously being made.

5. How do I delete the original archive automatically after I extract it? While tar doesn't have a built-in flag to delete the source archive post-extraction, you can chain commands together using &&: tar -xf archive.tar.xz && rm archive.tar.xz

Weekly DevOps & Cloud Intelligence Report – Week 4, February 2026

Pushpendra B — Sun, 01 Mar 2026 15:47:25 GMT

Introduction

If you spent the last week heads-down in tickets and deployments, here is what you missed. The period from February 23 to March 1, 2026 was unusually dense with infrastructure-layer changes across all three major clouds, a meaningful IaC release, and a wave of Linux kernel and userland security patches that cannot be deferred indefinitely.

More importantly, the underlying direction is becoming clearer: AI agents are no longer confined to code assistants. They are being wired directly into cluster management, observability pipelines, and deployment systems. That shift is not purely theoretical anymore. This week gave us concrete product releases from AWS, Azure, and Google that make it real.

Whether you run Kubernetes workloads, manage RHEL servers, or are planning your next certification, there is something here that affects your work. Let us break it down.

Cloud & DevOps Updates

AWS: EKS Node Monitoring Agent Goes Open Source

AWS open-sourced the Amazon EKS Node Monitoring Agent. This agent runs as a DaemonSet on every node in your cluster and is responsible for collecting node-level metrics and logs, which it ships into AWS observability backends like CloudWatch.[youtube]

This matters for a specific reason: until now, the agent was a black box. You could use it but not inspect, modify, or extend it. With the source available, teams running hybrid clusters or custom node configurations can fork the agent, add their own collectors, or simply audit what is being shipped out of their nodes.

A practical use case: if you run EKS with GPU nodes for ML workloads and want to add DCGM (NVIDIA's CUDA monitoring exporter) metrics alongside the default node telemetry, you can now build that directly into the agent rather than running a sidecar.[aws.amazon]

AWS: Nested Virtualization, EKS Auto Mode Logging, and OpenSearch Cluster Insights

Three smaller but noteworthy AWS updates landed this week.

Nested KVM/Hyper-V support on EC2 means you can now spin up virtual machines inside EC2 instances. This is immediately useful for CI environments where your pipeline needs to boot a full VM to test an installer, run Packer builds, or run Windows Subsystem for Linux in an isolated environment. The new high-frequency M8azn instances give you the compute headroom to make this practical.[youtube]

EKS Auto Mode now vends CloudWatch logs per capability. If you use Auto Mode and want to understand what the control plane is actually doing with storage provisioning, load balancing, or compute scaling, those logs are now separated by capability rather than mixed into a single stream. That is a significant debugging improvement.[youtube]

OpenSearch cluster insights adds automated detection of hot shards and index imbalances. If you run OpenSearch for log aggregation and have seen unexplained query latency, this feature surfaces the root cause rather than leaving you to guess from slow query logs.[aws.amazon]

Azure: AKS Gets Kubernetes 1.34 GA, Node Auto-Provisioning, and an MCP Server

Azure Kubernetes Service reached general availability on Kubernetes 1.34. The practical upside: Gateway API, more granular scheduling primitives, and improved networking behavior are now production-safe on AKS without needing to pin to a preview channel.[youtube]

Node auto-provisioning is also now GA across more regions including government cloud. It supports LocalDNS, encryption at host, and disk encryption sets. This is essentially Karpenter-style node lifecycle management baked into AKS, which means your cluster can scale up, select the right node SKU, and apply security baselines without a human making those decisions per incident.[reddit]

The more forward-looking announcement is the AKS MCP server, released on GitHub alongside an agentic CLI cluster mode. This is worth understanding. The Model Context Protocol (MCP) is a standard that lets AI agents communicate with external systems in a structured way. Microsoft's AKS MCP server exposes cluster resources through this protocol, which means an AI agent can list deployments, scale workloads, or apply manifests as a first-class operation rather than by parsing kubectl output.[youtube]

Whether you adopt this immediately or not, this architectural pattern—AI agent as cluster operator—is where managed Kubernetes is heading.

Google Cloud: Multi-Region Cloud Run and Gemini Cloud Assist

Google Cloud preview-launched multi-region failover for Cloud Run. Until now, if your Cloud Run service had a regional outage, you needed custom traffic management to reroute. The new feature handles failover and failback automatically. This is a meaningful reliability improvement for serverless workloads that were previously one regional incident away from complete unavailability.[youtube]

Gemini Cloud Assist is entering preview for Cloud SQL and AlloyDB, analyzing slow queries and performance anomalies directly from the console. Think of it as a DBA assistant that reads your query plans and tells you what is wrong before you file a ticket.[docs.cloud.google]

Google also added a remote MCP server for Cloud Run, which enables AI agents to deploy and manage Cloud Run services programmatically via the Model Context Protocol. Same pattern as AKS MCP, different execution environment.[youtube]

Terraform: v1.14.6 Released, Enterprise 1.2.0 Brings Day-2 Actions

HashiCorp shipped Terraform v1.14.6 on February 25, continuing the 1.14.x stabilization cycle. A 1.15.0 alpha is in testing with Windows ARM64 builds and variable deprecation metadata—a sign the next minor version is getting closer to feature freeze.[discuss.hashicorp]

The more significant release is Terraform Enterprise 1.2.0, which ships two production-relevant features:

Explorer GA: a graph-style view across all workspaces and run history in your TFE instance. Before this, getting visibility into which workspace last ran, which failed, or which resources drifted required either the CLI or the API. Explorer brings that into the UI. Note that it requires a backfill run with updated agents before historical data appears.

Day-2 Actions GA: this lets you encode operational procedures—think patching, certificate rotation, or maintenance mode changes—as Terraform-managed workflows triggered by lifecycle hooks or directly via:[discuss.hashicorp]

terraform apply -invoke=

Paired with OIDC dynamic credentials (now supported in module test runs for AWS, Azure, GCP, and Vault), you can eliminate static secrets from your CI pipelines entirely. Instead of storing an AWS_ACCESS_KEY_ID in your CI environment, your runner assumes a role dynamically at runtime. This should be standard practice in any CI/CD pipeline touching cloud resources.[discuss.hashicorp]

Kubernetes Version Cadence Across Clouds

A quick alignment table for planning cluster upgrades:

Cloud	GA Version	Notes
EKS	Kubernetes 1.35	Supported since late January 2026
AKS	Kubernetes 1.34	GA as of this week
GKE	Kubernetes 1.34	Stable channel auto-upgrade begins March 10

If your clusters are running 1.31 or earlier on any of these platforms, extended support charges or deprecation warnings are either already active or imminent. Upgrade planning should be on your sprint board now.

Linux & Server Management

RHEL 10 Kernel Security Update: RHSA-2026:3124

Red Hat issued RHSA-2026:3124 on February 23 for RHEL 10 Extended Update Support. Two CVEs require attention:[access.redhat]

CVE-2025-38730 is an io_uring bug in network buffer handling. The io_uring subsystem is heavily used for high-performance I/O in modern application stacks. A mishandled buffer here can cause data corruption or system instability—not a remote code execution, but serious enough in any production environment doing high-throughput I/O.

CVE-2025-39760 is an out-of-bounds read during USB configuration parsing that can trigger a denial of service. On cloud VMs or containers this seems irrelevant, but on bare-metal servers where USB devices are present (even passively), this is an exploitable path to crash the host kernel.

Both require a reboot after patching. To check your current kernel version and whether the patch applies:

uname -r
rpm -q kernel
sudo dnf check-update kernel

If you are on AlmaLinux, Rocky Linux, or another RHEL rebuild, expect equivalent advisories within the week. Do not defer this past your next maintenance window.[access.redhat]

Multi-Distro Patch Wave: OpenSSL, ImageMagick, freerdp, libsoup

This week's CERN Linux update log is a useful proxy for what is hitting RHEL-family estates broadly. Active patches include:[linux.web.cern]

ImageMagick (CVE-2025-62171, CVE-2026-23876): image processing vulnerabilities that matter anywhere you resize or convert user-uploaded images on the server side.
OpenSSL (CVE-2025-9230): affects any service using OpenSSL for TLS. That is most things.
freerdp (multiple CVEs): relevant if you have any RDP-based remote access or broker services.
libsoup: an HTTP client library used widely in GNOME-stack applications and some server-side tooling.

Cross-distro coverage (AlmaLinux, Debian, Fedora, Oracle Linux, RHEL, Rocky, Ubuntu, SUSE) was documented in the January security roundup and continues this month. The volume alone is the argument for automated patch pipelines. Running apt upgrade or dnf update manually once a month is no longer a defensible posture.[linuxcompatible]

A simple Ansible ad-hoc command to get your patch status across a fleet:

ansible all -m command -a "dnf check-update" --become

Patch Tuesday Spillover into Linux Environments

February's Patch Tuesday covered 59 Microsoft vulnerabilities including six actively exploited zero-days, plus critical fixes for SAP and Intel TDX.[thehackernews]

If you run Hyper-V hosts under Linux guests, or Intel TDX-based confidential compute environments, the Intel TDX patches have direct hypervisor-layer implications. Unpatched hypervisor code sitting under a patched Linux guest is not a safe state. Align your Windows/Intel firmware patching cycle with your Linux kernel cycle.

Career & Learning Trends

The Job Market: Platform Engineering and MLOps Are the Premium Tiers

According to a February 21 HackerX analysis, DevOps, SRE, and Platform Engineer roles remain among the fastest-filling positions in tech. The differentiator in 2026 is specialization: generalist DevOps profiles compete on a crowded field, but engineers who combine infrastructure with ML platform experience (GPU cluster management, model serving, MLflow, Ray, etc.) command the top of the salary range—$150k–$260k base in major US markets at mid-senior level.[hackerx]

Perforce's 2026 State of DevOps Report adds a structural data point: 70% of organizations say their DevOps maturity directly affects how successful their AI initiatives are. That is not a soft correlation. Organizations that cannot reliably deploy, monitor, and roll back software struggle to operationalize models. The foundational work matters more, not less, as AI tooling advances.[perforce]

Certifications: What the Market is Actually Rewarding

KodeKloud's February 2026 certification guide reflects current hiring patterns:[kodekloud]

Cloud (pick your primary platform): AWS DevOps Engineer Professional, AZ-400, or Google Professional Cloud DevOps Engineer.
Kubernetes: CKA remains the strongest signal, with CKS increasingly required for any platform role touching production.
IaC: HashiCorp Terraform Associate is table stakes for infrastructure roles.
Security: CKS and cloud-provider security specializations are growing in weight.

One practical note on the CKA: the Linux Foundation updated the exam in early 2025, and the new version runs on Kubernetes 1.34. If you are preparing using older study material, you will find roughly half the exam has shifted toward Gateway API, Helm, Kustomize, CRDs, and Operators—topics that older guides barely mention. Study accordingly, and do not rely on exam dumps from the pre-2025 version.training.linuxfoundation+1

The community consensus on Reddit and engineering forums remains consistent: build real projects. Deploy a multi-tier application, break the cluster, fix it under time pressure, add monitoring, write the runbook. That experience is more durable than memorizing YAML.[reddit]

Strategic Tech Moves

Microsoft Consolidates Security Tooling Around Defender

Microsoft extended the sunset date of the legacy Azure Sentinel portal to March 31, 2027, while continuing to push teams toward the unified Defender portal for SIEM and XDR operations. The delay is a concession to enterprise migration timelines, but the direction is firm: if your security operations still live primarily in the classic Sentinel interface, you have roughly one year before it goes away.[learn.microsoft]

More broadly, Microsoft is weaving Copilot capacity into partner benefit packages alongside Defender, Entra, and Intune. For DevOps teams that also own security posture (a combination that is increasingly common in smaller engineering organizations), this matters because your cloud portal experience is being redesigned around AI assistance. Learning to use it effectively is becoming part of the job.

The "Always-On Cloud" Assumption Is Cracking

A theme emerging from multiple analyst pieces this week: enterprises are starting to acknowledge that cloud availability guarantees are not the same as application availability. A January 2026 analysis found that critical and major incidents across major DevOps SaaS platforms—GitHub, Jira, Azure DevOps—jumped 69% year-over-year in 2025, with total degraded time more than doubling.[thehackernews]

The architectural response is not to avoid cloud SaaS, but to design for its failure. That means self-hosted mirrors for critical repositories, independent backup strategies that do not rely solely on vendor exports, and multi-region deployment patterns that actually get tested. For platform engineers, this is not theoretical: design your internal developer platform to survive a GitHub outage without a 48-hour recovery period.

AI & Automation in DevOps

What Is Actually Shipping This Week

Let us separate what is available now from what is still vaporware.

Available now:

AWS Bedrock AgentCore supports server-side tool execution. This means an agent running inside AWS can call internal APIs and trigger workflows without routing through the user's client. For DevOps, this enables patterns like: "AI agent detects a degraded service, calls an internal runbook API to restart the affected component, and logs the action to an audit trail."[youtube]
Azure AKS MCP server is on GitHub. You can deploy it today. AI agents that understand MCP can now perform CRUD operations on AKS resources. The blast-radius question—how much autonomy you give those agents—is yours to configure through RBAC.[youtube]
Google Cloud Run MCP server lets LLM agents deploy and manage Cloud Run services. Paired with multi-region failover, you can build an agent that detects regional degradation and triggers a redeployment to a secondary region automatically.[youtube]
Bedrock Converse API batch inference is available. If you run LLM pipelines for log summarization, incident triage, or documentation generation, batch mode significantly reduces cost over synchronous inference.[youtube]

Worth watching but not yet production-ready for most:

Gemini Cloud Assist for Cloud SQL/AlloyDB is in preview. Useful for experimentation, not for automated production remediation yet.[docs.cloud.google]

Observability as a Control Plane, Not a Dashboard

Dynatrace's 2026 Pulse of Agentic AI survey found that 50% of organizations already have agentic AI in production somewhere, with IT operations the strongest adoption area at 70%. IBM's 2026 observability analysis describes the trajectory as "Observability-as-Code," where you define what gets monitored and at what threshold in version-controlled configuration, and AI uses that telemetry as guardrails for autonomous decisions.dynatrace+1

The practical implication for your current stack: the quality of your observability instrumentation directly determines how trustworthy your AI automation can be. An agent that triggers a rollback based on a synthetic alert misconfiguration is worse than no agent at all. Getting your metrics, logs, and traces right is now upstream work for any AI-assisted ops initiative.

A concrete starting point: ensure all your services emit structured logs with consistent field names (service, environment, trace_id, error_code), and that your alert thresholds are reviewed and documented. That is the foundation before any AI layer is worth configuring.

Key Takeaways

Patch your RHEL 10 kernel this maintenance window. CVE-2025-38730 (io_uring) and CVE-2025-39760 (USB OOB read) both require a reboot and should not be deferred. Downstream rebuilds (AlmaLinux, Rocky) will have equivalent patches shortly.
Enable OIDC dynamic credentials in your CI pipelines. Terraform Enterprise 1.2.0 makes this straightforward for AWS, Azure, GCP, and Vault. Static cloud credentials in CI environments are a liability that has no technical justification in 2026.
Plan Kubernetes upgrade windows now. EKS is at 1.35, AKS at 1.34, GKE Stable channel moves to 1.34 on March 10. Clusters two or more minor versions behind are either in extended support or approaching end-of-life.
If you are studying for CKA, use updated material. The exam now runs on Kubernetes 1.34 with new coverage of Gateway API, Helm, Kustomize, CRDs, and Operators. Pre-2025 guides will leave you underprepared for roughly half the exam.
The MCP pattern (AKS, Cloud Run) is the AI-DevOps integration to understand this year. It is a structured protocol for AI agents to interact with infrastructure. Learn how it works architecturally before you need to make a production decision about it.
Design your toolchain to survive SaaS outages. GitHub, Jira, and Azure DevOps had 69% more critical incidents in 2025 than in 2024. Self-hosted mirrors and independent backup strategies are worth the investment.
Observability quality gates your AI automation. Before adding AI agents to your ops stack, audit the quality and structure of your existing telemetry. Garbage-in still means garbage-out at any level of model sophistication.

Conclusion

This week is a good illustration of why keeping up with infrastructure-layer changes matters even when your sprint is full. A kernel CVE does not wait for your planning cycle. An AKS GA feature might change how you scope your next platform migration. A Terraform Enterprise release might eliminate a security practice you have been meaning to fix for months.

The bigger pattern running through all of this is that the toolchain is becoming more autonomous. AI agents managing clusters, responding to incidents, and deploying services is no longer a research topic—it is landing in production releases from the three largest cloud providers simultaneously. The engineers who will use this well are not the ones who adopt it fastest. They are the ones who have solid foundations: clean telemetry, tested runbooks, hardened RBAC, and a genuine understanding of what the automation is doing and why.

Stay rigorous. Stay curious. The pace of change is not slowing down, and the best defense against being overwhelmed by it is building systems you actually understand.

How to Allow SSH Root Login on Linux (Securely): Real-World Use Cases & Best Practices

Pushpendra B — Wed, 25 Feb 2026 04:18:48 GMT

If you have ever set up a new Linux server—whether you are setting up an Ubuntu web server from scratch or planning a complex Docker deployment for production—you have probably encountered a common roadblock: the server flat-out rejects direct SSH root login.

Out of the box, major Linux distributions like Ubuntu, Debian, and Rocky Linux block direct SSH access for the root user. The industry standard is to log in with a regular user account and use the sudo command for administrative tasks.

But sometimes, you genuinely need direct root access. In this beginner-friendly guide—part of our ongoing SSH Security Series—we will explore why root login is blocked by default, when it actually makes sense to enable it, and exactly how to configure it without compromising your Linux server's security.

Why is SSH Root Login Disabled by Default?

The root user possesses absolute power over the Linux operating system. If an unauthorized person gains access to it, they have total control over your server.

There are a few critical reasons why SSH root login is turned off by default:

Brute-Force Attacks: Every Linux server has a root user. Malicious bots constantly scan the internet, attempting to guess the root password. Disabling direct login stops these automated attacks at the door.
Accountability: If your entire team shares a single root password, you have no way of knowing who executed a specific command. Using individual accounts with sudo leaves a clear audit trail. Just as you should carefully manage your AWS IAM users rather than sharing an AWS root account, you should manage Linux user access individually.
Accidental Damage: Logging in as a regular user adds a necessary layer of friction. Having to explicitly type sudo makes you pause and think before running a potentially destructive command.

When Do You Actually Need Direct Root Login?

You should never enable root login simply to save a few keystrokes. However, there are valid, real-world scenarios where relying on sudo is impractical or impossible.

1. Automated System Backups

Tools like rsync or BorgBackup often need to read every single file on the file system, including files owned by other users. A dedicated backup server running an automated script typically requires direct root SSH access to pull a complete, system-wide backup without interactive password prompts.

2. Infrastructure Automation and CI/CD

If you use configuration management tools like Ansible to manage infrastructure, they sometimes need to connect directly as root to run initial bootstraps. Similarly, if you are building a professional CI/CD pipeline, your deployment agents might temporarily need elevated privileges to copy system files or restart core services.

3. Deep System Diagnostics

When troubleshooting severe server lag or disk I/O bottlenecks, administrators rely on specialized diagnostic tools. Using utilities like iotop to monitor hard disk processes deep within the system layer is much easier when operating natively as root, especially in isolated testing environments.

The Right Way to Enable Root Login Securely

If your workflow demands direct root login, there is a right way and a wrong way to configure it.

The Golden Security Rule: Never use passwords for root SSH.

You must rely entirely on SSH cryptographic keys. Here is the step-by-step guide to setting it up safely.

Step 1: Generate an SSH Key Pair

If you do not already have one, generate an SSH key pair on your local computer (not the server). Open your terminal and run:

ssh-keygen -t ed25519 -C "your_email@example.com"

Press Enter to accept the default file location, and be sure to set a strong passphrase for an extra layer of security.

Step 2: Copy the Public Key to the Server

Next, copy your public key to the server's root account.

If root login is completely disabled right now, you will need to use your regular user to set this up initially:

SSH into the server as your regular user.
Switch to root privileges: sudo su -
Create the SSH directory if it does not exist: mkdir -p /root/.ssh && chmod 700 /root/.ssh
Add your public key (from your local ~/.ssh/id_ed25519.pub file) into /root/.ssh/authorized_keys.
Secure the file permissions: chmod 600 /root/.ssh/authorized_keys

Step 3: Edit the SSH Configuration File

Open the main SSH configuration file on your server using a text editor:

sudo nano /etc/ssh/sshd_config

Step 4: Update the `PermitRootLogin` Directive

Search for the line that says PermitRootLogin. It might be commented out with a #. Modify it to look exactly like this:

PermitRootLogin prohibit-password

This specific, highly secure setting means that the root user is allowed to log in via SSH, but password authentication is strictly forbidden. The only way to gain access is by possessing the authorized SSH key.

Step 5: Restrict by IP Address (Optional but Highly Recommended)

If you only need root login for a specific backup server or a static office IP, you can restrict access to just that machine's IP address. Add this block to the very bottom of the config file:

Match Address 198.51.100.45
    PermitRootLogin prohibit-password

(Remember to replace the placeholder IP address with your actual trusted IP).

Step 6: Restart the SSH Service

Save your changes (in Nano, press CTRL+O, Enter, CTRL+X) and restart the SSH service so the new security rules take effect.

# For Ubuntu or Debian distributions:
sudo systemctl restart ssh

# For RHEL, CentOS, or Rocky Linux distributions:
sudo systemctl restart sshd

Wrapping Up Our SSH Security Series

Allowing direct root login over SSH is not a cardinal sin, provided you configure it securely. By strictly disabling password authentication, relying exclusively on SSH keys, and ideally restricting access by trusted IP addresses, you can accommodate your automated DevOps workflows without giving hackers an easy way into your Linux server.

For more deep dives into Linux administration, cloud infrastructure, and DevOps best practices, be sure to check out our weekly tech roundups and newsletters. Keep your servers secure, and happy deploying!

Cloud, Kernel & Models: What Changed This Week (Feb 16–22, 2026)

Pushpendra B — Mon, 23 Feb 2026 05:00:00 GMT

A compact, practitioner-focused digest of the week's most impactful releases, updates, and strategic shifts across AWS, Azure, GCP, Kubernetes, Linux, CI/CD, and AI-driven infrastructure.

The One-Line Takeaway

AI moved from a workload to an infrastructure primitive this week — and your toolchain, certifications, and cloud bill are all changing because of it.

☁️ Cloud Platforms: AWS, Azure & GCP

AWS

Amazon had a dense week focused on compute and AI inference.

EC2 Hpc8a instances are now GA. Built on AMD EPYC Gen 5 with 300 Gbps EFA networking, they deliver up to 40% higher performance for HPC workloads — CFD, FEA, risk simulations — without moving to GPU-heavy stacks. If you run tightly coupled simulations on EC2, this is a direct upgrade path.
SageMaker Inference for custom Amazon Nova models is live. You can now deploy your own fine-tuned Nova-based models with configurable instance types, autoscaling, and concurrency controls — treating large-model inference the same as any other managed service. No custom inference server. No Kubernetes YAML sprawl. Just a policy, an endpoint, and autoscaling rules.
Nested virtualization on EC2 C8i, M8i, R8i — AWS quietly unlocked nested KVM/Hyper-V support on Xeon 6–based mainstream instances, not just bare-metal. Run complex testbeds, WSL inside Windows dev boxes, or Docker-on-VM lab environments directly inside EC2 without provisioning bare-metal.

💡 For your DevOps/Cloud transition: AWS is treating AI inference as a tunable, scalable building block — the same way Lambda abstracted functions in 2015. Start designing for it now.

Azure

Microsoft shipped a dense set of operational updates, mostly GA:

AKS Fleet Manager namespace-scoped resource placement (preview) — Multi-cluster, multi-tenant scheduling is getting more granular. If you manage multiple AKS clusters, Fleet Manager is the path to GitOps-style cross-cluster placement without custom operators.
Azure Container Storage v2.1.0 GA — Full Elastic SAN integration with on-demand install. Better storage ergonomics for stateful AKS workloads.
WAF Default Ruleset 2.2 GA + X-Forwarded-For–based rate limiting for Application Gateway WAF v2. Better bot and DDoS mitigation without custom rules.
Serverless Workspaces in Azure Databricks GA — No cluster management for ad-hoc data engineering. Relevant if your team runs mixed ML + infra workflows.
New reference architectures published: Highly available multi-region AKS deployments, and an Azure AI hub-and-spoke landing zone. If you're designing greenfield Azure environments in 2026, these are worth bookmarking before you start the Terraform.
Azure Copilot Data Connector for Microsoft Sentinel (public preview) — You can now ingest Copilot activity as security events into Sentinel. AI assistant actions are officially part of your attack surface. Model them accordingly.

Google Cloud

Google Cloud's updates this week center on economics and developer experience:

Cloud Run now supports Ubuntu 24 LTS base images GA for source deployments. Standardize your Cloud Run builds on Ubuntu 24, align them with your GKE node base, and carry consistent patching across both.
Expanded Compute CUDs covering Cloud Run — Flexible committed use discounts now apply across Compute Engine, GKE, and Cloud Run together, which simplifies cost governance for mixed serverless + container workloads.
GKE Dynamic Default StorageClass — GKE now auto-selects between Persistent Disk and Hyperdisk based on node hardware in mixed-generation clusters. Your PVC manifests stay cleaner and more portable.
Google Cloud Innovators Program going "Legacy" — No new members. Existing members keep their 35 monthly Skills Boost credits and Innovator badge. The program is being replaced by the GEAR (Gemini Enterprise Agent Ready) AI-agent community. If you're already in the program, keep redeeming. If you're not, expect Google's learning initiatives to be increasingly AI/agent-centric.

🐳 Kubernetes, Containers & CI/CD Tooling

Kubernetes: Patch Storm

This week's Kubernetes patch wave was broad:

K8s v1.35.1, 1.34.4, 1.33.8, 1.32.12 all released within the same window, mostly for stability, with notable fixes for high etcd CPU usage after restart in K3s.
If you run any of these series in production, schedule maintenance windows. This wasn't a security-critical release, but etcd stability fixes are worth treating as priority patches.

CSI External Snapshotter v8.5.0

VolumeGroupSnapshot moves to GA. Minimum supported Kubernetes is now 1.25. If you rely on application-consistent snapshots across multiple PVCs (e.g., a database data + WAL volume), this is the release to move to.

Docker Engine 29.x

The 29.x line is now on hosted runners and worth your attention:

nftables backend (experimental) replacing iptables for Docker networking.
Better encrypted overlay network stability and Swarm networking reliability.
cgroup v1 deprecation — officially deprecated, supported through at least 2029. If your hosts are still on cgroup v1 kernel configs, start tracking the migration path.
GitHub Actions hosted runners (Ubuntu, Windows) moved to Docker 29.1 + Compose v2.40 on Feb 9. If your CI pipelines rely on deprecated Docker flags or old Compose behaviors, now is the time to test and fix.

Red Hat OpenShift 4.21 GA

Built on Kubernetes 1.34 + CRI-O 1.34, this release is now generally available:

Includes Kueue integration for batch/AI orchestration (relevant for ML pipelines on OpenShift).
CIFS/SMB CSI driver operator + Kernel Module Management operator on IBM Power.
Continued push toward unified VM + container management and AI workload support via OpenShift Platform Plus.

Strategic signal: Red Hat is betting heavily on "one control plane for everything" — VMs, containers, edge, AI. Migration Toolkit for Virtualization (MTV) is their answer to VMware migration anxiety.

GitHub Actions: Big Changes for CI/CD Budgets

Two things to know:

Pricing shift (effective now & March 1): Hosted runner prices dropped up to 39% starting Jan 1, 2026. But from March 1, 2026, self-hosted runners on private repos will incur a $0.002/min cloud platform charge. Public repos stay free. If you're on self-hosted, run the math now.
Feature updates (early Feb): Custom runner autoscaling now supports containers, VMs, and bare metal with multi-label support and explicit agentic workflow support (GitHub Copilot coding agent jobs). Allowed actions allowlists are now available to all plans, improving supply-chain control for small teams too.

Cloudflare Terraform Provider v5.17.0

Adds ai_search_instance and ai_search_token resources, plus state migration upgraders for the v4 → v5 transition. If you manage Cloudflare infra as code, you can now provision AI search infra alongside your DNS, Workers, and WAF config. The v4 → v5 migration path is also smoother now — good time to make that upgrade if you've been delaying.

Datadog Feature Flags GA

Datadog shipped Feature Flags as a first-class product, tying each flag directly to APM and RUM signals. You can now see in real-time whether a flag change correlates with error rate spikes or latency increases — and roll back in the same interface where you're already watching your services. This collapses the gap between release management and observability. Datadog DASH 2026 (June 9–10, NYC) is also now open for registration — the year's biggest AI + observability + security event.

🐧 Linux & Server Management

Security: Patch Week Across the Board

This was an active security advisory week for both major enterprise Linux families:

Ubuntu: Multiple kernel security advisories covering 16.04, 18.04, 20.04, 22.04, 24.04, and 25.10 were issued. If you run unattended-upgrades, it should have already pulled these. If you manage fleets manually, verify kernel package versions now.
→ Check: https://ubuntu.com/security/notices
RHEL 9 (RHSA-2026:2722): A moderate-impact kernel security update was released Feb 15. Standard patching cycle, but review the associated CVEs against your workload's code paths.
→ Check: https://access.redhat.com/errata/RHSA-2026:2722
Fedora CVE-2025-1272 (High): Kernel lockdown mode is disabled on some Fedora builds running 6.12+, exposing Secure Boot assumptions and allowing unsigned kernel modules. If you run Fedora with Secure Boot enabled, verify your lockdown configuration explicitly — don't assume it's on.

Kernel Direction

Linux 6.12 LTS remains the stable long-term support kernel (supported through Dec 2026+), shipping real-time PREEMPT_RT, sched_ext, eBPF improvements, and hardware support updates. Most enterprise distros will continue riding 6.12.x for the near term.
Linux 7.0 is the next major release — Torvalds has announced the version bump after 6.19, expected around April 2026.

Podman v5.8.0

Better handling of multiple Quadlet files and new support for AppArmor configuration in .container files. If you use Podman + Quadlet for systemd-managed containers on RHEL/Fedora servers, this release makes per-container AppArmor profiles much more ergonomic.

🎓 Career & Learning: What the Market Wants in 2026

Skills That Are Actually Getting You Hired

Based on multiple 2026 skills analyses published this week, the non-negotiable stack for a DevOps engineer role in 2026 is:

Tier	Skills
Table stakes	Cloud (AWS/Azure/GCP), Kubernetes, Linux, Git
Strong differentiator	GitOps + Platform Engineering, Terraform/IaC, CI/CD (GitHub Actions, ArgoCD)
Fast-rising demand	DevSecOps, Chaos engineering, AI-augmented workflows
Emerging expectation	Prompt engineering, AIOps tooling, self-healing system design

My current trajectory — Windows Admin → Linux → AWS/Docker/Kubernetes → DevOps — directly maps to the "table stakes + differentiator" tier. The AI-augmented layer is where to invest next.

CKA: Updated for Kubernetes 1.34

The Linux Foundation's CKA exam is now based on Kubernetes v1.34, with a significantly updated scope:

Exam weight distribution: Troubleshooting 30% · Cluster Architecture 25% · Networking 20% · Workloads & Scheduling 15% · Storage 10%
New emphasis: Helm, Kustomize, Gateway API, NetworkPolicy, CRDs, and extension interfaces (CNI/CSI/CRI) now account for roughly half the exam.
Old prep guides are outdated. freeCodeCamp just released a fully updated CKA prep course (2026) sponsored by Linux Foundation:
→ https://www.youtube.com/watch?v=l57xKN6OBhY

AWS Certification Shifts

ML Specialty retires end of March 2026. If you're mid-study, plan around this.
New AWS Certified Generative AI Developer–Professional is rolling out (beta from late 2025).
Best DevOps path in 2026: Developer Associate → DevOps Engineer Professional (for automation/release engineering roles), or CloudOps Engineer Associate → DevOps Engineer Professional (for operations-heavy roles).

🤖 AI in Infrastructure: It's Not "Coming" Anymore

The theme this week wasn't "AI is coming to DevOps." It was "AI is already a system component — start treating it like one."

AI as an Infrastructure Primitive

AWS SageMaker Inference for custom Nova models means you configure LLM deployments the same way you configure EC2 autoscaling groups — instance type, scaling policy, concurrency limits. Infra-as-code for models is now just IaC.
Cloudflare AI Search in Terraform (ai_search_instance, ai_search_token) means AI search backends are provisioned alongside your security and networking config in the same terraform apply.

AI-Augmented CI/CD

GitHub Actions autoscaling runners now explicitly support agentic workflows — pipelines where a Copilot coding agent proposes changes, opens PRs, and runs tests end-to-end. This week's update bakes the required telemetry and autoscaling directly into the runner pool, not as a bolt-on.
The cloud-native community is also raising flags: "AI slop" (low-quality AI-generated code entering pipelines) and lack of auditability for AI agents in production are now active engineering concerns, not theoretical ones. Safe shutdown mechanisms and policy-as-code are going to matter.

AI Observability is a Product Category Now

Datadog Feature Flags (above) is one piece of this. Datadog's broader direction — Toto foundation model for telemetry, BOOM benchmark for AI forecasting, LLM cost/latency tracking — shows observability vendors are treating AI workloads as a first-class monitoring target.
DASH 2026's full AI observability track (June, NYC) will likely establish the best-practice playbook for LLM/agent monitoring in production.

Security: AI Actions Are Attack Surface

Azure Copilot → Sentinel connector (public preview): AI assistant actions are now loggable as security events. Your SIEM needs to understand what your AI tools are doing, not just your users.
Google's GTIG AI Misuse Report (Feb 11): Documents how threat actors are actively exploiting AI tools for phishing, recon, and code generation. If your team is integrating AI agents into CI/CD or operations workflows, threat model them — not just the code they produce, but the actions they can take.

✅ Actions for This Week

If you're actively building toward a DevOps/Cloud engineering role, here's what to do with this week's information:

Patch your Linux systems. Ubuntu (all supported) and RHEL 9 both received kernel updates. If you self-manage any servers, this is the week to run apt upgrade or dnf update kernel.
Test your CI pipelines against Docker 29.1. GitHub-hosted runners already upgraded. Check for broken flags, removed behaviors, or cgroup v2 assumptions.
Review the March 1 GitHub Actions self-hosted pricing change. If you run self-hosted runners on private repos, calculate your monthly exposure now.
Bookmark the updated CKA prep course. The new exam scope (Helm, Kustomize, Gateway API, NetworkPolicy) is meaningfully different from pre-2025 guides. Align your study material.
Read the Datadog Feature Flags launch page. Even if you don't use Datadog, the model of "tie every flag to your observability telemetry" is becoming an industry expectation.
If you're on Fedora: Explicitly verify kernel lockdown + Secure Boot status (CVE-2025-1272). Don't assume lockdown mode is active on 6.12.x builds.

Follow along for weekly DevOps/Cloud briefings, practical career guides, and infra deep-dives. What from this week are you acting on? Drop it in the comments.

#DevOps #CloudEngineering #Kubernetes #Linux #AWS #Azure #GCP #Docker #GitOps #DevSecOps #SRE #Infrastructure #CareerInTech

Beginner's Guide to Building a Professional CI/CD Pipeline from Scratch

Pushpendra B — Mon, 16 Feb 2026 16:39:22 GMT

Project: Week 1 → CI/CD Foundations (Node.js + GraphQL)

Repository: Push1697/devops-portfolio

In the world of DevOps, a pipeline isn't just a script that runs tests — it's the factory floor of your software delivery. A well-architected pipeline ensures that code flows from a developer's laptop to production reliably, securely, and rapidly.

This guide walks you through building a complete CI/CD pipeline for a basic Node.js application using GitHub Actions, Docker, and AWS from scratch. Every file, every keyword, every config is explained so you can build the same pipeline yourself without ever opening the GitHub repo.

Whether you're a beginner or refining up your skills, this is the kind of pipeline you'd find behind any serious production deployment.

Architecture Overview
Project Setup — Build the Application
Docker Hub Setup — Create Your Token
GitHub Repository Secrets
AWS Infrastructure Setup
The Pipeline — Complete Walkthrough
Branch Protection & Governance
Troubleshooting — Every Error We Hit
Summary

1. Architecture Overview

Before diving into code, let's understand the flow. We aren't just "deploying code"; we are orchestrating a software supply chain.

The Pipeline Stages

Stage	What It Does	Key Tools
Build	Clean install (`npm ci`), syntax check	Node.js 20, npm
Test	Run unit/integration tests	Jest / npm test
Security	Dependency audit + static analysis	npm audit, CodeQL
Docker	Build, scan, and push container image	Docker Buildx, Trivy
Deploy	Pull & run on EC2 via SSM (with rollback)	AWS OIDC, SSM

Pipeline Visualization

The complete pipeline DAG in GitHub Actions. Notice how Test and Security run in parallel after Build, and Docker + Deploy only trigger on the main branch. Total pipeline time: ~2m 49s.

Key design choice: Test and Security are independent of each other. By running them in parallel (both use needs: build), we cut pipeline time without sacrificing quality gates.

2. Project Setup — Build the Application

Before anything CI/CD, you need a working application. Here's exactly what to build.

Step 1: Initialize the Node.js Project

mkdir week1-cicd && cd week1-cicd
npm init -y
npm install express express-graphql graphql

What each package does:

Package	Purpose
`express`	Web framework — handles HTTP routing and middleware
`express-graphql`	Adds a `/graphql` endpoint with the GraphiQL IDE
`graphql`	Core library for defining schemas, types, and resolvers

Your package.json should look like this:

{
  "name": "node-ci-demo",
  "version": "1.0.0",
  "description": "Sample Node.js app for CI/CD demo",
  "main": "server.js",
  "scripts": {
    "start": "node server.js"
  },
  "license": "MIT",
  "dependencies": {
    "express": "^5.2.1",
    "express-graphql": "^0.12.0",
    "graphql": "^15.10.1"
  }
}

💡 Important: After running npm install, a package-lock.json file is generated. You must commit this file — the pipeline uses npm ci which requires it.

Step 2: Create the Data File (`MOCK_DATA.json`)

[
  { "id": 1, "firstName": "Asha", "lastName": "Iyer", "email": "asha.iyer@example.com", "password": "pass1234" },
  { "id": 2, "firstName": "Noah", "lastName": "Cole", "email": "noah.cole@example.com", "password": "pass2345" },
  { "id": 3, "firstName": "Mina", "lastName": "Khan", "email": "mina.khan@example.com", "password": "pass3456" },
  { "id": 4, "firstName": "Luis Kumar", "lastName": "Santos", "email": "luis.santos@example.com", "password": "pass4567" }
]

Step 3: Create the Server (`server.js`)

The server code sets up:

Endpoint	Type	What It Does
`/`	HTML	Interactive user search UI
`/graphql`	GraphQL	Full GraphQL API with queries + mutations
`/rest/getAllUsers`	REST	Returns all users as JSON
`/api/users/search?q=`	REST	Search users by name or email

Here's the core structure of server.js:

const express = require("express");
const graphql = require("graphql");
const { graphqlHTTP } = require("express-graphql");

const app = express();
const PORT = 5000;
const userData = require("./MOCK_DATA.json");

const {
  GraphQLObjectType,
  GraphQLSchema,
  GraphQLList,
  GraphQLInt,
  GraphQLString,
} = graphql;

// Define the User type (what fields a "User" has)
const UserType = new GraphQLObjectType({
  name: "User",
  fields: () => ({
    id: { type: GraphQLInt },
    firstName: { type: GraphQLString },
    lastName: { type: GraphQLString },
    email: { type: GraphQLString },
    password: { type: GraphQLString },
  }),
});

// Define queries (how to READ data)
const RootQuery = new GraphQLObjectType({
  name: "RootQueryType",
  fields: {
    getAllUsers: {
      type: new GraphQLList(UserType),
      args: { id: { type: GraphQLInt } },
      resolve() { return userData; },         // Returns all users
    },
    findUserById: {
      type: UserType,
      args: { id: { type: GraphQLInt } },
      resolve(parent, args) {
        return userData.find((item) => item.id === args.id);
      },
    },
  },
});

// Define mutations (how to WRITE data)
const Mutation = new GraphQLObjectType({
  name: "Mutation",
  fields: {
    createUser: {
      type: UserType,
      args: {
        firstName: { type: GraphQLString },
        lastName: { type: GraphQLString },
        email: { type: GraphQLString },
        password: { type: GraphQLString },
      },
      resolve(parent, args) {
        const newUser = {
          id: userData.length + 1,
          ...args,
        };
        userData.push(newUser);
        return newUser;
      },
    },
  },
});

// Build the schema and mount GraphQL + REST endpoints
const schema = new GraphQLSchema({ query: RootQuery, mutation: Mutation });

app.use("/graphql", graphqlHTTP({ schema, graphiql: true }));

app.get("/rest/getAllUsers", (req, res) => { res.send(userData); });

app.get("/api/users/search", (req, res) => {
  const query = (req.query.q || "").toLowerCase().trim();
  const results = query
    ? userData.filter((u) =>
        [u.firstName, u.lastName, u.email]
          .join(" ").toLowerCase().includes(query)
      )
    : userData;
  res.json(results);
});

// The "/" route serves an HTML page with a search UI (omitted for brevity)

app.listen(PORT, () => { console.log("Server running"); });

Test it locally:

npm start
# Open http://localhost:5000

The deployed application running on EC2. Notice the URL — this is the public IP of our AWS instance on port 5000, exactly what the pipeline deploys to.

Step 4: Create the Dockerfile (Multi-Stage Distroless Build)

This is where most beginners just write FROM node and call it a day. We don't do that.

# Stage 1: Dependencies — use a full Node image to install packages
FROM node:20-alpine3.18 AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci --omit=dev --ignore-scripts \
    && npm cache clean --force \
    && rm -rf /root/.npm /tmp/*

# Stage 2: Runtime — copy only what's needed into a minimal image
FROM gcr.io/distroless/nodejs20-debian12:nonroot
WORKDIR /app
COPY --from=builder --chown=nonroot:nonroot /app/node_modules ./node_modules
COPY --chown=nonroot:nonroot server.js .
COPY --chown=nonroot:nonroot MOCK_DATA.json .
EXPOSE 5000
CMD ["server.js"]

Line-by-line explanation:

Line	What It Does
`FROM node:20-alpine3.18 AS builder`	Stage 1 uses Alpine Linux (small) to install deps. Named `builder` for reference
`COPY package*.json ./`	Copies both `package.json` and `package-lock.json` into the container
`npm ci --omit=dev`	Clean install, production dependencies only — skips devDependencies
`--ignore-scripts`	Skips lifecycle scripts (postinstall, etc.) — reduces attack surface
`npm cache clean --force`	Removes npm cache — smaller image layer
`FROM gcr.io/distroless/nodejs20-debian12:nonroot`	Stage 2 uses Google's Distroless — no shell, no package manager, no OS utils
`COPY --from=builder`	Copies `node_modules` from Stage 1 into the final image
`--chown=nonroot:nonroot`	Files owned by non-root user — container never runs as root
`CMD ["server.js"]`	Distroless uses exec form (no shell), so we pass the filename directly

💡 Result: Our final image is 51.6 MB instead of ~1 GB with a standard node:20 base. You can verify this in Docker Hub — smaller image = faster pulls = faster deployments.

Step 5: Create the `.dockerignore`

node_modules
npm-debug.log
.git
.env
tests/
*.test.js
coverage/
README.md

Why this matters: Without .dockerignore, Docker copies your entire node_modules (which gets rebuilt inside), .git history, and test files into the build context — making builds slower and images larger.

Step 6: Commit Everything

git add .
git commit -m "feat: add Node.js app with GraphQL + REST endpoints"
git push origin main

⚠️ Don't forget package-lock.json! The pipeline's npm ci command requires it. If missing, the build fails immediately.

3. Docker Hub Setup — Create Your Token

Before the pipeline can push images, you need a Docker Hub account and an access token.

Step 1: Navigate to Personal Access Tokens

Go to Docker Hub → Account Settings → Security → Personal access tokens.

The Docker Hub PAT settings page. If this is your first token, you'll see the "Generate new token" button.

Step 2: Create the Access Token

Click Generate new token and fill in:

Field	Value	Why
Access token description	`github_actions`	Identifies what this token is used for
Expiration date	30 days (or more)	Set based on your security policy
Access permissions	Read & Write	Push requires Write; Read pulls images

Fill in exactly as shown. The "Read & Write" permission lets your pipeline push images to Docker Hub.

Step 3: Copy and Save the Token

After clicking Generate, Docker Hub shows your token once. Copy it immediately.

Example: dckr_pat_ABC123DEF456GHI789JKL0MN

⚠️ This token will never be shown again. If you lose it, you must generate a new one.

Why a Personal Access Token (PAT) over your password?

Can be revoked without changing your Docker Hub password
Can be scoped to specific permissions (Read, Write, Delete)
Can be audited — you see when it was last used
If compromised, your account password stays safe

4. GitHub Repository Secrets

Now we add all credentials to GitHub so the pipeline can use them securely.

How to Add a Secret

Go to your GitHub repo → Settings → Secrets and variables → Actions
Click New repository secret
Enter the Name and Value
Click Add secret

Secrets You Need to Add

Secret Name	Value	Purpose
`DOCKERHUB_USERNAME`	Your Docker Hub username (e.g., `deviltalks`)	Docker login
`DOCKERHUB_TOKEN`	The PAT from Step 3 above	Docker login
`AWS_ROLE_ARN`	`arn:aws:iam::123456789012:role/github-actions-oidc-role`	OIDC authentication
`EC2_INSTANCE_ID`	`i-0xxxxxxxxxxxxxxxxx`	SSM deployment target

💡 How GitHub Secrets work: Values are encrypted at rest using libsodium sealed boxes. They are never printed in logs — even if your pipeline does echo ${{ secrets.DOCKERHUB_TOKEN }}, it shows ***. They cannot be read by forks or PRs from forks.

5. AWS Infrastructure Setup

This is where most tutorials skip. But in real life, infrastructure is where 90% of issues happen.

5.1 AWS OIDC Configuration (No Static Access Keys!)

We do not use AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY. Those are long-lived credentials — if they leak, an attacker has access until you manually rotate them. That's unacceptable.

Instead, we use OpenID Connect (OIDC):

GitHub Actions ──► (JWT token) ──► AWS STS ──► Temporary credentials (15 min)

AWS says: "I trust GitHub Actions from this specific repo to assume this specific role*, and only for a short time."*

Step-by-step setup:

Go to AWS IAM → Identity Providers → Add provider
Select OpenID Connect
Provider URL: https://token.actions.githubusercontent.com
Audience: sts.amazonaws.com
Click Add provider

Then create the IAM Role:

Go to IAM → Roles → Create role
Trusted entity type: Web identity
Identity provider: token.actions.githubusercontent.com
Audience: sts.amazonaws.com
Click Next
Attach permission: AmazonSSMManagedInstanceCore
Role name: github-actions-oidc-role
Click Create role

Update the Trust Policy (IAM → Roles → your role → Trust relationships → Edit):

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Federated": "arn:aws:iam::YOUR_ACCOUNT_ID:oidc-provider/token.actions.githubusercontent.com"
      },
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Condition": {
        "StringEquals": {
          "token.actions.githubusercontent.com:aud": "sts.amazonaws.com"
        },
        "StringLike": {
          "token.actions.githubusercontent.com:sub": "repo:Push1697/devops-portfolio:*"
        }
      }
    }
  ]
}

⚠️ Critical: The sub condition locks this to your exact repository. Without it, any GitHub repo could assume your AWS role.

Copy the Role ARN (e.g., arn:aws:iam::450070307294:role/github-actions-oidc-role) → save it as the AWS_ROLE_ARN GitHub Secret.

5.2 EC2 Instance Preparation

Your EC2 deployment target needs four things:

Requirement	How to Set It Up
Docker installed	`sudo yum install docker -y && sudo systemctl enable --now docker`
SSM Agent running	Pre-installed on Amazon Linux. Verify: `systemctl status amazon-ssm-agent`
IAM Instance Role	Attach a role with `AmazonSSMManagedInstanceCore` policy (see below)
Security Group	Allow inbound TCP on port 5000 from `0.0.0.0/0`

How to attach AmazonSSMManagedInstanceCore to your EC2:

Go to AWS EC2 → Instances → Select your instance
Look for IAM Role in the details panel
If no role is attached: Actions → Security → Modify IAM role
Select a role that has AmazonSSMManagedInstanceCore attached (or create one)
Click Update IAM role

To verify the role's permissions:

Go to IAM → Roles → Click the role name
Under Permissions, confirm AmazonSSMManagedInstanceCore is listed
If missing: Add permissions → Attach policies → Search for AmazonSSMManagedInstanceCore

💡 Why SSM instead of SSH? With SSM Run Command, you don't need to open port 22 to the internet. No SSH keys to manage, no bastion hosts. All commands are logged in CloudTrail. It's the modern, secure way to run commands on EC2.

6. The Pipeline — Complete `ci-cd-pipeline.yml` Walkthrough

This file lives at .github/workflows/ci-cd-pipeline.yml. Let's dissect every single line.

6.1 Name, Triggers & Permissions

name: week-1 CI-CD Pipeline

on:
  push:
    branches: ["main", "develop"]
  pull_request:
    branches: ["main"]
  workflow_dispatch:

permissions:
  contents: read
  id-token: write
  security-events: write

Line-by-line:

Line	What It Does
`name: week-1 CI-CD Pipeline`	Display name shown in GitHub Actions tab
`on.push.branches: ["main", "develop"]`	Runs on every push to `main` or `develop` branches
`on.pull_request.branches: ["main"]`	Runs when a PR is opened/updated against `main` — validates before merge
`workflow_dispatch:`	Adds a "Run workflow" button in GitHub UI for manual triggers
`permissions.contents: read`	Allows the workflow to read (checkout) your repository code
`permissions.id-token: write`	Critical for OIDC — lets GitHub generate a JWT token for AWS auth
`permissions.security-events: write`	Required for CodeQL to upload scan results to GitHub's Security tab

⚠️ #1 mistake beginners make: Forgetting id-token: write. Without it, the OIDC handshake with AWS fails silently and the deploy job errors with a cryptic permissions message.

6.2 Global Environment Variables

env:
  NODE_VERSION: "20"
  APP_DIR: "week1-cicd"
  IMAGE_NAME: ${{ secrets.DOCKERHUB_USERNAME }}/node-ci-demo
  CONTAINER_NAME: "node-ci-demo"
  APP_PORT: "5000"
  AWS_REGION: "ap-south-1"
  AWS_ROLE_ARN: ${{ secrets.AWS_ROLE_ARN }}
  EC2_INSTANCE_ID: ${{ secrets.EC2_INSTANCE_ID }}

Variable	Value	Why It's Here
`NODE_VERSION`	`"20"`	Used by `setup-node` — change once, applies everywhere
`APP_DIR`	`"week1-cicd"`	Our app lives in a subdirectory, not the repo root
`IMAGE_NAME`	`/node-ci-demo`	Full Docker image name — constructed from secrets
`CONTAINER_NAME`	`"node-ci-demo"`	Name of the Docker container on EC2
`APP_PORT`	`"5000"`	Port the app listens on
`AWS_REGION`	`"ap-south-1"`	Mumbai region — change to your deploy region
`AWS_ROLE_ARN`	From secrets	The IAM role ARN for OIDC auth
`EC2_INSTANCE_ID`	From secrets	Target EC2 instance for deployment

💡 DRY Principle: If you ever need to change the Node version, image name, or AWS region, you change it in one place — not scattered across 5 jobs. as this practice creates simplicity and manageability of your pipeline if it is complex.

6.3 Concurrency Control

concurrency:
  group: ${{ github.workflow }}-${{ github.ref }}
  cancel-in-progress: true

Keyword	What It Does
`group:`	Creates a unique group per workflow + branch combo
`cancel-in-progress`	If you push 3 commits rapidly, only the latest pipeline runs — older ones are cancelled

Why this matters: This saves GitHub Actions minutes (and money). Without it, 3 rapid pushes = 3 concurrent pipelines fighting over resources.

6.4 Job 1: Build

jobs:
  build:
    name: Build
    runs-on: ubuntu-latest

    steps:
      - name: Checkout code
        uses: actions/checkout@v4

      - name: Setup Node.js
        uses: actions/setup-node@v4
        with:
          node-version: ${{ env.NODE_VERSION }}
          cache: npm
          cache-dependency-path: ${{ env.APP_DIR }}/package-lock.json

      - name: Install dependencies
        run: npm ci
        working-directory: ${{ env.APP_DIR }}

      - name: Validate server entry
        run: node -c server.js
        working-directory: ${{ env.APP_DIR }}

      - name: Lint (if configured)
        run: npm run lint --if-present
        working-directory: ${{ env.APP_DIR }}

Step-by-step explanation:

Step	What It Does	Why
`actions/checkout@v4`	Clones your repository into the GitHub runner	Without this, the runner VM has no code
`actions/setup-node@v4`	Installs Node.js 20 on the runner	`cache: npm` reuses previously downloaded packages
`cache-dependency-path`	Points to `week1-cicd/package-lock.json`	Tells the cache which lockfile to hash for cache invalidation
`npm ci`	Clean install from lockfile	Deterministic — fails if lockfile is missing or out of sync
`node -c server.js`	Syntax check only — parses the file without executing it	Catches typos, missing brackets, or syntax errors instantly
`npm run lint --if-present`	Runs linting only if a `lint` script exists in `package.json`	`--if-present` prevents failure if no linter is configured

What is runs-on: ubuntu-latest? This tells GitHub to run this job on a fresh Ubuntu virtual machine. Each job gets its own clean VM — nothing carries over between jobs unless you explicitly share artifacts.

What is working-directory? Our app lives inside week1-cicd/, not the repo root. This keyword tells each step to cd into that folder before running the command.

Why npm ci instead of npm install?

Feature	`npm install`	`npm ci`
Uses `package-lock.json`	Optional	Required
Modifies lockfile	Yes	Never
Deletes `node_modules` first	No	Yes
Deterministic	No	Yes
Speed	Slower	Faster

6.5 Job 2: Test

  test:
    name: Test
    runs-on: ubuntu-latest
    needs: build

    steps:
      - name: Checkout code
        uses: actions/checkout@v4

      - name: Setup Node.js
        uses: actions/setup-node@v4
        with:
          node-version: ${{ env.NODE_VERSION }}
          cache: npm
          cache-dependency-path: ${{ env.APP_DIR }}/package-lock.json

      - name: Install dependencies
        run: npm ci
        working-directory: ${{ env.APP_DIR }}

      - name: Run tests (if configured)
        run: npm test --if-present
        working-directory: ${{ env.APP_DIR }}

Key keyword — needs: build:

This creates a dependency chain. The test job waits for build to succeed before starting. If the build fails, test never runs — saving compute time and money.

💡 Why does each job re-checkout and re-install? Each GitHub Actions job runs on a separate VM. The VM from the Build job is destroyed when it finishes. So Test needs its own checkout and install. This is by design — isolation prevents contamination between jobs.

npm test --if-present: Runs the test script if it exists in package.json, otherwise skips gracefully. This is useful in early-stage projects where tests haven't been written yet.

6.6 Job 3: Security (Defense in Depth)

  security:
    name: Security
    runs-on: ubuntu-latest
    needs: build

    steps:
      - name: Checkout code
        uses: actions/checkout@v4

      - name: Setup Node.js
        uses: actions/setup-node@v4
        with:
          node-version: ${{ env.NODE_VERSION }}
          cache: npm
          cache-dependency-path: ${{ env.APP_DIR }}/package-lock.json

      - name: Install dependencies
        run: npm ci
        working-directory: ${{ env.APP_DIR }}

      - name: Run npm audit
        run: npm audit --audit-level=moderate || true
        working-directory: ${{ env.APP_DIR }}

      - name: CodeQL init
        uses: github/codeql-action/init@v4
        with:
          languages: javascript

      - name: CodeQL analyze
        uses: github/codeql-action/analyze@v4

Notice: Both test and security have needs: build. They are independent of each other, so they run in parallel. This is visible in the pipeline visualization above.

We use three layers of security scanning (Defense in Depth):

Layer 1: npm audit        → Known vulnerabilities in npm packages
Layer 2: GitHub CodeQL     → Static analysis of YOUR code (injections, logic bugs)
Layer 3: Trivy            → OS-level CVEs in the Docker image (runs in Docker job)

Keyword breakdown:

Keyword / Flag	What It Does
`npm audit --audit-level=moderate`	Only flag vulnerabilities rated moderate or higher
`		true`	Don't fail the job on audit warnings — log them but continue
`codeql-action/init`	Downloads and initializes the CodeQL analysis engine for JavaScript
`codeql-action/analyze`	Runs the actual scan and uploads results to GitHub's Security tab

💡 Why || true on npm audit? In a real production pipeline, you might want npm audit to fail the build. Here we use || true so advisory-level warnings don't block deploys during development. In stricter environments, remove the || true.

6.7 Job 4: Docker Build, Scan & Push

This is the longest job. Let's break it down step by step.

  docker:
    name: Docker
    runs-on: ubuntu-latest
    needs: [test, security]
    if: github.event_name == 'push' && github.ref == 'refs/heads/main'

Keyword	What It Does
`needs: [test, security]`	Waits for both Test and Security to pass before starting
`if: github.event_name == 'push'`	Only runs on direct pushes, not on pull requests
`github.ref == 'refs/heads/main'`	Only runs on the `main` branch — feature branches never push images

Step 1 → Extract Docker Metadata (Smart Tagging):

      - name: Extract Docker metadata
        id: meta
        uses: docker/metadata-action@v5
        with:
          images: ${{ env.IMAGE_NAME }}
          tags: |
            type=ref,event=branch
            type=ref,event=pr
            type=sha,prefix={{branch}}-
            type=raw,value=latest,enable={{is_default_branch}}

This automatically generates Docker tags for your image:

Tag Rule	Example Output	Purpose
`type=ref,event=branch`	`main`	Tags with branch name
`type=ref,event=pr`	`pr-42`	Tags for pull requests
`type=sha,prefix={{branch}}-`	`main-c260cb1`	Branch + short commit SHA
`type=raw,value=latest`	`latest`	Only on the default branch

The SHA tag (main-c260cb1) is critical — it lets you trace any running container back to the exact Git commit that built it.

Step 2 → Docker Hub Login:

      - name: Log in to Docker Hub
        uses: docker/login-action@v3
        with:
          username: ${{ secrets.DOCKERHUB_USERNAME }}
          password: ${{ secrets.DOCKERHUB_TOKEN }}

This authenticates to Docker Hub using the secrets we configured earlier. The password field uses the PAT (Personal Access Token), not your actual Docker Hub password.

Step 3 → Set Up Docker Buildx:

      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v3

What is Buildx? It's Docker's extended build tool. It enables multi-platform builds and, critically, caching support via GitHub Actions cache. Without Buildx, you can't use cache-from / cache-to.

Step 4 → Build and Push:

      - name: Build and push
        uses: docker/build-push-action@v6
        with:
          context: ${{ env.APP_DIR }}
          file: ${{ env.APP_DIR }}/Dockerfile
          push: true
          tags: ${{ steps.meta.outputs.tags }}
          labels: ${{ steps.meta.outputs.labels }}
          cache-from: type=gha
          cache-to: type=gha,mode=max

Keyword	What It Does
`context: ${{ env.APP_DIR }}`	Build context is the `week1-cicd/` directory
`file: ${{ env.APP_DIR }}/Dockerfile`	Path to our multi-stage Dockerfile
`push: true`	Pushes the built image to Docker Hub
`tags: ${{ steps.meta.outputs.tags }}`	Uses tags generated by the metadata step
`cache-from: type=gha`	Pulls cached layers from GitHub Actions cache
`cache-to: type=gha,mode=max`	Saves all layers to cache for future builds

💡 Why caching matters: Without caching, Docker rebuilds every layer from scratch (~2-3 minutes). With type=gha caching, unchanged layers are reused — builds drop to ~30 seconds.

Docker Hub showing our pushed image with 5 tags. The main-c260cb1 tag maps to a specific Git commit. Repository size: 51.6 MB (thanks to our distroless multi-stage build).

Step 5 → Trivy Container Scan:

      - name: Trivy scan (optional)
        uses: aquasecurity/trivy-action@master
        continue-on-error: true
        with:
          image-ref: ${{ env.IMAGE_NAME }}:latest
          format: sarif
          output: trivy-results.sarif

      - name: Upload Trivy results
        uses: github/codeql-action/upload-sarif@v3
        if: always()
        with:
          sarif_file: trivy-results.sarif

Keyword	What It Does
`continue-on-error: true`	Trivy findings don't block the pipeline — they're informational
`format: sarif`	SARIF is a standard format that GitHub understands for its Security tab
`if: always()`	Upload results even if the Trivy scan step finds vulnerabilities

An actual failed Docker build from our pipeline. The error invalid tag "/node-ci-demo:main" happened because DOCKERHUB_USERNAME was empty — the secret wasn't set yet. After adding the secret, this was resolved.

6.8 Job 5: Deploy with Automated Rollback

  deploy:
    name: Deploy
    runs-on: ubuntu-latest
    needs: docker
    if: github.event_name == 'push' && github.ref == 'refs/heads/main'

    steps:
      - name: Configure AWS credentials (OIDC)
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: ${{ env.AWS_ROLE_ARN }}
          aws-region: ${{ env.AWS_REGION }}

      - name: Deploy via SSM Run Command
        run: |
          aws ssm send-command \
            --document-name "AWS-RunShellScript" \
            --targets "Key=instanceids,Values=${{ env.EC2_INSTANCE_ID }}" \
            --parameters 'commands=["set -e","CURRENT_IMAGE=$(docker inspect -f {{.Image}} ${{ env.CONTAINER_NAME }} 2>/dev/null || echo none)","docker pull ${{ env.IMAGE_NAME }}:latest","docker stop ${{ env.CONTAINER_NAME }} || true","docker rm ${{ env.CONTAINER_NAME }} || true","if docker run -d --name ${{ env.CONTAINER_NAME }} -p ${{ env.APP_PORT }}:5000 ${{ env.IMAGE_NAME }}:latest; then echo Deploy succeeded; else echo Deploy failed, rolling back; [ \"$CURRENT_IMAGE\" != none ] && docker run -d --name ${{ env.CONTAINER_NAME }} -p ${{ env.APP_PORT }}:5000 $CURRENT_IMAGE; exit 1; fi"]' \
            --comment "Deploy node-ci-demo" \
            --region ${{ env.AWS_REGION }}

Step 1 — OIDC Authentication:

The aws-actions/configure-aws-credentials action:

Requests a JWT token from GitHub's OIDC provider
Sends it to AWS STS (Security Token Service)
AWS validates the token against the trust policy
Returns temporary credentials (valid for ~15 minutes)

Step 2 — SSM Run Command (The Deployment Script):

Here's what the script does, broken into readable steps:

# 1. Strict mode — exit immediately if any command fails
set -e

# 2. Save the currently running image hash (for rollback)
CURRENT_IMAGE=$(docker inspect -f '{{.Image}}' node-ci-demo 2>/dev/null || echo none)

# 3. Pull the latest image from Docker Hub
docker pull deviltalks/node-ci-demo:latest

# 4. Stop and remove the old container (ignore errors if it doesn't exist)
docker stop node-ci-demo || true
docker rm node-ci-demo || true

# 5. Start the new container — if it fails, rollback!
if docker run -d --name node-ci-demo -p 5000:5000 deviltalks/node-ci-demo:latest; then
  echo "✅ Deploy succeeded"
else
  echo "❌ Deploy failed — rolling back!"
  # Restore the previous working image
  if [ "$CURRENT_IMAGE" != "none" ]; then
    docker run -d --name node-ci-demo -p 5000:5000 $CURRENT_IMAGE
  fi
  exit 1
fi

What makes this deployment script production-grade:

Feature	How We Do It
Image hash capture	Saves `CURRENT_IMAGE` before doing anything destructive
Automatic rollback	If new container fails to start, restores previous image
Zero-downtime recovery	Rollback is automatic — no manual intervention needed
`set -e`	Any command failure stops the script — no partial deploys
`		true` on stop/rm	Gracefully handles first-ever deployment (no container to stop)

AWS Systems Manager executing our deployment script via SSM Run Command. You can see the full command JSON response including the CommandId, the deployment commands with masked secrets (***), the target EC2 instance, and the "Pending" status. Notice how IMAGE_NAME and AWS_ROLE_ARN are masked → GitHub never exposes secrets in logs.

7. Branch Protection & Governance

A pipeline is only as good as the rules protecting it. Without branch protection, anyone with repo access could push directly to main and bypass all checks.

Setting Up Branch Protection Rules

Go to GitHub → Settings → Branches → Add rule → Branch name pattern: main

Enable these settings:

Setting	What It Does
✅ Require pull request before merging	Forces code review — prevents accidental pushes
✅ Require status checks to pass	Blocks merge if Build / Test / Security / Docker fails
✅ Require code reviews (1+ approver)	At least one peer must approve the PR
✅ Dismiss stale pull request approvals	Re-review required if new commits are added after approval
✅ Require branches to be up to date	Prevents merge conflicts in production
✅ Restrict who can push	Only specific users/teams can bypass (rarely used)

What happens when someone tries to push directly to main:

$ git push origin main
remote: error: GH006: Protected branch update failed
remote: error: Required status check "Build" is expected
remote: error: At least 1 approving review is required

💡 Without branch protection: Any developer with push access can push broken code directly to production. With it: Every change must pass CI checks AND be reviewed by a peer before merging.

Skipping CI (When You Need To)

Sometimes you push documentation changes or .md edits that don't need a full pipeline run. Add [skip ci] to your commit message:

git commit -m "docs: update README formatting [skip ci]"
git push origin main

GitHub Actions recognizes [skip ci], [ci skip], [no ci], or [skip actions] in the commit message and will not trigger any workflows for that push.

⚠️ Use sparingly. Only skip CI for documentation-only changes. Never skip CI for code changes — that defeats the entire purpose of the pipeline.

The Complete Governance Model

Protection Layer	What It Prevents
Branch protection	Direct pushes to `main` → forces PR + review
PR status checks	Merging PRs with failing Build / Test / Security
OIDC (no static keys)	Leaked AWS credentials → tokens auto-expire in 15 min
GitHub Secrets	Credentials in code → encrypted, masked in logs, no fork access
npm audit + CodeQL	Vulnerable dependencies and code-level security flaws
Trivy	OS-level vulnerabilities in the Docker image
Automated rollback	Failed deployments staying live → auto-restores previous version
Audit trail	Untraceable changes → every commit, PR, and deploy is logged

8. Troubleshooting — Every Error We Hit

These aren't hypothetical — these are errors we actually encountered while building this pipeline. Every fix is documented.

❌ Error 1: `npm ci` fails — "package-lock.json not found"

npm ERR! The `npm ci` command can only install with an existing package-lock.json

Root cause: You ran npm install locally but never committed package-lock.json.

Fix:

npm install                # generates package-lock.json
git add package-lock.json
git commit -m "chore: add lockfile for CI reproducibility"
git push origin main

❌ Error 2: Docker push — "denied: requested access to the resource is denied"

ERROR: denied: requested access to the resource is denied

Root cause: One (or more) of these:

DOCKERHUB_TOKEN is your password, not a PAT
PAT lacks Write permission
DOCKERHUB_USERNAME doesn't match the image name prefix (e.g., image is deviltalks/node-ci-demo but username is deviltalks2)

Fix:

Go to Docker Hub → Account Settings → Security → Delete old token
Generate a new PAT with Read & Write permissions
Update the DOCKERHUB_TOKEN secret in GitHub
Verify IMAGE_NAME starts with your exact Docker Hub username

❌ Error 3: Docker build — "invalid tag: invalid reference format"

ERROR: failed to build: invalid tag "/node-ci-demo:main": invalid reference format

Root cause: The DOCKERHUB_USERNAME secret is empty or not set. The IMAGE_NAME env var becomes /node-ci-demo (leading slash) instead of deviltalks/node-ci-demo.

Fix:

Go to GitHub → Settings → Secrets → Actions
Verify DOCKERHUB_USERNAME exists and has a value
Re-trigger the pipeline:

git commit --allow-empty -m "retry: fix docker username secret"
git push origin main

💡 This error is visible in the build logs screenshot above — line 222 shows the exact error.

❌ Error 4: Deploy fails — SSM "Command failed"

Root cause: Usually one of:

SSM Agent is stopped on EC2
EC2 instance doesn't have the AmazonSSMManagedInstanceCore IAM policy
Port 5000 is blocked in Security Group
Instance is in a different region than AWS_REGION

Fix:

# On the EC2 instance — restart SSM Agent:
sudo systemctl restart amazon-ssm-agent
sudo systemctl status amazon-ssm-agent

# Verify Docker is running:
sudo systemctl status docker

In AWS Console:

IAM: Verify the EC2 instance role has AmazonSSMManagedInstanceCore attached
EC2 → Security Groups: Edit inbound rules → Add TCP 5000 from 0.0.0.0/0
Verify region: Make sure AWS_REGION in your workflow matches the instance's actual region

❌ Error 5: OIDC fails — "Not authorized to perform sts:AssumeRoleWithWebIdentity"

Error: Not authorized to perform sts:AssumeRoleWithWebIdentity

Root cause: The sub claim in your IAM trust policy doesn't match your repo.

Fix:

Verify the trust policy has the correct repo:

"token.actions.githubusercontent.com:sub": "repo:Push1697/devops-portfolio:*"

Verify permissions.id-token: write is set in your workflow
Verify the OIDC Identity Provider exists in IAM with the correct audience (sts.amazonaws.com)

Run docker/login-action@v3
Error: Username required

Root cause: Secret name in the workflow doesn't match what's stored in GitHub. This is case-sensitive!

Fix:

Check your workflow file:

# ✅ Correct — matches the secret name exactly
password: ${{ secrets.DOCKERHUB_TOKEN }}

# ❌ Wrong — different secret name
password: ${{ secrets.DOCKER_SECRET_KEY }}

In GitHub Secrets, verify the exact names: DOCKERHUB_USERNAME and DOCKERHUB_TOKEN

❌ Error 7: Trivy upload fails — "Path does not exist: trivy-results.sarif"

Error: Path does not exist: trivy-results.sarif

Root cause: The Trivy scan step was skipped or failed (often because the Docker build failed first), so no SARIF file was generated.

Fix: This is usually a downstream effect of another error. Fix the Docker build first, and the Trivy upload will work. The if: always() on the upload step ensures it runs even when Trivy fails, but it can't upload a file that doesn't exist.

9. Summary

This isn't a toy pipeline. It's a production-grade delivery system that handles:

Pillar	How We Address It
Security	OIDC (no static keys), GitHub Secrets, npm audit, CodeQL, Trivy
Reliability	`npm ci` for deterministic builds, automated tests
Recoverability	Automated rollback on deploy failure
Traceability	Docker tags tied to Git commit SHAs
Efficiency	Parallel jobs, Docker layer caching (`type=gha`), concurrency
Governance	Branch protection, PR reviews, status checks, audit trail

By following this guide step by step, you've built something real — not a tutorial demo, but the same patterns used in production systems at companies shipping code daily.

Built as part of the DevOps Portfolio — Week 1: CI/CD Foundations.

The AI SRE Is Here — And So Are Two New Kernel CVEs | Overflowbyte Weekly · Feb 15, 2026

Pushpendra B — Sat, 14 Feb 2026 18:30:00 GMT

Last week we talked about the AI infrastructure arms race heating up. This week, it got louder — and more concrete.

The hyperscalers posted their numbers, and the capex figures are staggering. But more interestingly, the tooling is catching up. AI is no longer just something being built on top of cloud infrastructure — it is being woven into how we run that infrastructure. Google's SREs are using a Gemini CLI during actual production outages. Splunk now ships an AI agent that acts as a "fellow SRE." These are not demos anymore.

On the ground, two new Linux kernel CVEs dropped that are specifically relevant to virtualized environments, Ubuntu pushed a heavy patch batch, and Docker shipped a sandboxed microVM environment aimed directly at running AI agents safely.

Here is your curated briefing for the week ending February 15, 2026.

1. The Hyperscaler Arms Race Gets a Price Tag

If last week was the announcement round, this week was the financial confirmation.

AWS is targeting approximately $200 Billion in capital expenditure for 2026 — almost entirely for data centres and custom silicon powering AI workloads. They also launched Nova Forge, a new service that lets you fine-tune Amazon's own generative AI models during training, without shipping your data somewhere else. That is a meaningful enterprise unlock. Source

Google / Alphabet is on track to nearly double its 2026 capex to roughly $175–185 Billion. Google Cloud posted 48% year-over-year revenue growth — its fastest pace since 2021 — and its backlog has more than doubled to $240 Billion. The supply constraint now is not demand. It is physical capacity. Source

Microsoft Azure is up 38–39% with nearly 1 Gigawatt of AI infrastructure added in a single quarter. They are shipping their own silicon — Maia 200 accelerators and Cobalt CPUs — optimised for "tokens per watt per dollar." $37.5 Billion in capex, roughly two-thirds on GPUs and CPUs. Source

What this means for you: Expect AI-optimised instance types, new managed AI services, and region expansions to accelerate across all three clouds in 2026. The "multi-cloud" story is increasingly about accessing the best AI models where they live — not just load-balancing workloads. Vendor-specific AI tooling pressure is real and growing.

2. Docker Sandbox: Secure Environments for Running AI Agents

Docker shipped Desktop 4.59.0 (Feb 2) with something worth paying attention to: Docker Sandbox, a microVM-based environment specifically designed for running coding and AI agents in isolation. Kernel bumped to 6.12.67, Compose updated to v5.0.2.

A quick 4.60.1 bugfix followed on Feb 9 to resolve dashboard crashes after sign-in.

Why does Sandbox matter? Right now, most people running AI coding agents locally are doing so with far less isolation than they think. Docker Sandbox gives you a proper microVM boundary — the agent can write files, run code, make calls, and be contained without touching your host environment.

Action: Upgrade to Desktop ≥ 4.59 and start experimenting with Sandbox for any local AI agent workflows. This is Docker's answer to the "how do I run an agent without it doing something I didn't intend" problem.

📎 Docker Desktop Release Notes

3. Linux Security: Two CVEs and a Heavy Ubuntu Patch Batch

CVE-2026-23057 — vsock/virtio Memory Leak

A bug in the Linux kernel's vsock/virtio subsystem leaks uninitialized kernel memory when certain zero-copy message patterns (MSG_ZEROCOPY) are used — caused by incorrect buffer coalescing on the RX queue. Exploitation requires local access and vsock loopback, but this is particularly relevant if you run workloads that use vsock for inter-VM communication — which includes several popular agent and sandbox setups.

Patches are in mainline. In the meantime: disable vsock loopback and MSG_ZEROCOPY where they are not explicitly needed, and tighten access to vsock interfaces. Source

CVE-2026-23107 — ARM64 SME NULL-Pointer Dereference

An ARM64-specific bug in the Scalable Matrix Extension (SME) subsystem causes a NULL-pointer dereference when restoring ZA signal context — typically triggered by CRIU or checkpoint/restore tooling. The result is a local kernel crash and denial-of-service. Fix has landed upstream.

If you are not running SME workloads, disabling SME at boot is the cleanest short-term mitigation. If you rely on CRIU, restrict it to trusted admins until you can patch. Source

Ubuntu Feb 12 Mega-Patch

Ubuntu dropped a large security batch on February 12 covering the Linux kernel across 18.04, 20.04, and 22.04 LTS — each notice rolling up dozens to hundreds of CVEs. Also patched: libpng, nginx, dnsdist, HAProxy, and MUNGE.

Action: Do not defer this one. Plan maintenance windows and kernel reboots this week for any Ubuntu server that is internet-facing or running in a virtual environment.

📎 Ubuntu Security Notices

OpenSSH Advisory on RHEL 9.6 / 10.0 EUS

Red Hat issued advisories for RHEL 9.6 and 10.0 EUS covering a case where control characters or embedded nulls in usernames or URIs could lead to code execution via ProxyCommand. Rated Moderate, but if you have ProxyCommand or complex ssh:// URIs in any automation — patch it now, it is not worth the risk.

📎 Red Hat Advisory RHSA-2026:0693

4. Kubernetes: 1.36 Cycle and When to Upgrade

Kubernetes 1.36.0-alpha.1 was cut around February 4. The timeline:

Milestone	Date
Code Freeze	Mid-March 2026
GA Release	April 22, 2026

On the managed side: AKS is bringing 1.35 into preview and GA windows in Q1–Q2, and EKS shows standard support for 1.32–1.34 running through late 2026 into 2027.

If you are still on 1.29 or 1.30, this is the same message as last week: get off them. Staying too far behind is becoming a security liability, not just an operational inconvenience. Map your upgrade path to 1.33+ now before the managed provider EOL catches you off guard.

📎 Kubernetes Release Schedule

5. The AWS SysOps Cert Has a New Name (and a New Practice Exam)

AWS is officially rebranding the SysOps Administrator Associate exam to AWS Certified CloudOps Engineer – Associate. The rename reflects the actual reality of the job — it is not just sys administration, it is operational engineering on cloud platforms.

What is new this cycle:

Official Practice Exam launched on Skill Builder in January 2026
New agentic-AI classroom courses added to Skill Builder
Exam structure updated to reflect current AWS operations patterns

The core certification path still holds:

Cloud Practitioner
  → Solutions Architect Associate
      → CloudOps Engineer Associate  ← renamed, updated
          → DevOps Engineer Professional

Action: If you were planning to sit the SysOps exam, hold for a few weeks and check the updated exam guide. The new practice exam on Skill Builder is worth running through regardless — it reflects the current question format.

📎 AWS Certification Updates

6. The AI SRE Is No Longer a Concept — It Is In Production

This is the section worth slowing down on.

Google's SREs are using a Gemini CLI during real outages. Not in a sandbox, not in a blog post demo — in actual incident response. An InfoQ write-up this week details how Google Cloud SRE teams use an internal Gemini CLI to summarise incidents, navigate logs and metrics, and surface remediation candidates, replacing the manual "wade through five dashboards while half-asleep at 3am" workflow. Source

Splunk shipped an AI troubleshooting agent in its Q1 2026 Observability Cloud update. It ingests metrics, events, logs, and traces when an alert fires and proposes root causes, impact summaries, and remediation steps. Splunk is explicitly calling it "a fellow SRE." Source

LogicMonitor's 2026 Observability & AI Outlook forecasts the trajectory plainly: enterprises are drowning in telemetry but lack correlation and causality. The near-term demand is for AI-driven root-cause analysis and predictive detection. The target state — which is closer than most people realise — is autonomous remediation with human approval gates.

The recommended path if you want to get ahead of this:

Consolidate your observability tooling (fewer tools, better data)
Standardise on OpenTelemetry as your instrumentation layer
Layer AI-assisted alert correlation on top — start with one workflow, measure signal vs. noise

Action: Pick one low-stakes alert in your environment. Wire it up to an LLM (anything from Bedrock to a local model) for summarisation and suggested action. Run it in parallel with your existing workflow for two weeks. That hands-on experience will tell you more than any vendor demo.

7. Hands-On: Wire an LLM Into Your Incident Workflow

Inspired by what Google's SREs are actually doing, here is a minimal starting point you can build this week:

Goal: An AI-assisted on-call helper that summarises a CloudWatch alarm and suggests next steps before you open your first dashboard.

What you need:

An AWS account with CloudWatch and Bedrock access (Claude or Titan work fine)
A CloudWatch alarm you already have set up
About 90 minutes

Rough architecture:

CloudWatch Alarm
    → SNS Topic
        → Lambda Function
            → Fetch last 15 min of relevant metrics + logs
            → Send to Bedrock with a structured prompt
            → Post summary + suggested action to Slack / PagerDuty note

The prompt matters more than the model. A simple structure works well:

You are an SRE assistant. Given the following alert context and recent metrics,
summarise what is likely happening in 3 sentences, list the top 2 likely causes,
and suggest the first diagnostic command an engineer should run.

Alert: {alarm_name} — {alarm_description}
Recent metrics: {metric_data}
Recent log errors: {log_excerpt}

This is not autonomous remediation. It is decision support — and that is exactly where you should start. Build the human-in-the-loop version first, trust it, then consider automation later.

Keep shipping,Overflowbyte

Sources:

AWS CNBC · Google TrendForce · Azure Futurum · Docker Docs · SentinelOne CVE-23057 · SentinelOne CVE-23107 · Ubuntu Security · Red Hat Advisory · Kubernetes Release · AWS Cert · InfoQ Gemini CLI · Splunk Blog · LogicMonitor

#weekly-dev-journal

Discover the Latest in DevOps: Week of February 8, 2026

Pushpendra B — Sun, 08 Feb 2026 09:40:14 GMT

Staying relevant as a DevOps engineer in 2026 means tracking two massive forces simultaneously: the relentless AI infrastructure arms race between hyperscalers and the day‑to‑day operational realities of Linux, Kubernetes, and security.

This week, we see the giants (AWS, Google, Azure) spending unprecedented amounts to build the "AI Supercycle," while on the ground, engineers are being given new tools to actually use that AI in production safely.

Here is your curated briefing on the most meaningful updates for the week ending February 8, 2026.

1. AI Infrastructure Boom: Hyperscalers Invest Billions in 2026

If you needed proof that 2026 is the year of AI infrastructure, the financial reports this week screamed it.

AWS is planning nearly $200 Billion in capital expenditure for 2026, pouring money into data centers and custom silicon. Source
Google expects its 2026 spending to double, driven by an infrastructure race that is now truly global. Source
Azure continues its massive build-out, with cloud revenue up 39%. Source

What this means for you: Expect more region availability for high-end AI instances, but also increasing pressure to adopt vendor-specific AI tools. The "multi-cloud" conversation is shifting from "avoiding lock-in" to "getting access to the best AI models where they live."

AWS Innovations: Practical Tools for Secure AI Operations

While the executives talk billions, AWS shipped features that matter now. You can read the full AWS Weekly Roundup here.

Bedrock Agent Workflows: Now support server-side tool use. This is huge. It means you can build an AI agent that queries your internal databases or triggers a Lambda function without exposing those tools to the public internet. It’s the missing link for secure "ChatOps."
S3 UpdateObjectEncryption: You can finally change server-side encryption settings on existing objects without rewriting them. If you manage petabyte-scale buckets and Compliance just asked you to rotate keys, your week just got a lot better.
Network Firewall GenAI Visibility: A new feature to help you see and block/allow traffic to GenAI tools, giving security teams the control they’ve been asking for.

2. Kubernetes Evolution: Key Updates and Future Milestones

The heartbeat of modern infrastructure continues to beat steadily.

Kubernetes v1.36 Cycle: The release cycle has officially begun, with Alpha 1 dropping this week. The target GA date is April 22, 2026. Mark your calendars. Release Schedule
EKS Updates: Amazon EKS now supports Kubernetes 1.35. If you are still running 1.29 or 1.30, 2026 is the year to plan a major leap forward. The ecosystem is stabilizing around these newer versions, and staying too far behind is becoming a security liability. AWS EKS Docs

3. Linux Security Alert: Prepare for the 2026 Secure Boot Deadline

The Ticking Clock: Secure Boot 2026

Red Hat issued a critical reminder this week that arguably affects every enterprise Linux admin: Microsoft's 2011 Secure Boot signing certificate expires on June 26, 2026.

The Good News: Existing systems will continue to boot.
The Bad News: You won't be able to sign new boot components (shims, kernels) with the old cert after that date.
The Action Item: Do not ignore this. Start inventorying your fleet's Secure Boot state. You will need to apply vendor-provided shim and firmware updates before June. Do not try to manually edit UEFI databases unless you really know what you are doing—you can easily brick servers. Read the Red Hat Guidance

Kernel Watch

New Longterm Support (LTS) kernels dropped this week: 6.12.69, 6.6.123, and 6.1.162. These are purely stability and security fixes. Patch early, sleep better. Kernel.org Releases

4. DevOps 2026: Essential Skills for the AI-Driven Era

What does a "Senior DevOps Engineer" look like in 2026? Recent industry reports and job market data from this week paint a clear picture.

The baseline has moved. Kubernetes and Cloud-Native fluency are no longer "nice-to-haves"—they are the entry ticket. The new premium skills are:

AIOps Integration: Can you wire an AI agent into your PagerDuty workflow to triage alerts before a human wakes up?
Security Engineering: "DevSecOps" is just "DevOps" now. Integrating automated security scanning (SAST/DAST) into pipelines is standard.
Cost Intelligence: With cloud spend soaring, engineers who can optimize usage (FinOps) are defending their salaries easily. Skill Matrix Source

Top Certifications for 2026:

AWS DevOps Professional / Azure DevOps Expert
CKA (Certified Kubernetes Administrator): Still the gold standard for hands-on skills.

5. Hands-On Guide: Building Your First Secure AI Agent

Build Your First "Safe" AI Agent.

Don't just read about the billions being spent. Use the new AWS Bedrock server-side tools feature to build a simple prototype:

Create an agent that can query a read-only internal status page or CloudWatch metric.
Connect it to a Slack channel or CLI.
See how it feels to "chat with your infrastructure" securely.

This is the direction the industry is moving autonomous operations with strict guardrails. Better to build it yourself now than be surprised by it later.

Keep shipping,Overflowbyte

Sources

Boost LinkedIn Automation: Creating Content with n8n and Google Gemini

Pushpendra B — Sun, 08 Feb 2026 09:01:05 GMT

In the fast-paced world of social media, consistency is key. But for busy professionals and tech enthusiasts, maintaining a steady stream of high-quality LinkedIn posts can be a challenge.

In this post, I'll walk you through how I automated my LinkedIn content creation using n8n, Google Gemini, and a human-in-the-loop workflow. This system allows me to generate engaging, platform-specific content from a simple topic, review it, add visual assets, and publish it—all without leaving the automation flow.

The Workflow Overview

The core of this automation is an n8n workflow that acts as my social media assistant. Here’s the high-level logic:

Idea Injection: I provide a simple "Topic" or "Title" via a web form.
AI Processing: An AI Agent (powered by Google Gemini) researches the topic and drafts a professional LinkedIn post, complete with hashtags and a call-to-action.
Human Review: The workflow pauses and presents the generated text to me. I can review it and uploading a relevant image.
Publishing: Once approved, the workflow automatically posts the text and image to my LinkedIn profile.

The Automation Architecture

Here is the visual representation of the flow:

graph TD
    Start((Start)) --> Form_Trigger[Form Trigger: Input Topic]
    Form_Trigger --> Split_Input[Split Input]
    Split_Input --> AI_Data_Prep[Data for AI]
    AI_Data_Prep --> AI_Agent[AI Agent - Gemini]

    subgraph "AI Processing"
        AI_Agent --> Gemini_Model[Google Gemini Model]
        AI_Agent --> Parser[Structured Output Parser]
    end

    AI_Agent --> Aggregate[Aggregate Results]
    Aggregate --> Human_Loop_Form[Form: Review & Upload Image]

    Human_Loop_Form --> Process_Image[Process Image Data]
    Process_Image --> LinkedIn_Node[Publish to LinkedIn]

    LinkedIn_Node --> Done((End / Confirmation))

    style Start fill:#f9f,stroke:#333,stroke-width:2px
    style AI_Agent fill:#bbf,stroke:#333,stroke-width:2px
    style Human_Loop_Form fill:#bfb,stroke:#333,stroke-width:2px
    style LinkedIn_Node fill:#f96,stroke:#333,stroke-width:2px

The n8n Canvas View

Here's what the actual workflow looks like in n8n:

The workflow is organized into 4 distinct sections (marked by sticky notes):

Get the Data for Social Media Post using the Web Form (Yellow)
AI Agent will do its Job (Blue)
Get Image to Be Published (Light Blue)
Publish on Social Media (Yellow) → Confirmation (Green)

Step-by-Step Node-by-Node Implementation Guide

Now, let me walk you through building this workflow from scratch. Each subsection represents a node you need to add to your n8n canvas.

Node 1: On form submission (Form Trigger)

Node Type: Form Trigger
Purpose: This is the entry point of your workflow. It creates a web form to capture the post topic.
Configuration:
- Form Title: "Social Media Content AI Agent"
- Form Description: Add a helpful description explaining what the form does
- Form Fields:
  - Field Name: Post Title/Topic
  - Field Type: Text
  - Placeholder: "Write a brief and clear title or main topic for the post"
- Authentication: Basic Auth (optional, for security)
- Button Label: "Continue to Image Upload"
What happens: When submitted, this triggers the workflow and passes the topic to the next node.

Node 2: Split Form Input (Set Node)

Node Type: Set (Edit Fields)
Purpose: Extract and structure the data from the form submission.
Configuration:
- Mode: Keep only specific fields
- Fields to Include: output.platform_posts.LinkedIn.post
Connect to: Output from On form submission

Node 3: Split Data (Set Node)

Node Type: Set (Edit Fields)
Purpose: Further refine the data structure.
Configuration: Same as Split Form Input
Connect to: Output from Split Form Input

Node 4: Data for AI (Set Node)

Node Type: Set (Edit Fields)
Purpose: Prepare the final input data for the AI Agent.
Configuration:
- Assignments:
  - Post Title/Topic = {{ $('On form submission').item.json['Post Title/Topic'] }}
  - formMode = {{ $('On form submission').item.json.formMode }}
Connect to: Output from Split Data

Section 2: AI Agent Will Do Its Job

Node 5: AI Agent (AI Agent Node)

Node Type: @n8n/n8n-nodes-langchain.agent
Purpose: The brain of the operation. Generates platform-specific content.
Configuration:
- Prompt Type: Define
- Prompt: A detailed system prompt that:
  - Defines the AI's role as a content creator for your brand
  - Specifies platform-specific rules (LinkedIn, Instagram, Facebook, Twitter)
  - Includes hashtag strategies
  - References the input data: {{ $json['Post Title/Topic'] }}
- Has Output Parser: ✓ Enabled
Connect to: Output from Data for AI
Sub-nodes to connect:
- Google Gemini Chat Model (Language Model)
- Structured Output Parser (Output Parser)

Node 6: Google Gemini Chat Model (Sub-node)

Node Type: @n8n/n8n-nodes-langchain.lmChatGoogleGemini
Purpose: The actual AI model that processes the prompt.
Configuration:
- Model Name: models/gemini-3-flash-preview (or your preferred Gemini model)
- Credentials: Google Gemini API credentials
Connect to: AI Agent (via Language Model connection)

Node 7: Structured Output Parser (Sub-node)

Node Type: @n8n/n8n-nodes-langchain.outputParserStructured
Purpose: Ensures the AI returns properly formatted JSON.
Configuration:
- Schema Type: Manual
- Input Schema: A JSON schema defining the structure:

{
  "type": "object",
  "properties": {
    "platform_posts": {
      "type": "object",
      "properties": {
        "LinkedIn": {
          "type": "object",
          "properties": {
            "post": {"type": "string"},
            "hashtags": {"type": "array"},
            "call_to_action": {"type": "string"}
          }
        },
        "Twitter": {...},
        "Facebook": {...}
      }
    }
  }
}

Connect to: AI Agent (via Output Parser connection)

Node 8: Aggregate (Aggregate Node)

Node Type: Aggregate
Purpose: Combines all the AI output into a single item for easier reference.
Configuration:
- Aggregate: All Item Data
Connect to: Output from AI Agent

Section 3: Get Image to Be Published

Node 9: Upload Image (Form Node)

Node Type: Form
Purpose: Pauses the workflow to let you review the AI-generated text and upload an image.
Configuration:
- Operation: Wait for Form Submission
- Form Title: "Review the Text"
- Form Description: Display the AI-generated text using expressions:
```
  LinkedIn: {{ $json.data[0].output.platform_posts.LinkedIn.post }}
  Twitter: {{ $json.data[0].output.platform_posts.Twitter.post }}
```
- Form Fields:
  - Field Label: image
  - Field Type: File
  - Accepted File Types: .jpg (or .png, .jpeg)
  - Required: ✓ Yes
- Button Label: "Proceed to Next Step"
Connect to: Output from Aggregate

Node 10: Nest Top Meta (Set Node)

Node Type: Set (Edit Fields)
Purpose: Preserve all form data including the binary image.
Configuration:
- Assignments:
  - Name: metaTop
  - Type: Object
  - Value: {{ $json }}
- Options: Include Binary ✓
Connect to: Output from Upload Image

Node 11: Rename Image Binary Top Image (Code Node)

Node Type: Code
Purpose: Rename the binary data field for LinkedIn compatibility.
Configuration:
- Mode: Run Once for Each Item
- JavaScript Code:

$input.item.binary.top = $input.item.binary.data;
delete $input.item.binary.data;
return $input.item;

Connect to: Output from Nest Top Meta

Node 12: Publish to LinkedIn (LinkedIn Node)

Node Type: LinkedIn
Purpose: Posts the content to your LinkedIn profile.
Configuration:
- Resource: Post
- Operation: Create
- Person: Your LinkedIn Person URN (e.g., CryRqQfSsC)
- Text:
```
  {{ $('AI Agent').item.json.output.platform_posts.LinkedIn.post }}
  {{ $('Aggregate').item.json.data[0].output.platform_posts.LinkedIn.call_to_action }}
```
- Share Media Category: IMAGE
- Binary Property Name: image
- Credentials: LinkedIn OAuth2 credentials
Connect to: Output from Rename Image Binary Top Image

Node 13: X (Twitter Node) - Optional

Node Type: Twitter
Purpose: Posts to Twitter/X.
Configuration:
- Text: {{ $('Aggregate').item.json.data[0].output.platform_posts.Twitter.post }}
- Credentials: Twitter OAuth2 credentials
Connect to: Can run in parallel with LinkedIn

Node 14: Edit Fields (Set Node)

Node Type: Set (Edit Fields)
Purpose: Extract Twitter post ID for confirmation.
Configuration:
- Assignments:
  - Name: edit_history_tweet_ids
  - Type: Array
  - Value: {{ $json.edit_history_tweet_ids }}
Connect to: Output from X (if using Twitter)

Node 15: Merge1 (Merge Node)

Node Type: Merge
Purpose: Combines outputs from LinkedIn (and optionally Twitter) before final confirmation.
Configuration: Default merge settings
Connect to:
- Input 1: Output from Publish to LinkedIn
- Input 2: Any other social media nodes

Section 5: Confirmation that Post is Published

Node 16: Form (Form Node - Completion)

Node Type: Form
Purpose: Shows a success message with links to the published posts.

Configuration:

Operation: Completion
Completion Title: "Thanks"

Completion Message:

  Your post has successfully been submitted to LinkedIn.

  LinkedIn: https://www.linkedin.com/feed/update/{{ $('Publish to LinkedIn').item.json.urn }}

  Thanks,
  AI Agent

Form Title: "AI Agent (Job Done)"

Connect to: Output from Merge1

Deep Dive: How It Works

1. The Trigger (Command Center)

Everything starts with an n8n Form Trigger. Instead of manually logging into LinkedIn, I simply open a private URL hosted by my n8n instance. This form asks for:

Post Title/Topic: e.g., "The Future of Open Source AI".
Keywords: Optional context to guide the AI.

2. The Brain: Google Gemini AI Agent

This is where the magic happens. The input is passed to an AI Agent node connected to the Google Gemini Chat Model.

I've configured the prompt to act as a "Content Creation AI" for my brand. It follows specific rules:

Tone: Professional, insightful, and value-driven.
Structure: 3-4 sentences, optimized for engagement.
Hashtags: A mix of general tech tags (#Innovation, #AI) and niche ones.

The agent doesn't just output text; it uses a Structured Output Parser to return a clean JSON object containing the post text, hashtags, and even suggestions for other platforms like Twitter and Facebook.

3. Human-in-the-Loop (The "Wait" Node)

Automation is great, but I don't want to blindly post AI-generated content. I need final approval.

The flow uses a second n8n Form node (titled "Upload Image") in the middle of the execution.

The workflow pauses here.
It displays the AI-generated text for me to read.
It asks me to upload the final image or creative asset to go with the post.
Once I hit "Proceed," the workflow resumes.

This step is crucial. It combines the speed of AI with the quality control of a human.

4. The Publisher

Finally, the workflow takes the text approved in the previous step and the image I just uploaded. It formats the binary data and sends it to the LinkedIn node, which uses the LinkedIn API to create a "Share" on my personal profile.

Security: Keeping Credentials Safe

One of the most important aspects of sharing or backing up automation flows is security.

Credential Separation: n8n separates the workflow logic (the nodes and connections) from the credentials (API keys and passwords).
No Hardcoded Secrets: In the workflow JSON file, you will never see my actual API keys. You will only see references like linkedInOAuth2Api or googlePalmApi.
Environment Variables: For sensitive data that might be needed inside expressions, I use n8n environment variables rather than typing them directly into the node parameters.

When you import my flow, n8n will ask you to set up your own credentials for LinkedIn and Google Gemini. Your secrets stay on your server, and mine stay on mine.

Conclusion

This workflow has transformed how I manage my professional presence. By automating the "drafting" phase and streamlining the "publishing" phase, I save hours of time while maintaining high content quality.

If you want to try this yourself, you'll need:

A self-hosted or cloud n8n instance.
A Google Cloud Console project with the Gemini API enabled.
A LinkedIn App for API access.

Happy Automating!

Managing AWS IAM Users Made Easy: Tips on Creation, Administration, and Removal

Pushpendra B — Thu, 18 Dec 2025 07:50:21 GMT

Introduction: Amazon Web Services (AWS) is a vast cloud ecosystem that offers immense flexibility and power. However, with great power comes great responsibility. Managing who can access your AWS resources and what they can do with them is crucial. This is where AWS Identity and Access Management (IAM) comes into play.

In this comprehensive and conversational blog post, we will deeply dive into AWS IAM users. We’ll not only cover creating and deleting IAM users but also explore the finer points of user management, security, and best practices. By the end, you’ll be equipped with the knowledge to navigate the IAM landscape confidently.

Chapter 1: Understanding AWS IAM Users AWS IAM users are the cornerstone of access control within your AWS environment. Before we dive into the practical aspects of creating and deleting IAM users, let’s ensure we have a solid understanding of what they are and why they matter.

Imagine IAM Users as Real People: Think of IAM users as virtual individuals or entities within your AWS account. Each user is assigned a set of permissions that dictate what they can and cannot do.

What’s an IAM User? An IAM user is an entity that represents a person or an application within your AWS account. Each IAM user has a unique set of security credentials.

Why Are IAM Users Important? Imagine your AWS account as a bustling office building. Without IAM users, everyone has the master key to every room. IAM users provide individual keys, ensuring that only authorized personnel can access specific areas.

Chapter 2: Creating IAM Users (Step-by-Step) Creating IAM users is a fundamental task in user management. Here’s a step-by-step guide to help you do it right:

Step 1: Access the IAM Dashboard

Press enter or click to view image in full size

Open the IAM dashboard.
Select users — In this section, you will be able to edit the users, create, edit and update them.

Step 2: Adding a New User

In the navigation pane, click on “Users.”
Hit the “Create user” button.

Press enter or click to view image in full size

Step 3: User Details

Enter a username.
Specify the type of access: programmatic or AWS Management Console.

Press enter or click to view image in full size

Create a user

Here you can give access of the console to the person ( User) you are creating. In which you can specify a user in the Identity centre or simply create an IAM user.

Press enter or click to view image in full size

Credentials details

You can autogenerate passwords which is suitable so that user can create their own password after the first login.

Assign permissions by adding the user to a group or attaching policies directly.

Press enter or click to view image in full size

Assigning Permissions and user roles

You can assign Permissions, create groups, copy permissions of different users that we have already, or attach the policies directly.

We will learn about attaching policies in different article. Where I will provide you in depth guide to adding the roles, permissions, creating groups.

Get Pushpendra’s stories in your inbox

Join Medium for free to get updates from this writer.

For now, we are moving forward without assigning any kind of roles to our IAM user.

Review and create the user.

Press enter or click to view image in full size

Review and create

“Pro Tip: When creating a user for programmatic access, don’t forget to generate an access key. This key is crucial for programmatic interactions with AWS services.”

Press enter or click to view image in full size

You can use the console sign-in URL to log in via the credentials provided.

Chapter 3: Deleting IAM Users (Step-by-Step) While creating IAM users is vital, so is cleaning up when they’re no longer needed. Here’s how you can delete IAM users responsibly:

Step 1: Navigate to the User List

Access the IAM dashboard.
Click on “Users” in the left-hand navigation pane.

Step 2: Select the User

Click on the user you want to delete.

Step 3: Delete the User

On the user details page, click the “Delete user” button.
Double-check the user’s permissions and policies to avoid accidental deletions.

Chapter 4: IAM User Management Best Practices Managing IAM users isn’t just about creating and deleting them; it’s an ongoing process. Here are some best practices to consider:

Least Privilege: Follow the principle of least privilege, ensuring that users have only the permissions necessary for their tasks.
Regular Key Rotation: Regularly rotate access keys and passwords to enhance security.
Use Groups: Group users with similar access needs and assign permissions to groups rather than individuals.
Monitoring and Auditing: Implement robust tracking and auditing of user activities to detect and respond to suspicious behaviour.
De-provisioning: Develop and enforce de-provisioning policies to remove access promptly when users no longer require it.

Chapter 5: Advanced IAM Concepts and Security Beyond the basics of creating and deleting IAM users, let’s explore some advanced IAM concepts:

IAM Roles: Roles provide temporary permissions for users or services, enabling cross-account access and minimising security risks.
Multi-Factor Authentication (MFA): Implement MFA to add an extra layer of security to user accounts.
Identity Federation: Federate identities from external sources (e.g., Active Directory) to AWS IAM for streamlined access management.

Chapter 6: Recap and Final Thoughts IAM user management isn’t a one-time task; it’s an ongoing commitment to security and efficiency. To recap:

IAM users are virtual entities representing real individuals or applications.
Creating IAM users follows a straightforward process within the IAM dashboard.
Deleting IAM users should be done carefully to avoid unintended consequences.
Best practices, such as least privilege and regular key rotation, enhance IAM security.
Advanced IAM concepts like roles, MFA, and identity federation provide additional layers of control.

Remember, IAM is your key master to the AWS kingdom. Properly managing IAM users ensures that only the right people have access to the right resources. Stay secure, stay in control, and explore the AWS world with confidence.

Conclusion: IAM users are the linchpin of secure and efficient AWS resource management. By mastering the art of creating, managing, and deleting IAM users, you not only bolster the security of your AWS environment but also ensure that your resources are used effectively.

Feel free to reach out with any questions or to share your own IAM user management tips in the comments below. Remember, in the realm of AWS, IAM is your trusted guardian.

Comprehensive DNS Propagation Checker and Deep Trace Tool: 2025 Guide

Pushpendra B — Tue, 09 Dec 2025 06:40:07 GMT

Headline: Stop guessing with cached local lookups. Discover how CheckYourDNS performs deep, recursive tracing from Root to Authoritative servers for instant, accurate diagnostics.

Hey fellow tech professionals!

As developers, server admins, DevOps, SRE's, we've all been there: a site is down, or a migration just finished, and you're staring at a terminal running dig or refreshing a browser, wondering if you're seeing cached data or the real thing.

If you're spending precious time context-switching between propagation checkers, local CLI tools, and WHOIS lookups, this is for you.

The "Cached" Reality Trap

Let's be honest => how many times have you flushed your local DNS cache just to be sure? Traditional debugging workflow looks like this:

Check ping locally (Is it my ISP?)
Run dig +trace (Is it the authoritative server?)
Use an external propagation checker (Is it global?)
Check SSL labs (Is the certificate valid?)

If you keep doing this manually, you're leaking productivity. That's exactly why I built CheckYourDNS.

Deep Tracing: The Game Changer

Most online tools just query their own local resolver (like Google's 8.8.8.8) and show you the result. That's fine for simple checks, but it hides the truth. What if the Root server is pointing correctly, but the TLD server is timing out?

CheckYourDNS doesn't just "lookup" a record. It performs a Deep Recursive Trace in real-time:

➜ Searching for google.com...
✓ Root Server (A.ROOT-SERVERS.NET) [14ms]
✓ TLD Server (a.gtld-servers.net) [22ms]
✓ Authoritative (ns1.google.com) [18ms]
⇒ Result: 142.250.190.46 (TTL: 300)

This gives you instant visibility into where the chain is breaking vs. just "Not Found."

The Tech Stack at Your Fingertips

When you run a query on CheckYourDNS, you get comprehensive intelligence instantly:

1. DNS Architecture

Recursive Trace: See the exact path from Root -> TLD -> NameServer.
A/AAAA Records: Verify IPv4 and IPv6 simultaneous connectivity.
MX Transparency: See Priority levels instantly (crucial for email migrations).
TXT Verification: Validate SPF, DKIM, and verification tokens for Slack/Google/Facebook in one glance.

2. Global Propagation

With one click, you can verify your domain against multiple global providers (Google, Cloudflare, Quad9, OpenDNS) to ensure your changes have rolled out worldwide.

Real-World Scenarios

Here is how CheckYourDNS shines in everyday DevOps tasks:

📧 Email System Setup

Setting up Google Workspace? You need to verify 5 different MX records with distinct priorities. One typo allows spam in or blocks legitimate mail. Our tool highlights the Priority field clearly so you can spot priority: 10 vs priority: 1 instantly.

🔒 SSL & Migration Verification

Switched from GoDaddy to AWS Route53? The "Deep Trace" feature confirms that the TLD servers (.com) are actually pointing to your new AWS nameservers before you even switch the traffic over.

ROI for Professionals

Consider this: verifying a migration manually takes about 5-10 minutes of "digging" and cross-referencing. With CheckYourDNS, it takes 3 seconds.

In today's fast-paced infrastructure environment, efficiency isn't just nice to have => it's essential. We designed this tool to give you back what's most valuable: your time.

Analyze Your Domain Now

Tags: #devops #webdev #dns #troubleshooting #sysadmin #techtools

Resolving IP Conflicts in Talos Kubernetes: A Step-by-Step Guide

Pushpendra B — Thu, 27 Nov 2025 18:30:00 GMT

I was working on a lab and experimenting with the Talos Linux setup when I encountered an error related to Flannel as the Container Network Interface (CNI). Flannel is a straightforward overlay network provider for Kubernetes that establishes a flat, Layer 3 network for pods, enabling communication across different nodes. It assigns each node a unique subnet and uses encapsulation methods like UDP or VXLAN to route traffic between nodes, offering a basic yet easy-to-configure networking solution.

I recently spun up a local Kubernetes lab on my laptop to learn Talos Linux. The setup was straightforward:

Platform: VirtualBox on Windows
Cluster: 2 VMs
- Control Plane: talos-jk6-lje
- Worker: talos-m9z-pjn
OS: Talos Linux v1.6.2
Kubernetes: v1.29.0
CNI: Flannel (ghcr.io/siderolabs/flannel:v0.23.0)

Each VM has two network adapters:

NAT (for internet access)
Host-Only (192.168.56.0/24, for cluster communication)

After bootstrapping the cluster, I ran kubectl get pods -A expecting everything to be green. Instead:

The Problem

What I saw:

CoreDNS pods stuck in ContainerCreating
- Error: failed to find plugin "flannel" in path [/opt/cni/bin]
kube-flannel DaemonSet on the worker: CrashLoopBackOff
- Error: loadFlannelSubnetEnv failed: open /run/flannel/subnet.env: no such file or directory
Running kubectl get nodes -o wide showed both nodes with the same InternalIP: 10.0.3.15

Something was clearly wrong with the network layer.

Understanding the Issue (With a Simple Analogy)

Before diving into the technical fix, let me explain what was happening using a simple analogy that even my cousin could understand.

Imagine a town where two different houses accidentally put up the exact same street address.

The Nodes are Houses: You have a "Control Plane House" and a "Worker House".
The IP Address is the Street Address: Both houses claim to be at "10.0.3.15".
The Network is the Mail Carrier: When the mail carrier (Kubernetes/Flannel) tries to deliver a package, they see the address "10.0.3.15" and get confused. "I just passed this address! Which house is the real one?"
Flannel is the Road System: Flannel tries to build a map of the town. Because of the duplicate addresses, it can't figure out which road leads where. It gives up, and the roads remain unfinished.
CoreDNS is the Phonebook: The town's phonebook lives in one of the houses. Since the roads (Flannel) are broken, nobody can reach the house to get the phonebook. As a result, nobody can look up any numbers.

The fix is simple: Tell the town planner (Talos) to ignore the duplicate address and use the other unique address (the Host-Only network) for each house.

Glossary of Terms

Before we go further, here are a few technical terms we'll use:

CNI (Container Network Interface): The plugin that lets Kubernetes pods talk to each other. We are using Flannel.
VXLAN: A network technology that creates a "virtual tunnel" between nodes so pods can communicate across the cluster.
CIDR: A way to describe a range of IP addresses (e.g., 192.168.56.0/24).
DaemonSet: A type of Kubernetes workload that ensures a copy of a pod runs on every node (like the Flannel network agent).

My Lab's Network Architecture

Here's how I configured the network for each VM:

NAT Network (10.0.3.0/24): Used for internet access. VirtualBox often assigns the same IP (10.0.3.15) to VMs in this mode from the guest's perspective.
Host-Only Network (192.168.56.0/24): Used for communication between the Host and VMs. These IPs are unique (.101 and .102).

Network Topology

flowchart LR
  Host[(Laptop Host)]
  subgraph VBoxNet[VirtualBox Networks]
    vboxnet0{{Host-Only 192.168.56.0/24}}
    nat{{NAT 10.0.3.0/24}}
  end

  CP[Control Plane VM\n192.168.56.101\nNAT seen as 10.0.3.15]
  WK[Worker VM\n192.168.56.102\nNAT seen as 10.0.3.15]

  Host --- vboxnet0
  vboxnet0 --- CP
  vboxnet0 --- WK
  nat --- CP
  nat --- WK

Network Plan Details

Host-Only network (vboxnet0): 192.168.56.0/24
- Control plane: 192.168.56.101
- Worker: 192.168.56.102
NAT network: both VMs surfaced the same internal address 10.0.3.15 to kubelet
Pod CIDRs (flannel):
- Control plane: 10.244.1.0/24
- Worker: 10.244.0.0/24

IP Summary

Component	Address/Range	Notes
Control Plane NodeIP	192.168.56.101	Host-Only adapter (preferred by kubelet)
Worker NodeIP	192.168.56.102	Host-Only adapter (preferred by kubelet)
NAT (both VMs)	10.0.3.15	Undesired for cluster traffic
Service CIDR	10.96.0.0/12 (typical)	kube-dns at 10.96.0.10
PodCIDR (CP)	10.244.1.0/24	flannel allocation
PodCIDR (Worker)	10.244.0.0/24	flannel allocation

Why NAT + Host-Only is Tricky

In many VirtualBox lab setups, the NAT adapter presents an identical outward address from the guest's perspective (here, 10.0.3.15). Kubernetes picks a node IP from available interfaces. If it picks the NAT address on both nodes, flannel sees duplicate node IPs and fails to initialize the overlay.

You have two options:

Remove the NAT adapter for cluster traffic entirely
Keep NAT only for outbound internet and explicitly tell kubelet to use the Host-Only network (my approach)

Recommended for labs:

Keep Host-Only for all cluster traffic (stable, unique IPs)
Keep NAT only for VM outbound internet, but prevent Kubernetes from using it by pinning node IP selection

Root Cause Analysis

After digging through events and logs, I realized what was happening:

Kubernetes (specifically the Kubelet) auto-detects the node's IP address from available interfaces. In my case, it was picking the NAT interface on both VMs, which presented the same address (10.0.3.15) from the guest OS perspective.

Flannel (the CNI plugin) uses this node IP to create the overlay network (VXLAN). When it saw duplicate IPs, it couldn't establish proper routes and failed to initialize on the worker node.

How I Diagnosed It

Here are the exact commands ( on windows ) I ran to confirm the issue:

# Point kubectl to the cluster
$kc = "c:\Users\Pushpendra\Desktop\projects\talos_linux_learning\kubeconfig"

# Check API and nodes
kubectl --kubeconfig $kc cluster-info
kubectl --kubeconfig $kc get nodes -o wide

# See what's failing
kubectl --kubeconfig $kc get pods -A -o wide
kubectl --kubeconfig $kc -n kube-system get ds -o wide

# Inspect the problem pods
kubectl --kubeconfig $kc -n kube-system describe pod kube-flannel-k5sdw
kubectl --kubeconfig $kc -n kube-system describe pods -l k8s-app=kube-dns
kubectl --kubeconfig $kc get events -A --sort-by=.lastTimestamp

# Confirm duplicate node IPs (this was the smoking gun!)
kubectl --kubeconfig $kc get nodes -o jsonpath="{range .items[*]}{.metadata.name}: {.status.addresses[*].type}:{.status.addresses[*].address}{'\n'}{end}"

Expected telltales:

Duplicate InternalIP on two nodes
kubelet events referencing flannel plugin install and missing /run/flannel/subnet.env
CoreDNS stuck with "failed to create pod sandbox" errors

sequenceDiagram
  autonumber
  participant CP as Control Plane (10.0.3.15 / 192.168.56.101)
  participant WK as Worker (10.0.3.15 / 192.168.56.102)
  Note over CP,WK: BEFORE – both nodes advertise 10.0.3.15
  WK->>Flannel: start
  Flannel-->>WK: cannot stabilize (duplicate node IP)
  WK->>Kubelet: Pod sandbox create
  Kubelet-->>WK: fail (CNI flannel not ready)
  Note over CP,WK: AFTER – nodes advertise 192.168.56.x via validSubnets
  WK->>Flannel: start
  Flannel-->>WK: VXLAN ready, /run/flannel/subnet.env present
  WK->>Kubelet: Pod sandbox create
  Kubelet-->>WK: success - CoreDNS runs

The Fix

Once I understood the problem, the solution was clear: tell Talos to explicitly use the Host-Only subnet (192.168.56.0/24) and ignore the NAT IP.

Here's what I did:

Step 1: Update Machine Configs

I edited both _out/controlplane.yaml and _out/worker.yaml to add this kubelet configuration:

machine:
  kubelet:
    nodeIP:
      validSubnets:
        - 192.168.56.0/24  # Allow IPs from the Host-Only network
        - '!10.0.3.15/32'  # Explicitly deny the duplicate NAT IP

Step 2: Apply the Configs

I applied the updated configs using talosctl, which restarted the Kubelet with the new IP selection logic:

# Set talosconfig path
$env:TALOSCONFIG = "_out\talosconfig"

# Apply to Control Plane
talosctl -n 192.168.56.101 apply-config --mode=auto -f _out\controlplane.yaml

# Apply to Worker
talosctl -n 192.168.56.102 apply-config --mode=auto -f _out\worker.yaml

# Optional: reboot nodes for faster pickup
talosctl -n 192.168.56.101 reboot
talosctl -n 192.168.56.102 reboot

The Result

After applying the configs and waiting a few minutes:

Unique IPs: Both nodes now advertised their Host-Only addresses (192.168.56.101 and 192.168.56.102)
Flannel came alive: The DaemonSet rolled out successfully on both nodes
CoreDNS started: Pods transitioned from ContainerCreating to Running

Success! 🎉

Validating the Fix

I ran these commands to confirm everything was healthy:

$kc = "c:\Users\Pushpendra\Desktop\projects\talos_linux_learning\kubeconfig"

# Nodes should now have unique InternalIP in 192.168.56.x
kubectl --kubeconfig $kc get nodes -o wide

# Flannel should be Ready everywhere
kubectl --kubeconfig $kc -n kube-system rollout status ds/kube-flannel --timeout=180s

# CoreDNS should converge to Ready
kubectl --kubeconfig $kc -n kube-system rollout status deploy/coredns --timeout=180s
kubectl --kubeconfig $kc -n kube-system get pods -o wide

flowchart LR
  subgraph CP[Node: Control Plane
NodeIP 192.168.56.101
PodCIDR 10.244.1.0/24]
    cpfl[flannel.vxlan]
    apiserver[(kube-apiserver)]
  end
  subgraph WK[Node: Worker
NodeIP 192.168.56.102
PodCIDR 10.244.0.0/24]
    direction TB
    wkfl[flannel.vxlan]
    wkpods[" "]
    coredns1[(coredns
10.244.0.2)]
    coredns2[(coredns
10.244.0.3)]
    wkfl ~~~ wkpods
    wkpods ~~~ coredns1
    wkpods ~~~ coredns2
  end
  kubeDNS[(Service kube-dns 10.96.0.10)]
  cpfl <--> wkfl
  coredns1 --> kubeDNS
  coredns2 --> kubeDNS
  kubeDNS --> apiserver
  classDef vxlan stroke-dasharray: 5 5,stroke:#6c6,stroke-width:2px
  class cpfl,wkfl vxlan

Verification

To make sure everything was really working, I ran a quick DNS smoke test:

$kc = "kubeconfig"

# Create a PodSecurity-friendly test pod
@'
apiVersion: v1
kind: Pod
metadata:
  name: dns-smoke
spec:
  securityContext:
    seccompProfile:
      type: RuntimeDefault
  containers:
  - name: bb
    image: busybox:1.36
    command: ["sh","-c","sleep 3600"]
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop: ["ALL"]
      runAsNonRoot: true
      runAsUser: 1000
'@ | kubectl --kubeconfig $kc apply -f -

# Wait for ready
kubectl --kubeconfig $kc wait --for=condition=Ready pod/dns-smoke --timeout=60s

# Test DNS
kubectl --kubeconfig $kc exec dns-smoke -- nslookup kubernetes.default.svc.cluster.local

# Cleanup
kubectl --kubeconfig $kc delete pod dns-smoke

The lookup returned 10.96.0.10 (the kube-dns service). Perfect! My cluster networking was fully operational.

What this proves:

Pod-to-Service Communication: The busybox pod could reach the CoreDNS service IP.
CNI Overlay Health: For the packet to travel from the pod to the service, the Flannel VXLAN tunnel had to be working correctly.
DNS Resolution: CoreDNS was actually running and able to answer the query.

Optional Hardening

If flannel ever chooses the wrong NIC in the future, you can pin the interface explicitly in the ConfigMap.

Why do this? Even with the Talos fix, there's a small chance that if you add more network cards or change the VM config, the interface order could change. Pinning the interface name (e.g., eth1) in the Flannel config is an extra safety measure to ensure it always uses the correct road, no matter what.

$kc = "c:\Users\Pushpendra\Desktop\projects\talos_linux_learning\kubeconfig"

# Export the flannel ConfigMap
kubectl --kubeconfig $kc -n kube-system get cm kube-flannel-cfg -o yaml > flannel-cm.yaml

# Edit net-conf.json in the ConfigMap and add: "Iface": ""
# For example: "Iface": "eth1" (or whatever interface has 192.168.56.x)

# Apply the updated ConfigMap
kubectl --kubeconfig $kc -n kube-system apply -f flannel-cm.yaml

# Restart flannel DaemonSet to pick up changes
kubectl --kubeconfig $kc -n kube-system rollout restart ds/kube-flannel

Key Takeaways

Here's what I learned from this debugging adventure:

Unique node IPs are table stakes for CNI overlays like Flannel—duplicate IPs break VXLAN tunnel establishment
VirtualBox NAT can be tricky in multi-VM labs—it often presents the same IP (10.0.3.15) to multiple guests from the guest OS perspective
Talos makes it easy to pin node IP selection with machine.kubelet.nodeIP.validSubnets—no need to manually configure networking
Always check events (kubectl get events -A --sort-by=.lastTimestamp) when pods fail to start—they contain the real root cause
kubectl describe is your friend—pod events show CNI failures, sandbox creation errors, and flannel state
For labs: use Host-Only for cluster traffic (stable, predictable) and NAT only for internet access (avoid it for node IPs)
The validSubnets approach lets you keep both NICs while controlling which one Kubernetes uses

Reusable Debugging Runbook

If you hit similar issues, here's the checklist I followed:

Check node IPs: Run kubectl get nodes -o wide. Are the INTERNAL-IPs unique?
Inspect failing pods: Run kubectl get pods -A -o wide. Look for CrashLoopBackOff or ContainerCreating.
Read events: Run kubectl get events -A --sort-by=.lastTimestamp. This is often where the "smoking gun" error lives.
Describe problematic pods: Run kubectl -n kube-system describe pod . Look at the "Events" section at the bottom.
Check CNI logs: Run kubectl -n kube-system logs . Look for errors about "subnets" or "interfaces".
Fix node IP selection: Update machine.kubelet.nodeIP.validSubnets in your Talos config.
Apply and Validate: Use talosctl apply-config, then watch the rollout with kubectl rollout status.
Smoke-test: Run a simple pod to verify DNS and network connectivity.

Resources

About This Post

This guide documents a real troubleshooting session from my Talos Linux learning journey. If you found it helpful, feel free to share it.

Happy clustering! 🚀

Have you encountered similar networking gremlins in your home lab? Let me know in the comments.

The Comprehensive Guide to Deploying n8n in Production: A Docker Deployment Journey

Pushpendra B — Mon, 17 Nov 2025 17:19:29 GMT

A Real-World Project: Building a Self-Hosted Workflow Automation Platform with Docker Compose, PostgreSQL, and Caddy

Introduction: Why I Built This n8n Deployment

In today's fast-paced business environment, efficiency isn't just an advantage =>it's a necessity. For this production-grade Docker project, I chose to deploy n8n (pronounced "n-eight-n"), a powerful, self-hostable, open-source workflow automation platform. This wasn't just a learning exercise; it was about solving a real problem: eliminating manual, repetitive tasks that drain productivity from Micro, Small, and Medium Enterprises (MSMEs).

What is n8n?

n8n is a workflow automation platform that connects different apps and services to create complex, customized workflows without extensive coding. Think of it as a central nervous system for your business.

it makes your applications communicate with each other, handling routine operations automatically.

Why n8n Matters for MSMEs

I chose n8n for this deployment project because it addresses critical business needs:

💰 Cost Efficiency: Being open-source and self-hostable means significantly lower operational costs compared to proprietary SaaS automation services like Zapier or Make.com. For budget-conscious smaller businesses, this can mean thousands of dollars in annual savings.
🔒 Data Control & Security: Self-hosting gives complete control over sensitive business data and credentials. In an age of data breaches and privacy concerns, knowing exactly where your data lives and who has access to it is invaluable.
📈 Scalability for Growth: The containerized, microservices architecture I've implemented ensures the platform can scale from a single user to an enterprise-level operation without major re-architecture.

Real-World Automation Use Cases

Through n8n, businesses can automate:

Marketing Automation: Automatically add leads from web forms to CRM systems, notify sales teams via Slack, WhatsApp and trigger personalized welcome email sequences.
Data Synchronization: Keep inventory numbers, customer lists, and project statuses consistent across Google Sheets, databases, and accounting software in real-time.
Internal Operations: Automate notification systems, generate scheduled reports, perform data cleanup tasks, and manage approval workflows.

Why This Deployment Architecture?

For my first Docker production deployment, I needed an architecture that was not only robust and secure but also manageable and educational. Here's the technical stack I chose and why each component matters.

The Power of Docker Compose

Docker Compose allows us to define and orchestrate multi-container applications using a single declarative configuration file. For this n8n deployment, I manage three services — n8n application, PostgreSQL database, and Caddy reverse proxy as a unified system.

Why Docker Compose?

Manageability: The entire infrastructure is defined in a single docker-compose.yml file, making it version-controllable, reproducible, and easy to understand. Anyone reviewing my project can see exactly how services are configured and connected.
Isolation: Each service runs in its own container with defined resource boundaries and network isolation, improving security and preventing conflicts.
Portability: The same configuration works on any system with Docker installed — whether it's my development machine, a production VPS, or a cloud provider.
Scalability: While starting with a single instance, this containerized architecture provides a clear migration path to orchestration platforms like Kubernetes when scaling becomes necessary.

Choosing PostgreSQL Over SQLite for Production

One of the most important architectural decisions was selecting the database backend. While n8n defaults to SQLite for simple setups, a production environment demands more.

Why PostgreSQL?

🔄 Concurrency: SQLite locks the entire database file during write operations, which severely limits performance when multiple workflows execute simultaneously. PostgreSQL handles multiple concurrent connections and read/write operations efficiently using its Multi-Version Concurrency Control (MVCC) system.
✅ Reliability & ACID Compliance: PostgreSQL offers superior transaction management with full ACID (Atomicity, Consistency, Isolation, Durability) guarantees. This is crucial when dealing with workflow execution history and sensitive credential storage where data integrity cannot be compromised.
📦 Data Encapsulation: PostgreSQL runs as a separate, dedicated service with its own container, providing better separation of concerns. This architecture simplifies backup and restore operations compared to file-based databases.
🚀 Performance at Scale: PostgreSQL provides advanced query optimization, sophisticated indexing capabilities, and efficient resource management that becomes critical as workflow complexity and execution volume grow.

PostgreSQL 16 Alpine specifically offers:

Latest stable release with performance improvements
Long-term support until November 2028
Smaller container image (~240MB vs ~380MB for standard images)
Reduced attack surface due to Alpine Linux's minimal design

Caddy: Simplified Security with Automatic HTTPS

For my first production deployment, I wanted security to be robust but not complex. Caddy emerged as the perfect reverse proxy choice.

Why Caddy?

🔐 Automatic HTTPS: When you configure a domain name, Caddy automatically obtains, installs, and renews SSL certificates from Let's Encrypt —> no manual certificate management, no cron jobs, no expired certificates causing downtime.
⚡ Zero-Configuration SSL: Unlike traditional web servers (Apache, Nginx) that require complex SSL configuration, Caddy makes HTTPS the default with minimal configuration.
🛡️ Security by Default: Caddy includes modern security headers, HTTP/2 and HTTP/3 support, and secure TLS configurations out of the box.
🔄 Graceful Reloads: Configuration changes can be applied without service interruption —> critical for production environments.
Simplicity: The Caddyfile configuration syntax is intuitive and readable, making it perfect for a first production project where understanding every component is important.

Prerequisites: What You Need Before Starting

Before beginning this deployment, ensure your production server meets these requirements:

Required Software

Docker (Version 20.10 or higher):

docker --version

Docker Compose (Version 2.0 or higher):

docker compose version

Installation (if needed for Ubuntu/Debian):

curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh
sudo usermod -aG docker $USER

System Requirements

Operating System: Linux (Ubuntu 20.04+, Debian 11+, CentOS 8+)
RAM: Minimum 2GB, Recommended 4GB+ (based on expected workflow complexity)
Storage: Minimum 10GB free space for containers and data
Network: Public IP address for external access

Optional (For Production Domain Deployment)

Domain Name: DNS A record pointing to your server's IP address
Firewall Configuration: Ports 80 (HTTP) and 443 (HTTPS) open for incoming connections

Architecture Overview: How Everything Connects

Understanding the architecture was crucial for my learning journey. Here's how the three services interact:

┌─────────────────────────────────────────────────────────────┐
│                         Internet                            │
└──────────────────────────┬──────────────────────────────────┘
                           │
                           │ HTTP/HTTPS (Ports 80/443)
                           │
                  ┌────────▼─────────┐
                  │                  │
                  │  Caddy Proxy     │  ← Automatic HTTPS
                  │  (Alpine Linux)  │     SSL Termination
                  │                  │     Reverse Proxy
                  └────────┬─────────┘
                           │
                           │ Internal Network (Default)
                           │ HTTP to n8n:5678
                  ┌────────▼─────────┐
                  │                  │
                  │  n8n Application │  ← Workflow Engine
                  │  (Node.js)       │     REST API
                  │                  │     Web Interface
                  └────────┬─────────┘
                           │
                           │ Internal Network (Isolated)
                           │ PostgreSQL Protocol :5432
                  ┌────────▼─────────┐
                  │                  │
                  │  PostgreSQL 16   │  ← Database
                  │  (Alpine Linux)  │     Data Persistence
                  │                  │     Credential Storage
                  └──────────────────┘

Network Architecture Explained

Two Isolated Networks:

Default Network (Exposed):
- Connects Caddy (exposed to internet) with n8n
- Caddy receives external HTTP/HTTPS requests
- Forwards internally to n8n on port 5678
Internal Network (Isolated):
- Connects n8n with PostgreSQL
- Completely isolated from internet access
- Database port 5432 not exposed externally
- Security benefit: Database cannot be directly attacked from internet

Request Flow:

User Browser → HTTPS/HTTP
    ↓
Caddy (Ports 80/443) → SSL Termination
    ↓
n8n (Port 5678) → Workflow Processing
    ↓
PostgreSQL (Port 5432) → Data Storage

Data Persistence Strategy

All critical data is stored in local directory bind mounts under ./data/:

/home/user/n8n/
├── docker-compose.yml      # Service orchestration
├── Caddyfile              # Reverse proxy config
├── .env                   # Environment secrets
└── data/                  # All persistent data
    ├── postgres/          # Database files
    │   └── pgdata/       # PostgreSQL data directory
    ├── n8n/              # Application data
    │   ├── .n8n.json    # Configuration
    │   ├── credentials/  # Encrypted credentials
    │   └── workflows/    # Workflow backups
    └── caddy/            # Web server data
        ├── data/         # SSL certificates
        └── config/       # Runtime config

Why Local Directories Instead of Docker Volumes?

Easy Backups: Simple filesystem operations (cp, rsync, tar)
Direct Access: No need for docker volume commands to inspect data
Portability: Easy migration between servers
Transparency: Clear visibility of where data resides
Version Control: Can selectively track configurations (excluding sensitive data)

Step 1: Setting Up docker-compose.yml

This file is the heart of the deployment, defining all services, their configurations, and how they interconnect. Let me break down each component with the reasoning behind every configuration choice.

Complete docker-compose.yml

services:
  postgres:
    image: postgres:16-alpine
    container_name: n8n_postgres
    restart: always
    environment:
      POSTGRES_USER: ${POSTGRES_USER}
      POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
      POSTGRES_DB: ${POSTGRES_DB}
      PGDATA: /var/lib/postgresql/data/pgdata
    volumes:
      - ./data/postgres:/var/lib/postgresql/data
    networks:
      - internal
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U ${POSTGRES_USER}"]
      interval: 10s
      timeout: 5s
      retries: 5

  n8n:
    image: n8nio/n8n:stable
    container_name: n8n_app
    restart: always
    environment:
      N8N_HOST: ${N8N_HOST}
      N8N_PROTOCOL: ${N8N_PROTOCOL}
      WEBHOOK_URL: ${N8N_PROTOCOL}://${N8N_HOST}/
      DB_TYPE: postgresdb
      DB_POSTGRESDB_HOST: postgres
      DB_POSTGRESDB_PORT: 5432
      DB_POSTGRESDB_DATABASE: ${POSTGRES_DB}
      DB_POSTGRESDB_USER: ${POSTGRES_USER}
      DB_POSTGRESDB_PASSWORD: ${POSTGRES_PASSWORD}
      N8N_ENCRYPTION_KEY: ${N8N_ENCRYPTION_KEY}
      EXECUTIONS_DATA_PRUNE: "true"
      EXECUTIONS_DATA_MAX_AGE: 168
    volumes:
      - ./data/n8n:/home/node/.n8n
    networks:
      - internal
      - default
    depends_on:
      postgres:
        condition: service_healthy

  caddy:
    image: caddy:2-alpine
    container_name: n8n_caddy
    restart: always
    ports:
      - "80:80"
      - "443:443"
    volumes:
      - ./Caddyfile:/etc/caddy/Caddyfile:ro
      - ./data/caddy/data:/data
      - ./data/caddy/config:/config
    networks:
      - default
    depends_on:
      - n8n

networks:
  internal:
    driver: bridge
  default:
    driver: bridge

PostgreSQL Service Deep Dive

postgres:
  image: postgres:16-alpine
  container_name: n8n_postgres
  restart: always

Configuration Explained:

image: postgres:16-alpine: Uses PostgreSQL 16 with Alpine Linux base (lightweight, security-focused)
container_name: n8n_postgres: Friendly name for easier management and log identification
restart: always: Container automatically restarts on failure or system reboot —> critical for production availability

  environment:
    POSTGRES_USER: ${POSTGRES_USER}
    POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
    POSTGRES_DB: ${POSTGRES_DB}
    PGDATA: /var/lib/postgresql/data/pgdata

Why Each Variable Matters:

POSTGRES_USER: Creates the database superuser account (loaded from .env for security)
POSTGRES_PASSWORD: Secures database access — must be strong and unique
POSTGRES_DB: Database name created on first startup (default: n8n_db)
PGDATA: Specifies exact data directory path — required when using bind mounts to avoid permission issues

  volumes:
    - ./data/postgres:/var/lib/postgresql/data

Data Persistence:

./data/postgres: Local directory on host machine (created automatically)
/var/lib/postgresql/data: PostgreSQL's internal data directory
Bind Mount: Direct mapping ensures data survives container removal/recreation
Critical: Without this, all workflow execution history would be lost on container restart!

  networks:
    - internal

Network Isolation:

Connected only to internal network
Not exposed to default network (internet-facing)
Database port 5432 never directly accessible from outside
Security benefit: Prevents external database attacks

  healthcheck:
    test: ["CMD-SHELL", "pg_isready -U ${POSTGRES_USER}"]
    interval: 10s
    timeout: 5s
    retries: 5

Why Health Checks?

pg_isready: PostgreSQL utility that checks if database accepts connections
interval: 10s: Check every 10 seconds
timeout: 5s: Wait maximum 5 seconds for response
retries: 5: Must pass 5 consecutive checks before considered healthy
Purpose: Prevents n8n from starting before database is fully ready, avoiding connection errors

n8n Application Service Deep Dive

n8n:
  image: n8nio/n8n:stable
  container_name: n8n_app
  restart: always

Image Selection:

n8nio/n8n:stable: Official n8n image on the stable release channel
Why stable tag? Avoids unexpected changes from latest builds, ensuring predictable production behavior

  environment:
    N8N_HOST: ${N8N_HOST}
    N8N_PROTOCOL: ${N8N_PROTOCOL}
    WEBHOOK_URL: ${N8N_PROTOCOL}://${N8N_HOST}/

Public Access Configuration:

N8N_HOST: How n8n is accessed externally
- IP-based: :80 (just the port)
- Domain-based: n8n.yourdomain.com
N8N_PROTOCOL: http (IP access) or https (domain with SSL)
WEBHOOK_URL: Full URL for external services to send webhook callbacks
- Example: https://n8n.yourdomain.com/ for webhooks from Stripe, GitHub, etc.

    DB_TYPE: postgresdb
    DB_POSTGRESDB_HOST: postgres
    DB_POSTGRESDB_PORT: 5432
    DB_POSTGRESDB_DATABASE: ${POSTGRES_DB}
    DB_POSTGRESDB_USER: ${POSTGRES_USER}
    DB_POSTGRESDB_PASSWORD: ${POSTGRES_PASSWORD}

Database Integration:

DB_TYPE: postgresdb: Tells n8n to use PostgreSQL instead of default SQLite
DB_POSTGRESDB_HOST: postgres: Uses Docker service name (Docker's internal DNS resolves this to the container IP)
DB_POSTGRESDB_PORT: 5432: Standard PostgreSQL port on internal network
Credentials must match PostgreSQL service configuration exactly

    N8N_ENCRYPTION_KEY: ${N8N_ENCRYPTION_KEY}

🔐 MOST CRITICAL SECURITY PARAMETER:

Encrypts all sensitive credentials (API keys, passwords, OAuth tokens) stored in PostgreSQL
Must be set before first run
DO NOT LOSE THIS KEY: All encrypted credentials become permanently unrecoverable if lost
Generate using: openssl rand -base64 32

    EXECUTIONS_DATA_PRUNE: "true"
    EXECUTIONS_DATA_MAX_AGE: 168

Data Retention Management:

EXECUTIONS_DATA_PRUNE: "true": Enables automatic cleanup of old workflow execution logs
EXECUTIONS_DATA_MAX_AGE: 168: Retention period in hours (168 hours = 7 days)
Why this matters: Prevents database from growing infinitely — execution history can consume significant space over time
Customization: Adjust based on compliance requirements and storage capacity

  volumes:
    - ./data/n8n:/home/node/.n8n

Application Data Storage:

./data/n8n: Local directory for n8n application data
/home/node/.n8n: n8n's internal data directory (runs as node user, UID 1000)
Stores: Custom node packages, local file storage, configuration cache
Note: Actual workflow definitions and credentials are in PostgreSQL, not here

  networks:
    - internal
    - default

Dual Network Connection:

internal: Communicates with PostgreSQL database
default: Receives proxied requests from Caddy
Bridge role: n8n sits between the internet-facing proxy and isolated database

  depends_on:
    postgres:
      condition: service_healthy

Startup Order Control:

Waits for PostgreSQL service
Critical: condition: service_healthy ensures database health checks pass before n8n starts
Prevents: Database connection errors during startup

Caddy Reverse Proxy Service Deep Dive

caddy:
  image: caddy:2-alpine
  container_name: n8n_caddy
  restart: always

Image Choice:

caddy:2-alpine: Caddy 2.x with Alpine Linux base
Benefits: Small image size (~50MB), reduced attack surface, same powerful features

  ports:
    - "80:80"
    - "443:443"

Port Exposure (Only Exposed Ports):

"80:80": HTTP traffic (required for Let's Encrypt validation and HTTP-to-HTTPS redirect)
"443:443": HTTPS traffic (secure encrypted connections)
Format: "host_port:container_port"
Critical: These are the ONLY ports accessible from internet—>everything else is internal

  volumes:
    - ./Caddyfile:/etc/caddy/Caddyfile:ro
    - ./data/caddy/data:/data
    - ./data/caddy/config:/config

Volume Mounts Explained:

./Caddyfile:/etc/caddy/Caddyfile:ro: Configuration file (:ro = read-only for security)
./data/caddy/data:/data: SSL certificates and persistent data (Let's Encrypt certificates stored here)
./data/caddy/config:/config: Runtime configuration cache
Bind mounts: All data accessible on host for easy backup and inspection

  networks:
    - default

Network Connection:

Connected to default network (shared with n8n, exposed to internet)
Not connected to internal network (doesn't need database access)

  depends_on:
    - n8n

Dependency:

Starts after n8n application is running
Ensures reverse proxy target is available when Caddy starts accepting traffic

Step 2: Configuring the Caddyfile

The Caddyfile defines how Caddy handles incoming web traffic. Its simplicity is deceptive —> behind this clean syntax, Caddy automatically manages SSL certificates, security headers, and request forwarding.

Flexible Caddyfile for IP and Domain Access

# Option 1: IP-based access (initial deployment)
# For accessing via http://YOUR_SERVER_IP
:80 {
    reverse_proxy n8n:5678 {
        header_up Host {host}
        header_up X-Real-IP {remote_host}
        header_up X-Forwarded-For {remote_host}
        header_up X-Forwarded-Proto {scheme}
    }
}

# Option 2: Domain-based access with automatic HTTPS
# Uncomment and replace with your domain when ready
# n8n.yourdomain.com {
#     reverse_proxy n8n:5678 {
#         header_up Host {host}
#         header_up X-Real-IP {remote_host}
#         header_up X-Forwarded-For {remote_host}
#         header_up X-Forwarded-Proto {scheme}
#     }
# }

Configuration Breakdown

IP-Based Access Block:

:80 {

:80: Listens on port 80 (HTTP) without a specific hostname
Use case: Initial deployment when accessing via http://123.456.789.012
No SSL: Caddy only enables automatic HTTPS when a domain name is specified

    reverse_proxy n8n:5678 {

n8n: Docker service name (resolved via Docker DNS to n8n container IP)
5678: n8n's internal application port
Function: Forwards all incoming requests to n8n application

        header_up Host {host}
        header_up X-Real-IP {remote_host}
        header_up X-Forwarded-For {remote_host}
        header_up X-Forwarded-Proto {scheme}
    }
}

Header Forwarding (Why Each Matters):

Host {host}: Preserves original hostname from request — n8n needs this for webhook URL generation
X-Real-IP {remote_host}: Real client IP address (not Caddy's internal IP)
X-Forwarded-For {remote_host}: Standard header for proxied requests, used for logging and security
X-Forwarded-Proto {scheme}: Tells n8n whether original request was HTTP or HTTPS — critical for proper redirects

Domain-Based Access (Production): When you have a domain pointing to your server:

n8n.yourdomain.com {
    reverse_proxy n8n:5678 {
        # Same headers as above
    }
}

What Changes When Domain is Configured:

Caddy automatically contacts Let's Encrypt
Validates domain ownership via HTTP-01 challenge
Obtains SSL certificate
Enables HTTPS on port 443
Automatically redirects HTTP (port 80) to HTTPS (port 443)
Sets up automatic renewal (certificates renewed before 30-day expiration)

No manual certificate management required!

Step 3: Critical Security - Data Encryption

Security was a top priority in this deployment. n8n stores sensitive credentials (API keys, OAuth tokens, database passwords) that must be protected.

Understanding N8N_ENCRYPTION_KEY

What It Does:

Encrypts all credentials before storing in PostgreSQL database
Uses AES-256-GCM encryption (industry-standard, highly secure)
Each credential is encrypted individually with authentication

Why It's Critical:

Lose this key = Lose all credentials permanently
No recovery mechanism exists — encrypted data cannot be decrypted without the exact key
Changing the key invalidates all existing encrypted credentials

Generating a Secure Encryption Key

Option 1: Base64 Encoded (Recommended)

openssl rand -base64 32

Output example: Xk9pL2mN3qR5sT7vW9yZ1bC4dF6gH8jKlMnPqRsTuVwXyZ

Option 2: Hexadecimal

openssl rand -hex 32

Output example: a3f5c7b9d1e2f4g6h8i0j2k4l6m8n0o2p4q6r8s0t2u4v6w8x0y2z4

Option 3: Alphanumeric Only

cat /dev/urandom | tr -dc 'a-zA-Z0-9' | fold -w 32 | head -n 1

Output example: 8kY3mQ7nB2xR9tL6wV4pS1zF5cH0jN3g

Best Practices:

Minimum 32 characters (256 bits of entropy)
Store in password manager immediately after generation
Never commit to version control
Back up securely (encrypted backup storage recommended)

Step 4: Environment Configuration (.env File)

The .env file contains all sensitive configuration. This file must never be committed to version control.

Complete .env File Structure

# ============================================
# n8n Public Access Configuration
# ============================================

# For IP-based access (initial deployment):
N8N_HOST=:80
N8N_PROTOCOL=http

# For domain-based access (production):
# N8N_HOST=n8n.yourdomain.com
# N8N_PROTOCOL=https

# ============================================
# PostgreSQL Database Configuration
# ============================================

POSTGRES_USER=n8n_user
POSTGRES_PASSWORD=your_strong_database_password_here
POSTGRES_DB=n8n_db

# ============================================
# n8n Security Configuration
# ============================================

# CRITICAL: Encryption key for credentials (generate with: openssl rand -base64 32)
N8N_ENCRYPTION_KEY=your_generated_encryption_key_here

# JWT Secret (must be different from encryption key)
N8N_USER_MANAGEMENT_JWT_SECRET=your_unique_jwt_secret_here

# Session duration (hours users stay logged in)
N8N_USER_MANAGEMENT_JWT_DURATION_HOURS=24
N8N_USER_MANAGEMENT_JWT_REFRESH_TIMEOUT_HOURS=24

# ============================================
# Login Security & Brute Force Protection
# ============================================

# Maximum failed login attempts before lockout
N8N_LOGIN_MAX_ATTEMPTS=5

# Lockout duration in minutes
N8N_LOGIN_LOCKOUT_DURATION=30

# ============================================
# Password Policy
# ============================================

N8N_USER_MANAGEMENT_PASSWORD_MIN_LENGTH=12
N8N_USER_MANAGEMENT_PASSWORD_REQUIRE_UPPERCASE=true
N8N_USER_MANAGEMENT_PASSWORD_REQUIRE_LOWERCASE=true
N8N_USER_MANAGEMENT_PASSWORD_REQUIRE_NUMBER=true
N8N_USER_MANAGEMENT_PASSWORD_REQUIRE_SPECIAL=true

# ============================================
# Additional Security Headers & CORS
# ============================================

N8N_SECURITY_HEADERS_ENABLED=true
# For domain-based deployment:
# N8N_ALLOWED_ORIGINS=https://n8n.yourdomain.com

# ============================================
# Workflow Execution Settings
# ============================================

# Maximum workflow execution timeout (seconds)
EXECUTIONS_TIMEOUT=600
EXECUTIONS_TIMEOUT_MAX=3600

# ============================================
# Logging Configuration
# ============================================

N8N_LOG_LEVEL=warn
N8N_LOG_OUTPUT=json

Configuration Variable Explanations

Session Management:

N8N_USER_MANAGEMENT_JWT_DURATION_HOURS=24

Purpose: How long users remain logged in without activity
24 hours: Balances security with user convenience
Shorter values (1-8 hours): Higher security, more frequent logins
Why 24? Provides full workday access without requiring re-authentication

N8N_LOGIN_MAX_ATTEMPTS=5

Purpose: Limits failed login attempts before account lockout
5 attempts: Industry standard for brute-force protection
Why? After 5 failed attempts, probability of legitimate user is very low
Protection: Makes password guessing attacks impractical

N8N_LOGIN_LOCKOUT_DURATION=30

Purpose: Lockout duration in minutes after exceeding max attempts
30 minutes: Long enough to deter automated attacks, short enough to not permanently block legitimate users
Why? Provides cooldown period while not creating excessive user friction

Password Policy:

N8N_USER_MANAGEMENT_PASSWORD_MIN_LENGTH=12

12 characters: Minimum for modern password security standards
Why? Provides sufficient entropy against brute-force attacks
Each additional character exponentially increases crack time

N8N_USER_MANAGEMENT_PASSWORD_REQUIRE_*=true

Enforces composition: Uppercase, lowercase, numbers, special characters
Why all four? Creates passwords resistant to dictionary attacks
Example compliant password: MyN8n@Pass2024!

Workflow Execution Safety:

EXECUTIONS_TIMEOUT=600

Purpose: Maximum execution time in seconds (600 = 10 minutes)
Why? Prevents runaway workflows from consuming excessive resources
Customization: Adjust based on your longest legitimate workflow duration

EXECUTIONS_DATA_MAX_AGE=168

Purpose: How long to keep execution history (168 hours = 7 days)
Why 7 days? Balances troubleshooting needs with database size management
Automatic cleanup: Prevents database from growing infinitely

Step 5: Deployment - Bringing It All Together

With all configuration in place, it's time to deploy. Here's the step-by-step process I followed:

Pre-Deployment Checklist

# Verify Docker and Docker Compose are installed
docker --version
docker compose version

# Ensure you're in the deployment directory
cd /home/user/n8n

# Verify all required files exist
ls -la
# Expected: docker-compose.yml, Caddyfile, .env

Initial Deployment

# Start all services in detached mode (background)
docker compose up -d

What Happens:

Network Creation:

[+] Network n8n_internal   Created
[+] Network n8n_default    Created

Two isolated networks established for security

Container Startup:

[+] Container n8n_postgres   Started
[+] Container n8n_app        Starting... (waiting for postgres health)
[+] Container n8n_caddy      Starting... (waiting for n8n)

Health Checks:
- PostgreSQL health checks begin immediately
- After 5 successful checks (~50 seconds), marked healthy
- n8n starts connecting to database
- Caddy starts accepting traffic

Verify Deployment Success

# Check container status
docker compose ps

Expected Output:

NAME              IMAGE                 STATUS
n8n_postgres      postgres:16-alpine    Up (healthy)
n8n_app           n8nio/n8n:stable      Up
n8n_caddy         caddy:2-alpine        Up

All containers should show "Up" status.

# View startup logs
docker compose logs -f

Look for:

PostgreSQL: database system is ready to accept connections
n8n: Editor is now accessible via: http://...
Caddy: serving initial configuration

Press Ctrl+C to stop viewing logs (containers continue running)

Access Your n8n Instance

# Get your server's public IP
curl -4 ifconfig.me

Open in browser:

http://YOUR_SERVER_IP

Example: http://123.456.789.012

First-Time Setup

When accessing n8n for the first time, you'll complete the initial owner account creation:

Email address for owner account
Strong password (must meet policy requirements from .env)
Workspace name (optional)
Usage preferences (optional telemetry)

This is your admin account -> credentials are encrypted using N8N_ENCRYPTION_KEY

Need Help with n8n Deployment or Custom Automation?

If you're looking for professional assistance with:

n8n Installation & Configuration: Production-ready deployments with security best practices
Custom Workflow Design: Tailored automation solutions for your specific business needs
Migration Services: Moving from Zapier, Make.com, or other platforms to self-hosted n8n
Ongoing n8n Management: Server maintenance, updates, monitoring, and troubleshooting
Process Automation Consulting: Identifying automation opportunities in your business

I can help! With experience in server administration and proven expertise in n8n automation (40% reduction in manual tasks for my current organization), I specialize in designing and implementing workflow automation that drives real business value.

📧 Email: push1697@gmail.com
💼 LinkedIn: linkedin.com/in/pushpendra16
📱 WhatsApp: +91 8619274820
🌐 Location: Jaipur, Rajasthan, India (Remote services available)

Real-World Deployment Challenges: Lessons Learned

During my actual deployment, I encountered several issues that taught me valuable lessons about production Docker deployments. Here's what went wrong and how I fixed it.

Challenge 1: Permission Errors (n8n Container)

Error Encountered:

Error: EACCES: permission denied, open '/home/node/.n8n/config'

What Happened: The n8n container runs as user node with UID 1000. The mounted ./data/n8n directory had restrictive permissions that prevented the container from writing configuration files.

Root Cause: When using bind mounts (local directories), the container user must have write permissions to the mounted directory. Docker doesn't automatically handle this like it does with named volumes.

Solution:

# Grant full permissions to n8n data directory
chmod -R 777 data/n8n

Better Solution (More Secure):

# Set ownership to UID 1000 (n8n container user)
sudo chown -R 1000:1000 data/n8n
chmod -R 755 data/n8n

Lesson Learned: Always consider container user IDs when using bind mounts. Check the container's documentation for the default user UID/GID.

Challenge 2: Port Binding Error (Caddy)

Error Encountered:

Error: cannot expose privileged port 80: permission denied

What Happened: My Docker installation was running in rootless mode (security-enhanced). Rootless Docker cannot bind to privileged ports (< 1024) without special system configuration.

Root Cause: Linux restricts binding to ports below 1024 to root user. Rootless Docker intentionally runs without root privileges for enhanced security.

Initial Workaround: Modified docker-compose.yml to use non-privileged ports:

ports:
  - "8080:80"   # Changed from 80:80
  - "8443:443"  # Changed from 443:443

Permanent Solution:

# Allow unprivileged ports system-wide
echo 'net.ipv4.ip_unprivileged_port_start=80' | sudo tee -a /etc/sysctl.conf
sudo sysctl -p

# Revert docker-compose.yml to standard ports
# Then restart
docker compose down
docker compose up -d

Lesson Learned: Security features (like rootless Docker) sometimes conflict with standard port conventions. Understanding the trade-offs between security and convenience is crucial for production deployments.

Challenge 3: Caddyfile Syntax Error

Error Encountered:

Error: unrecognized global option: reverse_proxy

What Happened: Initially, I attempted to use environment variable substitution in the Caddyfile, which caused syntax confusion.

Initial (Broken) Configuration:

${N8N_HOST} {
    reverse_proxy n8n:5678
}

Root Cause: Caddyfile doesn't support environment variable substitution in the same way docker-compose does. The ${N8N_HOST} was being interpreted as a literal string, not replaced with the value.

Solution: Use explicit configuration based on deployment type:

For IP-based access:

:80 {
    reverse_proxy n8n:5678 {
        header_up Host {host}
        header_up X-Real-IP {remote_host}
        header_up X-Forwarded-For {remote_host}
        header_up X-Forwarded-Proto {scheme}
    }
}

For domain-based access:

n8n.yourdomain.com {
    reverse_proxy n8n:5678 {
        # same headers
    }
}

Lesson Learned: Configuration file syntaxes vary between tools. What works in docker-compose.yml doesn't necessarily work in Caddyfile. Always reference official documentation for each component.

Challenge 4: Docker Compose Version Warning

Warning Encountered:

WARN[0000] the attribute 'version' is obsolete

What Happened: Modern Docker Compose (v2.x) no longer requires or uses the version: field at the top of docker-compose.yml.

Original File:

version: '3.8'
services:
  postgres:
    # ...

Solution: Simply removed the version field:

services:
  postgres:
    # ...

Why This Changed: Docker Compose v2 automatically uses the latest spec features. The version field was only necessary for Docker Compose v1.x to determine feature compatibility.

Lesson Learned: Tools evolve. Configuration patterns that were best practices in 2020 may be obsolete in 2024. Stay updated with latest documentation.

Final Working Configuration

After resolving all issues, here's what successfully deployed:

Access Information:

http://123.164.126.34:8080  (using non-privileged ports)

Status:

NAME           IMAGE                 STATUS           PORTS
n8n_app        n8nio/n8n:stable      Up              5678/tcp
n8n_caddy      caddy:2-alpine        Up              0.0.0.0:8080->80/tcp, 0.0.0.0:8443->443/tcp
n8n_postgres   postgres:16-alpine    Up (healthy)    5432/tcp

All containers running successfully! ✅

Real-World Automation Examples I've Built

As someone who actively uses n8n in production environments, here are some automation workflows I've designed and implemented:

1. Email-to-Ticket Automation System

Problem: Support requests from multiple email accounts needed manual consolidation
Solution: n8n workflow monitoring multiple IMAP mailboxes, creating tickets in project management system with intelligent categorization
Result: 60% reduction in ticket processing time, zero missed support requests

2. Cross-Platform Data Synchronization

Problem: Customer data scattered across Google Sheets, CRM, and accounting software
Solution: Bi-directional sync workflows with conflict resolution and audit logging
Result: Single source of truth for customer data, eliminated duplicate entry work

3. AI-Powered Content Moderation

Problem: Manual review of user-generated content was time-consuming
Solution: n8n workflow integrating AI APIs for content analysis, automatic flagging, and notification system
Result: 85% of content automatically processed, moderation team focuses only on flagged items

4. Automated Backup & Reporting Pipeline

Problem: Weekly server backups and reports required manual execution
Solution: Scheduled n8n workflows with error handling, Slack notifications, and report generation
Result: 100% backup reliability, management receives automated insights every Monday

Want similar automation for your business? These are just examples =>every business has unique processes that can benefit from intelligent automation. Let's discuss how n8n can transform your operations.

📧 Contact me: push1697@gmail.com

Migration Path: From IP to Domain with HTTPS

One of the design goals was making it easy to transition from initial IP-based deployment to production domain-based deployment with automatic HTTPS. Here's how this migration works seamlessly.

Current State (IP-Based Access)

Configuration:

N8N_HOST=:80
N8N_PROTOCOL=http

Access: http://123.456.789.012:8080

Limitations:

No encryption (HTTP only)
IP address not user-friendly
No automatic SSL certificate management

Migration Steps to Domain-Based HTTPS

Step 1: Configure DNS

Point your domain's A record to your server IP:

DNS Configuration:

Type: A
Name: n8n
Value: 123.456.789.012
TTL: 3600

Result: n8n.yourdomain.com → 123.456.789.012

Verify DNS Propagation:

# Check resolution
nslookup n8n.yourdomain.com

# Alternative verification
dig n8n.yourdomain.com +short

Expected Output: 123.456.789.012

Wait Time: DNS propagation typically takes 5-15 minutes, can be up to 48 hours in rare cases

Step 2: Update Environment Configuration

# Edit .env file
nano .env

Change from IP-based:

N8N_HOST=:80
N8N_PROTOCOL=http

To domain-based:

N8N_HOST=n8n.yourdomain.com
N8N_PROTOCOL=https

Also update CORS if configured:

N8N_ALLOWED_ORIGINS=https://n8n.yourdomain.com

Save: Ctrl+O, Enter, Ctrl+X

Step 3: Update Caddyfile

Edit Caddyfile:

nano Caddyfile

Comment out IP-based block:

# :80 {
#     reverse_proxy n8n:5678 {
#         header_up Host {host}
#         header_up X-Real-IP {remote_host}
#         header_up X-Forwarded-For {remote_host}
#         header_up X-Forwarded-Proto {scheme}
#     }
# }

Uncomment domain-based block:

n8n.yourdomain.com {
    reverse_proxy n8n:5678 {
        header_up Host {host}
        header_up X-Real-IP {remote_host}
        header_up X-Forwarded-For {remote_host}
        header_up X-Forwarded-Proto {scheme}
    }
}

Step 4: Restart Services

# Graceful restart
docker compose down
docker compose up -d

Step 5: Watch Caddy Obtain SSL Certificate

# Monitor Caddy logs
docker compose logs -f caddy

Look for these log messages:

[INFO] Obtaining SSL certificate
[INFO] Validating domain ownership via HTTP-01 challenge
[INFO] Certificate obtained successfully
[INFO] Enabling automatic HTTPS

This process takes 10-60 seconds depending on Let's Encrypt response time

Step 6: Verify HTTPS Access

Open in browser:

https://n8n.yourdomain.com

Verify:

Browser shows padlock icon 🔒
Certificate issued by "Let's Encrypt"
No certificate warnings
HTTP automatically redirects to HTTPS

Check certificate details:

# Command-line verification
curl -vI https://n8n.yourdomain.com 2>&1 | grep -E 'SSL|TLS'

What Caddy Does Automatically

Certificate Request: Contacts Let's Encrypt ACME API
Domain Validation: Responds to HTTP-01 challenge on port 80
Certificate Installation: Stores certificate in ./data/caddy/data
HTTPS Enablement: Configures TLS with modern cipher suites
HTTP Redirect: Automatically redirects all HTTP traffic to HTTPS
Renewal Scheduling: Sets up automatic renewal before 30-day expiration
OCSP Stapling: Enables for faster certificate validation

No manual intervention required for renewals! Caddy handles everything.

Migration Benefits

Zero Data Loss:

All workflows preserved
All credentials remain encrypted
Execution history intact
No database migration needed

No Downtime Required:

Can be done during low-traffic period
Total downtime: ~10 seconds (during restart)

Improved Security:

All traffic encrypted end-to-end
Protection against man-in-the-middle attacks
Automatic security header injection

Backup Strategy: Protecting Your Work

Production deployments require reliable backup strategies. Here's the comprehensive approach I implemented.

What Needs Backing Up

Critical Data:

PostgreSQL Database - Workflows, credentials, execution history
n8n Data Directory - Custom nodes, file storage, local configuration
Caddy Data - SSL certificates (can be regenerated, but backup prevents rate limits)
Configuration Files - .env, docker-compose.yml, Caddyfile

Manual Backup Script

Save as ~/n8n-backup.sh:

#!/bin/bash
# n8n Complete Backup Script

# Configuration
BACKUP_DIR=~/n8n-backups
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BACKUP_PATH="$BACKUP_DIR/$TIMESTAMP"
N8N_DIR=/home/user/n8n

# Create backup directory
mkdir -p "$BACKUP_PATH"

# Navigate to deployment directory
cd "$N8N_DIR" || exit 1

# Load environment variables for database credentials
if [ -f .env ]; then
    export $(grep -v '^#' .env | xargs)
fi

# Backup PostgreSQL database (SQL dump)
echo "Backing up PostgreSQL database..."
docker compose exec -T postgres pg_dump -U "${POSTGRES_USER}" "${POSTGRES_DB}" > "$BACKUP_PATH/database.sql"

# Backup configuration files
echo "Backing up configuration files..."
cp .env "$BACKUP_PATH/.env"
cp docker-compose.yml "$BACKUP_PATH/docker-compose.yml"
cp Caddyfile "$BACKUP_PATH/Caddyfile"

# Backup data directories (compressed)
echo "Backing up data directories..."
tar -czf "$BACKUP_PATH/data-backup.tar.gz" \
  --exclude='./data/postgres/pgdata/postmaster.pid' \
  --exclude='./data/postgres/pgdata/*.pid' \
  ./data

# Calculate backup size
BACKUP_SIZE=$(du -sh "$BACKUP_PATH" | cut -f1)

# Remove old backups (keep last 7 days)
RETENTION_DAYS=7
echo "Removing backups older than $RETENTION_DAYS days..."
find "$BACKUP_DIR" -maxdepth 1 -type d -mtime +$RETENTION_DAYS -exec rm -rf {} \;

# Log completion
echo "[$(date)] Backup completed: $BACKUP_PATH (Size: $BACKUP_SIZE)"
ls -lh "$BACKUP_PATH"

Make executable:

chmod +x ~/n8n-backup.sh

Run manually:

~/n8n-backup.sh

Automated Daily Backups

Set up cron job for automatic backups:

# Edit crontab
crontab -e

# Add this line (runs daily at 2 AM)
0 2 * * * ~/n8n-backup.sh >> ~/n8n-backup.log 2>&1

Verify crontab:

crontab -l

Check backup logs:

tail -f ~/n8n-backup.log

Restore from Backup

Save as ~/n8n-restore.sh:

#!/bin/bash
# n8n Restore Script

# Configuration
BACKUP_DATE="20240101_120000"  # Change to your backup timestamp
BACKUP_PATH=~/n8n-backups/$BACKUP_DATE
N8N_DIR=/home/user/n8n

# Navigate to deployment directory
cd "$N8N_DIR" || exit 1

# Stop all services
echo "Stopping services..."
docker compose down

# Backup current data (safety measure)
echo "Creating safety backup of current data..."
if [ -d data ]; then
    mv data "data.old.$(date +%Y%m%d_%H%M%S)"
fi

# Restore configuration files
echo "Restoring configuration files..."
cp "$BACKUP_PATH/.env" ./
cp "$BACKUP_PATH/docker-compose.yml" ./
cp "$BACKUP_PATH/Caddyfile" ./

# Restore data directories
echo "Restoring data directories..."
tar -xzf "$BACKUP_PATH/data-backup.tar.gz" -C "$N8N_DIR"

# Fix permissions
echo "Fixing permissions..."
sudo chown -R 1000:1000 ./data/n8n
sudo chown -R $USER:$USER ./data
chmod -R 755 ./data

# Start services
echo "Starting services..."
docker compose up -d

# Wait for services
echo "Waiting for services to initialize..."
sleep 15

# Check status
docker compose ps

echo "================================"
echo "Restore completed from backup: $BACKUP_DATE"
echo "Previous data backed up to: data.old.*"
echo "Verify everything works, then remove old data:"
echo "  rm -rf data.old.*"
echo "================================"

Make executable:

chmod +x ~/n8n-restore.sh

To restore:

# Edit script to set BACKUP_DATE variable
nano ~/n8n-restore.sh

# Run restore
~/n8n-restore.sh

Monitoring and Maintenance

Viewing Logs

All services:

docker compose logs -f

Specific service:

docker compose logs -f n8n
docker compose logs -f postgres
docker compose logs -f caddy

Last 100 lines:

docker compose logs --tail=100 n8n

Filter by time:

# Logs from last hour
docker compose logs --since=1h n8n

Container Health Monitoring

Quick status:

docker compose ps

Resource usage:

docker stats

Detailed container inspection:

docker inspect n8n_app

Database Maintenance

Access PostgreSQL CLI:

docker compose exec postgres psql -U n8n_user -d n8n_db

Useful database commands:

-- Check database size
SELECT pg_size_pretty(pg_database_size('n8n_db'));

-- Check table sizes
SELECT
    schemaname,
    tablename,
    pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;

-- Vacuum and analyze (optimize performance)
VACUUM ANALYZE;

-- Exit
\q

Updating n8n

Check for updates:

docker compose pull

Apply updates:

# Create backup first
~/n8n-backup.sh

# Stop services
docker compose down

# Start with new images
docker compose up -d

# Verify new version
docker compose exec n8n n8n --version

Professional n8n Services Available

Don't want to manage this yourself? I offer comprehensive n8n services for businesses:

🚀 Deployment Services

Production-ready n8n installation with security hardening
AWS/GCP/Azure cloud deployment
High-availability configurations
Custom domain setup with SSL

🔧 Workflow Design & Implementation

Business process analysis and automation strategy
Custom workflow development for your specific needs
Integration with existing tools (CRM, ERP, Marketing platforms)
API integration and custom node development

🛡️ Managed Services

24/7 monitoring and incident response
Regular updates and security patches
Performance optimization
Backup management and disaster recovery

📚 Training & Consultation

Team training on n8n best practices
Workflow design workshops
Technical documentation
Ongoing support and troubleshooting

Pricing: Flexible packages available based on your requirements
Response Time: Within 24 hours for inquiries
Experience: 2+ years managing production n8n deployments

📧 Email: push1697@gmail.com
💼 LinkedIn: linkedin.com/in/pushpendra16
📱 WhatsApp: +91 8619274820

Manageability Checklist

Action	Tool / Method	Frequency	Benefit
Backups	Automated cron script	Daily at 2 AM	Quick recovery from failures
Updates	`docker compose pull && up -d`	Monthly	Security patches, new features
Log Monitoring	`docker compose logs -f`	As needed	Debugging, performance tracking
Health Checks	`docker compose ps`	Weekly	Early problem detection
Database Vacuum	PostgreSQL VACUUM	Monthly	Maintain query performance
SSL Renewal	Caddy automatic	Automatic	Continuous HTTPS availability
Disk Space	`df -h` & `docker system df`	Weekly	Prevent storage issues
Security Audit	Review `.env` settings	Quarterly	Maintain security posture

Key Takeaways from This Project

Technical Accomplishments

Production-Ready Architecture: Deployed a multi-container application with proper network isolation and security
Automatic HTTPS: Implemented zero-configuration SSL with automatic renewal
Data Persistence: Configured durable storage for database and application data
Security Best Practices: Encrypted credentials, strong password policies, session management
Operational Excellence: Automated backups, comprehensive logging, easy updates

Lessons Learned

Docker Fundamentals:

Understanding container user IDs and filesystem permissions
Difference between bind mounts and named volumes
Importance of health checks for service dependencies
Network isolation for security

Configuration Management:

Keep sensitive data in .env files (never commit to git)
Each tool has its own syntax (docker-compose vs Caddyfile)
Version specifications matter (stable vs latest tags)
Documentation is your friend —> always reference official docs

Production Considerations:

Security isn't optional —> encryption keys, password policies, session management all matter
Backups aren't optional either —> automate them from day one
Monitoring and logging are essential for debugging production issues
Always have a rollback plan (backups, version pinning)

Real-World Challenges:

Things break in unexpected ways (permission errors, port conflicts)
Troubleshooting skills are as important as initial setup knowledge
Understanding the "why" behind configurations helps fix issues faster
Community resources and documentation are invaluable

Conclusion

This n8n deployment project represents a complete journey through modern DevOps practices —> from architecture design to production deployment, from handling real-world errors to implementing operational best practices.

What makes this deployment production-ready:

✅ Robust database backend (PostgreSQL instead of SQLite)
✅ Automated security (Caddy with Let's Encrypt SSL)
✅ Data encryption (N8N_ENCRYPTION_KEY for credentials)
✅ Network isolation (internal network for database)
✅ Automated backups (daily cron job with retention policy)
✅ Comprehensive monitoring (logs, health checks, resource metrics)
✅ Easy migration path (IP to domain without data loss)
✅ Disaster recovery plan (restore scripts and procedures)

Access your deployed n8n instance:

https://n8n.yourdomain.com

Start automating workflows, connecting APIs, and building the integrations that make businesses more efficient.

Happy Automating! 🚀

Additional Resources

n8n Official Documentation: https://docs.n8n.io/
Docker Documentation: https://docs.docker.com/
PostgreSQL Documentation: https://www.postgresql.org/docs/
Caddy Documentation: https://caddyserver.com/docs/
n8n Community Forum: https://community.n8n.io/
n8n Workflow Templates: https://n8n.io/workflows/

Learn YAML Fast: Your First Step to DevOps Mastery

Pushpendra B — Thu, 06 Nov 2025 18:30:00 GMT

Series Goal: The ultimate guide to mastering YAML syntax, Docker Compose, and Kubernetes manifests for aspiring engineers.

🎯 Introduction: From Data Format to Automation Language

Imagine this: You're scrolling through a Kubernetes manifest, then hop over to a Docker Compose file, peek at some Ansible playbooks, and glance at a GitHub Actions workflow. What's the one thing they all have in common?

YAML. YAML everywhere.

It's like that friend who somehow shows up at every party. From Kubernetes manifests and Ansible playbooks to GitHub Actions, ArgoCD, and Docker Compose files. YAML has gone from a humble data format to the de facto language of automation and infrastructure declaration.

The Origin Story

YAML stands for "YAML Ain't Markup Language" (yes, it's a recursive acronym=>geeks love recursion 🤓). First released in 2001, its name is a cheeky rebellion against document-centric languages like HTML or XML. The message? "We're all about the data, baby."

Back in the early 2000s, XML and JSON were the cool kids on the block, dominating data serialization. But YAML had a secret weapon that the others hadn't prioritized: human-friendliness.

As automation tools evolved, they needed a way for humans to declare their intent. the desired state of a system in a format that was:

✅ Clear and readable
✅ Easy to audit
✅ Maintainable without a PhD in bracket-matching

When Ansible adopted YAML for its automation "playbooks," and later when Kubernetes made it the default for application "manifests," YAML's fate was sealed. It quietly became the backbone of configuration management and the cloud-native ecosystem.

Fun fact: YAML is so human-friendly that even non-technical folks can almost understand what's happening. Try that with XML! 😅

🥊 A Tale of Three Formats: The Great Config Wars

Think of this as the "Game of Thrones" of data formats except instead of dragons and ice zombies, we have angle brackets and curly braces. To understand why YAML won the throne, we need to meet the competition.

👴 XML (Extensible Markup Language)

The Grandfather of Data Serialization

XML is powerful, schema-rich, and highly structured. It's also... verbose. Like, "I-need-three-cups-of-coffee-just-to-parse-this-visually" verbose.

<user>
  <name>John Doename>
  <age>30age>
  <job>Engineerjob>
user>

The Problem: XML is cluttered with explicit opening and closing tags. For deeply nested configurations (looking at you, enterprise Java configs), it becomes a nightmare to read, write, and debug.

Verdict: Great for document schemas and legacy systems. Terrible for configs you actually want to maintain.

🚀 JSON (JavaScript Object Notation)

The API King

JSON is the undisputed champion of APIs and web-based data interchange. It's lightweight, machine-friendly, and maps beautifully to most programming languages' data structures.

{
  "user": {
    "name": "John Doe",
    "age": 30,
    "job": "Engineer"
  }
}

The Problem: JSON has two fatal flaws for human-edited configuration files:

No Comments 😱
That's right. You can't add comments in JSON. Try documenting why you exposed port 8080 or why that timeout is set to 30 seconds. Good luck explaining that to Future You™ or your teammates!
Syntactic Noise
While cleaner than XML, JSON still makes you jump through hoops:
- Curly braces everywhere: {}
- Quotes around every key: "name"
- The dreaded missing comma bug (we've all been there)
- No trailing commas allowed (because... reasons?)

Verdict: Perfect for APIs. Painful for configs you need to edit at 2 AM while debugging a production incident.

🏆 YAML (YAML Ain't Markup Language)

The Human Whisperer

YAML was designed with one mission: readability first. It achieves this through:

Indentation-based structure (like Python!)
Minimal syntactic noise (no braces, minimal quotes)
Native comment support (#)

# A simple user object
user:
  name: John Doe
  age: 30
  job: Engineer

The Magic: This looks more like a recipe than code. You can read it aloud to a rubber duck and it makes sense. Try that with XML!

This clean, minimal syntax makes YAML the ideal format for declarative configurations—where you tell the system what you want (e.g., "three copies of my application running") rather than how to do it.

🎭 The Interface Philosophy

Here's the key insight that explains the "format wars":

Format	Interface Type	Primary Use Case
JSON	Machine-to-Machine (M2M)	APIs, data interchange
YAML	Human-to-Machine (H2M)	Config files, IaC
XML	Document-to-System	Legacy systems, schemas

DevOps and Infrastructure as Code (IaC) are all about humans writing declarative configurations that machines execute. YAML bridges this H2M gap perfectly.

📊 Format Feature Showdown

Feature	YAML	JSON	XML
Human Readability	🟢 High	🟡 Medium	🔴 Low
Comment Support	✅ Yes (`#`)	❌ No	✅ Yes ()
Syntactic Overhead	🟢 Low (Indentation)	🟡 Medium (Braces, Quotes)	🔴 High (Tags)
Data Interchange Speed	🟡 Good	🟢 Excellent	🔴 Poor
Primary Use Case	DevOps Config, IaC	APIs, Web Data	Legacy, Documents
Learning Curve	Gentle slope	Moderate	Mountain climb

🤝 The Secret Weapon: YAML ❤️ JSON

Here's a plot twist that most people don't know:

YAML is a superset of JSON.

Translation: Any valid JSON file is also a valid YAML file. 🤯

Why This Matters

This compatibility was a strategic masterstroke. It created a zero-friction adoption path for tools like Kubernetes:

Backend: Build with JSON-based APIs for machine-to-machine communication (fast parsing, wide support)
Frontend: Add a YAML parser for human-friendly manifests (readable configs)
Bonus: Since YAML parsers can handle JSON, you get both for the price of one!

This allowed YAML to be adopted in addition to JSON, not instead of it. No migration pain. No breaking changes. Just more options.

Result: YAML's explosive growth in the DevOps ecosystem! 🚀

🎬 What's Next?

Now that we understand why YAML conquered the DevOps world, it's time to get our hands dirty with the syntax itself.

Coming up in Part 2:

YAML syntax fundamentals (scalars, lists, dictionaries)
The infamous "indentation hell" and how to avoid it
Common gotchas that trip up beginners
Real-world examples from Docker Compose

💡 Key Takeaways

YAML = Human-Friendly Config Language — It's designed for people first, machines second
Comments Matter — Infrastructure code without documentation is technical debt
YAML ⊃ JSON — This relationship made adoption seamless
Choose Your Format Wisely — APIs → JSON, Configs → YAML, Nostalgia → XML

🙋 Questions? Thoughts?

Drop a comment below! Whether you're team YAML, team JSON, or team "I-still-use-XML-fight-me," I'd love to hear your experiences.

Next Post: Part 2 - YAML Syntax Fundamentals (or: How I Learned to Stop Worrying and Love Indentation)

Follow me for more DevOps deep dives, cloud shenanigans, and the occasional dad joke disguised as technical content! 😄

#DevOps #YAML #CloudNative #Kubernetes #Docker #InfrastructureAsCode #TechEducation #LearningInPublic

Step-by-Step Guide: Setting Up a Web Server with Virtual Hosts on Ubuntu

Pushpendra B — Tue, 30 Sep 2025 18:30:00 GMT

Ever wondered how a single server can host dozens of websites without breaking a sweat? The secret lies in Virtual Hosts – Apache's elegant solution for managing multiple domains on one machine.

In this hands-on guide, I'll walk you through setting up Apache Virtual Hosts on Ubuntu Server, from installation to deployment. By the end, you'll be hosting multiple websites like a pro!

Why Virtual Hosts Matter

Picture this: You have a powerful server with plenty of resources, but you're only using a fraction of its capacity. Virtual Hosts allow you to:

Host multiple websites on a single server
Maximize resource utilization instead of leaving cores idle
Organize projects efficiently with separate configurations
Save costs by consolidating infrastructure

Let's dive in and unlock your server's full potential.

Step 1: Installing Apache Web Server

First, we need a web server. Apache is battle-tested, free, and perfect for Ubuntu systems.

Update Your Package List

sudo apt update

For RedHat/CentOS users:

sudo yum update

Install Apache2

sudo apt install apache2

If you see the message above, Apache is already installed. That's perfectly fine!

Verify the Installation

Open your browser and navigate to your server's IP address or localhost. You should see the default Apache page:

Pro Tip: The default Apache page is located at /var/www/html/. You can edit index.html in this directory to customize it.

Step 2: Creating Your Website Directory Structure

Now let's set up a proper directory for our new website. We'll use overflowbyte.tech as an example.

Create the Domain Directory

mkdir -p /var/www/overflowbyte.tech/public_html

Create a Simple HTML Page

Navigate to your new directory and create an index.html file:

cd /var/www/overflowbyte.tech/public_html
nano index.html

Add this HTML content:

<html>
<head>
  <title>Welcome to overflowbyte.techtitle>
head>
<body>
  <h1>Success! 🎉h1>
  <p>I'm running this website on an Ubuntu Server!p>
body>
html>

Set Proper Permissions

Grant your user ownership of the directory:

sudo chown -R $USER:$USER /var/www/overflowbyte.tech/public_html

Set read permissions for the web server:

sudo chmod -R 755 /var/www

Why This Matters: Without proper permissions, Apache won't be able to serve your content, and you won't be able to modify files easily.

Step 3: Configuring Virtual Hosts

Here's where the magic happens! Virtual Host configuration files tell Apache how to handle requests for different domains.

Copy the Default Configuration

sudo cp /etc/apache2/sites-available/000-default.conf /etc/apache2/sites-available/overflowbyte.tech.conf

Important: Ubuntu requires virtual host files to end with .conf.

Edit Your Virtual Host File

cd /etc/apache2/sites-available/
nano overflowbyte.tech.conf

Here's what the default configuration looks like (with comments removed):

:80>
    ServerAdmin webmaster@localhost
    DocumentRoot /var/www/html
    ErrorLog ${APACHE_LOG_DIR}/error.log
    CustomLog ${APACHE_LOG_DIR}/access.log combined

Customize Your Configuration

Replace it with this configuration:

:80>
    ServerAdmin admin@overflowbyte.tech
    ServerName overflowbyte.tech
    DocumentRoot /var/www/overflowbyte.tech/public_html

    ErrorLog ${APACHE_LOG_DIR}/error.log
    CustomLog ${APACHE_LOG_DIR}/access.log combined

Key Directives Explained:

ServerAdmin: Your contact email for error notifications
ServerName: The domain name this virtual host handles
DocumentRoot: Path to your website's files
ErrorLog/CustomLog: Logging configuration for debugging

Step 4: Enabling Your Virtual Host

Now let's activate your new virtual host configuration.

Enable the New Site

sudo a2ensite overflowbyte.tech.conf

Disable the Default Site

To avoid conflicts, disable Apache's default configuration:

sudo a2dissite 000-default.conf

Reload Apache

Apply your changes by reloading Apache:

sudo systemctl reload apache2

Step 5: Testing Your Virtual Host

Add a Host Entry (For Local Testing)

Since we're testing locally, add this entry to your /etc/hosts file:

127.0.0.1  overflowbyte.tech

On Windows, edit: C:\Windows\System32\drivers\etc\hosts

Browse Your Website

Open your browser and navigate to http://overflowbyte.tech

🎉 Congratulations! Your virtual host is live and serving content.

Pro Tips for Production

1. Point Your Domain to Your Server

Update your domain's DNS records to point to your server's public IP address.

2. Enable SSL with Let's Encrypt

sudo apt install certbot python3-certbot-apache
sudo certbot --apache -d overflowbyte.tech

3. Create Multiple Virtual Hosts

Repeat the process for each domain you want to host. Each gets its own .conf file!

4. Monitor Your Logs

Check logs regularly for issues:

tail -f /var/log/apache2/error.log

Troubleshooting Common Issues

Problem: Browser shows "Connection Refused"
Solution: Check if Apache is running: sudo systemctl status apache2

Problem: Shows default Apache page instead of your site
Solution: Verify your virtual host is enabled: sudo a2ensite overflowbyte.tech.conf

Problem: 403 Forbidden Error
Solution: Check directory permissions: sudo chmod -R 755 /var/www/overflowbyte.tech

Wrapping Up

You've just learned how to:

Install and configure Apache on Ubuntu
Create organized directory structures for multiple websites
Set up virtual hosts to manage different domains
Enable and test your configurations

Virtual Hosts are the backbone of efficient web hosting. Master this skill, and you'll be able to manage entire web ecosystems from a single server.

Need Help with Your Server Setup?

Setting up production-ready web infrastructure can be complex. If you need professional assistance with:

Server deployment and configuration
WordPress or application hosting
SSL certificate setup
Performance optimization
Migration from other providers

I'm here to help! Reach out to me at overflowbyte.tech@yahoo.com or visit my portfolio at pushpendra.overflowbyte.cloud

With experience in server administration and cloud infrastructure, I specialize in building reliable, scalable hosting solutions for businesses.

📖 Read the original article: Mastering Multiple Domains on Medium

💼 Connect with me:
LinkedIn | GitHub | Email

Happy hosting! 🚀

Weekly Tech Dose: September 13, 2025

Pushpendra B — Fri, 12 Sep 2025 18:30:00 GMT

Ever miss a critical patch? 😰 Imagine logging in to find an unwanted guest in your system. For server admins, that's not a thriller movie—it's a preventable nightmare.

Staying updated is not just best practice, it's peace of mind.

Here's what you need to know from the last 24 hours:

☁️ Google Cloud

Cloud Run GPU support is GA 🎉 — Attach NVIDIA L4 Tensor Core GPUs to serverless containers. Perfect for ML inference, video encoding, and graphics-heavy workloads with autoscaling.
Datastream for MongoDB is live ✅ — Enabling CDC-based ingestion into BigQuery/Cloud Storage.
Data transfer pricing changes ahead of new EU regulations.

🖥️ Windows Server 2025 – Now Generally Available!

Hybrid-first with deeper Azure Arc integration
Stronger security: SMB over QUIC, Secured-Core servers, Credential Guard enhancements
Optimized for AI and containers — Better GPU passthrough and Kubernetes support

🔻 Patch Tuesday – September 2025

86 vulnerabilities patched. Among them:

🔸 Critical RCE in SMB/NTLM & HPC Pack
🔸 Privilege escalation bugs in Windows Server 2025, 2022, and older versions

➡️ If your Windows Servers are internet-facing, patch NOW. Don't become the next headline.

🧠 My Take

Innovation in AI and cloud is thrilling, but nothing trumps security. Today's patches aren't optional—they're essential. Windows Server 2025 brings great features, but only if it's secure.

🗳️ POLL: Which update matters most to your org?

1️⃣ GPUs in Cloud Run (serverless AI/ML)
2️⃣ Windows Server 2025 migration
3️⃣ Applying this month's security patches

Simple Ways to Install VLC on Linux (Ubuntu, Fedora, Centos & More)

Pushpendra B — Sat, 26 Oct 2024 17:26:43 GMT

With the rise in multimedia consumption, a reliable media player is a must for every Linux user. VLC Media Player is one of the most popular choices, offering a versatile, all-in-one solution that’s free, open-source, and compatible with almost every file format. Whether you’re using Ubuntu, Fedora, Kali, or any other Linux distribution, this guide will walk you through five beginner-friendly ways to install VLC on your Linux system.

What is VLC Media Player?

VLC is a powerful, open-source multimedia player that works seamlessly across various platforms, including Linux, Windows, macOS, Android, and iOS. It can play nearly every audio and video format without requiring extra codecs, thanks to its built-in codec library. Originally a desktop application, VLC is now available for mobile, making it the go-to media player for millions worldwide.

Prerequisites for Installing VLC on Linux

To set up VLC, you’ll need a Linux-based OS (this guide covers Ubuntu, Fedora, Arch, Debian, and more), along with an internet connection to download the necessary packages. Let’s dive into each installation method, starting with the easiest options!

Method 1: Install VLC Using Snap (Most Linux Distros)

Snap is a package management tool that makes it easy to install and update software across different Linux distributions. It’s compatible with most Linux distros, so installing VLC this way is quick and straightforward.

Installing VLC with Snap on Ubuntu, Debian, Mint, and Kali

Open Terminal (Press Ctrl + Alt + T).
Install Snap:
```
 sudo apt install snapd
```
Install VLC:
```
 sudo snap install vlc
```

Installing VLC on Fedora Using Snap

Open Terminal.

Install Snapd:

 sudo dnf install snapd
 sudo ln -s /var/lib/snapd/snap /snap

Install VLC:
```
 sudo snap install vlc
```

Note: Snap installations can sometimes feel slow, so if speed is an issue, consider using the package manager (Method 3).

Method 2: Install VLC Using the Software Center (GUI Installation)

If you’re not yet comfortable with command-line installations, Ubuntu’s Software Center provides a simple GUI option. This method is perfect for beginners and works on Ubuntu, Mint, and Debian-based systems.

Open the Software Center: Click on “Show Applications” and type “Ubuntu Software.”
Search for VLC in the Software Center.
Install VLC: Click on VLC and hit “Install.” Enter your password if prompted.

This quick, visual installation method is especially beginner-friendly.

Method 3: Install VLC Using Terminal Commands (apt, dnf, pacman)

Using your system’s package manager to install VLC via the terminal is a direct and efficient option. Each Linux distribution has a slightly different command, so follow the one for your system.

Installing VLC on Ubuntu, Debian, and Other Debian-Based Systems

Open Terminal.

Update Package Lists:

 sudo apt update && sudo apt upgrade -y

Install VLC:
```
 sudo apt install vlc
```

Installing VLC on Fedora

Open Terminal.

Enable RPM Fusion:

 sudo dnf install https://download1.rpmfusion.org/free/fedora/rpmfusion-free-release-$(rpm -E %fedora).noarch.rpm

Install VLC:
```
 sudo dnf install vlc
```

This installation method is fast and reliable, especially for those comfortable with the terminal.

Method 4: Install VLC Using Flatpak (Cross-Distro Compatibility)

If you’re looking for a versatile installation tool, Flatpak offers cross-distro compatibility, allowing you to install VLC on almost any Linux setup.

Install Flatpak:

For Debian-based systems:
```
  sudo apt install flatpak
```
For Fedora:
```
  sudo dnf install flatpak
```
For Arch Linux:
```
  sudo pacman -S flatpak
```

Add Flathub Repository:

 flatpak remote-add --if-not-exists flathub https://flathub.org/repo/flathub.flatpakrepo

Install VLC:

 flatpak install flathub org.videolan.VLC

Launch VLC: Open from your app launcher or type:
```
 flatpak run org.videolan.VLC
```

Tip: Flatpak installations are secure and flexible, making it a great choice if you want software compatibility across various Linux environments.

Method 5: Advanced - Building VLC from Source (Optional for Latest Features)

For advanced users or those looking to customize VLC’s installation, building VLC from source provides access to the latest features and versions.

Install Build Dependencies:
- For Debian-based systems:
```
  sudo apt-get build-dep vlc
```
- For Fedora:
```
  sudo dnf builddep vlc
```

Clone VLC Source Code:

 git clone https://code.videolan.org/videolan/vlc.git
 cd vlc

Compile VLC:
```
 ./bootstrap
 ./configure
 make
```
Install VLC:
```
 sudo make install
```

This is best suited for users with experience in compiling software on Linux, so if you’re a beginner, try one of the other methods first.

Troubleshooting VLC Installation Issues

Here are some common problems you might encounter when installing VLC on Linux, along with simple fixes.

VLC Won’t Launch: Try resetting VLC’s configuration:

  vlc --reset-config

Or reinstall VLC:

  sudo apt remove vlc && sudo apt install vlc

Snap or Flatpak Installations Are Slow: Snap and Flatpak can sometimes feel slower due to sandboxing. If speed is a concern, use your system’s package manager.
Choppy Video Playback: Update your graphics drivers and check VLC’s video settings under “Preferences” for smoother playback.

With any of these methods, you’ll be up and running VLC on your Linux system in no time, ready to enjoy all your media without limitations!

Discovering the Power Behind Popular Linux GUI Applications

Pushpendra B — Tue, 22 Oct 2024 19:03:46 GMT

Linux is renowned for its versatility and its ability to run a wide variety of GUI applications. While these programs offer user-friendly interfaces, they are often driven by powerful command-line tools. In this post, we’ll explore five popular GUI applications in Linux and the commands that work behind the scenes.

1. GIMP (Image Editor)

Overview

GIMP (GNU Image Manipulation Program) is a powerful, open-source image editing tool that rivals paid options like Adobe Photoshop. It is widely used for tasks ranging from simple image retouching to advanced image composition.

Command Behind It

To launch GIMP from the terminal, you can use:

gimp

If you want to open a specific image file with GIMP, you can include the file path:

gimp /path/to/image.png

Underlying Command

GIMP uses various command-line tools for image manipulation, such as convert from the ImageMagick suite for format conversion and jpegoptim for JPEG optimization. These commands allow GIMP to perform tasks like resizing, format conversion, and optimization behind the scenes.

2. GNOME System Monitor (System Resource Monitor)

Overview

The GNOME System Monitor provides a graphical interface for viewing and managing system processes and resources like CPU, memory, and network usage. It’s essentially a graphical front-end for monitoring system health.

Command Behind It

To open the GNOME System Monitor, you can run:

gnome-system-monitor

Underlying Command

Behind the scenes, GNOME System Monitor runs commands like top, ps, and free. These commands provide information on system processes, memory usage, and CPU activity. For instance, top shows real-time system resource usage, and ps lists all running processes.

3. LibreOffice (Office Suite)

Overview

LibreOffice is a comprehensive, open-source office suite, offering tools for word processing, spreadsheets, presentations, and more. It's an excellent alternative to Microsoft Office and is widely used in Linux environments.

Command Behind It

To open LibreOffice from the command line, you can use:

libreoffice

For specific modules, such as opening a Word document or a spreadsheet, you can specify the application:

libreoffice --writer /path/to/document.docx
libreoffice --calc /path/to/spreadsheet.xlsx

Underlying Command

LibreOffice can be used in conjunction with command-line tools like pdftotext for converting PDFs into text and unoconv for converting documents between formats like DOCX, ODT, and PDF. This command-line flexibility makes LibreOffice ideal for automated document processing tasks.

4. Cheese (Camera Application)

Overview

Cheese is a camera application for Linux that allows users to capture photos and record videos using their webcam. It's commonly used for quick snapshots, webcam testing, and video recordings.

Command Behind It

To open Cheese and start capturing video or images, use:

cheese

Underlying Command

Cheese uses v4l2-ctl, a command-line utility for controlling video devices, to access and configure the webcam. It also works with ffmpeg for video encoding and mplayer for playback of captured media.

5. Shotwell (Photo Manager)

Overview

Shotwell is a popular photo manager for Linux, enabling users to organize, view, and edit their photo collections. It's lightweight and integrates well with other GNOME applications.

Command Behind It

To open Shotwell, simply type:

shotwell

If you want to import specific photos directly from the command line, you can specify the directory or file path:

shotwell /path/to/photo_directory

Underlying Command

Shotwell utilizes exiv2 for reading and modifying image metadata (such as EXIF data) and can work alongside gphoto2, a command-line tool for managing digital cameras, to import images directly from cameras connected via USB.

Conclusion

Understanding the command-line tools behind popular Linux GUI applications provides deeper insight into how these programs function. It also empowers you to troubleshoot or enhance your workflow by combining the ease of GUI with the power of the Linux command line. Whether you're editing images, managing system resources, or organizing your photos, knowing the commands that power these applications can help you become more efficient and informed.

Feel free to share your thoughts or additional examples in the comments!

Mastering Multiple Domains: How to Set Up a Web Server with Virtual Hosts on Ubuntu

Pushpendra B — Mon, 26 Aug 2024 16:57:24 GMT

In the realm of web development, hosting, and deployment, a single server often hosts multiple websites. This is where virtual hosts come in, acting as magical portals that differentiate between different domains all residing on the same machine. Today, we’ll explore the world of Ubuntu web servers and configuring virtual hosts to efficiently manage your web empire!

Getting Started: Web Server Installation

First things first, we need a web server. there are a lot of web servers available in the market. but, Apache, a free and open-source powerhouse, is a popular choice for Ubuntu. Open your terminal and update the package list:

sudo apt update

or for Redhat and centos

sudo yum update

Now, install Apache with this command:

sudo apt install apache2

I’ve already Installed so you will see like this.

This will install and configure Apache on your system. after installation completed you can test the Apache by browsing the IP of the server or localhost in your ubuntu server. Below is the default Apache page. which is available on path “/var/www/html/”

You can edit the “index.html” under our path “/var/www/html/”.

Moving Forward: Hosting Your Site and Utilizing Virtual Hosts

Now that we’ve covered the basics of setting up your web server, let’s delve into hosting your site. As your web presence expands, managing multiple sites efficiently becomes crucial. Consider a scenario where your server has a specific configuration “X,” and one of your sites only utilizes “X/4” or one core of that configuration. In such cases, valuable resources are left idle. This is where Apache’s Virtual Host functionality comes to the rescue.

What are Virtual Hosts?

Virtual hosts are configuration files that instruct Apache on how to manage requests for various domains. Each virtual host file outlines a document root, indicating the directory housing the website’s files. Apache utilizes this data to serve the appropriate content when a domain is accessed.

Before we proceed with creating a Virtual Host, let’s create a website named overflowbyte.techand direct it to our server using our system’s host entry. Additionally, we’ll create a public_html directory within your domain directory. This directory will store the content to be served to your visitors.

Step 1: Creating a directory for our website (domain)

mkdir /var/www/overflowbyte.tech
mkdir /var/www/overflowbyte.tech/public_html/

Now go to our created directory and create an index.html file.

cd /var/www/overflowbyte.tech/public_html
nano index.html

after creating index.htmlpaste the below HTML code

Welcome to overflowbyte.tech

I'm running this website on an Ubuntu Server server!

We have created our own site and our default index.htmlpage and before moving with our Virtual Host file to browse our website we are going to setup the permission as per the USER because we have created the above directory and file with the ownership of root user. If you want your regular user to be able to modify files in these web directories, you can change the ownership for a particular user with these commands:

sudo chown -R $USER:$USER /var/www/overflowbyte.tech/public_html

The $USER variable will take the value of the user you are currently logged in as when you press ENTER. By doing this, the regular user now owns the public_html subdirectories where you will be storing your content.

You should also modify your permissions to ensure that read access is permitted to the general web directory and all of the files and folders it contains so that the pages can be served correctly:

sudo chmod -R 755 /var/www

Your web server now has the permissions it needs to serve content, and your user should be able to create content within the necessary folders. The next step is to create content for your virtual host sites.

Otherwise, we won’t be able to browse our newly created website because server doesn’t know for which site this request is? and will display our default page of Apache.

Creating Virtual Host Files

Here’s where the magic happens! Let’s create a virtual host file for a domain named “overflowbyte.tech”. Virtual host files are instrumental as they specify the precise configuration of your virtual hosts, guiding the Apache web server on how to respond to different domain requests.

Apache comes with a default virtual host file called 000-default.conf. You can copy this file to create virtual host files for each of your domains.

Since we’re setting this up locally, we’ll need to add a Host entry in our system to direct our domain overflowbyte.tech to our VM’s IP.

I’ve already done my Host entry to point to my server, as I don’t want my browser to query DNS outside of my system. you can also learn how to create Host entry by following Host entry tutorial.

Moving on first step to create Virtual Host. Copy the default configuration file over to the first domain:

Copy the Default Configuration:
Start by copying the default Apache configuration file:

sudo cp /etc/apache2/sites-available/000-default.conf /etc/apache2/sites-available/overflowbyte.tech.conf

Be aware that the default Ubuntu configuration requires that each virtual host file should end in .conf.

Open the new file in your preferred text editor with root privileges:

cd /etc/apache2/sites-available/

nano overflowbyte.tech.conf

Below is default file after opening in nano. (we can remove the comments in the default file). I’ve remove the comments and below is the file. If you want to see default file then you can check 000-default.conffile in /etc/apache2/site-available directory.

You’ll see a bunch of directives. Here’s what to modify:

ServerAdmin: Replace this with your email address.
DocumentRoot: This should point to the directory containing your website’s files (e.g., /var/www/mydomain.com).
ServerName: Specify the domain name this virtual host handles (e.g., mydomain.com).


        ServerAdmin webmaster@localhost
        DocumentRoot /var/www/html
        ErrorLog ${APACHE_LOG_DIR}/error.log
        CustomLog ${APACHE_LOG_DIR}/access.log combined

Now we will start editing our virtual host file.

Moving forward we should have our email in ServerAdmin so users can reach us in case Apache experiences any error:

ServerAdmin admin@overflowbyte.tech

We would also need to set the DocumentRoot directive to point to the directory where our site files are hosted on:

DocumentRoot /var/www/overflowbyte.tech/public_html

The default file doesn’t come with a ServerName directive, so we’ll have to add and define it by adding this line below the last directive. It establishes the base domain for the virtual host definition.:

ServerName overflowbyte.tech

This ensures people reach the right site instead of the default one when they type in overflowbyte.tech.

Now that we’re done configuring our site, below is the our Virtual Host file for domain “overflowbyte.tech”, let’s save and activate it in the next step!


        ServerAdmin webmaster@localhost
        ServerName  overflowbyte.tech
        DocumentRoot /var/www/overflowbyte.tech/public_html

        ErrorLog ${APACHE_LOG_DIR}/error.log
        CustomLog ${APACHE_LOG_DIR}/access.log combined

Enabling the New Virtual Host Files

Now we have created your virtual host files, we must enable them to access. Apache includes some tools that allow you to do this.

We’ll be using the a2ensite tool to enable each of your sites. If you would like to read more about this script, you can refer to the a2ensite documentation.

sudo a2ensite overflowbyte.tech.conf

But before browsing our created website we first need to disable the defualt apache2 page. We can use the following command to disable:

sudo a2dissite 000-default.conf

Now the time to take the changes effect by reloading apache2. As it becomes necessary to restart any service, whenever need to change and reflect them.

sudo systemctl reload apache2

Finally we need to see the result of all the work done right!

Congratulations! You’ve successfully configured a virtual host on your Ubuntu web server. Now you can repeat this process to create virtual hosts for all your domains, keeping your web empire organised and thriving!

Bonus Tip: Don’t forget to set up your domain name to point to your server’s IP address for users to access your websites from the outside world.

A Beginner guide for "iotop" to processes on your Hard Disks

Pushpendra B — Mon, 26 Aug 2024 16:37:08 GMT

What is iotop ?

while we are studying about the iotop It become important to understand what is it right ?

so iotop is a command line utility to monitor and check the usage of I/O operations of our disk. You can check official repository or iotop. it was written by Guillaume Chazarain.

iotop watches I/O usage information output by the Linux kernel (requires 2.6.20 or later) and displays a table of current I/O usage by processes or threads on the system.

It displays columns for the I/O bandwidth read and written by each process/thread during the sampling period. It also displays the % of time the thread/process spent while swapping in and while waiting on I/O. For each process, its I/O priority (class/level) is shown.

In addition, the total I/O bandwidth read and written during the sampling period is displayed at the top of the interface.

You can use the left and right arrows to change the sorting, r to reverse the sorting order, o to toggle the --only option, p to toggle the --processes option, a to toggle the --accumulated option, q to quit or i to change the priority of a thread or a process' thread(s). Any other key will force a refresh.

without making any delay let's move on the installation of this wonderful tool.

installation of iotop

So, the installation of the iotop is so simple we don't have to do anything instead firing up a single command according to respective package manager and Linux Distro.

Ubuntu/Debian/Linux Mint:

sudo apt install iotop
# while you've logged in with root user
apt isntall iotop

CentOS/RHEL:

sudo yum install iotop
# or
yum install iotop

Basic Usage of iotop

Using iotop is not much hard until you understand the basics of disk i/o operations. However it is so simple to run it. Just type the below command and you are good to go for exploring the wonderland.

sudo iotop

This will display a list of processes along with their disk I/O statistics. The default output includes several important columns which you can see in the image:

PID: The Process ID.
PRIO: The I/O priority of the process.
USER: The user who owns the process.
DISK READ: The amount of data read from the disk in KiB/s.
DISK WRITE: The amount of data written to the disk in KiB/s.
SWAPIN: The percentage of the process's I/O that is being swapped in.
IO: The percentage of time the process is waiting on I/O.

You can navigate within iotop using simple commands:

Press o to filter and display only processes with active I/O.
Press q to quit the program.

Advanced iotop Usage

1. Filtering by User or Process: You can focus on specific users or processes using iotop. For instance, to filter by user, use:

sudo iotop -u username

Or to monitor a specific process:

sudo iotop -p PID
# below is my PID 3372 
sudo iotop -p 3372

2. Batch Mode for Logging: Running iotop in batch mode allows you to log disk I/O activity for later analysis. This is particularly useful for long-term monitoring or troubleshooting:

sudo iotop -b -o > iotop.log

3. Customising Output: Adjust the delay between updates using -d:

sudo iotop -d 5

Also you can limit the number of iterations with -n:

sudo iotop -n 10

Display values in kilobytes instead of the default megabytes with -k:

sudo iotop -k

To suppress some lines of header. this will suppress the lines of header.

sudo iotop -q
# This will now suppress some line of header in the output.

4. Combining with Other Tools: For a more automated approach, you can combine iotop with cron to run at regular intervals, or use grep to filter the output for specific patterns:

sudo iotop -b -o | grep 'pattern' > filtered_iotop.log

Optimizing Disk Performance Using iotop

1. Identifying and Terminating Problematic Processes: If iotop reveals a process that’s consuming too much I/O, you can terminate it to free up resources:

sudo kill -9 PID

2. Adjusting I/O Priorities: For processes that need to run but are consuming too much I/O, you can adjust their I/O priority using ionice:

sudo ionice -c3 -p PID

This command will set the process to the "idle" priority class, meaning it will only use disk I/O when the system is otherwise idle.

3. Proactive Monitoring: Set up alerts based on iotop output to catch I/O issues early. For example, you can use a monitoring script that sends an email or triggers an alert if disk I/O crosses a certain threshold.

4. Regular Analysis: Make it a habit to run iotop periodically, especially if you notice performance issues. Regular monitoring helps you catch problems early, ensuring your system runs smoothly.

Use Cases of iotop

Below are some use cases that you can consider for using iotop which can be helpful in a lot ways.

1. Identifying Disk I/O Bottlenecks: High disk I/O can cause significant slowdowns in system performance. With iotop, you can quickly identify which processes are consuming the most I/O resources. For example, if your system is sluggish, running iotop can help you pinpoint processes that are hogging the disk, allowing you to take appropriate action.

2. Monitoring Performance of Disk-Intensive Applications: Applications like databases, backup processes, or large file transfers are always very disk-intensive. Using iotop, you can monitor these applications in real-time, ensuring they aren’t causing unnecessary strain on your system or interfering with other processes.

3. Diagnosing Swap Usage Issues: If your system is heavily using swap, it can lead to increased disk I/O, slowing down your system. iotop helps you monitor swap usage as well and identify processes that are causing excessive swapping and enabling you to optimise memory usage and reduce swap dependence.

4. Analysing Disk Write Patterns: Understanding which processes are writing heavily to disk can help in managing disk wear, especially for SSDs. iotop provides a clear view of disk write activity, making it easier to manage disk health and longevity.

I hope you've enjoyed this article on iotop, you can also use some resources mentioned below to expand your understanding on iotop.

OverflowByte — DevOps, Cloud & Linux for Engineers

How to Extract (Unzip) tar.xz File: A Complete Beginner's Guide

1. Introduction

What is a tar.xz file?

Why tar.xz is commonly used in Linux distributions

Where users typically encounter tar.xz files

2. Understanding tar and xz Separately

What is tar?

What is xz compression?

Difference between .tar, .tar.gz, .tar.bz2, and .tar.xz

3. Prerequisites

Required Tools

How to check if tar and xz are installed

Installation Commands for Different Distributions

4. How to Extract tar.xz File (Step-by-Step)

The Basic Extraction Command

Explanation of Each Flag

Extract to a Specific Directory

Extract Specific Files from the Archive

List Contents Without Extracting

Extract with Progress

5. Extracting tar.xz on Different Platforms

Linux (CLI and GUI Methods)

Windows

macOS (Terminal Method)

6. Common Errors and Troubleshooting

"tar: command not found" or "xz: command not found"

"Permission denied"

"Unexpected EOF in archive" or "Corrupted archive"

7. Advanced Usage

Combining Extraction with Pipe

Extracting and Moving in One Command

Verifying Archive Integrity

Performance Considerations

8. Real-World Example

9. Best Practices

10. Conclusion

Frequently Asked Questions (FAQ)

Weekly DevOps & Cloud Intelligence Report – Week 4, February 2026

Introduction

Cloud & DevOps Updates

AWS: EKS Node Monitoring Agent Goes Open Source

AWS: Nested Virtualization, EKS Auto Mode Logging, and OpenSearch Cluster Insights

Azure: AKS Gets Kubernetes 1.34 GA, Node Auto-Provisioning, and an MCP Server

Google Cloud: Multi-Region Cloud Run and Gemini Cloud Assist

Terraform: v1.14.6 Released, Enterprise 1.2.0 Brings Day-2 Actions

Kubernetes Version Cadence Across Clouds

Linux & Server Management

RHEL 10 Kernel Security Update: RHSA-2026:3124

Multi-Distro Patch Wave: OpenSSL, ImageMagick, freerdp, libsoup

Patch Tuesday Spillover into Linux Environments

Career & Learning Trends

The Job Market: Platform Engineering and MLOps Are the Premium Tiers

Certifications: What the Market is Actually Rewarding

Strategic Tech Moves

Microsoft Consolidates Security Tooling Around Defender

The "Always-On Cloud" Assumption Is Cracking

AI & Automation in DevOps

What Is Actually Shipping This Week

Observability as a Control Plane, Not a Dashboard

Key Takeaways

Conclusion

How to Allow SSH Root Login on Linux (Securely): Real-World Use Cases & Best Practices

Why is SSH Root Login Disabled by Default?

When Do You Actually Need Direct Root Login?

1. Automated System Backups

2. Infrastructure Automation and CI/CD

3. Deep System Diagnostics

The Right Way to Enable Root Login Securely

Step 1: Generate an SSH Key Pair

Step 2: Copy the Public Key to the Server

Step 3: Edit the SSH Configuration File

Step 4: Update the PermitRootLogin Directive

Step 5: Restrict by IP Address (Optional but Highly Recommended)

Step 6: Restart the SSH Service

Wrapping Up Our SSH Security Series

Cloud, Kernel & Models: What Changed This Week (Feb 16–22, 2026)

The One-Line Takeaway

☁️ Cloud Platforms: AWS, Azure & GCP

AWS

Step 4: Update the `PermitRootLogin` Directive

Step 2: Create the Data File (`MOCK_DATA.json`)

Step 3: Create the Server (`server.js`)

Step 5: Create the `.dockerignore`

6. The Pipeline — Complete `ci-cd-pipeline.yml` Walkthrough

❌ Error 1: `npm ci` fails — "package-lock.json not found"