KServe - The Ultimate Guide to Production Model Serving on Kubernetes

Open Table of Contents

Introduction
What You’ll Learn in This Blog
The “Last Mile” Problem: Why Model Serving/Inference is a Unique Challenge
I. Foundational Introduction to KServe
- 1. What is KServe?
- 2. Key Benefits that Set KServe Apart
II. KServe Design and Architecture Concepts
III. Lab: Deploy Your First KServe Inference Service on GKE
Conclusion: From Theory to Production

Introduction

Productionizing AI infrastructure can often be overly complex. Between deploying an experimental Large Language Model (LLM) and serving it at scale, there is a maze of infrastructure decisions, framework incompatibilities, and operational challenges that can delay deployments by months. KServe’s mission is to eliminate this complexity. It is a standardized, cloud-native platform designed for deploying Generative AI (Gen AI) and machine learning (LLM) LLMs on Kubernetes (K8S), whether you are serving traditional ML models or the latest frontier LLM.

Originally known as KFServing, KServe has evolved into the Cloud Native Computing Foundation’s (CNCF) reference AI Inference implementation, trusted by leading organizations running AI at scale.

Whether you are a data scientist looking to deploy your latest ML experiment, a DevOps engineer building scalable ML infrastructure, or a decision-maker evaluating AI infrastructure platforms, KServe provides the production-grade foundation you need without vendor lock-in.

What You’ll Learn in This Blog

In this comprehensive guide, you’ll discover:

The “Last Mile” Problem: Understanding why model serving presents unique infrastructure challenges that go far beyond simple API deployment
Foundational Introduction to KServe: What KServe is, its evolution from KFServing, and the key benefits that set it apart from other serving solutions
KServe Design and Architecture: Deep dive into core design principles, control and data planes, deployment modes (Standard vs. Knative), inference protocols, and the LLMInferenceService resource
Hands-On Lab: Step-by-step walkthrough of deploying Gemma 3 on Google Kubernetes Engine using KServe’s Standard mode with Gateway API and KEDA autoscaling

By the end, you’ll understand why KServe has become the open-source standard for model serving—and how to use it to accelerate your own AI deployments.

The “Last Mile” Problem: Why Model Serving/Inference is a Unique Challenge

When you train an ML model, the next critical step is to make it accessible to your end users through a process called serving. In simpler terms, serving means loading your trained model into an inference server and exposing an API endpoint that applications or users can call to get predictions. This is where your model transitions from an experimental artifact to a production service that delivers real business value.

Model Inference

Process of using a trained model to make predictions or generate outputs based on new inputs
Supports real‑time tasks like text generation, translation, and summarization
Essential for responsive, production‑grade applications
Requires building a scalable platform to handle high demand and low latency

You might wonder why model serving requires specialized tools like KServe. After all, can’t we just wrap our model in a Flask API and call it a day? The reality is that production model serving presents a set of challenges that go far beyond simply calling model.predict().

First, there’s the performance challenge. Inference workloads have unique characteristics—they’re often bursty, require low latency, and need to handle concurrent requests efficiently. A simple Flask server will buckle under production traffic, leading to poor user experiences and wasted resources.

Then there’s the scalability problem. Production models need to scale up and down based on demand, sometimes dramatically. Traditional (Kubernetes) autoscaling approaches based on CPU metrics don’t work well for inference workloads, which are often GPU-bound and have different resource utilization patterns.

Model versioning presents another hurdle. Organizations need to serve multiple versions of the same model simultaneously for A/B testing, canary deployments, or gradual rollouts. Managing this complexity manually is error-prone and time-consuming.

Next, there’s the observability challenge. Unlike traditional web services, model inference requires specialized monitoring. You need to track not just latency and throughput, but also model-specific metrics like prediction drift, feature importance, and confidence scores.

The cold start problem becomes particularly acute with LLMs. Large language models can take several minutes to load into GPU memory, making standard container starting approaches impractical. Users won’t wait 8 minutes for your model to “wake up.” This requires local LLM caching on the nodes to solve this known ML/LLM challenge.

Security and compliance present another layer of complexity, especially for LLM deployments. You need a specialized AI/LLM API gateway that handles authentication and authorization, but also implements guardrails to prevent prompt injection attacks, data leakage, and generation of harmful content. Without proper safeguards, your LLM endpoints become potential security vulnerabilities and compliance nightmares.

KServe addresses all these challenges out of the box, providing a production-ready foundation for model serving that would otherwise require months of custom engineering.

But what is KServe after all?

I. Foundational Introduction to KServe

1. What is KServe?

From the KServe official documentation, KServe is defined as:

KServe:

“Standardized Distributed Generative and Predictive AI Inference Platform for Scalable, Multi-Framework Deployment on Kubernetes”

What makes KServe particularly powerful is its ability to serve any framework—TensorFlow, PyTorch, XGBoost, Scikit-learn, and more—through a unified interface. It’s designed from the ground up to handle the unique challenges of production inference workloads, from autoscaling to versioning to monitoring.

KServe began its journey as KFServing, a project under the Kubeflow umbrella aimed at solving the ML model’s serving problem on Kubernetes. As the project matured and gained adoption, it became clear that its value extended beyond just Kubeflow users.

In 2021, the project was renamed to KServe and became a standalone CNCF (Cloud Native Computing Foundation) project. This change reflected its broader mission to serve as the standard for model serving on Kubernetes, regardless of whether you’re using Kubeflow or not.

Today, KServe is widely adopted across industries - see the industry adopters here: - and has become the reference implementation for model serving on Kubernetes. Its CNCF status ensures it remains vendor-neutral, community-driven, and aligned with the broader cloud-native ecosystem.

KServe Project Information (November 2025)

Latest Version: v0.16.0
Status: CNCF Incubating Maturity Level
CNCF Page: cncf.io/projects/kserve
GitHub: github.com/kserve/kserve
Documentation: kserve.github.io/website
Community Slack: #kserve on Kubernetes Slack
Original Creator: Dan Sun

2. Key Benefits that Set KServe Apart

While many excellent serving and inference projects exist in the industry, several key features and benefits led us at Drizzle AI Systems to select KServe for our production-ready AI infrastructure mission:

Kubernetes-Native Architecture: KServe is built from the ground up for Kubernetes, leveraging native Kubernetes objects like Pods and Deployments, also K8S constructs like CRDs and operators. This deep integration means it works seamlessly with existing Kubernetes tooling and processes, rather than fighting against them.
Cloud-Agnostic Portability: Deploy anywhere—whether you’re on AWS, Azure, GCP, on-premises, or hybrid environments, KServe provides consistent behavior and identical APIs across all platforms. This eliminates vendor lock-in and enables seamless workload migration between cloud providers.
Unified Platform: A single platform for both Generative AI (mainly LLMs) and Predictive AI (mainly traditional ML frameworks like PyTorch, TensorFlow, etc) inference on Kubernetes. This eliminates the operational complexity of managing separate serving infrastructures for different AI workloads, providing consistent APIs, monitoring, and deployment patterns across your entire model portfolio and teams.
Production-Grade Scaling: Based on Keda and Knative, KServe automatically handles traffic spikes, scale-to-zero (Knative option), and manage multi models efficiently.
LLMs and ML Framework Agnostic Excellence: KServe’s unified interface seamlessly supports multi-LLM frameworks (Hugging Face, vLLM, etc) and Multi-ML frameworks (TensorFlow, PyTorch, XGBoost, Scikit-learn, etc) through a single, consistent API.
Advanced Inference Patterns: Built-in support for OpenAI-Compatible APIs, Canary deployments, A/B testing, InferenceGraph for model ensembles, multi-node inference, intelligent routing and density packing using ModelMesh … etc
Optimized Inference Performance: KServe delivers great performance through intelligent resource management and advanced caching strategies. Its KV Cache Offloading capabilities optimize GPU memory utilization, while model caching dramatically reduces cold start times from 15-20 minutes to under 1 minute for large language models.
Seamless Operations - Production monitoring, logging, low-latency prediction serving for real-time applications and observability out of the box.
Enterprise Security: KServe provides Envoy AI Gateway integration, which is an enterprise-grade API management and routing for AI workloads that ensures authentication, authorization, and rate limiting features out-of-the-box, meeting the strict security requirements of production environments.
Proven at Scale: Organizations like Bloomberg, IBM, and Nvidia trust KServe for their production workloads, validating its reliability and performance under real-world conditions.

Now that you understand what KServe is and why it’s useful for production AI workloads, it is the right time now to dive into how it actually works.

II. KServe Design and Architecture Concepts

1. Core Design Principles

KServe’s power lies in three fundamental design principles that make production model serving accessible without sacrificing flexibility:

Declarative Simplicity: Model deployment inputs are defined in a main YAML configuration (InferenceService), eliminating the need to manually configure K8S objects like containers, pods, services, etc. KServe starts from the InferenceService and orchestrate everything else automatically.
Extending Kubernetes: KServe extends K8S through purpose-built Custom Resources (CRDs) designed specifically for inference workloads. This deep integration means you get cloud-native scalability, reliability, and observability patterns that work seamlessly with your existing K8s infrastructure—without needing to become a Kubernetes expert.
Runtime Flexibility Through Abstraction: KServe’s pluggable architecture separates the “what” from the “how” of model serving. Choose the optimal inference engine for your workload—vLLM for LLMs, TorchServe for PyTorch, TensorRT for NVIDIA optimization, or bring your own custom runtime—all while maintaining a consistent API and operational model.

2. KServe Generative Inference Architecture

The following figure describes the main components of the GenAI Inference architecture:

Key Component:

Envoy AI Gateway: The top-level, unified API gateway that provides several routing features (smart request distribution based on model requirements), security features (like rate limiting, auth/z, etc) and monitoring features (like LLM Observability and model usage metering)
Gateway API Inference Extension: The official Kubernetes Gateway implementation, that provides enhanced routing capabilities such as endpoint picker, load-aware routing, cache-aware routing, etc.
KServe Controller: more details in the next section.
Inference Service Deployment: handle LLM standard K8S deployments controlling the vLLM containers, storage containers, model caching / OCI (caches models across K8S nodes).
Distributed Inference: For larger models requiring distributed processing vLLM deployment (e.g. vLLM head, vLLM workers, vLLM prefilling, vLLM decoding,..)
Infrastructure Components: The foundation layer orchestrates a sophisticated hardware ecosystem designed for maximum flexibility and performance:
- Heterogeneous GPU Farms(H100, H200, A100, MIG)
- Distributed KV Cache: A cluster-wide shared memory system for model key-value pairs, across all inference pods, dramatically reducing redundant computations and improving response times for similar queries.
- Model Registry / Hugging Face Hub / GCS / Azure / S3; providing deployment flexibility without vendor lock-in.

This architecture showcases how KServe unifies specialized components into a production-grade platform. Each layer tackles specific LLM serving challenges, from intelligent request routing and dynamic resource allocation to hardware acceleration and cache optimization, creating an end-to-end solution that transforms complex infrastructure into simple, declarative configurations.

3. KServe Control and Data Planes

KServe consists of two main components:

Control Plane

KServe Controle Plan

The brain of the KServe operations; it manages the complete inference service lifecycle, orchestrates with Kubernetes APIs, handles resource provisioning, etc. Operating independently from the data plane, this architectural ensures clean concerns.

It onsists of several components such as Controller Manager, LocalModel Controller, Networking Components, Autoscaling Components, and much more.

Data Plane

KServe Data Plan

The data plane is designed to be independent from the control plane, focusing purely on inference execution, using inference API that is independent of any specific ML/DL framework and model server.

We’ll be publishing deep-dive articles on the control plane and data plane components in our blog. Stay tuned!

4. KServe Deployment Modes

KServe offers two powerful deployment modes to fit diverse operational needs:

Standard Mode (Raw K8S Deployment mode): Standard Deployment mode uses standard Kubernetes resources and is recommended for most production environments.
Knative mode: Leverages Knative Serving for event-driven, scale-to-zero capabilities.

Standard Mode

This is the recommended deployment mode for LLM Serving. In this mode KServe uses standard Kubernetes objects for maximum control, and it is the recommended for most production environments.

The following diagram shows how this standar deployment mode (with Gateway API) working.

Figure: Official KServe Standard Deployment Mode Architecture

Standard Mode Architecture Characteristics:

Native K8S Deployments (Deployment, Service, …)
Gateway API: Leverages K8S Gateway API for advanced traffic management (recommended way)
Kubernetes Ingress Fallback: Supports Kubernetes Ingress (e.g. Nginx) but with limited functionality
Horizontal Pod AutoScaler (HPA) + Optional KEDA: Uses HPA for basic scaling, with KEDA for advanced metrics

When is it best to use the Standard mode?

✅ Production environment requiring high availability, or omplex networking
✅ Multi-tenant environments with strict isolation requirements
✅ Applications requiring persistent connections or streaming

Knative Mode:

In this mode, KServe leverages Knative for serverless capabilities including scale-to-zero, this mode is ideal for dynamic workloads requiring resource efficiency.

The following diagram shows how the KServe Knative mode works:

Figure: Official KServe Knative Deployment Mode Architecture

Knative Mode Architecture Characteristics:

Knative Resources (Knative Service and Revision)
Knative Gateway (Istio or Kourier)
Scale-to-Zero: Scale down to zero replicas when no traffic
Queue Proxy: Each pod includes a Knative queue proxy for metrics and request handling
Revision Management: Automatic deployment versioning and traffic splitting features.

When is it best to use the Knative mode?

✅ Cost efficiency is a priority benefiting from scale-to-zero
✅ Unpredictable traffic patterns
✅ Burst traffic use case where instant scaling is required

Limitations of this mode

⚠️ Cold Start Latency: Initial requests may experience higher latency
⚠️ Resource Overhead: Queue proxy containers add resource overhead
⚠️ Networking Dependencies: Requires Istio or Kourier for networking

5. The KServe Inference Protocols

KServe supports two protocol versions for model inference, each designed to meet different operational needs:

KServe V1 Protocol (Legacy) The V1 protocol provides a standardized prediction workflow across all model frameworks. While still supported for backward compatibility, it offers basic inference capabilities with a simple request/response pattern.

KServe V2 Protocol: The Open Inference Protocol (Recommended) The V2 protocol, also known as the Open Inference Protocol, sets the modern standard for model serving. KServe’s V2 protocol improves upon several limitations found in the V1 protocol, delivering better performance and enhanced compatibility across various model frameworks and servers. It supports both HTTP/REST and gRPC interfaces, providing flexibility in implementation.

The V2 protocol is recommended for all new deployments and serves as the foundation for KServe’s advanced features. For more details about the V2 Protocol please refer to the official documentation

6. The LLMInferenceService

LLMInferenceService

It is a Kubernetes Custom Resource Definition (CRD) introduced in KServe. The LLMInferenceService provides the standard pattern on deploying and managing LLM inference workloads on Kubernetes.

While KServe has traditionally used the InferenceService CRD for serving machine learning models, the platform now adopts a dual-track architecture to address the distinct requirements of modern AI workloads:

InferenceService: The original CRD optimized for Predictive AI workloads. It excels at serving traditional ML models from frameworks. While still capable of serving LLMs, it’s designed primarily for conventional machine learning inference patterns.

LLMInferenceService: A purpose-built CRD specifically engineered for GenAI workloads and LLMs. This specialized resource addresses the unique challenges of LLM serving, including:

Distributed multi-node inference for models that exceed single-GPU capacity
Prefill-decode separation for optimized token generation
Advanced routing strategies for disaggregated serving architectures
Integrated support for KV cache management and sharing
Native OpenAI-compatible API endpoints

This architectural separation enables KServe to deliver LLM-specific optimizations and features without introducing unnecessary complexity to the traditional InferenceService API.

At this stage, you might already feel saturated with the conceptual terminologies. So, let’s switch things up a bit and get hands-on with KServe. In the next section, you’ll be deploying and serving Gemma 3 with vLLM in KServe, demonstrating how to utilize GPU resources in KServe on Google Kubernetes Engine (GKE). Let’s do it!

III. Lab: Deploy Your First KServe Inference Service on GKE

In this hands-on lab, you’ll learn how to serve a GenAI LLM on Kubernetes using KServe’s recommended deployment pattern (Raw K8S Deployment) with a Gateway API for routing and an LLMInferenceService. We will use Google Kubernetes Engine (GKE) and Gemma 3.

1. GKE Setup

First, we’ll provision a GKE cluster. An Autopilot cluster is ideal as it manages the underlying nodes, allowing you to focus on the application.

Ensure you have a gcp project with billing enabled and the GKE API activated.
Ensure you have the following tools installed on your workstation:

- gcloud CLI
- kubectl
- helm

Set the default environment variables:

export PROJECT_ID=$(gcloud config get project)
export REGION=us-central1 # update to suit your needs
export CLUSTER_NAME=dais-kserve-raw-demo # updated for this lab

Create a GKE Autopilot cluster with the necessary permissions for networking:

gcloud container clusters create-auto ${CLUSTER_NAME} \
  --location=$REGION \
  --project=$PROJECT_ID \
  --labels=created-by=dais,guide=kserve-the-ultimate-guide

# Configure kubectl to communicate with your new cluster
gcloud container clusters get-credentials ${CLUSTER_NAME} \
 --region ${REGION} \
 --project ${PROJECT_ID}

You should see an output similar to my outout below:

kubeconfig entry generated for dais-kserve-raw-demo.
NAME                  LOCATION     MASTER_VERSION      MASTER_IP        MACHINE_TYPE   NODE_VERSION        NUM_NODES  STATUS   STACK_TYPE
dais-kserve-raw-demo  us-central1  1.33.5-gke.1201000  136.119.xxx.xxx  ek-standard-8  1.33.5-gke.1201000  3          RUNNING  IPV4

2. Recommended KServe Setup with Gateway API, and KEDA

In this section, we will manually install the required components for KServe’s Standard deployment mode. We will install Gateway API, KEDA for autoscaling, Envoy Gateway for traffic management, and KServe with LLMInferenceService support.

Resources to be installed

Infrastructure Components for Kserve Standard

KEDA (for Standard KServe autoscaling)
KEDA OpenTelemetry Addon (for Standard KServe autoscaling)

Infrastructure Components for LLMInferenceService:

External Load Balancer (MetalLB for local clusters)
Cert-Manager
Gateway API CRDs
Gateway API Inference Extension CRDs
Envoy Gateway
Envoy AI Gateway
LeaderWorkerSet (multi-node deployments)
GatewayClass
Gateway

KServe Components

KServe CRDs and Controller (Standard)
LLMInferenceService CRDs and Controller

Local Deployment

Follow the quick start guide at https://kserve.github.io/website/docs/getting-started/quickstart-guide to deploy and test KServe locally.

2.1. Define Component Versions

The versions of the components we are about to install.

export ISTIO_VERSION=1.27.1
export KSERVE_VERSION=v0.16.0
export GATEWAY_API_INFERENCE_EXTENSION_VERSION=v1.1.0
export CERT_MANAGER_VERSION=v1.16.1
export GATEWAY_API_VERSION=v1.4.0
export KEDA_VERSION=2.18.1
export ENVOY_GATEWAY_VERSION=v1.5.0
export ENVOY_AI_GATEWAY_VERSION=v0.3.0
export LEADER_WORKER_SET_VERSION=v0.6.2

2.2. Install Core Dependencies

We’ll begin by installing the foundational networking and security components.

Install Gateway API CRDs, and Inference Extension

Gateway API provides a standard, vendor-agnostic way to expose network services. We’ll also install the Kubernetes Standard Inference Extension for LLM-specific routing capabilities.

echo "Installing Gateway API CRDs..."
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.4.0/standard-install.yaml

echo "Installing Gateway API Inference Extension CRDs..."
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/download/v1.1.0/manifests.yaml

Install Cert-Manager

Cert-Manager is required to automate the management and issuance of TLS certificates for KServe’s webhooks.

# Add the Jetstack Helm repository
helm repo add jetstack https://charts.jetstack.io --force-update
helm repo update

# Install Cert-Manager
helm install \
  cert-manager jetstack/cert-manager \
  --namespace cert-manager \
  --create-namespace \
  --version v1.16.1 \
  --set crds.enabled=true

echo "Successfully installed Cert-Manager"

You can verify the installation by checking the cert-manager pods (your output may vary):

Context "gke_project-xxxxxxxxx_us-central1_dais-kserve-raw-demo"
Active namespace is "cert-manager".
> kubectl get po
NAME                                       READY   STATUS    RESTARTS      AGE
cert-manager-76d8bd58dd-8w8b9              1/1     Running   0             5m25s
cert-manager-cainjector-77c548b55f-4c46f   1/1     Running   0             5m25s
cert-manager-startupapicheck-8qntf         1/1     Running   3 (66s ago)   4m30s
cert-manager-webhook-85b85dfff8-drqts      1/1     Running   0             5m25s

2.3. Install KEDA for Autoscaling

KEDA (Kubernetes Event-driven Autoscaling) is the recommended autoscaling solution for KServe in Standard mode.

# Add the KEDA Helm repository
helm repo add kedacore https://kedacore.github.io/charts
helm repo update

# Install KEDA
helm install keda kedacore/keda --version 2.18.1 --namespace keda --create-namespace --wait

echo "Successfully installed KEDA"

# Install OpenTelemetry Operator for KEDA
helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
helm repo update

helm install my-opentelemetry-operator open-telemetry/opentelemetry-operator \
  -n opentelemetry-operator-system --create-namespace \
  --set "manager.collectorImage.repository=otel/opentelemetry-collector-contrib"

# Install KEDA OpenTelemetry Addon
helm upgrade -i kedify-otel oci://ghcr.io/kedify/charts/otel-add-on \
  --version=v0.0.6 --namespace keda --wait \
  --set validatingAdmissionPolicy.enabled=false

echo "Successfully installed KEDA OpenTelemetry Addon"

Example of the expected output when installing KEDA:

Update Complete. ⎈Happy Helming!⎈
NAME: keda
LAST DEPLOYED: Tue Nov 18 17:24:40 2025
NAMESPACE: keda
STATUS: deployed
REVISION: 1
DESCRIPTION: Install complete
TEST SUITE: None
NOTES:
:::^.     .::::^:     :::::::::::::::    .:::::::::.                   .^.
7???~   .^7????~.     7??????????????.   :?????????77!^.              .7?7.
7???~  ^7???7~.       ~!!!!!!!!!!!!!!.   :????!!!!7????7~.           .7???7.
7???~^7????~.                            :????:    :~7???7.         :7?????7.
7???7????!.           ::::::::::::.      :????:      .7???!        :7??77???7.
7????????7:           7???????????~      :????:       :????:      :???7?5????7.
7????!~????^          !77777777777^      :????:       :????:     ^???7?#P7????7.
7???~  ^????~                            :????:      :7???!     ^???7J#@J7?????7.
7???~   :7???!.                          :????:   .:~7???!.    ~???7Y&@#7777????7.
7???~    .7???7:      !!!!!!!!!!!!!!!    :????7!!77????7^     ~??775@@@GJJYJ?????7.
7???~     .!????^     7?????????????7.   :?????????7!~:      !????G@@@@@@@@5??????7:
::::.       :::::     :::::::::::::::    .::::::::..        .::::JGGGB@@@&7:::::::::
                                                                      ?@@#~
                                                                      P@B^
                                                                    :&G:
                                                                    !5.
                                                                    .Kubernetes Event-driven Autoscaling (KEDA) - Application autoscaling made simple.

        _       _               _     _                               ?@@#~
   ___ | |_ ___| |     __ _  __| | __| |     ___  _ __                P@B^
  / _ \| __/ _ \ |    / _` |/ _` |/ _` |___ / _ \| '_ \             :&G:
 | (_) | ||  __/ |   | (_| | (_| | (_| |___| (_) | | | |            !5.
  \___/ \__\___|_|    \__,_|\__,_|\__,_|    \___/|_| |_|            ,
                                                                    .
Successfully installed KEDA OpenTelemetry Addon
~

2.4. InstallLeaderWorkerSet, Envoy Gateway and Envoy AI Gateway

LeaderWorkerSet is required for multi-node deployments.
Envoy Gateway provides a cloud-native gateway implementation that works well with KServe. We’ll also install the AI Gateway for LLM-specific routing capabilities.
Envoy AI Gateway requires Envoy Gateway version 1.5.0 or higher. For the best experience while trying out AI Gateway, we recommend using the latest version as shown in the commands below.
(Optional) Depending on the additional features you want to handle by the Envoy A Gateway (like rate limiting or InferencePool), you need to pass additional addon values files to modify the Envoy Gateway installation. Currently, supported addons are:
- Rate Limiting: https://github.com/envoyproxy/ai-gateway/blob/main/examples/token_ratelimit/envoy-gateway-values-addon.yaml
- InferencePool: https://github.com/envoyproxy/ai-gateway/blob/main/examples/inference-pool/envoy-gateway-values-addon.yaml

# Create the namespace for KServe components
kubectl create namespace kserve --dry-run=client -o yaml | kubectl apply -f -

# Install LeaderWorkerSet for multi-node deployments
kubectl apply -f https://github.com/kubernetes-sigs/lws/releases/download/v0.6.2/manifests.yaml

echo "Successfully installed LeaderWorkerSet"

# Install Envoy Gateway
helm install eg oci://docker.io/envoyproxy/gateway-helm \
  --version v1.5.0 \
  -n envoy-gateway-system \
  --create-namespace \
  --wait

echo "Successfully installed Envoy Gateway"

# Create GatewayClass
kubectl apply -f - <<EOF
apiVersion: gateway.networking.k8s.io/v1
kind: GatewayClass
metadata:
  name: envoy
spec:
  controllerName: gateway.envoyproxy.io/gatewayclass-controller
EOF

# Create Gateway
kubectl apply -f - <<EOF
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: kserve-ingress-gateway
  namespace: kserve
spec:
  gatewayClassName: envoy
  listeners:
    - name: http
      protocol: HTTP
      port: 80
      allowedRoutes:
        namespaces:
          from: All
  infrastructure:
    labels:
      serving.kserve.io/gateway: kserve-ingress-gateway
EOF

# Install Envoy AI Gateway CRDs
helm install aieg-crd oci://docker.io/envoyproxy/ai-gateway-crds-helm \
  --version v0.3.0 \
  --namespace envoy-ai-gateway-system \
  --create-namespace

# Install Envoy AI Gateway
helm install aieg oci://docker.io/envoyproxy/ai-gateway-helm \
  --version v0.3.0 \
  --namespace envoy-ai-gateway-system \
  --create-namespace

# Configure Envoy Gateway for AI Gateway integration

kubectl apply -f "https://raw.githubusercontent.com/envoyproxy/ai-gateway/v0.3.0/manifests/envoy-gateway-config/redis.yaml"
kubectl apply -f "https://raw.githubusercontent.com/envoyproxy/ai-gateway/v0.3.0/manifests/envoy-gateway-config/config.yaml"
kubectl apply -f "https://raw.githubusercontent.com/envoyproxy/ai-gateway/v0.3.0/manifests/envoy-gateway-config/rbac.yaml"

# Enable Gateway API Inference Extension support for Envoy Gateway
kubectl apply -f "https://raw.githubusercontent.com/envoyproxy/ai-gateway/v0.3.0/examples/inference-pool/config.yaml"
kubectl rollout restart -n envoy-gateway-system deployment/envoy-gateway

echo "Successfully installed Envoy AI Gateway"

ensure components are deployed and healthy and surface any installation errors

Tip: Run the commands below to validate installations, check rollout status, and collect logs for troubleshooting.

2.5. Install KServe with LLMInferenceService

Now install KServe’s CRDs and controller configured for Standard mode, and ensure the LLMInferenceService CRDs are available. The commands below will:

Install KServe CRDs and controller via Helm
Set the default deployment mode to “Standard”
Restart the controller so the change takes effect

# Install KServe CRDs first
helm install kserve-crd \
  oci://ghcr.io/kserve/charts/kserve-crd \
  --version v0.0.16 \
  --namespace kserve \
  --wait

# Install the KServe controller
helm install kserve \
  oci://ghcr.io/kserve/charts/kserve-resources \
  --version v0.0.16 \
  --namespace kserve \
  --wait

# Update KServe configuration for Standard mode
kubectl patch configmap inferenceservice-config -n kserve --type merge -p '{"data":{"deploy":"{\"defaultDeploymentMode\":\"Standard\"}"}}'

# Restart the KServe controller to apply the configuration changes
kubectl rollout restart deployment/kserve-controller-manager -n kserve

echo "Successfully installed KServe in Standard mode"

2.6. Verify Installation

Ensure all components are running correctly before proceeding.

# Check that the KServe controller manager is running
kubectl rollout status deployment/kserve-controller-manager -n kserve

# Verify that the required ServingRuntimes are available
kubectl get ServingRuntimes -n kserve

# Check that KEDA is running
kubectl rollout status deployment/keda-operator -n kserve

# Check that Envoy Gateway is running
kubectl rollout status deployment/envoy-gateway -n envoy-gateway-system

# Check that Envoy AI Gateway is running
kubectl rollout status deployment/ai-gateway-controller -n envoy-ai-gateway-system

# Check that LeaderWorkerSet is running
kubectl rollout status deployment/lws-controller-manager -n lws-system

3. Deploy Gemma 3 with LLMInferenceService

With the infrastructure in place, we can now deploy the Gemma 3 model using the LLMInferenceService.

3.1. Create Hugging Face Secret

You need a Hugging Face token to download the Gemma 3 model.

Generate a Hugging Face access token with read permissions.
Ensure you have accepted the terms of use for the Gemma 3 model on Hugging Face.

# Create a namespace for our inference service
kubectl create namespace dais-kserve-demo

# Replace XXX with your actual Hugging Face token
export HF_TOKEN="XXX"

# Create the secret in Kubernetes
kubectl apply -f - <<EOF
apiVersion: v1
kind: Secret
metadata:
    name: hf-secret
    namespace: dais-kserve-demo
type: Opaque
stringData:
    hf_api_token: ${HF_TOKEN}
EOF

3.2. Deploy the LLMInferenceService

Create the LLMInferenceService manifest. This resource is specifically designed for LLM workloads and integrates with KEDA for autoscaling.

kubectl apply -f - <<EOF
apiVersion: llm.serving.kserve.io/v1alpha1
kind: LLMInferenceService
metadata:
  name: huggingface-gemma3
  namespace: dais-kserve-demo
spec:
  predictor:
    nodeSelector:
      cloud.google.com/gke-accelerator: nvidia-l4
      cloud.google.com/gke-accelerator-count: "1"
    model:
      modelFormat:
        name: huggingface
      args:
        - --enable_docs_url=True
        - --model_name=gemma3-4b-it
        - --model_id=google/gemma-3-4b-it
      env:
      - name: HF_TOKEN
        valueFrom:
          secretKeyRef:
            name: hf-secret
            key: hf_api_token
      resources:
        limits:
          cpu: "6"
          memory: 24Gi
          nvidia.com/gpu: "1"
        requests:
          cpu: "6"
          memory: 24Gi
          nvidia.com/gpu: "1"
  autoscaling:
    minReplicas: 1
    maxReplicas: 3
    metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80
EOF

3.3. Create an HTTPRoute to Expose the Service

Create an HTTPRoute to connect our Gateway to the predictor Service created by KServe.

kubectl apply -f - <<EOF
apiVersion: gateway.networking.k8s.io/v1beta1
kind: HTTPRoute
metadata:
  name: gemma3-route
  namespace: dais-kserve-demo
spec:
  parentRefs:
  - name: eg
    namespace: envoy-gateway-system
  rules:
  - matches:
    - path:
        type: PathPrefix
        value: /
    forwardTo:
    - serviceName: huggingface-gemma3-predictor
      port: 8080
EOF

4. Test the Inference Service

First, wait for the model pod to be in a Running state.

kubectl get pods -n dais-kserve-demo -w

Next, get the external IP address of the Envoy gateway, which is our entry point.

export INGRESS_IP=$(kubectl get svc eg-envoy-gateway -n envoy-gateway-system -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
export SERVICE_URL="http://${INGRESS_IP}"
echo "Inference Service is available at: ${SERVICE_URL}"

Now, send a request to the OpenAI-compatible endpoint provided by the vLLM server runtime.

curl -X POST "${SERVICE_URL}/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
    "model": "gemma3-4b-it",
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful AI Infra expert"
        },
        {
            "role": "user",
            "content": "What is KServe with Standard mode and KEDA?"
        }
    ],
    "max_tokens": 100,
    "stream": false
}'

You can also access the interactive Swagger UI for testing at ${SERVICE_URL}/docs.

4.1. Test Autoscaling

To test the autoscaling capabilities provided by KEDA, you can generate load on the service:

# Install hey if you don't have it
go install github.com/rakyll/hey@latest

# Generate load
hey -z 2m -c 10 -m POST -d '{
    "model": "gemma3-4b-it",
    "messages": [
        {
            "role": "user",
            "content": "What is the meaning of life?"
        }
    ],
    "max_tokens": 50
}' -H "Content-Type: application/json" "${SERVICE_URL}/v1/chat/completions"

# Watch the replicas scale up
kubectl get pods -n dais-kserve-demo -w

5. Clean Up

Delete the GKE cluster to remove all resources created in this lab.

gcloud container clusters delete ${CLUSTER_NAME} \
    --location=$REGION \
    --project=$PROJECT_ID \
--quiet

echo "Cluster ${CLUSTER_NAME} deleted successfully"

Conclusion: From Theory to Production

KServe provides the powerful, open-source building blocks for enterprise-grade model serving. It addresses the unique challenges of production inference workloads, from scalability to versioning to observability, all while maintaining a simple, consistent interface that works across frameworks and deployment environments.

But building, securing, and observing a complete, production-ready AI platform around KServe is a complex engineering challenge. It requires expertise in Kubernetes, MLOps, security, observability, and infrastructure management expertise that many organizations struggle to acquire and maintain.

At Drizzle AI Systems (a.k.a DAIS), we specialize in building these platforms, turning open-source components like KServe into battle-tested, managed AI infrastructure. We’ve helped organizations across industries move from experimenting with models to running them confidently in production.

Ready to embrace the power of KServe in your production AI infrastructure?

Schedule a Free 30-Minute Strategy Session with our AI infrastructure experts. In this consultation, you’ll get:

✅ Architecture review of your current or planned KServe deployment
✅ Custom recommendations for your specific use case (LLMs, traditional ML, or hybrid)
✅ Cost optimization strategies for GPU utilization and autoscaling
✅ Security and compliance guidance for production AI workloads
✅ Clear roadmap from proof-of-concept to production-ready infrastructure

Book Your Free Consultation →

Explore Our Open-Source Contributions

We believe in contributing back to the community. Check out our GitHub organization for more AI Infra extensions, deployment templates, and production-ready configurations:

Visit Drizzle AI Systems on GitHub →

Found this guide helpful? Share it with your team:

Share on Twitter

Share on LinkedIn

Questions or feedback? Reach out to us at contact@drizzle.systems or connect with me on LinkedIn.

Get Involved with KServe

Interested in contributing to the future of model serving? The KServe community welcomes developers, data scientists, and infrastructure engineers of all experience levels.

Ways to Get Involved:

📚 Documentation: Help improve guides and tutorials
🐛 Bug Reports: Report issues or inconsistencies you encounter
💡 Feature Requests: Suggest enhancements for the project
🔧 Code Contributions: Submit pull requests for bug fixes or new features
💬 Community Support: Answer questions on Slack or GitHub Discussions
🎤 Share Your Story: Present your KServe use case at community meetings

Join the Community:

Whether you’re fixing a typo in the docs or architecting a major feature, your contributions help make KServe better for everyone.

Frequently Asked Questions (FAQ)

What is the difference between Standard mode and Knative mode in KServe?

Standard mode uses native Kubernetes resources (Deployments, Services, HPA) and is recommended for production LLM workloads requiring stable endpoints and complex networking. Knative mode leverages Knative Serving for serverless capabilities including scale-to-zero, making it ideal for cost optimization with unpredictable traffic patterns. Standard mode offers more control and stability, while Knative mode provides better resource efficiency.

Does KServe support GPU acceleration?

Yes, KServe fully supports GPU acceleration for inference workloads. It works seamlessly with NVIDIA GPUs, including A100, H100, L4, and others. You can specify GPU requirements in your InferenceService or LLMInferenceService resource definitions, and KServe will automatically schedule pods on GPU-enabled nodes.

How does KServe handle model versioning and A/B testing?

KServe provides built-in support for canary deployments and traffic splitting, allowing you to deploy multiple versions of a model simultaneously and gradually shift traffic between them. This enables safe A/B testing and gradual rollouts of new model versions without service interruption.

What monitoring and observability features does KServe provide?

KServe integrates with standard Kubernetes observability tools and provides metrics compatible with Prometheus and Grafana. It exposes model-specific metrics like prediction latency, throughput, and error rates. When combined with tools like the Envoy AI Gateway, you can also track LLM-specific metrics like token usage and prompt characteristics.

Can I use KServe with my existing ML models?

Yes, KServe supports a wide range of ML frameworks including TensorFlow, PyTorch, Scikit-learn, XGBoost, and custom models. For LLMs, it supports popular frameworks like Hugging Face Transformers, vLLM, and TensorRT-LLM. You can also create custom ServingRuntimes for specialized frameworks.

What is the LLMInferenceService and when should I use it?

LLMInferenceService is a specialized CRD designed specifically for serving Large Language Models. It provides LLM-specific features like distributed multi-node inference, prefill-decode separation, KV cache management, and native OpenAI-compatible APIs. Use it instead of the standard InferenceService when deploying LLMs for better performance and easier configuration.

How does KServe compare to solutions like vLLM or TGI?

This is a key concept: KServe is not a replacement for these, it’s an orchestrator for them. vLLM and TGI are high-performance inference servers. KServe provides the production-grade platform around them, managing:

The deployment as a Kubernetes-native resource (LLMInferenceService)
Network routing (via Gateway API)
Autoscaling (via KEDA)
Observability and logging

In our lab, the LLMInferenceService automatically uses a vLLM-based runtime to serve the Gemma 3 model.

What are the security best practices for KServe deployments?

Key security best practices include: enabling authentication and authorization through the Envoy AI Gateway, using secrets management for API tokens (like Hugging Face tokens), implementing network policies to restrict pod-to-pod communication, using RBAC for controlling access to KServe resources, and enabling TLS for encrypted communication. For LLM deployments, also implement prompt injection protection and content filtering.

How much does KServe cost?

KServe itself is completely free and open-source under the Apache 2.0 license. You only pay for the underlying infrastructure costs (Kubernetes cluster, GPUs, storage, networking) from your cloud provider. There are no licensing fees, vendor lock-in, or usage-based charges for KServe itself.

Can KServe scale to zero to save costs?

Yes, when using Knative mode, KServe can scale deployments down to zero replicas when there’s no traffic, significantly reducing infrastructure costs. However, this comes with cold start latency when the service needs to wake up. For LLM workloads where cold starts can take several minutes, Standard mode with a minimum replica count is often preferred.

The Ultimate Guide to KServe

Table of Contents

Introduction

What You’ll Learn in This Blog

The “Last Mile” Problem: Why Model Serving/Inference is a Unique Challenge

I. Foundational Introduction to KServe

1. What is KServe?

2. Key Benefits that Set KServe Apart

II. KServe Design and Architecture Concepts

1. Core Design Principles

2. KServe Generative Inference Architecture

3. KServe Control and Data Planes

4. KServe Deployment Modes

5. The KServe Inference Protocols

6. The LLMInferenceService

III. Lab: Deploy Your First KServe Inference Service on GKE

1. GKE Setup

2. Recommended KServe Setup with Gateway API, and KEDA

3. Deploy Gemma 3 with LLMInferenceService

4. Test the Inference Service

5. Clean Up

Conclusion: From Theory to Production

Ready to embrace the power of KServe in your production AI infrastructure?

Explore Our Open-Source Contributions

What is the difference between Standard mode and Knative mode in KServe?

Does KServe support GPU acceleration?

How does KServe handle model versioning and A/B testing?

What monitoring and observability features does KServe provide?

Can I use KServe with my existing ML models?

What is the LLMInferenceService and when should I use it?

How does KServe compare to solutions like vLLM or TGI?

What are the security best practices for KServe deployments?

How much does KServe cost?

Can KServe scale to zero to save costs?

Make it 10x Easier to Production-Grade AI Infrastructure

The Ultimate Guide to KServe

Table of Contents

Introduction

What You’ll Learn in This Blog

The “Last Mile” Problem: Why Model Serving/Inference is a Unique Challenge

I. Foundational Introduction to KServe

1. What is KServe?

2. Key Benefits that Set KServe Apart

II. KServe Design and Architecture Concepts

1. Core Design Principles

2. KServe Generative Inference Architecture

3. KServe Control and Data Planes

4. KServe Deployment Modes

5. The KServe Inference Protocols

6. The LLMInferenceService

III. Lab: Deploy Your First KServe Inference Service on GKE

1. GKE Setup

2. Recommended KServe Setup with Gateway API, and KEDA

3. Deploy Gemma 3 with LLMInferenceService

4. Test the Inference Service

5. Clean Up

Conclusion: From Theory to Production

Ready to embrace the power of KServe in your production AI infrastructure?

Explore Our Open-Source Contributions

Share This Guide

What is the difference between Standard mode and Knative mode in KServe?

Does KServe support GPU acceleration?

How does KServe handle model versioning and A/B testing?

What monitoring and observability features does KServe provide?

Can I use KServe with my existing ML models?

What is the LLMInferenceService and when should I use it?

How does KServe compare to solutions like vLLM or TGI?

What are the security best practices for KServe deployments?

How much does KServe cost?

Can KServe scale to zero to save costs?

Make it 10x Easier to Production-Grade AI Infrastructure