How to Deploy Ollama on Kubernetes | AI Model Serving on k8s

Deploy Ollama on kuberntes

Running AI models in a Kubernetes cluster provides scalability and flexibility. If you’ve ever wanted to deploy AI models on your own infrastructure, this guide will walk you through deploying Ollama on Kubernetes. By the end, you’ll have a fully functional setup that allows AI model serving within a Kubernetes cluster.

1. Setting Up the Kubernetes Namespace

To keep resources organized, start by creating a dedicated namespace for Ollama:

kubectl create namespace ollama

Verify the namespace:

kubectl get ns

This ensures that all Ollama-related resources remain isolated within a specific namespace.

2. Deploying Ollama on kubernetes

Create an ollama-deployment.yaml file to deploy Ollama as a Kubernetes Deployment.

apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama
namespace: ollama
spec:
replicas: 1
selector:
matchLabels:
app: ollama
template:
metadata:
labels:
app: ollama
spec:
containers:
- name: ollama
image: ollama/ollama:latest
ports:
- name: http
containerPort: 11434
protocol: TCP

Apply the deployment:

kubectl apply -f ollama-deployment.yaml

Check if the pod is running:

kubectl get pods -n ollama

Once the pod is up and running, the AI model serving infrastructure is ready.

3. Exposing Ollama Using NodePort

To make Ollama accessible externally, expose it using a NodePort service. Create a ollama-service.yaml file:

apiVersion: v1
kind: Service
metadata:
name: ollama
namespace: ollama
spec:
type: NodePort
selector:
app: ollama
ports:
- port: 80
name: http
targetPort: 11434
protocol: TCP
nodePort: 30007

Apply the service:

kubectl apply -f ollama-service.yaml

Verify the service status:

kubectl get svc -n ollama

Now, the Ollama deployment is exposed, and external requests can reach the AI model.

4. Testing Ollama Deployment

To confirm that Ollama is accessible, send a test request using curl:

curl -s http://<NODE_IP>:30007/api/generate -d '{
"model": "llama2",
"prompt": "Why is the sky blue?"
}' | jq -r '.response' | tr -d '\n'

The first request may take longer, as the model needs to be downloaded. You can monitor logs to track progress:

kubectl logs -f deployment/ollama -n ollama

5. Running AI Models with Ollama

Different AI models can be served using Ollama. Run a request using another model:

curl http://<NODE_IP>:30007/api/generate -d '{
"model": "orca-mini:3b",
"prompt": "What is Kubernetes?"
}'

Responses will be returned based on the requested AI model, enabling scalable AI inference within Kubernetes.

6. Summary & Next Steps

By following this guide, you successfully:

  • Set up a dedicated Kubernetes namespace for Ollama
  • Successfully deployed Ollama on Kubernetes
  • Exposed the deployment using NodePort for external access
  • Tested AI model serving within the cluster

With Ollama deployed, you can integrate it into applications, fine-tune models, or explore different service exposure methods such as LoadBalancer or Ingress. If you have questions or want to explore more AI deployments on Kubernetes, drop a comment below!

Related Resources

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top