How to Deploy Ollama on Kubernetes | AI Model Serving on k8s

Running AI models in a Kubernetes cluster provides scalability and flexibility. If you’ve ever wanted to deploy AI models on your own infrastructure, this guide will walk you through deploying Ollama on Kubernetes. By the end, you’ll have a fully functional setup that allows AI model serving within a Kubernetes cluster.

1. Setting Up the Kubernetes Namespace

To keep resources organized, start by creating a dedicated namespace for Ollama:

kubectl create namespace ollama

Verify the namespace:

kubectl get ns

This ensures that all Ollama-related resources remain isolated within a specific namespace.

2. Deploying Ollama on kubernetes

Create an ollama-deployment.yaml file to deploy Ollama as a Kubernetes Deployment.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama
  namespace: ollama
spec:
  replicas: 1
  selector:
    matchLabels:
      app: ollama
  template:
    metadata:
      labels:
        app: ollama
    spec:
      containers:
        - name: ollama
          image: ollama/ollama:latest
          ports:
            - name: http
              containerPort: 11434
              protocol: TCP

Apply the deployment:

kubectl apply -f ollama-deployment.yaml

Check if the pod is running:

kubectl get pods -n ollama

Once the pod is up and running, the AI model serving infrastructure is ready.

3. Exposing Ollama Using NodePort

To make Ollama accessible externally, expose it using a NodePort service. Create a ollama-service.yaml file:

apiVersion: v1
kind: Service
metadata:
  name: ollama
  namespace: ollama
spec:
  type: NodePort
  selector:
    app: ollama
  ports:
    - port: 80
      name: http
      targetPort: 11434
      protocol: TCP
      nodePort: 30007

Apply the service:

kubectl apply -f ollama-service.yaml

Verify the service status:

kubectl get svc -n ollama

Now, the Ollama deployment is exposed, and external requests can reach the AI model.

4. Testing Ollama Deployment

To confirm that Ollama is accessible, send a test request using curl:

curl -s http://<NODE_IP>:30007/api/generate -d '{
  "model": "llama2",
  "prompt": "Why is the sky blue?"
}' | jq -r '.response' | tr -d '\n'

The first request may take longer, as the model needs to be downloaded. You can monitor logs to track progress:

kubectl logs -f deployment/ollama -n ollama

5. Running AI Models with Ollama

Different AI models can be served using Ollama. Run a request using another model:

curl http://<NODE_IP>:30007/api/generate -d '{
  "model": "orca-mini:3b",
  "prompt": "What is Kubernetes?"
}'

Responses will be returned based on the requested AI model, enabling scalable AI inference within Kubernetes.

6. Summary & Next Steps

By following this guide, you successfully:

Set up a dedicated Kubernetes namespace for Ollama
Successfully deployed Ollama on Kubernetes
Exposed the deployment using NodePort for external access
Tested AI model serving within the cluster

With Ollama deployed, you can integrate it into applications, fine-tune models, or explore different service exposure methods such as LoadBalancer or Ingress. If you have questions or want to explore more AI deployments on Kubernetes, drop a comment below!

How to Deploy Ollama on Kubernetes | AI Model Serving on k8s

1. Setting Up the Kubernetes Namespace

2. Deploying Ollama on kubernetes

3. Exposing Ollama Using NodePort

4. Testing Ollama Deployment

5. Running AI Models with Ollama

6. Summary & Next Steps

Related Resources

Leave a Comment Cancel Reply

Sign up to receive email updates, fresh news and more!

How to Deploy Ollama on Kubernetes | AI Model Serving on k8s

1. Setting Up the Kubernetes Namespace

2. Deploying Ollama on kubernetes

3. Exposing Ollama Using NodePort

4. Testing Ollama Deployment

5. Running AI Models with Ollama

6. Summary & Next Steps

Related Resources

Related Posts

Leave a Comment Cancel Reply