Optimize Amazon EKS AI workloads with GPU sharing: Introducing GPU time-slicing in Spot Ocean

As DevOps teams manage increasingly complex artificial intelligence and machine learning (AI/ML) pipelines on Kubernetes, the pressure to maximize infrastructure efficiency rises—especially around expensive resources like GPUs. Whether you’re deploying real-time inference services or running distributed training jobs, underutilized GPU capacity can silently drain your cloud budget.

That’s why we’re pleased to introduce GPU time-slicing in Flexera Spot Ocean for Amazon EKS.

By considering the time-slicing configured on your EKS when autoscaling GPU nodes, Spot Ocean can enable multiple Kubernetes workloads to share a single GPU. The result? You’ll see drastically improved GPU utilization, reduced cost and a reduced carbon footprint without the complexity usually associated with fine-tuning GPU workloads.

What is GPU time-slicing?

By allocating dedicated time intervals to each workload, GPU time-slicing allows multiple pods or containers to share a single GPU. Instead of assigning an entire GPU to a single pod (the traditional model), you can now “oversubscribe” a GPU. Workloads scheduled on oversubscribed GPUs can then interleave over time, effectively sharing GPU resources without memory or fault isolation between replicas.

Why it matters

For DevOps practitioners managing AI/ML infrastructure, GPU time-slicing offers major wins in efficiency, cost savings and simplicity.

Better GPU utilization

In most EKS-based GPU deployments, DevOps engineers bind GPU access via nvidia.com/gpu:1 in pod specs. This means Kubernetes schedules one pod per GPU—even if the pod only uses a small fraction of the available resources. Consider this example:

Let’s use a scenario where each inference job uses only 10% of the GPU’s compute. Without time-slicing, each job still claims 100% of the GPU, wasting the remaining 90%.

With GPU time-slicing, multiple pods can share the same GPU without changing your workload logic. Per the above example, up to 10 workloads can run concurrently on the same GPU.

This is particularly useful for:

Model inference services using frameworks like TensorFlow Serving or Triton Inference Server
Hyperparameter tuning jobs that require moderate GPU power
Batch processing pipelines that run in parallel

Lower GPU node cost

GPU nodes are expensive. By time-slicing, you can run more pods per GPU instance, reducing the total number of instances required and dramatically lowering costs.

Simple, centralized configuration

With Spot Ocean for EKS, GPU efficiency just got smarter. Spot Ocean now supports GPU time-sharing configurations, allowing you to fully utilize each GPU by running multiple workloads simultaneously.

Once you’ve configured GPU sharing in your cluster, simply define the relevant settings in Spot Ocean: GPU multipliers, naming conventions and sharing mode.

Spot Ocean will factor this into its autoscaling and scheduling logic, resulting in smarter resource allocation, less waste and improved performance without added complexity. Consider this real-world example of serving ML models efficiently: When deploying a PyTorch model behind an API using FastAPI and TorchServe, each inference takes ~150ms of GPU time and uses less than 10% of the GPU’s memory. In a default EKS setup, each replica would need its own GPU, even though most of it would go unused.

If you configure eight replicas to share a single GPU, Spot Ocean will apply time-sharing logic and spin up just one GPU node without requiring a change in deployment strategy.

Considerations and best practices

GPU time-slicing introduces powerful efficiency gains—but it’s not a silver bullet for every use case. Here are a few key considerations:

No memory/fault isolation: Time-sliced workloads share memory, so issues in one pod can potentially affect others. Avoid using it for untrusted multi-tenant scenarios or large training jobs where memory is fully consumed
Latency variability: Since GPU access is serialized, time-sliced workloads may experience latency spikes. This is usually acceptable for batch or async inference jobs, but not ideal for low-latency, high-throughput serving
Monitoring: Leverage GPU monitoring tools (such as DCGM exporter or Prometheus + Grafana) to track utilization and fine-tune your time-slicing ratios

Get started

Here is an example of Ocean GPU time-slicing configuration in the Ocean API. This example assumes five replicas per single GPU.

{ 
    "launchSpec": { 
        "oceanId": "oceanId", 
        "name": "gpuTest", 
        "gpu": { 
            "sharing": [ 
                { 
                    "gpuSharingType": "timeSlicing", 
                    "config": { 
                        "replicas": 5 
                    } 
                } 
            ] 
        } 
    } 
}

It’s that simple. From here, Spot Ocean will spin up just one GPU node for every five needed.

Unlock more infrastructure value

GPU time-slicing in Spot Ocean for Amazon EKS brings powerful new capabilities to DevOps teams managing AI workloads. By enabling efficient sharing of GPU resources while autoscaling, Spot Ocean helps reduce costs and unlock more value from your infrastructure without increasing operational burden.

Whether you’re scaling inference services or running parallel experiments, this feature ensures you’re not overspending on GPU resources.

If you already use Ocean. learn more about GPU time-slicing in Ocean in the Spot open API documentation.

To start using Ocean, sign up here to connect your AWS account, or book your demo with our solutions engineers.