Share:

GPU-Guide-

A Guide on Choosing GPUs for Your Cloud

Graphics Processing Units (GPUs) are no longer just for gaming or graphics rendering. In today’s cloud environments, GPUs power artificial intelligence (AI), machine learning (ML), high-performance computing (HPC), and even virtual desktops.

For cloud service providers (CSPs), system integrators, and enterprises, choosing the right GPUs can directly influence performance, scalability, and cost-efficiency. This guide will help you understand GPU options and how to choose the right one for your cloud workloads.
Why-GPUs

Why GPUs in the Cloud?

Cloud environments are evolving beyond basic compute and storage. Customers increasingly demand GPU-backed services for:

  • AI/ML & Deep Learning: Model training, inferencing, computer vision, and natural language processing.
  • Virtual Desktops (VDI): GPU acceleration for graphics-intensive applications.
  • Cloud Gaming & Streaming: High-quality rendering and video delivery.
  • Media & Graphics Production: 3D rendering, editing, and transcoding.
  • HPC & Simulations: Scientific modeling, engineering simulations, and large-scale data processing.
  • Enterprise & SaaS: GPU-powered apps, Dev/Test environments, and education workloads.

For CSPs, offering GPUs means new revenue streams, differentiated services, and meeting data sovereignty requirements while giving customers self-service access to specialized hardware.

Deployment-Model

GPU Deployment Models

Apache CloudStack and similar cloud platforms now integrate and offer GPU management for IaaS clouds, but the choice of GPU type matters greatly depending on the workload. Here are some common GPU deployment and usage models:

1. Passthrough GPUs

  • One or more full physical GPU hardware is dedicated to an instance (virtual machine).
  • Best for workloads requiring large VRAM, low latency, or real-time processing (e.g., robotics, video transcoding, complex AI training).
  • Offers strong isolation and performance but limited scalability.

2. Shared / Virtualised GPUs

Multiple tenants share the underlying GPU hardware via virtualisation technologies. The virtualisation mechanism differs between vendors and GPU models, the variety includes:

    • SR-IOV (AMD MxGPU, Intel Flex) – Hardware-based virtual functions with strong isolation.
    • vGPU (NVIDIA) – Software-driven GPU partitions for AI inferencing, VDI, and media workloads.
    • MIG (NVIDIA) – Multi-Instance GPU providing fully isolated GPU slices (ideal for multi-tenant AI).
    • Time-Sliced Sharing – Simpler, lower-concurrency GPU sharing suitable for bursty workloads.

 

Comparison Criteria Pass-through SR-IOV MDEV (Mediated Device)
GPU Access and Technology VM → Direct access to GPU through Hypervisor

GPU hardware passthrough to instance as PCI device.

VM → connected via Virtual Functions (VFs)

GPU exposes Physical Function (PF) split into multiple VFs

VM gets a shared GPU or vGPU slice via MDEV Emulation

Host OS partitions GPU using vendor driver (NVIDIA vGPU, Intel/AMD mdev)

I/O Uses IOMMU for secure DMA + device isolation Managed by Hypervisor + IOMMU Uses VFIO-mdev with IOMMU protection (Needs vendor drivers)
Multi-tenant Single Tenant Multi-tenant / Hardware-dependent Multi-tenant / Flexible-partitioning
Performance High performance High / Near-native performance Near-native / Fair performance
Isolation Strong isolation Strong isolation Hardware-dependent

GPU-Vendors

Key GPU Vendors

These are the current and upcoming GPU vendors in the ecosystem:

    • NVIDIA: Current market leader and popular that offers CUDA-based ecosystem, strong AI/ML and vGPU support.
    • AMD: Growing contender with ROCm stack and MxGPU virtualisation.
    • Intel: Emerging player with oneAPI and Flex GPUs for SR-IOV virtualisation.
    • Apple: Niche player with Metal based API/Stack, mainly for consumer devices.
    • Others: Qualcomm (Adreno) and a few others in Android/mobile or proprietary ecosystems.

 

NVIDIA AMD Intel
Platform/Tech. CUDA ROCm oneAPI
Virtualisation vGPU (GRID) MxGPU (SR-IOV) Basic/Flex
Current Standing Market Leader Growing Contender Catching up
Ecosystem PyTorch, TensorFlow… Improving support in PyTorch, etc. Laggard

 

The-Right-GPU

Choosing the Right GPU for Your Workload

When selecting GPUs, match workload requirements with the right GPU type, for example:

  • High-performance AI training & real-time apps → Passthrough high-end GPUs (e.g., NVIDIA H100, A100).
  • AI inferencing, VDI, and virtual apps → Shared GPUs (NVIDIA L40, A10, AMD MI300X, Intel Flex).
  • Cloud Gaming, Education, Remote Apps → vGPU or SR-IOV-based shared GPUs.
  • Enterprise-scale Multi-Tenant AI → NVIDIA MIG-enabled GPUs (A100, H100).
  • Graphics/Media workloads → Quadro-based vGPUs (NVIDIA Q-series).

 

Practical-Considerations

Practical Considerations

When building a GPU-enabled IaaS cloud, also consider:

  • NUMA placement – Improper GPU-CPU memory alignment can severely affect performance.
  • Hypervisor support – Ensure GPU drivers align with chosen OS and hypervisor (e.g., RHEL, Ubuntu with KVM).
  • Licensing & Ecosystem – NVIDIA’s vGPU licensing vs. AMD/Intel’s open approaches.
  • Scalability & Limits – The choice of IaaS/GPU or CMP platform generally will have tenant limits, for example, Apache CloudStack provides limit/quota controls for GPU resource usage (account.gpus, max.project.gpus).

KVM is a popular and growing choice for hypervisor in the IaaS/Cloud space. Some key consideration can be support for GPU vendor/model and technology as used and supported on KVM using GPU passthrough and virtualisation technologies:

GPU Considerations with KVM MDEV SR-IOV VFIO Passthrough
Sharing Yes Yes No
Uses VFIO Yes (vfio_mdev) Yes (vfio-pci) Yes (vfio-pci)
Granularity Fine-grained virtual functions Hardware-based VFs Entire physical device
Device support Software-defined (via driver) Hardware-defined Full passthrough
Needs IOMMU Yes Yes Yes
Examples NVIDIA vGPU AMD MxGPU, Intel Flex Full GPU passthrough
Guest Driver Vendor vGPU driver Vendor driver Vendor driver

Challenges

Challenges & The Road Ahead

While GPU integration into IaaS clouds is advancing rapidly, challenges remain:

  • Testing and validation across diverse GPU hardware.
  • Improving GPU resource management and orchestration.
  • Supporting advanced features like GPU-enable instance live migration.
  • Expanding multi-hypervisor support and richer GPU metrics.
  • Lack of consistent virtualisation and technology specification across vendors.

Future cloud platforms will continue to mature GPU integration, making GPU-backed workloads as seamless as traditional CPU and storage provisioning.

Conclusion

Choosing the right GPU for your cloud depends on balancing performance, scalability, and workload type. Passthrough GPUs excel at raw power and isolation, while shared and virtualized GPUs enable multi-tenant efficiency and flexibility. NVIDIA, AMD, and Intel each offer unique strengths, and the decision ultimately rests on the nature of workloads you plan to support.

By carefully aligning GPU types with workload demands, cloud providers and enterprises can unlock new opportunities in AI, VDI, HPC, and beyond — building clouds that are ready for the next generation of compute.

Share:

Related Posts:

ShapeBlue