Senior GPU Optimisation Engineer | San Francisco at smallest

RoleWe’re hiring a GPU Optimization Engineer who understands GPUs at a deep, architectural level — someone who knows exactly how to squeeze every last millisecond out of a model, what GPU constraints matter, and how to restructure models for real-world inference performance. You’ll work across CUDA kernels, model graph optimizations, hardware-specific tuning, and porting models across GPU architectures. Your work directly impacts the latency, throughput, and reliability of smallest’s real-time speech models.What You’ll DoOptimize model architectures (ASR, TTS, SLMs) for maximum performance on specific GPU hardwareProfile models end-to-end to identify GPU bottlenecks — memory bandwidth, kernel launch overhead, fusion opportunities, quantization constraintsDesign and implement custom kernels (CUDA/Triton/Tinygrad) for performance-critical model sectionsPerform operator fusion, graph optimization, and kernel-level scheduling improvementsTune models to fit GPU memory limits while maintaining qualityBenchmark and calibrate inference across NVIDIA, AMD, and potentially emerging acceleratorsPort models across GPU chipsets (NVIDIA → AMD / edge GPUs / new compute backends)Work with TensorRT, ONNX Runtime, and custom runtimes for deploymentPartner with the research and infra teams to ensure the entire stack is optimized for real-time workloadsRequirementsStrong understanding of GPU architecture — SMs, warps, memory hierarchy, occupancy tuningHands-on experience with CUDA, kernel writing, and kernel-level debuggingExperience with kernel fusion and model graph optimizationsFamiliarity with TensorRT, ONNX, Triton, tinygrad, or similar inference enginesStrong proficiency in PyTorch and PythonDeep understanding of model architectures (transformers, convs, RNNs, attention, diffusion blocks)Experience profiling GPU workloads using Nsight, nvprof, or similar toolsStrong problem-solving abilities with a performance-first mindsetGreat to HaveExperience with quantization (INT8, FP8, hybrid formats)Experience with audio/speech models (ASR, TTS, SSL, vocoders)Contributions to open-source GPU stacks or inference runtimesPublished work related to systems-level model optimizationYears of Experience3-5 years of specialized experience in GPU Optimization through academia or industryEducationMaster's or PhD in GPU Programming or related fieldNote - we often make exceptions and hire brilliant candidates regardless of years of experience or education - proof of work is paramount