Skip to content

Machine Learning Software Engineer - GPU/NPU

    • Vancouver, British Columbia
  • pre4a

Job description

Huawei Canada has an immediate 12-month contract opening for a Machine Learning Software Engineer.

About the team:

The Software-Hardware System Optimization Lab continuously improves the power efficiency and performance of smartphone products through software-hardware systems optimization and architecture innovation. We keep tracking the trends of cutting-edge technologies, building the competitive strength of mobile AI, graphics, multimedia, and software architecture for mobile phone products.

About the job:

  • Profile and optimize end-to-end ML workloads and kernels to improve latency, throughput, and efficiency across GPU/NPU/CPU.

  • Identify bottlenecks (compute, memory, bandwidth) and land fixes: tiling, fusion, vectorization, quantization, mixed precision, layout changes.

  • Build/extend tooling for benchmarking, tracing, and automated regression/perf testing.

  • Collaborate with compiler/runtime teams to land graph- and kernel-level improvements.

  • Apply ML/RL-based techniques (e.g., cost models, schedulers, autotuners) to search better execution plans.

  • Translate promising research/prototypes into reliable, scalable production features and services.

    The target annual compensation (based on 2080 hours per year) ranges from $78,000 to $168,000 depending on education, experience and demonstrated expertise.

Job requirements

About the ideal candidate:

  • Master or PhD degree in Computer Science or related fields. Solid experience in ML systems or performance engineering (industry, OSS, or research). Fluency in Python and C++.

  • Hands-on with at least one compute stack: CUDA/ROCm, OpenCL, Metal/Vulkan compute, Triton, vendor or open source NPUs.

  • Practical knowledge of PyTorch or TensorFlow/JAX and inference/training performance basics (mixed precision, graph optimizations, quantization).

  • Ability to turn ambiguous perf problems into measurable, repeatable experiments.

  • AI compiler exposure: TVM, IREE, XLA/MLIR, TensorRT, or similar. Profiling skills (Nsight, perf, VTune, CUPTI/ROCm tools) and comfort reading roofline/memory-hierarchy signals.

  • Experience with kernel scheduling/auto-tuning (RL, Bayesian/EA search) and hardware counters.

  • Background with custom accelerators/NPUs, DMA/tiling/SRAM management, or quantization (INT8/FP8).

  • Contributions to relevant OSS (links welcome).

or