Skip to content

Software Engineer – Serverless Distributed Computing & AI Systems

    • Markham, Ontario
  • hsl2r

Job description

Huawei Canada has an immediate permanent opening for a Software Engineer.

About the team: 

The Distributed Data Storage and Management Lab leads research in distributed data systems, aiming to develop next-generation cloud serverless products that encompass core infrastructure and databases. This lab addresses various data challenges, including cloud-native disaggregated databases, pay-by-query user models, and optimizing low-level data transfers via RDMA. Teams within this lab create advanced cloud serverless data infrastructure and implement cutting-edge networking technologies for Huawei's global AI infrastructure.


About the job:

  • Architect and develop frameworks and engines for next-generation serverless computing tailored to AI workloads (LLM training/inference, agent execution, RL training, etc.).

  • Analyze and optimize end-to-end AI system performance, including distributed scheduling, data flow, and memory utilization across large clusters.

  • Research and evaluate cutting-edge technologies in distributed computing, serverless infrastructure, reinforcement learning, and LLM-based AI agents.

  • Collaborate cross-functionally with research, product, and platform teams to transform conceptual AI agent or RL research into scalable production systems.

  • Contribute thought leadership through innovation, technical presentations, and patent generation.

  • Stay ahead of industry trends, assessing emerging tools and frameworks (e.g., Ray, SkyPilot, vLLM, DeepSpeed, Mojo, etc.) to inform team.

Job requirements

About the ideal candidate:

  • PhD with research background in LLM systems, RL, AI agents, or distributed computing, or MS in Computer Science, Electrical Engineering, or related field with 3–4 years of AI industry experience.

  • Strong system design and software engineering skills, including experience with C++ or Python, concurrency, performance tuning, and large-scale distributed systems.

  • Proven expertise in one or more of the following areas:

    o   AI system architecture — LLM training/inference pipeline optimization, multi-agent orchestration, or reinforcement learning frameworks.

    o   Serverless / distributed infrastructure — autoscaling, resource scheduling, fault recovery, or cloud-native microservices.

  • Ability to lead complex technical projects, mentor peers, and deliver solutions with measurable impact.

  • Publications, open-source contributions, or patents in AI systems, RL, or distributed computing is an asset.

  • Familiarity with GPU cluster management, model parallelism, or memory-optimized inference (e.g., KVCache, offloading strategies) is an asset.

  • Demonstrated ability to bridge research and engineering, bringing experimental AI methods into production-grade systems is an asset.

or