
Job description
Huawei Canada has an immediate permanent opening for a Software Engineer.
About the team:
The Distributed Data Storage and Management Lab leads research in distributed data systems, aiming to develop next-generation cloud serverless products that encompass core infrastructure and databases. This lab addresses various data challenges, including cloud-native disaggregated databases, pay-by-query user models, and optimizing low-level data transfers via RDMA. Teams within this lab create advanced cloud serverless data infrastructure and implement cutting-edge networking technologies for Huawei's global AI infrastructure.
About the job:
Architect and develop frameworks and engines for next-generation serverless computing tailored to AI workloads (LLM training/inference, agent execution, RL training, etc.).
Analyze and optimize end-to-end AI system performance, including distributed scheduling, data flow, and memory utilization across large clusters.
Research and evaluate cutting-edge technologies in distributed computing, serverless infrastructure, reinforcement learning, and LLM-based AI agents.
Collaborate cross-functionally with research, product, and platform teams to transform conceptual AI agent or RL research into scalable production systems.
Contribute thought leadership through innovation, technical presentations, and patent generation.
Stay ahead of industry trends, assessing emerging tools and frameworks (e.g., Ray, SkyPilot, vLLM, DeepSpeed, Mojo, etc.) to inform team.
Job requirements
About the ideal candidate:
PhD with research background in LLM systems, RL, AI agents, or distributed computing, or MS in Computer Science, Electrical Engineering, or related field with 3–4 years of AI industry experience.
Strong system design and software engineering skills, including experience with C++ or Python, concurrency, performance tuning, and large-scale distributed systems.
Proven expertise in one or more of the following areas:
o AI system architecture — LLM training/inference pipeline optimization, multi-agent orchestration, or reinforcement learning frameworks.
o Serverless / distributed infrastructure — autoscaling, resource scheduling, fault recovery, or cloud-native microservices.
Ability to lead complex technical projects, mentor peers, and deliver solutions with measurable impact.
Publications, open-source contributions, or patents in AI systems, RL, or distributed computing is an asset.
Familiarity with GPU cluster management, model parallelism, or memory-optimized inference (e.g., KVCache, offloading strategies) is an asset.
Demonstrated ability to bridge research and engineering, bringing experimental AI methods into production-grade systems is an asset.
or
All done!
Your application has been successfully submitted!
