Skip to content

Member of Technical Staff - Infrastructure & Systems for Large Models

    • Markham, Ontario
    • Montreal, Quebec
    • Edmonton, Alberta
    +2 more
  • ak6ad

Job description

Huawei Canada has an immediate 12-month contract opening for a Member of Technical Staff

About the team:

Founded in 2012, the Noah’s Ark lab has evolved into a prominent research organization with notable achievements in academia and industry. The lab’s mission focuses on advancing artificial intelligence and related fields to benefit the company and society. Driven by impactful, long-term projects, the aim is to enhance state-of-the-art research while integrating innovations into the company's products and services, including LLMs, RL, NLP, computer vision, AI theory, and Autonomous driving.

About the job:

  • Join a team that maintain the core infrastructure powering large-scale AI training.

  • Contribute to data loading, training workflows, and checkpointing systems for distributed model training.

  • Help improve tools that manage training jobs across compute clusters (e.g., GPUs, TPUs, multi-node setups).

  • Work on monitoring and logging tools to make long-running jobs reliable and observable.

  • Support optimization efforts (e.g., mixed precision, sharding) to make model training faster and more efficient.

  • Collaborate closely with machine learning engineers and researchers on new training methods and experiments.

  • Learn to scale systems, debug complex workloads, and make training pipelines reproducible.

  • Be part of a team that bridges research and infrastructure to accelerate AI development.

Job requirements

About the ideal candidate:

  • 1–2 years of software engineering experience.

  • Proficient in Python, with basic experience in backend or infrastructure development.

  • Familiarity with ML frameworks like PyTorch or TensorFlow.

  • Some exposure to distributed systems, training jobs, or cloud computing is a plus.

  • Comfortable using Linux, containers (e.g., Docker), and command-line tools.

  • Understanding of software engineering best practices (e.g., testing, version control).

  • Eager to learn about large-scale ML systems and infrastructure design.

  • Strong communication and collaboration skills; enjoys working with cross-functional teams.

or