MENU
  • Remote Jobs
  • Companies
  • Go Premium
  • Job Alerts
  • Post a Job
  • Log in
  • Sign up
Working Nomads logo Working Nomads
  • Remote Jobs
  • Companies
  • Post Jobs
  • Go Premium
  • Get Free Job Alerts
  • Log in

Machine Learning Engineer (Distributed Training)

CloudWalk

Full-time
Brazil
machine learning
engineer
architecture
kubernetes
statistics
Apply for this position

Who we are:

CloudWalk is a fintech company reimagining the future of financial services. We are building intelligent infrastructure powered by AI, blockchain, and thoughtful design. Our products serve millions of entrepreneurs across Brazil and the US every day, helping them grow with tools that are fast, fair, and built for how business actually works. Learn more at cloudwalk.io.

Who We’re Looking For:

We’re looking for a Machine Learning Engineer to own and evolve our distributed training pipeline for large language models. You’ll work inside our GPU cluster to help researchers train and scale foundation models using frameworks like Hugging Face Transformers, Accelerate, DeepSpeed, FSDP, and others. Your focus will be distributed training: from designing sharding strategies and multi-node orchestration to optimizing throughput and managing checkpoints at scale.

This role is not research - it's about building and scaling the systems that let researchers move fast and models grow big. You’ll work closely with MLOps, infra, and model developers to make our training runs efficient, resilient, and reproducible.

What You'll Do:

  • Own the architecture and maintenance of our distributed training pipeline;

  • Train LLMs using tools like DeepSpeed, FSDP, and Hugging Face Accelerate;

  • Design and debug multi-node/multi-GPU training runs (Kubernetes-based);

  • Optimize training performance: memory usage, speed, throughput, and cost;

  • Help manage experiment tracking, artifact storage, and resume logic;

  • Build reusable, scalable training templates for internal use;

  • Collaborate with researchers to bring their training scripts into production shape.

What We’re Looking For:

  • Expertise in distributed training: Experience with DeepSpeed, FSDP, or Hugging Face Accelerate in real-world multi-GPU or multi-node setups;

  • Strong PyTorch background: Comfortable writing custom training loops, schedulers, or callbacks;

  • Hugging Face stack experience: Transformers, Datasets, Accelerate - you know the ecosystem and how to bend it;

  • Infra literacy: You understand how GPUs, containers, and job schedulers work together. You can debug cluster issues, memory bottlenecks, or unexpected slowdowns;

  • Resilience mindset: You write code that can checkpoint, resume, log correctly, and keep running when things go wrong;

  • Collaborative builder: You don’t mind digging into other people’s scripts, making them robust, and helping everyone train faster.

Bonus Points:

  • Experience with Kubernetes-based GPU clusters and Ray;

  • Experience with experiment tracking (MLflow, W&B);

  • Familiarity with mixed precision, ZeRO stages, model parallelism;

  • Comfort with CLI tooling, profiling, logging, and telemetry;

  • Experience with dataloading bottlenecks and dataset streaming.

How We Hire:

  • Online assessment: technical logic and fundamentals (Math/Calculus, Statistics, Probability, Machine Learning/Deep Learning, Code)

  • Technical interview: deep dive into distributed training theory and reasoning (no code)

  • Cultural interview

  • If you are not willing to take an online quiz, do not apply.

If you’ve trained LLMs before - or helped others do it better - this role is for you. Even if you don’t check every box, if you’re confident working with distributed compute and real-world LLM workloads, we want to hear from you.

Apply for this position
Bookmark Report

About the job

Full-time
Brazil
Posted 1 week ago
machine learning
engineer
architecture
kubernetes
statistics

Apply for this position

Bookmark
Report
Enhancv advertisement

30,000+
REMOTE JOBS

Unlock access to our database and
kickstart your remote career
Join Premium

Machine Learning Engineer (Distributed Training)

CloudWalk

Who we are:

CloudWalk is a fintech company reimagining the future of financial services. We are building intelligent infrastructure powered by AI, blockchain, and thoughtful design. Our products serve millions of entrepreneurs across Brazil and the US every day, helping them grow with tools that are fast, fair, and built for how business actually works. Learn more at cloudwalk.io.

Who We’re Looking For:

We’re looking for a Machine Learning Engineer to own and evolve our distributed training pipeline for large language models. You’ll work inside our GPU cluster to help researchers train and scale foundation models using frameworks like Hugging Face Transformers, Accelerate, DeepSpeed, FSDP, and others. Your focus will be distributed training: from designing sharding strategies and multi-node orchestration to optimizing throughput and managing checkpoints at scale.

This role is not research - it's about building and scaling the systems that let researchers move fast and models grow big. You’ll work closely with MLOps, infra, and model developers to make our training runs efficient, resilient, and reproducible.

What You'll Do:

  • Own the architecture and maintenance of our distributed training pipeline;

  • Train LLMs using tools like DeepSpeed, FSDP, and Hugging Face Accelerate;

  • Design and debug multi-node/multi-GPU training runs (Kubernetes-based);

  • Optimize training performance: memory usage, speed, throughput, and cost;

  • Help manage experiment tracking, artifact storage, and resume logic;

  • Build reusable, scalable training templates for internal use;

  • Collaborate with researchers to bring their training scripts into production shape.

What We’re Looking For:

  • Expertise in distributed training: Experience with DeepSpeed, FSDP, or Hugging Face Accelerate in real-world multi-GPU or multi-node setups;

  • Strong PyTorch background: Comfortable writing custom training loops, schedulers, or callbacks;

  • Hugging Face stack experience: Transformers, Datasets, Accelerate - you know the ecosystem and how to bend it;

  • Infra literacy: You understand how GPUs, containers, and job schedulers work together. You can debug cluster issues, memory bottlenecks, or unexpected slowdowns;

  • Resilience mindset: You write code that can checkpoint, resume, log correctly, and keep running when things go wrong;

  • Collaborative builder: You don’t mind digging into other people’s scripts, making them robust, and helping everyone train faster.

Bonus Points:

  • Experience with Kubernetes-based GPU clusters and Ray;

  • Experience with experiment tracking (MLflow, W&B);

  • Familiarity with mixed precision, ZeRO stages, model parallelism;

  • Comfort with CLI tooling, profiling, logging, and telemetry;

  • Experience with dataloading bottlenecks and dataset streaming.

How We Hire:

  • Online assessment: technical logic and fundamentals (Math/Calculus, Statistics, Probability, Machine Learning/Deep Learning, Code)

  • Technical interview: deep dive into distributed training theory and reasoning (no code)

  • Cultural interview

  • If you are not willing to take an online quiz, do not apply.

If you’ve trained LLMs before - or helped others do it better - this role is for you. Even if you don’t check every box, if you’re confident working with distributed compute and real-world LLM workloads, we want to hear from you.

Working Nomads

Post Jobs
Premium Subscription
Sponsorship
Free Job Alerts

Job Skills
API
FAQ
Privacy policy
Terms and conditions
Contact us
About us

Jobs by Category

Remote Administration jobs
Remote Consulting jobs
Remote Customer Success jobs
Remote Development jobs
Remote Design jobs
Remote Education jobs
Remote Finance jobs
Remote Legal jobs
Remote Healthcare jobs
Remote Human Resources jobs
Remote Management jobs
Remote Marketing jobs
Remote Sales jobs
Remote System Administration jobs
Remote Writing jobs

Jobs by Position Type

Remote Full-time jobs
Remote Part-time jobs
Remote Contract jobs

Jobs by Region

Remote jobs Anywhere
Remote jobs North America
Remote jobs Latin America
Remote jobs Europe
Remote jobs Middle East
Remote jobs Africa
Remote jobs APAC

Jobs by Skill

Remote Accounting jobs
Remote Assistant jobs
Remote Copywriting jobs
Remote Cyber Security jobs
Remote Data Analyst jobs
Remote Data Entry jobs
Remote English jobs
Remote Spanish jobs
Remote Project Management jobs
Remote QA jobs
Remote SEO jobs

Jobs by Country

Remote jobs Australia
Remote jobs Argentina
Remote jobs Brazil
Remote jobs Canada
Remote jobs Colombia
Remote jobs France
Remote jobs Germany
Remote jobs Ireland
Remote jobs India
Remote jobs Japan
Remote jobs Mexico
Remote jobs Netherlands
Remote jobs New Zealand
Remote jobs Philippines
Remote jobs Poland
Remote jobs Portugal
Remote jobs Singapore
Remote jobs Spain
Remote jobs UK
Remote jobs USA


Working Nomads curates remote digital jobs from around the web.

© 2025 Working Nomads.