AI Nvidia GPU Infrastructure Job- Remote

AI Nvidia GPU Infrastructure Job- Remote

CozenTech is pleased to present urgent Job opportunity for AI Nvidia Infrastructure – Remote

Why Apply Now?
Due to an immediate hiring need, qualified candidates who apply early will be fast-tracked through the hiring process. If your background and experience align with the role, we strongly encourage you to submit your resume promptly.

Job Title: AI Nvidia Infrastructure Architect
Duration: 6 Months
Location: Remote, USA

Job Description:
This role focuses on managing and optimizing our AI infrastructure, ensuring seamless operations, and providing guidance and training to our team members.
The ideal candidate will have hands-on experience with AI operations, infrastructure management, and a strong understanding of high-performance computing (HPC) environments.
This position emphasizes operational excellence and team education rather than strategic development or workload definition.

Key Responsibilities:

  1. Manage and maintain AI infrastructure, ensuring high availability and performance.
  2. Implement and optimize AI operations using tools like NVIDIA Mission Control and RunAI.
  3. NVIDIA Mission Control helps manage and monitor AI workloads running on NVIDIA systems — like a control center for your AI projects.
  4. RunAI organizes and shares GPU resources efficiently so multiple users or teams can run AI jobs smoothly.
  5. Together, they make running, scaling, and managing AI workloads easier and more automated.
  6. Collaborate with cross-functional teams to support AI workloads and ensure efficient resource utilization.
  7. Provide training and mentorship to team members on AI infrastructure tools and best practices.
  8. Monitor system performance and troubleshoot issues to minimize downtime and optimize resource allocation.
  9. Assist in the deployment and scaling of AI models and applications.
  10. Stay updated with the latest advancements in AI infrastructure technologies and recommend improvements.
  11. Document processes, configurations, and best practices for AI infrastructure management.

Required Skills and Qualifications:

  1. Proven experience in managing AI infrastructure and operations.
  2. Proficiency with NVIDIA Mission Control/Bright Cluster Manager and Run: AI.
  3. Proficiency with Linux Operation Systems such as Ubuntu, RHEL.
  4. Strong understanding of high-performance computing (HPC) environments.
  5. Experience with cloud platforms and on-premises infrastructure.
  6. Excellent problem-solving skills and attention to detail.
  7. Ability to work collaboratively in a team environment and communicate effectively.
  8. Experience in training and mentoring technical teams.
  9. Bachelor’s degree in computer science, Engineering, or a related field, or equivalent experience.

Preferred Qualifications:

  1. Experience with containerization technologies such as Docker and Kubernetes.
  2. Familiarity with AI frameworks and libraries (e.g., TensorFlow, PyTorch).
  3. Knowledge of network and storage solutions for AI workloads.
  4. Familiarity with job scheduling such as SLURM.

Best Regards

if you’re seeking to apply please share your resume to vignesh@cozentech.com

Job Type: Contract
Job Location: California (CA)

Apply for this position

Allowed Type(s): .pdf, .doc, .docx