Senior Research Engineer – Multimodal & Video Foundation Model

{ “@context”: “http://schema.org”, “@type”: “JobPosting”, “title”: “Senior Research Engineer – Multimodal & Video Foundation Model”, “description”: “

Overview

Senior Research Engineer – Multimodal & Video Foundation Model

As a member of the AI model team, you will drive innovation in architecture development for cutting-edge models of various scales, including small, large, and multi-modal systems. Your work will enhance intelligence, improve efficiency, and introduce new capabilities to advance the field.

Responsibilities

  • Pioneer multimodal and video-centric research that moves fast and breaks ground, contributing directly to usable prototypes and scalable systems.
  • Design and implement novel AI architectures for multimodal language models, integrating text, visual, and audio modalities.
  • Engineer scalable training and inference pipelines optimized for large-scale multimodal datasets and distributed GPU systems across thousands of GPUs.
  • Optimize systems and algorithms for efficient data processing, model execution, and pipeline throughput.
  • Build modular tools for preprocessing, analyzing, and managing multimodal data assets (e.g., images, video, text).
  • Collaborate cross-functionally with research and engineering teams to translate cutting-edge model innovations into production-grade solutions.
  • Prototype generative AI applications showcasing new capabilities of multimodal foundation models in real-world products.
  • Develop benchmarking tools to rigorously evaluate model performance across diverse multimodal tasks.

Qualifications

  • Bachelor’s degree in Computer Science, Computer Engineering, or a related technical field, or equivalent practical experience
  • Expertise in Python & PyTorch, including practical experience working with the full development pipeline from data processing & data loading to training, inference, and optimization.
  • Experience working with large-scale text data, or (bonus) interleaved data spanning audio, video, image, and/or text.
  • Direct hands-on experience in developing or benchmarking at least one of the following topics: LLMs, Vision Language Models, Audio Language Models, generative video models

Nice to have skills

  • PhD in Computer Vision, Machine Learning, NLP, Computer Science, Applied Statistics, or a closely related field
  • Demonstrated expertise in computer vision, video generation foundation model and/or multimodal research.
  • First-author publications at leading AI conferences such as CVPR, ICCV, ECCV, ICML, ICLR, NeurIPS etc.

Important information for candidates

  • Recruitment scams have become increasingly common. To protect yourself, please keep the following in mind when applying for roles: Apply only through our official channels. We do not use third-party platforms or agencies for recruitment unless clearly stated. All open roles are listed on our official careers page: https://tether.recruitee.com/
  • Verify the recruiter’s identity. All our recruiters have verified LinkedIn profiles. If you’re unsure, you can confirm their identity by checking their profile or contacting us through our website.
  • Be cautious of unusual communication methods. We do not conduct interviews over WhatsApp, Telegram, or SMS. All communication is done through official company emails and platforms.
  • Double-check email addresses. All communication from us will come from emails ending in @tether.to or @tether.io
  • We will never request payment or financial details. If someone asks for personal financial information or payment at any point during the hiring process, it is a scam. Please report it immediately.

Job details

  • Seniority level: Not Applicable
  • Employment type: Full-time
  • Job function: Information Technology
  • Industries: Technology, Information and Internet

#J-18808-Ljbffr”, “datePosted”: “2026-04-11”, “hiringOrganization”: { “@type”: “Organization”, “name”: “Tether.io”, “sameAs”: “https://uk.whatjobs.com/pub_api__cpl__402756907__4861?utm_campaign=publisher&utm_medium=api&utm_source=4861” }, “jobLocation”: { “@type”: “Place”, “address”: { “@type”: “PostalAddress”, “addressLocality”: “” } } }
Company: Tether.io
Apply for the Senior Research Engineer – Multimodal & Video Foundation Model
Location:
Job Description:

Overview

Senior Research Engineer – Multimodal & Video Foundation Model

As a member of the AI model team, you will drive innovation in architecture development for cutting-edge models of various scales, including small, large, and multi-modal systems. Your work will enhance intelligence, improve efficiency, and introduce new capabilities to advance the field.

Responsibilities

  • Pioneer multimodal and video-centric research that moves fast and breaks ground, contributing directly to usable prototypes and scalable systems.
  • Design and implement novel AI architectures for multimodal language models, integrating text, visual, and audio modalities.
  • Engineer scalable training and inference pipelines optimized for large-scale multimodal datasets and distributed GPU systems across thousands of GPUs.
  • Optimize systems and algorithms for efficient data processing, model execution, and pipeline throughput.
  • Build modular tools for preprocessing, analyzing, and managing multimodal data assets (e.g., images, video, text).
  • Collaborate cross-functionally with research and engineering teams to translate cutting-edge model innovations into production-grade solutions.
  • Prototype generative AI applications showcasing new capabilities of multimodal foundation models in real-world products.
  • Develop benchmarking tools to rigorously evaluate model performance across diverse multimodal tasks.

Qualifications

  • Bachelor’s degree in Computer Science, Computer Engineering, or a related technical field, or equivalent practical experience
  • Expertise in Python & PyTorch, including practical experience working with the full development pipeline from data processing & data loading to training, inference, and optimization.
  • Experience working with large-scale text data, or (bonus) interleaved data spanning audio, video, image, and/or text.
  • Direct hands-on experience in developing or benchmarking at least one of the following topics: LLMs, Vision Language Models, Audio Language Models, generative video models

Nice to have skills

  • PhD in Computer Vision, Machine Learning, NLP, Computer Science, Applied Statistics, or a closely related field
  • Demonstrated expertise in computer vision, video generation foundation model and/or multimodal research.
  • First-author publications at leading AI conferences such as CVPR, ICCV, ECCV, ICML, ICLR, NeurIPS etc.

Important information for candidates

  • Recruitment scams have become increasingly common. To protect yourself, please keep the following in mind when applying for roles: Apply only through our official channels. We do not use third-party platforms or agencies for recruitment unless clearly stated. All open roles are listed on our official careers page: https://tether.recruitee.com/
  • Verify the recruiter’s identity. All our recruiters have verified LinkedIn profiles. If you’re unsure, you can confirm their identity by checking their profile or contacting us through our website.
  • Be cautious of unusual communication methods. We do not conduct interviews over WhatsApp, Telegram, or SMS. All communication is done through official company emails and platforms.
  • Double-check email addresses. All communication from us will come from emails ending in @tether.to or @tether.io
  • We will never request payment or financial details. If someone asks for personal financial information or payment at any point during the hiring process, it is a scam. Please report it immediately.

Job details

  • Seniority level: Not Applicable
  • Employment type: Full-time
  • Job function: Information Technology
  • Industries: Technology, Information and Internet

#J-18808-Ljbffr…

Posted: April 11th, 2026