We’re hiring AI Engineers who love turning research into reliable, customer-facing products. You’ll architect and own the inference stack that powers our LLM-based APIs, RAG pipelines, and autonomous agents—then scale it to thousands of concurrent users.
What You’ll Do
Build and harden the core API layer that exposes large language models to internal and external clients (latency, auth, rate-limiting, multi-tenancy).
Design and ship robust RAG systems (chunking, embedding, retrieval, re-ranking) and agentic workflows (planning, tool-calling, reflection loops).
Architect end-to-end ML pipelines that scale horizontally on AWS/GCP while keeping cost-per-token predictable.
Continuously evaluate new model families, fine-tuning techniques, and orchestration frameworks; run A/B tests to quantify real-world gains.
Optimize prompts, inference kernels, and serving stacks for throughput, latency, and memory footprint.
Write production-grade Python, extensive unit + integration tests, and clear design docs that any teammate can pick up.
Collaborate daily with product, design, and backend teams in a fast-moving, feedback-rich environment.
What You’ll Bring
Must-haves
Hands-on experience with at least one LLM ecosystem: OpenAI APIs, LangChain, LlamaIndex, Vertex AI, or open-source models.
Comfortable designing REST/GraphQL APIs, containerizing with Docker, and deploying on Kubernetes.
Strong grasp of software engineering fundamentals—version control, testing, CI/CD, observability.
Nice-to-haves
Experience with vector databases (Pinecone, Weaviate, Milvus) and embedding models.
Familiarity with inference optimization (vLLM, TensorRT-LLM, ONNX, quantization).