Miaa-774 [updated]

All modalities share the same mechanism, allowing cross‑modal interactions (e.g., “Describe this sound in the context of the photo”).

OpenAI (the team behind MIAA‑774) also publishes a and a Data Card detailing provenance, known limitations, and recommended usage policies. MIAA-774

| Trick | Why It Matters | |---|---| | | Starts with single‑modality batches, gradually ramps up multi‑modal mixes → stabilizes convergence. | | Retrieval‑augmented pre‑training (RAG) | Each token can attend to a frozen index of 500 M external documents/images → improves factuality. | | Contrastive multimodal loss | Forces paired modalities (image‑caption, video‑audio) to align in latent space, boosting zero‑shot performance. | | Sparse‑attention windows | Reduces quadratic cost for long sequences (up to 64 K tokens), enabling full‑document or long‑video processing. | | | Retrieval‑augmented pre‑training (RAG) | Each token

Future studies will focus on: