Skip to main content

Models

You need to match your model's capabalities to your task.

Model Size

Model size (number of parameters) is directly related to general model performance. Choosing a larger model will improve performance, but also increase GPU costs and latency.

  • Small Language Models (SLMs) : Models with <10 billion parameters. They are great for basic tasks like classification, sentiment analysis, data extraction, single function calling.

  • Large Language Models (LLMs): More capable models, often with >30 billion parameters. They offer state-of-the-art performance on complex tasks like multi-step agentic planning, large/infinite schema generation, codegen, deep reasoning.

Most structured generation tasks are compatible with models of 10 billion or less parameters, except for agentic planning and codegen, which need models of 30 billion or more parameters.

Model Version

"Instruct" or "Code" versions (e.g., Llama-3.1-70B-Instruct, Qwen3Coder) often perform better than "Base" or "Chat" versions (e.g., Llama-3.1-70B-Base) of an LLM.

While "Base" and "Chat" versions are general autocomplete engines, the "Instruct" and "Code" versions are specifically trained on following instructions and codegen, and thus, are more likely to follow strict syntax, indentation, brackets.

Task-specific models

Smaller models trained / fine-tuned for specific tasks or domains often match the performance of larger general-purpose models, and thus lower your latency and costs. Some of these models, like NanonetsOCR, are trained / fine-tuned to produce structured outputs out-of-the-box.

Examples include Qwen3Coder for codegen, NanonetsOCR for OCR / document processing, BioMistral-7B for healthcare, FinGPT V3 for finance, and Hermes 2 Pro for function calling. You can explore models to match your task on HuggingFace.

Will your task benefit from a specific model paradigm?

Understand the different paradigms of LLMs, and you may find one that benefits your task:

If you have a PDF with tables, headers, and weird layouts, multimodality is non-negotiable. Pure text models (even GPT-5 class) fail when the layout has meaning. You should use Vision-Language Models (VLMs) like QwenVL or Nanonets-OCR.

For step-by-step reasoning tasks (codegen, math problems, logic puzzles), you might need reasoning models that can separate "thinking" tokens from "output" tokens, like DeepSeek.

If you have 10 million user comments and you need to tag them into simple enums, you need a rote instruction follower. Google is practically giving Gemini 1.5 Flash away for free, or you can self-host with Llama 3.1 8B or Phi-4.

Sometimes the best way to generate structured data is to not generate it at all. Embedding models like OpenAI text-embedding-3, Cohere Embed, or Nomic convert text into token embeddings. LLMs also use these embedding models internally to convert tokens into token embeddings.

note

What are token embeddings?

Once text is tokenized by LLMs, each token needs to be represented in a way that captures not just the token but also its meaning and relation to other tokens. This is where embeddings come in. An embedding is a vector representation of a token, and it places each token into a high-dimensional space. Tokens with similar meanings are closer together in this space, allowing the model to understand semantic and syntactic similarities and differences.

If your task is strictly classification or retrieval (e.g., "Match user query to the closest FAQ ID"), using an embedding model is 100x cheaper and faster than asking an LLM to generate the ID.

Standard speech-to-text removes tone, emotion, and pauses. If your structured output relies on how something was said (e.g., "detect sarcasm," "identify speaker urgency," or "extract hesitation markers"), use audio-native models like GPT-4o Audio and Gemini 1.5 Pro, that process sound waves directly alongside text tokens.

Benchmarks

Consult open-source benchmarks, like BFCL for function calling, or IDP Leaderboard for OCR / document processing. But note that most models use the latest open-source benchmarks for training, and thus, are overfitted on them. It is best to use benchmarks to filter the top 5/10 models, and then evaluate all of them internally on your task.