One of the most common questions I get asked from my students is how to structure the responses from LLMs to integrate them into pipelines. The answer is simple: use JSON schemas. It’s a simple and effective way to define what you want from the LLM and most libraries will restrict the sampling process to only return the desired JSON output. Before we look into the examples below, there a few things to keep in mind:
- Keep it simple: The simpler the schema, the more reliable the output. The more complex and restrictive the schema, the more likely the LLM will fail to return a valid response because of the restrictions you are applying on it.
- Use examples: If you can, provide examples of the expected output. This will help the LLM understand what you want and improve the chances of getting a valid response. In my experience, with more enterprise use cases, the LLMs are getting better at just outputting the JSON schema without examples.
- Divide and conquer: If you have a complex schema (e.g. nested objects), consider breaking it down into smaller, simpler schemas and prompt the LLM several times to return each part separately. This will help the LLM focus on one thing at a time and improve the chances of getting a valid response.
- Prompt matters: The prompt you use to ask the LLM for a response is just as important as the schema. Make sure to provide clear instructions and context for what you want.
Most libraries will allow you to pass a JSON Schema as part of the prompt. If you like to try these online, the Google AI Studio has a structured output option that allows you to define a JSON Schema and test things for free. On the Google AI Studio, there is an option for “Structured Output” where you can copy paste a schema or visually define it using their form editor. This is a great way to see how the LLM responds to different schemas and refine your approach.
Google AI Studio Structured Output with JSON Schema
Categories, Multiple Choices, True/False, Classification
This use case, in my opinion, is the most common. I use classification in the most general sense here: a multiple choice question could be viewed as classifying the input to the correct answer. The schema below is a simple example of how to restrict the output to a specific set of values:
{
"type": "object",
"properties": {
"category": {
"type": "string",
"enum": ["sports", "politics", "entertainment", "technology"]
}
},
"required": ["category"]
}
The sampler will be forced to return one of the options. More often than not, these restrictions are implemented at the sampling stage to disable certain tokens that would violate the schema. In return, you are guaranteed to get the structure you want. If you have multiple choice questions, a common thing to do is restrict the answers to A, B, C, D
or True/False
. It’s good to thing about what the training data on the internet looks like. For example, multiple choice questions are often formatted as A, B, C, D
in standardised exams. The LLMs are trained on this data and will often be more accurate when you use common sense conventions.
Check what the LLM returns without the schema, then adapt the schema to be closer to what the LLM returns by picking similar words or phrases.
The above comment is very important because you don’t want to re-invent how the LLM tries to speak. You can always map the output to your own esoteric categories later. This approach helps ease the constraints (e.g. enum
) on the LLM by picking semantically similar outputs. At Semafind, we use this approach frequently to parse incoming text such as classifying customer reviews into various categories like priority, sentiment, and so on.
Numbers and optional fields
The complimentary part to string outputs is number fields. I often use these to give a score, or confidence measure for the LLM. The schema below is a more robust way of handling sentiment classification:
{
"type": "object",
"properties": {
"sentiment": {
"type": "string",
"description": "The classified sentiment.",
"enum": ["positive", "negative", "neutral"]
},
"confidence_score": {
"type": "number",
"description": "The model's confidence in the classification (0.0 to 1.0).",
"minimum": 0,
"maximum": 1
},
"rationale": {
"type": "string",
"description": "Brief explanation or keywords supporting the sentiment."
}
},
"required": ["sentiment", "confidence_score"]
}
We ask for categorical sentiment similar to multiple choice as before but also a confidence score. The rationale
field is optional and can be used to provide additional information about the classification. This is useful for applications where you want to provide more context or explanation for the classification. Turns out, LLMs are relatively good at giving some explanation for their outputs if you ask them to. However, take this with a grain of salt as it doesn’t mean that’s how the LLM actually classified the text. There is still ongoing research under the umbrella of mechanistic interpretability of LLMs and how they actually arrive at their outputs.
A sample output to the above schema with an appropiate prompt might be:
{
"sentiment": "positive",
"confidence_score": 0.85,
"rationale": "The text expresses a favorable opinion about the product."
}
Arrays, Lists, and Bullet Points
For asking the LLM to list something, we use arrays to create a bullet point list or placeholder. It works really well in practice as opposed to hoping the LLM will generate parsable bullet points. If the prompt asks for listing key points, or uses the keyword bullet points, often the schema gets used correctly. The example below is a simple schema that asks the LLM to return a summary and key points:
{
"type": "object",
"properties": {
"summary_text": {
"description": "A concise paragraph summarizing the input.",
"type": "string",
"minLength": 20
},
"key_points": {
"description": "A list of the most important points or takeaways.",
"type": "array",
"items": {
"type": "string"
},
"minItems": 1,
"maxItems": 5
}
},
"required": ["summary_text", "key_points"]
}
The key_points
field is an array of strings, which allows the LLM to return a list of items. This is useful for applications where you want to provide a summary or key points from a larger text. The minItems
and maxItems
constraints ensure that the list of key points is within a reasonable range and stops the LLM from rambling on. We tried this recently at Semafind with a news intelligence platform prototype and it worked really well giving factual key points from a contextual augmented news article.
Nested Fields
Finally, it can also be useful to further structure your output. This approach is common for scraping or extracting information from a larger text. For example, the below schema extracts two descriptions from a given text:
{
"type": "object",
"properties": {
"description": {
"type": "object",
"properties": {
"short": {
"type": "string",
"maxLength": 100
},
"long": {
"type": "string",
"minLength": 100,
"maxLength": 500
}
},
"required": ["short", "long"]
},
"title": {
"type": "string",
"maxLength": 50
}
},
"required": ["description", "title"]
}
which are handy for showing long and short form content. The description
field is an object itself with two properties: short
and long
, which allows you to provide different levels of detail. Notice the restrictions on the length of the strings. The more complex the schema, you either have to use a more powerful language model or as I mentioned earlier in the post, break it down into smaller parts.
Conclusion
JSON schemas are a powerful way to structure the output of LLMs but there is a limit after the extra constraints start to degrade the LLM performance. The examples above are a good starting point for structuring different types of outputs. You can mix and match them to create the schema you need. Don’t forget to test things before pushing anything out to production!