Structuring LLM outputs | Best practices for legal prompt engineering

Craft prompts to control LLM output formats for seamless use in databases, spreadsheets, or downstream applications.

You can craft a prompt to control not only what a model outputs but the format of those outputs. This can be especially helpful if you are planning to use outputted text downstream in other systems or prompts.

For example, you may need to extract data points from a contract and save those data points into a contract lifecycle database that only accepts data in a specific format (e.g. dates formatted like YYYY-MM-DD). Or you may want to save an output into a spreadsheet as comma-delimited entries. Or you may be trying to set up a chain of prompts – a topic covered in a later article – which requires being able to accurately predict the output of the first prompt for use in the second.  

Whatever the goal, here a few tried-and-true methods to style your outputs:

Give examples

Probably the best way to control outputs is to provide the model with examples of how to format things in your prompt. For example, if you want to control the date format provided in an answer, you can write something like this:

System Message:

INSTRUCTIONS:
Examine the text provided by the user and answer the query below.
FORMAT:  
If the answer is a date, format the answer as YYYY-MM-DD. For example, August 11, 2023, should be “2023-08-11”.
QUERY:

What is the effective date of the contract?

User Message:

This agreement is between Acme and XYZ Inc and goes into force on June 11, 2024.  

Completion:

2024-06-11

Specify the “Data Type”

In the “Use separators to divide prompts” article, we used exactly this tactic. We wanted a list of entities, and for each entity, we specified the entity “Label” as well as the “Data Type” we wanted in the response like so:

System Message:

Extract the specified ENTITIES from the DOCUMENT TEXT using the specified Labels and Data Type.


User Message:

DOCUMENT TEXT:
[DOCUMENT TEXT OMITTED]  

ENTITIES:  

Entity #1:
Label: Investor
Data Type: Text
Entity #2:
Label: Signer
Data Type: Text
Entity #3:
Label: Investment
Data Type: Number
Entity #4:
Label: Investment Date

Data Type: Date

Completion:

Entity #1: IOU Inc
Entity #2: Not provided in the document
Entity #3: $400,000
Entity #4: 12-31-2022

Common data types include text, number, date, and true/false. If you want to get really fancy, you can even throw in data types like “Multiple Choice (select one)” or “Checkbox”, and the model is likely to output the desired response.

File formats like CSVs

Because LLMs have been trained on lots of data structured into file formats like CSVs, JSON, and XML, you can reference those formats to control the outputted completion. Let’s say you want to analyze text and list out any companies mentioned in the text. If you want that list to be something you could open in Excel, you can simply ask for CSV format in your system message:  

System Message:

  1. Carefully read the text provided by the user.
  2. Identify any companies mentioned and list them in CSV format with each value enclosed in quotes. The first column should be labelled “Company” and contain the company name. The second column should be labelled “Address” and contain the company address.
  3. If you don’t know a value, say “unknown”.

User Message:

This is an agreement between Acme Co, a Delaware corporation located at 321 Main Street, North City, NY 11111, and Sand Dunes LLC, a New Mexico company, located at 777 Side Ave, South City, NM 22222.

Completion:

"Company","Address"
"Acme Co","321 Main Street, North City, NY 11111"
"Sand Dunes LLC","777 Side Ave, South City, NM 22222"

Now let’s try another format like JSON:

System Message:

  1. Carefully read the text provided by the user.
  2. Identify any companies mentioned and list them in a JSON array. For each company, provide a key-value pair for the company name labelled “Company” and provide a key-value pair for the company address labelled “Address”.
  3. If you don’t know a value, say “unknown”.

User Message:

This is an agreement between Acme Co, a Delaware corporation located at 321 Main Street, North City, NY 11111, and Sand Dunes LLC, a New Mexico company, located at 777 Side Ave, South City, NM 22222.

Completion:

[
 {
   "Company": "Acme Co",
   "Address": "321 Main Street, North City, NY 11111"
 },
 {
   "Company": "Sand Dunes LLC",
   "Address": "777 Side Ave, South City, NM 22222"
 }
]

By relying on common data formats like the above, you can get a much more reliable structure to your outputs.

Nothing but the answer

Sometimes the LLM can be “wordy” and include filler and context that you don’t necessarily want in your output. To strip this filler out of your completions, be explicit about what you want.  

For example, let’s say you write a prompt to help identify the tone of certain text:  

System Message:

Read the text and identify its tone. Is the text hostile, friendly, or neutral?

User Message:

I am writing to you with a sense of urgency and frustration, as my previous attempts to resolve this matter amicably have been met with nothing but silence and disregard from your end. This is not only unprofessional but also unacceptable.

Completion:

The tone of the text is hostile
To make sure the LLM leaves out filler like “The tone of the text is…”, just tweak your system message as follows:

System Message:

Read the text and identify its tone. Respond with only one of the three words: Hostile, Friendly, or Neutral

User Message:

I am writing to you with a sense of urgency and frustration, as my previous attempts to resolve this matter amicably have been met with nothing but silence and disregard from your end. This is not only unprofessional but also unacceptable.

Completion:

Hostile

Advanced – function calling and structured outputs

For very advanced users – particularly those who want to take LLM outputs and use them in other software applications – you may want to investigate methods like OpenAI’s function calling or structured outputs. These techniques allow you to design prompts that output text formatted into “functions” or “schemas” that are highly structured and consistent.

Next articles