Choosing the right large language model depends on balancing cost, capability, and speed for your specific needs.
Every day, it seems like a company releases a new large language model. While many of these models offer the same core prompt-and-completion functionality, each model can vary massively in terms of capability, cost, and what use cases they have been tailored to.
Last year, OpenAI released GPT-4, Meta released Llama 2, and Anthropic released Claude 2; a year later, OpenAI is now on to GPT-4o, GPT 4o mini, and GPT o1, while Anthropic has Claude 3.5 Sonnet and Haiku. That’s just the tip of the iceberg. There are literally hundreds of models that have been updated or released in the last few years. OpenAI itself offers dozens of its own.
With all that choice, how do you know what model to use?
A good rule of thumb when selecting a model for your use case is that most models represent a tradeoff between capability and cost. With OpenAI’s GPT 3.5 turbo, you can generate outputs as long as the seven books of Harry Potter series and it’ll cost you under $2. With the more powerful GPT 4 32K model, the same set of outputs would cost a cool $100. But the GPT 4 series of models does a much better job of following longer and more complicated instructions, and for some use cases, that can make all the difference. (Not to mention with the recently released GPT 4o, even the highest performing models are becoming shockingly affordable.)
Let’s look at one example to understand the tradeoffs between these two models. You can use this same prompt to compare other models, too. Specifically, we are going to use a prompt to classify an excerpt of text from a legal document based on a prominent legal entity taxonomy – SALI’s Legal Matter Specification Standard (“LMSS”).
System Message:
ENTITIES:
Entity 1:
Entity Label: Document Type
Entity Definition: This is the type of legal document the text in the DOCUMENT TEXT section relates to.
Entity Data Type: Single select text value from the following values:
- Advisory Document
- Bankruptcy and Restructuring Document
- Legal Services Engagement Documents
- Litigation Document
- Project Management Document
- Transactional Document
Entity 2:
Entity Label: Area of Law
Entity Definition: This is the practice area of law that the text in the DOCUMENT TEXT section relates to.
Entity Data Type: Single select text value from the following values:
- Banking Law
- Bankruptcy and Restructuring Law
- Commercial and Trade Law
- Constitutional and Civil Rights Law
- Corporate Law
- Criminal Law
- Education Law
- Energy Law
- Environmental and Natural Resource Law
- Finance and Lending Law
- Food and Drug Law
- Gaming Law
- Health Law
- Information Security Law
- Insurance Law
- Intellectual Property Law
- Labor and Employment Law
- Municipal Law
- Personal and Family Law
- Personal Injury and Tort Law
- Public and Administrative Law
- Real Property Law
- Securities and Financial Instruments Law
- Tax and Revenue Law
- Telecommunications Law
- Transportation Law
ASSIGNMENT:
Perform the following steps to complete the assignment:
1. Carefully analyze the document text provided by the user.
2. Use the definition of the entity and the data type of the entity to output a series of key-value pairs in CSV table format, with the entity label in the first column and the value in the second column.
4. If you are not confident in what an entity is, provide "null" as the value in the key-value pair.
5. Only respond with a CSV table. Do not include any other text in your response.
User Message:
PREMISES AND TERM
3.1 Lease of Premises.
(a) Lease of Premises. Subject to and upon the terms and conditions set forth in this Lease, Landlord hereby leases to Tenant, and Tenant hereby leases from Landlord, the Premises, together with all rights and appurtenances thereto.
(b) Remeasurement of Building. Tenant has the right within six (6) months of delivery of the Premises to Tenant to remeasure the Building. Such remeasurement shall be made from and to the exterior surfaces of exterior walls of the Building, and the exterior surfaces of doors and windows, and shall not include any stairwells, elevator shafts, utility or janitorial closets or storerooms. If such remeasurement indicates square footage different from that set forth in Section 1.4, Building Square Footage shall be adjusted to reflect the corrected square footage, and the Tenant Improvement Allowance, the Annual Rent and the Monthly Rent shall be adjusted to reflect the corrected Building Square Footage as follows: Annual Rent shall be established by multiplying the newly determined Building Square Footage by the per square foot annual rent figure set forth in Section 1.9, Monthly Rent shall be determined by dividing the newly determined Annual Rent by twelve (12), and the Tenant Improvement Allowance shall be established by multiplying the newly determined Building Square Footage by the per square foot tenant improvement allowance figure set forth in Section 1.11.
When we prompt GPT 3.5 turbo with the above, we receive the following completion:
Document Type,Lease Agreement
Area of Law,Real Property Law
Compare that to the output GPT-4 provides:
"Document Type","Transactional Document"
"Area of Law","Real Property Law"
At first blush, there isn’t a big difference between the two outputs. They both seem to have performed the task well. However, upon closer examination, you’ll notice a problem with GPT 3.5 turbo’s output. The model selected “Lease Agreement” as the type of document the text is related to… but that was not one of the options we gave it.
Instead GPT 3.5 turbo seems to have forgotten that instruction and merely filled in the most likely document type. On the other hand, GPT 4 correctly followed our instructions and labeled the text with one of the options we gave it: “Transactional Document”. This might seem like a subtle difference, but depending on your use cases and the need for accuracy, there may be times when using the pricier model is the right choice.
Another key factor to consider along the capability and cost continuum is that certain models work with longer prompts and completions better than others. For example, OpenAI’s most powerful models have a context window of 123,000 tokens, which roughly approximates to 300 double spaced pages. In comparison, GPT 3.5 turbo’s 16k model can only handle 48 double spaced pages.
The final factor to consider is speed. The cheaper, lightweight models like GPT 3.5 turbo, GPT 4o mini, and Claude 3 Haiku require less computational horsepower, which means they can deliver results more quickly. The same prompt that GPT 4o mini takes 5 seconds to respond to may take GPT 4o twenty seconds or more. If you have a need for speed, choose your model wisely.