Skip to content

Files

Latest commit

Aug 17, 2023
047005f · Aug 17, 2023

History

History
78 lines (58 loc) · 2.21 KB

compare_llm_providers.md

File metadata and controls

78 lines (58 loc) · 2.21 KB

Compare LLM Providers

In this guide we compare LLMs answers at summarizing text.

Environment setup

In this guide, we use the OpenAI API and the Cohere API

pip install openai cohere
export OPENAI_API_KEY="sk-..."
export COHERE_API_KEY="..."

Data preparation

We load a publically available congressional bill summarization dataset from HuggingFace.

import pandas as pd
from datasets import load_dataset
billsum = load_dataset("billsum", split="ca_test")
billsum_df = pd.DataFrame(billsum).sample(10, random_state=278487)

LLM response generation

We use OpenAI and Cohere to generate summaries of these bills:

from langchain.llms import OpenAI, Cohere
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate

gpt3 = OpenAI(temperature=0.0, max_tokens=100)
command = Cohere(temperature=0.0, max_tokens=100)

prompt_template = PromptTemplate(
	input_variables=["text"],
	template="""
	You are an expert summarizer of text. A good summary 
	captures the most important information in the text and doesnt focus too much on small details.
	Text: {text}
	Summary:
	"""
)

gpt3_chain = LLMChain(llm=gpt3, prompt=prompt_template)
command_chain = LLMChain(llm=command, prompt=prompt_template)

# generate summaries with truncated text
gpt3_summaries = [gpt3_chain.run(bill[:3000]) for bill in billsum_df.text]
command_summaries = [command_chain.run(bill[:3000]) for bill in billsum_df.text]

Create test suite

For this test suite, we want to compare gpt-3 against command. We will use the SummaryQuality scoring metric to A/B test each set of candidate responses against the reference summaries from the dataset

from arthur_bench.run.testsuite import TestSuite
my_suite = TestSuite(
	"congressional_bills", 
	"summary_quality", 
	input_text_list=list(billsum_df.text),
	reference_output_list=list(billsum_df.summary)
)

Run test suite

my_suite.run("gpt3_summaries", candidate_output_list=gpt3_summaries)
my_suite.run("command_summaries", candidate_output_list=command_summaries)

View results

Run bench from your command line to visualize the run results comparing the different temperature settings.