Comparison

Comparing LLM Models with gollm

gollm provides powerful tools for comparing responses from different LLM providers and models. This feature is particularly useful for evaluating model performance, choosing the best model for specific tasks, and understanding the strengths and weaknesses of different LLMs.

Here's a comprehensive guide on how to use the model comparison feature in gollm:

Define the models you want to compare:

models := []struct {
    provider string
    model    string
}{
    {"openai", "gpt-4o-mini"},
    {"openai", "gpt-4o"},
    {"anthropic", "claude-3-haiku-20240307"},
    {"anthropic", "claude-3-5-sonnet-20240620"},
}

Create configurations for each model:

configs := make([]*gollm.Config, 0, len(models))
for _, m := range models {
    apiKeyEnv := fmt.Sprintf("%s_API_KEY", strings.ToUpper(m.provider))
    apiKey := os.Getenv(apiKeyEnv)
    if apiKey == "" {
        fmt.Printf("Skipping %s %s: API key not set. Please set %s environment variable.\n", m.provider, m.model, apiKeyEnv)
        continue
    }

    config := &gollm.Config{}
    gollm.SetProvider(m.provider)(config)
    gollm.SetModel(m.model)(config)
    gollm.SetAPIKey(apiKey)(config)
    gollm.SetMaxTokens(500)(config)
    gollm.SetMaxRetries(3)(config)
    gollm.SetRetryDelay(time.Second * 2)(config)
    gollm.SetDebugLevel(goal.LogLevelWarn)(config)

    configs = append(configs, config)
}

Create a prompt for the comparison:

promptText := `Generate information about a fictional person.
Create a fictional person with the following attributes: name, age, occupation, city, country, favorite color, hobbies (1-5), education, pet name, and lucky number (1-100).
Ensure all fields are filled and adhere to the specified constraints.
Return the data as a JSON object that adheres to this schema:
[Your JSON schema here]`

Define a validation function (if needed):

validateComplexPerson := func(person ComplexPerson) error {
    if person.Age < 0 || person.Age > 150 {
        return fmt.Errorf("age must be between 0 and 150")
    }
    if len(person.Hobbies) < 1 || len(person.Hobbies) > 5 {
        return fmt.Errorf("number of hobbies must be between 1 and 5")
    }
    if person.LuckyNumber < 1 || person.LuckyNumber > 100 {
        return fmt.Errorf("lucky number must be between 1 and 100")
    }
    return nil
}

Perform the comparison:

results, err := gollm.CompareModels(ctx, promptText, validateComplexPerson, configs...)
if err != nil {
    log.Fatalf("Error comparing models: %v", err)
}

Process and display the results:

for _, result := range results {
    fmt.Printf("\n%s\n", strings.Repeat("=", 50))
    fmt.Printf("Results for %s %s\n", result.Provider, result.Model)
    fmt.Printf("Attempts: %d\n", result.Attempts)

    if result.Error != nil {
        fmt.Printf("Error: %v\n", result.Error)
        continue
    }

    fmt.Println("Valid output generated:")
    prettyJSON, err := json.MarshalIndent(result.Data, "", "  ")
    if err != nil {
        fmt.Printf("Error prettifying JSON: %v\n", err)
    } else {
        fmt.Printf("%s\n", string(prettyJSON))
    }
}

Analyze the comparison results:

fmt.Println("\nFinal Analysis of Results:")
analysis := gollm.AnalyzeComparisonResults(results)
fmt.Println(analysis)

This comprehensive approach to model comparison allows you to:

Compare responses from different providers and models simultaneously
Evaluate model performance for specific tasks, including structured output generation
Apply custom validation to ensure output quality across models
Analyze success rates, attempt counts, and output consistency
Make informed decisions about which model to use for different scenarios based on detailed comparisons

The AnalyzeComparisonResults function provides insights into the differences between model responses, including:

Success rates for each model
Average number of attempts required
Consistency of outputs across models
Any notable differences or patterns in the responses

By leveraging these powerful comparison tools, you can optimize your use of language models, ensuring you select the most appropriate model for each specific use case in your applications.

PreviousPrompt engineering NextJSON output validation

Last updated 1 year ago

Was this helpful?