loader

Streamline Your RAG Pipeline Evaluation with Synthetic Data

Imagine slashing your data labelling time by 90% while improving the accuracy of your RAG pipeline evaluations. Sounds like a dream? With the RAGAs framework in Azure OpenAI, this dream becomes a reality.

Evaluating RAG (Retrieval-Augmented Generation) pipelines is crucial, but creating QA (questions and answers) samples manually can be time-consuming. Synthetic data generation can reduce developer time by 90%. In this blog, I'll guide you through generating synthetic data using the RAGAs framework in Azure OpenAI, streamlining your RAG pipeline evaluation process.

The Challenge: Why Manual QA (Questions and Answers) Sampling is a Pain

Building and deploying a RAG pipeline is just the start. The real challenge is checking its performance accurately and efficiently. Manually creating QA samples for evaluation is not only time-consuming but also inconsistent. How can we ensure our RAG pipeline works well in a real-world setting without spending endless hours on manual data labeling?

Automate Your QA Sample Creation and Save Time

Imagine if you could reduce the time spent on data labeling by 90%. What if there was a way to generate synthetic data that provides a robust and comprehensive evaluation of your RAG pipeline? This is where the RAGAs (Retrieval-Augmented Generation Assessment) framework comes into play, changing the way we approach RAG evaluation.

The Struggle of Manual Data Labeling

Creating QA samples manually is slow and inefficient. It involves repetitive tasks that waste resources and delay your RAG system’s deployment. Moreover, manual sampling often misses the variety and complexity needed to thoroughly test the pipeline, leading to potential gaps in evaluation.

Plus, if you're working with old documents, finding experts to create accurate QA pairs can be tough. This lack of expertise can make it even harder to ensure your RAG pipeline is reliable and high-quality.

How to Do It: Step-by-Step Guide to Generating Synthetic Data

Step 1: First, make sure you have the necessary libraries imported:

Import Libraries

import os
from dotenv import load_dotenv
from pathlib import Path
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
from langchain_openai.chat_models import AzureChatOpenAI
from langchain_openai.embeddings import AzureOpenAIEmbeddings
from langchain_community.document_loaders import WebBaseLoader

Step 2: Configure your environment variables for Azure services:

 

Set Environment Variables

os.environ["OPENAI_API_TYPE"] = "azure"
os.environ["OPENAI_API_VERSION"] = "2023-05-15"
os.environ["OPENAI_API_BASE"] = os.getenv("AZURE_ENDPOINT")
os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY")

Step 3: Next, load your documents. Here’s an example of loading Advancing Analytic's website's content:

Load Documents

loader = WebBaseLoader("https://www.advancinganalytics.co.uk/")
documents = loader.load()
for document in documents:
    document.metadata['filename'] = document.metadata['source']

Step 4: We will need three models: a generator model for generating QA pairs, an embedding model for retrieving and generating context, and a critic model for validating the generation process.

 

Define Models

# Configuration dictionary for Azure services
azure_configs = {
    "azure_endpoint": os.getenv("AZURE_ENDPOINT"),
    "model_critic_llm__deployment": "gpt-4o",
    "model_critic_llm_name": "gpt-4o",
    "model_generator_llm__deployment": "gpt-4-32k",
    "model_generator_llm_name": "gpt-4-32k",
    "embedding_deployment": "text-embedding-ada-002",
    "embedding_name": "text-embedding-ada-002",
}

# Initialize the critic LLM (Language Model) using AzureChatOpenAI
critic_llm = AzureChatOpenAI(
    openai_api_version="2023-05-15",
    azure_endpoint=azure_configs["azure_endpoint"],
    azure_deployment=azure_configs["model_critic_llm_name"],
    model=azure_configs["model_critic_llm__deployment"],
    validate_base_url=False,
)

# Initialize the generator LLM (Language Model) using AzureChatOpenAI
generator_llm = AzureChatOpenAI(
    openai_api_version="2023-05-15",
    azure_endpoint=azure_configs["azure_endpoint"],
    azure_deployment=azure_configs["model_generator_llm_name"],
    model=azure_configs["model_generator_llm__deployment"],
    validate_base_url=False,
)

# Initialize the embeddings for answer_relevancy, answer_correctness and answer_similarity
azure_embeddings = AzureOpenAIEmbeddings(
    openai_api_version="2023-05-15",
    azure_endpoint=azure_configs["azure_endpoint"],
    azure_deployment=azure_configs["embedding_deployment"],
    model=azure_configs["embedding_name"],
)
 

Step 5: Use the Azure OpenAI models to create a synthetic test set from the loaded documents.

The RAGAs test set generator incorporates all three models and allows you to specify the distribution of the generated dataset.

For the purpose of this blog, I have set the test size to be 5, but this can be configured accordingly.

You can set how many examples should be simple Q&A, how many should involve reasoning, and how many should be multi-context. I have set the following: simple as 50%, reasoning as 20%,  multi_context as  20% and conditional as 10%.

  • Simple: As the name implies, this is a simple, straightforward question.
  • Reasoning: Rewrite the question in a way that enhances the need for reasoning to answer it effectively.

  • Conditioning: Modify the question to introduce a conditional element, which adds complexity to the question.

  • Multi-Context: Rephrase the question in a manner that necessitates information from multiple related sections or chunks to formulate an answer.

Generate Testset

# Generate testset
generator = TestsetGenerator.from_langchain(
    generator_llm,
    critic_llm,
    azure_embeddings
)

testset = generator.generate_with_langchain_docs(
    documents,
    test_size=5,
    distributions={simple: 0.5, reasoning: 0.2, multi_context: 0.2, conditional: 0.1}
)

Step 6: Once you have created a test set, load it into a DataFrame for analysis:

Save Dataframe

df = pd.DataFrame(testset) df.to_csv('testset.csv', index=False)

Capture

This DataFrame contains questions and answers about Advancing Analytics, along with some additional information. Here’s a breakdown of what each column represents:

  • Question: Lists questions like “What role does Advancing Analytics play in the industry?” and “What recognition has Advancing Analytics received?”
  • Contexts: Provides the background information or passages used to answer the questions.
  • Ground Truth: Contains the verified answers to the questions.
  • Evolution Type: Describes the type of reasoning needed to answer each question, such as “simple,” “reasoning,” “multi_context,” or “conditional.”
  • Metadata: Includes extra details like source URLs for further reading.
  • Episode Done: Shows whether the task related to each question is finished, marked as “True.”

One of the questions that I find particularly interesting involves the multi-reasoning approach. Here’s the question and answer:

Question: What services does Advancing Analytics offer in Data Governance and Security, and how do they align with partnerships like Microsoft and Databricks?

Answer: The context does not provide specific information on the services Advancing Analytics offers in Data Governance and Security, or how these services align with their partnerships with Microsoft and Databricks.

This is indeed true, as we are still in the process of gathering comprehensive information for our new website!

The Benefits: Why RAGAs is a Game Changer

⏱️ Save Time

By using the RAGAs framework, you can save time and resources, reducing the manual effort required for data labeling by 90%.

🎯 Improve Accuracy

RAGAs ensure a thorough and precise evaluation by generating varied and complex questions. This evolutionary approach, inspired by Evol-Instruct, leads to more robust assessments.

📊 Get Comprehensive Insights

RAGAs create a diverse set of QA pairs, incorporating simple, reasoning, and multi-context questions. This ensures that your RAG pipeline is tested across a wide range of scenarios, identifying potential weaknesses and areas for improvement.

 

Conclusion: Efficiently Evaluate Your RAG Pipeline

By automating the generation of synthetic data, you can save time, improve accuracy, and provide comprehensive evaluations that guide your development process. For more details on the RAGAs framework, check out the official documentation.

While there is still time required to validate the generated questions and answers, this approach significantly reduces the effort and time needed to start from scratch. This streamlined process allows us to quickly iterate and refine our models, ensuring they meet our high standards of quality and performance.

By following the steps outlined above, you can efficiently evaluate your RAG pipeline using synthetic data, saving time and ensuring a thorough assessment. Happy evaluating!

 

Gavita Regunath

Author

Gavita Regunath