In today's AI-driven solutions, providing fast and accurate responses is important. As users interact with AI models like OpenAI's GPT and Meta's Llama models, a problem occurs, particularly when delivering a Retrieval Augmented Generation (RAG) architecture. Specifically, there is a need for an efficient storage solution where we can not only log our responses from the models but also the retrieved context and other metrics we may need for our evaluations.
Now imagine a system that not only does this but also retrieves these interactions intelligently based on semantic similarity. This is where Azure Cosmos DB comes into play.
By leveraging Cosmos DB for semantic lookup of previously logged AI responses, we can enhance performance, reduce costs, and improve the overall user experience. This blog will delve into why integrating Cosmos DB is a best practice in solution design, especially when combined with the robustness and real-time capabilities of Azure Functions.
AI applications, particularly those involving conversational interfaces or chatbots, often face several challenges:
The core problem is finding an efficient way to store and retrieve AI-generated responses that can scale with demand while maintaining high performance.
To address these challenges, we have delivered the following solutions for clients:
This architecture ensures that the system can quickly serve responses by retrieving them from the database when appropriate, thus reducing latency and computational overhead.
Azure Functions enable us to build serverless applications that scale automatically based on demand. They act as the entry point for handling HTTP requests, processing logic, and returning responses.
Below is a basic example of an Azure Function in Python that handles HTTP requests:
import azure.functions as func
import logging
import datetime
import uuid
def main(req: func.HttpRequest) -> func.HttpResponse:
logging.info("Received a request")
try:
# Parse the incoming request
req_body = req.get_json()
prompt = req_body.get('prompt')
user_id = req_body.get('user_id')
if not prompt or not user_id:
logging.error("Prompt and user ID must be provided.")
return func.HttpResponse(
"Prompt and user ID must be provided.",
status_code=400
)
# Initialise services and helpers (Create your own classes!)
# For example:
# embedding_service = EmbeddingService()
# ai_model_api = AIModelApiService()
# cosmos_helper = CosmosDBHelper(endpoint, key, database_name, container_name)
# ...
# Generate embedding for the prompt
prompt_embedding = embedding_service.generate_embedding(prompt)
# Look up past responses using semantic similarity
existing_response = cosmos_helper.find_similar_response(
user_id=user_id,
prompt_embedding=prompt_embedding,
similarity_threshold=0.95
)
if existing_response:
# Return the existing response
return func.HttpResponse(
existing_response['response'],
status_code=200
)
# If no similar response is found, generate a new one
response = ai_model_api.generate_response(prompt)
response_text = response['choices'][0]['text'].strip()
For each user prompt, generate a semantic embedding using a language model encoder. This embedding captures the contextual meaning of the prompt.
# Generate embedding for the prompt
prompt_embedding = embedding_service.generate_embedding(prompt)
Store the prompt, response, and embedding in Cosmos DB for future retrieval:
# Create an item to store in Cosmos DB
item = {
"id": str(uuid.uuid4()),
"user_id": user_id,
"prompt": prompt,
"response": response_text,
"embedding": prompt_embedding,
"timestamp": datetime.datetime.utcnow().isoformat()
}
# Store the item in Cosmos DB
cosmos_helper.store_item(item)
Before generating a new response, check if a similar prompt has been processed before by computing the semantic similarity between the new prompt's embedding and those stored in Cosmos DB.
def find_similar_response(self, user_id, prompt_embedding, similarity_threshold=0.95):
query = f"SELECT * FROM c WHERE c.user_id = @user_id"
parameters = [{"name": "@user_id", "value": user_id}]
items = list(self.container.query_items(
query=query,
parameters=parameters,
enable_cross_partition_query=True
))
max_similarity = 0
best_match = None
for item in items:
stored_embedding = item.get('embedding')
similarity = compute_cosine_similarity(prompt_embedding, stored_embedding)
if similarity > max_similarity:
max_similarity = similarity
best_match = item
if max_similarity >= similarity_threshold:
return best_match
else:
return None
The cosine similarity function calculates the similarity between two embeddings:
import numpy as np
def compute_cosine_similarity(embedding1, embedding2):
embedding1 = np.array(embedding1)
embedding2 = np.array(embedding2)
return np.dot(embedding1, embedding2) / (np.linalg.norm(embedding1) * np.linalg.norm(embedding2))
If no similar prompt exists, generate a new response using the AI Model API:
# Generate a new response using AI Model API
ai_response = ai_model_api.generate_response(prompt)
# Extract the generated text
final_answer = ai_response.get('choices')[0]['text'].strip()
# Store the new response in Cosmos DB for future use
item = {
"id": str(uuid.uuid4()),
"user_id": user_id,
"prompt": prompt,
"response": final_answer,
"embedding": prompt_embedding,
"timestamp": datetime.datetime.utcnow().isoformat()
}
cosmos_helper.store_item(item)
Finally, return the response to the user:
return func.HttpResponse(
json.dumps({"answer": final_answer}),
status_code=200,
mimetype="application/json"
)
Leveraging Cosmos DB for semantic lookup of previously logged AI responses is a best practice that brings significant benefits to AI solutions. When combined with the robustness and real-time capabilities of Azure Functions, this solution design enhances performance, scalability, and user satisfaction. By intelligently storing and retrieving AI responses, you optimise resource usage and provide a smoother experience for users interacting with AI systems.
There are other methods and tools to consider for semantic caching:
Each of these alternatives has its own set of advantages and trade-offs. The best choice depends on factors such as the scale of your application, cost considerations, infrastructure preferences, and specific performance requirements.
Choosing the right caching solution is crucial for optimising the performance and cost-efficiency of AI applications utilising large language models. While Azure Cosmos DB offers a robust and scalable option for semantic caching, tools like GPTCache provide open-source alternatives that can be customised to fit unique project needs. By evaluating the features and benefits of each option, you can select the solution that best aligns with your application's goals and constraints.