Introduction
In today's AI-driven solutions, providing fast and accurate responses is important. As users interact with AI models like OpenAI's GPT and Meta's Llama models, a problem occurs, particularly when delivering a Retrieval Augmented Generation (RAG) architecture. Specifically, there is a need for an efficient storage solution where we can not only log our responses from the models but also the retrieved context and other metrics we may need for our evaluations.
Now imagine a system that not only does this but also retrieves these interactions intelligently based on semantic similarity. This is where Azure Cosmos DB comes into play.
By leveraging Cosmos DB for semantic lookup of previously logged AI responses, we can enhance performance, reduce costs, and improve the overall user experience. This blog will delve into why integrating Cosmos DB is a best practice in solution design, especially when combined with the robustness and real-time capabilities of Azure Functions.
Problem Definition
AI applications, particularly those involving conversational interfaces or chatbots, often face several challenges:
- Redundant Processing: Generating responses for similar or identical queries repeatedly consumes resources unnecessarily.
- Latency Issues: Real-time applications require quick response times, which can be hindered by on-the-fly AI requests.
- Scalability Constraints: As the user base grows, the system must handle increased loads without degrading performance.
- Inefficient Data Retrieval: Without an intelligent retrieval system, accessing past responses could become slow and resource-intensive.
The core problem is finding an efficient way to store and retrieve AI-generated responses that can scale with demand while maintaining high performance.
High-Level Solution
To address these challenges, we have delivered the following solutions for clients:
- Semantic Logging with Cosmos DB: Store AI responses along with their semantic embeddings in Cosmos DB. This allows for quick retrieval of semantically similar responses without redundant requests.
- Integration with Azure Functions: Utilise Azure Functions to handle incoming requests and manage interactions between the AI models and Cosmos DB. Azure Functions provide a scalable and robust environment ideal for real-time applications.
- Semantic Similarity Search: Implement a mechanism to compute the semantic similarity between new queries and stored responses using cosine similarity on embeddings.
This architecture ensures that the system can quickly serve responses by retrieving them from the database when appropriate, thus reducing latency and computational overhead.
Architecture Overview
- Azure Functions: Act as the main orchestrator, handling HTTP requests, processing logic, and returning responses.
- AI Model API: Generates responses to user prompts when no similar past response is found.
- Azure Cosmos DB: Stores user prompts, AI responses, and their embeddings for future semantic searches.
- Semantic Search Logic: Utilises embeddings and cosine similarity to find relevant past responses efficiently.
Implementation Details
1. Setting Up Azure Functions
Azure Functions enable us to build serverless applications that scale automatically based on demand. They act as the entry point for handling HTTP requests, processing logic, and returning responses.
Below is a basic example of an Azure Function in Python that handles HTTP requests:
import azure.functions as func
import logging
import datetime
import uuid
def main(req: func.HttpRequest) -> func.HttpResponse:
logging.info("Received a request")
try:
# Parse the incoming request
req_body = req.get_json()
prompt = req_body.get('prompt')
user_id = req_body.get('user_id')
if not prompt or not user_id:
logging.error("Prompt and user ID must be provided.")
return func.HttpResponse(
"Prompt and user ID must be provided.",
status_code=400
)
# Initialise services and helpers (Create your own classes!)
# For example:
# embedding_service = EmbeddingService()
# ai_model_api = AIModelApiService()
# cosmos_helper = CosmosDBHelper(endpoint, key, database_name, container_name)
# ...
# Generate embedding for the prompt
prompt_embedding = embedding_service.generate_embedding(prompt)
# Look up past responses using semantic similarity
existing_response = cosmos_helper.find_similar_response(
user_id=user_id,
prompt_embedding=prompt_embedding,
similarity_threshold=0.95
)
if existing_response:
# Return the existing response
return func.HttpResponse(
existing_response['response'],
status_code=200
)
# If no similar response is found, generate a new one
response = ai_model_api.generate_response(prompt)
response_text = response['choices'][0]['text'].strip()
2. Generating and Storing Embeddings
For each user prompt, generate a semantic embedding using a language model encoder. This embedding captures the contextual meaning of the prompt.
# Generate embedding for the prompt
prompt_embedding = embedding_service.generate_embedding(prompt)
Store the prompt, response, and embedding in Cosmos DB for future retrieval:
# Create an item to store in Cosmos DB
item = {
"id": str(uuid.uuid4()),
"user_id": user_id,
"prompt": prompt,
"response": response_text,
"embedding": prompt_embedding,
"timestamp": datetime.datetime.utcnow().isoformat()
}
# Store the item in Cosmos DB
cosmos_helper.store_item(item)
3. Semantic Lookup in Cosmos DB
Before generating a new response, check if a similar prompt has been processed before by computing the semantic similarity between the new prompt's embedding and those stored in Cosmos DB.
def find_similar_response(self, user_id, prompt_embedding, similarity_threshold=0.95):
query = f"SELECT * FROM c WHERE c.user_id = @user_id"
parameters = [{"name": "@user_id", "value": user_id}]
items = list(self.container.query_items(
query=query,
parameters=parameters,
enable_cross_partition_query=True
))
max_similarity = 0
best_match = None
for item in items:
stored_embedding = item.get('embedding')
similarity = compute_cosine_similarity(prompt_embedding, stored_embedding)
if similarity > max_similarity:
max_similarity = similarity
best_match = item
if max_similarity >= similarity_threshold:
return best_match
else:
return None
The cosine similarity function calculates the similarity between two embeddings:
import numpy as np
def compute_cosine_similarity(embedding1, embedding2):
embedding1 = np.array(embedding1)
embedding2 = np.array(embedding2)
return np.dot(embedding1, embedding2) / (np.linalg.norm(embedding1) * np.linalg.norm(embedding2))
4. Generating New Responses with AI Model API
If no similar prompt exists, generate a new response using the AI Model API:
# Generate a new response using AI Model API
ai_response = ai_model_api.generate_response(prompt)
# Extract the generated text
final_answer = ai_response.get('choices')[0]['text'].strip()
# Store the new response in Cosmos DB for future use
item = {
"id": str(uuid.uuid4()),
"user_id": user_id,
"prompt": prompt,
"response": final_answer,
"embedding": prompt_embedding,
"timestamp": datetime.datetime.utcnow().isoformat()
}
cosmos_helper.store_item(item)
5. Returning the Response
Finally, return the response to the user:
return func.HttpResponse(
json.dumps({"answer": final_answer}),
status_code=200,
mimetype="application/json"
)
Why use Cosmos DB for Semantic Lookup?
- Performance Improvement: Reduces latency by retrieving existing responses instead of generating new ones every time.
- Cost Reduction: Decreases the number of API calls to AI models, saving on usage costs.
- Enhanced Scalability: Cosmos DB's global distribution and automatic scaling handle increasing loads seamlessly.
- Improved User Experience: Provides faster and more consistent responses to users.
- Simplified Maintenance: Centralises data storage and retrieval logic, making the system easier to maintain and extend.
Conclusion
Leveraging Cosmos DB for semantic lookup of previously logged AI responses is a best practice that brings significant benefits to AI solutions. When combined with the robustness and real-time capabilities of Azure Functions, this solution design enhances performance, scalability, and user satisfaction. By intelligently storing and retrieving AI responses, you optimise resource usage and provide a smoother experience for users interacting with AI systems.
Alternatives
There are other methods and tools to consider for semantic caching:
- In-Memory Caching: Utilising in-memory data stores like Redis to cache responses for faster retrieval.
- Custom Databases: Implementing your own caching mechanism using relational or NoSQL databases based on your application's requirements.
- Hybrid Approaches: Combining multiple caching strategies to optimise performance, cost, and scalability.
Each of these alternatives has its own set of advantages and trade-offs. The best choice depends on factors such as the scale of your application, cost considerations, infrastructure preferences, and specific performance requirements.
Conclusion
Choosing the right caching solution is crucial for optimising the performance and cost-efficiency of AI applications utilising large language models. While Azure Cosmos DB offers a robust and scalable option for semantic caching, tools like GPTCache provide open-source alternatives that can be customised to fit unique project needs. By evaluating the features and benefits of each option, you can select the solution that best aligns with your application's goals and constraints.
Additional Resources
Topics Covered :
Author
Christopher Durow