Build scalable and serverless RAG workflows with a vector engine for Amazon OpenSearch Serverless and Amazon Bedrock Claude models

In pursuit of a more efficient and customer-centric support system, organizations are deploying cutting-edge generative AI applications. These applications are designed to excel in four critical areas: multi-lingual support, sentiment analysis, personally identifiable information (PII) detection, and conversational search capabilities. Customers worldwide can now engage with the applications in their preferred language, and the applications can gauge their emotional state, mask sensitive personal information, and provide context-aware responses. This holistic approach not only enhances the customer experience but also offers efficiency gains, ensures data privacy compliance, and drives customer retention and sales growth.

Generative AI applications are poised to transform the customer support landscape, offering versatile solutions that integrate seamlessly with organizations’ operations. By combining the power of multi-lingual support, sentiment analysis, PII detection, and conversational search, these applications promise to be a game-changer. They empower organizations to deliver personalized, efficient, and secure support services while ultimately driving customer satisfaction, cost savings, data privacy compliance, and revenue growth.

Amazon Bedrock and foundation models like Anthropic Claude are poised to enable a new wave of AI adoption by powering more natural conversational experiences. However, a key challenge that has emerged is tailoring these general purpose models to generate valuable and accurate responses based on extensive, domain-specific datasets. This is where the Retrieval Augmented Generation (RAG) technique plays a crucial role.

RAG allows you to retrieve relevant data from databases or document repositories to provide helpful context to large language models (LLMs). This additional context helps the models generate more specific, high-quality responses tuned to your domain.

In this post, we demonstrate building a serverless RAG workflow by combining the vector engine for Amazon OpenSearch Serverless with an LLM like Anthropic Claude hosted by Amazon Bedrock. This combination provides a scalable way to enable advanced natural language capabilities in your applications, including the following:

  • Multi-lingual support – The solution uses the ability of LLMs like Anthropic Claude to understand and respond to queries in multiple languages without any additional training needed. This provides true multi-lingual capabilities out of the box, unlike traditional machine learning (ML) systems that need training data in each language.
  • Sentiment analysis – This solution enables you to detect positive, negative, or neutral sentiment in text inputs like customer reviews, social media posts, or surveys. LLMs can provide explanations for the inferred sentiment, describing which parts of the text contributed to a positive or negative classification. This explainability helps build trust in the model’s predictions. Potential use cases could include analyzing product reviews to identify pain points or opportunities, monitoring social media for brand sentiment, or gathering feedback from customer surveys.
  • PII detection and redaction – The Claude LLM can be accurately prompted to identify various types of PII like names, addresses, Social Security numbers, and credit card numbers and replace it with placeholders or generic values while maintaining readability of the surrounding text. This enables compliance with regulations like GDPR and prevents sensitive customer data from being exposed. This also helps automate the labor-intensive process of PII redaction and reduces risk of exposed customer data across various use cases, such as the following:
    • Processing customer support tickets and automatically redacting any PII before routing to agents.
    • Scanning internal company documents and emails to flag any accidental exposure of customer PII.
    • Anonymizing datasets containing PII before using the data for analytics or ML, or sharing the data with third parties.

Through careful prompt engineering, you can accomplish the aforementioned use cases with a single LLM. The key is crafting prompt templates that clearly articulate the desired task to the model. Prompting allows us to tap into the vast knowledge already present within the LLM for advanced natural language processing (NLP) tasks, while tailoring its capabilities to our particular needs. Well-designed prompts unlock the power and potential of the model.

With the vector database capabilities of Amazon OpenSearch Serverless, you can store vector embeddings of documents, allowing ultra-fast, semantic (rather than keyword) similarity searches to find the most relevant passages to augment prompts.

Read on to learn how to build your own RAG solution using an OpenSearch Serverless vector database and Amazon Bedrock.

Solution overview

The following architecture diagram provides a scalable and fully managed RAG-based workflow for a wide range of generative AI applications, such as language translation, sentiment analysis, PII data detection and redaction, and conversational AI. This pre-built solution operates in two distinct stages. The initial stage involves generating vector embeddings from unstructured documents and saving these embeddings within an OpenSearch Serverless vectorized database index. In the second stage, user queries are forwarded to the Amazon Bedrock Claude model along with the vectorized context to deliver more precise and relevant responses.

In the following sections, we discuss the two core functions of the architecture in more detail:

  • Index domain data
  • Query an LLM with enhanced context

Index domain data

In this section, we discuss the details of the data indexing phase.

Generate embeddings with Amazon Titan

We used Amazon Titan embeddings model to generate vector embeddings. With 1,536 dimensions, the embeddings model captures semantic nuances in meaning and relationships. Embeddings are available via the Amazon Bedrock serverless experience; you can access it using a single API and without managing any infrastructure. The following code illustrates generating embeddings using a Boto3 client.

import boto3
bedrock_client = boto3.client('bedrock-runtime')

## Generate embeddings with Amazon Titan Embeddings model
response = bedrock_client.invoke_model(
            body = json.dumps({"inputText": 'Hello World'}),
            modelId = 'amazon.titan-embed-text-v1',
result = json.loads(response['body'].read())
embeddings = result.get('embedding')
print(f'Embeddings -> {embeddings}')

Store embeddings in an OpenSearch Serverless vector collection

OpenSearch Serverless offers a vector engine to store embeddings. As your indexing and querying needs fluctuate based on workload, OpenSearch Serverless automatically scales up and down based on demand. You no longer have to predict capacity or manage infrastructure sizing.

With OpenSearch Serverless, you don’t provision clusters. Instead, you define capacity in the form of Opensearch Capacity Units (OCUs). OpenSearch Serverless will scale up to the maximum number of OCUs defined. You’re charged for a minimum of 4 OCUs, which can be shared across multiple collections sharing the same AWS Key Management Service (AWS KMS) key.

The following screenshot illustrates how to configure capacity limits on the OpenSearch Serverless console.

Query an LLM with domain data

In this section, we discuss the details of the querying phase.

Generate query embeddings

When a user queries for data, we first generate an embedding of the query with Amazon Titan embeddings. OpenSearch Serverless vector collections employ an Approximate Nearest Neighbors (A-NN) algorithm to find document embeddings closest to the query embeddings. The A-NN algorithm uses cosine similarity to measure the closeness between the embedded user query and the indexed data. OpenSearch Serverless then returns the documents whose embeddings have the smallest distance, and therefore the highest similarity, to the user’s query embedding. The following code illustrates our vector search query:

vector_query = {
                "size": 5,
                "query": {"knn": {"embedding": {"vector": embedded_search, "k": 2}}},
                "_source": False,
                "fields": ["text", "doc_type"]

Query Anthropic Claude models on Amazon Bedrock

OpenSearch Serverless finds relevant documents for a given query by matching embedded vectors. We enhance the prompt with this context and then query the LLM. In this example, we use the AWS SDK for Python (Boto3) to invoke models on Amazon Bedrock. The AWS SDK provides the following APIs to interact with foundational models on Amazon Bedrock:

The following code invokes our LLM:

import boto3
bedrock_client = boto3.client('bedrock-runtime')
# model_id could be 'anthropic.claude-v2', 'anthropic.claude-v1','anthropic.claude-instant-v1']
response = bedrock_client.invoke_model_with_response_stream(


Before you deploy the solution, review the prerequisites.

Deploy the solution

The code sample along with the deployment steps are available in the GitHub repository. The following screenshot illustrates deploying the solution using AWS CloudShell.

Test the solution

The solution provides some sample data for indexing, as shown in the following screenshot. You can also index custom text. Initial indexing of documents may take some time because OpenSearch Serverless has to create a new vector index and then index documents. Subsequent requests are faster. To delete the vector index and start over, choose Reset.

The following screenshot illustrates how you can query your domain data in multiple languages after it’s indexed. You could also try out sentiment analysis or PII data detection and redaction on custom text. The response is streamed over Amazon API Gateway WebSockets.

Clean up

To clean up your resources, delete the following AWS CloudFormation stacks via the AWS CloudFormation console:

  • LlmsWithServerlessRagStack
  • ApiGwLlmsLambda


In this post, we provided an end-to-end serverless solution for RAG-based generative AI applications. This not only offers you a cost-effective option, particularly in the face of GPU cost and hardware availability challenges, but also simplifies the development process and reduces operational costs.

Stay up to date with the latest advancements in generative AI and start building on AWS. If you’re seeking assistance on how to begin, check out the Generative AI Innovation Center.

About the authors

Fraser Sequeira is a Startups Solutions Architect with AWS based in Mumbai, India. In his role at AWS, Fraser works closely with startups to design and build cloud-native solutions on AWS, with a focus on analytics and streaming workloads. With over 10 years of experience in cloud computing, Fraser has deep expertise in big data, real-time analytics, and building event-driven architecture on AWS. He enjoys staying on top of the latest technology innovations from AWS and sharing his learnings with customers. He spends his free time tinkering with new open source technologies.

Kenneth Walsh is a New York-based Sr. Solutions Architect whose focus is AWS Marketplace. Kenneth is passionate about cloud computing and loves being a trusted advisor for his customers. When he’s not working with customers on their journey to the cloud, he enjoys cooking, audiobooks, movies, and spending time with his family and dog.

Max Winter is a Principal Solutions Architect for AWS Financial Services clients. He works with ISV customers to design solutions that allow them to leverage the power of AWS services to automate and optimize their business. In his free time, he loves hiking and biking with his family, music and theater, digital photography, 3D modeling, and imparting a love of science and reading to his two nearly-teenagers.

Manjula Nagineni is a Senior Solutions Architect with AWS based in New York. She works with major financial service institutions, architecting and modernizing their large-scale applications while adopting AWS Cloud services. She is passionate about designing big data workloads cloud-natively. She has over 20 years of IT experience in software development, analytics, and architecture across multiple domains such as finance, retail, and telecom.

Latest articles


Related articles

Leave a reply

Please enter your comment!
Please enter your name here