Building a Privacy-Preserving LLM-Based Chatbot

Article Monday, November 27 2023

As Large Language Models (LLMs) and generative AI continue to grow more sophisticated and available, many organizations are starting to build, fine-tune, and customize LLMs based on their internal data and documents. This can bring incredible efficiency and reliability to data-driven decision-making processes. However, this practice comes with its share of challenges, primarily around data privacy, protection, and governance.

Let’s consider the construction of the LLM itself, which is trained on a massive amount of data collected from public and private sources. Without careful anonymization and filtering, sensitive data — such as PII or intellectual property — may be inadvertently included in the training set, potentially leading to a privacy breach.

Furthermore, privacy concerns are introduced when interacting with LLMs, as users might input sensitive data, such as names, addresses, or even confidential business information. If these inputs aren’t handled properly, the misuse or exposure of this information is a genuine risk.

In this post, we’ll explore how to work with LLMs in a privacy-preserving way when building an LLM-based chatbot. As we walk through the technology from end-to-end, we’ll highlight the most acute data privacy concerns and we’ll show how using a data privacy vault addresses those concerns.

Let’s start by taking a closer look at the problem we need to solve.

The problem: Protecting sensitive information from exposure by a chatbot

Consider a company that has uses an LLM-based chatbot for its internal operations. The LLM for the chatbot was built by modifying a pre-existing base model with embeddings created from internal company documents. The chatbot provides an easy-to-use interface that lets non-technical users within the company access information from internal data and documents.

The company has a sensitive internal project called “Project Titan.” Project Titan is so important and so sensitive that only people working on Project Titan know about it. In fact, the team often says: the first rule of Project Titan is don’t talk about Project Titan. Naturally, the team wants to take advantage of the internal chatbot and also include Project Titan specific information to speed up creation of design documents, documentation, and press releases. However, they need to control who can see details about this sensitive project.

What we have is a tangible and pressing privacy concern that sits at the intersection of AI and data. These challenges appear extremely difficult to solve in a scalable and production-ready way. Simply having a private version of the LLM doesn’t address the core issue of data access.

The proposed solution: Sensitive data de-identification and fine-grained access control

Ultimately, we need to identify the key points where sensitive data must be de-identified during the process of building (or fine-tuning) the LLM and the end user’s interaction with the LLM-based chatbot. After careful analysis, we’ve identified that there are two key points in the process where we need to de-identify (and later re-identify) sensitive data:

Before ingestion: When documents from Project Titan are used to create embeddings, the project name, any PII, and anything else sensitive to the project must be de-identified. This de-identification should occur as part of the ETL pipeline prior to data ingestion into the LLM.
During use: When a user inputs data to the chatbot, any sensitive data included in that input must also be de-identified.

You can de-identify sensitive data using Skyflow’s polymorphic encryption and tokenization engine that’s included within Skyflow Data Privacy Vault. This includes detection of PII but also terms you define within a sensitive data dictionary, like intellectual property (i.e. Project Titan).

Of course, only Project Titan team members who use the chatbot should be able to access the sensitive project data. Therefore, when the chatbot forms a response, we’ll rely on Skyflow’s governance engine (which provides fine-grained access control) and detokenization API to retrieve the sensitive data from the data privacy vault, making it available only to authorized end users.

Before we dive into the technical implementation, let’s go through a brief overview of foundational LLM concepts. If you’re already familiar with these concepts, you can skip the next section.

A brief primer on LLMs

LLMs are sophisticated artificial intelligence (AI) systems designed to analyze, generate, and work with human language. Built on advanced machine learning architectures, they are trained on vast quantities of text data, enabling them to generate text that is convincingly human-like in its coherence and relevance.

LLMs leverage a technology called transformers — one example is GPT, which stands for Generative Pre-Trained Transformer — to predict or generate a piece of text when given input or context. LLMs learn from patterns in the data they are trained on and then apply these learnings to understand newly given content or to generate new content.

Despite their benefits, LLMs pose potential challenges in terms of privacy, data security, and ethical considerations. This is because LLMs can inadvertently memorize sensitive information from their training data or generate inappropriate content if not properly regulated or supervised. Therefore, the use of LLMs necessitates effective strategies for data handling, governance, and preserving user privacy.

A technical overview of the solution

When embarking on any LLM project, we need to start with a model. Many open-source LLMs have been released in recent months, each with its specific area of focus. Instead of building an entire LLM model from scratch, many developers choose a pre-built model and then adjust the model with vector embeddings generated from domain-specific data.

Vector embeddings encapsulate the semantic relationship between words and help algorithms understand context. The embeddings act as an additional contextual knowledge base to help augment the facts known by the base model.

In our case, we’ll begin with an existing model from Hugging Face, and then customize it with embeddings. Hugging Face provides ML infrastructure services as well as open-source models and datasets.

In addition to the Hugging Face model, we’ll use the following additional tools to build out our privacy-preserving LLM-based ETL pipeline and chatbot:

LangChain an open-source Python library that chains together components typically used for building applications (such as chatbots) powered by LLMs
Snowflake, which we’ll use for internal document and data storage
Snowpipe, which we’ll use with Snowflake for automated data loading
Chroma, an AI-native, open-source database for vector embeddings
Streamlit, an open-source framework for building AI/ML-related applications using Python
RetrievalQA, a question-answering chain in LangChain which gets documents from a Retriever and then uses a QA chain to answer questions from those documents

The following diagram shows the high-level ETL and embeddings data flow:

Example of the ETL and embeddings data flow.

The ETL and embeddings flows from end to end are:

ETL

Start with source data, which may contain sensitive data.
Send data to Skyflow Data Privacy Vault for de-identification.
Use Snowpipe to load clean data into Snowflake.

Create vector embeddings

Load documents from Snowflake into LangChain.
Create vector embeddings with LangChain.
Store embeddings in Chroma.

Once the model has been customized with the Project Titan information, the user interaction and inference flow is as follows:

User interaction and inference information flow

Chat UI input

Accept user input via Streamlit’s chat UI.
Send user input to Skyflow for de-identification.

2. Retrieve embeddings

Get the embeddings from Chroma and attach to RetrievalQA.

3. Inference

Send clean data to RetrievalQA.
Use QA chain in RetrievalQA to answer the user’s question.

4. Chat UI response

Send RetrievalQA’s response to Skyflow for detokenization.
Send re-identified data to Streamlit for display to the end user.

Now that we’re clear on the high-level process, let’s dive in and take a closer look at each step.

ETL: Cleaning the source data

Cleaning the source data with Skyflow Data Privacy Vault is fairly straightforward and I’ve covered some of this in a prior post. In this case, we need to process all the source documents for Project Titan available in an AWS S3 bucket.

Skyflow will store the raw files, de-identify PII and IP, and save the clean files to another S3 bucket.

import boto3
from skyflow.vault import ConnectionConfig, Configuration, RequestMethod

# Authentication to Skyflow API
bearerToken = ''
def tokenProvider():
    global bearerToken
    if is_expired(bearerToken):
        return bearerToken
    bearerToken, _ = generate_bearer_token('<YOUR_CREDENTIALS_FILE_PATH>')
    return bearerToken

def processTrainingData(trainingData):
    try:
        # Vault connection configuration
        config = Configuration('<YOUR_VAULT_ID>', '<YOUR_VAULT_URL>', tokenProvider)

        # Define the connection API endpoint
        connectionConfig = ConnectionConfig('<YOUR_CONNECTION_URL>', RequestMethod.POST,
        requestHeader = {
            'Content-Type': 'application/json',
            'Authorization': '<YOUR_CONNECTION_BASIC_AUTH>'
        }
        requestBody = {
            'trainingData': trainingData
        }
 
        # Connect to the vault
        client = Client(config)
    
        # Call the Skyflow API to de-identify the training data
        response = client.invoke_connection(connectionConfig)

        # Define the S3 bucket name and key for the file
        bucketName = "clean-data-bucket"
        fileKey = "{timestamp}-{generated-uuid}"

        # Write the data to a file in memory
        fileContents = bytes(response.training_data.encode("UTF-8"))

        # Upload the file to S3
        s3 = boto3.client("s3")
        s3.put_object(Bucket=bucketName, Key=fileKey, Body=fileContents)
    except SkyflowError as e:
        print('Error Occurred:', e)

Next, we’ll configure Snowpipe to detect new documents in our S3 bucket and load that data into Snowflake. To do this, we’ll need to create the following in Snowflake:

CREATE OR REPLACE TABLE custom_training_data (
  training_text BINARY
  );

CREATE OR REPLACE FILE FORMAT training_data_json_format
  TYPE = JSON;

CREATE OR REPLACE TEMPORARY STAGE training_data_stage
 FILE_FORMAT = training_data_json_format;

CREATE PIPE custom_training_data
  AUTO_INGEST = TRUE
  AS
  COPY INTO custom_training_data
    FROM (SELECT $1:records.fields.training_text
          FROM @ training_data_stage t)
    ON_ERROR = 'continue';

With that, we have raw data that goes through a de-identification process, and then we store the plaintext sensitive data in Snowflake. Any sensitive data related to Project Titan is now obscured in the LLM, but because of Skyflow’s polymorphic encryption and tokenization, the de-identified data has referential integrity, meaning we can return the data to its original form when interacting with the chatbot.

Creating vector embeddings: Customizing our LLM

Now that we have our de-identified text data stored in Snowflake, we’re confident that all information related to Project Titan has been properly concealed. The next step is to create embeddings of these documents.

We’ll use the Instructor model provided by Hugging Face as our embedding model. We store our embeddings in Chroma, a vector database built expressly for this purpose. This will allow for the downstream retrieval and search support of the textual data stored in our vector database.

The code below loads the base model, embedding model, and storage context.

from langchain.chat_models import ChatOpenAI
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.embeddings.openai import OpenAIEmbeddings

model_id = "hkunlp/instructor-large"
embed_model = HuggingFaceEmbeddings(model_name=model_id)
vectorstore = Chroma("langchain_store", embed_model)

Next, we need to load all documents and add them to the vector store. For this, we use the Snowflake document loader in LangChain.

from snowflakeLoader import SnowflakeLoader
import settings as s

QUERY = "select training_text as source from custom_training_data"
snowflake_loader = SnowflakeLoader(
    query=QUERY,
    user=s.SNOWFLAKE_USER,
    password=s.SNOWFLAKE_PASS,
    account=s.SNOWFLAKE_ACCOUNT,
    warehouse=s.SNOWFLAKE_WAREHOUSE,
    role=s.SNOWFLAKE_ROLE,
    database=s.SNOWFLAKE_DATABASE,
    schema=s.SNOWFLAKE_SCHEMA,
    metadata_columns=["source"],
)
training_documents = snowflake_loader.load()

vector_store.add_documents(training_documents)

With the training document and vector store created, we create the question-answering chain.

qa = RetrievalQA.from_chain_type(llm=ChatOpenAI(temperature=0.2,model_name='gpt-3.5-turbo'),
                                 chain_type="stuff", 
                                 retriever=vector_store.as_retriever())
result = qa.run("What is Project Titan?")

This question (“What is Project Titan?”) will fail because the model doesn’t actually know about Project Titan, it knows about a de-identified version of the string “Project Titan”.

To issue a query like this, the query needs to be first sent through Skyflow to de-identify the string and then the de-identified version is passed to the model. We’ll tackle this next as we start to put the pieces together for our chat UI.

Chat UI Input: Preserving privacy of user-supplied data

We’re ready to focus on the chatbot UI aspect of our project, dealing with accepting and processing user input as well as returning results with Project Titan data detokenized when needed.

For this portion of the project, we will use Streamlit for our UI. The code below creates a simple chatbot UI with Streamlit.

import openai
import streamlit as st

st.title("🔏Acme Corp Assistant")

# Initialize the chat messages history
if "messages" not in st.session_state.keys():
    st.session_state.messages = [
        {"role": "assistant", "content": "Hello 👋!  \nHow can I help?"}
    ]

# Prompt for user input and save
if prompt := st.chat_input():
    st.session_state.messages.append({"role": "user", "content": prompt})

# display the prior chat messages
for message in st.session_state.messages:
    with st.chat_message(message["role"]):
        st.write(message["content"])

# If last message is not from assistant, we need to generate a new response
if st.session_state.messages[-1]["role"] != "assistant":
    # Generate a response
    with st.chat_message("assistant"):
        with st.spinner("Thinking..."):
            response = "TODO"

    message = {"role": "assistant", "content": response}
    st.session_state.messages.append(message)

Our simple chat UI looks like this:

As you can see, the UI accepts a user input, but doesn’t currently integrate with our LLM. Next, we need to send the user input to Skyflow for de-identification before we use RetrievalQA to answer the user’s question. Let’s start with accepting and processing our input data.

To detect and de-identify plaintext sensitive data with Skyflow, we can use the detect API endpoint with code similar to the following:

def deIdentifyText(input):
   data = {
        "text": [
            {
                "message": input
            }
        ],
        "deidentify_option": "tokenize"
    }
    response = client.detect(data)

    return response[0].processed_text

Now that we’ve de-identified the user input data, we can send the question to RetrievalQA, which will then use a QA chain to answer the question from our documents.

def performCompletion(input):
     clean_input = deIdentifyText(input)

     qa = RetrievalQA.from_chain_type(llm=ChatOpenAI(temperature=0.2,model_name='gpt-3.5-turbo'),
                                 chain_type="stuff", 
                                 retriever=vector_store.as_retriever())
    return qa.run(clean_input)

We now have our response from RetrievalQA. However, we need to take one additional step before we can send it back to our user: detokenize (re-identify) our response through Skyflow’s detokenization API. This is fairly straightforward, similar to previous API calls to Skyflow.

Everything we need is encapsulated by the function performInference, which calls a function to reIdentifyText after the completion is returned.

Who can see what and in which format is controlled by Skyflow’s governance engine. There’s too much to cover here, but if you want to learn more, see Introducing the Skyflow Data Governance Engine.

def performInference(input):
    response = performCompletion(input)

    return reIdentifyText(response)

These final steps connect our entire application from end-to-end. Now, we need to update our UI code from above so that the response is correctly set.

# If last message is not from assistant, we need to generate a new response
if st.session_state.messages[-1]["role"] != "assistant":
    # Generate a response
    with st.chat_message("assistant"):
        with st.spinner("Thinking..."):
            response = performInference(m["content"])

With these pieces in place, here’s a quick demo of our privacy-preserving LLM-based chatbot in action:

*Example of the privacy-preserving bot in action.*

Tying it all together

In this article, we walked through the general steps to construct a privacy-preserving LLM-based chatbot. With organizations increasingly using LLM-based applications in their businesses and operations, the need to preserve data privacy has become acute. Concerns about protecting the privacy and security of sensitive data are the biggest adoption blocker that prevents many companies from making full use of AI with their datasets.

Solving this problem requires identifying the key points where sensitive data might enter your system and need to be de-identified. When working with LLMs, those points occur during model training — both when building an LLM or customizing one — and at the user input stage. You can use Skyflow Data Privacy Vault to implement effective de-identification and data governance for LLM-based AI tools like chatbots.

Building an LLM-based chatbot requires the use of several tools to ensure that data is handled in a manner that preserves privacy. Taking privacy-preserving measures is critical to prevent the misuse or exposure of sensitive information. By using the tools and methods we’ve demonstrated here, companies can leverage AI’s benefits and promote efficient data-driven decision-making while prioritizing data privacy and protection.

Sean’s been an academic, startup founder, and Googler. He has published works covering a wide range of topics from information visualization to quantum computing. Currently, Sean is Head of Marketing and Developer Relations at Skyflow and host of the podcast Partially Redacted, a podcast about privacy and security engineering. You can connect with Sean on Twitter @seanfalconer.