Hello Pangeans, Meet GHOST, an AI chatbot powered by Pangea. I just wanted to explain how I built this and receive the feedback from you all.
The Problem
When people use AI platforms, they might unknowingly leak sensitive information, which can be a real problem. It could happen when they share personal details without realizing it or accidentally reveal confidential stuff during interactions. This kind of slip-up can have serious consequences for privacy and security. So, it’s crucial for users to be careful about what they share and understand how AI systems work. Plus, companies need to have strong safeguards in place to protect data and make sure users know how to use these systems safely.
Solution
One effective solution to mitigate the risk of sensitive information leakage in AI usage is by implementing redaction and file sanitization measures. Redacting sensitive information involves selectively removing or obscuring confidential details from documents or inputs before they are processed by AI systems. Similarly, sanitizing files ensures that any potentially compromising data is cleansed or anonymized prior to analysis or sharing. These practices help safeguard privacy and security by minimizing the exposure of sensitive information while still allowing users to benefit from AI technologies. Additionally, incorporating encryption and access controls further enhances data protection measures, ensuring that only authorized individuals can access or manipulate sensitive data within AI environments.
What GHOST do?
GHOST is an AI chatbot powered by Pangea. You may wonder why another bot? The aim is to showcase how a chatbot can be enhanced to be safer and more trustworthy. With Pangea’s Redaction and file sanitization feature, we can utilize AI without concerns about the leakage of sensitive/personal information.
How I built it?
AI
When considering AI options, many opt for OpenAI. However, there exist numerous bots powered by it. But I thought of why not try with something new and better. That’s where Google’s Gemini 1.5 Pro model comes into play. Gemini 1.5 delivers dramatically enhanced performance with a more efficient architecture.
Redaction
Every prompt will go through the Redaction process before it reaches the AI. For this I have used Pangea’s Redact Service. It comes with predefined rules designed to handle various forms of sensitive data such as personally identifiable information (PII), geographic locations, payment card industry (PCI) data, and more.
Sanitization
Integrating AI with PDF files for Q&A offers significant advantages in efficiency, accuracy, and accessibility of information retrieval and analysis. However, it’s crucial to handle sensitive information appropriately, particularly in corporate environments where data privacy is paramount.
While leveraging AI for Q&A tasks streamlines processes and improves decision-making, directly exposing sensitive content to AI models poses risks. Instead, organizations must implement robust security measures to safeguard confidential information. With the help of Pangea’s Sanitize service, we can redact sensitive information, ensuring compliance while eliminating potentially harmful active content.
Secure Share with Sanitize allows us to share documents confidently. So after Sanitization, the user can download and use this file anywhere as needed. Here is an example screenshot of sanitized Active Substance Use Medical Summary Report:
At first, my plan was to extract the text content directly from the file, redact it, and then apply AI processing. However, a problem arises when you need to review the redacted content or reuse it later. By sanitizing the file, we can view the redacted content in its original format and effortlessly share it with others.
Text Extraction
I have used LLM Sherpa, a tool is designed to help developers work more efficiently with LLMs by providing a set of APIs that can handle various tasks, such as parsing documents, extracting layout information, and facilitating better indexing and retrieval processes.
Text Embeddings
File contents are large and it’s so hard to process the whole content every time a query is raised. Also there is a token limit and costs to process it. By Converting it into embeddings, it becomes conducive to various analyses such as similarity comparison, clustering, classification, and more. Using a simple comparison function, we can calculate a similarity score for two embeddings to figure out whether two texts are talking about similar things. In this way, based on the input query, we find the top similar contents and then send it to the Gemini AI for question/answering. For this, I have used the Cohere Embedding.
Workflow
How to run it?
You can directly hit the button and start running it on the cloud.
To run it locally, clone the repo:
#1 - Clone the repo
git clone https://github.com/dotAadarsh/ghost.git
#2 - Install the requirements
pip install -r requirements.txt
#3 - Grab your API keys and add it in the ./.streamlit/secrets.toml
file
- Pangea | Security Services for Developers
- Google AI Studio
- Cohere | The leading AI platform for enterprise
#4 - Run the applicaiton
streamlit run Home.py
What’s next for GHOST?
GHOST can utilize the capabilities of Pangea’s AuthN to establish user profiles, enabling users to sign in to GHOST and seamlessly pick up where they left off. I know I could have implemented this, but due to the time constraint I was not able to :(. Anyway I will try to implement it.
A concept involves customer service routinely processing our audio for training purposes. A tool could be developed for companies to automatically redact sensitive information from these recordings, enabling them to proceed with further processing in compliance with privacy regulations and industry guidelines. This can be powered by OpenAI’s Whisper Model and Pangea’s Services.
Another one is I was thinking, why can’t I create a Chrome extension that redacts all the sensitive info (if it’s turned on) on all websites if the user tries to enter it?
Perhaps we could develop custom extensions or bots tailored to companies’ needs, aiding in the prevention of sensitive information leakage and ensuring compliance adherence.
Outro
Playing with the Pangea’s API was interesting. Used the postman collection as well to understand the API response. Ran into different errors but hey, a different error message is always a step forward :) which keeps me motivated. Thanks to the docs, community and AI. Give GHOST a try and let me know your feedback, Have a great day!
– Aadarsh