# This is an experiment: create vectorized embeddings out of an EverNote DB (PDF, DOCX, HTML, TXT)



## Features

* vectorize text, html files, pdfs and docx into one vector DB, split in tables (sqlite vss)
* use local self-hosted embeddings (CPU or GPU computed)
  * for sentences
* query a local sqlite vss vector db, use cache from LangChain (sqlite)
* use OpenAI API and (Ollama on-prem self-hosted) Mistral for the response processing
* compare with LLMware Bling

## Anti-Features

* due to cost reasons the OpenAI embeddings don't get used. So sorry ... not.

# Setup and configuration

⚠ This config is automated and executes a Bash script from a GitHub repo if you execute it on Goog Colab ⚠

In [23]:
import sys
import os
import subprocess

IN_COLAB = 'google.colab' in sys.modules

if not IN_COLAB:
    # The Evernote DB path containing the extracted data
    # It will not be needed on Colab
    extracted_evernote_db = "/home/marius/data/it-sec-research-extracted/IT sec research"

    # Output paths containing the Evernote text notes or documents data.
    # These get generated by the data extraction process
    output_path_extracted_notes = "/home/marius/source/bookworm/export.txt"
    output_path_extracted_docs = "/home/marius/source/bookworm/export.documents.txt"

    # Resulting DB or vector store path.
    result_db = "/home/marius/source/bookworm/evernote.db"

else:
    # For the Goog Colab env we use different paths
    output_path_extracted_notes = "/content/export.txt"
    output_path_extracted_docs = "/content/export.documents.txt"
    result_db = "/content/evernote.db"

    # Download the data locally (just some txt files here)
    # Install pip dependencies in Colab
    subprocess.run('''
    source <(curl -s https://raw.githubusercontent.com/norandom/project_bookworm/main/scripts/prepare_colab_env.sh)
    ''',
    shell=True, check=True, executable='/bin/bash')

In [24]:
# To suppress some warnings
os.environ["TOKENIZERS_PARALLELISM"] = "False"

## Checks

In [25]:
print(output_path_extracted_notes)

/home/marius/source/bookworm/export.txt


## For the progress bars in Colab

⚛ If you don't add this magic commands the `tqdm` progress bars will not update properly ⚛

In [5]:
%reload_ext autoreload
%autoreload 2

## Select key dependencies

* `cryptography` is used to handle some PDF functions here (signatures)

In [6]:
%pip show cryptography

Name: cryptography
Version: 42.0.5
Summary: cryptography is a package which provides cryptographic recipes and primitives to Python developers.
Home-page: 
Author: 
Author-email: The Python Cryptographic Authority and individual contributors <cryptography-dev@python.org>
License: Apache-2.0 OR BSD-3-Clause
Location: /usr/local/lib/python3.10/dist-packages
Requires: cffi
Required-by: pyOpenSSL


* `pikepdf` is used to repair some PDFs

In [7]:
%pip show pikepdf

Name: pikepdf
Version: 8.13.0
Summary: Read and write PDFs with Python, powered by qpdf
Home-page: 
Author: 
Author-email: "James R. Barlow" <james@purplerock.ca>
License: MPL-2.0
Location: /usr/local/lib/python3.10/dist-packages
Requires: Deprecated, lxml, packaging, Pillow
Required-by: 


* `pypdf` with all features is needed because this DB consists of 100+ PDFs

In [8]:
%pip show "pypdf"

Name: pypdf
Version: 4.0.2
Summary: A pure-python PDF library capable of splitting, merging, cropping, and transforming PDF files
Home-page: 
Author: 
Author-email: Mathieu Fenniak <biziqe@mathieu.fenniak.net>
License: 
Location: /usr/local/lib/python3.10/dist-packages
Requires: 
Required-by: 


* `torch` is used for tensors, and GPU processing

In [9]:
%pip show torch

Name: torch
Version: 2.2.1+cu121
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: packages@pytorch.org
License: BSD-3
Location: /usr/local/lib/python3.10/dist-packages
Requires: filelock, fsspec, jinja2, networkx, nvidia-cublas-cu12, nvidia-cuda-cupti-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-runtime-cu12, nvidia-cudnn-cu12, nvidia-cufft-cu12, nvidia-curand-cu12, nvidia-cusolver-cu12, nvidia-cusparse-cu12, nvidia-nccl-cu12, nvidia-nvtx-cu12, sympy, triton, typing-extensions
Required-by: fastai, sentence-transformers, torchaudio, torchdata, torchtext, torchvision


* `faiss` is used in the CPU version as a general vector store library. The data is being serialzed with `LangChain`. FAISS CPU version uses AVX2. The GPU port has some implementation issues with disk persistance and merging.

In [10]:
%pip show faiss_cpu

Name: faiss-cpu
Version: 1.8.0
Summary: A library for efficient similarity search and clustering of dense vectors.
Home-page: 
Author: 
Author-email: Kota Yamaguchi <yamaguchi_kota@cyberagent.co.jp>
License: MIT License
Location: /usr/local/lib/python3.10/dist-packages
Requires: numpy
Required-by: 


## Text extraction

⬛ This doesn't need to get executed if you already have the `*.txt` files.


* Here the html and text data is extracted into txt
* The PDF and DOCX data is extracted into another txt file. This will be used for weighted data fusion later.

* the texts are normalized:
  * unicode normalization
  * surrogate characters get replaced
  * html gets converted to text
  * pdfs get repaired
  * docx files get read

* exceptions get handled (UTF-16 issues, PDF reference errors)

In [11]:
import glob
import os

import unicodedata  # to normalize text
import html2text  # to convert html to text
from langchain.document_loaders import PyPDFLoader, Docx2txtLoader
import pikepdf  # to repair PDFs
from pathlib import Path
from tqdm.notebook import tqdm
from concurrent.futures import ThreadPoolExecutor, as_completed

In [None]:
def convert_html_to_text(html_blob: str) -> str:
    """
    Converts a html blob into a string.
    """
    h = html2text.HTML2Text()
    h.mark_code = True
    h.escape_snob = True
    h.unicode_snob = True
    # h.use_automatic_links = True
    h.images_as_html = True
    h.single_line_break = True
    h.ignore_links = True
    return h.handle(html_blob)

def normalize_text(txt_blob: str) -> str:
    """
    Normalize a text blob using NFKD normalization.
    """
    return unicodedata.normalize("NFKD", txt_blob)

def repair_pdf(file_path: str) -> bool:
    """
    Attempts to repair a PDF file using pikepdf.
    """
    try:
        with pikepdf.open(file_path, allow_overwriting_input=True) as pdf:
            pdf.save(file_path)
        return True
    except pikepdf.PdfError as e:
        print(f"Failed to repair PDF {file_path}: {e}")
        return False

def read_and_convert_file(file_path: str, is_html: bool, is_pdf: bool, is_docx: bool) -> str:
    """
    Reads and converts a file from HTML, PDF, DOCX, or plain text to text.
    :param file_path:
    :param is_html:
    :param is_pdf:
    :param is_docx:
    :return:
    """

    content = ""
    if is_html:
        try:
            with open(file_path, 'r', encoding='utf-8') as file:
                content = file.read()
            return convert_html_to_text(content)
        except Exception as e:
            print(f"Error reading {file_path}: {e}")
            return ""

    elif is_pdf:
        try:
            loader = PyPDFLoader(file_path)
            # ... fixes "Multiple definitions in dictionary at byte 0xb32 for key /ExtGState" error
            documents = loader.load()
            content = "\n".join(doc.page_content for doc in documents if hasattr(doc, 'page_content'))
        except Exception as e:
            print(f"Error loading PDF {file_path}: {e}. Attempting to repair...")
            if repair_pdf(file_path):
                try:
                    loader = PyPDFLoader(file_path)
                    documents = loader.load()
                    content = "\n".join(doc.page_content for doc in documents if hasattr(doc, 'page_content'))
                except Exception as e:
                    print(f"Failed to process PDF {file_path} after repair: {e}")
                    return ""
        return normalize_text(content)

    elif is_docx:
        try:
            loader = Docx2txtLoader(file_path)
            content = loader.load()
            if isinstance(content, list):
                content = "\n".join(content)
        except Exception as e:
            print(f"Error reading DOCX {file_path}: {e}")
            return ""
        return normalize_text(content)

    else:  # For plain text files
        try:
            with open(file_path, 'r', encoding='utf-8') as file:
                return normalize_text(file.read())
        except Exception as e:
            print(f"Error reading {file_path}: {e}")
            return ""

def sanitize_text(text):
    """
    Removes or replaces surrogate characters from a string.
    """
    return text.encode('utf-8', 'replace').decode('utf-8')

def append_to_output(data: str, is_pdf: bool, is_docx: bool, output_path: str):
    """
    Appends sanitized data to an output file.
    """
    sanitized_data = sanitize_text(data)
    if is_pdf or is_docx:
        output_path = str(Path(output_path).with_suffix('')) + ".documents.txt"

    with open(output_path, "a", encoding='utf-8') as output_file:
        output_file.write(sanitized_data)

def process_file(file):
    is_html = file.endswith('.html')
    is_pdf = file.endswith('.pdf')
    is_docx = file.endswith('.docx')

    file_content = read_and_convert_file(file, is_html, is_pdf, is_docx)
    append_to_output(file_content, is_pdf, is_docx, output_path=output_path)

def process_files_in_directory(directory: str):
    txt_html_files = glob.glob(os.path.join(directory, "*.txt")) + glob.glob(os.path.join(directory, "*.html"))
    pdf_docx_files = glob.glob(os.path.join(directory, "img", "*.pdf")) + glob.glob(os.path.join(directory, "img", "*.docx"))
    all_files = txt_html_files + pdf_docx_files

    # Initialize the progress bar
    pbar = tqdm(total=len(all_files), desc="Processing files")

    with ThreadPoolExecutor(max_workers=3) as executor:
        # Submit all files to the executor and store future objects
        futures = [executor.submit(process_file, file) for file in all_files]

        # As tasks complete, update the progress bar
        for future in as_completed(futures):
            pbar.update(1)  # Update the progress bar by one for each task completed

    # Ensure the progress bar is closed upon completion
    pbar.close()

process_files_in_directory(extracted_evernote_db)

# Chunking of the texts

The texts need to get chunked (pre-processing) before the embedding process. We are processing text for the sake of similarity detection. Therefore we can use overlaps. For log-processing and detection engineering, overlaps would be counter-productive.

In [26]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

def chunk_text_data(txt_file=output_path_extracted_notes):

    with open(txt_file) as f:
        text_notes = f.read()

    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=100,
        chunk_overlap=20,
        length_function=len
    )

    chunks = text_splitter.create_documents([text_notes])
    print(f'Now you have {len(chunks)} chunks in {txt_file}')
    return chunks

# chunk individual text file containing the data
text_chunks = chunk_text_data(txt_file=output_path_extracted_notes)
doc_chunks = chunk_text_data(txt_file=output_path_extracted_docs)

Now you have 723845 chunks in /home/marius/source/bookworm/export.txt
Now you have 151259 chunks in /home/marius/source/bookworm/export.documents.txt


In [28]:
text_chunks[42].page_content

'fields. Primitive fields are those finite fields in which the exponent n is 1. In primitive fields'

### Embedding costs - why no OpenAI?

The OpenAI API has a cost for the embeddings.
At this point there seems to be no way to pre-estimate the costs reliably.
The following calculation is probably flawed. But if it's correct, I wish the OpenAPI team the best of luck with finding a new pricing model.

In [13]:
def print_embedding_cost(texts):
    import tiktoken
    enc = tiktoken.encoding_for_model('gpt-4')
    total_tokens = sum([len(enc.encode(page.page_content)) for page in texts])
    print(f'Total Tokens: {total_tokens}')
    print(f'Embedding Cost in USD: { (0.03 / 1_000) * total_tokens}')

print_embedding_cost(text_chunks)

Total Tokens: 15769414
Embedding Cost in USD: 473.08241999999996


# Use Hugging Face Embeddings Sentence Transformers

Here we:

* use a self-hosted on-premises model for the embedding and vectorization
* configure it for the use with the CPU or GPU

This model is from the Beijing Academy of Artificial Intelligence
* https://huggingface.co/BAAI/bge-large-en-v1.5
* It uses: https://huggingface.co/docs/transformers/model_doc/auto

It will produce embeddings of 1024 dimensions, roughly 500 less than OpenAI Embeddings I wanted to use initially.

## GPU detection (CUDA)

Here we detect whether a GPU is present, and if that is the case, we initialize the model to use it later. If not, we can use the CPU as a fallback. But for this use-case / implementation the Nvidia V100 GPU is about 60x faster (estimation).

In [13]:
import torch
use_cuda = torch.cuda.is_available()

USE_GPU=True

if use_cuda:
    print('__CUDNN VERSION:', torch.backends.cudnn.version())
    print('__Number CUDA Devices:', torch.cuda.device_count())
    print('__CUDA Device Name:',torch.cuda.get_device_name(0))
    print('__CUDA Device Total Memory [GB]:',torch.cuda.get_device_properties(0).total_memory/1e9)
    USE_GPU=True
    print("GPU enabled")

if not use_cuda:
    print('No CUDA available')
    USE_GPU=False


No CUDA available


## BAAI BERT Model

The Beijing Academy of Artificial Intelligence (BAAI) is a leading organization, which provides state of the art models on HuggingFace. Here the model is being used to create the Embeddings. An Embedding here isn't a plain Word2Vec style projection of text to a vector space. It has a semantic integration. I still have to research the details.

Bidirectional Encoder Representations from Transformers (BERT) is a language model based on the transformer architecture, notable for its dramatic improvement over previous state of the art models. It was introduced in October 2018 by researchers at Google.

In [14]:
from langchain.embeddings import HuggingFaceEmbeddings

# pre-trained model path
modelPath = "BAAI/bge-large-en-v1.5"

# Create a dictionary with model configuration options, specifying to use the CPU or GPU for computations
if not USE_GPU:
    model_kwargs = {'device':'cpu'}
else:
    model_kwargs = {}

# Create a dictionary with encoding options, specifically setting 'normalize_embeddings' to True
encode_kwargs = {'normalize_embeddings': True}

### Initialization of the Embedding model

In [15]:
# Initialize an instance of HuggingFaceEmbeddings with the specified parameters
# this model requires sentence_transformers

embeddings = HuggingFaceEmbeddings(
    model_name=modelPath,     # Provide the pre-trained model's path
    model_kwargs=model_kwargs, # Pass the model configuration options
    encode_kwargs=encode_kwargs # Pass the encoding options
 )

In [16]:
vector = embeddings.embed_query(text_chunks[0].page_content)
# print(vector)
n_dimensions = len(vector)
print(n_dimensions, " dimensions are going to be used")

NameError: name 'text_chunks' is not defined

This means that per line of the txt, this model creates 1024 dimensions (per vector).

## Batch process the embedding

Many data-science tasks require to split a larger processing operation into batch jobs.
Like in the good old Mainframe days.

Initially I wanted to use the basic vector DB sqlite-vss again: https://github.com/asg017/sqlite-vss

This is based on FAISS as well, but sqlite-vss doesn't seem to be able to handle concurrency. Recent sqlite versions can.

### FAISS

https://faiss.ai/ - a library for efficient similarity search and clustering of dense vectors.

### Concurrency and batch processing

We add vectors of 1024 dimensions per chunk (sentence, line break delimited) to a vector store based on FAISS and LangChain.
The processing is done in batches of 50 chunks, using 3 threads in parallel.

In [19]:
from concurrent.futures import ThreadPoolExecutor, as_completed
import os

from tqdm.notebook import tqdm
from typing import List
from langchain.schema.document import Document

from langchain_community.vectorstores import FAISS

In [22]:
def add_texts_in_batches(batch: List[Document], sqlite_table: str = "evernote", embeddings=embeddings) -> None:
    """
    Using type hints is a good idea here, because error messages get swallowed by the ThreadPoolExecutor.
    The exception handling serves the same purpose.
    Exceptions can cost performance, but only on the CPU level here.
    """

    try:
        db = FAISS.from_documents(batch, embeddings, distance_strategy="COSINE")
        return db

    except Exception as e:
        print(f"Exception occurred in add_texts_in_batches: {e}")


def divide_chunks(chunks, n):
    """
    Divide and conquer :)
    """
    for i in range(0, len(chunks), n):
        yield chunks[i:i + n]


def vectorize_data_in_batches(chunks, embeddings):
    """
    This function orchestrates the embedding vectorization in batches.
    """

    num_workers = 3
    batch_size = 500  # Adjust based on your needs and memory constraints

    batches = list(divide_chunks(chunks, batch_size))
    faiss_db = None

    with ThreadPoolExecutor(max_workers=num_workers) as executor:
        # Submit all the batches for processing
        futures = {executor.submit(add_texts_in_batches, batch, embeddings=embeddings): batch for batch in batches}

        # Setup the tqdm progress bar
        progress_bar = tqdm(total=len(futures), desc="Processing batches")

        for future in as_completed(futures):
            # Each time a future completes, update the progress and collect the result
            progress_bar.update(1)
            try:
                db_result = future.result()  # This is where you get the returned value from add_texts_in_batches
                if faiss_db is not None:
                    faiss_db.merge_from(db_result)

                else:
                    faiss_db = db_result

            except Exception as e:
                print(f"An error occurred: {e}")

        progress_bar.close()  # Ensure the progress bar is closed at the end

    faiss_db.save_local("faiss_index_cosine")
    print("All texts have been added to the database.")


vectorize_data_in_batches(chunks=text_chunks, embeddings=embeddings)

Processing batches:   0%|          | 0/1448 [00:00<?, ?it/s]

All texts have been added to the database.


# Similarity and MMR search

* this works on the FAISS index without a GPU
* you can retrieve the data from Kaggle: https://www.kaggle.com/mariusciepluch/faiss-text-db-infosec-archive
* the data is a FAISS index with cosine similarity
* you can use this FAISS index with MMR search


"Maximal Marginal Relevance a.k.a. MMR has been introduced in the paper The Use of MMR, Diversity-Based Reranking for Reordering Documents and Producing Summaries. MMR tries to reduce the redundancy of results while at the same time maintaining query relevance of results for already ranked documents/phrases etc." (https://medium.com/tech-that-works/maximal-marginal-relevance-to-rerank-results-in-unsupervised-keyphrase-extraction-22d95015c7c5)

MMR search provides better results here.

In [3]:
from langchain_community.vectorstores import FAISS

In [10]:
# DL FAISS index via API command (API key required afaik)
!/home/marius/miniconda3/envs/llm_langchain/bin/kaggle datasets download -d mariusciepluch/faiss-text-db-infosec-archive

Downloading faiss-text-db-infosec-archive.zip to /home/marius/source/bookworm
100%|██████████████████████████████████████| 2.59G/2.59G [01:34<00:00, 35.9MB/s]
100%|██████████████████████████████████████| 2.59G/2.59G [01:34<00:00, 29.4MB/s]


In [ ]:
# Unzip the downloaded file
!7z x faiss-text-db-infosec-archive.zip

In [17]:
loaded_db = FAISS.load_local("faiss_index_cosine", embeddings, distance_strategy="COSINE", allow_dangerous_deserialization=True)

In [34]:
results = loaded_db.search(search_type="mmr", query="What is an exploit?")
print(results)

[Document(page_content="`exploit'"), Document(page_content='* **Exploit** : An exploit is a software or procedure that uses a vulnerability to effect some'), Document(page_content='exploit techniques disclosed?'), Document(page_content='Exploit-Entwicklung.')]


In [35]:
results_with_scores = loaded_db.similarity_search_with_score("What is an exploit?")
print(results_with_scores)

[(Document(page_content="`exploit'"), 0.24956527), (Document(page_content="`exploit'"), 0.24956527), (Document(page_content="`exploit'"), 0.24956527), (Document(page_content="`exploit'"), 0.24956527)]


In [32]:
retriever = loaded_db.as_retriever( search_type="mmr",)
docs = retriever.invoke("What is an exploit?")
print(docs)

[Document(page_content="`exploit'"), Document(page_content='* **Exploit** : An exploit is a software or procedure that uses a vulnerability to effect some'), Document(page_content='exploit techniques disclosed?'), Document(page_content='Exploit-Entwicklung.')]


In [41]:
retriever = loaded_db.as_retriever( search_type="mmr",  search_kwargs={'k': 50, 'fetch_k': 5000})
docs = retriever.invoke("What is an exploit and what is the process of creating it? How does reverse engineering contribute to exploit development?")
print(docs)

[Document(page_content='### Exploit Development'), Document(page_content='Reverse engineering, specific to computer science, is the act of deriving a'), Document(page_content='This article is about the reverse engineering of the exploit found in the leak'), Document(page_content='the Software Exploitation challenges and I designed all the Reverse\nEngineering challenges.'), Document(page_content='understand the thought process behind reverse engineering modern malware of'), Document(page_content='# Gdbinit | Reverse Engineering Mac OS X\n**Created:**| _1/3/2012 4:12:17 PM_  \n---|---'), Document(page_content='series. First thing said: “what is an exploit?”. It could be described as a'), Document(page_content='How might an attacker benefit from capturing or modifying the data?'), Document(page_content='to do Reverse Engineering and exploit development, yet some of it is required'), Document(page_content='hacking, crack, hack, unlock, unprotect, break, reverse engineer, recover,'), Docum

# Use the FAISS index with Mistral

In [42]:
from langchain_community.llms import Ollama
from langchain.globals import set_llm_cache
from langchain.cache import InMemoryCache

set_llm_cache(InMemoryCache())

llm = Ollama(model="mistral")

## Pass MMR search results to Mistral

* I am using Ollama and Mistral, self-hosted
* The Mistral model is a large language model, which can be used for text generation and QA
* The Mistral model is being used to generate a response to the MMR search results

In [43]:
from langchain.chains import RetrievalQA

chain = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=retriever)

In [45]:
query = "What is an exploit and how does the process of reverse engineering contribute to exploit development?"
answer = chain.invoke(query)
print(answer)

{'query': 'What is an exploit and how does the process of reverse engineering contribute to exploit development?', 'result': " An exploit is a software or procedure that uses a vulnerability to effect some unwanted or unintended action in a system. The process of reverse engineering contributes significantly to exploit development as it involves understanding the inner workings of a software or system, identifying vulnerabilities, and developing code (exploits) to take advantage of those vulnerabilities. Reverse engineering tools and techniques enable researchers and attackers to analyze software, disassemble code, and modify it to create exploits. Exploit development requires reverse engineering skills, custom shellcode payloads, and a deep understanding of the target system's vulnerabilities and exploitability."}


# Sandbox code - test area

In [26]:
from langchain_community.vectorstores import FAISS

texts = ["FAISS is an important library", "LangChain supports FAISS"]
faiss = FAISS.from_texts(texts, embeddings, distance_strategy="COSINE")
print(type(faiss))

faiss.save_local("test")

new_db = FAISS.load_local("test", embeddings, allow_dangerous_deserialization=True)

<class 'langchain_community.vectorstores.faiss.FAISS'>
