# This is an experiment: create vectorized embeddings out of an EverNote DB (PDF, DOCX, HTML, TXT)

## Features

* vectorize text, html files, pdfs and docx into one vector DB, split in tables (sqlite vss)
* use local self-hosted embeddings (CPU or GPU computed)
  * for sentences
* query a local sqlite vss vector db, use cache from LangChain (sqlite)
* use OpenAI API and (Ollama on-prem self-hosted) Mistral for the response processing
* compare with LLMware Bling

## Anti-Features

* due to cost reasons the OpenAI embeddings don't get used. So sorry :p

# Configuration

In [2]:
import sys
import os
IN_COLAB = 'google.colab' in sys.modules

if not IN_COLAB:
    # The Evernote DB path containing the extracted data.
    extracted_evernote_db = "/home/marius/data/it-sec-research-extracted/IT sec research"

    # Output paths containing the Evernote text notes or documents data.
    # These get generated by the data extraction process
    output_path_extracted_notes = "/home/marius/source/bookworm/export.txt"
    output_path_extracted_docs = "/home/marius/source/bookworm/export.documents.txt"

    # Resulting DB or vector store path.
    result_db = "/home/marius/source/bookworm/evernote.db"

else:
    # For the Goog Colab env
    output_path_extracted_notes = "/content/export.txt"
    output_path_extracted_docs = "/content/export.documents.txt"
    result_db = "/content/evernote.db"

# To suppress some warnings
import os
os.environ["TOKENIZERS_PARALLELISM"] = "True"

In [None]:
# Controls:

In [3]:
print(output_path_extracted_notes)

/content/export.txt


In [4]:
%reload_ext autoreload
%autoreload 2

## Dependencies

* Cryptography is used to handle some PDF functions here (signatures)

In [5]:
%pip show cryptography

Name: cryptography
Version: 42.0.5
Summary: cryptography is a package which provides cryptographic recipes and primitives to Python developers.
Home-page: 
Author: 
Author-email: The Python Cryptographic Authority and individual contributors <cryptography-dev@python.org>
License: Apache-2.0 OR BSD-3-Clause
Location: /usr/local/lib/python3.10/dist-packages
Requires: cffi
Required-by: pyOpenSSL


* pikepdf is used to repair some PDFs

In [None]:
%pip show pikepdf

Name: pikepdf
Version: 8.13.0
Summary: Read and write PDFs with Python, powered by qpdf
Home-page: 
Author: 
Author-email: "James R. Barlow" <james@purplerock.ca>
License: MPL-2.0
Location: /home/marius/miniconda3/envs/llm_langchain/lib/python3.11/site-packages
Requires: Deprecated, lxml, packaging, Pillow
Required-by: 
Note: you may need to restart the kernel to use updated packages.


* pypdf with all features is needed because this DB consists of 100+ PDFs

In [None]:
%pip show "pypdf"

Name: pypdf
Version: 4.0.2
Summary: A pure-python PDF library capable of splitting, merging, cropping, and transforming PDF files
Home-page: 
Author: 
Author-email: Mathieu Fenniak <biziqe@mathieu.fenniak.net>
License: 
Location: /home/marius/miniconda3/envs/llm_langchain/lib/python3.11/site-packages
Requires: 
Required-by: 
Note: you may need to restart the kernel to use updated packages.


In [6]:
%pip show torch

Name: torch
Version: 2.2.1+cu121
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: packages@pytorch.org
License: BSD-3
Location: /usr/local/lib/python3.10/dist-packages
Requires: filelock, fsspec, jinja2, networkx, nvidia-cublas-cu12, nvidia-cuda-cupti-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-runtime-cu12, nvidia-cudnn-cu12, nvidia-cufft-cu12, nvidia-curand-cu12, nvidia-cusolver-cu12, nvidia-cusparse-cu12, nvidia-nccl-cu12, nvidia-nvtx-cu12, sympy, triton, typing-extensions
Required-by: fastai, sentence-transformers, torchaudio, torchdata, torchtext, torchvision


In [8]:
%pip show faiss_gpu

Name: faiss-gpu
Version: 1.7.2
Summary: A library for efficient similarity search and clustering of dense vectors.
Home-page: https://github.com/kyamagu/faiss-wheels
Author: Kota Yamaguchi
Author-email: KotaYamaguchi1984@gmail.com
License: MIT
Location: /usr/local/lib/python3.10/dist-packages
Requires: 
Required-by: 


## Text extraction

* Here the html and text data is extracted into one txt file
* The PDF and DOCX data is extracted into another txt file

This will be used for weighted data fusion later.

* the texts are normalized:
  * unicode normalization
  * surrogate characters get replaced
  * html gets converted to text
  * pdfs get repaired
  * docx files get read

* exceptions get handled (UTF-16 issues, PDF reference errors)

In [None]:
import glob
import os

import unicodedata  # to normalize text
import html2text  # to convert html to text
from langchain.document_loaders import PyPDFLoader, Docx2txtLoader
import pikepdf  # to repair PDFs
from pathlib import Path
from tqdm.notebook import tqdm
from concurrent.futures import ThreadPoolExecutor, as_completed

def convert_html_to_text(html_blob: str) -> str:
    """
    Converts a html blob into a string.
    """
    h = html2text.HTML2Text()
    h.mark_code = True
    h.escape_snob = True
    h.unicode_snob = True
    # h.use_automatic_links = True
    h.images_as_html = True
    h.single_line_break = True
    h.ignore_links = True
    return h.handle(html_blob)

def normalize_text(txt_blob: str) -> str:
    """
    Normalize a text blob using NFKD normalization.
    """
    return unicodedata.normalize("NFKD", txt_blob)

def repair_pdf(file_path: str) -> bool:
    """
    Attempts to repair a PDF file using pikepdf.
    """
    try:
        with pikepdf.open(file_path, allow_overwriting_input=True) as pdf:
            pdf.save(file_path)
        return True
    except pikepdf.PdfError as e:
        print(f"Failed to repair PDF {file_path}: {e}")
        return False

def read_and_convert_file(file_path: str, is_html: bool, is_pdf: bool, is_docx: bool) -> str:
    """
    Reads and converts a file from HTML, PDF, DOCX, or plain text to text.
    :param file_path:
    :param is_html:
    :param is_pdf:
    :param is_docx:
    :return:
    """

    content = ""
    if is_html:
        try:
            with open(file_path, 'r', encoding='utf-8') as file:
                content = file.read()
            return convert_html_to_text(content)
        except Exception as e:
            print(f"Error reading {file_path}: {e}")
            return ""

    elif is_pdf:
        try:
            loader = PyPDFLoader(file_path)
            # ... fixes "Multiple definitions in dictionary at byte 0xb32 for key /ExtGState" error
            documents = loader.load()
            content = "\n".join(doc.page_content for doc in documents if hasattr(doc, 'page_content'))
        except Exception as e:
            print(f"Error loading PDF {file_path}: {e}. Attempting to repair...")
            if repair_pdf(file_path):
                try:
                    loader = PyPDFLoader(file_path)
                    documents = loader.load()
                    content = "\n".join(doc.page_content for doc in documents if hasattr(doc, 'page_content'))
                except Exception as e:
                    print(f"Failed to process PDF {file_path} after repair: {e}")
                    return ""
        return normalize_text(content)

    elif is_docx:
        try:
            loader = Docx2txtLoader(file_path)
            content = loader.load()
            if isinstance(content, list):
                content = "\n".join(content)
        except Exception as e:
            print(f"Error reading DOCX {file_path}: {e}")
            return ""
        return normalize_text(content)

    else:  # For plain text files
        try:
            with open(file_path, 'r', encoding='utf-8') as file:
                return normalize_text(file.read())
        except Exception as e:
            print(f"Error reading {file_path}: {e}")
            return ""

def sanitize_text(text):
    """
    Removes or replaces surrogate characters from a string.
    """
    return text.encode('utf-8', 'replace').decode('utf-8')

def append_to_output(data: str, is_pdf: bool, is_docx: bool, output_path: str):
    """
    Appends sanitized data to an output file.
    """
    sanitized_data = sanitize_text(data)
    if is_pdf or is_docx:
        output_path = str(Path(output_path).with_suffix('')) + ".documents.txt"

    with open(output_path, "a", encoding='utf-8') as output_file:
        output_file.write(sanitized_data)

def process_file(file):
    is_html = file.endswith('.html')
    is_pdf = file.endswith('.pdf')
    is_docx = file.endswith('.docx')

    file_content = read_and_convert_file(file, is_html, is_pdf, is_docx)
    append_to_output(file_content, is_pdf, is_docx, output_path=output_path)

def process_files_in_directory(directory: str):
    txt_html_files = glob.glob(os.path.join(directory, "*.txt")) + glob.glob(os.path.join(directory, "*.html"))
    pdf_docx_files = glob.glob(os.path.join(directory, "img", "*.pdf")) + glob.glob(os.path.join(directory, "img", "*.docx"))
    all_files = txt_html_files + pdf_docx_files

    # Initialize the progress bar
    pbar = tqdm(total=len(all_files), desc="Processing files")

    with ThreadPoolExecutor(max_workers=3) as executor:
        # Submit all files to the executor and store future objects
        futures = [executor.submit(process_file, file) for file in all_files]

        # As tasks complete, update the progress bar
        for future in as_completed(futures):
            pbar.update(1)  # Update the progress bar by one for each task completed

    # Ensure the progress bar is closed upon completion
    pbar.close()

process_files_in_directory(extracted_evernote_db)

ModuleNotFoundError: No module named 'html2text'

## Chunking of the texts

The texts need to get chunked (pre-processing) before the embedding process.

In [9]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

def chunk_text_data(txt_file=output_path_extracted_notes):

    with open(txt_file) as f:
        text_notes = f.read()

    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=100,
        chunk_overlap=20,
        length_function=len
    )

    chunks = text_splitter.create_documents([text_notes])
    print(f'Now you have {len(chunks)} chunks in {txt_file}')
    return chunks

# chunk individual text file containing the data
text_chunks = chunk_text_data(txt_file=output_path_extracted_notes)
doc_chunks = chunk_text_data(txt_file=output_path_extracted_docs)

Now you have 723845 chunks in /content/export.txt
Now you have 151259 chunks in /content/export.documents.txt


### Embedding costs - why no OpenAI?

The OpenAI API has a cost for the embeddings.
At this point there seems to be no way to pre-estimate the costs reliably.
The following calculation is probably flawed:

In [None]:
def print_embedding_cost(texts):
    import tiktoken
    enc = tiktoken.encoding_for_model('gpt-4')
    total_tokens = sum([len(enc.encode(page.page_content)) for page in texts])
    print(f'Total Tokens: {total_tokens}')
    print(f'Embedding Cost in USD: { (0.03 / 1_000) * total_tokens}')

print_embedding_cost(text_chunks)

## Use Hugging Face Embeddings Sentence Transformers

Here we:

* use a self-hosted on-premises model for the embedding and vectorization
* configure it for the use with the CPU or GPU

This model is from the Beijing Academy of Artificial Intelligence
* https://huggingface.co/BAAI/bge-large-en-v1.5
* It uses: https://huggingface.co/docs/transformers/model_doc/auto

It will produce embeddings of 1024 dimensions, roughly 500 less than OpenAI Embeddings.

In [10]:
import torch
use_cuda = torch.cuda.is_available()

USE_GPU=False

if use_cuda:
    print('__CUDNN VERSION:', torch.backends.cudnn.version())
    print('__Number CUDA Devices:', torch.cuda.device_count())
    print('__CUDA Device Name:',torch.cuda.get_device_name(0))
    print('__CUDA Device Total Memory [GB]:',torch.cuda.get_device_properties(0).total_memory/1e9)
    USE_GPU=True
    print("GPU enabled")

if not use_cuda:
    print('No CUDA available')

__CUDNN VERSION: 8902
__Number CUDA Devices: 1
__CUDA Device Name: Tesla V100-SXM2-16GB
__CUDA Device Total Memory [GB]: 16.935682048
GPU enabled


In [11]:
from langchain.embeddings import HuggingFaceEmbeddings

# pre-trained model path
modelPath = "BAAI/bge-large-en-v1.5"

# Create a dictionary with model configuration options, specifying to use the CPU or GPU for computations
if not USE_GPU:
    model_kwargs = {'device':'cpu'}
else:
    model_kwargs = {}

# Create a dictionary with encoding options, specifically setting 'normalize_embeddings' to True
encode_kwargs = {'normalize_embeddings': True}

In [12]:
# Initialize an instance of HuggingFaceEmbeddings with the specified parameters
# this model requires sentence_transformers

embeddings = HuggingFaceEmbeddings(
    model_name=modelPath,     # Provide the pre-trained model's path
    model_kwargs=model_kwargs, # Pass the model configuration options
    encode_kwargs=encode_kwargs # Pass the encoding options
 )

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/94.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/779 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/191 [00:00<?, ?B/s]

In [13]:
vector = embeddings.embed_query(text_chunks[0].page_content)
# print(vector)
n_dimensions = len(vector)
print(n_dimensions, " dimensions are going to be used")

1024  dimensions are going to be used


## Batch process the embedding

Many data-science tasks require to split a larger processing operation into batch jobs.
Like in the good old Mainframe days.

The vector DB: https://github.com/asg017/sqlite-vss
Basis: https://faiss.ai/ - a library for efficient similarity search and clustering of dense vectors.

We add vectors of 1024 dimensions per chunk (sentence, line break delimited) to the DB.
The processing is done in batches of 50 chunks, using 3 threads.

In [14]:
from concurrent.futures import ThreadPoolExecutor, as_completed
import os

from tqdm.notebook import tqdm  # Import tqdm for notebook
from typing import List
from langchain.schema.document import Document

from langchain_community.vectorstores import FAISS

In [16]:
def add_texts_in_batches(batch: List[Document], sqlite_table: str = "evernote", embeddings=embeddings) -> None:
    """
    Using type hints is a good idea here, because error messages get swallowed by the ThreadPoolExecutor.

    The exception handling serves the same purpose.
    """

    try:
        db = FAISS.from_documents(batch, embeddings)
        return db

    except Exception as e:
        print(f"Exception occurred in add_texts_in_batches: {e}")

def divide_chunks(chunks, n):
    """
    Divide and conquer
    """
    for i in range(0, len(chunks), n):
        yield chunks[i:i + n]


def vectorize_data_in_batches(chunks, embeddings):
    num_workers = 3
    batch_size = 50  # Adjust based on your needs and memory constraints

    batches = list(divide_chunks(chunks, batch_size))
    faiss_db = None  # List to collect the returned db objects

    with ThreadPoolExecutor(max_workers=num_workers) as executor:
        # Submit all the batches for processing
        futures = {executor.submit(add_texts_in_batches, batch, embeddings=embeddings): batch for batch in batches}

        # Setup the tqdm progress bar
        progress_bar = tqdm(total=len(futures), desc="Processing batches")

        for future in as_completed(futures):
            # Each time a future completes, update the progress and collect the result
            progress_bar.update(1)
            try:
                db_result = future.result()  # This is where you get the returned value from add_texts_in_batches
                if faiss_db is not None:
                    faiss_db = db_result.merge_from(faiss_db)
                else:
                    faiss_db = db_result

            except Exception as e:
                print(f"An error occurred: {e}")

        progress_bar.close()  # Ensure the progress bar is closed at the end

    print("All texts have been added to the database.")
    faiss_db.save_local("faiss_index")

vectorize_data_in_batches(chunks=text_chunks, embeddings=embeddings)

[autoreload of faiss.swigfaiss failed: Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/IPython/extensions/autoreload.py", line 245, in check
    superreload(m, reload, self.old_objects)
  File "/usr/local/lib/python3.10/dist-packages/IPython/extensions/autoreload.py", line 394, in superreload
    module = reload(module)
  File "/usr/lib/python3.10/imp.py", line 315, in reload
    return importlib.reload(module)
  File "/usr/lib/python3.10/importlib/__init__.py", line 169, in reload
    _bootstrap._exec(spec, module)
  File "<frozen importlib._bootstrap>", line 619, in _exec
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/usr/local/lib/python3.10/dist-packages/faiss/swigfaiss.py", line 711, in <module>
    class ParameterRangeVector(object):
  File "/usr/local/lib/python3.10/dist-packages/faiss/swigfaiss.py", line 738, in ParameterRangeVect

Exception occurred in add_texts_in_batches: name 'swig_ptr' is not defined


Processing batches:   0%|          | 0/14477 [00:00<?, ?it/s]

KeyboardInterrupt: 