# Attention-based log-similarity clustering with FAISS and Linformer

## File Placement

Place the following file in the same directory as this notebook:

`lab_logs_blindtest_activity_sysmon_1000samples_july_28_2024_filtered_with_vectors_clean.parquet`

**SHA1 Hash:** `8349713e82c50b0c747b05e085e533d4b01e833a`

You can download the file from [Kaggle](https://www.kaggle.com/datasets/mariusciepluch/log2ml-blindtest-maldoc-activity-capture).

## Package Installation

Install the required packages by running the following command:

```bash
pip install -r requirements.cpu.txt
```

Make sure to refer to the `/dependencies` folder for the `requirements.cpu.txt` file.


# Preparation: read the parquet file with the trace and vector data (Linformer)

In [5]:
import pandas as pd
import numpy as np
import json

# Read from Parquet using pandas
pdf_read = pd.read_parquet("lab_logs_blindtest_activity_sysmon_1000samples_july_28_2024_filtered_with_vectors_clean.parquet")

# Function to convert JSON-encoded strings back to numpy arrays
def string_to_vector(s):
 return np.array(json.loads(s))

# Convert JSON strings back to numpy arrays
pdf_read['message_vector'] = pdf_read['message_vector_str'].apply(string_to_vector)

# Verify the shape of the vector
print("Vector lengths:")
print(pdf_read['message_vector'].apply(len).head())

# Check a sample vector to ensure dimensionality is preserved
sample_vector = pdf_read['message_vector'].iloc[0]
print(f"\nSample vector shape: {sample_vector.shape}")

# Drop the string column as it's no longer needed
pdf_read = pdf_read.drop(columns='message_vector_str')

# Verify vector lengths (equivalent to the Polars operation)
print("\nVector lengths (pandas equivalent of Polars operation):")
print(pdf_read['message_vector'].apply(len).head())

# If you need to see the full DataFrame structure
print("\nDataFrame info:")
pdf_read.info()

# If you want to see the first few rows of the DataFrame
print("\nFirst few rows of the DataFrame:")
print(pdf_read.head())

Vector lengths:
0 30000
1 30000
2 30000
3 30000
4 30000
Name: message_vector, dtype: int64

Sample vector shape: (30000,)

Vector lengths (pandas equivalent of Polars operation):
0 30000
1 30000
2 30000
3 30000
4 30000
Name: message_vector, dtype: int64

DataFrame info:

RangeIndex: 13455 entries, 0 to 13454
Data columns (total 16 columns):
 # Column Non-Null Count Dtype 
--- ------ -------------- ----- 
 0 @timestamp 13455 non-null datetime64[us, UTC]
 1 host.hostname 13455 non-null object 
 2 host.ip 13455 non-null object 
 3 log.level 13455 non-null object 
 4 winlog.event_id 13455 non-null int64 
 5 winlog.task 13455 non-null object 
 6 message 13455 non-null object 
 7 filtered_message 13455 non-null object 
 8 image 13455 non-null object 
 9 target_filename 13455 non-null object 
 10 parent_image 13455 non-null object 
 11 text 13455 non-null object 
 12 temp_folder 13455 non-null object 
 13 filename 13455 non-null object 
 14 label 13455 non-null object 
 15 message_vector 1345

## Loading the dataset

`X` is the typical variable for a NumPy array.

In [6]:
import numpy as np

df_f = pdf_read

print(df_f)

print()
# Convert the 'message_vector' column to a NumPy array
X = np.array(df_f['message_vector'].to_list())
print("Original data shape:", X.shape)
del(pdf_read) # Free up memory

 @timestamp host.hostname \
0 2024-07-28 15:08:24.277000+00:00 win10 
1 2024-07-28 15:08:24.488000+00:00 win10 
2 2024-07-28 15:08:25.005000+00:00 win10 
3 2024-07-28 15:08:25.005000+00:00 win10 
4 2024-07-28 15:08:25.030000+00:00 win10 
... ... ... 
13450 2024-07-28 23:35:53.054000+00:00 win10 
13451 2024-07-28 23:35:53.054000+00:00 win10 
13452 2024-07-28 23:35:54.133000+00:00 win10 
13453 2024-07-28 23:35:54.133000+00:00 win10 
13454 2024-07-28 23:41:55.301000+00:00 win10 

 host.ip log.level winlog.event_id \
0 fe80::c1af:35de:6006:d4cf information 3 
1 fe80::c1af:35de:6006:d4cf information 3 
2 fe80::c1af:35de:6006:d4cf information 10 
3 fe80::c1af:35de:6006:d4cf information 10 
4 fe80::c1af:35de:6006:d4cf information 10 
... ... ... ... 
13450 fe80::c1af:35de:6006:d4cf information 10 
13451 fe80::c1af:35de:6006:d4cf information 10 
13452 fe80::c1af:35de:6006:d4cf information 10 
13453 fe80::c1af:35de:6006:d4cf information 10 
13454 fe80::c1af:35de:6006:d4cf information 1 

 winlo

## Direct insertion into FAISS

No pre-processing.

In [78]:
import numpy as np
from langchain_community.vectorstores import FAISS
from langchain.schema import Document
from langchain.embeddings.base import Embeddings

# Assuming df_f is your DataFrame and X is your numpy array of vectors
X = np.array(df_f['message_vector'].to_list())

# Create Document objects
documents = [Document(page_content=text, metadata={'index': i}) for i, text in enumerate(df_f['message'])]

# Create a custom Embeddings class for pre-computed vectors
class PrecomputedEmbeddings(Embeddings):
 def __init__(self, vectors):
 self.vectors = vectors

 def embed_documents(self, texts):
 # Return all vectors, assuming order matches
 return self.vectors.tolist()

 def embed_query(self, text):
 # This method is required but won't be used for indexing
 # Return a zero vector of the same dimension as your embeddings
 return np.zeros(self.vectors.shape[1]).tolist()

 # Adding this method to conform to the Embeddings interface
 def embed_text(self, text):
 return self.embed_query(text)

# Create embeddings object
embeddings = PrecomputedEmbeddings(X)

# Create FAISS index
db = FAISS.from_documents(documents, embeddings, distance_strategy="COSINE")

# Save the index locally
db.save_local("faiss_index_sysmon_cosine")

print("FAISS index created and saved successfully.")

FAISS index created and saved successfully.


# Preparation: init the vector function (Linformer) - CPU

This is for the search queries.

In [14]:
from linformer_pytorch import LinformerLM
import torch
from tokenizers import Tokenizer

# Define the device
device = torch.device("cpu")

print("This uses a " + str(device) + " device")

# Load the custom tokenizer
tokenizer = Tokenizer.from_file("log_tokenizer.json")

# Initialize the Linformer model
linformer_model = LinformerLM(
 num_tokens=30000,
 input_size=700,
 channels=64,
 dim_k=128,
 dim_ff=128,
 dropout_ff=0.15,
 nhead=4,
 depth=2,
 dropout=0.1,
 activation="gelu",
 checkpoint_level="C0",
 parameter_sharing="layerwise",
 emb_dim=128,
).to(device)

def vectorize_text(text):
 MAX_LENGTH = 700

 # Tokenize using the custom tokenizer
 encoded = tokenizer.encode(text)

 # Get token IDs
 input_ids = encoded.ids

 # Ensure the input_ids length is exactly MAX_LENGTH
 input_ids = input_ids[:MAX_LENGTH] if len(input_ids) > MAX_LENGTH else input_ids + [0] * (MAX_LENGTH - len(input_ids))

 # Convert to PyTorch tensor and move to CPU
 input_ids = torch.tensor([input_ids], dtype=torch.long).to(device)

 # Get the model outputs
 with torch.no_grad():
 outputs = linformer_model(input_ids)

 # Assuming outputs is the tensor of interest
 vector = outputs.mean(dim=1)
 return vector.numpy()

This uses a cpu device
Vector shape: (1, 30000)
Vector: [[0.3499783 0.20774072 0.2604245 ... 0.21256167 0.11159717 0.01982626]]


# Test 1: using FAISS for string matches

In [70]:
# Vectorize a message of interest

interesting_log_line = r"""
TargetFilename: C:\Users\student\AppData\Local\Temp\file.exe
"""

# Vectorize the message
vectorized_log = vectorize_text(interesting_log_line)
print("Vectorized log shape:", vectorized_log.shape)

Vectorized log shape: (1, 30000)


In [54]:
import numpy as np
import faiss
from sklearn.preprocessing import normalize

# Function to find partial matches
def find_partial_matches(db, interesting_log_line, excluded_strings):
 matches = []
 for i in range(db.index.ntotal):
 doc = db.docstore.search(db.index_to_docstore_id[i])
 if interesting_log_line.strip() in doc.page_content and not any(excluded in doc.page_content for excluded in excluded_strings):
 matches.append((doc, i))
 return matches

# Ensure the query vector is a 2D numpy array of float32
query_vector = vectorized_log.astype(np.float32)

# Normalize the query vector
query_vector = normalize(query_vector)

# Debugging: Print the shape and type of query_vector
print("Processed query_vector shape:", query_vector.shape)
print("Processed query_vector type:", type(query_vector))

# Parameters
excluded_strings = [
 "Image: C:\\Users\\student\\AppData\\Local\\miniconda3\\python.exe",
 "Image: C:\\Program Files (x86)\\Microsoft\\EdgeUpdate\\",
 "Image: C:\\Program Files\\Avast Software\\Avast",
 "SourceImage: C:\\ProgramData\\Microsoft\\Windows Defender\\platform\\4.18.2011.6-0\\MsMpEng.exe",
 "Image: C:\\Users\\student\\AppData\\Local\\Microsoft\\Teams\\current\\Teams.exe",
 "Image: C:\\Program Files (x86)\\Microsoft\\Edge\\Application",
 "SourceImage: C:\\Windows"
]
max_matches_to_print = 3 # Number of partial matches to process and print

try:
 # Get the raw FAISS index
 raw_index = db.index

 # Print index type for debugging
 print(f"Index type: {type(raw_index)}")

 # Find partial matches
 partial_matches = find_partial_matches(db, interesting_log_line, excluded_strings)

 if not partial_matches:
 print("No partial matches found for the interesting_log_line (excluding specified strings).")
 else:
 print(f"Found {len(partial_matches)} partial matches. Printing details for the first {max_matches_to_print}:")
 
 for match_num, (match, match_index) in enumerate(partial_matches[:max_matches_to_print], 1):
 print(f"\nPartial match {match_num}:")
 print(f"Message: {match.page_content[:300]}...") # Print first 300 characters
 print(f"Metadata: {match.metadata}")

except Exception as e:
 print(f"Error occurred: {e}")
 print("FAISS index info:")
 print(f"Index size: {db.index.ntotal}")
 print(f"Index dimension: {db.index.d}")

Processed query_vector shape: (1, 30000)
Processed query_vector type: 
Index type: 
Found 114 partial matches. Printing details for the first 3:

Partial match 1:
Message: File created:
RuleName: EXE
UtcTime: 2024-07-28 15:12:53.459
ProcessGuid: {18e8265a-5fef-66a6-f701-000000004400}
ProcessId: 10072
Image: C:\Program Files\Microsoft Office\Root\Office16\EXCEL.EXE
TargetFilename: C:\Users\student\AppData\Local\Temp\file.exe
CreationUtcTime: 2024-07-23 14:24:50.520...
Metadata: {'index': 620}

Partial match 2:
Message: File created:
RuleName: EXE
UtcTime: 2024-07-28 15:44:34.527
ProcessGuid: {18e8265a-675e-66a6-1905-000000004400}
ProcessId: 8708
Image: C:\Program Files\Microsoft Office\Root\Office16\EXCEL.EXE
TargetFilename: C:\Users\student\AppData\Local\Temp\file.exe
CreationUtcTime: 2024-07-23 14:24:50.520...
Metadata: {'index': 4908}

Partial match 3:
Message: File created:
RuleName: EXE
UtcTime: 2024-07-28 15:52:39.361
ProcessGuid: {18e8265a-6942-66a6-6a05-000000004400}
ProcessId: 

# Test 2: using FAISS for similarity search (Cosine distance)

In [None]:
# Vectorize a message of interest

interesting_log_line = r"""
File created:
RuleName: EXE
Image: C:\Program Files\Microsoft Office\Root\Office16\EXCEL.EXE
TargetFilename: C:\Users\student\AppData\Local\Temp\file.exe
"""

# Vectorize the message
vectorized_log = vectorize_text(interesting_log_line)
print("Vectorized log shape:", vectorized_log.shape)

In [75]:
import numpy as np
from langchain_community.vectorstores import FAISS
from langchain.schema import Document
from langchain.embeddings.base import Embeddings
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import TruncatedSVD

# Assuming df_f is your DataFrame and it contains the 'message_vector' and 'message' columns
X = np.array(df_f['message_vector'].to_list())

# Apply StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply TruncatedSVD (PCA)
pca = TruncatedSVD(n_components=500, random_state=42)
X_pca = pca.fit_transform(X_scaled)

# Create Document objects
documents = [Document(page_content=text, metadata={'index': i}) for i, text in enumerate(df_f['message'])]

# Create a custom Embeddings class for pre-computed vectors
class PrecomputedEmbeddings(Embeddings):
 def __init__(self, vectors):
 self.vectors = vectors

 def embed_documents(self, texts):
 # Return all vectors, assuming order matches
 return self.vectors.tolist()

 def embed_query(self, text):
 # This method is required but won't be used for indexing
 # Return a zero vector of the same dimension as your embeddings
 return np.zeros(self.vectors.shape[1]).tolist()

 def embed_text(self, text):
 return self.embed_query(text)

# Create embeddings object with PCA-transformed vectors
embeddings = PrecomputedEmbeddings(X_pca)

# Create FAISS index
db = FAISS.from_documents(documents, embeddings, distance_strategy="COSINE")

# Save the index locally
db.save_local("faiss_index_sysmon_cosine_pca")

print("FAISS index created and saved successfully.")

# Save the scaler and PCA objects for later use
import joblib
joblib.dump(scaler, 'scaler.joblib')
joblib.dump(pca, 'pca.joblib')

print("Scaler and PCA objects saved for future use.")

# Print some information about the transformed data
print(f"Original vector shape: {X.shape}")
print(f"PCA-transformed vector shape: {X_pca.shape}")
print(f"Explained variance ratio sum: {pca.explained_variance_ratio_.sum():.4f}")

FAISS index created and saved successfully.
Scaler and PCA objects saved for future use.
Original vector shape: (13455, 30000)
PCA-transformed vector shape: (13455, 500)
Explained variance ratio sum: 1.0000


In [76]:
import joblib
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import TruncatedSVD
from langchain_community.vectorstores import FAISS
from langchain.embeddings.base import Embeddings

# Load the saved scaler and PCA objects
scaler = joblib.load('scaler.joblib')
pca = joblib.load('pca.joblib')

# Recreate the PrecomputedEmbeddings class (it needs to be defined before loading)
class PrecomputedEmbeddings(Embeddings):
 def __init__(self, vectors):
 self.vectors = vectors

 def embed_documents(self, texts):
 return self.vectors.tolist()

 def embed_query(self, text):
 return np.zeros(self.vectors.shape[1]).tolist()

 def embed_text(self, text):
 return self.embed_query(text)

# Create a dummy embeddings object (we'll replace its vectors later)
embeddings = PrecomputedEmbeddings(np.zeros((1, 100)))

# Load your FAISS index
db = FAISS.load_local("faiss_index_sysmon_cosine_pca", embeddings, allow_dangerous_deserialization=True)

def preprocess_query(query_vector):
 # Ensure query_vector is 2D
 if query_vector.ndim == 1:
 query_vector = query_vector.reshape(1, -1)
 
 # Apply the same preprocessing as during index creation
 query_scaled = scaler.transform(query_vector)
 query_pca = pca.transform(query_scaled)
 
 return query_pca

# Your query vector
query_vector = vectorized_log.astype(np.float32)

# Preprocess the query vector
processed_query = preprocess_query(query_vector)

# Now you can use this processed_query with your FAISS index
# For example:
results = db.similarity_search_by_vector(processed_query[0], k=5)

# Print results
for doc in results:
 print(f"Score: {doc.metadata.get('score', 'N/A')}")
 print(f"Content: {doc.page_content[:300]}...")
 print("---")

Score: N/A
Content: Process Create:
RuleName: -
UtcTime: 2024-07-28 15:38:31.608
ProcessGuid: {18e8265a-65f7-66a6-f904-000000004400}
ProcessId: 8612
Image: C:\Users\student\AppData\Local\miniconda3\python.exe
FileVersion: 3.12.4
Description: Python
Product: Python
Company: Python Software Foundation
OriginalFileName: p...
---
Score: N/A
Content: Process Create:
RuleName: -
UtcTime: 2024-07-28 15:38:35.802
ProcessGuid: {18e8265a-65fb-66a6-fe04-000000004400}
ProcessId: 2224
Image: C:\Users\student\AppData\Local\miniconda3\python.exe
FileVersion: 3.12.4
Description: Python
Product: Python
Company: Python Software Foundation
OriginalFileName: p...
---
Score: N/A
Content: Process Create:
RuleName: -
UtcTime: 2024-07-28 15:38:34.158
ProcessGuid: {18e8265a-65fa-66a6-fc04-000000004400}
ProcessId: 8200
Image: C:\Users\student\AppData\Local\miniconda3\python.exe
FileVersion: 3.12.4
Description: Python
Product: Python
Company: Python Software Foundation
OriginalFileName: p...
---
Score: N/A
Conte

# Test 3: using FAISS for similarity search (Cosine distance) with normalized vectors

In [None]:
# Vectorize a message of interest

interesting_log_line = r"""
File created:
RuleName: EXE
Image: C:\Program Files\Microsoft Office\Root\Office16\EXCEL.EXE
TargetFilename: C:\Users\student\AppData\Local\Temp\file.exe
"""

# Vectorize the message
vectorized_log = vectorize_text(interesting_log_line)
print("Vectorized log shape:", vectorized_log.shape)

In [77]:
import numpy as np
import faiss
from sklearn.preprocessing import normalize
from langchain_community.vectorstores import FAISS
from langchain.embeddings.base import Embeddings

# Load your original vectors (assuming you still have access to them)
X = np.array(df_f['message_vector'].to_list())

# Normalize the vectors (this makes L2 distance equivalent to cosine similarity)
X_normalized = normalize(X, axis=1)

class PrecomputedEmbeddings(Embeddings):
 def __init__(self, vectors):
 self.vectors = vectors

 def embed_documents(self, texts):
 return self.vectors.tolist()

 def embed_query(self, text):
 # This should not be called, but we'll keep it for compatibility
 return np.zeros(self.vectors.shape[1]).tolist()

# Create embeddings object with normalized vectors
embeddings = PrecomputedEmbeddings(X_normalized)

# Create Document objects
documents = [Document(page_content=text, metadata={'index': i}) for i, text in enumerate(df_f['message'])]

# Create FAISS index
db = FAISS.from_documents(documents, embeddings, distance_strategy="COSINE")

# Save the index
db.save_local("faiss_index_sysmon_cosine_normalized")

print("FAISS index created and saved successfully.")

# Now, let's search:
def search_similar(query_vector, k=5):
 # Normalize the query vector
 query_vector_normalized = normalize(query_vector.reshape(1, -1))[0]
 
 results = db.similarity_search_by_vector(query_vector_normalized, k=k)
 
 print(f"Top {k} similar messages:")
 for i, doc in enumerate(results, 1):
 print(f"\n{i}. Similarity Score: {doc.metadata.get('score', 'N/A')}")
 print(f"Message: {doc.page_content[:300]}...") # Print first 300 characters
 print(f"Metadata: {doc.metadata}")

# Your query vector
query_vector = vectorized_log.astype(np.float32)

# Search for similar messages
search_similar(query_vector, k=5)

FAISS index created and saved successfully.
Top 5 similar messages:

1. Similarity Score: N/A
Message: Dns query:
RuleName: -
UtcTime: 2024-07-28 18:27:54.471
ProcessGuid: {00000000-0000-0000-0000-000000000000}
ProcessId: 5940
QueryName: dc.sec699-20.lab
QueryStatus: 0
QueryResults: ::ffff:192.168.20.101;
Image: ...
Metadata: {'index': 8550}

2. Similarity Score: N/A
Message: Dns query:
RuleName: -
UtcTime: 2024-07-28 17:24:01.108
ProcessGuid: {18e8265a-7eb1-66a6-6a08-000000004400}
ProcessId: 10084
QueryName: dc.sec699-20.lab
QueryStatus: 0
QueryResults: ::ffff:192.168.20.101;
Image: ...
Metadata: {'index': 7393}

3. Similarity Score: N/A
Message: Registry value set:
RuleName: Context,ProtectedModeExitOrMacrosUsed
EventType: SetValue
UtcTime: 2024-07-28 19:31:25.327
ProcessGuid: {18e8265a-9c88-66a6-170c-000000004400}
ProcessId: 10624
Image: C:\Program Files\Microsoft Office\Root\Office16\EXCEL.EXE
TargetObject: HKU\S-1-5-21-3148146594-1027658...
Metadata: {'index': 9736}

4. Similarity