project_bookworm/Ask_Phrack.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "dd9277e7-b860-426a-9a2c-3bb60263be03",
   "metadata": {},
   "source": [
    "# Ask Phrack"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bd56e703-8fab-4d5f-8269-de31afaf8a13",
   "metadata": {},
   "source": [
    "## Setup \n",
    "\n",
    "This is analogous to [the Wiki](https://because-security.atlassian.net/wiki/spaces/LML/pages/74121692/LangChain+GPT-4+and+a+security+motivation+for+Agents)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "925cca71-fa9e-4736-bc98-246c4cd5a705",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import os\n",
    "from dotenv import load_dotenv, find_dotenv\n",
    "load_dotenv(find_dotenv(filename=\"env\"), override=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b5d06cca-cb1c-4611-946e-51bc676f53c7",
   "metadata": {},
   "source": [
    "## API key handing\n",
    "\n",
    "I cycle keys here anways, but the `print()` is commented out. :) "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "id": "58bbd926-bbb1-4a34-9382-bb9bc9e79e66",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_openai import ChatOpenAI\n",
    "\n",
    "OPENAI_API_KEY=os.environ.get('OPENAI_API_KEY=')\n",
    "\n",
    "llm = ChatOpenAI(model=\"gpt-4\", temperature=0.9, max_tokens=512, api_key=OPENAI_API_KEY)\n",
    "# print(llm)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8c051e0b-0925-4a83-9b0e-43ac677b6e6d",
   "metadata": {},
   "source": [
    "## Basic Prompt 1on1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "id": "12e23e56-0759-4ab8-b0ce-86bd8aaa85f1",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'question': 'explain quantum mechanics in one sentence', 'text': ' \\n\\nQuantum mechanics is a branch of physics that studies the behavior and interactions of particles on a very small scale, such as atoms and subatomic particles.'}\n"
     ]
    }
   ],
   "source": [
    "from langchain.chains import LLMChain\n",
    "from langchain.prompts import PromptTemplate\n",
    "from langchain_openai import OpenAI\n",
    "\n",
    "template = \"\"\"Question: {question}\n",
    "\n",
    "Answer: Let's keep the answer very short and so simple so that\n",
    " a child can understand it.\"\"\"\n",
    "\n",
    "prompt = PromptTemplate.from_template(template)\n",
    "\n",
    "llm = OpenAI()\n",
    "\n",
    "llm_chain = LLMChain(prompt=prompt, llm=llm)\n",
    "\n",
    "# Define the question as a dictionary to match the expected input format\n",
    "question_dict = {\"question\": \"explain quantum mechanics in one sentence\"}\n",
    "\n",
    "# Invoke the chain with the question dictionary\n",
    "output = llm_chain.invoke(question_dict)\n",
    "\n",
    "\n",
    "print(output)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7976cb2c-6cf7-4d86-b8b7-1f885e0f1228",
   "metadata": {},
   "source": [
    "## Extended Prompt 1on1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "id": "3c6aac1a-d066-44fe-a2cb-4394b9cb88f0",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "content='Quantenmechanik ist das Studium von Phänomenen auf mikroskopischer Ebene, bei denen Teilchen gleichzeitig verschiedene Zustände einnehmen können, bis sie gemessen werden.'\n"
     ]
    }
   ],
   "source": [
    "from langchain.schema import(\n",
    "    AIMessage,\n",
    "    HumanMessage,\n",
    "    SystemMessage\n",
    ")\n",
    "\n",
    "chat = ChatOpenAI(model='gpt-4', temperature=0.5, max_tokens=1024)\n",
    "messages = [\n",
    "    SystemMessage(content='You are a physicist and respond only in German.'),\n",
    "    HumanMessage(content='explain quantum mechanics in one sentence')\n",
    "]\n",
    "output = chat(messages)\n",
    "print(output)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d9c8fec7-7e9e-482b-8b2c-a203ea9c4d88",
   "metadata": {},
   "source": [
    "## Sequential / chained Prompts 1on1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "id": "cd4a17ad-cd1b-4b85-b516-019237c09d2d",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "\n",
      "\u001b[1m> Entering new SimpleSequentialChain chain...\u001b[0m\n",
      "\u001b[36;1m\u001b[1;3mSure, the softmax function is often used in the final layer of a neural network-based classifier. Such networks are commonly trained under a log loss (or cross-entropy) regime, giving a non-linear generalization of logistic regression.\n",
      "\n",
      "Here is a Python function that implements the softmax function:\n",
      "\n",
      "```python\n",
      "import numpy as np\n",
      "\n",
      "def softmax(x):\n",
      "    \"\"\"Compute softmax values for each sets of scores in x.\"\"\"\n",
      "    e_x = np.exp(x - np.max(x))\n",
      "    return e_x / e_x.sum(axis=0) \n",
      "\n",
      "# test with some data\n",
      "scores = np.array([3.0, 1.0, 0.2])\n",
      "print(softmax(scores))\n",
      "```\n",
      "\n",
      "In this function, `np.exp(x - np.max(x))` is used to improve numerical stability. This doesn't change the result, but it does prevent possible overflows or underflows in the calculation.\n",
      "\n",
      "Please note that the input to the softmax function is a vector of real numbers and the output of the softmax function is a vector that sums to 1. The softmax function is often used in machine learning, specifically in the context of probabilistic classification.\u001b[0m\n",
      "\u001b[33;1m\u001b[1;3mThe `softmax` function in around Python implements the softmax function which is hugely employed in the domain of machine learning and data science. A softmax function is a type of squashing function which transform arbitrarily-lengthed vectors of 'real numbers' to a fixed vector of 'real numbers'.\n",
      "\n",
      "Especially, it's often applied in the last layer of a neural network-based classifiers handling multi-class classification problems. The output via a softmax function can be seen as representative probable results for each class. The probability for each class could be viewed as a percentage like score which then can be utilized to choose the maximum value for final class assignment, or can be used to assessibalize the level of confidence of the network's prediction.\n",
      "\n",
      "In this function:\n",
      "- `np.exp(x - np.max(x))` : It represents exponential of (x - max(x)), where x is an input array and `np.max(x)` is the max value element in array. It shifts the values of x by minus the max value in x. The result gives an array where the highest value is 0. This operation also help avoid potential problematics from overflow or underflows caused by the exponential operations.\n",
      "\n",
      "- `return e_x / e_x.sum(axis=0)` : It normalizes these exponential values(e_x) simply by attributing each value by the sum of all exponentials in the input data. By doing so, every elements / bits in the resulting output vector sums to 1 when tallyed up, hereby representing probabilities for distinct categories.\n",
      "\n",
      "To test the function, definitions `scores` an array of arbitrary numbers to form inputs, afterwards calls `softmax(scores)` to print the output, which evidences how data runs via the softmax function and gives the rersults.\u001b[0m\n",
      "\n",
      "\u001b[1m> Finished chain.\u001b[0m\n"
     ]
    }
   ],
   "source": [
    "from langchain_openai import ChatOpenAI\n",
    "from langchain import PromptTemplate\n",
    "from langchain.chains import LLMChain, SimpleSequentialChain\n",
    "\n",
    "# Initialize the first ChatOpenAI model (gpt-3.5-turbo) with specific temperature\n",
    "llm1 = ChatOpenAI(model='gpt-4', temperature=0.5)\n",
    "\n",
    "# Define the first prompt template\n",
    "prompt_template1 = PromptTemplate.from_template(\n",
    "    template='You are an experienced scientist and Python programmer. Write a function that implements the concept of {concept}.'\n",
    ")\n",
    "# Create an LLMChain using the first model and the prompt template\n",
    "chain1 = LLMChain(llm=llm1, prompt=prompt_template1)\n",
    "\n",
    "# Initialize the second ChatOpenAI model (gpt-4-turbo) with specific temperature\n",
    "llm2 = ChatOpenAI(model='gpt-4', temperature=1.2)\n",
    "\n",
    "# Define the second prompt template\n",
    "prompt_template2 = PromptTemplate.from_template(\n",
    "    template='Given the Python function {function}, describe it as detailed as possible.'\n",
    ")\n",
    "# Create another LLMChain using the second model and the prompt template\n",
    "chain2 = LLMChain(llm=llm2, prompt=prompt_template2)\n",
    "\n",
    "# Combine both chains into a SimpleSequentialChain\n",
    "overall_chain = SimpleSequentialChain(chains=[chain1, chain2], verbose=True)\n",
    "\n",
    "# Invoke the overall chain with the concept \"linear regression\"\n",
    "output = overall_chain.invoke('softmax')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "859e231b-89f2-4d17-ad01-c7ba94d99c40",
   "metadata": {},
   "source": [
    "## LLM GPT-4 agents with Python 1on1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "c1aafe55-4189-4f61-9df4-a4bd2f26622f",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "\n",
      "\u001b[1m> Entering new AgentExecutor chain...\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Python REPL can execute arbitrary code. Use with caution.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[32;1m\u001b[1;3m\n",
      "Invoking: `Python_REPL` with `str(math.pi)[8]`\n",
      "\n",
      "\n",
      "\u001b[0m\u001b[36;1m\u001b[1;3mNameError(\"name 'math' is not defined\")\u001b[0m\u001b[32;1m\u001b[1;3m\n",
      "Invoking: `Python_REPL` with `import math\n",
      "str(math.pi)[8]`\n",
      "responded: I encountered an error while trying to access the 8th digit of Pi. Let me fix that and try again.\n",
      "\n",
      "\u001b[0m\u001b[36;1m\u001b[1;3m\u001b[0m\u001b[32;1m\u001b[1;3mThe 8th digit of Pi is 3.\u001b[0m\n",
      "\n",
      "\u001b[1m> Finished chain.\u001b[0m\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "{'input': 'What is the 8th digit of Pi?',\n",
       " 'output': 'The 8th digit of Pi is 3.'}"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from langchain import hub\n",
    "from langchain.agents import AgentExecutor\n",
    "from langchain_experimental.tools import PythonREPLTool\n",
    "\n",
    "tools = [PythonREPLTool()]\n",
    "\n",
    "from langchain.agents import create_openai_functions_agent\n",
    "from langchain_openai import ChatOpenAI\n",
    "\n",
    "instructions = \"\"\"You are an agent designed to write and execute python code to answer questions.\n",
    "You have access to a python REPL, which you can use to execute python code.\n",
    "If you get an error, debug your code and try again.\n",
    "Only use the output of your code to answer the question. \n",
    "You might know the answer without running any code, but you should still run the code to get the answer.\n",
    "If it does not seem like you can write code to answer the question, just return \"I don't know\" as the answer.\n",
    "\"\"\"\n",
    "base_prompt = hub.pull(\"langchain-ai/openai-functions-template\")\n",
    "prompt = base_prompt.partial(instructions=instructions)\n",
    "\n",
    "agent = create_openai_functions_agent(ChatOpenAI(temperature=0), tools, prompt)\n",
    "agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True)\n",
    "agent_executor.invoke({\"input\": \"What is the 8th digit of Pi?\"})"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "36945f93-620f-42bc-b358-876566dbf08b",
   "metadata": {},
   "source": [
    "# Ask Phrak\n",
    "\n",
    "A little deeper?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "20420b66-d9a0-4516-b3cd-cea605696889",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Now you have 1059 chunks\n"
     ]
    }
   ],
   "source": [
    "from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
    "\n",
    "with open('3.txt') as f:\n",
    "    js_engines_phrack_21 = f.read()\n",
    "\n",
    "\n",
    "text_splitter = RecursiveCharacterTextSplitter(\n",
    "    chunk_size=100,\n",
    "    chunk_overlap=20,\n",
    "    length_function=len\n",
    ")\n",
    "\n",
    "\n",
    "chunks = text_splitter.create_documents([js_engines_phrack_21])\n",
    "# print(chunks[2])\n",
    "# print(chunks[10].page_content)\n",
    "print(f'Now you have {len(chunks)} chunks')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c26f8d3b-8257-4e5e-ab81-3309f3a39a87",
   "metadata": {},
   "source": [
    "## Embedding with a specific model and cost calc"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "f8fc1864-3c4f-4deb-8d9c-ad0803ec6ae1",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Total Tokens: 20945\n",
      "Embedding Cost in USD: 0.008378\n"
     ]
    }
   ],
   "source": [
    "def print_embedding_cost(texts):\n",
    "    import tiktoken\n",
    "    enc = tiktoken.encoding_for_model('text-embedding-ada-002')\n",
    "    total_tokens = sum([len(enc.encode(page.page_content)) for page in texts])\n",
    "    print(f'Total Tokens: {total_tokens}')\n",
    "    print(f'Embedding Cost in USD: {total_tokens / 1000 * 0.0004:.6f}')\n",
    "    \n",
    "print_embedding_cost(chunks)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "39bb31fe-55a7-4145-8c4d-c799d5050d33",
   "metadata": {},
   "source": [
    "## Simple: OpenAI SaaS model embeddings (1536 dimensions)\n",
    "\n",
    "[OpenAI Embeddings](https://platform.openai.com/docs/guides/embeddings/what-are-embeddings)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "a0830bea-aac3-475d-a615-efba61b078eb",
   "metadata": {},
   "outputs": [],
   "source": [
    "# from langchain.embeddings import OpenAIEmbeddings\n",
    "from langchain_openai import OpenAIEmbeddings\n",
    "embeddings = OpenAIEmbeddings()\n",
    "\n",
    "vector = embeddings.embed_query(chunks[0].page_content)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b3e43f74-70dc-483c-9346-2f028c3c11dc",
   "metadata": {},
   "source": [
    "### Text loading and chunking\n",
    "\n",
    "This is needed to prepare the Phrack txt file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "id": "943adca6-fe8d-4e9a-806e-bf73e1520cc7",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Now you have 1059 chunks\n"
     ]
    }
   ],
   "source": [
    "from langchain_community.document_loaders import TextLoader\n",
    "from langchain_openai import OpenAIEmbeddings\n",
    "from langchain_text_splitters import CharacterTextSplitter\n",
    "\n",
    "from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
    "\n",
    "with open('3.txt') as f:\n",
    "    phrack = f.read()\n",
    "\n",
    "\n",
    "text_splitter = RecursiveCharacterTextSplitter(\n",
    "    chunk_size=100,\n",
    "    chunk_overlap=20,\n",
    "    length_function=len\n",
    ")\n",
    "\n",
    "\n",
    "chunks = text_splitter.create_documents([phrack])\n",
    "# print(chunks[2])\n",
    "# print(chunks[10].page_content)\n",
    "print(f'Now you have {len(chunks)} chunks')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ba42df21-3243-4daa-9304-51c60361f2fc",
   "metadata": {},
   "source": [
    "## SQlite VSS and OpenAI Embeddings w 1535 dims\n",
    "\n",
    "Local SQlite DB with VSS ext\n",
    "\n",
    "[SQlite VSS](https://github.com/asg017/sqlite-vss) "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "id": "1b54d3b7-bd43-4a3c-99c8-f082487a9faa",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_openai import OpenAIEmbeddings\n",
    "embeddings = OpenAIEmbeddings()\n",
    "\n",
    "# vector = embeddings.embed_query(chunks[0].page_content)\n",
    "texts = [chunk.page_content for chunk in chunks]\n",
    "\n",
    "from langchain_community.vectorstores import SQLiteVSS\n",
    "\n",
    "db = SQLiteVSS.from_texts(\n",
    "    texts=texts,\n",
    "    embedding=embeddings,\n",
    "    table=\"test\",\n",
    "    db_file=\"./vss.db\",\n",
    ")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "27a386e4-e749-45d2-9025-81642c455c6c",
   "metadata": {},
   "source": [
    "### Sample query with similarity search"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "id": "50c5723d-846b-4e1f-ad39-9d5a75e1d25d",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'the heap and treats those as root nodes. In contrast, e.g. Spidermonkey'"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# query it\n",
    "query = \"What is SpiderMonkey?\"\n",
    "data = db.similarity_search(query)\n",
    "\n",
    "# print results\n",
    "data[3].page_content"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0b1268c3-f931-4182-8217-4a6e8dbb71f6",
   "metadata": {},
   "source": [
    "## Vector DB result management with GPT 4"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "id": "9c477da6-dce1-4be9-92a5-a804c685a8b7",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.chains import RetrievalQA\n",
    "from langchain_openai import ChatOpenAI\n",
    "\n",
    "\n",
    "llm = ChatOpenAI(model='gpt-4', temperature=1)\n",
    "retriever = db.as_retriever(search_type='similarity', search_kwargs={'k': 3})\n",
    "chain = RetrievalQA.from_chain_type(llm=llm, chain_type=\"stuff\", retriever=retriever)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "id": "792d5283-4759-42d2-ba23-57a76724712c",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The text is setting up a discussion about a certain code and its implications, possibly in a cybersecurity context. It is about to summarize the action or function of this code, often a crucial part of understanding an exploit. However, the detailed description of the code and its actions is not given in the provided text.\n"
     ]
    }
   ],
   "source": [
    "query = \"Sum up the text in 3 sentences.\"\n",
    "answer = chain.run(query)\n",
    "print(answer)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "id": "c1e864a8-80a9-4007-a234-f9d89c4c1805",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "SpiderMonkey is the JavaScript engine used in Mozilla's Firefox browser. It is responsible for interpreting and executing JavaScript code.\n"
     ]
    }
   ],
   "source": [
    "query = \"What is SpiderMonkey?\"\n",
    "answer = chain.run(query)\n",
    "print(answer)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "id": "1b714d44-895d-4078-bda9-fbd525765d9e",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The text provided doesn't give a full explanation or method on how to attack a JavaScript engine. However, one strategy mentioned is to inject a fake JavaScript Object into the engine. Please note that this represents illegal activity and is not condoned nor supported.\n"
     ]
    }
   ],
   "source": [
    "query = \"How to attack a JavaScript engine?\"\n",
    "answer = chain.run(query)\n",
    "print(answer)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "id": "11b08b9d-057a-47c4-926e-36390af87ac5",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Done"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3cd9e3dd-fc9c-42fc-b4ad-6fecea9c4491",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}