# Use a local CPU Large Language Model (LLM) to generate text

This is a basic LLM, which 

* does not require a GPU
* is not fine-tuned for a specific task
* is not optimized for speed
* is not optimized for memory usage
* has a smaller model size
* ...
* is not as good as a GPU LLM
* is not as good as a fine-tuned LLM
* is not as good as a larger LLM
* ...

Its purpose is to allow on-premises and self-hosted use of LLMs. 

In [5]:
# You need to manage the dependencies of LangChain with
# the requirements.txt file. The versions are pinned.
# %pip install -r requirements.txt

## Use the Hugging Face pipeline with LLMware Bling

* The Hugging Face pipeline is a convenient way to use a pre-trained model.
* LLMware Bling is a CPU LLM.
* The config of this model is to allow remote code from Hugging Face.

In [2]:
from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

model_id = "llmware/bling-stable-lm-3b-4e1t-v0"

# Ensure the directory for saving models is created and specified in your environment
# This is more about ensuring that the model download doesn't prompt for storage location or confirmation
import os
from transformers import logging

# Optionally, increase logging level if you want to see more details about the download process
logging.set_verbosity_info()

# Make sure you have set TRANSFORMERS_CACHE in your environment variables
# os.environ["TRANSFORMERS_CACHE"] = "/path/to/your/preferred/cache/directory"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True)

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, max_new_tokens=500)
hf = HuggingFacePipeline(pipeline=pipe)


loading file vocab.json from cache at None
loading file merges.txt from cache at None
loading file tokenizer.json from cache at /home/marius/.cache/huggingface/hub/models--llmware--bling-stable-lm-3b-4e1t-v0/snapshots/a9e4d8d478d76dd062d9acd01b6ce3417217a344/tokenizer.json
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at /home/marius/.cache/huggingface/hub/models--llmware--bling-stable-lm-3b-4e1t-v0/snapshots/a9e4d8d478d76dd062d9acd01b6ce3417217a344/special_tokens_map.json
loading file tokenizer_config.json from cache at /home/marius/.cache/huggingface/hub/models--llmware--bling-stable-lm-3b-4e1t-v0/snapshots/a9e4d8d478d76dd062d9acd01b6ce3417217a344/tokenizer_config.json
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
loading configuration file config.json from cache at /home/marius/.cache/huggingface/hub/models--llmware--bling-stable-lm-3b-4e1t-v0/snapshots/a9e4d8d

## Use the Hugging Face pipeline with LLMware Bling via LangChain

* This is a basic prompt template with LangChain
* The question is passed to the model via a chain

In [3]:
from langchain.prompts import PromptTemplate

template = """Question: {question}

Answer: Let's think step by step."""
prompt = PromptTemplate.from_template(template)

chain = prompt | hf

question = "What is electroencephalography?"

test = chain.invoke({"question": question})

Disabling tokenizer parallelism, we're using DataLoader multithreading already
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


In [4]:
print(test)

 First, electroencephalography (EEG) is a medical test that measures electrical activity in the brain. Second, EEG is a type of electrodiagnostic test. Third, electrodiagnostic tests are used to evaluate neurological conditions.
