Tips and Tricks Picked Up From Using a Large Language Model to Search My Personal Wiki By "Vibe"

Recently, I wanted to find a document in my personal wiki that mentioned the idea of "there will always be another coding task" - unfortunately, traditional search came up short here, because I couldn't remember the exact wording I used, only the "vibe" of this particular phrase. However, I've been looking for a reason to use a large language model (aka LLM) with my wiki, and while there are other techniques that could accomplish the same goal, this seemed like as good an excuse as any to try it out! I learned some tips and tricks along the way, so I thought I'd share - if you just want the tips, feel free to skip ahead to the Summary - otherwise, read on!

Using ollama to run LLMs locally

First things first - choosing an LLM! There are a number of services and models out there - ChatGPT is kind of the "Google" of LLMs at the moment - but I personally don't feel super comfortable sending my personal notes to ChatGPT, so I wanted to make use of a model that I could run on my own hardware. There's a project named ollama that makes it really easy to do this, and among locally-runnable models we have some options as well - llama3.1 is one that came out fairly recently, so I chose to use that one:

$ sudo pacman -S ollama-cuda
$ sudo systemctl start ollama
$ ollama pull llama3.1
$ ollama run llama3.1 'what is 3 + 3?'
3 + 3 = 6.

Now that I have a local LLM to talk to, my task is pretty straightforward:

Get a list of candidate documents

When you're using an LLM, you can't throw an unbounded amount of text at it - LLMs have what's called a context window, which as I understand is like its short-term memory. Llama3.1 has a context window of 128K tokens (and I think that, depending on which model you're using, tokens tend to be something like three-quarters of a word), which is pretty big, but my wiki is considerably bigger. I tend to tag documents like the one I'm looking for as "Journal", "Reflection", or "Insights", and I'm pretty sure I wrote this document this year, so I'm only going to work with documents tagged as such that were modified in 2024, which gets me about 300 documents:

$ tiddlywiki --render ~/wiki /tmp/reflection-docs.json text/plain  '$:/core/templates/exporters/JsonFile' 'exportFilter' '[tag[Insights]] [tag[Journal]] [project[Reflection]]'
$ < /tmp/reflection-docs.json jq -c '.[] | select(.type == "text/vnd.tiddlywiki" and (.modified | startswith("2024")))' > /tmp/2024-reflection-docs.json

Transform the documents

The LLM should be fine with a single JSON document per line, and since TiddlyWiki's documents come with a bunch of extra metadata that I don't need for this, and we want to keep the documents as small as possible to fit as many in the context window as we can, I pare down the list of fields to just title and text:

$ < /tmp/2024-reflection-docs.json jq -c '{title: .title, text: .text}' > /tmp/llm-docs.json

Come up with a prompt for the model

Now that I've given the LLM a bunch of documents to look at, I needed to figure out the question to ask it. This is definitely more of an art that a science, but I started with this:

Given the following documents expressed as JSON documents with title and text properties, identify the five documents that most closely express the notion of "there will always be another coding task."

You'll notice that I asked for five documents rather than just "the closest" match - this is a trick I picked up from Simon Willison:

One tip I have for these things is to ask for 20 ideas for X. Always ask for lots of ideas, because if you ask it for an idea for X, it’ll come up with something obvious and boring. If you ask it for 20, by number 15, it’s really scraping the bottom of the barrel.

This works well for creative idea generation, but also for smoothing over weirdness you might see with the "top answer"!

TIP #1: Ask for multiple results in your prompt.

Submitting the documents and the prompt to the LLM

Now, there are multiple ways to interact with ollama - after running into limitations with the ollama CLI, I started using langchain, mostly because I stumbled upon it while searching and already had it on my radar. So here's an example script that does what I needed:

import fileinput
import json

from langchain_community.llms import Ollama

docs = [ json.loads(line.rstrip()) for line in fileinput.input() ]

llm = Ollama(model="llama3.1")

actual_prompt = 'find the titles of the five documents that mostly likely discuss the notion of "there will always be a next coding task"'
prompt = f'Given the documents that follow the next newline, expressed one-per-line as JSON documents with the title in the "title" key and the document body in the "text" key, {actual_prompt}:\n'
res = llm.invoke(prompt + '\n' + '\n'.join(json.dumps(d) for d in docs))

print(res)

Unfortunately, when I ran this, the responses I got were along the lines of "this appears to be a list of documents".

What I quickly realized is that the size of my documents was blowing through the context window - I confirmed this by first changing my prompt to something simple and unrelated (namely, "Ignore everything that follows this line and just tell me what 3+3 is". I had naïvely assumed that I would get an error if I exceeded the context window, but that's not the case - the LLM simply "forgets" what's at the beginning.

TIP #2: Not really a tip, just a caveat: exceeding context window size just results in forgetfulness.

Figuring out how many documents fit into the context

At this point, I wanted to get a feel for how many documents I could stuff into the context, so I limited the number down to 50 or something, moved the prompt to the end Interestingly enough, Anthropic mentions that there are other benefits to putting your prompt after a series of documents Interestingly enough, Anthropic mentions that there are other benefits to putting your prompt after a series of documents, and changed it to something like "given the preceding JSON documents, tell me how many you see". I also tried "given the preceding JSON documents, tell me the title of the first one you see".

TIP #3: Move your prompt to the end to prevent the LLM forgetting it.

TIP #4: Ask the LLM questions about the context as a diagnostic.

The results were surprising - it was something like "6 documents", rather than the supplied 50 or so!

What ended up happening in this case is that although llama3.1 has a 128K token context window, langchain itself restricts the context window size to a measly 2K by default (this is apparently ollama's default as well). Increasing the size via num_ctx increased the number of documents considered by the LLM - I experimented with the value, watching my VRAM consumption along the way. At a window size of 64K, my VRAM usage reached 15GB of my available 16GB, so that seemed like a good place to stop.

TIP #5: Don't settle for langchain/ollama's default context window size of 2K tokens.

Once I got the context window size set up properly, I found that with my 64K token context, I could include about 32K tokens' worth of documents before the LLM started to forget it always ended up being roughly ½ the total context size - I don't know why it always ended up being roughly ½ the total context size - I don't know why. For this subset of my wiki, that divided the data into 4 chunks of roughly 75 documents each. So then I adapted my code to use these chunks and just spit out the answer for each.

Another trick I tried was to run each chunk twice - LLMs are non-deterministic, so you may get a different answer each time. I figure that if I ran each chunk more than once, any titles that appear more frequently in the results should end up higher in the final ranking. (This kind of reminds me of quantum algorithms, which give you a probabilistic result, and you can run multiple times to boost your confidence in the answer)

TIP #6: Ask your question multiple times and consolidate the answers.

At this point, I was able to find the document I was looking for! So this is where I stopped my experimentation, but I had an idea for another technique for chunking and asking multiple times: instead of dividing the full document set into roughly equal chunks and running each chunk twice, I could instead divide the document set into chunks, run each chunk, then shuffle the document set, re-chunk, and run each new chunk - I might try that sometime if I have cause to do this kind of thing again!

Summary

  1. Ask for multiple results in your prompt.
  2. Not really a tip, just a caveat: exceeding context window size just results in forgetfulness.
  3. Move your prompt to the end to prevent the LLM forgetting it.
  4. Ask the LLM questions about the context as a diagnostic.
  5. Don't settle for langchain/ollama's default context window size of 2K tokens.
  6. Ask your question multiple times and consolidate the answers.
Published on 2024-09-06