Stage 2 – making sense of the chaos

This is the part where all the content sources came together into a centralized system I could actually interact with.

This post is a cleaned-up record of what I built, what worked, what didn’t, and what I planned next. If you’ve ever tried to unify fragmented notes, decks, blogs, and structured documents into a searchable system, this might resonate 🙂


What I Built

There were two main components at the heart of the system:

  1. Batch Processing Script
    PopulateChatSystemDataRepository.py — this was run manually to gather and format all source data into a single repository. My plan was to automate it later.
  2. Continuous Scanner
    A lightweight background service monitored for new blog posts and updates.

At that point, the batch script did the heavy lifting, though I intended to shift it onto Google Cloud Run to handle scale.


Where the Data Lived

The sources I processed included:

  • PowerPoint files
    These were manually selected and hardcoded into the script — a reasonable tradeoff given how few I needed to track.
  • RSS Feeds
    • My blog at bjrees.com
    • A few curated industry insight feeds
  • OneNote Notebooks, such as:
    • Project documentation (e.g. Skynet, The Oracle)
    • Notes from a Cambridge Judge Business School programme
    • Third-party and personal research logs
  • iCloud Backups
    These contained archived slide decks and supporting materials.

All of this data was funneled into a staging area for eventual vector embedding and retrieval.


Microsoft Graph API + OneNote

To pull content from OneNote, I used the Microsoft Graph API. First, I installed the required libraries:

pip install msal requests
  • msal handled authentication via Azure Active Directory
  • requests allowed me to interact with the Graph API endpoints

Once I authenticated, I could enumerate and query notebooks like this:

python ExtractNotes.py

After logging in via a Microsoft-generated URL, I could successfully extract content from all the notebooks I needed.


Licensing Curveballs

At the time, I hit a snag: my Microsoft 365 Family plan didn’t include SharePoint Online, which was required to query OneNote via the Graph API.

I weighed my options:

  1. Pay for a Business Standard plan (~£9.40/month)
  2. Try and use my home license in some way, even thought it didn’t seem to have what I needed for OneNote

I went with option 2, supported by a one-month free trial of Microsoft Business Basic to help validate the approach.


Google Sheets as the Backbone

The ingestion script used a JSON keyfile to interact with Google Sheets. It opened the sheet like this:

client.open_by_key(sheet_id).sheet1

Sheets acted as a live database — but I ran into 429 rate-limit errors, especially when repeatedly reading the same files. To solve this, I built a basic checkpointing system so the script would:

  • Cache previously processed records
  • Avoid re-downloading the same content every time
  • Track progress and only fetch new entries on each run

The GitHub Reset

After a short break from the project, I realized the codebase had grown too complex. I had introduced a lot of logic to deal with throttling and retries, but it made everything harder to understand.

So I rolled back to a much earlier commit and started again from a simpler foundation.

It was the right move.


What Came Next

Here’s what I tackled after that cleanup:

  • Migrated the whole project to an old home laptop
  • Simplified the ingestion pipeline
  • Ensured each run processed only new data, not the full archive
  • Finalized access and querying via Microsoft Graph API for OneNote and SharePoint content

Reflections

Skynet began as a chatbot experiment, but evolved into something bigger — a contextual knowledge system that drew from years of notes, presentations, and personal writing.

Stage 2 was about turning chaos into structure. The next phase was even more exciting: embeddings, retrieval, and building a system that could answer real questions, grounded in my own work.


Read Stage 1 if you missed the start.


Posted

in

by

Tags:

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *