Stage 1 – getting set up: foundations of my AI-powered chatbot project

Over the past few months, I’ve been building something a bit different: a real-time AI-powered assistant designed to help me work better with my own content. The goal is to create a system that can scan and catalog documents, blog posts, audio recordings, and notes, then surface that information back to me as I need it—almost like a second brain. I wanted it to pull from tools I already use daily, like Google Sheets, OneNote, and GitHub, and use technologies like Pinecone, OpenAI, and Google Cloud to power the intelligence behind it.

This blog series is a step-by-step breakdown of how I built it—from a messy OneNote notebook into a working system. Each post will focus on one key stage, including the code, architecture, and lessons learned along the way. This first post is solely about the tech stack that I chose, actually one of the most fun stages.


Setting Up the Environment for a Python-Based AI Chatbot

This project runs on Linux, primarily because I want to use Python, which I have some basic experience with. Here’s how I set up the development environment and supporting tools.

Core Tools and Services

  • Google Cloud
    • Speech-to-Text for audio transcription
    • Cloud Run to execute background processing tasks
    • Google Sheets for structured data storage (e.g., cataloging blog posts)
  • OneDrive (Personal) for general document storage
  • iCloud Drive for mobile voice recordings
  • GitHub to manage Python code and version control
  • Trello as a lightweight project tracker
  • ChatGPT to assist with development and planning

Getting Linux + WSL Working Smoothly

There were a few initial stumbling blocks in getting everything up and running, especially around Windows Subsystem for Linux (WSL). Here’s the distilled process:

  1. Launch CMD as Admin, then enter WSL with:
    wsl
  2. Activate the virtual Python environment:
    source myenv/bin/activate
  3. Navigate to the correct project folder:
    cd ~/skynet/

Code and Repository Setup

  • All code is version-controlled in GitHub
  • To update code:
    git add PopulateChatSystemDataRepository.py
    git commit -m "Message"
    git push origin main
    git pull origin main
  • Libraries required when switching machines:
    pip install gspread
    pip install oauth2client

Purpose of This Setup

One of the key tasks here is to catalog blog posts into a Google Sheet using Python:

python3 SendDataToGoogleSheets.py

This setup forms the backbone of a knowledge base that can be queried by an AI chatbot.

Why Use Google Cloud Run?

I plan to regularly publish new blog posts and documents. These need to be automatically picked up by a cloud-based system, not just left on a drive. To do that, I’m using Google Cloud Run to host the background process that parses and ingests this content.

Useful Links:

The service is named skynet, though it hasn’t been deployed live yet—waiting until the code is fully tested.

Setting Up the Google Sheet Database

  1. Create a Google Sheet and give it a meaningful name (e.g., “Chat System Data Repository”).
  2. Enable the Google Sheets API in the Google Cloud Console.
  3. Set up the API key and credentials for access.
  4. Code integration is done using Python with the gspread library—no Zapier or low-code tools.

Data Format for Each Entry

Each blog post entry should include:

  • Message ID
  • User ID
  • Timestamp
  • Message Text
  • Source (e.g., Slack, WhatsApp)
  • Response Status (e.g., Processed, Pending)

Parsed via Python’s feedparser, which extracts standard RSS fields such as title, link, description, content, and publication time.

Next Steps

There are two major next steps:

  1. Add additional content sources into the pipeline (see Trello board).
  2. For any new source, take the data through the same ingestion process.

Currently, everything runs through the PopulateChatSystemDataRepository.py script, which has been updated to handle edge cases like escape characters.

Now that the core data is in place inside a Google Sheet, the next stage is testing the pipeline end-to-end. Once that’s working, I’ll expand to include additional data sources.


Posted

in

by

Tags:

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *