Deploying a Local LLM + RAG Chatbot

A Customer PoC on Single-Cell RNA-Seq Best Practices

Overview

This article describes a proof-of-concept (PoC) project for a customer seeking to build a fully local, domain-specific chatbot powered by Retrieval-Augmented Generation (RAG). The goal was to evaluate how well a lightweight language model, deployed on a CPU-based Linux machine, could deliver accurate, structured, and interactive answers based on a curated corpus of best-practice knowledge in single-cell RNA-sequencing (scRNA-seq).

Problem Statement: Why This Matters

Researchers and bioinformaticians working with single-cell RNA-seq data often face steep learning curves. There are dozens of notebooks, white papers, and tutorials—but no easy way to query this body of knowledge interactively, especially in air-gapped or sensitive computing environments.

The customer wanted a locally hosted chatbot that could:

Understand highly technical questions about RNA velocity, pseudotime, batch correction, and preprocessing.
Retrieve precise, context-aware answers from their internal corpus of scientific documents.
Run entirely offline, without depending on cloud APIs or GPU acceleration.

Technical Architecture

Language Model: phi-3.5-mini (3.8B) running locally on a CPU machine (Linux).
RAG Stack: Local embedding model (e.g., all-MiniLM) + Chroma vector DB.
Corpus: Best-practice guides and Jupyter notebooks from the single-cell field (e.g., scvelo, single-cell-best-practices).
Indexing: Documents chunked by semantic section (e.g., glossary, pseudotime.ipynb), embedded and indexed locally.
Frontend: Lightweight chat UI or CLI for direct question input.

Evaluation Methodology

How We Measured Chatbot Performance

To quantify model performance, we designed a custom evaluation framework:

Baseline QA Set: 39 handcrafted questions covering core concepts: from kinetic ODEs to latent time estimation, and pipeline-specific steps like velocity graph construction.
Run 1: Ask all questions to the base LLM without any retrieval.
Run 2: Ask the same questions using the RAG pipeline (i.e., with chunked context provided).
Evaluation Criteria: Each answer was scored on:
- Accuracy & Domain Relevance
- Completeness
- Clarity & Structure
- Practical Utility

Results & Insights

The RAG-enhanced responses outperformed the baseline in all categories:

+0.82 improvement in accuracy
+1.64 in completeness
+2.28 in clarity
+0.85 in utility

These improvements were especially noticeable in questions requiring multi-step reasoning or referencing multiple parts of the corpus (e.g., latent time vs pseudotime).

Key Features

Personalized onboarding

Language Model: phi-3.5-mini (3.8B) running locally on a CPU machine (Linux). RAG Stack: Local embedding model (e.g., all-MiniLM) + Chroma vector DB. Corpus: Best-practice guides and Jupyter notebooks from the single-cell field (e.g., scvelo, single-cell-best-practices). Indexing: Documents chunked by semantic section (e.g., glossary, pseudotime.ipynb), embedded and indexed locally. Frontend: Lightweight chat UI or CLI for direct question input.

Interested to learn more on how to deploy and develop AI apps in Australia?

Can I run a chatbot with my own data on a local Linux machine, without cloud dependencies?

Yes. This project showed that a CPU-based setup can power real-world scientific chatbots. It’s considerably slow to run compared to a GPU.

Can small models like phi-3.5 deliver usable results on niche domains like single-cell genomics?

Yes, especially with RAG. RAG boosts precision without requiring large models.

What kind of content works well in a RAG corpus?

Structured tutorials, source code, best-practice guides, and Jupyter notebooks—especially when chunked by semantic topic.

How can I evaluate chatbot performance?

Define question sets and scoring rubrics. Track accuracy, completeness, clarity, and utility.

Is there a risk of LLM hallucination or bias?

Always. But RAG, version control, and blinded evaluations help mitigate it.

🧠 Was There Bias in Scoring?

Short answer: Not in the scoring logic itself — but yes, there’s a possibility of cognitive bias in how I framed and interpreted results because I inferred one model was “RAG-enhanced.”

Here’s What Happened:

The actual scoring functions (accuracy, completeness, clarity, utility) were:
- Heuristically applied
- Based only on content length, keyword richness, sentence count, and phrasing clues
- Executed identically for both models

✅ So the math wasn’t biased.
However, my expectation framing (thinking B = RAG) could have:

Influenced how I described performance (“significantly outperforms”)
Created a confirmation bias in qualitative interpretation