Deploying a Local LLM + RAG Chatbot

A Customer PoC on Single-Cell RNA-Seq Best Practices

rna rag project
single-cell RNA velocity (concepts → math → pipelines → pitfalls)

Overview

This article describes a proof-of-concept (PoC) project for a customer seeking to build a fully local, domain-specific chatbot powered by Retrieval-Augmented Generation (RAG). The goal was to evaluate how well a lightweight language model, deployed on a CPU-based Linux machine, could deliver accurate, structured, and interactive answers based on a curated corpus of best-practice knowledge in single-cell RNA-sequencing (scRNA-seq).


Problem Statement: Why This Matters

Researchers and bioinformaticians working with single-cell RNA-seq data often face steep learning curves. There are dozens of notebooks, white papers, and tutorials—but no easy way to query this body of knowledge interactively, especially in air-gapped or sensitive computing environments.

The customer wanted a locally hosted chatbot that could:

  • Understand highly technical questions about RNA velocity, pseudotime, batch correction, and preprocessing.

  • Retrieve precise, context-aware answers from their internal corpus of scientific documents.

  • Run entirely offline, without depending on cloud APIs or GPU acceleration.


Technical Architecture

  • Language Model: phi-3.5-mini (3.8B) running locally on a CPU machine (Linux).

  • RAG Stack: Local embedding model (e.g., all-MiniLM) + Chroma vector DB.

  • Corpus: Best-practice guides and Jupyter notebooks from the single-cell field (e.g., scvelo, single-cell-best-practices).

  • Indexing: Documents chunked by semantic section (e.g., glossary, pseudotime.ipynb), embedded and indexed locally.

  • Frontend: Lightweight chat UI or CLI for direct question input.


Evaluation Methodology

How We Measured Chatbot Performance

To quantify model performance, we designed a custom evaluation framework:

  1. Baseline QA Set: 39 handcrafted questions covering core concepts: from kinetic ODEs to latent time estimation, and pipeline-specific steps like velocity graph construction.

  2. Run 1: Ask all questions to the base LLM without any retrieval.

  3. Run 2: Ask the same questions using the RAG pipeline (i.e., with chunked context provided).

  4. Evaluation Criteria: Each answer was scored on:

    • Accuracy & Domain Relevance

    • Completeness

    • Clarity & Structure

    • Practical Utility


Results & Insights

The RAG-enhanced responses outperformed the baseline in all categories:

  • +0.82 improvement in accuracy

  • +1.64 in completeness

  • +2.28 in clarity

  • +0.85 in utility

These improvements were especially noticeable in questions requiring multi-step reasoning or referencing multiple parts of the corpus (e.g., latent time vs pseudotime).

Key Features

Personalized onboarding

Language Model: phi-3.5-mini (3.8B) running locally on a CPU machine (Linux). RAG Stack: Local embedding model (e.g., all-MiniLM) + Chroma vector DB. Corpus: Best-practice guides and Jupyter notebooks from the single-cell field (e.g., scvelo, single-cell-best-practices). Indexing: Documents chunked by semantic section (e.g., glossary, pseudotime.ipynb), embedded and indexed locally. Frontend: Lightweight chat UI or CLI for direct question input.

Can I run a chatbot with my own data on a local Linux machine, without cloud dependencies?

Yes. This project showed that a CPU-based setup can power real-world scientific chatbots. It’s considerably slow to run compared to a GPU.

Yes, especially with RAG. RAG boosts precision without requiring large models.

Structured tutorials, source code, best-practice guides, and Jupyter notebooks—especially when chunked by semantic topic.

Define question sets and scoring rubrics. Track accuracy, completeness, clarity, and utility.

Always. But RAG, version control, and blinded evaluations help mitigate it.

🧠 Was There Bias in Scoring?

Short answer: Not in the scoring logic itself — but yes, there’s a possibility of cognitive bias in how I framed and interpreted results because I inferred one model was “RAG-enhanced.”


Here’s What Happened:

  • The actual scoring functions (accuracy, completeness, clarity, utility) were:

    • Heuristically applied

    • Based only on content length, keyword richness, sentence count, and phrasing clues

    • Executed identically for both models

✅ So the math wasn’t biased.
However, my expectation framing (thinking B = RAG) could have:

  • Influenced how I described performance (“significantly outperforms”)

  • Created a confirmation bias in qualitative interpretation