[Spelunker] Start using embeddings to pre-select chunks (microsoft#733)

Had lots of "fun"getting embeddings to perform well enough due to the puny 120k TPM rate limit. Also finally started refactoring searchCode.ts into smaller pieces (there's still a few big things left to extract). Added createActionResultFromMarkdownDisplay to actionHelpers.ts. And of course logging tweaks.
gvanrossum-ms · Feb 19, 2025 · e24e7a6 · e24e7a6
1 parent cc25c63
commit e24e7a6
Show file tree

Hide file tree

Showing 13 changed files with 923 additions and 547 deletions.
diff --git a/ts/packages/agentSdk/src/helpers/actionHelpers.ts b/ts/packages/agentSdk/src/helpers/actionHelpers.ts
@@ -76,6 +76,19 @@ export function createActionResultFromHtmlDisplayWithScript(
     };
 }
 
+export function createActionResultFromMarkdownDisplay(
+    literalText: string,
+    entities: Entity[] = [],
+    resultEntity?: Entity,
+): ActionResultSuccess {
+    return {
+        literalText,
+        entities,
+        resultEntity,
+        displayContent: { type: "markdown", content: literalText },
+    };
+}
+
 export function createActionResultFromError(error: string): ActionResultError {
     return {
         error,

diff --git a/ts/packages/agents/spelunker/design.md b/ts/packages/agents/spelunker/design.md
@@ -16,22 +16,36 @@ Questions about the focused code base are answered roughly as follows:
 
 1. Gather all relevant source files. (E.g. `**/*.{py,ts}`)
 2. Chunkify locally (using chunker.py or typescriptChunker.ts)
-3. Send batches of chunks, in parallel, to a cheap, fast LLM
+3. Send batches of chunks, in parallel batches, to a cheap, fast LLM
+   with a prompt asking to summarize each chunk.
+
+(Note that 1-3 need to be done only for new or changed files.)
+
+4. Send batches of chunks, in parallel batches, to a cheap, fast LLM
    with a prompt asking it to find chunks relevant to the user question.
-4. Sort by relevance, keep top `N`. (E.g. `N = 30`)
-5. Send the selected chunks as context to a smart model (the "oracle")
+5. Sort selected chunks by relevance, keep top _N_.
+   (_N_ is dynamically computed to fit in the oracle prompt size limit.)
+6. Send the _N_ top selected chunks as context to a smart model ("the oracle")
    with the request to answer the user question using those chunks as context.
-6. Construct a result from the answer and the chunks used to come up with it.
+7. Construct a result from the answer and the chunks used to come up with it
+   ("references").
 
 ## How easy is it to target other languages?
 
 - Need a chunker for each language; the rest is the same.
-- Chunking TypeScript was, realistically, a week's work.
+- Chunking TypeScript was, realistically, a week's work, so not too terrible.
+
+## Latest changes
+
+The summaries are (so far, only) used to update so-called "breadcrumb" blobs
+(placeholders for sub-chunks) to make the placeholder text look better
+(a comment plus the full signature, rather than just e.g. `def foo ...`).
 
 ## TO DO
 
 - Prompt engineering (borrow from John Lam?)
 - Evaluation of selection process (does the model do a good enough job?)
-- Scaling. It takes 60-80 seconds to select from ~4000 chunks.
-- Do we need a "global index" (of summaries) like John Lam's ask.py?
+- Scaling. It takes 20-50 seconds to select from ~4000 chunks (and $5).
+  About the same to summarize that number of chunks.
+- Do we need to send a "global index" (of summaries) like John Lam's ask.py?
   How to make that scale?
diff --git a/ts/packages/agents/spelunker/scaling.md b/ts/packages/agents/spelunker/scaling.md
@@ -0,0 +1,56 @@
+# Scaling ideas
+
+These are very unformed thoughts.
+
+## Local indexing with fuzzy matching
+
+Directly after chunking, add embeddings for all chunks, just based on the code alone.
+(Yes I know that's pretty lame, but it's what we can do without summarizing all chunks.)
+
+Whenever a question is asked, _first_ search the embeddings for _k_ nearest neighbors,
+where _k_ is pretty large (maybe start with 1000).
+Then pass those chunks on to the usual AI-driven selection process.
+
+Do we still need summaries if we do this? How would they be used?
+(Possibly we could generate summaries for the query context on demand.)
+
+### Implementation planning
+
+- For now, skip the summarization phase.
+- Copy vectorTable.ts from _examples/memoryProviders_ (which IMO isn't a real package).
+- Maybe remove stuff we don't need, e.g. generics over `ValueType` and the other weird thing.
+- Keep using `interface typeagent.VectorStore<ChunkId>` and put creation in one place.
+- Add another file defining an `async` function to get an embedding (probably needs a model).
+- After we've got `allChunks` filled (with all the chunks), batch compute and insert
+  embeddings for each chunks into the vectore store.
+- When prepping for a question, instead of sending all chunks off for selection,
+  get the query's embedding and request a generous k nearest neighbors, and send _those_
+  off to the selection process. Let's start with _k_=1000, and then see if reducing it
+  by half or doubling by two makes much of a difference.
+- The rest is the same.
+
+### Again, with feeling
+
+- Copy `vectorTable` from _examples/memoryProviders_, change to pass in the Database object.
+  (We could import sqlite from memory-providers, but then the embeddings are in a different database.)
+- BETTER: `import { sqlite } from "memory-providers"` and add a createStorageFromDb method.
+- EVEN BETTER: Just extract the nearest-neighbors algorithm and do the rest myself. memory-providers is obsolete anyways.
+- Create an embedding model when we initialize `QueryContext` (and put it there).
+  (Look in old spelunker for example code.)
+- Create a table named `ChunkEmbeddings (chunkId TEXT PRIMARY KEY, ebedding BLOB)` when creating the db.
+- Use `generateTextEmbeddings` or `generateEmbedding` from `typeagent` to get embedding(s).
+  Those are async and not free and might fail, but generally pretty reliable.
+  (There are retry versions too if we need them.)
+- IIUC these normalize, so we can use dot product instead of cosine similarity.
+- Skip the summarizing step. (Keep the code and the Summaries table, we may need them later.)
+- Manage embeddings as chunks are removed and added. Probably have to add something
+  to remove all embeddings that reference a chunk for a given file (like we do for blobs).
+- When processing a query, before the selection step, slim down the chunks using embeddings:
+  - Get the embedding for the user query
+  - Call `nearestNeighbors` on the `VectorTable`
+  - Only read the selected chunk IDs from the Chunks table.
+
+### TODO
+
+- When fewer than maxConcurrency batches, create more batches and distribute evenly.
+  (I have an algorithm in mind, this can go in `makeBatches`.)
diff --git a/ts/packages/agents/spelunker/src/batching.ts b/ts/packages/agents/spelunker/src/batching.ts
@@ -0,0 +1,68 @@
+// Copyright (c) Microsoft Corporation.
+// Licensed under the MIT License.
+
+import { Chunk } from "./chunkSchema.js";
+import { console_log } from "./logging.js";
+import { ChunkDescription } from "./selectorSchema.js";
+
+export function makeBatches(
+    chunks: Chunk[],
+    batchSize: number, // In characters
+    maxChunks: number, // How many chunks at most per batch
+): Chunk[][] {
+    const batches: Chunk[][] = [];
+    let batch: Chunk[] = [];
+    let size = 0;
+    function flush(): void {
+        batches.push(batch);
+        console_log(
+            `    [Batch ${batches.length} has ${batch.length} chunks and ${size} characters]`,
+        );
+        batch = [];
+        size = 0;
+    }
+    for (const chunk of chunks) {
+        const chunkSize = getChunkSize(chunk);
+        if (
+            size &&
+            (size + chunkSize > batchSize || batch.length >= maxChunks)
+        ) {
+            flush();
+        }
+        batch.push(chunk);
+        size += chunkSize;
+    }
+    if (size) {
+        flush();
+    }
+    return batches;
+}
+
+export function keepBestChunks(
+    chunkDescs: ChunkDescription[], // Sorted by descending relevance
+    allChunks: Chunk[],
+    batchSize: number, // In characters
+): Chunk[] {
+    const chunks: Chunk[] = [];
+    let size = 0;
+    for (const chunkDesc of chunkDescs) {
+        const chunk = allChunks.find((c) => c.chunkId === chunkDesc.chunkId);
+        if (!chunk) continue;
+        const chunkSize = getChunkSize(chunk);
+        if (size + chunkSize > batchSize && chunks.length) {
+            break;
+        }
+        chunks.push(chunk);
+        size += chunkSize;
+    }
+    return chunks;
+}
+
+function getChunkSize(chunk: Chunk): number {
+    // This is all an approximation
+    let size = chunk.fileName.length + 50;
+    for (const blob of chunk.blobs) {
+        size += blob.lines.join("").length + 4 * blob.lines.length;
+    }
+    return size;
+}
diff --git a/ts/packages/agents/spelunker/src/databaseUtils.ts b/ts/packages/agents/spelunker/src/databaseUtils.ts
@@ -0,0 +1,119 @@
+// Copyright (c) Microsoft Corporation.
+// Licensed under the MIT License.
+
+import * as fs from "fs";
+import * as path from "path";
+import { createRequire } from "module";
+
+import Database, * as sqlite from "better-sqlite3";
+
+import { SpelunkerContext } from "./spelunkerActionHandler.js";
+
+import { console_log } from "./logging.js";
+
+const databaseSchema = `
+CREATE TABLE IF NOT EXISTS Files (
+    fileName TEXT PRIMARY KEY,
+    mtime FLOAT NOT NULL,
+    size INTEGER NOT NULL
+);
+CREATE TABLE IF NOT EXISTS Chunks (
+    chunkId TEXT PRIMARY KEY,
+    treeName TEXT NOT NULL,
+    codeName TEXT NOT NULL,
+    parentId TEXT KEY REFERENCES Chunks(chunkId), -- May be null
+    fileName TEXT KEY REFERENCES files(fileName) NOT NULL,
+    lineNo INTEGER NOT NULL -- 1-based
+);
+CREATE TABLE IF NOT EXISTS Blobs (
+    chunkId TEXT KEY REFERENCES Chunks(chunkId) NOT NULL,
+    start INTEGER NOT NULL, -- 0-based
+    lines TEXT NOT NULL,
+    breadcrumb TEXT -- Chunk ID or empty string or NULL
+);
+CREATE TABLE IF NOT EXISTS Summaries (
+    chunkId TEXT PRIMARY KEY REFERENCES Chunks(chunkId),
+    language TEXT, -- "python", "typescript", etc.
+    summary TEXT,
+    signature TEXT
+);
+CREATE TABLE IF NOT EXISTS ChunkEmbeddings (
+    chunkId TEXT PRIMARY KEY REFERENCES Chunks(chunkId),
+    embedding BLOB NOT NULL
+);
+`;
+
+function getDbOptions() {
+    if (process?.versions?.electron !== undefined) {
+        return undefined;
+    }
+    const r = createRequire(import.meta.url);
+    const betterSqlitePath = r.resolve("better-sqlite3/package.json");
+    const nativeBinding = path.join(
+        betterSqlitePath,
+        "../build/Release/better_sqlite3.n.node",
+    );
+    return { nativeBinding };
+}
+
+export function createDatabase(context: SpelunkerContext): void {
+    if (!context.queryContext) {
+        throw new Error(
+            "context.queryContext must be set before calling createDatabase",
+        );
+    }
+    const loc = context.queryContext.databaseLocation;
+    if (context.queryContext.database) {
+        console_log(`[Using database at ${loc}]`);
+        return;
+    }
+    if (fs.existsSync(loc)) {
+        console_log(`[Opening database at ${loc}]`);
+    } else {
+        console_log(`[Creating database at ${loc}]`);
+    }
+    const db = new Database(loc, getDbOptions());
+    // Write-Ahead Logging, improving concurrency and performance
+    db.pragma("journal_mode = WAL");
+    // Fix permissions to be read/write only by the owner
+    fs.chmodSync(context.queryContext.databaseLocation, 0o600);
+    // Create all the tables we'll use
+    db.exec(databaseSchema);
+    context.queryContext.database = db;
+}
+
+export function purgeFile(db: sqlite.Database, fileName: string): void {
+    const prepDeleteEmbeddings = db.prepare(`
+        DELETE FROM ChunkEmbeddings WHERE chunkId IN (
+            SELECT chunkId
+            FROM chunks
+            WHERE filename = ?
+        )
+    `);
+    const prepDeleteSummaries = db.prepare(`
+        DELETE FROM Summaries WHERE chunkId IN (
+            SELECT chunkId
+            FROM chunks
+            WHERE fileName = ?
+        )
+    `);
+    const prepDeleteBlobs = db.prepare(`
+        DELETE FROM Blobs WHERE chunkId IN (
+            SELECT chunkId
+            FROM chunks
+            WHERE filename = ?
+        )
+    `);
+    const prepDeleteChunks = db.prepare(
+        `DELETE FROM Chunks WHERE fileName = ?`,
+    );
+    const prepDeleteFiles = db.prepare(`DELETE FROM files WHERE fileName = ?`);
+
+    db.exec(`BEGIN TRANSACTION`);
+    prepDeleteSummaries.run(fileName);
+    prepDeleteBlobs.run(fileName);
+    prepDeleteEmbeddings.run(fileName);
+    prepDeleteChunks.run(fileName);
+    prepDeleteFiles.run(fileName);
+    db.exec(`COMMIT`);
+}