Skip to content

Commit

Permalink
[Spelunker] Start using embeddings to pre-select chunks (microsoft#733)
Browse files Browse the repository at this point in the history
Had lots of "fun"getting embeddings to perform well enough due to the
puny 120k TPM rate limit.

Also finally started refactoring searchCode.ts into smaller pieces
(there's still a few big things left to extract).

Added createActionResultFromMarkdownDisplay to actionHelpers.ts.

And of course logging tweaks.
  • Loading branch information
gvanrossum-ms authored Feb 19, 2025
1 parent cc25c63 commit e24e7a6
Show file tree
Hide file tree
Showing 13 changed files with 923 additions and 547 deletions.
13 changes: 13 additions & 0 deletions ts/packages/agentSdk/src/helpers/actionHelpers.ts
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,19 @@ export function createActionResultFromHtmlDisplayWithScript(
};
}

export function createActionResultFromMarkdownDisplay(
literalText: string,
entities: Entity[] = [],
resultEntity?: Entity,
): ActionResultSuccess {
return {
literalText,
entities,
resultEntity,
displayContent: { type: "markdown", content: literalText },
};
}

export function createActionResultFromError(error: string): ActionResultError {
return {
error,
Expand Down
28 changes: 21 additions & 7 deletions ts/packages/agents/spelunker/design.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,22 +16,36 @@ Questions about the focused code base are answered roughly as follows:

1. Gather all relevant source files. (E.g. `**/*.{py,ts}`)
2. Chunkify locally (using chunker.py or typescriptChunker.ts)
3. Send batches of chunks, in parallel, to a cheap, fast LLM
3. Send batches of chunks, in parallel batches, to a cheap, fast LLM
with a prompt asking to summarize each chunk.

(Note that 1-3 need to be done only for new or changed files.)

4. Send batches of chunks, in parallel batches, to a cheap, fast LLM
with a prompt asking it to find chunks relevant to the user question.
4. Sort by relevance, keep top `N`. (E.g. `N = 30`)
5. Send the selected chunks as context to a smart model (the "oracle")
5. Sort selected chunks by relevance, keep top _N_.
(_N_ is dynamically computed to fit in the oracle prompt size limit.)
6. Send the _N_ top selected chunks as context to a smart model ("the oracle")
with the request to answer the user question using those chunks as context.
6. Construct a result from the answer and the chunks used to come up with it.
7. Construct a result from the answer and the chunks used to come up with it
("references").

## How easy is it to target other languages?

- Need a chunker for each language; the rest is the same.
- Chunking TypeScript was, realistically, a week's work.
- Chunking TypeScript was, realistically, a week's work, so not too terrible.

## Latest changes

The summaries are (so far, only) used to update so-called "breadcrumb" blobs
(placeholders for sub-chunks) to make the placeholder text look better
(a comment plus the full signature, rather than just e.g. `def foo ...`).

## TO DO

- Prompt engineering (borrow from John Lam?)
- Evaluation of selection process (does the model do a good enough job?)
- Scaling. It takes 60-80 seconds to select from ~4000 chunks.
- Do we need a "global index" (of summaries) like John Lam's ask.py?
- Scaling. It takes 20-50 seconds to select from ~4000 chunks (and $5).
About the same to summarize that number of chunks.
- Do we need to send a "global index" (of summaries) like John Lam's ask.py?
How to make that scale?
56 changes: 56 additions & 0 deletions ts/packages/agents/spelunker/scaling.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# Scaling ideas

These are very unformed thoughts.

## Local indexing with fuzzy matching

Directly after chunking, add embeddings for all chunks, just based on the code alone.
(Yes I know that's pretty lame, but it's what we can do without summarizing all chunks.)

Whenever a question is asked, _first_ search the embeddings for _k_ nearest neighbors,
where _k_ is pretty large (maybe start with 1000).
Then pass those chunks on to the usual AI-driven selection process.

Do we still need summaries if we do this? How would they be used?
(Possibly we could generate summaries for the query context on demand.)

### Implementation planning

- For now, skip the summarization phase.
- Copy vectorTable.ts from _examples/memoryProviders_ (which IMO isn't a real package).
- Maybe remove stuff we don't need, e.g. generics over `ValueType` and the other weird thing.
- Keep using `interface typeagent.VectorStore<ChunkId>` and put creation in one place.
- Add another file defining an `async` function to get an embedding (probably needs a model).
- After we've got `allChunks` filled (with all the chunks), batch compute and insert
embeddings for each chunks into the vectore store.
- When prepping for a question, instead of sending all chunks off for selection,
get the query's embedding and request a generous k nearest neighbors, and send _those_
off to the selection process. Let's start with _k_=1000, and then see if reducing it
by half or doubling by two makes much of a difference.
- The rest is the same.

### Again, with feeling

- Copy `vectorTable` from _examples/memoryProviders_, change to pass in the Database object.
(We could import sqlite from memory-providers, but then the embeddings are in a different database.)
- BETTER: `import { sqlite } from "memory-providers"` and add a createStorageFromDb method.
- EVEN BETTER: Just extract the nearest-neighbors algorithm and do the rest myself. memory-providers is obsolete anyways.
- Create an embedding model when we initialize `QueryContext` (and put it there).
(Look in old spelunker for example code.)
- Create a table named `ChunkEmbeddings (chunkId TEXT PRIMARY KEY, ebedding BLOB)` when creating the db.
- Use `generateTextEmbeddings` or `generateEmbedding` from `typeagent` to get embedding(s).
Those are async and not free and might fail, but generally pretty reliable.
(There are retry versions too if we need them.)
- IIUC these normalize, so we can use dot product instead of cosine similarity.
- Skip the summarizing step. (Keep the code and the Summaries table, we may need them later.)
- Manage embeddings as chunks are removed and added. Probably have to add something
to remove all embeddings that reference a chunk for a given file (like we do for blobs).
- When processing a query, before the selection step, slim down the chunks using embeddings:
- Get the embedding for the user query
- Call `nearestNeighbors` on the `VectorTable`
- Only read the selected chunk IDs from the Chunks table.

### TODO

- When fewer than maxConcurrency batches, create more batches and distribute evenly.
(I have an algorithm in mind, this can go in `makeBatches`.)
68 changes: 68 additions & 0 deletions ts/packages/agents/spelunker/src/batching.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
// Copyright (c) Microsoft Corporation.
// Licensed under the MIT License.

import { Chunk } from "./chunkSchema.js";
import { console_log } from "./logging.js";
import { ChunkDescription } from "./selectorSchema.js";

export function makeBatches(
chunks: Chunk[],
batchSize: number, // In characters
maxChunks: number, // How many chunks at most per batch
): Chunk[][] {
const batches: Chunk[][] = [];
let batch: Chunk[] = [];
let size = 0;
function flush(): void {
batches.push(batch);
console_log(
` [Batch ${batches.length} has ${batch.length} chunks and ${size} characters]`,
);
batch = [];
size = 0;
}
for (const chunk of chunks) {
const chunkSize = getChunkSize(chunk);
if (
size &&
(size + chunkSize > batchSize || batch.length >= maxChunks)
) {
flush();
}
batch.push(chunk);
size += chunkSize;
}
if (size) {
flush();
}
return batches;
}

export function keepBestChunks(
chunkDescs: ChunkDescription[], // Sorted by descending relevance
allChunks: Chunk[],
batchSize: number, // In characters
): Chunk[] {
const chunks: Chunk[] = [];
let size = 0;
for (const chunkDesc of chunkDescs) {
const chunk = allChunks.find((c) => c.chunkId === chunkDesc.chunkId);
if (!chunk) continue;
const chunkSize = getChunkSize(chunk);
if (size + chunkSize > batchSize && chunks.length) {
break;
}
chunks.push(chunk);
size += chunkSize;
}
return chunks;
}

function getChunkSize(chunk: Chunk): number {
// This is all an approximation
let size = chunk.fileName.length + 50;
for (const blob of chunk.blobs) {
size += blob.lines.join("").length + 4 * blob.lines.length;
}
return size;
}
119 changes: 119 additions & 0 deletions ts/packages/agents/spelunker/src/databaseUtils.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
// Copyright (c) Microsoft Corporation.
// Licensed under the MIT License.

import * as fs from "fs";
import * as path from "path";
import { createRequire } from "module";

import Database, * as sqlite from "better-sqlite3";

import { SpelunkerContext } from "./spelunkerActionHandler.js";

import { console_log } from "./logging.js";

const databaseSchema = `
CREATE TABLE IF NOT EXISTS Files (
fileName TEXT PRIMARY KEY,
mtime FLOAT NOT NULL,
size INTEGER NOT NULL
);
CREATE TABLE IF NOT EXISTS Chunks (
chunkId TEXT PRIMARY KEY,
treeName TEXT NOT NULL,
codeName TEXT NOT NULL,
parentId TEXT KEY REFERENCES Chunks(chunkId), -- May be null
fileName TEXT KEY REFERENCES files(fileName) NOT NULL,
lineNo INTEGER NOT NULL -- 1-based
);
CREATE TABLE IF NOT EXISTS Blobs (
chunkId TEXT KEY REFERENCES Chunks(chunkId) NOT NULL,
start INTEGER NOT NULL, -- 0-based
lines TEXT NOT NULL,
breadcrumb TEXT -- Chunk ID or empty string or NULL
);
CREATE TABLE IF NOT EXISTS Summaries (
chunkId TEXT PRIMARY KEY REFERENCES Chunks(chunkId),
language TEXT, -- "python", "typescript", etc.
summary TEXT,
signature TEXT
);
CREATE TABLE IF NOT EXISTS ChunkEmbeddings (
chunkId TEXT PRIMARY KEY REFERENCES Chunks(chunkId),
embedding BLOB NOT NULL
);
`;

function getDbOptions() {
if (process?.versions?.electron !== undefined) {
return undefined;
}
const r = createRequire(import.meta.url);
const betterSqlitePath = r.resolve("better-sqlite3/package.json");
const nativeBinding = path.join(
betterSqlitePath,
"../build/Release/better_sqlite3.n.node",
);
return { nativeBinding };
}

export function createDatabase(context: SpelunkerContext): void {
if (!context.queryContext) {
throw new Error(
"context.queryContext must be set before calling createDatabase",
);
}
const loc = context.queryContext.databaseLocation;
if (context.queryContext.database) {
console_log(`[Using database at ${loc}]`);
return;
}
if (fs.existsSync(loc)) {
console_log(`[Opening database at ${loc}]`);
} else {
console_log(`[Creating database at ${loc}]`);
}
const db = new Database(loc, getDbOptions());
// Write-Ahead Logging, improving concurrency and performance
db.pragma("journal_mode = WAL");
// Fix permissions to be read/write only by the owner
fs.chmodSync(context.queryContext.databaseLocation, 0o600);
// Create all the tables we'll use
db.exec(databaseSchema);
context.queryContext.database = db;
}

export function purgeFile(db: sqlite.Database, fileName: string): void {
const prepDeleteEmbeddings = db.prepare(`
DELETE FROM ChunkEmbeddings WHERE chunkId IN (
SELECT chunkId
FROM chunks
WHERE filename = ?
)
`);
const prepDeleteSummaries = db.prepare(`
DELETE FROM Summaries WHERE chunkId IN (
SELECT chunkId
FROM chunks
WHERE fileName = ?
)
`);
const prepDeleteBlobs = db.prepare(`
DELETE FROM Blobs WHERE chunkId IN (
SELECT chunkId
FROM chunks
WHERE filename = ?
)
`);
const prepDeleteChunks = db.prepare(
`DELETE FROM Chunks WHERE fileName = ?`,
);
const prepDeleteFiles = db.prepare(`DELETE FROM files WHERE fileName = ?`);

db.exec(`BEGIN TRANSACTION`);
prepDeleteSummaries.run(fileName);
prepDeleteBlobs.run(fileName);
prepDeleteEmbeddings.run(fileName);
prepDeleteChunks.run(fileName);
prepDeleteFiles.run(fileName);
db.exec(`COMMIT`);
}
Loading

0 comments on commit e24e7a6

Please sign in to comment.