forked from microsoft/TypeAgent
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[Spelunker] Start using embeddings to pre-select chunks (microsoft#733)
Had lots of "fun"getting embeddings to perform well enough due to the puny 120k TPM rate limit. Also finally started refactoring searchCode.ts into smaller pieces (there's still a few big things left to extract). Added createActionResultFromMarkdownDisplay to actionHelpers.ts. And of course logging tweaks.
- Loading branch information
1 parent
cc25c63
commit e24e7a6
Showing
13 changed files
with
923 additions
and
547 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,56 @@ | ||
# Scaling ideas | ||
|
||
These are very unformed thoughts. | ||
|
||
## Local indexing with fuzzy matching | ||
|
||
Directly after chunking, add embeddings for all chunks, just based on the code alone. | ||
(Yes I know that's pretty lame, but it's what we can do without summarizing all chunks.) | ||
|
||
Whenever a question is asked, _first_ search the embeddings for _k_ nearest neighbors, | ||
where _k_ is pretty large (maybe start with 1000). | ||
Then pass those chunks on to the usual AI-driven selection process. | ||
|
||
Do we still need summaries if we do this? How would they be used? | ||
(Possibly we could generate summaries for the query context on demand.) | ||
|
||
### Implementation planning | ||
|
||
- For now, skip the summarization phase. | ||
- Copy vectorTable.ts from _examples/memoryProviders_ (which IMO isn't a real package). | ||
- Maybe remove stuff we don't need, e.g. generics over `ValueType` and the other weird thing. | ||
- Keep using `interface typeagent.VectorStore<ChunkId>` and put creation in one place. | ||
- Add another file defining an `async` function to get an embedding (probably needs a model). | ||
- After we've got `allChunks` filled (with all the chunks), batch compute and insert | ||
embeddings for each chunks into the vectore store. | ||
- When prepping for a question, instead of sending all chunks off for selection, | ||
get the query's embedding and request a generous k nearest neighbors, and send _those_ | ||
off to the selection process. Let's start with _k_=1000, and then see if reducing it | ||
by half or doubling by two makes much of a difference. | ||
- The rest is the same. | ||
|
||
### Again, with feeling | ||
|
||
- Copy `vectorTable` from _examples/memoryProviders_, change to pass in the Database object. | ||
(We could import sqlite from memory-providers, but then the embeddings are in a different database.) | ||
- BETTER: `import { sqlite } from "memory-providers"` and add a createStorageFromDb method. | ||
- EVEN BETTER: Just extract the nearest-neighbors algorithm and do the rest myself. memory-providers is obsolete anyways. | ||
- Create an embedding model when we initialize `QueryContext` (and put it there). | ||
(Look in old spelunker for example code.) | ||
- Create a table named `ChunkEmbeddings (chunkId TEXT PRIMARY KEY, ebedding BLOB)` when creating the db. | ||
- Use `generateTextEmbeddings` or `generateEmbedding` from `typeagent` to get embedding(s). | ||
Those are async and not free and might fail, but generally pretty reliable. | ||
(There are retry versions too if we need them.) | ||
- IIUC these normalize, so we can use dot product instead of cosine similarity. | ||
- Skip the summarizing step. (Keep the code and the Summaries table, we may need them later.) | ||
- Manage embeddings as chunks are removed and added. Probably have to add something | ||
to remove all embeddings that reference a chunk for a given file (like we do for blobs). | ||
- When processing a query, before the selection step, slim down the chunks using embeddings: | ||
- Get the embedding for the user query | ||
- Call `nearestNeighbors` on the `VectorTable` | ||
- Only read the selected chunk IDs from the Chunks table. | ||
|
||
### TODO | ||
|
||
- When fewer than maxConcurrency batches, create more batches and distribute evenly. | ||
(I have an algorithm in mind, this can go in `makeBatches`.) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,68 @@ | ||
// Copyright (c) Microsoft Corporation. | ||
// Licensed under the MIT License. | ||
|
||
import { Chunk } from "./chunkSchema.js"; | ||
import { console_log } from "./logging.js"; | ||
import { ChunkDescription } from "./selectorSchema.js"; | ||
|
||
export function makeBatches( | ||
chunks: Chunk[], | ||
batchSize: number, // In characters | ||
maxChunks: number, // How many chunks at most per batch | ||
): Chunk[][] { | ||
const batches: Chunk[][] = []; | ||
let batch: Chunk[] = []; | ||
let size = 0; | ||
function flush(): void { | ||
batches.push(batch); | ||
console_log( | ||
` [Batch ${batches.length} has ${batch.length} chunks and ${size} characters]`, | ||
); | ||
batch = []; | ||
size = 0; | ||
} | ||
for (const chunk of chunks) { | ||
const chunkSize = getChunkSize(chunk); | ||
if ( | ||
size && | ||
(size + chunkSize > batchSize || batch.length >= maxChunks) | ||
) { | ||
flush(); | ||
} | ||
batch.push(chunk); | ||
size += chunkSize; | ||
} | ||
if (size) { | ||
flush(); | ||
} | ||
return batches; | ||
} | ||
|
||
export function keepBestChunks( | ||
chunkDescs: ChunkDescription[], // Sorted by descending relevance | ||
allChunks: Chunk[], | ||
batchSize: number, // In characters | ||
): Chunk[] { | ||
const chunks: Chunk[] = []; | ||
let size = 0; | ||
for (const chunkDesc of chunkDescs) { | ||
const chunk = allChunks.find((c) => c.chunkId === chunkDesc.chunkId); | ||
if (!chunk) continue; | ||
const chunkSize = getChunkSize(chunk); | ||
if (size + chunkSize > batchSize && chunks.length) { | ||
break; | ||
} | ||
chunks.push(chunk); | ||
size += chunkSize; | ||
} | ||
return chunks; | ||
} | ||
|
||
function getChunkSize(chunk: Chunk): number { | ||
// This is all an approximation | ||
let size = chunk.fileName.length + 50; | ||
for (const blob of chunk.blobs) { | ||
size += blob.lines.join("").length + 4 * blob.lines.length; | ||
} | ||
return size; | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,119 @@ | ||
// Copyright (c) Microsoft Corporation. | ||
// Licensed under the MIT License. | ||
|
||
import * as fs from "fs"; | ||
import * as path from "path"; | ||
import { createRequire } from "module"; | ||
|
||
import Database, * as sqlite from "better-sqlite3"; | ||
|
||
import { SpelunkerContext } from "./spelunkerActionHandler.js"; | ||
|
||
import { console_log } from "./logging.js"; | ||
|
||
const databaseSchema = ` | ||
CREATE TABLE IF NOT EXISTS Files ( | ||
fileName TEXT PRIMARY KEY, | ||
mtime FLOAT NOT NULL, | ||
size INTEGER NOT NULL | ||
); | ||
CREATE TABLE IF NOT EXISTS Chunks ( | ||
chunkId TEXT PRIMARY KEY, | ||
treeName TEXT NOT NULL, | ||
codeName TEXT NOT NULL, | ||
parentId TEXT KEY REFERENCES Chunks(chunkId), -- May be null | ||
fileName TEXT KEY REFERENCES files(fileName) NOT NULL, | ||
lineNo INTEGER NOT NULL -- 1-based | ||
); | ||
CREATE TABLE IF NOT EXISTS Blobs ( | ||
chunkId TEXT KEY REFERENCES Chunks(chunkId) NOT NULL, | ||
start INTEGER NOT NULL, -- 0-based | ||
lines TEXT NOT NULL, | ||
breadcrumb TEXT -- Chunk ID or empty string or NULL | ||
); | ||
CREATE TABLE IF NOT EXISTS Summaries ( | ||
chunkId TEXT PRIMARY KEY REFERENCES Chunks(chunkId), | ||
language TEXT, -- "python", "typescript", etc. | ||
summary TEXT, | ||
signature TEXT | ||
); | ||
CREATE TABLE IF NOT EXISTS ChunkEmbeddings ( | ||
chunkId TEXT PRIMARY KEY REFERENCES Chunks(chunkId), | ||
embedding BLOB NOT NULL | ||
); | ||
`; | ||
|
||
function getDbOptions() { | ||
if (process?.versions?.electron !== undefined) { | ||
return undefined; | ||
} | ||
const r = createRequire(import.meta.url); | ||
const betterSqlitePath = r.resolve("better-sqlite3/package.json"); | ||
const nativeBinding = path.join( | ||
betterSqlitePath, | ||
"../build/Release/better_sqlite3.n.node", | ||
); | ||
return { nativeBinding }; | ||
} | ||
|
||
export function createDatabase(context: SpelunkerContext): void { | ||
if (!context.queryContext) { | ||
throw new Error( | ||
"context.queryContext must be set before calling createDatabase", | ||
); | ||
} | ||
const loc = context.queryContext.databaseLocation; | ||
if (context.queryContext.database) { | ||
console_log(`[Using database at ${loc}]`); | ||
return; | ||
} | ||
if (fs.existsSync(loc)) { | ||
console_log(`[Opening database at ${loc}]`); | ||
} else { | ||
console_log(`[Creating database at ${loc}]`); | ||
} | ||
const db = new Database(loc, getDbOptions()); | ||
// Write-Ahead Logging, improving concurrency and performance | ||
db.pragma("journal_mode = WAL"); | ||
// Fix permissions to be read/write only by the owner | ||
fs.chmodSync(context.queryContext.databaseLocation, 0o600); | ||
// Create all the tables we'll use | ||
db.exec(databaseSchema); | ||
context.queryContext.database = db; | ||
} | ||
|
||
export function purgeFile(db: sqlite.Database, fileName: string): void { | ||
const prepDeleteEmbeddings = db.prepare(` | ||
DELETE FROM ChunkEmbeddings WHERE chunkId IN ( | ||
SELECT chunkId | ||
FROM chunks | ||
WHERE filename = ? | ||
) | ||
`); | ||
const prepDeleteSummaries = db.prepare(` | ||
DELETE FROM Summaries WHERE chunkId IN ( | ||
SELECT chunkId | ||
FROM chunks | ||
WHERE fileName = ? | ||
) | ||
`); | ||
const prepDeleteBlobs = db.prepare(` | ||
DELETE FROM Blobs WHERE chunkId IN ( | ||
SELECT chunkId | ||
FROM chunks | ||
WHERE filename = ? | ||
) | ||
`); | ||
const prepDeleteChunks = db.prepare( | ||
`DELETE FROM Chunks WHERE fileName = ?`, | ||
); | ||
const prepDeleteFiles = db.prepare(`DELETE FROM files WHERE fileName = ?`); | ||
|
||
db.exec(`BEGIN TRANSACTION`); | ||
prepDeleteSummaries.run(fileName); | ||
prepDeleteBlobs.run(fileName); | ||
prepDeleteEmbeddings.run(fileName); | ||
prepDeleteChunks.run(fileName); | ||
prepDeleteFiles.run(fileName); | ||
db.exec(`COMMIT`); | ||
} |
Oops, something went wrong.