Published: January 24, 2024
11
39
763

An underrated part of Cursor is our codebase indexing system. It provides efficient indexing/updating without storing any code on our servers. (1/9)

The first step is constructing a merkle tree on the client. We use napi to hook our rust implementation into our typescript frontend. We choose rust to maximize speed and minimize resource use for merkle tree construction. Why a merkle tree? (2/9)

With unexpected failures on the client or server, the state of the codebase vs the state of the indexed codebase on the server may deviate We could fix this by sending the client state to the server (or vice versa) But consider the case for large codebases ~50K files (3/9)

The size of each filename is around 32 chars (>=32 bytes), and the sha-256 hash is another 32 bytes. This means just the file names and the hash of their contents will take 3.2MB. On each small update, we’d have to upload/download 3.2MB! That's untenable! (4/9)

This is where the merkle tree comes in. A merkle tree is a data structure where every leaf node stores the hash of a block of data, and every parent stores the hash of its children. For any change, it takes O(log(n)) bandwidth to find where the client/server disagree. (5/9)

Then, for every file that has changed, we use the standard approach of tree-sitter to get syntactically relevant chunks. This means most chunks will be identical for a small change in a given file. Why is this important? (6/9)

The biggest bottleneck/expense is embeddings. So, we use dynamo-db as an embedding cache for code chunks. When re-indexing a file where a single line of code has changed, we see cache hits for all unchanged chunks. The only text we'll need to embed is the changed chunk. (7/9)

This is the reason why codebase indexing is very fast with open-source repos or when other users on your team have indexed a codebase. Most of the chunks are cache hits, so all we need to do is write the embeddings into a vector DB and update the merkle tree. (8/9)

And in this whole process, no code gets stored on our servers! Just the sha256 hashes of the chunks, file names, and 500 token embeddings. (9/9)

Share this thread

Read on Twitter

View original thread

Navigate thread

1/9