sse: semantic search on the edge

Query your documents finding semantic similarities.

Usage

sse crawl <project> <dir> [--include pattern] [--exclude pattern]
    - Crawls the given project directory, including or excluding files that match certain patterns.

sse query <project> <query> [--limit number]
    - Runs a query on a specific project.

sse help
    - Displays this help message.

Example response

sse query hammurabi 'how to render an avatar'
  src/lib/babylon/avatars/AvatarRenderer.ts [distance 12.820]
  src/lib/babylon/avatar-rendering-system.ts [distance 13.001]
  src/lib/babylon/avatars/adr-65/customizations.ts [distance 13.209]

Setup

Warning: the status of the project is very early alpha

Create a database in postgresql with the pgvector extension, for example, by running this:

CREATE ROLE datastore WITH LOGIN PASSWORD 'password' CREATEDB;
CREATE DATABASE datastore;
GRANT ALL PRIVILEGES ON DATABASE datastore TO datastore;
CREATE EXTENSION IF NOT EXISTS vector;

Modify sse/load_settings.py to your liking
Install dependencies: pip install psycopg2 requests flask
Alias main.py to be sse in your path

Indexing folders

Usage: sse crawl <project> <dir>, for example, sse crawl my-project ~/Notes. To index documents, sse runs the following algorithm:

let $SETTINGS be the JSON parsing of file ~/.config/sse.json or the unix env variable $SSE_CONFIG 
let $DB be a connection to a postgresql database defined in $SETTINGS
Create tables "source_embeddings", "file_info", and "chunk_embedding" if they do not exist in $DB

for each $FILE in $SRC:
  let $CONTENTS be the contents of the $FILE
  let $SHASUM be the sha256 sum of $CONTENTS, base32 encoded
  let $ABS_PATH be the absolute path of $FILE
  let $MODEL_ID be the embedding model id defined in $SETTINGS
  let $CHUNKS be the tokenization of the contents of $FILE into chunks of size determined by the model's tokenizer
  let $DATE be the current timestamp
  store $SHASUM, $MODEL_ID, $DATE in table "source_embeddings"
  store unique id, $ABS_PATH, $SHASUM, $DATE in table "file_info"
  for each $CHUNK in $CHUNKS:
    let $CHUNK_HASH be the sha256 sum of $CHUNK, base32 encoded
    let $CHUNK_EMBEDDING be the result of running the embedding model on $CHUNK
    store unique id, $MODEL_ID, $CHUNK_HASH, $CHUNK, $CHUNK_EMBEDDING, $SHASUM, the index of $CHUNK in $CHUNKS, $DATE in table "chunk_embedding"

Dependencies

e5-large-v2, torch
psycopg2
pgvector
"CREATE EXTENSION vector;" in the DB

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.ai		.ai
sse		sse
.envrc		.envrc
.gitignore		.gitignore
README.md		README.md
flake.nix		flake.nix
main.py		main.py
requirements.txt		requirements.txt
run		run
shell.nix		shell.nix

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

sse: semantic search on the edge

Usage

Example response

Setup

Indexing folders

Dependencies

About

Releases

Packages

Languages

longregen/sse

Folders and files

Latest commit

History

Repository files navigation

sse: semantic search on the edge

Usage

Example response

Setup

Indexing folders

Dependencies

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages