picosearch

Minimalistic full-text search implemented in Typescript.

    🔎 Full text search using the BM25F algorithm for multi-field matching
    🈯 Fully typed with TypeScript
    🧐 Benchmark tests in CI/CD
    ♻️ JSON-serializable indexes
    🌑 Zero runtime dependencies (in the core package)

Installation

yarn add @picosearch/picosearch

Quick Start

import { Picosearch } from '@picosearch/picosearch';

type MyDoc = {
  id: string;
  text: string;
  additionalText: string;
};

const documents: MyDoc[] = [
  { id: '1', text: 'The quick brown fox', additionalText: 'A speedy canine' },
  { id: '2', text: 'Jumps over the lazy dog', additionalText: 'High leap' },
  { id: '3', text: 'Bright blue sky', additionalText: 'Clear and sunny day' },
];

const pico = new Picosearch<MyDoc>();
pico.insertMultipleDocuments(documents);
console.log(pico.searchDocuments('fox'));
// returns
//[
//  {
//    "id": "1",
//    "score": 0.5406145489041012,
//    "doc": {
//      "id": "1",
//      "text": "The quick brown fox",
//      "additionalText": "A speedy canine"
//    }
//  }
//]

Please note that currently, a document must be flat, can only contain string values, and needs an id field (also a string)!

Language-specific Preprocessing

By default, only a generic preprocessing is being done (simple regex tokenizer + lowercasing). It is highly recommended to replace this with language-specific options. Currently, the following languages have an additional package for pre-processing:

English (@picosearch/language-english)
German (@picosearch/language-german)

After installing it, use it like this:

import { Picosearch } from '@picosearch/picosearch';
import * as englishOptions from '@picosearch/language-english';
const pico = new Picosearch<Doc>({ ...englishOptions });

Create an issue if you need another language!

Custom Preprocessing

You can also provide a custom tokenizer (for splitting a document into words/tokens) and analyzer (processing a single token before indexing it). Just implement the types Tokenizer and Analyzer and provide these implementations to the constructor. Example:

import {
  Picosearch,
  type Analyzer,
  type Tokenizer,
} from '@picosearch/picosearch';

const myTokenizer: Tokenizer = (doc: string): string[] => doc.split(' ');

const myAnalyzer: Analyzer = (token: string): string =>
  // when the analyzer returns '', it is removed
  ['and', 'I'].includes(token) ? '' : token.toLowerCase();

const pico = new Picosearch({
  tokenizer: myTokenizer,
  analyzer: myAnalyzer,
});

JSON Serialization

Indexes can be exported to and imported from JSON. This is useful, for example, for performing the more compute-heavy indexing offline when the search runtime is in the browser. It is very important that you pass the same tokenizer and analyzer in the new instance and don't change any other constructor options. Here's an example:

import { Picosearch } from '@picosearch/picosearch';
import * as englishOptions from '@picosearch/language-english';
const pico = new Picosearch<Doc>({ ...englishOptions, keepDocuments: true });
// ...index documents

const jsonIndex = pico.toJSON() 

const fromSerialized = new Picosearch<Doc>({ ...englishOptions, jsonIndex });

Beware of the keepDocuments option! You might want to change it to false if you only need the index for search and can get individual documents at runtime via their ID another way.

Benchmark

The CI/CD pipeline includes a benchmarking step to ensure there are no performance regressions. It currently validates against three datasets of the BEIR benchmark. The performance is checked to be the same or slightly higher (due to multi-field matching) compared to the BM25 baseline.

	scidocs	nfcorpus	scifact
Picosearch+English (BM25F)	15.6%	32.9%	69.0%
Baseline (BM25) [1]	15.8%	32.5%	66.5%

[1] Thakur, N., Reimers, N., Rücklé, A., Srivastava, A., & Gurevych, I. (2021). BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. Ubiquitous Knowledge Processing Lab (UKP-TUDA), Department of Computer Science, Technische Universität Darmstadt. Retrieved from https://arxiv.org/pdf/2104.08663

Name		Name	Last commit message	Last commit date
Latest commit History 142 Commits
.github/workflows		.github/workflows
examples		examples
packages		packages
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
biome.json		biome.json
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
pnpm-workspace.yaml		pnpm-workspace.yaml
tsconfig.base.json		tsconfig.base.json
tsconfig.build.json		tsconfig.build.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

picosearch

Installation

Quick Start

Language-specific Preprocessing

Custom Preprocessing

JSON Serialization

Benchmark

About

Releases 8

Contributors 2

Languages

License

olastor/picosearch

Folders and files

Latest commit

History

Repository files navigation

picosearch

Installation

Quick Start

Language-specific Preprocessing

Custom Preprocessing

JSON Serialization

Benchmark

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 8

Contributors 2

Languages