Skip to content

Commit

Permalink
Merge pull request #145 from loupe-php/develop
Browse files Browse the repository at this point in the history
Version 0.9
  • Loading branch information
Toflar authored Jan 17, 2025
2 parents 28c0cc4 + b1360df commit da01e0c
Show file tree
Hide file tree
Showing 60 changed files with 2,712 additions and 353 deletions.
93 changes: 50 additions & 43 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,12 +8,12 @@
Loupe…

* …only requires PHP and SQLite, you don't need anything else - no containers, no nothing
* …is typo-tolerant (based on the State Set Index Algorithm and Levenshtein)
* …is typo-tolerant (based on the State Set Index Algorithm and Damerau-Levenshtein)
* …supports phrase search using `"` quotation marks
* …supports negative keyword and phrase search using `-` as modifier
* …supports filtering (and ordering) on any attribute with any SQL-inspired filter statement
* …supports filtering (and ordering) on Geo distance
* …orders relevance based on a number of factors such as number of matching terms as well as proximity
* …orders relevance based on a number of factors such as number of matching terms, typos, proximity, word counts and exactness
* …auto-detects languages
* …supports stemming
* …is very easy to use
Expand All @@ -29,16 +29,16 @@ Note that some implementation details (e.g. libraries used) referenced in this b

## Performance

Performance depends on many factors but here are some ballpark numbers based on indexing the [~32k movies fixture by
MeiliSearch][MeiliSearch_Movies] and the test files in `bin` of this repository:
Performance depends on many factors but here are some ballpark numbers based on indexing the
[~32k movies fixture][MeiliSearch_Movies] provided by MeiliSearch.

* Indexing (`php bin/index_performance_test.php`) will take a little over 2min (~230 documents per second)
* Querying (`php bin/search_performance_test.php`) for `Amakin Dkywalker` with typo tolerance enabled and ordered by
relevance finishes in about `120 ms`
* **Indexing** will take a little over **90 seconds** (~350 documents per second)
* **Querying** for `Amakin Dkywalker` with typo tolerance and relevance ranking takes about **100 ms**

Note that anything above 50k documents is probably not a use case for Loupe. Please, also read the
[Performance](./docs/performance.md) chapter in the docs. You may report your own performance
measurements and more details in the [respective discussion][Performance_Topic].
Note that anything above 50k documents is probably not a use case for Loupe. You can run your own benchmarks
using the scripts in the `bin/bench` folder: `index.php` for indexing and `search.php` for searching.
Please, also read the [Performance](./docs/performance.md) chapter in the docs. You may report your own performance
measurements and more details in the [respective discussion][Performance_Topic].

## Acknowledgement

Expand All @@ -55,20 +55,17 @@ I even took the liberty to copy some of their test data to feed Loupe for functi
## Installation

1. Make sure you have `pdo_sqlite` available and your installed SQLite version is at least 3.16.0. This is when
PRAGMA functions have been added without which no schema comparisons are possible. It is recommended you run at
least version 3.35.0 which is when mathematical functions found its way into SQLite. Otherwise, Loupe has to
polyfill those which will result in a little performance penalty.
PRAGMA functions have been added without which no schema comparisons are possible. For best performance it is of
course better to run a more recent version to benefit from improvements within SQLite.
2. Run `composer require loupe/loupe`.

## Usage

```php
<?php

namespace App;
### Creating a client

require_once 'vendor/autoload.php';
The first step is configuring and creating a client.

```php
use Loupe\Loupe\Config\TypoTolerance;
use Loupe\Loupe\Configuration;
use Loupe\Loupe\LoupeFactory;
Expand All @@ -82,13 +79,18 @@ $configuration = Configuration::create()
->withTypoTolerance(TypoTolerance::create()->withFirstCharTypoCountsDouble(false)) // can be further fine-tuned but is enabled by default
;

$loupeFactory = new LoupeFactory();
$loupe = (new LoupeFactory())->create('path/to/my_loupe_data_dir', $configuration);
```

To create an in-memory search client:

$loupe = $loupeFactory->create('path/to/my_loupe_data_dir', $configuration);
```php
$loupe = (new LoupeFactory())->createInMemory($configuration);
```

// or create in-memory search:
$loupe = $loupeFactory->createInMemory($configuration);
### Adding documents

```php
$loupe->addDocuments([
[
'uuid' => 2,
Expand All @@ -110,8 +112,11 @@ $loupe->addDocuments([
'age' => 18,
],
]);
```

### Performing search

```php
$searchParameters = SearchParameters::create()
->withQuery('Gucleberry')
->withAttributesToRetrieve(['uuid', 'firstname'])
Expand All @@ -121,29 +126,30 @@ $searchParameters = SearchParameters::create()

$results = $loupe->search($searchParameters);

foreach ($results->getHits() as $hit) {
echo $hit['title'] . PHP_EOL;
}
```

The `$results` array contains a list of search hits and metadata about the query.

```php
print_r($results->toArray());

/*
Array
(
[hits] => Array
(
[0] => Array
(
[uuid] => 6
[firstname] => Huckleberry
)

)

[query] => Gucleberry
[processingTimeMs] => 4
[hitsPerPage] => 20
[page] => 1
[totalPages] => 1
[totalHits] => 1
)
*/
[
'hits' => [
[
'uuid' => 6,
'firstname' => 'Huckleberry'
]
],
'query' => 'Gucleberry',
'processingTimeMs' => 4,
'hitsPerPage' => 20,
'page' => 1,
'totalPages' => 1,
'totalHits' => 1
]
```

## Docs
Expand All @@ -152,6 +158,7 @@ Array
* [Configuration](./docs/configuration.md)
* [Indexing](./docs/indexing.md)
* [Searching](./docs/searching.md)
* [Ranking](./docs/ranking.md)
* [Tokenizer](./docs/tokenizer.md)
* [Performance](./docs/performance.md)

Expand Down
26 changes: 26 additions & 0 deletions bin/bench/index.php
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
<?php

$config = require_once __DIR__ . '/../config.php';

$options = getopt('l::du', ['limit::', 'debug', 'update']);
$limit = intval($options['l'] ?? $options['limit'] ?? 0);
$debug = isset($options['d']) || isset($options['debug']);
$update = isset($options['u']) || isset($options['update']);

$movies = json_decode(file_get_contents($config['movies']), true);
if ($limit > 0) {
$movies = array_slice($movies, 0, $limit);
}

$config['loupe']->deleteAllDocuments();

$startTime = microtime(true);

$config['loupe']->addDocuments($movies);

if ($update) {
$config['loupe']->addDocuments($movies);
}

echo sprintf('Indexed in %.2F s using %.2F MiB', microtime(true) - $startTime, memory_get_peak_usage(true) / 1024 / 1024);
echo PHP_EOL;
24 changes: 24 additions & 0 deletions bin/bench/search.php
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
<?php

use Loupe\Loupe\SearchParameters;

$config = require_once __DIR__ . '/../config.php';

$options = getopt('q::d', ['query::', 'debug']);
$query = $options['q'] ?? $options['query'] ?? 'Amakin Dkywalker';
$debug = isset($options['d']) || isset($options['debug']);

$startTime = microtime(true);

$searchParameters = SearchParameters::create()
->withQuery($query)
;

$result = $config['loupe']->search($searchParameters);

if ($debug) {
print_r($result->toArray());
}

echo sprintf('Searched in %.2F ms using %.2F MiB', (microtime(true) - $startTime) * 1000, memory_get_peak_usage(true) / 1024 / 1024);
echo PHP_EOL;
5 changes: 3 additions & 2 deletions bin/config.php
Original file line number Diff line number Diff line change
@@ -1,9 +1,10 @@
<?php


use Loupe\Loupe\Configuration;
use Loupe\Loupe\LoupeFactory;

ini_set('memory_limit', '256M');

require_once __DIR__ . '/../vendor/autoload.php';

$movies = __DIR__ . '/../var/movies.json';
Expand All @@ -26,4 +27,4 @@
return [
'movies' => $movies,
'loupe' => $loupeFactory->create($dataDir, $configuration),
];
];
11 changes: 0 additions & 11 deletions bin/index_performance_test.php

This file was deleted.

15 changes: 0 additions & 15 deletions bin/search_performance_test.php

This file was deleted.

2 changes: 1 addition & 1 deletion composer.json
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@
"wamania/php-stemmer": "^3.0",
"doctrine/lexer": "^2.0 || ^3.0",
"mjaschen/phpgeo": "^4.2",
"toflar/state-set-index": "^2.0.1",
"toflar/state-set-index": "^3.0",
"psr/log": "^2.0 || ^3.0",
"nitotm/efficient-language-detector": "^2.0"
},
Expand Down
10 changes: 8 additions & 2 deletions docs/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,9 @@ $configuration = \Loupe\Loupe\Configuration::create()
;
```

Note that the order of searchable attributes has an influence on the [relevance ranking](./ranking.md) of search
results: attributes listed earlier carry more weight than attributes listed later.

## Filterable attributes

By default, no attribute can be filtered on in Loupe. Any attribute you want to filter for, needs to be defined as
Expand Down Expand Up @@ -127,11 +130,11 @@ Those are the two major configuration values that affect basically everything in
- The indexing performance
- The search performance

It's pretty hard to explain the State Set Index algorithm in a few short words but I tried my very best to explain
It's pretty hard to explain the State Set Index algorithm in a few short words, but I tried my very best to explain
some of it in the [Performance](performance.md) section. Best is to read the academic paper
linked. However, one thing to note: You **cannot** get wrong search results no matter what values you configure. Those
values are basically about the number of potential false-positives that then have to be filtered by
running the Levenshtein algorithm on all results. The higher the values, the less false-positives. But also the more
running the Damerau-Levenshtein algorithm on all results. The higher the values, the less false-positives. But also the more
space required for the index.

The alphabet size is configured to `4` by default. The index length to `14`.
Expand All @@ -143,6 +146,9 @@ $typoTolerance = \Loupe\Loupe\Config\TypoTolerance::create()
;
```

Note: The paper works using the Levenshtein algorithm. Loupe includes adjustments built on top of that paper to support
Damerau-Levenshtein.

### Typo thresholds

Usually, the longer the words, the more typos should be tolerated. It makes no sense to tolerate `6` typos for a word
Expand Down
28 changes: 28 additions & 0 deletions docs/indexing.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
# Indexing

## Adding documents

There are two methods to index documents in Loupe. Either you index only one document like so:

```php
Expand Down Expand Up @@ -43,6 +45,32 @@ Both of the methods return an `IndexResult` which provides the following methods
* `generalException()` - returns either `null` (if there was no general exception) or an exception implementing
`LoupeExceptionInterface`. A general exception is one that could not be linked to a document ID.

## Removing documents

To remove documents from the index, you can either remove a single document or batch the removal for
better performance. Whenever possible, you should prefer deleting multiple documents at once over
deleting each document on its own to improve performance and cleanup cost.

You'll need to pass in the id of a document to have it removed from the index.

```php
$loupe->deleteDocument(123);
```

Or you can remove multiple documents at once:

```php
$loupe->deleteDocuments([123, 456]);
```

## Removing all documents

If you need to remove all documents at once and start with a clean slate, there's a method for that:

```php
$loupe->deleteAllDocuments();
```

For schema related logic, read [the dedicated schema docs][Schema].

[Schema]: schema.md
3 changes: 3 additions & 0 deletions docs/performance.md
Original file line number Diff line number Diff line change
Expand Up @@ -105,6 +105,9 @@ $configuration = \Loupe\Loupe\Configuration::create()
;
```

Note: The paper works using the Levenshtein algorithm. Loupe includes adjustments built on top of that paper to support
Damerau-Levenshtein.

## Limit the languages to detect

You can read more about what the tokenizer does in the [respective docs](tokenizer.md) but basically, if you know
Expand Down
Loading

0 comments on commit da01e0c

Please sign in to comment.