Merge pull request #145 from loupe-php/develop

Version 0.9
loupe-php · Jan 17, 2025 · da01e0c · da01e0c
2 parents 28c0cc4 + b1360df
commit da01e0c
Show file tree

Hide file tree

Showing 60 changed files with 2,712 additions and 353 deletions.
diff --git a/README.md b/README.md
@@ -8,12 +8,12 @@
 Loupe…
 
 * …only requires PHP and SQLite, you don't need anything else - no containers, no nothing
-* …is typo-tolerant (based on the State Set Index Algorithm and Levenshtein)
+* …is typo-tolerant (based on the State Set Index Algorithm and Damerau-Levenshtein)
 * …supports phrase search using `"` quotation marks
 * …supports negative keyword and phrase search using `-` as modifier
 * …supports filtering (and ordering) on any attribute with any SQL-inspired filter statement
 * …supports filtering (and ordering) on Geo distance
-* …orders relevance based on a number of factors such as number of matching terms as well as proximity
+* …orders relevance based on a number of factors such as number of matching terms, typos, proximity, word counts and exactness
 * …auto-detects languages
 * …supports stemming
 * …is very easy to use
@@ -29,16 +29,16 @@ Note that some implementation details (e.g. libraries used) referenced in this b
 
 ## Performance
 
-Performance depends on many factors but here are some ballpark numbers based on indexing the [~32k movies fixture by 
-MeiliSearch][MeiliSearch_Movies] and the test files in `bin` of this repository:
+Performance depends on many factors but here are some ballpark numbers based on indexing the 
+[~32k movies fixture][MeiliSearch_Movies] provided by MeiliSearch.
 
-* Indexing (`php bin/index_performance_test.php`) will take a little over 2min (~230 documents per second)
-* Querying (`php bin/search_performance_test.php`) for `Amakin Dkywalker` with typo tolerance enabled and ordered by 
-  relevance finishes in about `120 ms`
+* **Indexing** will take a little over **90 seconds** (~350 documents per second)
+* **Querying** for `Amakin Dkywalker` with typo tolerance and relevance ranking takes about **100 ms**
 
-Note that anything above 50k documents is probably not a use case for Loupe. Please, also read the
-[Performance](./docs/performance.md) chapter in the docs. You may report your own performance 
-measurements and more details in the [respective discussion][Performance_Topic].
+Note that anything above 50k documents is probably not a use case for Loupe. You can run your own benchmarks 
+using the scripts in the `bin/bench` folder: `index.php` for indexing and `search.php` for searching. 
+Please, also read the [Performance](./docs/performance.md) chapter in the docs. You may report your own performance 
+measurements and more details in the [respective discussion][Performance_Topic]. 
 
 ## Acknowledgement
 
@@ -55,20 +55,17 @@ I even took the liberty to copy some of their test data to feed Loupe for functi
 ## Installation
 
 1. Make sure you have `pdo_sqlite` available and your installed SQLite version is at least 3.16.0. This is when 
-   PRAGMA functions have been added without which no schema comparisons are possible. It is recommended you run at 
-   least version 3.35.0 which is when mathematical functions found its way into SQLite. Otherwise, Loupe has to 
-   polyfill those which will result in a little performance penalty.
+   PRAGMA functions have been added without which no schema comparisons are possible. For best performance it is of
+   course better to run a more recent version to benefit from improvements within SQLite.
 2. Run `composer require loupe/loupe`.
 
 ## Usage
 
-```php
-<?php
-
-namespace App;
+### Creating a client
 
-require_once 'vendor/autoload.php';
+The first step is configuring and creating a client.
 
+```php
 use Loupe\Loupe\Config\TypoTolerance;
 use Loupe\Loupe\Configuration;
 use Loupe\Loupe\LoupeFactory;
@@ -82,13 +79,18 @@ $configuration = Configuration::create()
     ->withTypoTolerance(TypoTolerance::create()->withFirstCharTypoCountsDouble(false)) // can be further fine-tuned but is enabled by default
 ;
 
-$loupeFactory = new LoupeFactory();
+$loupe = (new LoupeFactory())->create('path/to/my_loupe_data_dir', $configuration);
+```
+
+To create an in-memory search client:
 
-$loupe = $loupeFactory->create('path/to/my_loupe_data_dir', $configuration);
+```php
+$loupe = (new LoupeFactory())->createInMemory($configuration);
+```
 
-// or create in-memory search:
-$loupe = $loupeFactory->createInMemory($configuration);
+### Adding documents
 
+```php
 $loupe->addDocuments([
     [
         'uuid' => 2,
@@ -110,8 +112,11 @@ $loupe->addDocuments([
         'age' => 18,
     ],
 ]);
+```
 
+### Performing search
 
+```php
 $searchParameters = SearchParameters::create()
     ->withQuery('Gucleberry')
     ->withAttributesToRetrieve(['uuid', 'firstname'])
@@ -121,29 +126,30 @@ $searchParameters = SearchParameters::create()
 
 $results = $loupe->search($searchParameters);
 
+foreach ($results->getHits() as $hit) {
+    echo $hit['title'] . PHP_EOL;
+}
+```
+
+The `$results` array contains a list of search hits and metadata about the query.
+
+```php
 print_r($results->toArray());
 
-/*
-Array
-(
-    [hits] => Array
-        (
-            [0] => Array
-                (
-                    [uuid] => 6
-                    [firstname] => Huckleberry
-                )
-
-        )
-
-    [query] => Gucleberry
-    [processingTimeMs] => 4
-    [hitsPerPage] => 20
-    [page] => 1
-    [totalPages] => 1
-    [totalHits] => 1
-)
-*/
+[
+    'hits' => [
+        [
+            'uuid' => 6,
+            'firstname' => 'Huckleberry'
+        ]
+    ],
+    'query' => 'Gucleberry',
+    'processingTimeMs' => 4,
+    'hitsPerPage' => 20,
+    'page' => 1,
+    'totalPages' => 1,
+    'totalHits' => 1
+]
 ```
 
 ## Docs
@@ -152,6 +158,7 @@ Array
 * [Configuration](./docs/configuration.md)
 * [Indexing](./docs/indexing.md)
 * [Searching](./docs/searching.md)
+* [Ranking](./docs/ranking.md)
 * [Tokenizer](./docs/tokenizer.md)
 * [Performance](./docs/performance.md)
 

diff --git a/bin/bench/index.php b/bin/bench/index.php
@@ -0,0 +1,26 @@
+<?php
+
+$config = require_once __DIR__ . '/../config.php';
+
+$options = getopt('l::du', ['limit::', 'debug', 'update']);
+$limit = intval($options['l'] ?? $options['limit'] ?? 0);
+$debug = isset($options['d']) || isset($options['debug']);
+$update = isset($options['u']) || isset($options['update']);
+
+$movies = json_decode(file_get_contents($config['movies']), true);
+if ($limit > 0) {
+    $movies = array_slice($movies, 0, $limit);
+}
+
+$config['loupe']->deleteAllDocuments();
+
+$startTime = microtime(true);
+
+$config['loupe']->addDocuments($movies);
+
+if ($update) {
+    $config['loupe']->addDocuments($movies);
+}
+
+echo sprintf('Indexed in %.2F s using %.2F MiB', microtime(true) - $startTime, memory_get_peak_usage(true) / 1024 / 1024);
+echo PHP_EOL;
diff --git a/bin/bench/search.php b/bin/bench/search.php
@@ -0,0 +1,24 @@
+<?php
+
+use Loupe\Loupe\SearchParameters;
+
+$config = require_once __DIR__ . '/../config.php';
+
+$options = getopt('q::d', ['query::', 'debug']);
+$query = $options['q'] ?? $options['query'] ?? 'Amakin Dkywalker';
+$debug = isset($options['d']) || isset($options['debug']);
+
+$startTime = microtime(true);
+
+$searchParameters = SearchParameters::create()
+    ->withQuery($query)
+;
+
+$result = $config['loupe']->search($searchParameters);
+
+if ($debug) {
+    print_r($result->toArray());
+}
+
+echo sprintf('Searched in %.2F ms using %.2F MiB', (microtime(true) - $startTime) * 1000, memory_get_peak_usage(true) / 1024 / 1024);
+echo PHP_EOL;
diff --git a/bin/config.php b/bin/config.php
@@ -1,9 +1,10 @@
 <?php
 
-
 use Loupe\Loupe\Configuration;
 use Loupe\Loupe\LoupeFactory;
 
+ini_set('memory_limit', '256M');
+
 require_once __DIR__ . '/../vendor/autoload.php';
 
 $movies = __DIR__ . '/../var/movies.json';
@@ -26,4 +27,4 @@
 return [
     'movies' => $movies,
     'loupe' => $loupeFactory->create($dataDir, $configuration),
-];
+];
diff --git a/bin/index_performance_test.php b/bin/index_performance_test.php
diff --git a/bin/search_performance_test.php b/bin/search_performance_test.php
diff --git a/composer.json b/composer.json
@@ -27,7 +27,7 @@
         "wamania/php-stemmer": "^3.0",
         "doctrine/lexer": "^2.0 || ^3.0",
         "mjaschen/phpgeo": "^4.2",
-        "toflar/state-set-index": "^2.0.1",
+        "toflar/state-set-index": "^3.0",
         "psr/log": "^2.0 || ^3.0",
         "nitotm/efficient-language-detector": "^2.0"
     },

diff --git a/docs/configuration.md b/docs/configuration.md
@@ -29,6 +29,9 @@ $configuration = \Loupe\Loupe\Configuration::create()
 ;
 ```
 
+Note that the order of searchable attributes has an influence on the [relevance ranking](./ranking.md) of search
+results: attributes listed earlier carry more weight than attributes listed later.
+
 ## Filterable attributes
 
 By default, no attribute can be filtered on in Loupe. Any attribute you want to filter for, needs to be defined as 
@@ -127,11 +130,11 @@ Those are the two major configuration values that affect basically everything in
 - The indexing performance
 - The search performance
 
-It's pretty hard to explain the State Set Index algorithm in a few short words but I tried my very best to explain 
+It's pretty hard to explain the State Set Index algorithm in a few short words, but I tried my very best to explain 
 some of it in the [Performance](performance.md) section. Best is to read the academic paper
 linked. However, one thing to note: You **cannot** get wrong search results no matter what values you configure. Those  
 values are basically about the number of potential false-positives that then have to be filtered by 
-running the Levenshtein algorithm on all results. The higher the values, the less false-positives. But also the more 
+running the Damerau-Levenshtein algorithm on all results. The higher the values, the less false-positives. But also the more 
 space required for the index.
 
 The alphabet size is configured to `4` by default. The index length to `14`.
@@ -143,6 +146,9 @@ $typoTolerance = \Loupe\Loupe\Config\TypoTolerance::create()
 ;
 ```
 
+Note: The paper works using the Levenshtein algorithm. Loupe includes adjustments built on top of that paper to support
+Damerau-Levenshtein.
+
 ### Typo thresholds
 
 Usually, the longer the words, the more typos should be tolerated. It makes no sense to tolerate `6` typos for a word 

diff --git a/docs/indexing.md b/docs/indexing.md
@@ -1,5 +1,7 @@
 # Indexing
 
+## Adding documents
+
 There are two methods to index documents in Loupe. Either you index only one document like so:
 
 ```php
@@ -43,6 +45,32 @@ Both of the methods return an `IndexResult` which provides the following methods
 * `generalException()` - returns either `null` (if there was no general exception) or an exception implementing 
   `LoupeExceptionInterface`. A general exception is one that could not be linked to a document ID.
 
+## Removing documents
+
+To remove documents from the index, you can either remove a single document or batch the removal for
+better performance. Whenever possible, you should prefer deleting multiple documents at once over
+deleting each document on its own to improve performance and cleanup cost.
+
+You'll need to pass in the id of a document to have it removed from the index.
+
+```php
+$loupe->deleteDocument(123);
+```
+
+Or you can remove multiple documents at once:
+
+```php
+$loupe->deleteDocuments([123, 456]);
+```
+
+## Removing all documents
+
+If you need to remove all documents at once and start with a clean slate, there's a method for that:
+
+```php
+$loupe->deleteAllDocuments();
+```
+
 For schema related logic, read [the dedicated schema docs][Schema].
 
 [Schema]: schema.md
diff --git a/docs/performance.md b/docs/performance.md
@@ -105,6 +105,9 @@ $configuration = \Loupe\Loupe\Configuration::create()
 ;
 ```
 
+Note: The paper works using the Levenshtein algorithm. Loupe includes adjustments built on top of that paper to support
+Damerau-Levenshtein.
+
 ## Limit the languages to detect
 
 You can read more about what the tokenizer does in the [respective docs](tokenizer.md) but basically, if you know