Skip to content

Commit

Permalink
Version 0.8
Browse files Browse the repository at this point in the history
  • Loading branch information
Toflar authored Nov 25, 2024
2 parents 83422c8 + 90e0334 commit f80a453
Show file tree
Hide file tree
Showing 23 changed files with 864 additions and 308 deletions.
6 changes: 3 additions & 3 deletions .github/workflows/ci.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ jobs:
- name: Setup PHP
uses: shivammathur/setup-php@v2
with:
php-version: '8.2'
php-version: '8.3'
coverage: none

- name: Checkout
Expand All @@ -30,7 +30,7 @@ jobs:
- name: Setup PHP
uses: shivammathur/setup-php@v2
with:
php-version: '8.2'
php-version: '8.3'
coverage: none

- name: Checkout
Expand All @@ -53,7 +53,7 @@ jobs:
fail-fast: false
matrix:
sqlite: ['3.16.0', 'default']
php: ['8.1', '8.2', '8.3']
php: ['8.1', '8.2', '8.3', '8.4']
composer: ['--prefer-stable', '--prefer-lowest']
steps:
- name: Setup PHP
Expand Down
7 changes: 4 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,10 @@ Loupe…
* …only requires PHP and SQLite, you don't need anything else - no containers, no nothing
* …is typo-tolerant (based on the State Set Index Algorithm and Levenshtein)
* …supports phrase search using `"` quotation marks
* …supports negative keyword and phrase search using `-` as modifier
* …supports filtering (and ordering) on any attribute with any SQL-inspired filter statement
* …supports filtering (and ordering) on Geo distance
* …orders relevance based on a typical TF-IDF Cosine similarity algorithm
* …orders relevance based on a number of factors such as number of matching terms as well as proximity
* …auto-detects languages
* …supports stemming
* …is very easy to use
Expand All @@ -31,9 +32,9 @@ Note that some implementation details (e.g. libraries used) referenced in this b
Performance depends on many factors but here are some ballpark numbers based on indexing the [~32k movies fixture by
MeiliSearch][MeiliSearch_Movies] and the test files in `bin` of this repository:

* Indexing (`php bin/index_performance_test.php`) will take less than 5min (~110 documents per second)
* Indexing (`php bin/index_performance_test.php`) will take a little over 2min (~230 documents per second)
* Querying (`php bin/search_performance_test.php`) for `Amakin Dkywalker` with typo tolerance enabled and ordered by
relevance finishes in about `80 ms`
relevance finishes in about `120 ms`

Note that anything above 50k documents is probably not a use case for Loupe. Please, also read the
[Performance](./docs/performance.md) chapter in the docs. You may report your own performance
Expand Down
4 changes: 4 additions & 0 deletions docs/blog_post.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,9 @@
# Loupe - a search engine with only PHP and SQLite

> IMPORTANT: Quite a few things have changed over time. Loupe does not use the same libraries anymore and also does not
> use a TF-IDF ranking order anymore. This blog post represents the initial version, make sure you also read the current
> state of the documentation. The blog post might still help for the big picture though.
They say that when you want to explain something to people, you should tell them a story. After all, we all read and
hear stories every day, and it makes topics understandable and relatable. So here we go:

Expand Down
26 changes: 22 additions & 4 deletions docs/searching.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,15 @@ $searchParameters = \Loupe\Loupe\SearchParameters::create()
;
```

You can also exclude documents that match to a given keyword. Use `-` as modifier. You can exclude both, regular keywords
as well as phrases:

```php
$searchParameters = \Loupe\Loupe\SearchParameters::create()
->withQuery('This but -"not this" or -this')
;
```

Hint: Note that your query is stripped if it's very long. See the section about [maximum query tokens in the
configuration settings][Config].

Expand Down Expand Up @@ -88,9 +97,10 @@ To make sure you properly escape the filter values, you can use `SearchParameter

## Sort

By default, Loupe sorts your results based on relevance. Relevance is determined using a TF-IDF algorithm combined
with cosine similarity. The relevance attribute is reserved and is called `_relevance`. You can sort by your own
attributes or by multiple ones and specify whether to sort ascending or descending:
By default, Loupe sorts your results based on relevance. Relevance is determined using a number of factors such as the
number of matching terms but also the proximity (search for `pink floyd` will make sure documents that contain `pink floyd`
will be ranked higher than `the pink pullover of Floyd`). The relevance attribute is reserved and is called `_relevance`.
You can sort by your own attributes or by multiple ones and specify whether to sort ascending or descending:

Note that you can only sort [on attributes that you have defined to be sortable in the configuration][Config].

Expand All @@ -109,7 +119,15 @@ $searchParameters = \Loupe\Loupe\SearchParameters::create()
;
```

In this case, every hit will have an additional `_rankingScore` attribute with a value between `-1.0` and `1.0`.
In this case, every hit will have an additional `_rankingScore` attribute with a value between `0.0` and `1.0`.

You can also limit the search results to a `rankingScoreThreshold` between `0.0` and `1.0`:

```php
$searchParameters = \Loupe\Loupe\SearchParameters::create()
->withRankingScoreThreshold(0.8)
;
```

## Pagination

Expand Down
73 changes: 0 additions & 73 deletions src/Internal/CosineSimilarity.php

This file was deleted.

25 changes: 25 additions & 0 deletions src/Internal/Doctrine/CachePreparedStatementsConnection.php
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
<?php

declare(strict_types=1);

namespace Loupe\Loupe\Internal\Doctrine;

use Doctrine\DBAL\Driver\Middleware\AbstractConnectionMiddleware;
use Doctrine\DBAL\Driver\Statement;

final class CachePreparedStatementsConnection extends AbstractConnectionMiddleware
{
/**
* @var array<string, Statement>
*/
private array $cachedStatements = [];

public function prepare(string $sql): Statement
{
if (isset($this->cachedStatements[$sql])) {
return $this->cachedStatements[$sql];
}

return $this->cachedStatements[$sql] = parent::prepare($sql);
}
}
20 changes: 20 additions & 0 deletions src/Internal/Doctrine/CachePreparedStatementsDriver.php
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
<?php

declare(strict_types=1);

namespace Loupe\Loupe\Internal\Doctrine;

use Doctrine\DBAL\Driver\Connection as DriverConnection;
use Doctrine\DBAL\Driver\Middleware\AbstractDriverMiddleware;

final class CachePreparedStatementsDriver extends AbstractDriverMiddleware
{
public function connect(
#[\SensitiveParameter]
array $params,
): DriverConnection {
return new CachePreparedStatementsConnection(
parent::connect($params),
);
}
}
16 changes: 16 additions & 0 deletions src/Internal/Doctrine/CachePreparedStatementsMiddleware.php
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
<?php

declare(strict_types=1);

namespace Loupe\Loupe\Internal\Doctrine;

use Doctrine\DBAL\Driver as DriverInterface;
use Doctrine\DBAL\Driver\Middleware as MiddlewareInterface;

class CachePreparedStatementsMiddleware implements MiddlewareInterface
{
public function wrap(DriverInterface $driver): DriverInterface
{
return new CachePreparedStatementsDriver($driver);
}
}
5 changes: 1 addition & 4 deletions src/Internal/Engine.php
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@

class Engine
{
public const VERSION = '0.4.0'; // Increase this whenever a re-index of all documents is needed
public const VERSION = '0.8.0'; // Increase this whenever a re-index of all documents is needed

private Parser $filterParser;

Expand Down Expand Up @@ -276,9 +276,6 @@ public function upsert(
$query .= ' RETURNING ' . $insertIdColumn;
}

if ($table === IndexInfo::TABLE_NAME_DOCUMENTS) {
}

$insertValue = $this->getConnection()->executeQuery($query, $values, $this->extractDbalTypes($values))->fetchOne();

if ($insertValue === false) {
Expand Down
5 changes: 1 addition & 4 deletions src/Internal/Index/IndexInfo.php
Original file line number Diff line number Diff line change
Expand Up @@ -487,6 +487,7 @@ private function addTermsToDocumentsRelationToSchema(Schema $schema): void

$table->setPrimaryKey(['term', 'document', 'attribute', 'position']);
$table->addIndex(['document']);
$table->addIndex(['position']);
}

private function addTermsToSchema(Schema $schema): void
Expand All @@ -507,10 +508,6 @@ private function addTermsToSchema(Schema $schema): void
$table->addColumn('length', Types::INTEGER)
->setNotnull(true);

// Inversed Document Frequency
$table->addColumn('idf', Types::FLOAT)
->setNotnull(true);

$table->setPrimaryKey(['id']);
$table->addUniqueIndex(['term', 'state', 'length']);
$table->addIndex(['state']);
Expand Down
29 changes: 1 addition & 28 deletions src/Internal/Index/Indexer.php
Original file line number Diff line number Diff line change
Expand Up @@ -81,7 +81,7 @@ public function addDocuments(array $documents): IndexResult

$this->persistStateSet();

// Update storage (IDF etc.) only once
// Update storage only once
$this->reviseStorage();
});
} catch (\Throwable $e) {
Expand Down Expand Up @@ -310,7 +310,6 @@ private function indexTerm(string $term, int $documentId, string $attributeName,
'term' => $term,
'state' => $state,
'length' => mb_strlen($term, 'UTF-8'),
'idf' => 1,
],
['term', 'state', 'length'],
'id'
Expand Down Expand Up @@ -420,31 +419,5 @@ private function removeOrphans(): void
private function reviseStorage(): void
{
$this->removeOrphans();
$this->updateInverseDocumentFrequencies();
}

private function updateInverseDocumentFrequencies(): void
{
// Notice the * 1.0 additions to the COUNT() SELECTS in order to force floating point calculations
$query = <<<'QUERY'
UPDATE
%s
SET
idf = 1.0 + (LN(
(SELECT COUNT(*) FROM %s) * 1.0
/
(SELECT COUNT(DISTINCT td.document) FROM %s AS td WHERE td.term = %s.id) * 1.0
))
QUERY;

$query = sprintf(
$query,
IndexInfo::TABLE_NAME_TERMS,
IndexInfo::TABLE_NAME_DOCUMENTS,
IndexInfo::TABLE_NAME_TERMS_DOCUMENTS,
IndexInfo::TABLE_NAME_TERMS,
);

$this->engine->getConnection()->executeStatement($query);
}
}
Loading

0 comments on commit f80a453

Please sign in to comment.