Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: BallTree algorithm with haversine distance #237

Open
blaylockbk opened this issue Aug 15, 2024 · 18 comments · Fixed by #263
Open

Feature Request: BallTree algorithm with haversine distance #237

blaylockbk opened this issue Aug 15, 2024 · 18 comments · Fixed by #263
Labels
enhancement New feature or request

Comments

@blaylockbk
Copy link

This is a neat polars plug in. Thanks! Do you have plans for implementing the Ball Tree algorithm like in scikit-learn?

https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.BallTree.html

The haversine distance metric to find nearest neighbors between latitude/longitude points on Earth would be very useful.

@abstractqqq
Copy link
Owner

abstractqqq commented Aug 15, 2024

Yea... I used to have haversine dist for kdtree.. Then I realized kdtree only works for LP distances..

It will take me quite some time before I can get to implementing the ball tree. The current focus is on regression and perfecting the kd-tree algorithms...

@abstractqqq abstractqqq added the enhancement New feature or request label Aug 19, 2024
@AtlasPilotPuppy
Copy link

AtlasPilotPuppy commented Sep 12, 2024

@blaylockbk @abstractqqq I would be interested in implementing this.
The big question is what is the best format for the output.
I am assuming the inputs will be a column ofPoints which can be structs containing an x and a y coordinate. Or a column for the x and another column for the y coordinate.
Would the output contain the parent of each node if any and the distance to the parent ?
What would be the prefered way to take the arguments for t,k,Q,B - as named arguments potentially.

@abstractqqq
Copy link
Owner

abstractqqq commented Sep 12, 2024

@AtlasPilotPuppy This is great! And btw I am working on bringing the current Rust kdtree impl to Python, like what SciPy offers. I expect to merge this into main this weekend. The benchmarks look good, with the Rust Kdtree being faster when k is large and much faster when run in parallel mode. In other cases, it is slightly slower.

In terms of format, can you follow what I have for query_radius_ptwise or query_knn_ptwise? I think @blaylockbk would be ok with the API I currently have.

When you implement the ball tree, it would be good if you can implement the SpatialQueries trait (in the arkadia subcrate.. The project structure is kind of messy now I know but please excuse it for the time being...), as that would reduce a lot of code redundancy. If that is impossible, or you propose some much better trait for these data structures, please let me know!

Some other thoughts I have: in the special case of haversine, we can likely use [f64; 2] which will be more efficient than a Vec. But I leave that decision to you.

@abstractqqq
Copy link
Owner

@AtlasPilotPuppy This branch is where the current kdtree work lives..https://github.com/abstractqqq/polars_ds_extension/tree/pykd

In main, it is called SpacialQueries and lives in mod.rs. I realized it was a typo and should be "spatial" and fixed it in the pykd branch lol. Thank you for checking!

@AtlasPilotPuppy
Copy link

Sorry I found it. It was just a spelling discrepancy in the comment.
Thanks

@abstractqqq
Copy link
Owner

Sorry I found it. It was just a spelling discrepancy in the comment. Thanks

I am merging the progress in the branch into main. I made quite a few small changes in arkadia. Hopefully it shouldn't cause too much conflict

@AtlasPilotPuppy
Copy link

@abstractqqq @blaylockbk Wanted some input.
I have a working implementation of BallTree although the api right now is not usable.
So i can implement BallTree very similar to query_radius_ptwise or query_knn_ptwise
But BallTree is somewhat expensive to compute and can be queried repeatedly.
I am looking for input on What is the best way to structure it similar to the Sklearn BallTree API. This would allow us to create a model that can be queried repeatedly. The challenge is It does not fit very well into the DataFrame / Plugin paradigm.
While we can provide an interface similar to query_radius_ptwise or query_knn_ptwise that will have to rebuild the tree for each query.
My thought process at the moment is I can leave my implementation in there and wrap a polars plugins compatible api. We can figure out how to provide a sklearn style interface later.

@abstractqqq
Copy link
Owner

abstractqqq commented Sep 13, 2024

@AtlasPilotPuppy you are exactly right in that this kind of stuff doesn't fit in dataframe/plugin well, however, I would argue that for small datasets, it is quite convenient to query within a dataframe, and typically for a few thousand knn searches, the speed gain can already justify the convenience to use this in dataframes. If people truely want to do vector search on huge dataset, they should go for LanceDB and other more sophisticated ANN solutions that usually involves local storage and more complicated space partitions.

I think once you have the balltree data structure ready, I can make it available in dataframes as queries. Maybe only for haversine distance because most of the spatial queries can be done with kdtree already and kdtrees are somewhat cheaper.

My recent branch pykd (merged into main already) has an implementation of Kdtree that works with numpy input. You can find the Python wrapper in src/pymodels/py_kdt.rs and in python/polars_ds/kdtree.py. Strictly speaking we don't need the additional wrapper in kdtree.py, but that is a very thin layer and provides linter and additional documentation to users. All SciPy kdtree's methods are available there too, with some name changes. I think this is the direction you want to go.

@AtlasPilotPuppy
Copy link

Excellent . Thank you for the very quick response. I agree. With dataframes of a few thousand rows there isn't much difference. It only becomes relevant at larger scale.

@abstractqqq
Copy link
Owner

Excellent . Thank you for the very quick response. I agree. With dataframes of a few thousand rows there isn't much difference. It only becomes relevant at larger scale.

And btw, unfortunately in order for Python Interop to work, you have to use OwnedLeaf because Python lifetime cannot be safely passed into Rust. So data has to be copied. Again, I don't think this is for large scale vector searches. So the copying is not a big concern.

@abstractqqq
Copy link
Owner

abstractqqq commented Sep 14, 2024

FYI I renamed python/polars_ds/kdtree.py to python/polars_ds/spatial.py, and you can put the ball tree algo in spatial.py too.

AtlasPilotPuppy pushed a commit to AtlasPilotPuppy/polars_ds_extension that referenced this issue Sep 21, 2024
…ctqqq#237

Thie implmentation does not utilize the SpatialQueries Trait. But can
be modified after somce discussion on how to handle the differeing
parameters
This does implement all the function utilized in the query_knn.py
There is demo code at the botton of basics.ipynb

There are also some Rust tests to test the functioning of the ball tree.
This ball tree was influenced by [this repo](https://github.com/grantslatton/ball-tree).
@AtlasPilotPuppy
Copy link

@abstractqqq I could use some guidance on implementing the Trait SpatialQueries. I was unsure of how to apply all the functions in my scenario.
This is tested and using the notebook and there are also some rust tests to verify the overall functioning of the balltree.

@abstractqqq
Copy link
Owner

I think the keys are the *_one_step queries. Most space partitioning algorithms start with the root node (root node in pending) In the one step function, it traverses the tree until it finds the next node which potentially may contain candidates. I haven't looked at your code yet. But I think BallTree's knn should be able to be written as a while loop and calling the one_step functions in the while loop until we exhaust the pending vec. (Traveling now and reading code on my phone is not productive. I will be home Sunday and take a closer look )

abstractqqq added a commit that referenced this issue Sep 23, 2024
Implemented Ball Tree using Haversine distance as mentioned in #237
@abstractqqq
Copy link
Owner

abstractqqq commented Sep 25, 2024

@AtlasPilotPuppy

I might need to reopen this (and maybe even revert back the commit) because there are some hidden assumptions in the ball tree construction that may not apply to haversine distance.

The "bouncing bubble" algorithm described in the bounding_sphere function updates the center of the point by moving in the direction of (pt - center), where pt is a point still outside the initial ball. When doing so, the direction vector is normalized and we move along the direction "scale" amount of distance. (This algo is also known as Ritten's bounding sphere algo for better reference.)

This is all fine if we are working in Euclidean space but for points that are coordinates (on spherical surface), what does normalizing even mean? It is easy to see that normalizing a point (in the Euclidean sense) on the sphere will result in a point not on the surface of the sphere.

Therefore it is very likely that the construction provided by the package doesn't work with Haversine distance, which is the goal of this PR.

The only workaround I see is to construct ball trees with leaves containing 1 point only. The rest of the construction doesn't involve normalizing and should work as long as the triangular inequality is true, which is the case for haversine. I need to double-check this. If we have to do this, then a complete rewrite is likely worth it because we can then take advantage of the fact that leaves contain 1 element.

@abstractqqq abstractqqq reopened this Sep 25, 2024
@AtlasPilotPuppy
Copy link

Hey that's a very valid point.
I think the approach is that haversine works with 2d points. Anything besides that uses euclidean.
Which means the bouncing ball can be applied in a euclidian scenario. We will have to find a different measure to find the bounds in the haversine scenario. Let me do some research on this.
So then we can have a match statement which would use haversine+ the correct bounding algorithm Or euclidian with the bouncing ball.

@abstractqqq
Copy link
Owner

Hey that's a very valid point. I think the approach is that haversine works with 2d points. Anything besides that uses euclidean. Which means the bouncing ball can be applied in a euclidian scenario. We will have to find a different measure to find the bounds in the haversine scenario. Let me do some research on this. So then we can have a match statement which would use haversine+ the correct bounding algorithm Or euclidian with the bouncing ball.

This is what I think: Maybe we specialize the BallTree to Haversine only.

According to the paper linked below, there are no significant advantages of Balltree (metric tree) over Kdtree, and Kdtree covers all L^p distances. No such space partitioning data structure work well on higher dimensions. Other, more specialized vector dbs / ANN search systems can do higher dimensions but those topics are wildly out of scope. I included Kdtree because I see it being used in calculating many entropy statistics that are sometimes used in time series. Haversine is a unique problem. It is low dimension but still commonly used.

The ultimate goal of the package is to be useful, fast, and improve developer experience. I don't expect nor want this package to cover all models/algorithms in sklearn. That will be very bloated and too much effort. Yes, writing all the algorithms/models is a great learning experience but only to the developers. To the end user, I don't want to confuse them with too many choices. Let's focus on small and useful data structures first.

I will also do some research. If we cannot come up with a solution in the near future, I will revert the commit.

https://link.springer.com/chapter/10.1007/978-3-540-74976-9_16

@abstractqqq
Copy link
Owner

@AtlasPilotPuppy

I didn't revert the commit but I commented everything out. Let's keep moving forward for now.

@AtlasPilotPuppy
Copy link

@abstractqqq perfect. I am working on a conditional and figuring out the best distance metric based on the paper and how to implement it. Might be 2-4 days before I can have it all done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
3 participants