-
-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature Request: BallTree algorithm with haversine distance #237
Comments
Yea... I used to have haversine dist for kdtree.. Then I realized kdtree only works for LP distances.. It will take me quite some time before I can get to implementing the ball tree. The current focus is on regression and perfecting the kd-tree algorithms... |
@blaylockbk @abstractqqq I would be interested in implementing this. |
@AtlasPilotPuppy This is great! And btw I am working on bringing the current Rust kdtree impl to Python, like what SciPy offers. I expect to merge this into main this weekend. The benchmarks look good, with the Rust Kdtree being faster when k is large and much faster when run in parallel mode. In other cases, it is slightly slower. In terms of format, can you follow what I have for When you implement the ball tree, it would be good if you can implement the SpatialQueries trait (in the arkadia subcrate.. The project structure is kind of messy now I know but please excuse it for the time being...), as that would reduce a lot of code redundancy. If that is impossible, or you propose some much better trait for these data structures, please let me know! Some other thoughts I have: in the special case of haversine, we can likely use [f64; 2] which will be more efficient than a Vec. But I leave that decision to you. |
@AtlasPilotPuppy This branch is where the current kdtree work lives..https://github.com/abstractqqq/polars_ds_extension/tree/pykd In main, it is called SpacialQueries and lives in mod.rs. I realized it was a typo and should be "spatial" and fixed it in the pykd branch lol. Thank you for checking! |
Sorry I found it. It was just a spelling discrepancy in the comment. |
I am merging the progress in the branch into main. I made quite a few small changes in arkadia. Hopefully it shouldn't cause too much conflict |
@abstractqqq @blaylockbk Wanted some input. |
@AtlasPilotPuppy you are exactly right in that this kind of stuff doesn't fit in dataframe/plugin well, however, I would argue that for small datasets, it is quite convenient to query within a dataframe, and typically for a few thousand knn searches, the speed gain can already justify the convenience to use this in dataframes. If people truely want to do vector search on huge dataset, they should go for LanceDB and other more sophisticated ANN solutions that usually involves local storage and more complicated space partitions. I think once you have the balltree data structure ready, I can make it available in dataframes as queries. Maybe only for haversine distance because most of the spatial queries can be done with kdtree already and kdtrees are somewhat cheaper. My recent branch pykd (merged into main already) has an implementation of Kdtree that works with numpy input. You can find the Python wrapper in src/pymodels/py_kdt.rs and in python/polars_ds/kdtree.py. Strictly speaking we don't need the additional wrapper in kdtree.py, but that is a very thin layer and provides linter and additional documentation to users. All SciPy kdtree's methods are available there too, with some name changes. I think this is the direction you want to go. |
Excellent . Thank you for the very quick response. I agree. With dataframes of a few thousand rows there isn't much difference. It only becomes relevant at larger scale. |
And btw, unfortunately in order for Python Interop to work, you have to use OwnedLeaf because Python lifetime cannot be safely passed into Rust. So data has to be copied. Again, I don't think this is for large scale vector searches. So the copying is not a big concern. |
FYI I renamed python/polars_ds/kdtree.py to python/polars_ds/spatial.py, and you can put the ball tree algo in spatial.py too. |
…ctqqq#237 Thie implmentation does not utilize the SpatialQueries Trait. But can be modified after somce discussion on how to handle the differeing parameters This does implement all the function utilized in the query_knn.py There is demo code at the botton of basics.ipynb There are also some Rust tests to test the functioning of the ball tree. This ball tree was influenced by [this repo](https://github.com/grantslatton/ball-tree).
@abstractqqq I could use some guidance on implementing the Trait SpatialQueries. I was unsure of how to apply all the functions in my scenario. |
I think the keys are the *_one_step queries. Most space partitioning algorithms start with the root node (root node in pending) In the one step function, it traverses the tree until it finds the next node which potentially may contain candidates. I haven't looked at your code yet. But I think BallTree's knn should be able to be written as a while loop and calling the one_step functions in the while loop until we exhaust the pending vec. (Traveling now and reading code on my phone is not productive. I will be home Sunday and take a closer look ) |
Implemented Ball Tree using Haversine distance as mentioned in #237
I might need to reopen this (and maybe even revert back the commit) because there are some hidden assumptions in the ball tree construction that may not apply to haversine distance. The "bouncing bubble" algorithm described in the bounding_sphere function updates the center of the point by moving in the direction of (pt - center), where pt is a point still outside the initial ball. When doing so, the direction vector is normalized and we move along the direction "scale" amount of distance. (This algo is also known as Ritten's bounding sphere algo for better reference.) This is all fine if we are working in Euclidean space but for points that are coordinates (on spherical surface), what does normalizing even mean? It is easy to see that normalizing a point (in the Euclidean sense) on the sphere will result in a point not on the surface of the sphere. Therefore it is very likely that the construction provided by the package doesn't work with Haversine distance, which is the goal of this PR. The only workaround I see is to construct ball trees with leaves containing 1 point only. The rest of the construction doesn't involve normalizing and should work as long as the triangular inequality is true, which is the case for haversine. I need to double-check this. If we have to do this, then a complete rewrite is likely worth it because we can then take advantage of the fact that leaves contain 1 element. |
Hey that's a very valid point. |
This is what I think: Maybe we specialize the BallTree to Haversine only. According to the paper linked below, there are no significant advantages of Balltree (metric tree) over Kdtree, and Kdtree covers all L^p distances. No such space partitioning data structure work well on higher dimensions. Other, more specialized vector dbs / ANN search systems can do higher dimensions but those topics are wildly out of scope. I included Kdtree because I see it being used in calculating many entropy statistics that are sometimes used in time series. Haversine is a unique problem. It is low dimension but still commonly used. The ultimate goal of the package is to be useful, fast, and improve developer experience. I don't expect nor want this package to cover all models/algorithms in sklearn. That will be very bloated and too much effort. Yes, writing all the algorithms/models is a great learning experience but only to the developers. To the end user, I don't want to confuse them with too many choices. Let's focus on small and useful data structures first. I will also do some research. If we cannot come up with a solution in the near future, I will revert the commit. https://link.springer.com/chapter/10.1007/978-3-540-74976-9_16 |
I didn't revert the commit but I commented everything out. Let's keep moving forward for now. |
@abstractqqq perfect. I am working on a conditional and figuring out the best distance metric based on the paper and how to implement it. Might be 2-4 days before I can have it all done |
This is a neat polars plug in. Thanks! Do you have plans for implementing the Ball Tree algorithm like in scikit-learn?
https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.BallTree.html
The haversine distance metric to find nearest neighbors between latitude/longitude points on Earth would be very useful.
The text was updated successfully, but these errors were encountered: