-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[jvm-packages] Support Ranker #10823
Conversation
Hi @trivialfis, @eordentlich Could you help review it? Thx |
Would be good to explain how this resolves the issue raised in this comment and issue linked therein: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some questions re plugin and preprocess.
*/ | ||
override private[spark] def preprocess(dataset: Dataset[_]): (Dataset[_], ColumnIndices) = { | ||
val (output, columnIndices) = super.preprocess(dataset) | ||
(output.sortWithinPartitions(getGroupCol), columnIndices) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this operation interact with spark-rapids plugin if enabled? Any implications on GPU memory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this preprocess even get called if plugin is enabled? If not, partition might not be sorted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad. Fixed this issue. Please help review it again. Thx very much.
Hi @eordentlich @trivialfis, Could you help take a look at it. Thx very much. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This resolves the issue with the plugin and sorted partitions (and nice to see the test for this case too), but still wondering how that partition sort is computed by the spark-rapids plugin when enabled. Is done on the GPU?
Also, does this PR resolve the issue I reference in an earlier comment?
@trivialfis should take a look as well.
HI @eordentlich, I just tried the below case which has the same pattern with XGBoost
and got below Physical plans
XGBoost leverages ColumnarRdd to extract the CUDF table. ColumnarRdd is going to do some we we get below corresponding RDDs which are coming from below GPU plans. So you can see the final cudf table was coming from GpuSort which will run on GPUs
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Will look into this tomorrow. |
Thank you for raising that. I will be looking into it along with other LTR issues/feature requests after sorting out some of the work on external memory. I still think within-partition sort is sufficient for most of the use cases. The worst case is adding these qid-based partitioning, which might be as expensive as a global sort. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, assuming all tests can pass.
No description provided.