Skip to content

Conversation

@NProkoptsev
Copy link

@NProkoptsev NProkoptsev commented Feb 15, 2025

#5929 introduces a method for treatment of position bias, but it had limited options of specifying the position column (.position file or numpy array), uou can't have position column as a part of your train file.
In contrast, you can specify queries and weights with arguments "query_column" and "weight_column", this will make LightGBM load these columns from the train file directly.

This PR adds support for loading position column from file.

  • Adding support for loading positions as a column in train file.
  • Removed std::vectorstd::string position_ids_. It was used before to hold a list of unique positions. Now, number of unique positions is "max(positions) + 1".
  • Refactored loading positions in Metadata::SetPosition

@jameslamb
Copy link
Collaborator

Thanks for your interest in LightGBM.

To set some expectations for how things work here:

  • until you take this PR out of draft or explicitly ask for a review, we won't review it
  • changes this large should be accompanied by new tests thoroughly covering the new behavior (not just modifying existing tests just enough to make them pass)
  • you will have to make a strong case for why this is valuable, given that what you're proposing is an ABI-breaking change

@NProkoptsev
Copy link
Author

@microsoft-github-policy-service agree

@NProkoptsev
Copy link
Author

@jameslamb can you take a look at this PR?

I added test, but it currently breaks for some gpu builds (perhaps because gpu is not deterministic?)
The goal of this PR is to reach parity in how positions can be loaded. Other columns(query and weights) can be specified by column id or column name.

@jameslamb jameslamb marked this pull request as ready for review February 27, 2025 20:23
Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I took it of draft since you said it's ready for review.

Before I help with the build errors, I'd like @metpavel or @shiyu1994 to look and give you some feedback on the design. This is a significant ABI-breaking change, I'd like them to comment on:

  • whether this functionality is worth it
  • whether this could be done without breaking LightGBM's ABI

@metpavel
Copy link
Collaborator

@NProkoptsev, thanks for adding this functionality — it would be very useful to also have an option to specify positions directly within a dataset column.

I have a couple of questions:

  1. Will it still be possible to load positions from a file?
  2. Can we keep the option to pass positions as strings (i.e., keep std::vectorstd::string position_ids_)? This is often helpful for tracking bias factors in more complex layouts.

@francoisWeber
Copy link

Thank you @NProkoptsev for adding this to LGBM :) Can't wait to to have it also in the scklearn-like API.
The documentation has very few information about this new feature. Could LGBM team be a bit more verbose on this point please ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants