recommendation.evaluator¶
- class fairdiverse.recommendation.evaluator.Abstract_Evaluator(config)[source]¶
Bases:
object
- eval(dataloader, model, store_scores=False)[source]¶
Evaluates the model on the provided dataloader and calculates performance metrics.
- Parameters:
dataloader – The data loader that provides batches of user-item interactions and corresponding labels.
model – The model to evaluate.
store_scores – Whether to return the predicted scores as a sparse matrix. Defaults to False.
- Returns:
A dictionary containing the evaluation metric(s) (e.g., AUC score).
- class fairdiverse.recommendation.evaluator.CTR_Evaluator(config)[source]¶
Bases:
Abstract_Evaluator
- eval(dataloader, model, store_scores=False)[source]¶
Evaluates the model on the provided dataloader and calculates performance metrics.
This function runs the evaluation on a dataset using the provided model. It calculates the AUC score based on the predicted scores and ground truth labels. If store_scores is set to True, it also returns the evaluation results as a sparse matrix of predicted scores.
- Parameters:
dataloader – The data loader that provides batches of user-item interactions and corresponding labels.
model – The model to evaluate.
store_scores – Whether to return the predicted scores as a sparse matrix. Defaults to False.
- Returns:
A dictionary containing the evaluation metric(s) (e.g., AUC score), and optionally, a sparse matrix of predicted scores.
- class fairdiverse.recommendation.evaluator.LLM_Evaluator(config)[source]¶
Bases:
Abstract_Evaluator
- cal_acc_score(label_lists, score_lists, topk)[source]¶
Calculate accuracy scores for recommendation system evaluation.
This method computes the average NDCG (Normalized Discounted Cumulative Gain), HR (Hit Ratio), and MRR (Mean Reciprocal Rank) at a specified topk cutoff for a list of ground-truth labels and corresponding prediction scores.
- :param label_listsList[List[int]]
A list of lists containing ground-truth labels. Each sublist represents the relevant items for a user or query.
- :param score_listsList[List[float]]
A list of lists containing predicted scores. Each sublist corresponds to the relevance scores for items matching the order in label_lists.
- :param topkint
The number of top predictions to consider when calculating the metrics.
- Returns:
Dict[str, float] A dictionary containing the average NDCG, HR, and MRR scores at the given topk, with keys formatted as ‘NDCG@{topk}’, ‘HR@{topk}’, and ‘MRR@{topk}’ respectively. Scores are rounded to 4 decimal places.
- cal_fair_score(iid2pid, predict, topk)[source]¶
Calculate fairness scores for a recommendation system’s evaluation.
This method computes various fairness metrics at a specified top-k cutoff to evaluate the diversity and inclusiveness of the predicted items. It utilizes different fairness measures like MMF (Max-Min Fairness), Gini coefficient, Min-Max Ratio, and Entropy to quantify the balance across different categories or groups within the predictions.
- Parameters:
int]) (iid2pid (Dict[int,) – A mapping where keys are item IDs and values are their respective group/category IDs.
float]]) (predict (List[Tuple[int,) – A list of tuples, each containing an item ID and its predicted score/score.
(int) (topk) – The top-k count used to consider the highest scored items for fairness evaluation.
- Returns:
A dictionary with keys as the metric names prefixed with the top-k cutoff (e.g., ‘MMF@5’) and values as the corresponding calculated scores, rounded to 4 decimal places.
- get_cates_value(iid2pid, predict, topk)[source]¶
Get the category values based on predicted indices and their corresponding categories.
This method processes the predicted indices along with their mapping to category IDs and returns a list of counts for each category, representing the frequency of occurrence in the top-k predictions.
- :param iid2piddict
A dictionary mapping item indices (int) to their respective category IDs (int). If an item index is not found in the dictionary, it defaults to -1.
- :param predictList[List[int]]
A 2D list where each sublist contains the predicted indices (top-k predictions) for corresponding input data points.
- :param topkint
The number of top predictions considered for each data point. This determines how many elements from the beginning of each sublist in predict are processed.
- Returns:
List[int] A list of integers where each value corresponds to the count of occurrences for a specific category across all top-k predictions. The order of these counts matches the sorted order of category IDs as returned by get_categories(iid2pid).
- get_data(data)[source]¶
This method processes the input data to extract prediction lists, label lists, and score lists for each user.
- Parameters:
'predict_list' – A list of predicted items.
'positive_items' – A list of items that are considered positive (e.g., liked or preferred by the user).
'scores' – A list of scores corresponding to the predicted items, indicating the confidence of the prediction.
- Returns:
predict_lists: A list of predict lists for all users.
label_lists: For each user, a list of binary labels indicating whether each predicted item is positive (1) or not (0).
score_lists: A list of score lists corresponding to the predicted items for all users.
- llm_eval(grounding_result, iid2pid)[source]¶
Evaluate the performance of a language model based on grounding results and item-pid mappings.
This method assesses the accuracy and fairness of the model’s predictions at different top-K thresholds. It computes both accuracy scores and fairness scores,汇总 these into a comprehensive evaluation result.
- Parameters:
Any]) (grounding_result (Dict[str,) – The output from the model grounding process, containing necessary information for evaluation.
str]) (iid2pid (Dict[str,) – A mapping from item IDs to product IDs, used in calculating fairness metrics.
- Returns:
eval_result (Dict[str, float]): A dictionary summarizing the evaluation outcomes, including accuracy and fairness scores for each specified top-K value.
- class fairdiverse.recommendation.evaluator.Ranking_Evaluator(config)[source]¶
Bases:
Abstract_Evaluator
- eval(dataloader, model, store_scores=False)[source]¶
Evaluates the model on the provided dataloader and calculates performance metrics.
This function runs the evaluation on a dataset using the provided model. It calculates the Ranking metrics based on the predicted scores and ground truth labels. If store_scores is set to True, it also returns the evaluation results as a sparse matrix of predicted scores.
- Parameters:
dataloader – The data loader that provides batches of user-item interactions and corresponding labels.
model – The model to evaluate.
store_scores – Whether to return the predicted scores as a sparse matrix. Defaults to False.
- Returns:
A dictionary containing the evaluation metric(s), and optionally, a sparse matrix of predicted scores.