recommendation.evaluator

class fairdiverse.recommendation.evaluator.Abstract_Evaluator(config)[source]

Bases: object

eval(dataloader, model, store_scores=False)[source]

Evaluates the model on the provided dataloader and calculates performance metrics.

Parameters:
  • dataloader – The data loader that provides batches of user-item interactions and corresponding labels.

  • model – The model to evaluate.

  • store_scores – Whether to return the predicted scores as a sparse matrix. Defaults to False.

Returns:

A dictionary containing the evaluation metric(s) (e.g., AUC score).

class fairdiverse.recommendation.evaluator.CTR_Evaluator(config)[source]

Bases: Abstract_Evaluator

eval(dataloader, model, store_scores=False)[source]

Evaluates the model on the provided dataloader and calculates performance metrics.

This function runs the evaluation on a dataset using the provided model. It calculates the AUC score based on the predicted scores and ground truth labels. If store_scores is set to True, it also returns the evaluation results as a sparse matrix of predicted scores.

Parameters:
  • dataloader – The data loader that provides batches of user-item interactions and corresponding labels.

  • model – The model to evaluate.

  • store_scores – Whether to return the predicted scores as a sparse matrix. Defaults to False.

Returns:

A dictionary containing the evaluation metric(s) (e.g., AUC score), and optionally, a sparse matrix of predicted scores.

class fairdiverse.recommendation.evaluator.LLM_Evaluator(config)[source]

Bases: Abstract_Evaluator

cal_acc_score(label_lists, score_lists, topk)[source]

Calculate accuracy scores for recommendation system evaluation.

This method computes the average NDCG (Normalized Discounted Cumulative Gain), HR (Hit Ratio), and MRR (Mean Reciprocal Rank) at a specified topk cutoff for a list of ground-truth labels and corresponding prediction scores.

:param label_listsList[List[int]]

A list of lists containing ground-truth labels. Each sublist represents the relevant items for a user or query.

:param score_listsList[List[float]]

A list of lists containing predicted scores. Each sublist corresponds to the relevance scores for items matching the order in label_lists.

:param topkint

The number of top predictions to consider when calculating the metrics.

Returns:

Dict[str, float] A dictionary containing the average NDCG, HR, and MRR scores at the given topk, with keys formatted as ‘NDCG@{topk}’, ‘HR@{topk}’, and ‘MRR@{topk}’ respectively. Scores are rounded to 4 decimal places.

cal_fair_score(iid2pid, predict, topk)[source]

Calculate fairness scores for a recommendation system’s evaluation.

This method computes various fairness metrics at a specified top-k cutoff to evaluate the diversity and inclusiveness of the predicted items. It utilizes different fairness measures like MMF (Max-Min Fairness), Gini coefficient, Min-Max Ratio, and Entropy to quantify the balance across different categories or groups within the predictions.

Parameters:
  • int]) (iid2pid (Dict[int,) – A mapping where keys are item IDs and values are their respective group/category IDs.

  • float]]) (predict (List[Tuple[int,) – A list of tuples, each containing an item ID and its predicted score/score.

  • (int) (topk) – The top-k count used to consider the highest scored items for fairness evaluation.

Returns:

A dictionary with keys as the metric names prefixed with the top-k cutoff (e.g., ‘MMF@5’) and values as the corresponding calculated scores, rounded to 4 decimal places.

get_categories(iid2pid)[source]
get_cates_value(iid2pid, predict, topk)[source]

Get the category values based on predicted indices and their corresponding categories.

This method processes the predicted indices along with their mapping to category IDs and returns a list of counts for each category, representing the frequency of occurrence in the top-k predictions.

:param iid2piddict

A dictionary mapping item indices (int) to their respective category IDs (int). If an item index is not found in the dictionary, it defaults to -1.

:param predictList[List[int]]

A 2D list where each sublist contains the predicted indices (top-k predictions) for corresponding input data points.

:param topkint

The number of top predictions considered for each data point. This determines how many elements from the beginning of each sublist in predict are processed.

Returns:

List[int] A list of integers where each value corresponds to the count of occurrences for a specific category across all top-k predictions. The order of these counts matches the sorted order of category IDs as returned by get_categories(iid2pid).

get_data(data)[source]

This method processes the input data to extract prediction lists, label lists, and score lists for each user.

Parameters:
  • 'predict_list' – A list of predicted items.

  • 'positive_items' – A list of items that are considered positive (e.g., liked or preferred by the user).

  • 'scores' – A list of scores corresponding to the predicted items, indicating the confidence of the prediction.

Returns:

  • predict_lists: A list of predict lists for all users.

  • label_lists: For each user, a list of binary labels indicating whether each predicted item is positive (1) or not (0).

  • score_lists: A list of score lists corresponding to the predicted items for all users.

llm_eval(grounding_result, iid2pid)[source]

Evaluate the performance of a language model based on grounding results and item-pid mappings.

This method assesses the accuracy and fairness of the model’s predictions at different top-K thresholds. It computes both accuracy scores and fairness scores,汇总 these into a comprehensive evaluation result.

Parameters:
  • Any]) (grounding_result (Dict[str,) – The output from the model grounding process, containing necessary information for evaluation.

  • str]) (iid2pid (Dict[str,) – A mapping from item IDs to product IDs, used in calculating fairness metrics.

Returns:

  • eval_result (Dict[str, float]): A dictionary summarizing the evaluation outcomes, including accuracy and fairness scores for each specified top-K value.

class fairdiverse.recommendation.evaluator.Ranking_Evaluator(config)[source]

Bases: Abstract_Evaluator

eval(dataloader, model, store_scores=False)[source]

Evaluates the model on the provided dataloader and calculates performance metrics.

This function runs the evaluation on a dataset using the provided model. It calculates the Ranking metrics based on the predicted scores and ground truth labels. If store_scores is set to True, it also returns the evaluation results as a sparse matrix of predicted scores.

Parameters:
  • dataloader – The data loader that provides batches of user-item interactions and corresponding labels.

  • model – The model to evaluate.

  • store_scores – Whether to return the predicted scores as a sparse matrix. Defaults to False.

Returns:

A dictionary containing the evaluation metric(s), and optionally, a sparse matrix of predicted scores.