process_daletor

fairdiverse.search.utils.process_daletor.Process(config)[source]

Main function for DALETOR data processing.

Parameters:

config – A dictionary containing configuration settings.

Returns:

Generates and saves the training and test datasets, as well as list training samples.

fairdiverse.search.utils.process_daletor.build_each_train_dataset(qid_list, qd, train_dict, rel_feat_dict, res_dir, query_emb, doc_emb)[source]

Generates and saves the training dataset for each query in qid_list.

Parameters:
  • qid_list – A list of query IDs to process.

  • qd – A dictionary containing the query data (query, document list, subtopic info).

  • train_dict – A dictionary of pre-generated training samples.

  • rel_feat_dict – A dictionary of relevance features for query-document pairs.

  • res_dir – The directory where the processed data will be saved.

  • query_emb – A dictionary of query embeddings.

  • doc_emb – A dictionary of document embeddings.

Returns:

Saves the processed data for each query in a .pkl.gz file.

fairdiverse.search.utils.process_daletor.build_test_dataset(config)[source]

Builds the test dataset for evaluation.

Parameters:

config – A dictionary containing configuration settings.

Returns:

Generates the test dataset and saves it in the result directory.

fairdiverse.search.utils.process_daletor.build_train_dataset(config, worker_num=20)[source]

Builds the training dataset by distributing the workload across multiple workers.

Parameters:
  • config – A dictionary containing configuration settings.

  • worker_num – The number of workers to use for parallel processing.

Returns:

Generates the training dataset and saves it in the result directory.

fairdiverse.search.utils.process_daletor.gen_list_training_sample(config, top_n=50, sample_num=200)[source]

Generates list training samples by selecting top-ranked documents for each query.

Parameters:
  • config – A dictionary containing configuration settings.

  • top_n – The number of top-ranked documents to consider for each sample.

  • sample_num – The number of samples to generate for each query.

Returns:

Saves the generated training samples in a file for later use.