process_daletor¶
- fairdiverse.search.utils.process_daletor.Process(config)[source]¶
Main function for DALETOR data processing.
- Parameters:
config – A dictionary containing configuration settings.
- Returns:
Generates and saves the training and test datasets, as well as list training samples.
- fairdiverse.search.utils.process_daletor.build_each_train_dataset(qid_list, qd, train_dict, rel_feat_dict, res_dir, query_emb, doc_emb)[source]¶
Generates and saves the training dataset for each query in qid_list.
- Parameters:
qid_list – A list of query IDs to process.
qd – A dictionary containing the query data (query, document list, subtopic info).
train_dict – A dictionary of pre-generated training samples.
rel_feat_dict – A dictionary of relevance features for query-document pairs.
res_dir – The directory where the processed data will be saved.
query_emb – A dictionary of query embeddings.
doc_emb – A dictionary of document embeddings.
- Returns:
Saves the processed data for each query in a .pkl.gz file.
- fairdiverse.search.utils.process_daletor.build_test_dataset(config)[source]¶
Builds the test dataset for evaluation.
- Parameters:
config – A dictionary containing configuration settings.
- Returns:
Generates the test dataset and saves it in the result directory.
- fairdiverse.search.utils.process_daletor.build_train_dataset(config, worker_num=20)[source]¶
Builds the training dataset by distributing the workload across multiple workers.
- Parameters:
config – A dictionary containing configuration settings.
worker_num – The number of workers to use for parallel processing.
- Returns:
Generates the training dataset and saves it in the result directory.
- fairdiverse.search.utils.process_daletor.gen_list_training_sample(config, top_n=50, sample_num=200)[source]¶
Generates list training samples by selecting top-ranked documents for each query.
- Parameters:
config – A dictionary containing configuration settings.
top_n – The number of top-ranked documents to consider for each sample.
sample_num – The number of samples to generate for each query.
- Returns:
Saves the generated training samples in a file for later use.