dataquality package#
Subpackages#
- dataquality.clients package
- Submodules
- dataquality.clients.api module
ApiClientApiClient.get_token()ApiClient.make_request()ApiClient.get_current_user()ApiClient.valid_current_user()ApiClient.get_project()ApiClient.get_projects()ApiClient.get_project_by_name()ApiClient.get_project_runs()ApiClient.get_project_runs_by_name()ApiClient.get_project_run()ApiClient.get_project_run_by_name()ApiClient.update_run_name()ApiClient.update_project_name()ApiClient.create_project()ApiClient.create_run()ApiClient.reset_run()ApiClient.delete_run()ApiClient.delete_run_by_name()ApiClient.delete_project()ApiClient.delete_project_by_name()ApiClient.get_labels_for_run()ApiClient.get_tasks_for_run()ApiClient.get_epochs_for_run()ApiClient.create_edit()ApiClient.reprocess_run()ApiClient.get_slice_by_name()ApiClient.get_metadata_columns()ApiClient.get_task_type()ApiClient.export_run()ApiClient.get_project_run_name()ApiClient.get_run_status()ApiClient.get_run_link()ApiClient.wait_for_run()ApiClient.get_presigned_url()ApiClient.get_run_summary()ApiClient.get_run_metrics()ApiClient.get_column_distribution()ApiClient.get_alerts()ApiClient.delete_alerts_for_split()ApiClient.delete_alerts()ApiClient.get_edits()ApiClient.export_edits()ApiClient.notify_email()ApiClient.get_splits()ApiClient.get_inference_names()ApiClient.set_metric_for_run()ApiClient.get_healthcheck_dq()ApiClient.upload_file_for_project()ApiClient.get_presigned_url_for_model()ApiClient.get_uploaded_model_info()
- dataquality.clients.objectstore module
- Module contents
- dataquality.core package
- Submodules
- dataquality.core.auth module
- dataquality.core.finish module
- dataquality.core.init module
- dataquality.core.log module
log_data_samples()log_data_sample()log_image_dataset()log_xgboost()log_dataset()log_model_outputs()log_od_model_outputs()set_labels_for_run()get_current_run_labels()set_tasks_for_run()set_tagging_schema()get_model_logger()get_data_logger()docs()set_epoch()set_split()set_epoch_and_split()get_run_link()
- dataquality.core.report module
- Module contents
- dataquality.dq_auto package
- Submodules
- dataquality.dq_auto.auto module
- dataquality.dq_auto.base_data_manager module
- dataquality.dq_auto.ner module
- dataquality.dq_auto.ner_trainer module
- dataquality.dq_auto.notebook module
- dataquality.dq_auto.schema module
BaseAutoDatasetConfigBaseAutoDatasetConfig.hf_dataBaseAutoDatasetConfig.train_pathBaseAutoDatasetConfig.val_pathBaseAutoDatasetConfig.test_pathBaseAutoDatasetConfig.train_dataBaseAutoDatasetConfig.val_dataBaseAutoDatasetConfig.test_dataBaseAutoDatasetConfig.input_colBaseAutoDatasetConfig.target_colBaseAutoDatasetConfig.formatter
BaseAutoTrainingConfig
- dataquality.dq_auto.tc_trainer module
- dataquality.dq_auto.text_classification module
- Module contents
- dataquality.dq_start package
- dataquality.integrations package
- Subpackages
- Submodules
- dataquality.integrations.fastai module
FAIKeyFastAiDQCallbackFastAiDQCallback.logger_configFastAiDQCallback.init_config()FastAiDQCallback.setup_idx_store()FastAiDQCallback.reset_idx_store()FastAiDQCallback.reset_config()FastAiDQCallback.get_layer()FastAiDQCallback.before_epoch()FastAiDQCallback.before_fit()FastAiDQCallback.before_train()FastAiDQCallback.wrap_indices()FastAiDQCallback.after_validate()FastAiDQCallback.is_train_or_val()FastAiDQCallback.before_validate()FastAiDQCallback.after_fit()FastAiDQCallback.before_batch()FastAiDQCallback.after_pred()FastAiDQCallback.register_hooks()FastAiDQCallback.forward_hook_with_store()FastAiDQCallback.prepare_split()FastAiDQCallback.unpatch()FastAiDQCallback.unhook()FastAiDQCallback.unwatch()
convert_img_dl_to_df()extract_split_indices()convert_tab_dl_to_df()
- dataquality.integrations.hf module
- dataquality.integrations.jsl module
- dataquality.integrations.keras module
DataQualityCallbackDataQualityCallback.storeDataQualityCallback.logger_configDataQualityCallback.modelDataQualityCallback.on_train_begin()DataQualityCallback.on_epoch_begin()DataQualityCallback.on_train_batch_begin()DataQualityCallback.on_train_batch_end()DataQualityCallback.on_test_begin()DataQualityCallback.on_test_batch_begin()DataQualityCallback.on_test_batch_end()DataQualityCallback.on_predict_begin()DataQualityCallback.on_predict_batch_end()
patch_model_fit_args_kwargs()store_model_ids()select_model_layer()watch()unwatch()
- dataquality.integrations.lightning module
- dataquality.integrations.setfit module
- dataquality.integrations.torch module
- dataquality.integrations.torch_semantic_segmentation module
SemanticTorchLoggerSemanticTorchLogger.convert_dataset()SemanticTorchLogger.find_mask_category()SemanticTorchLogger.get_image_ids_and_image_paths()SemanticTorchLogger.queue_gold_and_pred()SemanticTorchLogger.truncate_queue()SemanticTorchLogger.resize_probs_and_gold()SemanticTorchLogger.calculate_mislabeled_pixels()SemanticTorchLogger.expand_binary_classification()SemanticTorchLogger.get_argmax_probs()SemanticTorchLogger.upload_contours_split()SemanticTorchLogger.upload_dep_split()SemanticTorchLogger.finish()SemanticTorchLogger.run_one_epoch()
store_batch()patch_iterator_and_batch()watch()
- dataquality.integrations.transformers_trainer module
DQTrainerCallbackDQTrainerCallback.hook_managerDQTrainerCallback.validate()DQTrainerCallback.setup_model()DQTrainerCallback.on_train_begin()DQTrainerCallback.on_evaluate()DQTrainerCallback.on_epoch_begin()DQTrainerCallback.on_epoch_end()DQTrainerCallback.on_train_end()DQTrainerCallback.on_prediction_step()DQTrainerCallback.on_step_end()
watch()unwatch()
- dataquality.integrations.ultralytics module
- Module contents
- dataquality.loggers package
- Subpackages
- dataquality.loggers.data_logger package
- Subpackages
- Submodules
- dataquality.loggers.data_logger.base_data_logger module
- dataquality.loggers.data_logger.image_classification module
- dataquality.loggers.data_logger.object_detection module
- dataquality.loggers.data_logger.semantic_segmentation module
- dataquality.loggers.data_logger.tabular_classification module
- dataquality.loggers.data_logger.text_classification module
- dataquality.loggers.data_logger.text_multi_label module
- dataquality.loggers.data_logger.text_ner module
- Module contents
- dataquality.loggers.logger_config package
- Subpackages
- Submodules
- dataquality.loggers.logger_config.base_logger_config module
- dataquality.loggers.logger_config.image_classification module
- dataquality.loggers.logger_config.object_detection module
- dataquality.loggers.logger_config.semantic_segmentation module
- dataquality.loggers.logger_config.tabular_classification module
- dataquality.loggers.logger_config.text_classification module
- dataquality.loggers.logger_config.text_multi_label module
- dataquality.loggers.logger_config.text_ner module
- Module contents
- dataquality.loggers.model_logger package
- Subpackages
- Submodules
- dataquality.loggers.model_logger.base_model_logger module
- dataquality.loggers.model_logger.image_classification module
- dataquality.loggers.model_logger.object_detection module
- dataquality.loggers.model_logger.semantic_segmentation module
- dataquality.loggers.model_logger.tabular_classification module
- dataquality.loggers.model_logger.text_classification module
- dataquality.loggers.model_logger.text_multi_label module
- dataquality.loggers.model_logger.text_ner module
- Module contents
- dataquality.loggers.data_logger package
- Submodules
- dataquality.loggers.base_logger module
BaseLoggerAttributesBaseLoggerAttributes.textsBaseLoggerAttributes.labelsBaseLoggerAttributes.idsBaseLoggerAttributes.splitBaseLoggerAttributes.metaBaseLoggerAttributes.probBaseLoggerAttributes.gold_conf_probBaseLoggerAttributes.gold_loss_probBaseLoggerAttributes.gold_loss_prob_labelBaseLoggerAttributes.pred_conf_probBaseLoggerAttributes.pred_loss_probBaseLoggerAttributes.pred_loss_prob_labelBaseLoggerAttributes.goldBaseLoggerAttributes.embsBaseLoggerAttributes.probsBaseLoggerAttributes.logitsBaseLoggerAttributes.epochBaseLoggerAttributes.aumBaseLoggerAttributes.text_tokenizedBaseLoggerAttributes.gold_spansBaseLoggerAttributes.pred_embBaseLoggerAttributes.gold_embBaseLoggerAttributes.pred_spansBaseLoggerAttributes.text_token_indicesBaseLoggerAttributes.text_token_indices_flatBaseLoggerAttributes.log_helper_dataBaseLoggerAttributes.inference_nameBaseLoggerAttributes.imageBaseLoggerAttributes.token_label_strBaseLoggerAttributes.token_label_positionsBaseLoggerAttributes.token_label_offsetsBaseLoggerAttributes.labelBaseLoggerAttributes.token_depsBaseLoggerAttributes.textBaseLoggerAttributes.idBaseLoggerAttributes.token_gold_probsBaseLoggerAttributes.tokenized_labelBaseLoggerAttributes.inputBaseLoggerAttributes.targetBaseLoggerAttributes.generated_outputBaseLoggerAttributes.input_cutoffBaseLoggerAttributes.target_cutoffBaseLoggerAttributes.system_promptsBaseLoggerAttributes.xBaseLoggerAttributes.yBaseLoggerAttributes.data_xBaseLoggerAttributes.data_yBaseLoggerAttributes.get_valid()
BaseGalileoLoggerBaseGalileoLogger.LOG_FILE_DIRBaseGalileoLogger.logger_configBaseGalileoLogger.proj_runBaseGalileoLogger.write_output_dirBaseGalileoLogger.split_nameBaseGalileoLogger.split_name_pathBaseGalileoLogger.get_valid_attributes()BaseGalileoLogger.validate_and_format()BaseGalileoLogger.set_split_epoch()BaseGalileoLogger.is_valid()BaseGalileoLogger.non_inference_logged()BaseGalileoLogger.log()BaseGalileoLogger.validate_task()BaseGalileoLogger.upload()BaseGalileoLogger.get_all_subclasses()BaseGalileoLogger.get_logger()BaseGalileoLogger.doc()BaseGalileoLogger.validate_split()BaseGalileoLogger.check_for_logging_failures()BaseGalileoLogger.is_hf_dataset()BaseGalileoLogger.label_idx_mapBaseGalileoLogger.labels_to_idx()
- Module contents
BaseGalileoLoggerBaseGalileoLogger.LOG_FILE_DIRBaseGalileoLogger.logger_configBaseGalileoLogger.splitBaseGalileoLogger.inference_nameBaseGalileoLogger.proj_runBaseGalileoLogger.write_output_dirBaseGalileoLogger.split_nameBaseGalileoLogger.split_name_pathBaseGalileoLogger.get_valid_attributes()BaseGalileoLogger.validate_and_format()BaseGalileoLogger.set_split_epoch()BaseGalileoLogger.is_valid()BaseGalileoLogger.non_inference_logged()BaseGalileoLogger.log()BaseGalileoLogger.validate_task()BaseGalileoLogger.upload()BaseGalileoLogger.get_all_subclasses()BaseGalileoLogger.get_logger()BaseGalileoLogger.doc()BaseGalileoLogger.validate_split()BaseGalileoLogger.check_for_logging_failures()BaseGalileoLogger.is_hf_dataset()BaseGalileoLogger.label_idx_mapBaseGalileoLogger.labels_to_idx()
- Subpackages
- dataquality.schemas package
- Submodules
- dataquality.schemas.condition module
- dataquality.schemas.cv module
CVSmartFeatureColumnCVSmartFeatureColumn.image_pathCVSmartFeatureColumn.heightCVSmartFeatureColumn.widthCVSmartFeatureColumn.channelsCVSmartFeatureColumn.hashCVSmartFeatureColumn.contrastCVSmartFeatureColumn.overexpCVSmartFeatureColumn.underexpCVSmartFeatureColumn.blurCVSmartFeatureColumn.lowcontentCVSmartFeatureColumn.outlier_sizeCVSmartFeatureColumn.outlier_ratioCVSmartFeatureColumn.outlier_near_duplicate_idCVSmartFeatureColumn.outlier_near_dupCVSmartFeatureColumn.outlier_channelsCVSmartFeatureColumn.outlier_low_contrastCVSmartFeatureColumn.outlier_overexposedCVSmartFeatureColumn.outlier_underexposedCVSmartFeatureColumn.outlier_low_contentCVSmartFeatureColumn.outlier_blurry
- dataquality.schemas.dataframe module
- dataquality.schemas.edit module
- dataquality.schemas.hf module
- dataquality.schemas.job module
- dataquality.schemas.metrics module
HashableBaseModelMetaFilterInferenceFilterLassoSelectionFilterParamsFilterParams.class_filterFilterParams.data_error_potential_highFilterParams.data_error_potential_lowFilterParams.exclude_idsFilterParams.gold_filterFilterParams.idsFilterParams.inference_filterFilterParams.lassoFilterParams.likely_mislabeledFilterParams.likely_mislabeled_dep_percentileFilterParams.meta_filterFilterParams.misclassified_onlyFilterParams.num_similar_toFilterParams.pred_filterFilterParams.regexFilterParams.similar_toFilterParams.span_regexFilterParams.span_sample_idsFilterParams.span_textFilterParams.text_pat
- dataquality.schemas.model module
- dataquality.schemas.ner module
NERProbMethodNERErrorTypeTaggingSchemaNERColumnsNERColumns.idNERColumns.sample_idNERColumns.splitNERColumns.epochNERColumns.is_goldNERColumns.is_predNERColumns.span_startNERColumns.span_endNERColumns.goldNERColumns.predNERColumns.conf_probNERColumns.loss_probNERColumns.loss_prob_labelNERColumns.galileo_error_typeNERColumns.embNERColumns.inference_name
- dataquality.schemas.report module
- dataquality.schemas.request_type module
- dataquality.schemas.route module
RouteRoute.projectsRoute.runsRoute.usersRoute.cleanupRoute.loginRoute.current_userRoute.healthcheckRoute.healthcheck_dqRoute.slicesRoute.split_pathRoute.splitsRoute.inference_namesRoute.jobsRoute.latest_jobRoute.presigned_urlRoute.tasksRoute.labelsRoute.epochsRoute.summaryRoute.groupbyRoute.metricsRoute.distributionRoute.alertsRoute.exportRoute.editsRoute.export_editsRoute.notifyRoute.tokenRoute.upload_fileRoute.modelRoute.linkRoute.content_path()
- dataquality.schemas.semantic_segmentation module
SemSegColsErrorTypePolygonTypeSemSegMetricTypeClassificationErrorDataSemSegMetricDataPixelContourPolygonPolygon.areaPolygon.background_error_pctPolygon.cls_error_dataPolygon.contoursPolygon.data_error_potentialPolygon.error_typePolygon.ghost_percentagePolygon.label_idxPolygon.likely_mislabeled_pctPolygon.polygon_typePolygon.uuidPolygon.contours_json()Polygon.contours_opencv()Polygon.dummy_polygon()
- dataquality.schemas.seq2seq module
Seq2SeqModelTypeSeq2SeqInputColsSeq2SeqInputCols.idSeq2SeqInputCols.inputSeq2SeqInputCols.targetSeq2SeqInputCols.generated_outputSeq2SeqInputCols.split_Seq2SeqInputCols.tokenized_labelSeq2SeqInputCols.input_cutoffSeq2SeqInputCols.target_cutoffSeq2SeqInputCols.token_label_strSeq2SeqInputCols.token_label_positionsSeq2SeqInputCols.token_label_offsetsSeq2SeqInputCols.system_prompts
Seq2SeqInputTempColsSeq2SeqOutputColsSeq2SeqOutputCols.idSeq2SeqOutputCols.embSeq2SeqOutputCols.token_logprobsSeq2SeqOutputCols.top_logprobsSeq2SeqOutputCols.generated_outputSeq2SeqOutputCols.generated_token_label_positionsSeq2SeqOutputCols.generated_token_label_offsetsSeq2SeqOutputCols.generated_token_logprobsSeq2SeqOutputCols.generated_top_logprobsSeq2SeqOutputCols.split_Seq2SeqOutputCols.epochSeq2SeqOutputCols.inference_nameSeq2SeqOutputCols.generation_dataSeq2SeqOutputCols.generated_cols()
AlignedTokenDataLogprobDataModelGenerationBatchGenerationData
- dataquality.schemas.split module
- dataquality.schemas.task_type module
TaskTypeTaskType.text_classificationTaskType.text_multi_labelTaskType.text_nerTaskType.image_classificationTaskType.tabular_classificationTaskType.object_detectionTaskType.semantic_segmentationTaskType.prompt_evaluationTaskType.seq2seqTaskType.llm_monitorTaskType.seq2seq_completionTaskType.seq2seq_chatTaskType.get_valid_tasks()TaskType.get_seq2seq_tasks()TaskType.get_mapping()
- dataquality.schemas.torch module
- Module contents
RequestTypeRouteRoute.projectsRoute.runsRoute.usersRoute.cleanupRoute.loginRoute.current_userRoute.healthcheckRoute.healthcheck_dqRoute.slicesRoute.split_pathRoute.splitsRoute.inference_namesRoute.jobsRoute.latest_jobRoute.presigned_urlRoute.tasksRoute.labelsRoute.epochsRoute.summaryRoute.groupbyRoute.metricsRoute.distributionRoute.alertsRoute.exportRoute.editsRoute.export_editsRoute.notifyRoute.tokenRoute.upload_fileRoute.modelRoute.linkRoute.content_path()
- dataquality.utils package
- Subpackages
- dataquality.utils.semantic_segmentation package
- Submodules
- dataquality.utils.semantic_segmentation.constants module
- dataquality.utils.semantic_segmentation.errors module
- dataquality.utils.semantic_segmentation.lm module
- dataquality.utils.semantic_segmentation.metrics module
- dataquality.utils.semantic_segmentation.polygons module
- dataquality.utils.semantic_segmentation.utils module
- Module contents
- dataquality.utils.seq2seq package
- dataquality.utils.semantic_segmentation package
- Submodules
- dataquality.utils.arrow module
- dataquality.utils.auth module
- dataquality.utils.auto module
- dataquality.utils.auto_trainer module
- dataquality.utils.cuda module
- dataquality.utils.cv module
- dataquality.utils.cv_smart_features module
- dataquality.utils.dq_logger module
- dataquality.utils.dqyolo module
- dataquality.utils.emb module
- dataquality.utils.file module
- dataquality.utils.hdf5_store module
- dataquality.utils.helpers module
- dataquality.utils.hf_images module
- dataquality.utils.hf_tokenizer module
- dataquality.utils.imports module
- dataquality.utils.jsl module
- dataquality.utils.keras module
- dataquality.utils.ml module
- dataquality.utils.name module
- dataquality.utils.od module
- dataquality.utils.patcher module
- dataquality.utils.profiler module
- dataquality.utils.setfit module
- dataquality.utils.task_helpers module
- dataquality.utils.tf module
- dataquality.utils.thread_pool module
- dataquality.utils.torch module
cleanup_cuda()ModelHookManagerModelOutputsStoreTorchHelperTorchBaseInstancestore_batch_indices()patch_iterator_with_store()validate_fancy_index_str()convert_fancy_idx_str_to_slice()unpatch()remove_hook()remove_all_forward_hooks()find_dq_hook_by_name()PatchSingleDataloaderIteratorPatchSingleDataloaderNextIndexPatchDataloadersGlobally
- dataquality.utils.transformers module
- dataquality.utils.ultralytics module
- dataquality.utils.upload module
- dataquality.utils.upload_model module
- dataquality.utils.vaex module
- dataquality.utils.version module
- Module contents
tqdmtqdm.monitor_intervaltqdm.monitortqdm.format_sizeof()tqdm.format_interval()tqdm.format_num()tqdm.status_printer()tqdm.format_meter()tqdm.write()tqdm.external_write_mode()tqdm.set_lock()tqdm.get_lock()tqdm.pandas()tqdm.update()tqdm.close()tqdm.clear()tqdm.refresh()tqdm.unpause()tqdm.reset()tqdm.set_description()tqdm.set_description_str()tqdm.set_postfix()tqdm.set_postfix_str()tqdm.moveto()tqdm.format_dicttqdm.display()tqdm.wrapattr()
- Subpackages
Submodules#
dataquality.analytics module#
- pydantic model ProfileModel#
Bases:
BaseModelUser profile
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
-
field packages:
Optional[Dict[str,str]] = None#
-
field uuid:
Optional[str] = None#
-
field packages:
- class Analytics(ApiClient, config)#
Bases:
BorgAnalytics is used to track errors and logs in the background
To initialize the Analytics class you need to pass in an ApiClient and the dq config. :type ApiClient:
Type[ApiClient] :param ApiClient: The ApiClient class :type config:Config:param config: The dq config- debug_logging(log_message, *args)#
This function is used to log debug messages. It will only log if the DQ_DEBUG environment variable is set to True.
- Return type:
None
- ipython_exception_handler(shell, etype, evalue, tb, tb_offset=None)#
This function is used to handle exceptions in ipython.
- Return type:
None
- track_exception_ipython(etype, evalue, tb)#
We parse the current environment and send the error to the api.
- Return type:
None
- handle_exception(etype, evalue, tb)#
This function is used to handle exceptions in python.
- Return type:
None
- capture_exception(error)#
This function is used to take an exception that is passed as an argument.
- Return type:
None
- log_import(module)#
This function is used to log an import of a module.
- Return type:
None
- log_function(function)#
This function is used to log an functional call
- Return type:
None
- log(data)#
This function is used to send the error to the api in a thread.
- Return type:
None
- set_config(config)#
This function is used to set the config post init.
- Return type:
None
dataquality.dqyolo module#
- main()#
dqyolo is a wrapper around ultralytics yolo that will automatically run the model on the validation and test sets and provide data insights.
- Return type:
None
dataquality.exceptions module#
- exception GalileoException#
Bases:
ExceptionA class for Galileo Exceptions
- exception GalileoWarning#
Bases:
WarningA class for Galileo Warnings
- exception LogBatchError#
Bases:
ExceptionAn exception used to indicate an invalid batch of logged model outputs
dataquality.internal module#
Internal functions to help Galileans
- reprocess_run(project_name, run_name, alerts=True, wait=True)#
Reprocesses a run that has already been processed by Galileo
Useful if a new feature has been added to the system that is desired to be added to an old run that hasnβt been migrated
- Parameters:
project_name (
str) β The name of the projectrun_name (
str) β The name of the runalerts (
bool) β Whether to create the alerts. If True, all alerts for the run will be removed, and recreated during processing. Default Truewait (
bool) β Whether to wait for the run to complete processing on the server. If True, this will block execution, printing out the status updates of the run. Useful if you want to know exactly when your run completes. Otherwise, this will fire and forget your process. Default True
- Return type:
None
- reprocess_transferred_run(project_name, run_name, alerts=True, wait=True)#
Reprocess a run that has been transferred from another cluster
This is an internal helper function that allows us to reprocess a run that has been transferred from another cluster.
- Parameters:
project_name (
str) β The name of the projectrun_name (
str) β The name of the runalerts (
bool) β Whether to create the alerts. If True, all alerts for the run will be removed, and recreated during processing. Default Truewait (
bool) β Whether to wait for the run to complete processing on the server. If True, this will block execution, printing out the status updates of the run. Useful if you want to know exactly when your run completes. Otherwise, this will fire and forget your process. Default True
- Return type:
None
- rename_run(project_name, run_name, new_name)#
Assigns a new name to a run
Useful if a run was named incorrectly, or if a run was created with a temporary name and needs to be renamed to something more permanent
- Parameters:
project_name (
str) β The name of the projectrun_name (
str) β The name of the runnew_name (
str) β The new name to assign to the run
- Return type:
None
- rename_project(project_name, new_name)#
Renames a project
Useful if a project was named incorrectly, or if a project was created with a temporary name and needs to be renamed to something more permanent
- Parameters:
project_name (
str) β The name of the projectnew_name (
str) β The new name to assign to the project
- Return type:
None
dataquality.metrics module#
- create_edit(project_name, run_name, split, edit, filter, task=None, inference_name=None)#
Creates an edit for a run given a filter
- Parameters:
project_name (
str) β The project namerun_name (
str) β The run namesplit (
Split) β The splitedit (
Union[Edit,Dict]) β The edit to make. see help(Edit) for more informationtask (
Optional[str]) β Required task name if run is MLTCinference_name (
Optional[str]) β Required inference name if split is inference
- Return type:
Dict
- get_run_summary(project_name, run_name, split, task=None, inference_name=None, filter=None)#
Gets the summary for a run/split
Calculates metrics (f1, recall, precision) overall (weighted) and per label. Also returns the top 50 rows of the dataframe (sorted by data_error_potential)
- Parameters:
project_name (
str) β The project namerun_name (
str) β The run namesplit (
Split) β The split (training/test/validation/inference)task (
Optional[str]) β (If multi-label only) the task name in questioninference_name (
Optional[str]) β (If inference split only) The inference split namefilter (
Union[FilterParams,Dict,None]) β Optional filter to provide to restrict the summary to only to matching rows. See dq.schemas.metrics.FilterParams
- Return type:
Dict
- get_metrics(project_name, run_name, split, task=None, inference_name=None, category='gold', filter=None)#
Calculates available metrics for a run/split, grouped by a particular category
The category/column provided (can be gold, pred, or any categorical metadata column) will result in metrics per βgroupβ or unique value of that category/column
- Parameters:
project_name (
str) β The project namerun_name (
str) β The run namesplit (
Split) β The split (training/test/validation/inference)task (
Optional[str]) β (If multi-label only) the task name in questioninference_name (
Optional[str]) β (If inference split only) The inference split namecategory (
str) β The category/column to calculate metrics for. Default βgoldβ Can be βgoldβ for ground truth, βpredβ for predicted values, or any metadata column logged (or smart feature).filter (
Union[FilterParams,Dict,None]) β Optional filter to provide to restrict the metrics to only to matching rows. See dq.schemas.metrics.FilterParams
- Return type:
Dict[str,List]
- display_distribution(project_name, run_name, split, task=None, inference_name=None, column='data_error_potential', filter=None)#
Displays the column distribution for a run. Plotly must be installed
- Parameters:
project_name (
str) β The project namerun_name (
str) β The run namesplit (
Split) β The split (training/test/validation/inference)task (
Optional[str]) β (If multi-label only) the task name in questioninference_name (
Optional[str]) β (If inference split only) The inference split namecolumn (
str) β The column to get the distribution for. Default data error potentialfilter (
Union[FilterParams,Dict,None]) β Optional filter to provide to restrict the distribution to only to matching rows. See dq.schemas.metrics.FilterParams
- Return type:
None
- get_dataframe(project_name, run_name, split, inference_name='', file_type=FileType.arrow, include_embs=False, include_probs=False, include_token_indices=False, hf_format=False, tagging_schema=None, filter=None, as_pandas=True, include_data_embs=False, meta_cols=None)#
Gets the dataframe for a run/split
Downloads an arrow (or specified type) file to your machine and returns a loaded Vaex dataframe.
Special note for NER. By default, the data will be downloaded at a sample level (1 row per sample text), with spans for each sample in a spans column in a spacy-compatible JSON format. If include_embs or include_probs is True, the data will be expanded into span level (1 row per span, with sample text repeated for each span row), in order to join the span-level embeddings/probs
- Parameters:
project_name (
str) β The project namerun_name (
str) β The run namesplit (
Split) β The split (training/test/validation/inference)inference_name (
str) β Required if split is inference. The name of the inference split to get data for.file_type (
FileType) β The file type to download the data as. Default arrowinclude_embs (
bool) β Whether to include the embeddings in the data. Default Falseinclude_probs (
bool) β Whether to include the probs in the data. Default Falseinclude_token_indices (
bool) β (NER only) Whether to include logged text_token_indices in the data. Useful for reconstructing tokens for retraininghf_format (
bool) β (NER only) Whether to export the dataframe in a HuggingFace compatible formattagging_schema (
Optional[TaggingSchema]) β (NER only) If hf_format is True, you must pass a tagging schemafilter (
Union[FilterParams,Dict,None]) β Optional filter to provide to restrict the distribution to only to matching rows. See dq.schemas.metrics.FilterParamsas_pandas (
bool) β Whether to return the dataframe as a pandas df (or vaex if False) If you are having memory issues (the data is too large), set this to False, and vaex will memory map the data. If any columns returned are multi-dimensional (embeddings, probabilities etc), vaex will always be returned, because pandas cannot support multi-dimensional columns. Default Trueinclude_data_embs (
bool) β Whether to include the off the shelf data embeddingsmeta_cols (
Optional[List[str]]) β List of metadata columns to return in the dataframe. If β*β is included, return all metadata columns
- Return type:
Union[DataFrame,DataFrame]
- get_edited_dataframe(project_name, run_name, split, inference_name='', file_type=FileType.arrow, include_embs=False, include_probs=False, include_token_indices=False, hf_format=False, tagging_schema=None, reviewed_only=False, as_pandas=True, include_data_embs=False)#
Gets the edited dataframe for a run/split
Exports a run/splitβs data with all active edits in the edits cart and returns a vaex or pandas dataframe
Special note for NER. By default, the data will be downloaded at a sample level (1 row per sample text), with spans for each sample in a spans column in a spacy-compatible JSON format. If include_embs or include_probs is True, the data will be expanded into span level (1 row per span, with sample text repeated for each span row), in order to join the span-level embeddings/probs
- Parameters:
project_name (
str) β The project namerun_name (
str) β The run namesplit (
Split) β The split (training/test/validation/inference)inference_name (
str) β Required if split is inference. The name of the inference split to get data for.file_type (
FileType) β The file type to download the data as. Default arrowinclude_embs (
bool) β Whether to include the embeddings in the data. Default Falseinclude_probs (
bool) β Whether to include the probs in the data. Default Falseinclude_token_indices (
bool) β (NER only) Whether to include logged text_token_indices in the data. Useful for reconstructing tokens for retraininghf_format (
bool) β (NER only) Whether to export the dataframe in a HuggingFace compatible formattagging_schema (
Optional[TaggingSchema]) β (NER only) If hf_format is True, you must pass a tagging schemareviewed_only (
Optional[bool]) β Whether to export only reviewed edits or all edits. Default: False (all edits)as_pandas (
bool) β Whether to return the dataframe as a pandas df (or vaex if False) If you are having memory issues (the data is too large), set this to False, and vaex will memory map the data. If any columns returned are multi-dimensional (embeddings, probabilities etc), vaex will always be returned, because pandas cannot support multi-dimensional columns. Default Trueinclude_data_embs (
bool) β Whether to include the off the shelf data embeddings
- Return type:
Union[DataFrame,DataFrame]
- get_epochs(project_name, run_name, split)#
Returns the epochs logged for a run/split
- Parameters:
project_name (
str) β The project namerun_name (
str) β The run namesplit (
Split) β The split (training/test/validation/inference)
- Return type:
List[int]
- get_embeddings(project_name, run_name, split, inference_name='', epoch=None)#
Downloads the embeddings for a run/split at an epoch as a Vaex dataframe.
If not provided, will take the embeddings from the final epoch. Note that only the n and n-1 epoch embeddings are available for download
An hdf5 file will be downloaded to local and a Vaex dataframe will be returned
- Parameters:
project_name (
str) β The project namerun_name (
str) β The run namesplit (
Split) β The split (training/test/validation/inference)inference_name (
str) β Required if split is inferenceepoch (
Optional[int]) β The epoch to get embeddings for. Default final epoch
- Return type:
DataFrame
- get_data_embeddings(project_name, run_name, split, inference_name='')#
Downloads the data (off the shelf) embeddings for a run/split
An hdf5 file will be downloaded to local and a Vaex dataframe will be returned
- Parameters:
project_name (
str) β The project namerun_name (
str) β The run namesplit (
Split) β The split (training/test/validation/inference)inference_name (
str) β Required if split is inference
- Return type:
DataFrame
- get_probabilities(project_name, run_name, split, inference_name='', epoch=None)#
Downloads the probabilities for a run/split at an epoch as a Vaex dataframe.
If not provided, will take the probabilities from the final epoch.
An hdf5 file will be downloaded to local and a Vaex dataframe will be returned
- Parameters:
project_name (
str) β The project namerun_name (
str) β The run namesplit (
Split) β The split (training/test/validation/inference)inference_name (
str) β Required if split is inferenceepoch (
Optional[int]) β The epoch to get embeddings for. Default final epoch
- Return type:
DataFrame
- get_raw_data(project_name, run_name, split, inference_name='', epoch=None)#
Downloads the raw logged data for a run/split at an epoch as a Vaex dataframe.
If not provided, will take the probabilities from the final epoch.
An hdf5 file will be downloaded to local and a Vaex dataframe will be returned
- Parameters:
project_name (
str) β The project namerun_name (
str) β The run namesplit (
Split) β The split (training/test/validation/inference)inference_name (
str) β Required if split is inferenceepoch (
Optional[int]) β The epoch to get embeddings for. Default final epoch
- Return type:
DataFrame
- get_alerts(project_name, run_name, split, inference_name=None)#
Get alerts for a project/run/split
Alerts are automatic insights calculated and provided by Galileo on your data
- Return type:
List[Dict[str,str]]
- get_labels_for_run(project_name, run_name, task=None)#
Gets labels for a given run.
If multi-label, and a task is provided, this will get the labels for that task. Otherwise, it will get all task-labels
In NER, the full label set with the tags for each label will be returned
- Return type:
List
- get_tasks_for_run(project_name, run_name)#
Gets task names for a multi-label run
- Return type:
List[str]
Module contents#
- login()#
Log into your Galileo environment.
The function will prompt your for an Authorization Token (api key) that you can access from the console.
To skip the prompt for automated workflows, you can set GALILEO_USERNAME (your email) and GALILEO_PASSWORD if you signed up with an email and password. You can set GALILEO_API_KEY to your API key if you have one.
- Return type:
None
- logout()#
- Return type:
None
- init(task_type, project_name=None, run_name=None, overwrite_local=True)#
Start a run
Initialize a new run and new project, initialize a new run in an existing project, or reinitialize an existing run in an existing project.
Before creating the project, check: - The user is valid, login if not - The DQ client version is compatible with API version
Optionally provide project and run names to create a new project/run or restart existing ones.
- Return type:
None- Parameters:
task_type (
str) β The task type for modeling. This must be one of the valid
dataquality.schemas.task_type.TaskType options :type project_name:
Optional[str] :param project_name: The project name. If not passed in, a random one will be generated. If provided, and the project does not exist, it will be created. If it does exist, it will be set. :type run_name:Optional[str] :param run_name: The run name. If not passed in, a random one will be generated. If provided, and the project does not exist, it will be created. If it does exist, it will be set. :type overwrite_local:bool:param overwrite_local: If True, the current project/run log directory will be cleared during this function. If logging over many sessions with checkpoints, you may want to set this to False. Default True
- log_data_samples(*, texts, ids, meta=None, **kwargs)#
Logs a batch of input samples for model training/test/validation/inference.
Fields are expected as lists of their content. Field names are in the plural of log_input_sample (text -> texts) The expected arguments come from the task_type being used: See dq.docs() for details
ex (text classification): .. code-block:: python
all_labels = [βAβ, βBβ, βCβ] dq.set_labels_for_run(labels = all_labels)
- texts: List[str] = [
βText sample 1β, βText sample 2β, βText sample 3β, βText sample 4β
]
labels: List[str] = [βBβ, βCβ, βAβ, βAβ]
- meta = {
βsample_importanceβ: [βhighβ, βlowβ, βlowβ, βmediumβ] βquality_rankingβ: [9.7, 2.4, 5.5, 1.2]
}
ids: List[int] = [0, 1, 2, 3] split = βtrainingβ
dq.log_data_samples(texts=texts, labels=labels, ids=ids, meta=meta split=split)
- Parameters:
texts (
List[str]) β List[str] the input samples to your modelids (
List[int]) β List[int | str] the ids per samplesplit β Optional[str] the split for this data. Can also be set via
meta (
Optional[Dict[str,List[Union[str,float,int]]]]) β Dict[str, List[str | int | float]]. Log additional metadata fields to
- Return type:
Nonedq.set_split
each sample. The name of the field is the key of the dictionary, and the values are a list that correspond in length and order to the text samples. :type kwargs:
Any:param kwargs: See dq.docs() for details on other task specific parameters
- log_model_outputs(*, ids, embs=None, split=None, epoch=None, logits=None, probs=None, log_probs=None, inference_name=None, exclude_embs=False)#
Logs model outputs for model during training/test/validation.
- Parameters:
ids (
Union[List,ndarray]) β The ids for each sample. Must match input ids of logged samplesembs (
Union[List,ndarray,None]) β The embeddings per output samplesplit (
Optional[Split]) β The current split. Must be set either here or via dq.set_splitepoch (
Optional[int]) β The current epoch. Must be set either here or via dq.set_epochlogits (
Union[List,ndarray,None]) β The logits for each sampleprobs (
Union[List,ndarray,None]) β Deprecated, use logits. If passed in, a softmax will NOT be appliedinference_name (
Optional[str]) β Inference name indicator for this inference split. If logging for an inference split, this is required.exclude_embs (
bool) β Optional flag to exclude embeddings from logging. If True and embs is set to None, this will generate random embs for each sample.
- Return type:
None
The expected argument shapes come from the task_type being used See dq.docs() for more task specific details on parameter shape
- configure(do_login=True, _internal=False)#
Update your active config with new information
You can use environment variables to set the config, or wait for prompts Available environment variables to update: * GALILEO_CONSOLE_URL * GALILEO_USERNAME * GALILEO_PASSWORD * GALILEO_API_KEY
- Return type:
None
- finish(last_epoch=None, wait=True, create_data_embs=None, data_embs_col='text', upload_model=False)#
Finishes the current run and invokes a job
- Parameters:
last_epoch (
Optional[int]) β If set, only epochs up to this value will be uploaded/processed This is inclusive, so setting last_epoch to 5 would upload epochs 0,1,2,3,4,5wait (
bool) β If true, after uploading the data, this will wait for the run to be processed by the Galileo server. If false, you can manually wait for the run by calling dq.wait_for_run() Default Truecreate_data_embs (
Optional[bool]) β If True, an off-the-shelf transformer will run on the raw text input to generate data-level embeddings. These will be available in the data view tab of the Galileo console. You can also access these embeddings via dq.metrics.get_data_embeddings(). Default True if a GPU is available, else default False.data_embs_col (
str) β Optional text col on which to compute data embeddings. If not set, we default to βtextβ which corresponds to the input text. Can also be set to target, generated_output or any other column that is logged as metadata.upload_model (
bool) β If True, the model will be stored in the galileo project. Default False or set by the environment variable DQ_UPLOAD_MODEL.
- Return type:
str
- set_labels_for_run(labels)#
Creates the mapping of the labels for the model to their respective indexes. :rtype:
None- Parameters:
labels (
Union[List[List[str]],List[str]]) β An ordered list of labels (ie [βdogβ,βcatβ,βfishβ]
If this is a multi-label type, then labels are a list of lists where each inner list indicates the label for the given task
This order MUST match the order of probabilities that the model outputs.
In the multi-label case, the outer order (order of the tasks) must match the task-order of the task-probabilities logged as well.
- get_current_run_labels()#
Returns the current run labels, if there are any
- Return type:
Optional[List[str]]
- get_data_logger(task_type=None, *args, **kwargs)#
- Return type:
- get_model_logger(task_type=None, *args, **kwargs)#
- Return type:
- get_run_link(project_name=None, run_name=None)#
Gets the link to the run in the UI
- Return type:
str
- set_tasks_for_run(tasks, binary=True)#
Sets the task names for the run (multi-label case only).
This order MUST match the order of the labels list provided in log_input_data and the order of the probability vectors provided in log_model_outputs.
This also must match the order of the labels logged in set_labels_for_run (meaning that the first list of labels must be the labels of the first task passed in here)
- Return type:
None- Parameters:
tasks (
List[str]) β The list of tasks for your runbinary (
bool) β Whether this is a binary multi label run. If true, tasks will also
be set as your labels, and you should NOT call dq.set_labels_for_run it will be handled for you. Default True
- set_tagging_schema(tagging_schema)#
Sets the tagging schema for NER models
Only valid for text_ner task_types. Others will throw an exception
- Return type:
None
- docs()#
Print the documentation for your specific input and output logging format
Based on your task_type, this will print the appropriate documentation
- Return type:
None
- wait_for_run(project_name=None, run_name=None)#
Waits until a specific project run transitions from started to finished. Defaults to the current run if project_name and run_name are empty. Raises error if only one of project_name and run_name is passed in.
- Parameters:
project_name (
Optional[str]) β The project name. Default to current project if not passed in.run_name (
Optional[str]) β The run name. Default to current run if not passed in.
- Return type:
None- Returns:
None. Function returns after the run transitions to finished
- get_run_status(project_name=None, run_name=None)#
Returns the latest job of a specified project run. Defaults to the current run if project_name and run_name are empty. Raises error if only one of project_name and run_name is passed in.
- Parameters:
project_name (
Optional[str]) β The project name. Default to current project if not passed in.run_name (
Optional[str]) β The run name. Default to current run if not passed in.
- Return type:
Dict[str,Any]- Returns:
Dict[str, Any]. Response will have key status with value corresponding to the status of the latest job for the run. Other info, such as created_at, may be included.
- set_epoch(epoch)#
Set the current epoch.
When set, logging model outputs will use this if not logged explicitly
- Return type:
None
- set_split(split, inference_name=None)#
Set the current split.
When set, logging data inputs/model outputs will use this if not logged explicitly When setting split to inference, inference_name must be included
- Return type:
None
- set_epoch_and_split(epoch, split, inference_name=None)#
Set the current epoch and set the current split. When set, logging data inputs/model outputs will use this if not logged explicitly When setting split to inference, inference_name must be included
- Return type:
None
- set_console_url(console_url=None)#
For Enterprise users. Set the console URL to your Galileo Environment.
You can also set GALILEO_CONSOLE_URL before importing dataquality to bypass this :rtype:
None- Parameters:
console_url (
Optional[str]) β If set, that will be used. Otherwise, if an environment variable
GALILEO_CONSOLE_URL is set, that will be used. Otherwise, you will be prompted for a url.
- log_data_sample(*, text, id, **kwargs)#
Log a single input example to disk
Fields are expected singular elements. Field names are in the singular of log_input_samples (texts -> text) The expected arguments come from the task_type being used: See dq.docs() for details
- Parameters:
text (
str) β List[str] the input samples to your modelid (
int) β List[int | str] the ids per samplesplit β Optional[str] the split for this data. Can also be set via dq.set_split
kwargs (
Any) β See dq.docs() for details on other task specific parameters
- Return type:
None
- log_dataset(dataset, *, batch_size=100000, text='text', id='id', split=None, meta=None, **kwargs)#
Log an iterable or other dataset to disk. Useful for logging memory mapped files
Dataset provided must be an iterable that can be traversed row by row, and for each row, the fields can be indexed into either via string keys or int indexes. Pandas and Vaex dataframes are also allowed, as well as HuggingFace Datasets
- valid examples:
- d = [
{βmy_textβ: βsample1β, βmy_labelsβ: βAβ, βmy_idβ: 1, βsample_qualityβ: 5.3}, {βmy_textβ: βsample2β, βmy_labelsβ: βAβ, βmy_idβ: 2, βsample_qualityβ: 9.1}, {βmy_textβ: βsample3β, βmy_labelsβ: βBβ, βmy_idβ: 3, βsample_qualityβ: 2.7},
] dq.log_dataset(
d, text=βmy_textβ, id=βmy_idβ, label=βmy_labelsβ, meta=[βsample_qualityβ]
)
- Logging a pandas dataframe, df:
text label id sample_quality
0 sample1 A 1 5.3 1 sample2 A 2 9.1 2 sample3 B 3 2.7 # We donβt need to set text id or label because it matches the default dq.log_dataset(d, meta=[βsample_qualityβ])
Logging and iterable of tuples: d = [
(βsample1β, βAβ, βID1β), (βsample2β, βAβ, βID2β), (βsample3β, βBβ, βID3β),
] dq.log_dataset(d, text=0, id=2, label=1)
- Invalid example:
- d = {
βmy_textβ: [βsample1β, βsample2β, βsample3β], βmy_labelsβ: [βAβ, βAβ, βBβ], βmy_idβ: [1, 2, 3], βsample_qualityβ: [5.3, 9.1, 2.7]
}
- In the invalid case, use dq.log_data_samples:
meta = {βsample_qualityβ: d[βsample_qualityβ]} dq.log_data_samples(
texts=d[βmy_textβ], labels=d[βmy_labelsβ], ids=d[βmy_idsβ], meta=meta
)
Keyword arguments are specific to the task type. See dq.docs() for details
- Parameters:
dataset (
TypeVar(DataSet, bound=Union[Iterable,DataFrame,Dataset,DataFrame])) β The iterable or dataframe to logtext (
Union[str,int]) β str | int The column, key, or int index for text data. Default βtextβid (
Union[str,int]) β str | int The column, key, or int index for id data. Default βidβsplit (
Optional[Split]) β Optional[str] the split for this data. Can also be set via dq.set_splitmeta (
Union[List[str],List[int],None]) β List[str | int] Additional keys/columns to your input data to be logged as metadata. Consider a pandas dataframe, this would be the list ofkwargs (
Any) β See help(dq.get_data_logger().log_dataset) for more details here
- Batch_size:
The number of data samples to log at a time. Useful when logging a memory mapped dataset. A larger batch_size will result in faster logging at the expense of more memory usage. Default 100,000
- Return type:
Nonecolumns corresponding to each metadata field to log
or dq.docs() for more general task details
- log_image_dataset(dataset, *, imgs_local_colname=None, imgs_remote=None, batch_size=100000, id='id', label='label', split=None, inference_name=None, meta=None, parallel=False, **kwargs)#
Log an image dataset of input samples for image classification
- Parameters:
dataset (
TypeVar(DataSet, bound=Union[Iterable,DataFrame,Dataset,DataFrame])) β The dataset to log. This can be a Pandas/HF dataframe or an ImageFolder (from Torchvision).imgs_local_colname (
Optional[str]) β The name of the column containing the local images (typically paths but could also be bytes for HF dataframes). Ignored for ImageFolder where local paths are directly retrieved from the dataset.imgs_remote (
Optional[str]) β The name of the column containing paths to the remote images (in the case of a df) or remote directory containing the images (in the case of ImageFolder). Specifying this argument is required to skip uploading the images.batch_size (
int) β Number of samples to log in a batch. Default 10,000id (
str) β The name of the column containing the ids (in the dataframe)label (
str) β The name of the column containing the labels (in the dataframe)split (
Optional[Split]) β train/test/validation/inference. Can be set here or via dq.set_splitinference_name (
Optional[str]) β If logging inference data, a name for this inference data is required. Can be set here or via dq.set_splitparallel (
bool) β upload in parallel if set to True
- Return type:
None
- log_xgboost(model, X, *, y=None, feature_names=None, split=None, inference_name=None)#
Log data for tabular classification models with XGBoost
X can be logged as a numpy array or pandas DataFrame. If a numpy array is provided, feature_names must be provided. If a pandas DataFrame is provided, feature_names will be inferred from the column names.
Example with numpy arrays: .. code-block:: python
import xgboost as xgb from sklearn.datasets import load_wine
wine = load_wine()
X = wine.data y = wine.target feature_names = wine.feature_names
model = xgb.XGBClassifier() model.fit(X, y)
dq.log_xgboost(model, X, y=y, feature_names=feature_names, split=βtrainingβ)
# or for inference dq.log_xgboost(
model, X, feature_names, split=βinferenceβ, inference_name=βmy_inferenceβ
)
Example with pandas DataFrames: .. code-block:: python
import xgboost as xgb from sklearn.datasets import load_wine
X, y = load_wine(as_frame=True, return_X_y=True)
model = xgb.XGBClassifier() model.fit(df, y)
dq.log_xgboost(model, X=df, y=y, split=βtrainingβ)
# or for inference dq.log_xgboost(
model, X=df, split=βinferenceβ, inference_name=βmy_inferenceβ
)
- Parameters:
model (
XGBClassifier) β XGBClassifier model fit on the training dataX (
Union[DataFrame,ndarray]) β The input data has a numpy array or pandas DataFrame. Data should have shape (n_samples, n_features)y (
Union[Series,ndarray,List,None]) β Optional pandas Series, List, or numpy array of ground truth labels with shape (n_samples,). Provide for non-inference onlyfeature_names (
Optional[List[str]]) β List of feature names if X is input as numpy array. Must have length n_featuressplit (
Optional[Split]) β Optional[str] the split for this data. Can also be set via dq.set_splitinference_name (
Optional[str]) β Optional[str] the inference_name for this data. Can also be set via dq.set_split
- Return type:
None
- get_dq_log_file(project_name=None, run_name=None)#
- Return type:
Optional[str]
- build_run_report(conditions, emails, project_id, run_id, link)#
Build a run report and send it to the specified emails.
- Return type:
None
- register_run_report(conditions, emails)#
Register conditions and emails for a run report.
After a run is finished, a report will be sent to the specified emails.
- Return type:
None
- class AggregateFunction(value)#
Bases:
str,EnumAn enumeration.
- avg = 'Average'#
- min = 'Minimum'#
- max = 'Maximum'#
- sum = 'Sum'#
- pct = 'Percentage'#
- class Operator(value)#
Bases:
str,EnumAn enumeration.
- eq = 'is equal to'#
- neq = 'is not equal to'#
- gt = 'is greater than'#
- lt = 'is less than'#
- gte = 'is greater than or equal to'#
- lte = 'is less than or equal to'#
- pydantic model Condition#
Bases:
BaseModelClass for building custom conditions for data quality checks
After building a condition, call evaluate to determine the truthiness of the condition against a given DataFrame.
With a bit of thought, complex and custom conditions can be built. To gain an intuition for what can be accomplished, consider the following examples:
- Is the average confidence less than 0.3?
>>> c = Condition( ... agg=AggregateFunction.avg, ... metric="confidence", ... operator=Operator.lt, ... threshold=0.3, ... ) >>> c.evaluate(df)
- Is the max DEP greater or equal to 0.45?
>>> c = Condition( ... agg=AggregateFunction.max, ... metric="data_error_potential", ... operator=Operator.gte, ... threshold=0.45, ... ) >>> c.evaluate(df)
By adding filters, you can further narrow down the scope of the condition. If the aggregate function is βpctβ, you donβt need to specify a metric,
as the filters will determine the percentage of data.
For example:
- Alert if over 80% of the dataset has confidence under 0.1
>>> c = Condition( ... operator=Operator.gt, ... threshold=0.8, ... agg=AggregateFunction.pct, ... filters=[ ... ConditionFilter( ... metric="confidence", operator=Operator.lt, value=0.1 ... ), ... ], ... ) >>> c.evaluate(df)
- Alert if at least 20% of the dataset has drifted (Inference DataFrames only)
>>> c = Condition( ... operator=Operator.gte, ... threshold=0.2, ... agg=AggregateFunction.pct, ... filters=[ ... ConditionFilter( ... metric="is_drifted", operator=Operator.eq, value=True ... ), ... ], ... ) >>> c.evaluate(df)
- Alert 5% or more of the dataset contains PII
>>> c = Condition( ... operator=Operator.gte, ... threshold=0.05, ... agg=AggregateFunction.pct, ... filters=[ ... ConditionFilter( ... metric="galileo_pii", operator=Operator.neq, value="None" ... ), ... ], ... ) >>> c.evaluate(df)
Complex conditions can be built when the filter has a different metric than the metric used in the condition. For example:
- Alert if the min confidence of drifted data is less than 0.15
>>> c = Condition( ... agg=AggregateFunction.min, ... metric="confidence", ... operator=Operator.lt, ... threshold=0.15, ... filters=[ ... ConditionFilter( ... metric="is_drifted", operator=Operator.eq, value=True ... ) ... ], ... ) >>> c.evaluate(df)
- Alert if over 50% of high DEP (>=0.7) data contains PII
>>> c = Condition( ... operator=Operator.gt, ... threshold=0.5, ... agg=AggregateFunction.pct, ... filters=[ ... ConditionFilter( ... metric="data_error_potential", operator=Operator.gte, value=0.7 ... ), ... ConditionFilter( ... metric="galileo_pii", operator=Operator.neq, value="None" ... ), ... ], ... ) >>> c.evaluate(df)
You can also call conditions directly, which will assert its truth against a df 1. Assert that average confidence less than 0.3 >>> c = Condition( β¦ agg=AggregateFunction.avg, β¦ metric=βconfidenceβ, β¦ operator=Operator.lt, β¦ threshold=0.3, β¦ ) >>> c(df) # Will raise an AssertionError if False
- Parameters:
metric β The DF column for evaluating the condition
agg β An aggregate function to apply to the metric
operator β The operator to use for comparing the agg to the threshold (e.g. βgtβ, βltβ, βeqβ, βneqβ)
threshold β Threshold value for evaluating the condition
filter β Optional filter to apply to the DataFrame before evaluating the condition
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- Fields:
-
field agg:
AggregateFunction[Required]#
-
field filters:
List[ConditionFilter] [Optional]# - Validated by:
validate_filters
-
field metric:
Optional[str] = None# - Validated by:
validate_metric
-
field threshold:
float[Required]#
- evaluate(df)#
- Return type:
Tuple[bool,float]
- pydantic model ConditionFilter#
Bases:
BaseModelFilter a dataframe based on the column value
Note that the column used for filtering is the same as the metric used in the condition.
- Parameters:
operator β The operator to use for filtering (e.g. βgtβ, βltβ, βeqβ, βneqβ) See Operator
value β The value to compare against
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- Fields:
-
field metric:
str[Required]#
-
field value:
Union[float,int,str,bool] [Required]#
- disable_galileo()#
- Return type:
None
- disable_galileo_verbose()#
- Return type:
None
- enable_galileo_verbose()#
- Return type:
None
- enable_galileo()#
- Return type:
None
- auto(hf_data=None, hf_inference_names=None, train_data=None, val_data=None, test_data=None, inference_data=None, max_padding_length=200, hf_model='distilbert-base-uncased', num_train_epochs=15, labels=None, project_name=None, run_name=None, wait=True, create_data_embs=None, early_stopping=True)#
Automatically gets insights on a text classification or NER dataset
Given either a pandas dataframe, file_path, or huggingface dataset path, this function will load the data, train a huggingface transformer model, and provide Galileo insights via a link to the Galileo Console
One of hf_data, train_data should be provided. If neither of those are, a demo dataset will be loaded by Galileo for training.
- Parameters:
hf_data (
Union[DatasetDict,str,None]) β Union[DatasetDict, str] Use this param if you have huggingface data in the hub or in memory. Otherwise see train_data, val_data, and test_data. If provided, train_data, val_data, and test_data are ignored.hf_inference_names (
Optional[List[str]]) β Use this param alongside hf_data if you have splits youβd like to consider as inference. A list of key names in hf_data to be run as inference runs after training. Any keys set must exist in hf_datatrain_data (
Union[DataFrame,Dataset,str,None]) β Optional training data to use. Can be one of * Pandas dataframe * Huggingface dataset * Path to a local file * Huggingface dataset hub pathval_data (
Union[DataFrame,Dataset,str,None]) β Optional validation data to use. The validation data is what is used for the evaluation dataset in huggingface, and what is used for early stopping. If not provided, but test_data is, that will be used as the evaluation set. If neither val nor test are available, the train data will be randomly split 80/20 for use as evaluation data. Can be one of * Pandas dataframe * Huggingface dataset * Path to a local file * Huggingface dataset hub pathtest_data (
Union[DataFrame,Dataset,str,None]) β Optional test data to use. The test data, if provided with val, will be used after training is complete, as the held-out set. If no validation data is provided, this will instead be used as the evaluation set. Can be one of * Pandas dataframe * Huggingface dataset * Path to a local file * Huggingface dataset hub pathinference_data (
Optional[Dict[str,Union[DataFrame,Dataset,str]]]) β User this param to include inference data alongside the train_data param. If you are passing data via the hf_data parameter, you should use the hf_inference_names param. Optional inference datasets to run with after training completes. The structure is a dictionary with the key being the inference name and the value one of * Pandas dataframe * Huggingface dataset * Path to a local file * Huggingface dataset hub pathmax_padding_length (
int) β The max length for padding the input text during tokenization. Default 200hf_model (
str) β The pretrained AutoModel from huggingface that will be used to tokenize and train on the provided data. Default distilbert-base-uncasednum_train_epochs (
int) β The number of epochs to train for (early stopping will always be active). Default 15labels (
Optional[List[str]]) β Optional list of labels for this dataset. If not provided, they will attempt to be extracted from the dataproject_name (
Optional[str]) β Optional project name. If not set, a random name will be generatedrun_name (
Optional[str]) β Optional run name for this data. If not set, a random name will be generatedwait (
bool) β Whether to wait for Galileo to complete processing your run. Default Truecreate_data_embs (
Optional[bool]) β Whether to create data embeddings for this run. If True, Sentence-Transformers will be used to generate data embeddings for this dataset and uploaded with this run. You can access these embeddings via dq.metrics.get_data_embeddings in the emb column or dq.metrics.get_dataframe(β¦, include_data_embs=True) in the data_emb col Only available for TC currently. NER coming soon. Default True if a GPU is available, else default False.early_stopping (
bool) β Whether to use early stopping. Default True
- Return type:
None
For text classification datasets, the only required columns are text and label
For NER, the required format is the huggingface standard format of tokens and tags (or ner_tags). See example: https://huggingface.co/datasets/rungalileo/mit_movies
MIT Movies dataset in huggingface format
tokens ner_tags [what, is, a, good, action, movie, that, is, r... [0, 0, 0, 0, 7, 0, ... [show, me, political, drama, movies, with, jef... [0, 0, 7, 8, 0, 0, ... [what, are, some, good, 1980, s, g, rated, mys... [0, 0, 0, 0, 5, 6, ... [list, a, crime, film, which, director, was, d... [0, 0, 7, 0, 0, 0, ... [is, there, a, thriller, movie, starring, al, ... [0, 0, 0, 7, 0, 0, ... ... ... ...
To see auto insights on a random, pre-selected dataset, simply run
import dataquality as dq dq.auto()
An example using auto with a hosted huggingface text classification dataset
import dataquality as dq dq.auto(hf_data="rungalileo/trec6")
Similarly, for NER
import dataquality as dq dq.auto(hf_data="conll2003")
An example using auto with sklearn data as pandas dataframes
import dataquality as dq import pandas as pd from sklearn.datasets import fetch_20newsgroups # Load the newsgroups dataset from sklearn newsgroups_train = fetch_20newsgroups(subset='train') newsgroups_test = fetch_20newsgroups(subset='test') # Convert to pandas dataframes df_train = pd.DataFrame( {"text": newsgroups_train.data, "label": newsgroups_train.target} ) df_test = pd.DataFrame( {"text": newsgroups_test.data, "label": newsgroups_test.target} ) dq.auto( train_data=df_train, test_data=df_test, labels=newsgroups_train.target_names, project_name="newsgroups_work", run_name="run_1_raw_data" )
An example of using auto with a local CSV file with text and label columns
import dataquality as dq dq.auto( train_data="train.csv", test_data="test.csv", project_name="data_from_local", run_name="run_1_raw_data" )
- class DataQuality(model=None, task=TaskType.text_classification, labels=None, train_data=None, test_data=None, val_data=None, project='', run='', framework=None, *args, **kwargs)#
Bases:
object- Parameters:
model (
Optional[Any]) β The model to inspect, if a string, it will be assumed to be autotask (
TaskType) β Task type for example βtext_classificationβproject (
str) β Project namerun (
str) β Run nametrain_data (
Optional[Any]) β Training datatest_data (
Optional[Any]) β Optional test dataval_data (
Optional[Any]) β Optional: validation datalabels (
Optional[List[str]]) β The labels for the runframework (
Optional[ModelFramework]) β The framework to use, if provided it will be used instead of inferring it from the model. For example, if you have a torch model, you can pass framework=βtorchβ. If you have a torch model, you can pass framework=βtorchβargs (
Any) β Additional argumentskwargs (
Any) β Additional keyword arguments
from dataquality import DataQuality with DataQuality(model, "text_classification", labels = ["neg", "pos"], train_data = train_data) as dq: model.fit(train_data)
If you want to train without a model, you can use the auto framework:
from dataquality import DataQuality with DataQuality(labels = ["neg", "pos"], train_data = train_data) as dq: dq.finish()
- get_metrics(split=Split.train)#
- Return type:
Dict[str,Any]
- auto_notebook()#
- Return type:
None