dataquality package#
Subpackages#
- dataquality.clients package
- Submodules
- dataquality.clients.api module
ApiClient
ApiClient.get_token()
ApiClient.make_request()
ApiClient.get_current_user()
ApiClient.valid_current_user()
ApiClient.get_project()
ApiClient.get_projects()
ApiClient.get_project_by_name()
ApiClient.get_project_runs()
ApiClient.get_project_runs_by_name()
ApiClient.get_project_run()
ApiClient.get_project_run_by_name()
ApiClient.update_run_name()
ApiClient.update_project_name()
ApiClient.create_project()
ApiClient.create_run()
ApiClient.reset_run()
ApiClient.delete_run()
ApiClient.delete_run_by_name()
ApiClient.delete_project()
ApiClient.delete_project_by_name()
ApiClient.get_labels_for_run()
ApiClient.get_tasks_for_run()
ApiClient.get_epochs_for_run()
ApiClient.create_edit()
ApiClient.reprocess_run()
ApiClient.get_slice_by_name()
ApiClient.get_metadata_columns()
ApiClient.get_task_type()
ApiClient.export_run()
ApiClient.get_project_run_name()
ApiClient.get_run_status()
ApiClient.get_run_link()
ApiClient.wait_for_run()
ApiClient.get_presigned_url()
ApiClient.get_run_summary()
ApiClient.get_run_metrics()
ApiClient.get_column_distribution()
ApiClient.get_alerts()
ApiClient.delete_alerts_for_split()
ApiClient.delete_alerts()
ApiClient.get_edits()
ApiClient.export_edits()
ApiClient.notify_email()
ApiClient.get_splits()
ApiClient.get_inference_names()
ApiClient.set_metric_for_run()
ApiClient.get_healthcheck_dq()
ApiClient.upload_file_for_project()
ApiClient.get_presigned_url_for_model()
ApiClient.get_uploaded_model_info()
- dataquality.clients.objectstore module
- Module contents
- dataquality.core package
- Submodules
- dataquality.core.auth module
- dataquality.core.finish module
- dataquality.core.init module
- dataquality.core.log module
log_data_samples()
log_data_sample()
log_image_dataset()
log_xgboost()
log_dataset()
log_model_outputs()
log_od_model_outputs()
set_labels_for_run()
get_current_run_labels()
set_tasks_for_run()
set_tagging_schema()
get_model_logger()
get_data_logger()
docs()
set_epoch()
set_split()
set_epoch_and_split()
get_run_link()
- dataquality.core.report module
- Module contents
- dataquality.dq_auto package
- Submodules
- dataquality.dq_auto.auto module
- dataquality.dq_auto.base_data_manager module
- dataquality.dq_auto.ner module
- dataquality.dq_auto.ner_trainer module
- dataquality.dq_auto.notebook module
- dataquality.dq_auto.schema module
BaseAutoDatasetConfig
BaseAutoDatasetConfig.hf_data
BaseAutoDatasetConfig.train_path
BaseAutoDatasetConfig.val_path
BaseAutoDatasetConfig.test_path
BaseAutoDatasetConfig.train_data
BaseAutoDatasetConfig.val_data
BaseAutoDatasetConfig.test_data
BaseAutoDatasetConfig.input_col
BaseAutoDatasetConfig.target_col
BaseAutoDatasetConfig.formatter
BaseAutoTrainingConfig
- dataquality.dq_auto.tc_trainer module
- dataquality.dq_auto.text_classification module
- Module contents
- dataquality.dq_start package
- dataquality.integrations package
- Subpackages
- Submodules
- dataquality.integrations.fastai module
FAIKey
FastAiDQCallback
FastAiDQCallback.logger_config
FastAiDQCallback.init_config()
FastAiDQCallback.setup_idx_store()
FastAiDQCallback.reset_idx_store()
FastAiDQCallback.reset_config()
FastAiDQCallback.get_layer()
FastAiDQCallback.before_epoch()
FastAiDQCallback.before_fit()
FastAiDQCallback.before_train()
FastAiDQCallback.wrap_indices()
FastAiDQCallback.after_validate()
FastAiDQCallback.is_train_or_val()
FastAiDQCallback.before_validate()
FastAiDQCallback.after_fit()
FastAiDQCallback.before_batch()
FastAiDQCallback.after_pred()
FastAiDQCallback.register_hooks()
FastAiDQCallback.forward_hook_with_store()
FastAiDQCallback.prepare_split()
FastAiDQCallback.unpatch()
FastAiDQCallback.unhook()
FastAiDQCallback.unwatch()
convert_img_dl_to_df()
extract_split_indices()
convert_tab_dl_to_df()
- dataquality.integrations.hf module
- dataquality.integrations.jsl module
- dataquality.integrations.keras module
DataQualityCallback
DataQualityCallback.store
DataQualityCallback.logger_config
DataQualityCallback.model
DataQualityCallback.on_train_begin()
DataQualityCallback.on_epoch_begin()
DataQualityCallback.on_train_batch_begin()
DataQualityCallback.on_train_batch_end()
DataQualityCallback.on_test_begin()
DataQualityCallback.on_test_batch_begin()
DataQualityCallback.on_test_batch_end()
DataQualityCallback.on_predict_begin()
DataQualityCallback.on_predict_batch_end()
patch_model_fit_args_kwargs()
store_model_ids()
select_model_layer()
watch()
unwatch()
- dataquality.integrations.lightning module
- dataquality.integrations.setfit module
- dataquality.integrations.torch module
- dataquality.integrations.torch_semantic_segmentation module
SemanticTorchLogger
SemanticTorchLogger.convert_dataset()
SemanticTorchLogger.find_mask_category()
SemanticTorchLogger.get_image_ids_and_image_paths()
SemanticTorchLogger.queue_gold_and_pred()
SemanticTorchLogger.truncate_queue()
SemanticTorchLogger.resize_probs_and_gold()
SemanticTorchLogger.calculate_mislabeled_pixels()
SemanticTorchLogger.expand_binary_classification()
SemanticTorchLogger.get_argmax_probs()
SemanticTorchLogger.upload_contours_split()
SemanticTorchLogger.upload_dep_split()
SemanticTorchLogger.finish()
SemanticTorchLogger.run_one_epoch()
store_batch()
patch_iterator_and_batch()
watch()
- dataquality.integrations.transformers_trainer module
DQTrainerCallback
DQTrainerCallback.hook_manager
DQTrainerCallback.validate()
DQTrainerCallback.setup_model()
DQTrainerCallback.on_train_begin()
DQTrainerCallback.on_evaluate()
DQTrainerCallback.on_epoch_begin()
DQTrainerCallback.on_epoch_end()
DQTrainerCallback.on_train_end()
DQTrainerCallback.on_prediction_step()
DQTrainerCallback.on_step_end()
watch()
unwatch()
- dataquality.integrations.ultralytics module
- Module contents
- dataquality.loggers package
- Subpackages
- dataquality.loggers.data_logger package
- Subpackages
- Submodules
- dataquality.loggers.data_logger.base_data_logger module
- dataquality.loggers.data_logger.image_classification module
- dataquality.loggers.data_logger.object_detection module
- dataquality.loggers.data_logger.semantic_segmentation module
- dataquality.loggers.data_logger.tabular_classification module
- dataquality.loggers.data_logger.text_classification module
- dataquality.loggers.data_logger.text_multi_label module
- dataquality.loggers.data_logger.text_ner module
- Module contents
- dataquality.loggers.logger_config package
- Subpackages
- Submodules
- dataquality.loggers.logger_config.base_logger_config module
- dataquality.loggers.logger_config.image_classification module
- dataquality.loggers.logger_config.object_detection module
- dataquality.loggers.logger_config.semantic_segmentation module
- dataquality.loggers.logger_config.tabular_classification module
- dataquality.loggers.logger_config.text_classification module
- dataquality.loggers.logger_config.text_multi_label module
- dataquality.loggers.logger_config.text_ner module
- Module contents
- dataquality.loggers.model_logger package
- Subpackages
- Submodules
- dataquality.loggers.model_logger.base_model_logger module
- dataquality.loggers.model_logger.image_classification module
- dataquality.loggers.model_logger.object_detection module
- dataquality.loggers.model_logger.semantic_segmentation module
- dataquality.loggers.model_logger.tabular_classification module
- dataquality.loggers.model_logger.text_classification module
- dataquality.loggers.model_logger.text_multi_label module
- dataquality.loggers.model_logger.text_ner module
- Module contents
- dataquality.loggers.data_logger package
- Submodules
- dataquality.loggers.base_logger module
BaseLoggerAttributes
BaseLoggerAttributes.texts
BaseLoggerAttributes.labels
BaseLoggerAttributes.ids
BaseLoggerAttributes.split
BaseLoggerAttributes.meta
BaseLoggerAttributes.prob
BaseLoggerAttributes.gold_conf_prob
BaseLoggerAttributes.gold_loss_prob
BaseLoggerAttributes.gold_loss_prob_label
BaseLoggerAttributes.pred_conf_prob
BaseLoggerAttributes.pred_loss_prob
BaseLoggerAttributes.pred_loss_prob_label
BaseLoggerAttributes.gold
BaseLoggerAttributes.embs
BaseLoggerAttributes.probs
BaseLoggerAttributes.logits
BaseLoggerAttributes.epoch
BaseLoggerAttributes.aum
BaseLoggerAttributes.text_tokenized
BaseLoggerAttributes.gold_spans
BaseLoggerAttributes.pred_emb
BaseLoggerAttributes.gold_emb
BaseLoggerAttributes.pred_spans
BaseLoggerAttributes.text_token_indices
BaseLoggerAttributes.text_token_indices_flat
BaseLoggerAttributes.log_helper_data
BaseLoggerAttributes.inference_name
BaseLoggerAttributes.image
BaseLoggerAttributes.token_label_str
BaseLoggerAttributes.token_label_positions
BaseLoggerAttributes.token_label_offsets
BaseLoggerAttributes.label
BaseLoggerAttributes.token_deps
BaseLoggerAttributes.text
BaseLoggerAttributes.id
BaseLoggerAttributes.token_gold_probs
BaseLoggerAttributes.tokenized_label
BaseLoggerAttributes.input
BaseLoggerAttributes.target
BaseLoggerAttributes.generated_output
BaseLoggerAttributes.input_cutoff
BaseLoggerAttributes.target_cutoff
BaseLoggerAttributes.system_prompts
BaseLoggerAttributes.x
BaseLoggerAttributes.y
BaseLoggerAttributes.data_x
BaseLoggerAttributes.data_y
BaseLoggerAttributes.get_valid()
BaseGalileoLogger
BaseGalileoLogger.LOG_FILE_DIR
BaseGalileoLogger.logger_config
BaseGalileoLogger.proj_run
BaseGalileoLogger.write_output_dir
BaseGalileoLogger.split_name
BaseGalileoLogger.split_name_path
BaseGalileoLogger.get_valid_attributes()
BaseGalileoLogger.validate_and_format()
BaseGalileoLogger.set_split_epoch()
BaseGalileoLogger.is_valid()
BaseGalileoLogger.non_inference_logged()
BaseGalileoLogger.log()
BaseGalileoLogger.validate_task()
BaseGalileoLogger.upload()
BaseGalileoLogger.get_all_subclasses()
BaseGalileoLogger.get_logger()
BaseGalileoLogger.doc()
BaseGalileoLogger.validate_split()
BaseGalileoLogger.check_for_logging_failures()
BaseGalileoLogger.is_hf_dataset()
BaseGalileoLogger.label_idx_map
BaseGalileoLogger.labels_to_idx()
- Module contents
BaseGalileoLogger
BaseGalileoLogger.LOG_FILE_DIR
BaseGalileoLogger.logger_config
BaseGalileoLogger.split
BaseGalileoLogger.inference_name
BaseGalileoLogger.proj_run
BaseGalileoLogger.write_output_dir
BaseGalileoLogger.split_name
BaseGalileoLogger.split_name_path
BaseGalileoLogger.get_valid_attributes()
BaseGalileoLogger.validate_and_format()
BaseGalileoLogger.set_split_epoch()
BaseGalileoLogger.is_valid()
BaseGalileoLogger.non_inference_logged()
BaseGalileoLogger.log()
BaseGalileoLogger.validate_task()
BaseGalileoLogger.upload()
BaseGalileoLogger.get_all_subclasses()
BaseGalileoLogger.get_logger()
BaseGalileoLogger.doc()
BaseGalileoLogger.validate_split()
BaseGalileoLogger.check_for_logging_failures()
BaseGalileoLogger.is_hf_dataset()
BaseGalileoLogger.label_idx_map
BaseGalileoLogger.labels_to_idx()
- Subpackages
- dataquality.schemas package
- Submodules
- dataquality.schemas.condition module
- dataquality.schemas.cv module
CVSmartFeatureColumn
CVSmartFeatureColumn.image_path
CVSmartFeatureColumn.height
CVSmartFeatureColumn.width
CVSmartFeatureColumn.channels
CVSmartFeatureColumn.hash
CVSmartFeatureColumn.contrast
CVSmartFeatureColumn.overexp
CVSmartFeatureColumn.underexp
CVSmartFeatureColumn.blur
CVSmartFeatureColumn.lowcontent
CVSmartFeatureColumn.outlier_size
CVSmartFeatureColumn.outlier_ratio
CVSmartFeatureColumn.outlier_near_duplicate_id
CVSmartFeatureColumn.outlier_near_dup
CVSmartFeatureColumn.outlier_channels
CVSmartFeatureColumn.outlier_low_contrast
CVSmartFeatureColumn.outlier_overexposed
CVSmartFeatureColumn.outlier_underexposed
CVSmartFeatureColumn.outlier_low_content
CVSmartFeatureColumn.outlier_blurry
- dataquality.schemas.dataframe module
- dataquality.schemas.edit module
- dataquality.schemas.hf module
- dataquality.schemas.job module
- dataquality.schemas.metrics module
HashableBaseModel
MetaFilter
InferenceFilter
LassoSelection
FilterParams
FilterParams.class_filter
FilterParams.data_error_potential_high
FilterParams.data_error_potential_low
FilterParams.exclude_ids
FilterParams.gold_filter
FilterParams.ids
FilterParams.inference_filter
FilterParams.lasso
FilterParams.likely_mislabeled
FilterParams.likely_mislabeled_dep_percentile
FilterParams.meta_filter
FilterParams.misclassified_only
FilterParams.num_similar_to
FilterParams.pred_filter
FilterParams.regex
FilterParams.similar_to
FilterParams.span_regex
FilterParams.span_sample_ids
FilterParams.span_text
FilterParams.text_pat
- dataquality.schemas.model module
- dataquality.schemas.ner module
NERProbMethod
NERErrorType
TaggingSchema
NERColumns
NERColumns.id
NERColumns.sample_id
NERColumns.split
NERColumns.epoch
NERColumns.is_gold
NERColumns.is_pred
NERColumns.span_start
NERColumns.span_end
NERColumns.gold
NERColumns.pred
NERColumns.conf_prob
NERColumns.loss_prob
NERColumns.loss_prob_label
NERColumns.galileo_error_type
NERColumns.emb
NERColumns.inference_name
- dataquality.schemas.report module
- dataquality.schemas.request_type module
- dataquality.schemas.route module
Route
Route.projects
Route.runs
Route.users
Route.cleanup
Route.login
Route.current_user
Route.healthcheck
Route.healthcheck_dq
Route.slices
Route.split_path
Route.splits
Route.inference_names
Route.jobs
Route.latest_job
Route.presigned_url
Route.tasks
Route.labels
Route.epochs
Route.summary
Route.groupby
Route.metrics
Route.distribution
Route.alerts
Route.export
Route.edits
Route.export_edits
Route.notify
Route.token
Route.upload_file
Route.model
Route.link
Route.content_path()
- dataquality.schemas.semantic_segmentation module
SemSegCols
ErrorType
PolygonType
SemSegMetricType
ClassificationErrorData
SemSegMetricData
Pixel
Contour
Polygon
Polygon.area
Polygon.background_error_pct
Polygon.cls_error_data
Polygon.contours
Polygon.data_error_potential
Polygon.error_type
Polygon.ghost_percentage
Polygon.label_idx
Polygon.likely_mislabeled_pct
Polygon.polygon_type
Polygon.uuid
Polygon.contours_json()
Polygon.contours_opencv()
Polygon.dummy_polygon()
- dataquality.schemas.seq2seq module
Seq2SeqModelType
Seq2SeqInputCols
Seq2SeqInputCols.id
Seq2SeqInputCols.input
Seq2SeqInputCols.target
Seq2SeqInputCols.generated_output
Seq2SeqInputCols.split_
Seq2SeqInputCols.tokenized_label
Seq2SeqInputCols.input_cutoff
Seq2SeqInputCols.target_cutoff
Seq2SeqInputCols.token_label_str
Seq2SeqInputCols.token_label_positions
Seq2SeqInputCols.token_label_offsets
Seq2SeqInputCols.system_prompts
Seq2SeqInputTempCols
Seq2SeqOutputCols
Seq2SeqOutputCols.id
Seq2SeqOutputCols.emb
Seq2SeqOutputCols.token_logprobs
Seq2SeqOutputCols.top_logprobs
Seq2SeqOutputCols.generated_output
Seq2SeqOutputCols.generated_token_label_positions
Seq2SeqOutputCols.generated_token_label_offsets
Seq2SeqOutputCols.generated_token_logprobs
Seq2SeqOutputCols.generated_top_logprobs
Seq2SeqOutputCols.split_
Seq2SeqOutputCols.epoch
Seq2SeqOutputCols.inference_name
Seq2SeqOutputCols.generation_data
Seq2SeqOutputCols.generated_cols()
AlignedTokenData
LogprobData
ModelGeneration
BatchGenerationData
- dataquality.schemas.split module
- dataquality.schemas.task_type module
TaskType
TaskType.text_classification
TaskType.text_multi_label
TaskType.text_ner
TaskType.image_classification
TaskType.tabular_classification
TaskType.object_detection
TaskType.semantic_segmentation
TaskType.prompt_evaluation
TaskType.seq2seq
TaskType.llm_monitor
TaskType.seq2seq_completion
TaskType.seq2seq_chat
TaskType.get_valid_tasks()
TaskType.get_seq2seq_tasks()
TaskType.get_mapping()
- dataquality.schemas.torch module
- Module contents
RequestType
Route
Route.projects
Route.runs
Route.users
Route.cleanup
Route.login
Route.current_user
Route.healthcheck
Route.healthcheck_dq
Route.slices
Route.split_path
Route.splits
Route.inference_names
Route.jobs
Route.latest_job
Route.presigned_url
Route.tasks
Route.labels
Route.epochs
Route.summary
Route.groupby
Route.metrics
Route.distribution
Route.alerts
Route.export
Route.edits
Route.export_edits
Route.notify
Route.token
Route.upload_file
Route.model
Route.link
Route.content_path()
- dataquality.utils package
- Subpackages
- dataquality.utils.semantic_segmentation package
- Submodules
- dataquality.utils.semantic_segmentation.constants module
- dataquality.utils.semantic_segmentation.errors module
- dataquality.utils.semantic_segmentation.lm module
- dataquality.utils.semantic_segmentation.metrics module
- dataquality.utils.semantic_segmentation.polygons module
- dataquality.utils.semantic_segmentation.utils module
- Module contents
- dataquality.utils.seq2seq package
- dataquality.utils.semantic_segmentation package
- Submodules
- dataquality.utils.arrow module
- dataquality.utils.auth module
- dataquality.utils.auto module
- dataquality.utils.auto_trainer module
- dataquality.utils.cuda module
- dataquality.utils.cv module
- dataquality.utils.cv_smart_features module
- dataquality.utils.dq_logger module
- dataquality.utils.dqyolo module
- dataquality.utils.emb module
- dataquality.utils.file module
- dataquality.utils.hdf5_store module
- dataquality.utils.helpers module
- dataquality.utils.hf_images module
- dataquality.utils.hf_tokenizer module
- dataquality.utils.imports module
- dataquality.utils.jsl module
- dataquality.utils.keras module
- dataquality.utils.ml module
- dataquality.utils.name module
- dataquality.utils.od module
- dataquality.utils.patcher module
- dataquality.utils.profiler module
- dataquality.utils.setfit module
- dataquality.utils.task_helpers module
- dataquality.utils.tf module
- dataquality.utils.thread_pool module
- dataquality.utils.torch module
cleanup_cuda()
ModelHookManager
ModelOutputsStore
TorchHelper
TorchBaseInstance
store_batch_indices()
patch_iterator_with_store()
validate_fancy_index_str()
convert_fancy_idx_str_to_slice()
unpatch()
remove_hook()
remove_all_forward_hooks()
find_dq_hook_by_name()
PatchSingleDataloaderIterator
PatchSingleDataloaderNextIndex
PatchDataloadersGlobally
- dataquality.utils.transformers module
- dataquality.utils.ultralytics module
- dataquality.utils.upload module
- dataquality.utils.upload_model module
- dataquality.utils.vaex module
- dataquality.utils.version module
- Module contents
tqdm
tqdm.monitor_interval
tqdm.monitor
tqdm.format_sizeof()
tqdm.format_interval()
tqdm.format_num()
tqdm.status_printer()
tqdm.format_meter()
tqdm.write()
tqdm.external_write_mode()
tqdm.set_lock()
tqdm.get_lock()
tqdm.pandas()
tqdm.update()
tqdm.close()
tqdm.clear()
tqdm.refresh()
tqdm.unpause()
tqdm.reset()
tqdm.set_description()
tqdm.set_description_str()
tqdm.set_postfix()
tqdm.set_postfix_str()
tqdm.moveto()
tqdm.format_dict
tqdm.display()
tqdm.wrapattr()
- Subpackages
Submodules#
dataquality.analytics module#
- pydantic model ProfileModel#
Bases:
BaseModel
User profile
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
-
field packages:
Optional
[Dict
[str
,str
]] = None#
-
field uuid:
Optional
[str
] = None#
-
field packages:
- class Analytics(ApiClient, config)#
Bases:
Borg
Analytics is used to track errors and logs in the background
To initialize the Analytics class you need to pass in an ApiClient and the dq config. :type ApiClient:
Type
[ApiClient
] :param ApiClient: The ApiClient class :type config:Config
:param config: The dq config- debug_logging(log_message, *args)#
This function is used to log debug messages. It will only log if the DQ_DEBUG environment variable is set to True.
- Return type:
None
- ipython_exception_handler(shell, etype, evalue, tb, tb_offset=None)#
This function is used to handle exceptions in ipython.
- Return type:
None
- track_exception_ipython(etype, evalue, tb)#
We parse the current environment and send the error to the api.
- Return type:
None
- handle_exception(etype, evalue, tb)#
This function is used to handle exceptions in python.
- Return type:
None
- capture_exception(error)#
This function is used to take an exception that is passed as an argument.
- Return type:
None
- log_import(module)#
This function is used to log an import of a module.
- Return type:
None
- log_function(function)#
This function is used to log an functional call
- Return type:
None
- log(data)#
This function is used to send the error to the api in a thread.
- Return type:
None
- set_config(config)#
This function is used to set the config post init.
- Return type:
None
dataquality.dqyolo module#
- main()#
dqyolo is a wrapper around ultralytics yolo that will automatically run the model on the validation and test sets and provide data insights.
- Return type:
None
dataquality.exceptions module#
- exception GalileoException#
Bases:
Exception
A class for Galileo Exceptions
- exception GalileoWarning#
Bases:
Warning
A class for Galileo Warnings
- exception LogBatchError#
Bases:
Exception
An exception used to indicate an invalid batch of logged model outputs
dataquality.internal module#
Internal functions to help Galileans
- reprocess_run(project_name, run_name, alerts=True, wait=True)#
Reprocesses a run that has already been processed by Galileo
Useful if a new feature has been added to the system that is desired to be added to an old run that hasnβt been migrated
- Parameters:
project_name (
str
) β The name of the projectrun_name (
str
) β The name of the runalerts (
bool
) β Whether to create the alerts. If True, all alerts for the run will be removed, and recreated during processing. Default Truewait (
bool
) β Whether to wait for the run to complete processing on the server. If True, this will block execution, printing out the status updates of the run. Useful if you want to know exactly when your run completes. Otherwise, this will fire and forget your process. Default True
- Return type:
None
- reprocess_transferred_run(project_name, run_name, alerts=True, wait=True)#
Reprocess a run that has been transferred from another cluster
This is an internal helper function that allows us to reprocess a run that has been transferred from another cluster.
- Parameters:
project_name (
str
) β The name of the projectrun_name (
str
) β The name of the runalerts (
bool
) β Whether to create the alerts. If True, all alerts for the run will be removed, and recreated during processing. Default Truewait (
bool
) β Whether to wait for the run to complete processing on the server. If True, this will block execution, printing out the status updates of the run. Useful if you want to know exactly when your run completes. Otherwise, this will fire and forget your process. Default True
- Return type:
None
- rename_run(project_name, run_name, new_name)#
Assigns a new name to a run
Useful if a run was named incorrectly, or if a run was created with a temporary name and needs to be renamed to something more permanent
- Parameters:
project_name (
str
) β The name of the projectrun_name (
str
) β The name of the runnew_name (
str
) β The new name to assign to the run
- Return type:
None
- rename_project(project_name, new_name)#
Renames a project
Useful if a project was named incorrectly, or if a project was created with a temporary name and needs to be renamed to something more permanent
- Parameters:
project_name (
str
) β The name of the projectnew_name (
str
) β The new name to assign to the project
- Return type:
None
dataquality.metrics module#
- create_edit(project_name, run_name, split, edit, filter, task=None, inference_name=None)#
Creates an edit for a run given a filter
- Parameters:
project_name (
str
) β The project namerun_name (
str
) β The run namesplit (
Split
) β The splitedit (
Union
[Edit
,Dict
]) β The edit to make. see help(Edit) for more informationtask (
Optional
[str
]) β Required task name if run is MLTCinference_name (
Optional
[str
]) β Required inference name if split is inference
- Return type:
Dict
- get_run_summary(project_name, run_name, split, task=None, inference_name=None, filter=None)#
Gets the summary for a run/split
Calculates metrics (f1, recall, precision) overall (weighted) and per label. Also returns the top 50 rows of the dataframe (sorted by data_error_potential)
- Parameters:
project_name (
str
) β The project namerun_name (
str
) β The run namesplit (
Split
) β The split (training/test/validation/inference)task (
Optional
[str
]) β (If multi-label only) the task name in questioninference_name (
Optional
[str
]) β (If inference split only) The inference split namefilter (
Union
[FilterParams
,Dict
,None
]) β Optional filter to provide to restrict the summary to only to matching rows. See dq.schemas.metrics.FilterParams
- Return type:
Dict
- get_metrics(project_name, run_name, split, task=None, inference_name=None, category='gold', filter=None)#
Calculates available metrics for a run/split, grouped by a particular category
The category/column provided (can be gold, pred, or any categorical metadata column) will result in metrics per βgroupβ or unique value of that category/column
- Parameters:
project_name (
str
) β The project namerun_name (
str
) β The run namesplit (
Split
) β The split (training/test/validation/inference)task (
Optional
[str
]) β (If multi-label only) the task name in questioninference_name (
Optional
[str
]) β (If inference split only) The inference split namecategory (
str
) β The category/column to calculate metrics for. Default βgoldβ Can be βgoldβ for ground truth, βpredβ for predicted values, or any metadata column logged (or smart feature).filter (
Union
[FilterParams
,Dict
,None
]) β Optional filter to provide to restrict the metrics to only to matching rows. See dq.schemas.metrics.FilterParams
- Return type:
Dict
[str
,List
]
- display_distribution(project_name, run_name, split, task=None, inference_name=None, column='data_error_potential', filter=None)#
Displays the column distribution for a run. Plotly must be installed
- Parameters:
project_name (
str
) β The project namerun_name (
str
) β The run namesplit (
Split
) β The split (training/test/validation/inference)task (
Optional
[str
]) β (If multi-label only) the task name in questioninference_name (
Optional
[str
]) β (If inference split only) The inference split namecolumn (
str
) β The column to get the distribution for. Default data error potentialfilter (
Union
[FilterParams
,Dict
,None
]) β Optional filter to provide to restrict the distribution to only to matching rows. See dq.schemas.metrics.FilterParams
- Return type:
None
- get_dataframe(project_name, run_name, split, inference_name='', file_type=FileType.arrow, include_embs=False, include_probs=False, include_token_indices=False, hf_format=False, tagging_schema=None, filter=None, as_pandas=True, include_data_embs=False, meta_cols=None)#
Gets the dataframe for a run/split
Downloads an arrow (or specified type) file to your machine and returns a loaded Vaex dataframe.
Special note for NER. By default, the data will be downloaded at a sample level (1 row per sample text), with spans for each sample in a spans column in a spacy-compatible JSON format. If include_embs or include_probs is True, the data will be expanded into span level (1 row per span, with sample text repeated for each span row), in order to join the span-level embeddings/probs
- Parameters:
project_name (
str
) β The project namerun_name (
str
) β The run namesplit (
Split
) β The split (training/test/validation/inference)inference_name (
str
) β Required if split is inference. The name of the inference split to get data for.file_type (
FileType
) β The file type to download the data as. Default arrowinclude_embs (
bool
) β Whether to include the embeddings in the data. Default Falseinclude_probs (
bool
) β Whether to include the probs in the data. Default Falseinclude_token_indices (
bool
) β (NER only) Whether to include logged text_token_indices in the data. Useful for reconstructing tokens for retraininghf_format (
bool
) β (NER only) Whether to export the dataframe in a HuggingFace compatible formattagging_schema (
Optional
[TaggingSchema
]) β (NER only) If hf_format is True, you must pass a tagging schemafilter (
Union
[FilterParams
,Dict
,None
]) β Optional filter to provide to restrict the distribution to only to matching rows. See dq.schemas.metrics.FilterParamsas_pandas (
bool
) β Whether to return the dataframe as a pandas df (or vaex if False) If you are having memory issues (the data is too large), set this to False, and vaex will memory map the data. If any columns returned are multi-dimensional (embeddings, probabilities etc), vaex will always be returned, because pandas cannot support multi-dimensional columns. Default Trueinclude_data_embs (
bool
) β Whether to include the off the shelf data embeddingsmeta_cols (
Optional
[List
[str
]]) β List of metadata columns to return in the dataframe. If β*β is included, return all metadata columns
- Return type:
Union
[DataFrame
,DataFrame
]
- get_edited_dataframe(project_name, run_name, split, inference_name='', file_type=FileType.arrow, include_embs=False, include_probs=False, include_token_indices=False, hf_format=False, tagging_schema=None, reviewed_only=False, as_pandas=True, include_data_embs=False)#
Gets the edited dataframe for a run/split
Exports a run/splitβs data with all active edits in the edits cart and returns a vaex or pandas dataframe
Special note for NER. By default, the data will be downloaded at a sample level (1 row per sample text), with spans for each sample in a spans column in a spacy-compatible JSON format. If include_embs or include_probs is True, the data will be expanded into span level (1 row per span, with sample text repeated for each span row), in order to join the span-level embeddings/probs
- Parameters:
project_name (
str
) β The project namerun_name (
str
) β The run namesplit (
Split
) β The split (training/test/validation/inference)inference_name (
str
) β Required if split is inference. The name of the inference split to get data for.file_type (
FileType
) β The file type to download the data as. Default arrowinclude_embs (
bool
) β Whether to include the embeddings in the data. Default Falseinclude_probs (
bool
) β Whether to include the probs in the data. Default Falseinclude_token_indices (
bool
) β (NER only) Whether to include logged text_token_indices in the data. Useful for reconstructing tokens for retraininghf_format (
bool
) β (NER only) Whether to export the dataframe in a HuggingFace compatible formattagging_schema (
Optional
[TaggingSchema
]) β (NER only) If hf_format is True, you must pass a tagging schemareviewed_only (
Optional
[bool
]) β Whether to export only reviewed edits or all edits. Default: False (all edits)as_pandas (
bool
) β Whether to return the dataframe as a pandas df (or vaex if False) If you are having memory issues (the data is too large), set this to False, and vaex will memory map the data. If any columns returned are multi-dimensional (embeddings, probabilities etc), vaex will always be returned, because pandas cannot support multi-dimensional columns. Default Trueinclude_data_embs (
bool
) β Whether to include the off the shelf data embeddings
- Return type:
Union
[DataFrame
,DataFrame
]
- get_epochs(project_name, run_name, split)#
Returns the epochs logged for a run/split
- Parameters:
project_name (
str
) β The project namerun_name (
str
) β The run namesplit (
Split
) β The split (training/test/validation/inference)
- Return type:
List
[int
]
- get_embeddings(project_name, run_name, split, inference_name='', epoch=None)#
Downloads the embeddings for a run/split at an epoch as a Vaex dataframe.
If not provided, will take the embeddings from the final epoch. Note that only the n and n-1 epoch embeddings are available for download
An hdf5 file will be downloaded to local and a Vaex dataframe will be returned
- Parameters:
project_name (
str
) β The project namerun_name (
str
) β The run namesplit (
Split
) β The split (training/test/validation/inference)inference_name (
str
) β Required if split is inferenceepoch (
Optional
[int
]) β The epoch to get embeddings for. Default final epoch
- Return type:
DataFrame
- get_data_embeddings(project_name, run_name, split, inference_name='')#
Downloads the data (off the shelf) embeddings for a run/split
An hdf5 file will be downloaded to local and a Vaex dataframe will be returned
- Parameters:
project_name (
str
) β The project namerun_name (
str
) β The run namesplit (
Split
) β The split (training/test/validation/inference)inference_name (
str
) β Required if split is inference
- Return type:
DataFrame
- get_probabilities(project_name, run_name, split, inference_name='', epoch=None)#
Downloads the probabilities for a run/split at an epoch as a Vaex dataframe.
If not provided, will take the probabilities from the final epoch.
An hdf5 file will be downloaded to local and a Vaex dataframe will be returned
- Parameters:
project_name (
str
) β The project namerun_name (
str
) β The run namesplit (
Split
) β The split (training/test/validation/inference)inference_name (
str
) β Required if split is inferenceepoch (
Optional
[int
]) β The epoch to get embeddings for. Default final epoch
- Return type:
DataFrame
- get_raw_data(project_name, run_name, split, inference_name='', epoch=None)#
Downloads the raw logged data for a run/split at an epoch as a Vaex dataframe.
If not provided, will take the probabilities from the final epoch.
An hdf5 file will be downloaded to local and a Vaex dataframe will be returned
- Parameters:
project_name (
str
) β The project namerun_name (
str
) β The run namesplit (
Split
) β The split (training/test/validation/inference)inference_name (
str
) β Required if split is inferenceepoch (
Optional
[int
]) β The epoch to get embeddings for. Default final epoch
- Return type:
DataFrame
- get_alerts(project_name, run_name, split, inference_name=None)#
Get alerts for a project/run/split
Alerts are automatic insights calculated and provided by Galileo on your data
- Return type:
List
[Dict
[str
,str
]]
- get_labels_for_run(project_name, run_name, task=None)#
Gets labels for a given run.
If multi-label, and a task is provided, this will get the labels for that task. Otherwise, it will get all task-labels
In NER, the full label set with the tags for each label will be returned
- Return type:
List
- get_tasks_for_run(project_name, run_name)#
Gets task names for a multi-label run
- Return type:
List
[str
]
Module contents#
- login()#
Log into your Galileo environment.
The function will prompt your for an Authorization Token (api key) that you can access from the console.
To skip the prompt for automated workflows, you can set GALILEO_USERNAME (your email) and GALILEO_PASSWORD if you signed up with an email and password. You can set GALILEO_API_KEY to your API key if you have one.
- Return type:
None
- logout()#
- Return type:
None
- init(task_type, project_name=None, run_name=None, overwrite_local=True)#
Start a run
Initialize a new run and new project, initialize a new run in an existing project, or reinitialize an existing run in an existing project.
Before creating the project, check: - The user is valid, login if not - The DQ client version is compatible with API version
Optionally provide project and run names to create a new project/run or restart existing ones.
- Return type:
None
- Parameters:
task_type (
str
) β The task type for modeling. This must be one of the valid
dataquality.schemas.task_type.TaskType options :type project_name:
Optional
[str
] :param project_name: The project name. If not passed in, a random one will be generated. If provided, and the project does not exist, it will be created. If it does exist, it will be set. :type run_name:Optional
[str
] :param run_name: The run name. If not passed in, a random one will be generated. If provided, and the project does not exist, it will be created. If it does exist, it will be set. :type overwrite_local:bool
:param overwrite_local: If True, the current project/run log directory will be cleared during this function. If logging over many sessions with checkpoints, you may want to set this to False. Default True
- log_data_samples(*, texts, ids, meta=None, **kwargs)#
Logs a batch of input samples for model training/test/validation/inference.
Fields are expected as lists of their content. Field names are in the plural of log_input_sample (text -> texts) The expected arguments come from the task_type being used: See dq.docs() for details
ex (text classification): .. code-block:: python
all_labels = [βAβ, βBβ, βCβ] dq.set_labels_for_run(labels = all_labels)
- texts: List[str] = [
βText sample 1β, βText sample 2β, βText sample 3β, βText sample 4β
]
labels: List[str] = [βBβ, βCβ, βAβ, βAβ]
- meta = {
βsample_importanceβ: [βhighβ, βlowβ, βlowβ, βmediumβ] βquality_rankingβ: [9.7, 2.4, 5.5, 1.2]
}
ids: List[int] = [0, 1, 2, 3] split = βtrainingβ
dq.log_data_samples(texts=texts, labels=labels, ids=ids, meta=meta split=split)
- Parameters:
texts (
List
[str
]) β List[str] the input samples to your modelids (
List
[int
]) β List[int | str] the ids per samplesplit β Optional[str] the split for this data. Can also be set via
meta (
Optional
[Dict
[str
,List
[Union
[str
,float
,int
]]]]) β Dict[str, List[str | int | float]]. Log additional metadata fields to
- Return type:
None
dq.set_split
each sample. The name of the field is the key of the dictionary, and the values are a list that correspond in length and order to the text samples. :type kwargs:
Any
:param kwargs: See dq.docs() for details on other task specific parameters
- log_model_outputs(*, ids, embs=None, split=None, epoch=None, logits=None, probs=None, log_probs=None, inference_name=None, exclude_embs=False)#
Logs model outputs for model during training/test/validation.
- Parameters:
ids (
Union
[List
,ndarray
]) β The ids for each sample. Must match input ids of logged samplesembs (
Union
[List
,ndarray
,None
]) β The embeddings per output samplesplit (
Optional
[Split
]) β The current split. Must be set either here or via dq.set_splitepoch (
Optional
[int
]) β The current epoch. Must be set either here or via dq.set_epochlogits (
Union
[List
,ndarray
,None
]) β The logits for each sampleprobs (
Union
[List
,ndarray
,None
]) β Deprecated, use logits. If passed in, a softmax will NOT be appliedinference_name (
Optional
[str
]) β Inference name indicator for this inference split. If logging for an inference split, this is required.exclude_embs (
bool
) β Optional flag to exclude embeddings from logging. If True and embs is set to None, this will generate random embs for each sample.
- Return type:
None
The expected argument shapes come from the task_type being used See dq.docs() for more task specific details on parameter shape
- configure(do_login=True, _internal=False)#
Update your active config with new information
You can use environment variables to set the config, or wait for prompts Available environment variables to update: * GALILEO_CONSOLE_URL * GALILEO_USERNAME * GALILEO_PASSWORD * GALILEO_API_KEY
- Return type:
None
- finish(last_epoch=None, wait=True, create_data_embs=None, data_embs_col='text', upload_model=False)#
Finishes the current run and invokes a job
- Parameters:
last_epoch (
Optional
[int
]) β If set, only epochs up to this value will be uploaded/processed This is inclusive, so setting last_epoch to 5 would upload epochs 0,1,2,3,4,5wait (
bool
) β If true, after uploading the data, this will wait for the run to be processed by the Galileo server. If false, you can manually wait for the run by calling dq.wait_for_run() Default Truecreate_data_embs (
Optional
[bool
]) β If True, an off-the-shelf transformer will run on the raw text input to generate data-level embeddings. These will be available in the data view tab of the Galileo console. You can also access these embeddings via dq.metrics.get_data_embeddings(). Default True if a GPU is available, else default False.data_embs_col (
str
) β Optional text col on which to compute data embeddings. If not set, we default to βtextβ which corresponds to the input text. Can also be set to target, generated_output or any other column that is logged as metadata.upload_model (
bool
) β If True, the model will be stored in the galileo project. Default False or set by the environment variable DQ_UPLOAD_MODEL.
- Return type:
str
- set_labels_for_run(labels)#
Creates the mapping of the labels for the model to their respective indexes. :rtype:
None
- Parameters:
labels (
Union
[List
[List
[str
]],List
[str
]]) β An ordered list of labels (ie [βdogβ,βcatβ,βfishβ]
If this is a multi-label type, then labels are a list of lists where each inner list indicates the label for the given task
This order MUST match the order of probabilities that the model outputs.
In the multi-label case, the outer order (order of the tasks) must match the task-order of the task-probabilities logged as well.
- get_current_run_labels()#
Returns the current run labels, if there are any
- Return type:
Optional
[List
[str
]]
- get_data_logger(task_type=None, *args, **kwargs)#
- Return type:
- get_model_logger(task_type=None, *args, **kwargs)#
- Return type:
- get_run_link(project_name=None, run_name=None)#
Gets the link to the run in the UI
- Return type:
str
- set_tasks_for_run(tasks, binary=True)#
Sets the task names for the run (multi-label case only).
This order MUST match the order of the labels list provided in log_input_data and the order of the probability vectors provided in log_model_outputs.
This also must match the order of the labels logged in set_labels_for_run (meaning that the first list of labels must be the labels of the first task passed in here)
- Return type:
None
- Parameters:
tasks (
List
[str
]) β The list of tasks for your runbinary (
bool
) β Whether this is a binary multi label run. If true, tasks will also
be set as your labels, and you should NOT call dq.set_labels_for_run it will be handled for you. Default True
- set_tagging_schema(tagging_schema)#
Sets the tagging schema for NER models
Only valid for text_ner task_types. Others will throw an exception
- Return type:
None
- docs()#
Print the documentation for your specific input and output logging format
Based on your task_type, this will print the appropriate documentation
- Return type:
None
- wait_for_run(project_name=None, run_name=None)#
Waits until a specific project run transitions from started to finished. Defaults to the current run if project_name and run_name are empty. Raises error if only one of project_name and run_name is passed in.
- Parameters:
project_name (
Optional
[str
]) β The project name. Default to current project if not passed in.run_name (
Optional
[str
]) β The run name. Default to current run if not passed in.
- Return type:
None
- Returns:
None. Function returns after the run transitions to finished
- get_run_status(project_name=None, run_name=None)#
Returns the latest job of a specified project run. Defaults to the current run if project_name and run_name are empty. Raises error if only one of project_name and run_name is passed in.
- Parameters:
project_name (
Optional
[str
]) β The project name. Default to current project if not passed in.run_name (
Optional
[str
]) β The run name. Default to current run if not passed in.
- Return type:
Dict
[str
,Any
]- Returns:
Dict[str, Any]. Response will have key status with value corresponding to the status of the latest job for the run. Other info, such as created_at, may be included.
- set_epoch(epoch)#
Set the current epoch.
When set, logging model outputs will use this if not logged explicitly
- Return type:
None
- set_split(split, inference_name=None)#
Set the current split.
When set, logging data inputs/model outputs will use this if not logged explicitly When setting split to inference, inference_name must be included
- Return type:
None
- set_epoch_and_split(epoch, split, inference_name=None)#
Set the current epoch and set the current split. When set, logging data inputs/model outputs will use this if not logged explicitly When setting split to inference, inference_name must be included
- Return type:
None
- set_console_url(console_url=None)#
For Enterprise users. Set the console URL to your Galileo Environment.
You can also set GALILEO_CONSOLE_URL before importing dataquality to bypass this :rtype:
None
- Parameters:
console_url (
Optional
[str
]) β If set, that will be used. Otherwise, if an environment variable
GALILEO_CONSOLE_URL is set, that will be used. Otherwise, you will be prompted for a url.
- log_data_sample(*, text, id, **kwargs)#
Log a single input example to disk
Fields are expected singular elements. Field names are in the singular of log_input_samples (texts -> text) The expected arguments come from the task_type being used: See dq.docs() for details
- Parameters:
text (
str
) β List[str] the input samples to your modelid (
int
) β List[int | str] the ids per samplesplit β Optional[str] the split for this data. Can also be set via dq.set_split
kwargs (
Any
) β See dq.docs() for details on other task specific parameters
- Return type:
None
- log_dataset(dataset, *, batch_size=100000, text='text', id='id', split=None, meta=None, **kwargs)#
Log an iterable or other dataset to disk. Useful for logging memory mapped files
Dataset provided must be an iterable that can be traversed row by row, and for each row, the fields can be indexed into either via string keys or int indexes. Pandas and Vaex dataframes are also allowed, as well as HuggingFace Datasets
- valid examples:
- d = [
{βmy_textβ: βsample1β, βmy_labelsβ: βAβ, βmy_idβ: 1, βsample_qualityβ: 5.3}, {βmy_textβ: βsample2β, βmy_labelsβ: βAβ, βmy_idβ: 2, βsample_qualityβ: 9.1}, {βmy_textβ: βsample3β, βmy_labelsβ: βBβ, βmy_idβ: 3, βsample_qualityβ: 2.7},
] dq.log_dataset(
d, text=βmy_textβ, id=βmy_idβ, label=βmy_labelsβ, meta=[βsample_qualityβ]
)
- Logging a pandas dataframe, df:
text label id sample_quality
0 sample1 A 1 5.3 1 sample2 A 2 9.1 2 sample3 B 3 2.7 # We donβt need to set text id or label because it matches the default dq.log_dataset(d, meta=[βsample_qualityβ])
Logging and iterable of tuples: d = [
(βsample1β, βAβ, βID1β), (βsample2β, βAβ, βID2β), (βsample3β, βBβ, βID3β),
] dq.log_dataset(d, text=0, id=2, label=1)
- Invalid example:
- d = {
βmy_textβ: [βsample1β, βsample2β, βsample3β], βmy_labelsβ: [βAβ, βAβ, βBβ], βmy_idβ: [1, 2, 3], βsample_qualityβ: [5.3, 9.1, 2.7]
}
- In the invalid case, use dq.log_data_samples:
meta = {βsample_qualityβ: d[βsample_qualityβ]} dq.log_data_samples(
texts=d[βmy_textβ], labels=d[βmy_labelsβ], ids=d[βmy_idsβ], meta=meta
)
Keyword arguments are specific to the task type. See dq.docs() for details
- Parameters:
dataset (
TypeVar
(DataSet
, bound=Union
[Iterable
,DataFrame
,Dataset
,DataFrame
])) β The iterable or dataframe to logtext (
Union
[str
,int
]) β str | int The column, key, or int index for text data. Default βtextβid (
Union
[str
,int
]) β str | int The column, key, or int index for id data. Default βidβsplit (
Optional
[Split
]) β Optional[str] the split for this data. Can also be set via dq.set_splitmeta (
Union
[List
[str
],List
[int
],None
]) β List[str | int] Additional keys/columns to your input data to be logged as metadata. Consider a pandas dataframe, this would be the list ofkwargs (
Any
) β See help(dq.get_data_logger().log_dataset) for more details here
- Batch_size:
The number of data samples to log at a time. Useful when logging a memory mapped dataset. A larger batch_size will result in faster logging at the expense of more memory usage. Default 100,000
- Return type:
None
columns corresponding to each metadata field to log
or dq.docs() for more general task details
- log_image_dataset(dataset, *, imgs_local_colname=None, imgs_remote=None, batch_size=100000, id='id', label='label', split=None, inference_name=None, meta=None, parallel=False, **kwargs)#
Log an image dataset of input samples for image classification
- Parameters:
dataset (
TypeVar
(DataSet
, bound=Union
[Iterable
,DataFrame
,Dataset
,DataFrame
])) β The dataset to log. This can be a Pandas/HF dataframe or an ImageFolder (from Torchvision).imgs_local_colname (
Optional
[str
]) β The name of the column containing the local images (typically paths but could also be bytes for HF dataframes). Ignored for ImageFolder where local paths are directly retrieved from the dataset.imgs_remote (
Optional
[str
]) β The name of the column containing paths to the remote images (in the case of a df) or remote directory containing the images (in the case of ImageFolder). Specifying this argument is required to skip uploading the images.batch_size (
int
) β Number of samples to log in a batch. Default 10,000id (
str
) β The name of the column containing the ids (in the dataframe)label (
str
) β The name of the column containing the labels (in the dataframe)split (
Optional
[Split
]) β train/test/validation/inference. Can be set here or via dq.set_splitinference_name (
Optional
[str
]) β If logging inference data, a name for this inference data is required. Can be set here or via dq.set_splitparallel (
bool
) β upload in parallel if set to True
- Return type:
None
- log_xgboost(model, X, *, y=None, feature_names=None, split=None, inference_name=None)#
Log data for tabular classification models with XGBoost
X can be logged as a numpy array or pandas DataFrame. If a numpy array is provided, feature_names must be provided. If a pandas DataFrame is provided, feature_names will be inferred from the column names.
Example with numpy arrays: .. code-block:: python
import xgboost as xgb from sklearn.datasets import load_wine
wine = load_wine()
X = wine.data y = wine.target feature_names = wine.feature_names
model = xgb.XGBClassifier() model.fit(X, y)
dq.log_xgboost(model, X, y=y, feature_names=feature_names, split=βtrainingβ)
# or for inference dq.log_xgboost(
model, X, feature_names, split=βinferenceβ, inference_name=βmy_inferenceβ
)
Example with pandas DataFrames: .. code-block:: python
import xgboost as xgb from sklearn.datasets import load_wine
X, y = load_wine(as_frame=True, return_X_y=True)
model = xgb.XGBClassifier() model.fit(df, y)
dq.log_xgboost(model, X=df, y=y, split=βtrainingβ)
# or for inference dq.log_xgboost(
model, X=df, split=βinferenceβ, inference_name=βmy_inferenceβ
)
- Parameters:
model (
XGBClassifier
) β XGBClassifier model fit on the training dataX (
Union
[DataFrame
,ndarray
]) β The input data has a numpy array or pandas DataFrame. Data should have shape (n_samples, n_features)y (
Union
[Series
,ndarray
,List
,None
]) β Optional pandas Series, List, or numpy array of ground truth labels with shape (n_samples,). Provide for non-inference onlyfeature_names (
Optional
[List
[str
]]) β List of feature names if X is input as numpy array. Must have length n_featuressplit (
Optional
[Split
]) β Optional[str] the split for this data. Can also be set via dq.set_splitinference_name (
Optional
[str
]) β Optional[str] the inference_name for this data. Can also be set via dq.set_split
- Return type:
None
- get_dq_log_file(project_name=None, run_name=None)#
- Return type:
Optional
[str
]
- build_run_report(conditions, emails, project_id, run_id, link)#
Build a run report and send it to the specified emails.
- Return type:
None
- register_run_report(conditions, emails)#
Register conditions and emails for a run report.
After a run is finished, a report will be sent to the specified emails.
- Return type:
None
- class AggregateFunction(value)#
Bases:
str
,Enum
An enumeration.
- avg = 'Average'#
- min = 'Minimum'#
- max = 'Maximum'#
- sum = 'Sum'#
- pct = 'Percentage'#
- class Operator(value)#
Bases:
str
,Enum
An enumeration.
- eq = 'is equal to'#
- neq = 'is not equal to'#
- gt = 'is greater than'#
- lt = 'is less than'#
- gte = 'is greater than or equal to'#
- lte = 'is less than or equal to'#
- pydantic model Condition#
Bases:
BaseModel
Class for building custom conditions for data quality checks
After building a condition, call evaluate to determine the truthiness of the condition against a given DataFrame.
With a bit of thought, complex and custom conditions can be built. To gain an intuition for what can be accomplished, consider the following examples:
- Is the average confidence less than 0.3?
>>> c = Condition( ... agg=AggregateFunction.avg, ... metric="confidence", ... operator=Operator.lt, ... threshold=0.3, ... ) >>> c.evaluate(df)
- Is the max DEP greater or equal to 0.45?
>>> c = Condition( ... agg=AggregateFunction.max, ... metric="data_error_potential", ... operator=Operator.gte, ... threshold=0.45, ... ) >>> c.evaluate(df)
By adding filters, you can further narrow down the scope of the condition. If the aggregate function is βpctβ, you donβt need to specify a metric,
as the filters will determine the percentage of data.
For example:
- Alert if over 80% of the dataset has confidence under 0.1
>>> c = Condition( ... operator=Operator.gt, ... threshold=0.8, ... agg=AggregateFunction.pct, ... filters=[ ... ConditionFilter( ... metric="confidence", operator=Operator.lt, value=0.1 ... ), ... ], ... ) >>> c.evaluate(df)
- Alert if at least 20% of the dataset has drifted (Inference DataFrames only)
>>> c = Condition( ... operator=Operator.gte, ... threshold=0.2, ... agg=AggregateFunction.pct, ... filters=[ ... ConditionFilter( ... metric="is_drifted", operator=Operator.eq, value=True ... ), ... ], ... ) >>> c.evaluate(df)
- Alert 5% or more of the dataset contains PII
>>> c = Condition( ... operator=Operator.gte, ... threshold=0.05, ... agg=AggregateFunction.pct, ... filters=[ ... ConditionFilter( ... metric="galileo_pii", operator=Operator.neq, value="None" ... ), ... ], ... ) >>> c.evaluate(df)
Complex conditions can be built when the filter has a different metric than the metric used in the condition. For example:
- Alert if the min confidence of drifted data is less than 0.15
>>> c = Condition( ... agg=AggregateFunction.min, ... metric="confidence", ... operator=Operator.lt, ... threshold=0.15, ... filters=[ ... ConditionFilter( ... metric="is_drifted", operator=Operator.eq, value=True ... ) ... ], ... ) >>> c.evaluate(df)
- Alert if over 50% of high DEP (>=0.7) data contains PII
>>> c = Condition( ... operator=Operator.gt, ... threshold=0.5, ... agg=AggregateFunction.pct, ... filters=[ ... ConditionFilter( ... metric="data_error_potential", operator=Operator.gte, value=0.7 ... ), ... ConditionFilter( ... metric="galileo_pii", operator=Operator.neq, value="None" ... ), ... ], ... ) >>> c.evaluate(df)
You can also call conditions directly, which will assert its truth against a df 1. Assert that average confidence less than 0.3 >>> c = Condition( β¦ agg=AggregateFunction.avg, β¦ metric=βconfidenceβ, β¦ operator=Operator.lt, β¦ threshold=0.3, β¦ ) >>> c(df) # Will raise an AssertionError if False
- Parameters:
metric β The DF column for evaluating the condition
agg β An aggregate function to apply to the metric
operator β The operator to use for comparing the agg to the threshold (e.g. βgtβ, βltβ, βeqβ, βneqβ)
threshold β Threshold value for evaluating the condition
filter β Optional filter to apply to the DataFrame before evaluating the condition
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- Fields:
-
field agg:
AggregateFunction
[Required]#
-
field filters:
List
[ConditionFilter
] [Optional]# - Validated by:
validate_filters
-
field metric:
Optional
[str
] = None# - Validated by:
validate_metric
-
field threshold:
float
[Required]#
- evaluate(df)#
- Return type:
Tuple
[bool
,float
]
- pydantic model ConditionFilter#
Bases:
BaseModel
Filter a dataframe based on the column value
Note that the column used for filtering is the same as the metric used in the condition.
- Parameters:
operator β The operator to use for filtering (e.g. βgtβ, βltβ, βeqβ, βneqβ) See Operator
value β The value to compare against
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- Fields:
-
field metric:
str
[Required]#
-
field value:
Union
[float
,int
,str
,bool
] [Required]#
- disable_galileo()#
- Return type:
None
- disable_galileo_verbose()#
- Return type:
None
- enable_galileo_verbose()#
- Return type:
None
- enable_galileo()#
- Return type:
None
- auto(hf_data=None, hf_inference_names=None, train_data=None, val_data=None, test_data=None, inference_data=None, max_padding_length=200, hf_model='distilbert-base-uncased', num_train_epochs=15, labels=None, project_name=None, run_name=None, wait=True, create_data_embs=None, early_stopping=True)#
Automatically gets insights on a text classification or NER dataset
Given either a pandas dataframe, file_path, or huggingface dataset path, this function will load the data, train a huggingface transformer model, and provide Galileo insights via a link to the Galileo Console
One of hf_data, train_data should be provided. If neither of those are, a demo dataset will be loaded by Galileo for training.
- Parameters:
hf_data (
Union
[DatasetDict
,str
,None
]) β Union[DatasetDict, str] Use this param if you have huggingface data in the hub or in memory. Otherwise see train_data, val_data, and test_data. If provided, train_data, val_data, and test_data are ignored.hf_inference_names (
Optional
[List
[str
]]) β Use this param alongside hf_data if you have splits youβd like to consider as inference. A list of key names in hf_data to be run as inference runs after training. Any keys set must exist in hf_datatrain_data (
Union
[DataFrame
,Dataset
,str
,None
]) β Optional training data to use. Can be one of * Pandas dataframe * Huggingface dataset * Path to a local file * Huggingface dataset hub pathval_data (
Union
[DataFrame
,Dataset
,str
,None
]) β Optional validation data to use. The validation data is what is used for the evaluation dataset in huggingface, and what is used for early stopping. If not provided, but test_data is, that will be used as the evaluation set. If neither val nor test are available, the train data will be randomly split 80/20 for use as evaluation data. Can be one of * Pandas dataframe * Huggingface dataset * Path to a local file * Huggingface dataset hub pathtest_data (
Union
[DataFrame
,Dataset
,str
,None
]) β Optional test data to use. The test data, if provided with val, will be used after training is complete, as the held-out set. If no validation data is provided, this will instead be used as the evaluation set. Can be one of * Pandas dataframe * Huggingface dataset * Path to a local file * Huggingface dataset hub pathinference_data (
Optional
[Dict
[str
,Union
[DataFrame
,Dataset
,str
]]]) β User this param to include inference data alongside the train_data param. If you are passing data via the hf_data parameter, you should use the hf_inference_names param. Optional inference datasets to run with after training completes. The structure is a dictionary with the key being the inference name and the value one of * Pandas dataframe * Huggingface dataset * Path to a local file * Huggingface dataset hub pathmax_padding_length (
int
) β The max length for padding the input text during tokenization. Default 200hf_model (
str
) β The pretrained AutoModel from huggingface that will be used to tokenize and train on the provided data. Default distilbert-base-uncasednum_train_epochs (
int
) β The number of epochs to train for (early stopping will always be active). Default 15labels (
Optional
[List
[str
]]) β Optional list of labels for this dataset. If not provided, they will attempt to be extracted from the dataproject_name (
Optional
[str
]) β Optional project name. If not set, a random name will be generatedrun_name (
Optional
[str
]) β Optional run name for this data. If not set, a random name will be generatedwait (
bool
) β Whether to wait for Galileo to complete processing your run. Default Truecreate_data_embs (
Optional
[bool
]) β Whether to create data embeddings for this run. If True, Sentence-Transformers will be used to generate data embeddings for this dataset and uploaded with this run. You can access these embeddings via dq.metrics.get_data_embeddings in the emb column or dq.metrics.get_dataframe(β¦, include_data_embs=True) in the data_emb col Only available for TC currently. NER coming soon. Default True if a GPU is available, else default False.early_stopping (
bool
) β Whether to use early stopping. Default True
- Return type:
None
For text classification datasets, the only required columns are text and label
For NER, the required format is the huggingface standard format of tokens and tags (or ner_tags). See example: https://huggingface.co/datasets/rungalileo/mit_movies
MIT Movies dataset in huggingface format
tokens ner_tags [what, is, a, good, action, movie, that, is, r... [0, 0, 0, 0, 7, 0, ... [show, me, political, drama, movies, with, jef... [0, 0, 7, 8, 0, 0, ... [what, are, some, good, 1980, s, g, rated, mys... [0, 0, 0, 0, 5, 6, ... [list, a, crime, film, which, director, was, d... [0, 0, 7, 0, 0, 0, ... [is, there, a, thriller, movie, starring, al, ... [0, 0, 0, 7, 0, 0, ... ... ... ...
To see auto insights on a random, pre-selected dataset, simply run
import dataquality as dq dq.auto()
An example using auto with a hosted huggingface text classification dataset
import dataquality as dq dq.auto(hf_data="rungalileo/trec6")
Similarly, for NER
import dataquality as dq dq.auto(hf_data="conll2003")
An example using auto with sklearn data as pandas dataframes
import dataquality as dq import pandas as pd from sklearn.datasets import fetch_20newsgroups # Load the newsgroups dataset from sklearn newsgroups_train = fetch_20newsgroups(subset='train') newsgroups_test = fetch_20newsgroups(subset='test') # Convert to pandas dataframes df_train = pd.DataFrame( {"text": newsgroups_train.data, "label": newsgroups_train.target} ) df_test = pd.DataFrame( {"text": newsgroups_test.data, "label": newsgroups_test.target} ) dq.auto( train_data=df_train, test_data=df_test, labels=newsgroups_train.target_names, project_name="newsgroups_work", run_name="run_1_raw_data" )
An example of using auto with a local CSV file with text and label columns
import dataquality as dq dq.auto( train_data="train.csv", test_data="test.csv", project_name="data_from_local", run_name="run_1_raw_data" )
- class DataQuality(model=None, task=TaskType.text_classification, labels=None, train_data=None, test_data=None, val_data=None, project='', run='', framework=None, *args, **kwargs)#
Bases:
object
- Parameters:
model (
Optional
[Any
]) β The model to inspect, if a string, it will be assumed to be autotask (
TaskType
) β Task type for example βtext_classificationβproject (
str
) β Project namerun (
str
) β Run nametrain_data (
Optional
[Any
]) β Training datatest_data (
Optional
[Any
]) β Optional test dataval_data (
Optional
[Any
]) β Optional: validation datalabels (
Optional
[List
[str
]]) β The labels for the runframework (
Optional
[ModelFramework
]) β The framework to use, if provided it will be used instead of inferring it from the model. For example, if you have a torch model, you can pass framework=βtorchβ. If you have a torch model, you can pass framework=βtorchβargs (
Any
) β Additional argumentskwargs (
Any
) β Additional keyword arguments
from dataquality import DataQuality with DataQuality(model, "text_classification", labels = ["neg", "pos"], train_data = train_data) as dq: model.fit(train_data)
If you want to train without a model, you can use the auto framework:
from dataquality import DataQuality with DataQuality(labels = ["neg", "pos"], train_data = train_data) as dq: dq.finish()
- get_metrics(split=Split.train)#
- Return type:
Dict
[str
,Any
]
- auto_notebook()#
- Return type:
None