dataquality.integrations package#

Subpackages#

dataquality.integrations.seq2seq package

Submodules#

dataquality.integrations.fastai module#

class FAIKey(value)#

Bases: Enum

An enumeration.

dataloader_indices = 'dataloader_indices'#

model_input = 'model_input'#

model_output = 'model_output'#

ids = 'ids'#

class FastAiDQCallback(layer=None, finish=False, *args, **kwargs)#

Bases: Callback

Dataquality logs the model embeddings and logtis to measure the quality of the dataset. Provide the label names and the classifier layer to log the embeddings and logits. If no classifier layer is provided, the last layer of the model will be used. Here is how to take the last layer of the model: dqc = DataqualityCallback(labels=[‘negative’,’positive’], layer=model.fc) End to end example: .. code-block:: python

from fastai.vision.all import * from fastai.callback.galileo import DataqualityCallback path = untar_data(URLs.PETS)/’images’ image_files = get_image_files(path)#[:107] label_func = lambda x: x[0].isupper() dls = ImageDataLoaders.from_name_func(

path, image_files, valid_pct=0.2, label_func=label_func, item_tfms=Resize(224), num_workers=1, drop_last=False)

learn = vision_learner(dls, ‘resnet34’, metrics=error_rate) dqc = DataqualityCallback(labels=[“nocat”,”cat”]) learn.add_cb(dqc) learn.fine_tune(2)

Dataquality logs the model embeddings and logits to measure the quality of the dataset. This helps to find mislabeled samples in a data centric approach. :type layer: Optional[Any] :param layer: Classifier layer with embeddings as input and logits as output. :type finish: bool :param finish: Upload after training is complete :param disable_dq: Disable data quality logging.

logger_config: BaseLoggerConfig#

init_config()#

Return type:: None

setup_idx_store()#

Return type:: Dict[FAIKey, Any]

reset_idx_store()#

Return type:: None

reset_config()#

Return type:: None

get_layer()#: Get the classifier layer, which inputs and outputs will be logged (embeddings and logits). :rtype: Module :return: The classifier layer.

before_epoch()#

Return type:: None

before_fit()#

Return type:: None

before_train()#

Sets the split in data quality and registers the classifier layer hook.

Return type:: None

wrap_indices(dl)#

Wraps the get_idxs function of the dataloader to store the indices.

Return type:: None

after_validate()#

Return type:: None

is_train_or_val()#

Return type:: bool

before_validate()#

Sets the split in data quality and registers the classifier layer hook.

Return type:: None

after_fit()#

Uploads data to galileo and removes the classifier layer hook.

Return type:: None

before_batch()#

Clears the model outputs log.

Return type:: None

after_pred()#

Logs the model outputs.

Return type:: None

register_hooks()#

Registers the classifier layer hook.

Return type:: None

forward_hook_with_store(store, layer, model_input, model_output)#: Forward hook to store the output of a layer. :type store: Dict[FAIKey, Any] :param store: Dictionary to store the output in. :type layer: Module :param layer: Layer to store the output of. :type model_input: Any :param model_input: Input to the model. :type model_output: Any :param model_output: Output of the model. :rtype: None :return: None

prepare_split(split=Split.test, inference_name=None)#

Run before test data. To wrap it and set the split.

Return type:: None

unpatch()#

Unpatches the dataloader and removes the hook.

Return type:: None

unhook()#

Unpatches the dataloader and removes the hook.

Return type:: bool

unwatch()#

Unpatches the dataloader and removes the hook.

Return type:: None

convert_img_dl_to_df(dl, x_col='image')#: Converts a fastai DataLoader to a pandas DataFrame. :type dl: DataLoader :param dl: Fast ai DataLoader to convert. :type x_col: str :param x_col: Name of the column to use for the x values, for example image. :rtype: DataFrame :return: Pandas DataFrame with the data from the DataLoader.

extract_split_indices(dls)#

Return type:: Any

convert_tab_dl_to_df(dl, x_col='text', y_col='label')#: Converts a fastai DataLoader to a pandas DataFrame. :type dl: DataLoader :param dl: Fast ai DataLoader to convert. :type x_col: str :param x_col: Name of the column to use for the x values, for example text. :type y_col: str :param y_col: Name of the column to use for the y values, for example label. :rtype: DataFrame :return: Pandas DataFrame with the data from the DataLoader.

dataquality.integrations.hf module#

infer_schema(label_list)#

Infers the schema via the exhaustive list of labels

Return type:: TaggingSchema

tokenize_adjust_labels(all_samples_per_split, tokenizer, label_names)#

Return type:: BatchEncoding

tokenize_and_log_dataset(dd, tokenizer, label_names=None, meta=None)#

This function tokenizes a huggingface DatasetDict and aligns the labels to BPE

After tokenization, this function will also log the dataset(s) present in the DatasetDict

Parameters:

dd (DatasetDict) – DatasetDict from huggingface to log
tokenizer (PreTrainedTokenizerBase) – The pretrained tokenizer from huggingface
label_names (Optional[List[str]]) – Optional list of labels for the dataset. These can typically be extracted automatically (if the dataset came from hf datasets hub or was exported via Galileo dataquality). If they cannot be extracted, an error will be raised requesting label names
meta (Optional[List[str]]) – Optional metadata columns to be logged. The columns must be present in at least one of the splits of the dataset.

Return type:

DatasetDict

class TextDataset(hf_dataset)#

Bases: Dataset

An abstracted Huggingface Text dataset for users to import and use

Get back a DataLoader via the get_dataloader function

get_dataloader(dataset, **kwargs)#

Create a DataLoader for a particular split given a huggingface Dataset

The DataLoader will be a loader of a TextDataset. The __getitem__ for that dataset will return:

id - the Galileo ID of the sample

input_ids - the standard huggingface input_ids

attention_mask - the standard huggingface attention_mask

labels - output labels adjusted with tokenized NER data

Parameters:

dataset (Dataset) – The huggingface dataset to convert to a DataLoader
kwargs (Any) – Any additional keyword arguments to be passed into the DataLoader Things like batch_size or shuffle

Return type:

DataLoader

dataquality.integrations.jsl module#

dataquality.integrations.keras module#

class DataQualityCallback(store, logger_config, log_function, model, *args, **kwargs)#

Bases: Callback

Initialize the callback by passing in the model and the input store. :type store: Dict[str, Any] :param store: The store to save the input and output to. :type model: Layer :param model: The model to patch.

store: Dict[str, Any]#

logger_config: BaseLoggerConfig#

model: Layer#

on_train_begin(logs=None)#

Initialize the training by extracting the model input arguments. and from it generate the indices of the batches.

Return type:: None

on_epoch_begin(epoch, logs)#

At the beginning of the epoch we set the epoch in the store. :type epoch: int :param epoch: The epoch number. :type logs: Dict :param logs: The logs.

Return type:: None

on_train_batch_begin(batch, logs=None)#

At the beginning of the batch we clear the helper data from the logger config.

Return type:: None

on_train_batch_end(batch, logs=None)#

At the end of the batch we log the input of the classifier and the output. :type batch: Any :param batch: The batch number. :type logs: Optional[Dict] :param logs: The logs.

Return type:: None

on_test_begin(logs=None)#

At the beginning of the test we set the split to test. And generate the indices of the batches.

Return type:: None

on_test_batch_begin(batch, logs=None)#

At the beginning of the batch we clear the helper data from the logger config.

Return type:: None

on_test_batch_end(batch, logs=None)#

At the end of the test batch we log the input of the classifier and the output.

Return type:: None

on_predict_begin(batch)#

At the beginning of the prediction we set the split to validation.

Return type:: None

on_predict_batch_end(batch, logs=None)#

Log the validation batch

Return type:: None

patch_model_fit_args_kwargs(store, callback)#: Store the args and kwargs of model.fit in the store. Adds the callback to the callbacks of the model. :type store: Dict[str, Any] :param store: The store for the kwargs and args. :type callback: Callable :param callback: The callback to add to the model. :rtype: Callable :return: The patched model.fit function.

store_model_ids(store)#

Stores the indices of the batch. For a prebatched dataset

Return type:: Callable

select_model_layer(model, layer=None)#

Selects the classifier layer from the model. :type model: Layer :param model: The model. :type layer: Union[Layer, str, None] :param layer: The layer to select. If None, the layer with the name ‘classifier’ is selected.

Return type:: Layer

watch(model, layer=None, seed=42)#

Watch a model and log the inputs and outputs of a layer. :type model: Layer :param model: The model to watch :type layer: Optional[Any] :param layer: The layer to watch, if None the classifier layer is used :type seed: int :param seed: The seed to use for the model

Return type:: None

unwatch(model)#

Unpatches the model. Run after the run is finished :type model: Layer :param model: The model to unpatch

Return type:: None

dataquality.integrations.lightning module#

class LightningDQCallback(classifier_layer='classifier', embedding_dim=None, logits_dim=None, embedding_fn=None, logits_fn=None, last_hidden_state_layer=None)#

Bases: Callback, TorchLogger, PatchManager

PyTorch Lightning callback for logging model outputs to DataQuality. :type classifier_layer: Union[Module, str, None] :param classifier_layer: The layer to extract the logits from

(the output is taken as the logits and the input to the layer as the hidden state layer).

Parameters:

embedding_dim (Union[str, int, slice, Tensor, List, Tuple, None]) – The dimension to extract from the last hidden state.
logits_dim (Union[str, int, slice, Tensor, List, Tuple, None]) – The dimension to extract from the logits.
embedding_fn (Optional[Callable]) – A function to apply to the embedding.
logits_fn (Optional[Callable]) – A function to apply to the logits.
last_hidden_state_layer (Union[Module, str, None]) – Optional the layer to extract the last hidden state from. This will overwrite the input of the classifier_layer regarding the hidden state.

Example usage:

train_dataset = datasets.ImageFolder("train_images",
                                    transform=load_transforms)
train_dataloader = DataLoader(train_dataset, batch_size=4, num_workers=0)

# 🔭🌕 Galileo logging
dq.init("test_project", "test_run", task_type="image_classification")
dq.set_labels_for_run(["labelA", "labelB"])
dq.log_image_dataset(train_dataset, split="train")
callback = DQCallback(classifier_layer=model.model[2])
trainer = pl.Trainer(max_epochs=1, callbacks=[callback])
trainer.fit(
    model=model,
    train_dataloaders=train_dataloader
)

hook_manager: ModelHookManager#

on_fit_start(trainer, pl_module)#

Called when fit begins.

Return type:: None

on_fit_end(trainer, pl_module)#

Called when fit ends.

Return type:: None

on_train_epoch_start(trainer, pl_module)#

Called when the train epoch begins.

Return type:: None

on_validation_epoch_start(trainer, pl_module)#

Called when the val epoch begins.

Return type:: None

on_test_epoch_start(trainer, pl_module)#

Called when the test epoch begins.

Return type:: None

dataquality.integrations.setfit module#

class Evaluate(model, dq_store)#

Bases: object

Call function to evaluate SetFit model and log input and output to Galileo.

unwatch(setfit_obj)#

Unpatch SetFit model by replacing predict_proba function with original function. :type setfit_obj: Union[SetFitModel, SetFitTrainer, None] :param setfit_obj: SetFitModel or SetFitTrainer

Return type:: None

watch(setfit, labels=None, project_name='', run_name='', finish=True, wait=False, batch_size=None, meta=None, validate_before_training=False)#: Watch a SetFit model or trainer and extract model outputs for dataquality. Returns a function that can be used to evaluate the model on a dataset. :type setfit: Union[SetFitModel, SetFitTrainer] :param setfit: SetFit model or trainer :type labels: Optional[List[str]] :param labels: list of labels :type project_name: str :param project_name: name of project :type run_name: str :param run_name: name of run :type finish: bool :param finish: whether to run dq.finish after evaluation :type wait: bool :param wait: whether to wait for dq.finish :type batch_size: Optional[int] :param batch_size: batch size for evaluation :type meta: Optional[List] :param meta: meta data for evaluation :type validate_before_training: bool :param validate_before_training: whether to do a testrun before training :rtype: Evaluate :return: dq_evaluate function

evaluate(model)#: Watch SetFit model by replacing predict_proba function with SetFitModelHook. :type model: SetFitModel :param model: SetFit model :rtype: Evaluate :return: SetFitModelHook object

auto(setfit_model='sentence-transformers/paraphrase-mpnet-base-v2', hf_data=None, hf_inference_names=None, train_data=None, val_data=None, test_data=None, inference_data=None, labels=None, project_name='auto_tc_setfit', run_name=None, training_args=None, column_mapping=None, wait=True, create_data_embs=None)#

Automatically processes and generates insights on a text classification dataset.

Given a pandas dataframe, a file path, or a Huggingface dataset path, this function will load the data, train a Huggingface transformer model, and provide insights via a link to the Console.

At least one of hf_data, train_data should be provided. If neither of those are, a demo dataset will be used for training.

Parameters:

setfit (SetFitModel or Huggingface model name) – Computes text embeddings for a given text dataset with the model. If a string is provided, it will be used to load a Huggingface model and train it on the data.
hf_data (Union[DatasetDict, str], optional) – Use this parameter if you have Huggingface data in the hub or in memory. Otherwise see train_data, val_data, and test_data. If provided, train_data, val_data, and test_data are ignored.
hf_inference_names (list of str, optional) – A list of key names in hf_data to be run as inference runs after training. If set, those keys must exist in hf_data.
train_data (pandas.DataFrame, Dataset, str, optional) – Training data to use. Can be a pandas dataframe, a Huggingface dataset, path to a local file, or Huggingface dataset hub path.
val_data (pandas.DataFrame, Dataset, str, optional) – Validation data to use for evaluation and early stopping. If not provided, but test_data is, that will be used as the evaluation set. If neither val_data nor test_data are available, the train data will be split randomly in 80/20 ratio.
test_data (pandas.DataFrame, Dataset, str, optional) – Test data to use. If provided with val_data, will be used after training is complete,as the held-out set. If no validation data is provided, this will instead be used as the evaluation set.
inference_data (dict, optional) – Optional inference datasets to run after training. The structure is a dictionary with the key being the inference name and the value being a pandas dataframe, a Huggingface dataset, path to a local file, or Huggingface dataset hub path.
labels (list of str, optional) – List of labels for this dataset. If not provided, they will attempt to be extracted from the data.
project_name (str, optional) – Project name. If not set, a random name will be generated. Default is “auto_tc_setfit”.
run_name (str, optional) – Run name for this data. If not set, a random name will be generated.
training_args (dict, optional) – A dictionary of arguments for the SetFitTrainer. It allows you to customize training configuration such as learning rate, batch size, number of epochs, etc.
column_mapping (dict, optional) – A dictionary of column names to use for the provided data. Needs to map to the following keys: “text”, “id”, “label”.
wait (bool, optional) – Whether to wait for the processing of your run to complete. Default is True.
create_data_embs (bool, optional) – Whether to create data embeddings for this run. Default is None.

Return type:

SetFitModel

Returns:

SetFitModel – A SetFitModel instance trained on the provided dataset.
An example using auto with sklearn data as pandas dataframes
```python – import pandas as pd from sklearn.datasets import fetch_20newsgroups from dataquality.auto.text_classification import auto

# Load the newsgroups dataset from sklearn newsgroups_train = fetch_20newsgroups(subset=’train’) newsgroups_test = fetch_20newsgroups(subset=’test’) # Convert to pandas dataframes df_train = pd.DataFrame(

{“text”: newsgroups_train.data, “label”: newsgroups_train.target}

) df_test = pd.DataFrame(

{“text”: newsgroups_test.data, “label”: newsgroups_test.target}

)

auto(model=model,
train_data=df_train, test_data=df_test, labels=newsgroups_train.target_names, project_name=”newsgroups_work”, run_name=”run_1_raw_data”

)
```
An example of using auto with a local CSV file with text and label columns
```python
from dataquality.auto.text_classification import auto
auto( – setfit_model=”sentence-transformers/paraphrase-mpnet-base-v2”, train_data=”train.csv”, test_data=”test.csv”, project_name=”data_from_local”, run_name=”run_1_raw_data”
)
```

do_model_eval(model, encoded_data, wait, create_data_embs=None)#

Return type:: SetFitModel

dataquality.integrations.torch module#

class TorchLogger(model, last_hidden_state_layer=None, embedding_dim=None, logits_dim=None, classifier_layer=None, embedding_fn=None, logits_fn=None, helper_data=None, task=TaskType.text_classification)#

Bases: TorchBaseInstance

[TorchLogger] that sends the logs to [Galileo](https://www.rungalileo.io/) for each training training step.

embedding_dim: Union[int, slice, Tensor, List, Tuple, None]#

logits_dim: Union[int, slice, Tensor, List, Tuple, None]#

model: Module#

watch(model, dataloaders=[], classifier_layer=None, embedding_dim=None, logits_dim=None, embedding_fn=None, logits_fn=None, last_hidden_state_layer=None, unpatch_on_start=False, dataloader_random_sampling=False)#

wraps a PyTorch model and optionally dataloaders to log the embeddings and logits to [Galileo](https://www.rungalileo.io/).

dq.log_dataset(train_dataset, split="train")
train_dataloader = torch.utils.data.DataLoader()
model = TextClassificationModel(num_labels=len(train_dataset.list_of_labels))
watch(model, [train_dataloader, test_dataloader])
for epoch in range(NUM_EPOCHS):
    dq.set_epoch_and_split(epoch,"training")
    train()
    dq.set_split("validation")
    validate()
dq.finish()

Parameters:

model (Module) – Pytorch Model to be wrapped
dataloaders (Optional[List[DataLoader]]) – List of dataloaders to be wrapped
classifier_layer (Union[Module, str, None]) – Layer to hook into (usually ‘classifier’ or ‘fc’). Inputs are the embeddings and outputs are the logits.
embedding_dim (Union[str, int, slice, Tensor, List, Tuple, None]) – Dimension of the embeddings for example “[:, 0]” to remove the cls token
logits_dim (Union[str, int, slice, Tensor, List, Tuple, None]) – Dimension of the logits from layer input and logits from layer output. For example in NER “[:,1:,:]”. If the layer is not found, the last_hidden_state_layer will be used
embedding_fn (Optional[Callable]) – Function to process embeddings from the model
logits_fn (Optional[Callable]) – Function to process logits from the model f.e. lambda x: x[0]
last_hidden_state_layer (Union[Module, str, None]) – Layer to extract the embeddings from
unpatch_on_start (bool) – Force unpatching of dataloaders instead of global patching
dataloader_random_sampling (bool) – Whether a RandomSampler or WeightedRandomSampler is being used. If random sampling is being used, you must set this to True, otherwise logging will fail at the end of training.

Return type:

None

unwatch(model=None, force=True)#

Unwatches the model. Run after the run is finished. :type force: bool :param force: Force unwatch even if the model is not watched

Return type:: None

dataquality.integrations.torch_semantic_segmentation module#

class SemanticTorchLogger(imgs_remote_location, local_path_to_dataset_root, dataloaders, mask_col_name=None, *args, **kwargs)#

Bases: TorchLogger

Class to log semantic segmentation models to Galileo

Parameters:

imgs_remote_location (str) – name of the bucket that currently stores images in cloud
local_path_to_dataset_root (str) – path to the parent dataset folder
mask_col_name (Optional[str]) – name of the column that contains the mask
dataloaders (Dict[str, DataLoader]) – dataloaders to be logged

convert_dataset(dataset, split)#

Convert the dataset to the format expected by the dataquality client

Parameters:

dataset (Any) – dataset to convert
start_int (int) – starting unique id for each example in the dataset as we need a unique identifier for each example. Defaults to 0.

Return type:

List

find_mask_category(batch)#

Finds the mask category and stores it in the helper data :type batch: Dict[str, Any] :param batch: Dict[str, Any] batch from the dataloader

Return type:: None

get_image_ids_and_image_paths(split, logging_data)#

Return type:: Tuple[List[int], List[str]]

queue_gold_and_pred(probs, gold)#

Enqueue the ground truth and predicted masks for the batch

Parameters:

probs (torch.Tensor) – probability vectors to queue for LM
gold (torch.Tensor) – gold masks resized to queue for LM

Return type:

None

truncate_queue()#

Truncate the queue to the batch size

Parameters:: bs (int) – batch size
Return type:: None

resize_probs_and_gold(probs, gold)#

Resize the probs and gold to the correct size

Parameters:

probs (torch.Tensor) – probability vectors to resize
gold (torch.Tensor) – gold masks to resize

Return type:

Tuple[Tensor, Tensor]

calculate_mislabeled_pixels(probs, gold_mask)#

Helper function to calculate the mislabeled pixels in the batch

Parameters:

probs (torch.Tensor) – probability tensor of shape (bs, h, w, num_classes)
gold_mask (torch.Tensor) – gold truth mask of shape (bs, h, w)

Return type:

Tensor

Returns:

Mislabeled pixels tensor of shape (batch_size, height, width)

expand_binary_classification(probs)#

Expands the binary classification to a 2 channel tensor

Parameters:: probs (torch.Tensor) – binary classification tensor
Returns:: bs, 2, h, w tensor
Return type:: torch.Tensor

get_argmax_probs()#

Helper function to get the argmax and probs from the model outputs

Returns:: argmax and logits tensors
Return type:: Tuple[torch.Tensor, torch.Tensor]

upload_contours_split(split)#

Uploads all contours for a given split to minio

Structure of the contours.json file: {

image_id: {
polygon_uuid: contours polygon_uuid2: contours

} image_id2: {

polygon_uuid3: contours polygon_uuid4: contours

}

}

Parameters:: split (str) – split name
Return type:: None

upload_dep_split(split)#

Uploads all dep files for a given split to minio

Parameters:: split (str) – split name
Return type:: None

finish()#

Return type:: None

run_one_epoch(dataloader, device)#

Return type:: None

store_batch(store)#

Stores the batch in the passed store :type store: Dict[str, Dict[str, Union[ndarray, Tensor]]] :param store: Dict[str, torch.Tensor] location to store the batch

Return type:: Callable

patch_iterator_and_batch(store)#

Patches the iterator of the dataloader to return the indices and the batch :type store: Dict[str, Any] :param store: Dict[str, Any] location to store the indices and the batch

Return type:: Callable

watch(model, imgs_remote_location, local_path_to_dataset_root, dataloaders, mask_col_name=None, unpatch_on_start=False)#

wraps a PyTorch model and optionally dataloaders to log the embeddings and logits to [Galileo](https://www.rungalileo.io/).

train_dataloader = torch.utils.data.DataLoader() model = SemSegModel() watch(model, imgs_remote_location, local_path_to_dataset_root,

[train_dataloader, test_dataloader])

for epoch in range(NUM_EPOCHS):
dq.set_epoch_and_split(epoch,”training”) train() dq.set_split(“validation”) validate()

dq.finish()

Parameters:

model (Module) – Pytorch Model to be wrapped
imgs_remote_location (str) – Name of the bucket from which the images come
local_path_to_dataset_root (str) – Path to the dataset which we can remove from the image path
dataloaders (Dict[str, DataLoader]) – List of dataloaders to be wrapped
mask_col_name (Optional[str]) – Name of the column in the dataloader that contains the mask
unpatch_on_start (bool) – Whether to unpatch the model before patching it

Return type:

None

dataquality.integrations.transformers_trainer module#

class DQTrainerCallback(trainer, torch_helper, last_hidden_state_layer=None, embedding_dim=None, logits_dim=None, classifier_layer='classifier', embedding_fn=None, logits_fn=None)#

Bases: TrainerCallback, TorchBaseInstance, Patch

DQTrainerCallback that provides data quality insights with Galileo. This callback is logs during each training training step and is using the Huggingface transformers Trainer library.

Callback for logging model outputs during training :type trainer: Trainer :param trainer: Trainer object from Huggingface transformers :type last_hidden_state_layer: Union[Module, str, None] :param last_hidden_state_layer: Name of the last hidden state layer :type embedding_dim: Union[str, int, slice, Tensor, List, Tuple, None] :param embedding_dim: Dimension of the embedding :type logits_dim: Union[str, int, slice, Tensor, List, Tuple, None] :param logits_dim: Dimension of the logits :type classifier_layer: Union[Module, str, None] :param classifier_layer: Name of the classifier layer :type embedding_fn: Optional[Callable] :param embedding_fn: Function to extract the embedding from the last

hidden state

Parameters:

logits_fn (Optional[Callable]) – Function to extract the logits
torch_helper (TorchHelper) – Store for the callback

hook_manager: ModelHookManager#

validate(args, state, control, **kwargs)#

Validate the model and dataset :type args: TrainingArguments :param args: Training arguments :type state: TrainerState :param state: Trainer state :type control: TrainerControl :param control: Trainer control :type kwargs: Any :param kwargs: Keyword arguments (train_dataloader, eval_dataloader)

Return type:: None

setup_model(model)#

Setup the model for logging (attach hooks) :type model: Module :param model: Model

Return type:: None

on_train_begin(args, state, control, **kwargs)#

Event called at the beginning of training. Attaches hooks to model. :type args: TrainingArguments :param args: Training arguments :type state: TrainerState :param state: Trainer state :type control: TrainerControl :param control: Trainer control :type kwargs: Any :param kwargs: Keyword arguments (model, eval_dataloader, tokenizer…)

Return type:: None

on_evaluate(args, state, control, **kwargs)#

Event called after an evaluation phase.

Return type:: None

on_epoch_begin(args, state, control, **kwargs)#

Event called at the beginning of an epoch.

Return type:: None

on_epoch_end(args, state, control, **kwargs)#

Event called at the end of an epoch.

Return type:: None

on_train_end(args, state, control, **kwargs)#

Event called at the end of training.

Return type:: None

on_prediction_step(args, state, control, **kwargs)#

Event called after a prediction step.

Return type:: None

on_step_end(args, state, control, **kwargs)#

Perform a training step on a batch of inputs. Log the embeddings, ids and logits. :type args: TrainingArguments :param args: Training arguments :type state: TrainerState :param state: Trainer state :type control: TrainerControl :param control: Trainer control :type kwargs: Dict :param kwargs: Keyword arguments (including the model, inputs, outputs)

Return type:: None

watch(trainer, classifier_layer=None, embedding_dim=None, logits_dim=None, embedding_fn=None, logits_fn=None, last_hidden_state_layer=None)#

Hook into to the trainer to log to Galileo. :type trainer: Trainer :param trainer: Trainer object from the transformers library :type classifier_layer: Union[Module, str, None] :param classifier_layer: Name or Layer of the classifier layer to extract the

logits and the embeddings from

Parameters:

embedding_dim (Union[int, slice, Tensor, List, Tuple, None]) – Dimension slice for the embedding
logits_dim (Union[int, slice, Tensor, List, Tuple, None]) – Dimension slice for the logits
logits_fn (Optional[Callable]) – Function to extract the logits
embedding_fn (Optional[Callable]) – Function to extract the embedding
last_hidden_state_layer (Union[Module, str, None]) – Name of the last hidden state layer if classifier_layer is not provided

Return type:

None

unwatch(trainer)#

unwatch is used to remove the callback from the trainer :type trainer: Trainer :param trainer: Trainer object

Return type:: None

dataquality.integrations.ultralytics module#

find_midpoint(box, shape, resized_shape)#

Finds the midpoint of a box in xyxy format

Parameters:

box (Union[Tuple, List]) – box in xyxy format
shape (Union[Tuple, List]) – shape of the image
resized_shape (Union[Tuple, List]) – shape of the resized image

Return type:

Tuple[int, int, int, int]

Returns:

midpoint of the box

create_embedding(features, box, size=(640, 640))#

Creates an embedding from a feature map

Parameters:

features (List) – feature map
box (List) – box in xyxy format
size (Tuple[int, int]) – size of the image

Return type:

Tensor

Returns:

embedding

embedding_fn(features, boxes, size)#

Creates embeddings for all boxes

Parameters:

features (List) – feature map
boxes (Any) – boxes in xyxy format
size (Any) – size of the image

Return type:

Tensor

Returns:

embeddings

class StoreHook(on_finish_func=None)#

Bases: object

Generic Hook class to store model input and output

Initializes the hook

Parameters:: on_finish_func (Optional[Callable]) – function to be called when the hook is finished

h: Any = None#

hook(model, model_input, model_output)#

Hook function to store model input and output

Parameters:

model (Any) – model
model_input (Any) – model input
model_output (Any) – model output

Return type:

None

store_hook(h)#

Stores hook for later removal

Parameters:: h (Any) – hook
Return type:: None

class BatchLogger(old_function)#

Bases: object

Batch Logger class to store batches for later logging

Store the batch by overwriting the given method

Parameters:: old_function (Callable) – method that is wrapped

class Callback(nms_fn=None, bucket='', relative_img_path='', labels=[], iou_thresh=0.7, conf_thresh=0.25)#

Bases: object

Callback class that is used to log batches, embeddings and predictions

Initializes the callback

Parameters:: nms_fn (Optional[Callable]) – non-maximum suppression function

model: YOLO#

split: Optional[Split]#

file_map: Dict#

postprocess(batch)#

Postprocesses the batch for a training step. Taken from ultralytics. Might be removed in the future.

Parameters:: batch (Tensor) – batch to be postprocessed
Return type:: Any

register_hooks(model)#

Parameters:: model (Any) – the model to hook
Return type:: None

init_run()#

Initialize the run

Return type:: None

convert_dataset(dataset)#

Convert the dataset to the format expected by the dataquality client

Return type:: List

on_train_start(trainer)#

Parameters:: trainer (BaseTrainer) – the trainer
Return type:: None

on_train_end(trainer)#

Restore preprocess batch function on train end

Parameters:: trainer (BaseTrainer) – the trainer
Return type:: None

on_val_batch_start(validator)#

Parameters:: validator (BaseValidator) – the validator
Return type:: None

on_predict_start(predictor)#

Parameters:: predictor (BasePredictor) – the predictor
Return type:: None

on_predict_batch_end(predictor)#

Log predictions and embeddings on prediction batch end. Not functional yet

Return type:: None

add_callback(model, cb)#

Add the callback to the model

Parameters:

model (YOLO) – the model
cb (Callback) – callback cls

Return type:

None

watch(model, bucket, relative_img_path, labels, iou_thresh=0.7, conf_thresh=0.25)#

Watch the model for predictions and embeddings logging.

Parameters:: model (YOLO) – the model to watch
Return type:: None

dataquality.integrations package#

Subpackages#

Submodules#

dataquality.integrations.fastai module#

dataquality.integrations.hf module#

dataquality.integrations.jsl module#

dataquality.integrations.keras module#

dataquality.integrations.lightning module#

dataquality.integrations.setfit module#

dataquality.integrations.torch module#

dataquality.integrations.torch_semantic_segmentation module#

dataquality.integrations.transformers_trainer module#

dataquality.integrations.ultralytics module#

Module contents#