dataquality.integrations package#
Subpackages#
- dataquality.integrations.seq2seq package
Submodules#
dataquality.integrations.fastai module#
- class FAIKey(value)#
Bases:
Enum
An enumeration.
- dataloader_indices = 'dataloader_indices'#
- model_input = 'model_input'#
- model_output = 'model_output'#
- ids = 'ids'#
- class FastAiDQCallback(layer=None, finish=False, *args, **kwargs)#
Bases:
Callback
Dataquality logs the model embeddings and logtis to measure the quality of the dataset. Provide the label names and the classifier layer to log the embeddings and logits. If no classifier layer is provided, the last layer of the model will be used. Here is how to take the last layer of the model: dqc = DataqualityCallback(labels=[‘negative’,’positive’], layer=model.fc) End to end example: .. code-block:: python
from fastai.vision.all import * from fastai.callback.galileo import DataqualityCallback path = untar_data(URLs.PETS)/’images’ image_files = get_image_files(path)#[:107] label_func = lambda x: x[0].isupper() dls = ImageDataLoaders.from_name_func(
path, image_files, valid_pct=0.2, label_func=label_func, item_tfms=Resize(224), num_workers=1, drop_last=False)
learn = vision_learner(dls, ‘resnet34’, metrics=error_rate) dqc = DataqualityCallback(labels=[“nocat”,”cat”]) learn.add_cb(dqc) learn.fine_tune(2)
Dataquality logs the model embeddings and logits to measure the quality of the dataset. This helps to find mislabeled samples in a data centric approach. :type layer:
Optional
[Any
] :param layer: Classifier layer with embeddings as input and logits as output. :type finish:bool
:param finish: Upload after training is complete :param disable_dq: Disable data quality logging.-
logger_config:
BaseLoggerConfig
#
- init_config()#
- Return type:
None
- reset_idx_store()#
- Return type:
None
- reset_config()#
- Return type:
None
- get_layer()#
Get the classifier layer, which inputs and outputs will be logged (embeddings and logits). :rtype:
Module
:return: The classifier layer.
- before_epoch()#
- Return type:
None
- before_fit()#
- Return type:
None
- before_train()#
Sets the split in data quality and registers the classifier layer hook.
- Return type:
None
- wrap_indices(dl)#
Wraps the get_idxs function of the dataloader to store the indices.
- Return type:
None
- after_validate()#
- Return type:
None
- is_train_or_val()#
- Return type:
bool
- before_validate()#
Sets the split in data quality and registers the classifier layer hook.
- Return type:
None
- after_fit()#
Uploads data to galileo and removes the classifier layer hook.
- Return type:
None
- before_batch()#
Clears the model outputs log.
- Return type:
None
- after_pred()#
Logs the model outputs.
- Return type:
None
- register_hooks()#
Registers the classifier layer hook.
- Return type:
None
- forward_hook_with_store(store, layer, model_input, model_output)#
Forward hook to store the output of a layer. :type store:
Dict
[FAIKey
,Any
] :param store: Dictionary to store the output in. :type layer:Module
:param layer: Layer to store the output of. :type model_input:Any
:param model_input: Input to the model. :type model_output:Any
:param model_output: Output of the model. :rtype:None
:return: None
- prepare_split(split=Split.test, inference_name=None)#
Run before test data. To wrap it and set the split.
- Return type:
None
- unpatch()#
Unpatches the dataloader and removes the hook.
- Return type:
None
- unhook()#
Unpatches the dataloader and removes the hook.
- Return type:
bool
- unwatch()#
Unpatches the dataloader and removes the hook.
- Return type:
None
-
logger_config:
- convert_img_dl_to_df(dl, x_col='image')#
Converts a fastai DataLoader to a pandas DataFrame. :type dl:
DataLoader
:param dl: Fast ai DataLoader to convert. :type x_col:str
:param x_col: Name of the column to use for the x values, for example image. :rtype:DataFrame
:return: Pandas DataFrame with the data from the DataLoader.
- extract_split_indices(dls)#
- Return type:
Any
- convert_tab_dl_to_df(dl, x_col='text', y_col='label')#
Converts a fastai DataLoader to a pandas DataFrame. :type dl:
DataLoader
:param dl: Fast ai DataLoader to convert. :type x_col:str
:param x_col: Name of the column to use for the x values, for example text. :type y_col:str
:param y_col: Name of the column to use for the y values, for example label. :rtype:DataFrame
:return: Pandas DataFrame with the data from the DataLoader.
dataquality.integrations.hf module#
- infer_schema(label_list)#
Infers the schema via the exhaustive list of labels
- Return type:
- tokenize_adjust_labels(all_samples_per_split, tokenizer, label_names)#
- Return type:
BatchEncoding
- tokenize_and_log_dataset(dd, tokenizer, label_names=None, meta=None)#
This function tokenizes a huggingface DatasetDict and aligns the labels to BPE
After tokenization, this function will also log the dataset(s) present in the DatasetDict
- Parameters:
dd (
DatasetDict
) – DatasetDict from huggingface to logtokenizer (
PreTrainedTokenizerBase
) – The pretrained tokenizer from huggingfacelabel_names (
Optional
[List
[str
]]) – Optional list of labels for the dataset. These can typically be extracted automatically (if the dataset came from hf datasets hub or was exported via Galileo dataquality). If they cannot be extracted, an error will be raised requesting label namesmeta (
Optional
[List
[str
]]) – Optional metadata columns to be logged. The columns must be present in at least one of the splits of the dataset.
- Return type:
DatasetDict
- class TextDataset(hf_dataset)#
Bases:
Dataset
An abstracted Huggingface Text dataset for users to import and use
Get back a DataLoader via the get_dataloader function
- get_dataloader(dataset, **kwargs)#
Create a DataLoader for a particular split given a huggingface Dataset
The DataLoader will be a loader of a TextDataset. The __getitem__ for that dataset will return:
id - the Galileo ID of the sample
input_ids - the standard huggingface input_ids
attention_mask - the standard huggingface attention_mask
labels - output labels adjusted with tokenized NER data
- Parameters:
dataset (
Dataset
) – The huggingface dataset to convert to a DataLoaderkwargs (
Any
) – Any additional keyword arguments to be passed into the DataLoader Things like batch_size or shuffle
- Return type:
DataLoader
dataquality.integrations.jsl module#
dataquality.integrations.keras module#
- class DataQualityCallback(store, logger_config, log_function, model, *args, **kwargs)#
Bases:
Callback
Initialize the callback by passing in the model and the input store. :type store:
Dict
[str
,Any
] :param store: The store to save the input and output to. :type model:Layer
:param model: The model to patch.-
store:
Dict
[str
,Any
]#
-
logger_config:
BaseLoggerConfig
#
-
model:
Layer
#
- on_train_begin(logs=None)#
Initialize the training by extracting the model input arguments. and from it generate the indices of the batches.
- Return type:
None
- on_epoch_begin(epoch, logs)#
At the beginning of the epoch we set the epoch in the store. :type epoch:
int
:param epoch: The epoch number. :type logs:Dict
:param logs: The logs.- Return type:
None
- on_train_batch_begin(batch, logs=None)#
At the beginning of the batch we clear the helper data from the logger config.
- Return type:
None
- on_train_batch_end(batch, logs=None)#
At the end of the batch we log the input of the classifier and the output. :type batch:
Any
:param batch: The batch number. :type logs:Optional
[Dict
] :param logs: The logs.- Return type:
None
- on_test_begin(logs=None)#
At the beginning of the test we set the split to test. And generate the indices of the batches.
- Return type:
None
- on_test_batch_begin(batch, logs=None)#
At the beginning of the batch we clear the helper data from the logger config.
- Return type:
None
- on_test_batch_end(batch, logs=None)#
At the end of the test batch we log the input of the classifier and the output.
- Return type:
None
- on_predict_begin(batch)#
At the beginning of the prediction we set the split to validation.
- Return type:
None
- on_predict_batch_end(batch, logs=None)#
Log the validation batch
- Return type:
None
-
store:
- patch_model_fit_args_kwargs(store, callback)#
Store the args and kwargs of model.fit in the store. Adds the callback to the callbacks of the model. :type store:
Dict
[str
,Any
] :param store: The store for the kwargs and args. :type callback:Callable
:param callback: The callback to add to the model. :rtype:Callable
:return: The patched model.fit function.
- store_model_ids(store)#
Stores the indices of the batch. For a prebatched dataset
- Return type:
Callable
- select_model_layer(model, layer=None)#
Selects the classifier layer from the model. :type model:
Layer
:param model: The model. :type layer:Union
[Layer
,str
,None
] :param layer: The layer to select. If None, the layer with the name ‘classifier’ is selected.- Return type:
Layer
- watch(model, layer=None, seed=42)#
Watch a model and log the inputs and outputs of a layer. :type model:
Layer
:param model: The model to watch :type layer:Optional
[Any
] :param layer: The layer to watch, if None the classifier layer is used :type seed:int
:param seed: The seed to use for the model- Return type:
None
- unwatch(model)#
Unpatches the model. Run after the run is finished :type model:
Layer
:param model: The model to unpatch- Return type:
None
dataquality.integrations.lightning module#
- class LightningDQCallback(classifier_layer='classifier', embedding_dim=None, logits_dim=None, embedding_fn=None, logits_fn=None, last_hidden_state_layer=None)#
Bases:
Callback
,TorchLogger
,PatchManager
PyTorch Lightning callback for logging model outputs to DataQuality. :type classifier_layer:
Union
[Module
,str
,None
] :param classifier_layer: The layer to extract the logits from(the output is taken as the logits and the input to the layer as the hidden state layer).
- Parameters:
embedding_dim (
Union
[str
,int
,slice
,Tensor
,List
,Tuple
,None
]) – The dimension to extract from the last hidden state.logits_dim (
Union
[str
,int
,slice
,Tensor
,List
,Tuple
,None
]) – The dimension to extract from the logits.embedding_fn (
Optional
[Callable
]) – A function to apply to the embedding.logits_fn (
Optional
[Callable
]) – A function to apply to the logits.last_hidden_state_layer (
Union
[Module
,str
,None
]) – Optional the layer to extract the last hidden state from. This will overwrite the input of the classifier_layer regarding the hidden state.
Example usage:
train_dataset = datasets.ImageFolder("train_images", transform=load_transforms) train_dataloader = DataLoader(train_dataset, batch_size=4, num_workers=0) # 🔭🌕 Galileo logging dq.init("test_project", "test_run", task_type="image_classification") dq.set_labels_for_run(["labelA", "labelB"]) dq.log_image_dataset(train_dataset, split="train") callback = DQCallback(classifier_layer=model.model[2]) trainer = pl.Trainer(max_epochs=1, callbacks=[callback]) trainer.fit( model=model, train_dataloaders=train_dataloader )
-
hook_manager:
ModelHookManager
#
- on_fit_start(trainer, pl_module)#
Called when fit begins.
- Return type:
None
- on_fit_end(trainer, pl_module)#
Called when fit ends.
- Return type:
None
- on_train_epoch_start(trainer, pl_module)#
Called when the train epoch begins.
- Return type:
None
- on_validation_epoch_start(trainer, pl_module)#
Called when the val epoch begins.
- Return type:
None
- on_test_epoch_start(trainer, pl_module)#
Called when the test epoch begins.
- Return type:
None
dataquality.integrations.setfit module#
- class Evaluate(model, dq_store)#
Bases:
object
Call function to evaluate SetFit model and log input and output to Galileo.
- unwatch(setfit_obj)#
Unpatch SetFit model by replacing predict_proba function with original function. :type setfit_obj:
Union
[SetFitModel
,SetFitTrainer
,None
] :param setfit_obj: SetFitModel or SetFitTrainer- Return type:
None
- watch(setfit, labels=None, project_name='', run_name='', finish=True, wait=False, batch_size=None, meta=None, validate_before_training=False)#
Watch a SetFit model or trainer and extract model outputs for dataquality. Returns a function that can be used to evaluate the model on a dataset. :type setfit:
Union
[SetFitModel
,SetFitTrainer
] :param setfit: SetFit model or trainer :type labels:Optional
[List
[str
]] :param labels: list of labels :type project_name:str
:param project_name: name of project :type run_name:str
:param run_name: name of run :type finish:bool
:param finish: whether to run dq.finish after evaluation :type wait:bool
:param wait: whether to wait for dq.finish :type batch_size:Optional
[int
] :param batch_size: batch size for evaluation :type meta:Optional
[List
] :param meta: meta data for evaluation :type validate_before_training:bool
:param validate_before_training: whether to do a testrun before training :rtype:Evaluate
:return: dq_evaluate function
- evaluate(model)#
Watch SetFit model by replacing predict_proba function with SetFitModelHook. :type model:
SetFitModel
:param model: SetFit model :rtype:Evaluate
:return: SetFitModelHook object
- auto(setfit_model='sentence-transformers/paraphrase-mpnet-base-v2', hf_data=None, hf_inference_names=None, train_data=None, val_data=None, test_data=None, inference_data=None, labels=None, project_name='auto_tc_setfit', run_name=None, training_args=None, column_mapping=None, wait=True, create_data_embs=None)#
Automatically processes and generates insights on a text classification dataset.
Given a pandas dataframe, a file path, or a Huggingface dataset path, this function will load the data, train a Huggingface transformer model, and provide insights via a link to the Console.
At least one of hf_data, train_data should be provided. If neither of those are, a demo dataset will be used for training.
- Parameters:
setfit (SetFitModel or Huggingface model name) – Computes text embeddings for a given text dataset with the model. If a string is provided, it will be used to load a Huggingface model and train it on the data.
hf_data (Union[DatasetDict, str], optional) – Use this parameter if you have Huggingface data in the hub or in memory. Otherwise see train_data, val_data, and test_data. If provided, train_data, val_data, and test_data are ignored.
hf_inference_names (list of str, optional) – A list of key names in hf_data to be run as inference runs after training. If set, those keys must exist in hf_data.
train_data (pandas.DataFrame, Dataset, str, optional) – Training data to use. Can be a pandas dataframe, a Huggingface dataset, path to a local file, or Huggingface dataset hub path.
val_data (pandas.DataFrame, Dataset, str, optional) – Validation data to use for evaluation and early stopping. If not provided, but test_data is, that will be used as the evaluation set. If neither val_data nor test_data are available, the train data will be split randomly in 80/20 ratio.
test_data (pandas.DataFrame, Dataset, str, optional) – Test data to use. If provided with val_data, will be used after training is complete,as the held-out set. If no validation data is provided, this will instead be used as the evaluation set.
inference_data (dict, optional) – Optional inference datasets to run after training. The structure is a dictionary with the key being the inference name and the value being a pandas dataframe, a Huggingface dataset, path to a local file, or Huggingface dataset hub path.
labels (list of str, optional) – List of labels for this dataset. If not provided, they will attempt to be extracted from the data.
project_name (str, optional) – Project name. If not set, a random name will be generated. Default is “auto_tc_setfit”.
run_name (str, optional) – Run name for this data. If not set, a random name will be generated.
training_args (dict, optional) – A dictionary of arguments for the SetFitTrainer. It allows you to customize training configuration such as learning rate, batch size, number of epochs, etc.
column_mapping (dict, optional) – A dictionary of column names to use for the provided data. Needs to map to the following keys: “text”, “id”, “label”.
wait (bool, optional) – Whether to wait for the processing of your run to complete. Default is True.
create_data_embs (bool, optional) – Whether to create data embeddings for this run. Default is None.
- Return type:
SetFitModel
- Returns:
SetFitModel – A SetFitModel instance trained on the provided dataset.
An example using auto with sklearn data as pandas dataframes
```python – import pandas as pd from sklearn.datasets import fetch_20newsgroups from dataquality.auto.text_classification import auto
# Load the newsgroups dataset from sklearn newsgroups_train = fetch_20newsgroups(subset=’train’) newsgroups_test = fetch_20newsgroups(subset=’test’) # Convert to pandas dataframes df_train = pd.DataFrame(
{“text”: newsgroups_train.data, “label”: newsgroups_train.target}
) df_test = pd.DataFrame(
{“text”: newsgroups_test.data, “label”: newsgroups_test.target}
)
- auto(model=model,
train_data=df_train, test_data=df_test, labels=newsgroups_train.target_names, project_name=”newsgroups_work”, run_name=”run_1_raw_data”
)
An example of using auto with a local CSV file with text and label columns
from dataquality.auto.text_classification import auto
auto( – setfit_model=”sentence-transformers/paraphrase-mpnet-base-v2”, train_data=”train.csv”, test_data=”test.csv”, project_name=”data_from_local”, run_name=”run_1_raw_data”
)
- do_model_eval(model, encoded_data, wait, create_data_embs=None)#
- Return type:
SetFitModel
dataquality.integrations.torch module#
- class TorchLogger(model, last_hidden_state_layer=None, embedding_dim=None, logits_dim=None, classifier_layer=None, embedding_fn=None, logits_fn=None, helper_data=None, task=TaskType.text_classification)#
Bases:
TorchBaseInstance
[TorchLogger] that sends the logs to [Galileo](https://www.rungalileo.io/) for each training training step.
-
embedding_dim:
Union
[int
,slice
,Tensor
,List
,Tuple
,None
]#
-
logits_dim:
Union
[int
,slice
,Tensor
,List
,Tuple
,None
]#
-
model:
Module
#
-
embedding_dim:
- watch(model, dataloaders=[], classifier_layer=None, embedding_dim=None, logits_dim=None, embedding_fn=None, logits_fn=None, last_hidden_state_layer=None, unpatch_on_start=False, dataloader_random_sampling=False)#
wraps a PyTorch model and optionally dataloaders to log the embeddings and logits to [Galileo](https://www.rungalileo.io/).
dq.log_dataset(train_dataset, split="train") train_dataloader = torch.utils.data.DataLoader() model = TextClassificationModel(num_labels=len(train_dataset.list_of_labels)) watch(model, [train_dataloader, test_dataloader]) for epoch in range(NUM_EPOCHS): dq.set_epoch_and_split(epoch,"training") train() dq.set_split("validation") validate() dq.finish()
- Parameters:
model (
Module
) – Pytorch Model to be wrappeddataloaders (
Optional
[List
[DataLoader
]]) – List of dataloaders to be wrappedclassifier_layer (
Union
[Module
,str
,None
]) – Layer to hook into (usually ‘classifier’ or ‘fc’). Inputs are the embeddings and outputs are the logits.embedding_dim (
Union
[str
,int
,slice
,Tensor
,List
,Tuple
,None
]) – Dimension of the embeddings for example “[:, 0]” to remove the cls tokenlogits_dim (
Union
[str
,int
,slice
,Tensor
,List
,Tuple
,None
]) – Dimension of the logits from layer input and logits from layer output. For example in NER “[:,1:,:]”. If the layer is not found, the last_hidden_state_layer will be usedembedding_fn (
Optional
[Callable
]) – Function to process embeddings from the modellogits_fn (
Optional
[Callable
]) – Function to process logits from the model f.e. lambda x: x[0]last_hidden_state_layer (
Union
[Module
,str
,None
]) – Layer to extract the embeddings fromunpatch_on_start (
bool
) – Force unpatching of dataloaders instead of global patchingdataloader_random_sampling (
bool
) – Whether a RandomSampler or WeightedRandomSampler is being used. If random sampling is being used, you must set this to True, otherwise logging will fail at the end of training.
- Return type:
None
- unwatch(model=None, force=True)#
Unwatches the model. Run after the run is finished. :type force:
bool
:param force: Force unwatch even if the model is not watched- Return type:
None
dataquality.integrations.torch_semantic_segmentation module#
- class SemanticTorchLogger(imgs_remote_location, local_path_to_dataset_root, dataloaders, mask_col_name=None, *args, **kwargs)#
Bases:
TorchLogger
Class to log semantic segmentation models to Galileo
- Parameters:
imgs_remote_location (
str
) – name of the bucket that currently stores images in cloudlocal_path_to_dataset_root (
str
) – path to the parent dataset foldermask_col_name (
Optional
[str
]) – name of the column that contains the maskdataloaders (
Dict
[str
,DataLoader
]) – dataloaders to be logged
- convert_dataset(dataset, split)#
Convert the dataset to the format expected by the dataquality client
- Parameters:
dataset (Any) – dataset to convert
start_int (int) – starting unique id for each example in the dataset as we need a unique identifier for each example. Defaults to 0.
- Return type:
List
- find_mask_category(batch)#
Finds the mask category and stores it in the helper data :type batch:
Dict
[str
,Any
] :param batch: Dict[str, Any] batch from the dataloader- Return type:
None
- get_image_ids_and_image_paths(split, logging_data)#
- Return type:
Tuple
[List
[int
],List
[str
]]
- queue_gold_and_pred(probs, gold)#
Enqueue the ground truth and predicted masks for the batch
- Parameters:
probs (torch.Tensor) – probability vectors to queue for LM
gold (torch.Tensor) – gold masks resized to queue for LM
- Return type:
None
- truncate_queue()#
Truncate the queue to the batch size
- Parameters:
bs (int) – batch size
- Return type:
None
- resize_probs_and_gold(probs, gold)#
Resize the probs and gold to the correct size
- Parameters:
probs (torch.Tensor) – probability vectors to resize
gold (torch.Tensor) – gold masks to resize
- Return type:
Tuple
[Tensor
,Tensor
]
- calculate_mislabeled_pixels(probs, gold_mask)#
Helper function to calculate the mislabeled pixels in the batch
- Parameters:
probs (torch.Tensor) – probability tensor of shape (bs, h, w, num_classes)
gold_mask (torch.Tensor) – gold truth mask of shape (bs, h, w)
- Return type:
Tensor
- Returns:
Mislabeled pixels tensor of shape (batch_size, height, width)
- expand_binary_classification(probs)#
Expands the binary classification to a 2 channel tensor
- Parameters:
probs (torch.Tensor) – binary classification tensor
- Returns:
bs, 2, h, w tensor
- Return type:
torch.Tensor
- get_argmax_probs()#
Helper function to get the argmax and probs from the model outputs
- Returns:
argmax and logits tensors
- Return type:
Tuple[torch.Tensor, torch.Tensor]
- upload_contours_split(split)#
Uploads all contours for a given split to minio
Structure of the contours.json file: {
- image_id: {
polygon_uuid: contours polygon_uuid2: contours
} image_id2: {
polygon_uuid3: contours polygon_uuid4: contours
}
}
- Parameters:
split (str) – split name
- Return type:
None
- upload_dep_split(split)#
Uploads all dep files for a given split to minio
- Parameters:
split (str) – split name
- Return type:
None
- finish()#
- Return type:
None
- run_one_epoch(dataloader, device)#
- Return type:
None
- store_batch(store)#
Stores the batch in the passed store :type store:
Dict
[str
,Dict
[str
,Union
[ndarray
,Tensor
]]] :param store: Dict[str, torch.Tensor] location to store the batch- Return type:
Callable
- patch_iterator_and_batch(store)#
Patches the iterator of the dataloader to return the indices and the batch :type store:
Dict
[str
,Any
] :param store: Dict[str, Any] location to store the indices and the batch- Return type:
Callable
- watch(model, imgs_remote_location, local_path_to_dataset_root, dataloaders, mask_col_name=None, unpatch_on_start=False)#
wraps a PyTorch model and optionally dataloaders to log the embeddings and logits to [Galileo](https://www.rungalileo.io/).
train_dataloader = torch.utils.data.DataLoader() model = SemSegModel() watch(model, imgs_remote_location, local_path_to_dataset_root,
[train_dataloader, test_dataloader])
- for epoch in range(NUM_EPOCHS):
dq.set_epoch_and_split(epoch,”training”) train() dq.set_split(“validation”) validate()
dq.finish()
- Parameters:
model (
Module
) – Pytorch Model to be wrappedimgs_remote_location (
str
) – Name of the bucket from which the images comelocal_path_to_dataset_root (
str
) – Path to the dataset which we can remove from the image pathdataloaders (
Dict
[str
,DataLoader
]) – List of dataloaders to be wrappedmask_col_name (
Optional
[str
]) – Name of the column in the dataloader that contains the maskunpatch_on_start (
bool
) – Whether to unpatch the model before patching it
- Return type:
None
dataquality.integrations.transformers_trainer module#
- class DQTrainerCallback(trainer, torch_helper, last_hidden_state_layer=None, embedding_dim=None, logits_dim=None, classifier_layer='classifier', embedding_fn=None, logits_fn=None)#
Bases:
TrainerCallback
,TorchBaseInstance
,Patch
DQTrainerCallback that provides data quality insights with Galileo. This callback is logs during each training training step and is using the Huggingface transformers Trainer library.
Callback for logging model outputs during training :type trainer:
Trainer
:param trainer: Trainer object from Huggingface transformers :type last_hidden_state_layer:Union
[Module
,str
,None
] :param last_hidden_state_layer: Name of the last hidden state layer :type embedding_dim:Union
[str
,int
,slice
,Tensor
,List
,Tuple
,None
] :param embedding_dim: Dimension of the embedding :type logits_dim:Union
[str
,int
,slice
,Tensor
,List
,Tuple
,None
] :param logits_dim: Dimension of the logits :type classifier_layer:Union
[Module
,str
,None
] :param classifier_layer: Name of the classifier layer :type embedding_fn:Optional
[Callable
] :param embedding_fn: Function to extract the embedding from the lasthidden state
- Parameters:
logits_fn (
Optional
[Callable
]) – Function to extract the logitstorch_helper (
TorchHelper
) – Store for the callback
-
hook_manager:
ModelHookManager
#
- validate(args, state, control, **kwargs)#
Validate the model and dataset :type args:
TrainingArguments
:param args: Training arguments :type state:TrainerState
:param state: Trainer state :type control:TrainerControl
:param control: Trainer control :type kwargs:Any
:param kwargs: Keyword arguments (train_dataloader, eval_dataloader)- Return type:
None
- setup_model(model)#
Setup the model for logging (attach hooks) :type model:
Module
:param model: Model- Return type:
None
- on_train_begin(args, state, control, **kwargs)#
Event called at the beginning of training. Attaches hooks to model. :type args:
TrainingArguments
:param args: Training arguments :type state:TrainerState
:param state: Trainer state :type control:TrainerControl
:param control: Trainer control :type kwargs:Any
:param kwargs: Keyword arguments (model, eval_dataloader, tokenizer…)- Return type:
None
- on_evaluate(args, state, control, **kwargs)#
Event called after an evaluation phase.
- Return type:
None
- on_epoch_begin(args, state, control, **kwargs)#
Event called at the beginning of an epoch.
- Return type:
None
- on_epoch_end(args, state, control, **kwargs)#
Event called at the end of an epoch.
- Return type:
None
- on_train_end(args, state, control, **kwargs)#
Event called at the end of training.
- Return type:
None
- on_prediction_step(args, state, control, **kwargs)#
Event called after a prediction step.
- Return type:
None
- on_step_end(args, state, control, **kwargs)#
Perform a training step on a batch of inputs. Log the embeddings, ids and logits. :type args:
TrainingArguments
:param args: Training arguments :type state:TrainerState
:param state: Trainer state :type control:TrainerControl
:param control: Trainer control :type kwargs:Dict
:param kwargs: Keyword arguments (including the model, inputs, outputs)- Return type:
None
- watch(trainer, classifier_layer=None, embedding_dim=None, logits_dim=None, embedding_fn=None, logits_fn=None, last_hidden_state_layer=None)#
Hook into to the trainer to log to Galileo. :type trainer:
Trainer
:param trainer: Trainer object from the transformers library :type classifier_layer:Union
[Module
,str
,None
] :param classifier_layer: Name or Layer of the classifier layer to extract thelogits and the embeddings from
- Parameters:
embedding_dim (
Union
[int
,slice
,Tensor
,List
,Tuple
,None
]) – Dimension slice for the embeddinglogits_dim (
Union
[int
,slice
,Tensor
,List
,Tuple
,None
]) – Dimension slice for the logitslogits_fn (
Optional
[Callable
]) – Function to extract the logitsembedding_fn (
Optional
[Callable
]) – Function to extract the embeddinglast_hidden_state_layer (
Union
[Module
,str
,None
]) – Name of the last hidden state layer if classifier_layer is not provided
- Return type:
None
- unwatch(trainer)#
unwatch is used to remove the callback from the trainer :type trainer:
Trainer
:param trainer: Trainer object- Return type:
None
dataquality.integrations.ultralytics module#
- find_midpoint(box, shape, resized_shape)#
Finds the midpoint of a box in xyxy format
- Parameters:
box (
Union
[Tuple
,List
]) – box in xyxy formatshape (
Union
[Tuple
,List
]) – shape of the imageresized_shape (
Union
[Tuple
,List
]) – shape of the resized image
- Return type:
Tuple
[int
,int
,int
,int
]- Returns:
midpoint of the box
- create_embedding(features, box, size=(640, 640))#
Creates an embedding from a feature map
- Parameters:
features (
List
) – feature mapbox (
List
) – box in xyxy formatsize (
Tuple
[int
,int
]) – size of the image
- Return type:
Tensor
- Returns:
embedding
- embedding_fn(features, boxes, size)#
Creates embeddings for all boxes
- Parameters:
features (
List
) – feature mapboxes (
Any
) – boxes in xyxy formatsize (
Any
) – size of the image
- Return type:
Tensor
- Returns:
embeddings
- class StoreHook(on_finish_func=None)#
Bases:
object
Generic Hook class to store model input and output
Initializes the hook
- Parameters:
on_finish_func (
Optional
[Callable
]) – function to be called when the hook is finished
-
h:
Any
= None#
- hook(model, model_input, model_output)#
Hook function to store model input and output
- Parameters:
model (
Any
) – modelmodel_input (
Any
) – model inputmodel_output (
Any
) – model output
- Return type:
None
- store_hook(h)#
Stores hook for later removal
- Parameters:
h (
Any
) – hook- Return type:
None
- class BatchLogger(old_function)#
Bases:
object
Batch Logger class to store batches for later logging
Store the batch by overwriting the given method
- Parameters:
old_function (
Callable
) – method that is wrapped
- class Callback(nms_fn=None, bucket='', relative_img_path='', labels=[], iou_thresh=0.7, conf_thresh=0.25)#
Bases:
object
Callback class that is used to log batches, embeddings and predictions
Initializes the callback
- Parameters:
nms_fn (
Optional
[Callable
]) – non-maximum suppression function
-
model:
YOLO
#
-
file_map:
Dict
#
- postprocess(batch)#
Postprocesses the batch for a training step. Taken from ultralytics. Might be removed in the future.
- Parameters:
batch (
Tensor
) – batch to be postprocessed- Return type:
Any
- register_hooks(model)#
Register hooks to the model to log predictions and embeddings
- Parameters:
model (
Any
) – the model to hook- Return type:
None
- init_run()#
Initialize the run
- Return type:
None
- convert_dataset(dataset)#
Convert the dataset to the format expected by the dataquality client
- Return type:
List
- on_train_start(trainer)#
Register hooks and preprocess batch function on train start
- Parameters:
trainer (
BaseTrainer
) – the trainer- Return type:
None
- on_train_end(trainer)#
Restore preprocess batch function on train end
- Parameters:
trainer (
BaseTrainer
) – the trainer- Return type:
None
- on_val_batch_start(validator)#
Register hooks and preprocess batch function on validation start
- Parameters:
validator (
BaseValidator
) – the validator- Return type:
None
- on_predict_start(predictor)#
Register hooks on prediction start Note: prediction is not perfect as the model is not in eval mode. May be removed
- Parameters:
predictor (
BasePredictor
) – the predictor- Return type:
None
- on_predict_batch_end(predictor)#
Log predictions and embeddings on prediction batch end. Not functional yet
- Return type:
None
- add_callback(model, cb)#
Add the callback to the model
- Parameters:
model (
YOLO
) – the modelcb (
Callback
) – callback cls
- Return type:
None
- watch(model, bucket, relative_img_path, labels, iou_thresh=0.7, conf_thresh=0.25)#
Watch the model for predictions and embeddings logging.
- Parameters:
model (
YOLO
) – the model to watch- Return type:
None