dataquality.loggers.model_logger package#
Subpackages#
- dataquality.loggers.model_logger.seq2seq package
Submodules#
dataquality.loggers.model_logger.base_model_logger module#
- class BaseGalileoModelLogger(embs=None, probs=None, logits=None, ids=None, split='', epoch=None, inference_name=None)#
Bases:
BaseGalileoLogger
- log_file_ext = 'hdf5'#
- log()#
The top level log function that try/excepts its child
- Return type:
None
- write_model_output(data)#
Creates an hdf5 file from the data dict
- Return type:
None
- set_split_epoch()#
Sets the split for the current logger
If the split is not set, it will use the split set in the logger config
- Return type:
None
- upload()#
The upload function is implemented in the sister DataConfig class
- Return type:
None
- static get_model_logger_attr(cls)#
Returns the attribute that corresponds to the GalileoModelLogger class. This assumes only 1 GalileoModelLogger object exists in the class
- Parameters:
cls (
object
) – The class- Return type:
str
- Returns:
The attribute name
- convert_logits_to_prob_binary(sample_logits)#
Converts logits to probs in the binary case
Takes the sigmoid of the single class logits and adds the negative lass prediction (1-class pred)
- Return type:
ndarray
- convert_logits_to_probs(sample_logits)#
Converts logits to probs via softmax
- Return type:
ndarray
dataquality.loggers.model_logger.image_classification module#
- class ImageClassificationModelLogger(embs=None, probs=None, logits=None, ids=None, split='', epoch=None, inference_name=None)#
Bases:
TextClassificationModelLogger
- logger_config: BaseLoggerConfig = ImageClassificationLoggerConfig(labels=None, tasks=None, observed_num_labels=0, observed_labels=set(), tagging_schema=None, last_epoch=0, cur_epoch=None, cur_split=None, cur_inference_name=None, training_logged=False, validation_logged=False, test_logged=False, inference_logged=False, exception='', helper_data={}, input_data_logged=defaultdict(<class 'int'>, {}), logged_input_ids=defaultdict(<class 'set'>, {}), idx_to_id_map=defaultdict(<class 'list'>, {}), conditions=[], report_emails=[], ner_labels=[], int_labels=False, feature_names=[], metadata_documents=set(), finish=<function BaseLoggerConfig.<lambda>>, existing_run=False, dataloader_random_sampling=False, remove_embs=False, observed_ids={}, all_ids={})#
- write_model_output(model_output)#
Only write model output if there is data to write.
In image classification, it is possible that after filtering duplicate IDs, there are none to write. In that case, we’ll get an error trying to write them, so we skip
- Return type:
None
dataquality.loggers.model_logger.object_detection module#
- class ObjectDetectionModelLogger(ids=None, pred_boxes=None, gold_boxes=None, labels=None, pred_embs=None, gold_embs=None, image_size=None, embs=None, probs=None, logits=None, split='', epoch=None, inference_name=None)#
Bases:
BaseGalileoModelLogger
Takes in OD inputs as a list of batches
- Parameters:
pred_boxes (
Optional
[List
[ndarray
]]) – List of pred boxes per image len(pred_boxes) == bs, pred_boxes[idx].shape == (n, 4), where n is # predicted boxes per samplegold_boxes (
Optional
[List
[ndarray
]]) – List of gold boxes per image len(gold_boxes) == bs, gold_boxes[idx].shape == (n, 4), where n is # gold boxes per samplelabels (
Optional
[List
[ndarray
]]) – List of box labels per image labels.shape == (bs, n, 4), where n is # gold boxes per sample
- self.all_boxes: (bs, n, 2, 4)) n = boxes first four are pred,
last four are gold [-1] * 4 for empty boxes
self.deps: (bs, n) n = boxes, all boxes have a dep self.image_dep: (bs, 1) image dep aggregated self.is_gold: (bs, n) n = boxes True if gold, False if pred self.is_pred: (bs, n) n = boxes True if pred, False if gold self.embs: (bs, n, dim) n = boxes embedding for each box
- logger_config: BaseLoggerConfig = ObjectDetectionLoggerConfig(labels=None, tasks=None, observed_num_labels=None, observed_labels=None, tagging_schema=None, last_epoch=0, cur_epoch=None, cur_split=None, cur_inference_name=None, training_logged=False, validation_logged=False, test_logged=False, inference_logged=False, exception='', helper_data={}, input_data_logged=defaultdict(<class 'int'>, {}), logged_input_ids=defaultdict(<class 'set'>, {}), idx_to_id_map=defaultdict(<class 'list'>, {}), conditions=[], report_emails=[], ner_labels=[], int_labels=False, feature_names=[], metadata_documents=set(), finish=<function BaseLoggerConfig.<lambda>>, existing_run=False, dataloader_random_sampling=False, remove_embs=False, image_cloud_path='', box_format=<BoxFormat.xyxy: 'xyxy'>)#
- validate_and_format()#
Validates params passed in during logging. Implemented by child
- Return type:
None
- construct_image_ids()#
Creates a list of image ids equal to the number of boxes
The ids passed in for batch represent the ids of the images they map to Since we store the box data as 1 row per box, we need to duplicate the image id for each box of the same corresponding image.
When constructing the data for the batch, we store all preds first, then all golds. So we do the same here to map the IDs properly
- Return type:
List
[int
]
dataquality.loggers.model_logger.semantic_segmentation module#
- class SemanticSegmentationModelLogger(imgs_remote_location='', image_paths=[], image_ids=[], gold_masks=tensor([]), pred_masks=tensor([]), gold_boundary_masks=tensor([]), pred_boundary_masks=tensor([]), output_probs=tensor([]), mislabeled_pixels=tensor([]), embs=None, probs=None, logits=None, ids=None, split='', epoch=None, inference_name=None)#
Bases:
BaseGalileoModelLogger
Takes in SemSeg inputs as a list of batches
- Parameters:
image_ids (
List
[int
]) – List of image idsgold_masks (
Tensor
) – List of ground truth masks np.ndarray of shape (batch_size, height, width)pred_masks (
Tensor
) – List of prediction masks np.ndarray of shape (batch_size, height, width)gold_boundary_masks (
Tensor
) – List of gold boundary masks np.ndarray of shape (batch_size, height, width)pred_boundary_masks (
Tensor
) – List of predicted boundary masks np.ndarray of shape (batch_size, height, width)output_probs (
Tensor
) – Model probability predictions np.ndarray of shape (batch_size, height, width, num_classes)mislabeled_pixels (
Tensor
) – Model confidence predictions in the GT label torch.Tensor of shape (batch_size, height, width)
-
logger_config:
SemanticSegmentationLoggerConfig
= SemanticSegmentationLoggerConfig(labels=None, tasks=None, observed_num_labels=None, observed_labels=None, tagging_schema=None, last_epoch=0, cur_epoch=None, cur_split=None, cur_inference_name=None, training_logged=False, validation_logged=False, test_logged=False, inference_logged=False, exception='', helper_data={}, input_data_logged=defaultdict(<class 'int'>, {}), logged_input_ids=defaultdict(<class 'set'>, {}), idx_to_id_map=defaultdict(<class 'list'>, {}), conditions=[], report_emails=[], ner_labels=[], int_labels=False, feature_names=[], metadata_documents=set(), finish=<function BaseLoggerConfig.<lambda>>, existing_run=False, dataloader_random_sampling=False, remove_embs=False)#
- validate_and_format()#
Validates params passed in during logging. Implemented by child
- Return type:
None
- property local_dep_path: str#
- property local_proj_run_path: str#
- property local_contours_path: str#
- get_polygon_data(pred_polygons_batch, gold_polygons_batch)#
Returns polygon data for a batch of images in a dictionary that can then be used for our polygon df
- Parameters:
pred_polygons_batch (Tuple[List, List]) – polygon data for predictions in a minibatch of images
gold_polygons_batch (Tuple[List, List]) – polygon data for ground truth in a minibatch of images
- Returns:
a dict that can be used to create a polygon df
- Return type:
Dict[str, Any]
dataquality.loggers.model_logger.tabular_classification module#
- class TabularClassificationModelLogger(embs=None, probs=None, logits=None, ids=None, split='', epoch=None, inference_name=None)#
Bases:
BaseGalileoModelLogger
- logger_config: BaseLoggerConfig = TabularClassificationLoggerConfig(labels=None, tasks=None, observed_num_labels=0, observed_labels=set(), tagging_schema=None, last_epoch=0, cur_epoch=None, cur_split=None, cur_inference_name=None, training_logged=False, validation_logged=False, test_logged=False, inference_logged=False, exception='', helper_data={}, input_data_logged=defaultdict(<class 'int'>, {}), logged_input_ids=defaultdict(<class 'set'>, {}), idx_to_id_map=defaultdict(<class 'list'>, {}), conditions=[], report_emails=[], ner_labels=[], int_labels=False, feature_names=[], metadata_documents=set(), finish=<function BaseLoggerConfig.<lambda>>, existing_run=False, dataloader_random_sampling=False, remove_embs=False, feature_importances={})#
dataquality.loggers.model_logger.text_classification module#
- class GalileoModelLoggerAttributes(value)#
Bases:
str
,Enum
An enumeration.
- embs = 'embs'#
- probs = 'probs'#
- logits = 'logits'#
- ids = 'ids'#
- split = 'split'#
- epoch = 'epoch'#
- inference_name = 'inference_name'#
- static get_valid()#
- Return type:
List
[str
]
- class TextClassificationModelLogger(embs=None, probs=None, logits=None, ids=None, split='', epoch=None, inference_name=None)#
Bases:
BaseGalileoModelLogger
Class for logging model output data of Text Classification models to Galileo.
embs: Union[List, np.ndarray, torch.Tensor, tf.Tensor]. The Embeddings per
text sample input. Only one embedding vector is allowed per input sample. the embs parameter can be formatted either as:
np.ndarray
torch.tensor / tf.tensor
A list of List[float]
A list of numpy arrays
A list of tensorflow tensors
A list of pytorch tensors
logits: Union[List, np.ndarray, torch.Tensor, tf.Tensor] outputs from
forward pass. If provided, probs will be converted automatically and DO NOT need to be provided. Can be formatted either as:
np.ndarray
torch.tensor / tf.tensor
A list of List[float]
A list of numpy arrays
A list of tensorflow tensors
A list of pytorch tensors
probs: Deprecated - the probabilities for each output sample (use logits instead)
ids: Indexes of each input field: List[int]. These IDs must align with the input
IDs for each sample input. This will be used to join them together for analysis by Galileo. * split: The model training/test/validation split for the samples being logged
ex: .. code-block:: python
dq.set_epoch(0) dq.set_split(“train”)
embs: np.ndarray = np.random.rand(4, 768) # 4 samples, embedding dim 768 logits: np.ndarray = np.random.rand(4, 3) # 4 samples, 3 classes ids: List[int] = [0, 1, 2, 3]
dq.log_model_outputs(embs=embs, logits=logits, ids=ids)
- logger_config: BaseLoggerConfig = TextClassificationLoggerConfig(labels=None, tasks=None, observed_num_labels=0, observed_labels=set(), tagging_schema=None, last_epoch=0, cur_epoch=None, cur_split=None, cur_inference_name=None, training_logged=False, validation_logged=False, test_logged=False, inference_logged=False, exception='', helper_data={}, input_data_logged=defaultdict(<class 'int'>, {}), logged_input_ids=defaultdict(<class 'set'>, {}), idx_to_id_map=defaultdict(<class 'list'>, {}), conditions=[], report_emails=[], ner_labels=[], int_labels=False, feature_names=[], metadata_documents=set(), finish=<function BaseLoggerConfig.<lambda>>, existing_run=False, dataloader_random_sampling=False, remove_embs=False)#
- static get_valid_attributes()#
Returns a list of valid attributes that this logger accepts :rtype:
List
[str
] :return: List[str]
- validate_and_format()#
Validates that the current config is correct. * embs, probs, and ids must exist and be the same length :rtype:
None
:return:
- write_model_output(model_output)#
Creates an hdf5 file from the data dict
- Return type:
None
dataquality.loggers.model_logger.text_multi_label module#
- class TextMultiLabelModelLogger(embs=None, probs=None, logits=None, ids=None, split='', epoch=None, inference_name=None)#
Bases:
TextClassificationModelLogger
Class for logging model outputs of Multi Label Text classification models to Galileo
embs: (Embeddings) List[List[Union[int,float]]]. Embeddings per text sample input.
Only one embedding vector is allowed per input (len(embs) == len(text) and embs.shape==2) * logits: Output from forward pass during model training/evalutation. List[List[List[float]]] or List[np.ndarray]. For each text input, a list of lists of floats is expected (one list/array per task) The number of inner lists must be the number of tasks (matching the labels logged). The order of the inner lists is assumed to match the order of the inner list of labels when logging input data (matching the tasks provided by the call to dataquality.set_tasks_for_run()). * probs: (Probabilities) deprecated, use logits instead. * ids: Indexes of each input field: List[int]. These IDs must align with the input IDs for each sample input. This will be used to join them together for analysis by Galileo.
ex: .. code-block:: python
dq.set_epoch(0) dq.set_split(“train”)
# 3 samples, embedding dim 768. Only 1 embedding vector can be logged for all # tasks. Each task CANNOT have it’s own embedding vector embs: np.ndarray = np.random.rand(3, 768) # Logits per task. In this example, tasks “task_0” and “task_2” have 3 classes # but task “task_1” has 2 logits: List[np.ndarray] = [
np.random.rand(3, 3), # 3 samples, 3 classes np.random.rand(3, 3), # 3 samples, 2 classes np.random.rand(3, 3) # 3 samples, 3 classes
] ids: List[int] = [0, 1, 2]
dq.log_model_outputs(embs=embs, logits=logits, ids=ids)
-
logger_config:
TextMultiLabelLoggerConfig
= TextMultiLabelLoggerConfig(labels=None, tasks=None, observed_num_labels=None, observed_labels=defaultdict(<class 'set'>, {}), tagging_schema=None, last_epoch=0, cur_epoch=None, cur_split=None, cur_inference_name=None, training_logged=False, validation_logged=False, test_logged=False, inference_logged=False, exception='', helper_data={}, input_data_logged=defaultdict(<class 'int'>, {}), logged_input_ids=defaultdict(<class 'set'>, {}), idx_to_id_map=defaultdict(<class 'list'>, {}), conditions=[], report_emails=[], ner_labels=[], int_labels=False, feature_names=[], metadata_documents=set(), finish=<function BaseLoggerConfig.<lambda>>, existing_run=False, dataloader_random_sampling=False, remove_embs=False, observed_num_tasks=0, binary=True)#
- validate_and_format()#
Validates that the current config is correct. * embs, probs, and ids must exist and be the same length :rtype:
None
:return:
- convert_logits_to_prob_binary(sample_logits)#
Converts logits to probs in the binary case
Takes the sigmoid of the single class logits and adds the negative lass prediction (1-class pred)
- Return type:
ndarray
- convert_logits_to_probs(sample_logits)#
Converts logits to probs via softmax per sample
In the case of binary multi-label, we don’t run softmax, we use sigmoid
- Return type:
ndarray
dataquality.loggers.model_logger.text_ner module#
- class GalileoModelLoggerAttributes(value)#
Bases:
str
,Enum
An enumeration.
- gold_emb = 'gold_emb'#
- gold_spans = 'gold_spans'#
- gold_conf_prob = 'gold_conf_prob'#
- gold_loss_prob = 'gold_loss_prob'#
- gold_loss_prob_label = 'gold_loss_prob_label'#
- embs = 'embs'#
- pred_emb = 'pred_emb'#
- pred_spans = 'pred_spans'#
- pred_conf_prob = 'pred_conf_prob'#
- pred_loss_prob = 'pred_loss_prob'#
- pred_loss_prob_label = 'pred_loss_prob_label'#
- probs = 'probs'#
- logits = 'logits'#
- ids = 'ids'#
- split = 'split'#
- epoch = 'epoch'#
- log_helper_data = 'log_helper_data'#
- inference_name = 'inference_name'#
- static get_valid()#
- Return type:
List
[str
]
- class TextNERModelLogger(embs=None, probs=None, logits=None, ids=None, split='', epoch=None, inference_name=None)#
Bases:
BaseGalileoModelLogger
Class for logging model output data of Text NER models to Galileo.
embs: List[np.ndarray]: Each np.ndarray represents all embeddings of a given
sample. These embeddings are from the tokenized text, and will align with the tokens in the sample. If you have 12 samples in the dataset, with each sample of 20 tokens in length, and an embedding vector of size 768, len(embs) will be 12, and np.ndarray.shape is (20, 768).
logits: List[np.ndarray]: The NER prediction logits from the model
for each token. These outputs are from the tokenized text, and will align with the tokens in the sample. If you have 12 samples in the dataset, with each sample of 20 tokens in length, and observed_num_labels as 40, len(probs) will be 12, and np.ndarray.shape is (20, 40).
probs: Probabilities: List[np.ndarray]: deprecated, use logits
ids: List[int]: These IDs must align with the input
IDs for each sample input. This will be used to join them together for analysis by Galileo.
split: The model training/test/validation split for the samples being logged
ex: (see the data input example in the DataLogger for NER dataquality.get_data_logger().doc() .. code-block:: python
# Logged with dataquality.log_model_outputs logits =
[np.array([model(the), model(president), model(is), model(joe), model(bi), model(##den), model(<pad>), model(<pad>), model(<pad>)]), np.array([model(joe), model(bi), model(##den), model(addressed), model(the), model(united), model(states), model(on), model(monday)])]
- embs =
[np.array([emb(the), emb(president), emb(is), emb(joe), emb(bi), emb(##den), emb(<pad>), emb(<pad>), emb(<pad>)]), np.array([emb(joe), emb(bi), emb(##den), emb(addressed), emb(the), emb(united), emb(states), emb(on), emb(monday)])]
epoch = 0 ids = [0, 1] # Must match the data input IDs split = “training” dataquality.log_model_outputs(
embs=embs, logits=logits, ids=ids, split=split, epoch=epoch
)
- logger_config: BaseLoggerConfig = TextNERLoggerConfig(labels=None, tasks=None, observed_num_labels=None, observed_labels=None, tagging_schema=None, last_epoch=0, cur_epoch=None, cur_split=None, cur_inference_name=None, training_logged=False, validation_logged=False, test_logged=False, inference_logged=False, exception='', helper_data={}, input_data_logged=defaultdict(<class 'int'>, {}), logged_input_ids=defaultdict(<class 'set'>, {}), idx_to_id_map=defaultdict(<class 'list'>, {}), conditions=[], report_emails=[], ner_labels=[], int_labels=False, feature_names=[], metadata_documents=set(), finish=<function BaseLoggerConfig.<lambda>>, existing_run=False, dataloader_random_sampling=False, remove_embs=False, gold_spans={}, sample_length={})#
- static get_valid_attributes()#
Returns a list of valid attributes that GalileoModelConfig accepts :rtype:
List
[str
] :return: List[str]
- validate_and_format()#
Validates that the current config is correct. * embs, probs, and ids must exist and be the same length :rtype:
None
:return:
Module contents#
- class BaseGalileoModelLogger(embs=None, probs=None, logits=None, ids=None, split='', epoch=None, inference_name=None)#
Bases:
BaseGalileoLogger
- log_file_ext = 'hdf5'#
- embs: Union[List, np.ndarray]#
- logits: Union[List, np.ndarray]#
- probs: Union[List, np.ndarray]#
- ids: Union[List, np.ndarray]#
- split: str#
- inference_name: Optional[str]#
- log()#
The top level log function that try/excepts its child
- Return type:
None
- write_model_output(data)#
Creates an hdf5 file from the data dict
- Return type:
None
- set_split_epoch()#
Sets the split for the current logger
If the split is not set, it will use the split set in the logger config
- Return type:
None
- upload()#
The upload function is implemented in the sister DataConfig class
- Return type:
None
- static get_model_logger_attr(cls)#
Returns the attribute that corresponds to the GalileoModelLogger class. This assumes only 1 GalileoModelLogger object exists in the class
- Parameters:
cls (
object
) – The class- Return type:
str
- Returns:
The attribute name
- convert_logits_to_prob_binary(sample_logits)#
Converts logits to probs in the binary case
Takes the sigmoid of the single class logits and adds the negative lass prediction (1-class pred)
- Return type:
ndarray
- convert_logits_to_probs(sample_logits)#
Converts logits to probs via softmax
- Return type:
ndarray