dataquality package#

Subpackages#

Submodules#

dataquality.analytics module#

pydantic model ProfileModel#

Bases: BaseModel

User profile

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Fields:

packages (Dict[str, str] | None)
uuid (str | None)

field packages: Optional[Dict[str, str]] = None#

field uuid: Optional[str] = None#

class Analytics(ApiClient, config)#

Bases: Borg

Analytics is used to track errors and logs in the background

To initialize the Analytics class you need to pass in an ApiClient and the dq config. :type ApiClient: Type[ApiClient] :param ApiClient: The ApiClient class :type config: Config :param config: The dq config

debug_logging(log_message, *args)#

This function is used to log debug messages. It will only log if the DQ_DEBUG environment variable is set to True.

Return type:: None

ipython_exception_handler(shell, etype, evalue, tb, tb_offset=None)#

This function is used to handle exceptions in ipython.

Return type:: None

track_exception_ipython(etype, evalue, tb)#

We parse the current environment and send the error to the api.

Return type:: None

handle_exception(etype, evalue, tb)#

This function is used to handle exceptions in python.

Return type:: None

capture_exception(error)#

This function is used to take an exception that is passed as an argument.

Return type:: None

log_import(module)#

This function is used to log an import of a module.

Return type:: None

log_function(function)#

This function is used to log an functional call

Return type:: None

log(data)#

This function is used to send the error to the api in a thread.

Return type:: None

set_config(config)#

This function is used to set the config post init.

Return type:: None

dataquality.dqyolo module#

main()#

dqyolo is a wrapper around ultralytics yolo that will automatically run the model on the validation and test sets and provide data insights.

Return type:: None

dataquality.exceptions module#

exception GalileoException#

Bases: Exception

A class for Galileo Exceptions

exception GalileoWarning#

Bases: Warning

A class for Galileo Warnings

exception LogBatchError#

Bases: Exception

An exception used to indicate an invalid batch of logged model outputs

dataquality.internal module#

Internal functions to help Galileans

reprocess_run(project_name, run_name, alerts=True, wait=True)#

Reprocesses a run that has already been processed by Galileo

Useful if a new feature has been added to the system that is desired to be added to an old run that hasn’t been migrated

Parameters:

project_name (str) – The name of the project
run_name (str) – The name of the run
alerts (bool) – Whether to create the alerts. If True, all alerts for the run will be removed, and recreated during processing. Default True
wait (bool) – Whether to wait for the run to complete processing on the server. If True, this will block execution, printing out the status updates of the run. Useful if you want to know exactly when your run completes. Otherwise, this will fire and forget your process. Default True

Return type:

None

reprocess_transferred_run(project_name, run_name, alerts=True, wait=True)#

Reprocess a run that has been transferred from another cluster

This is an internal helper function that allows us to reprocess a run that has been transferred from another cluster.

Parameters:

project_name (str) – The name of the project
run_name (str) – The name of the run
alerts (bool) – Whether to create the alerts. If True, all alerts for the run will be removed, and recreated during processing. Default True
wait (bool) – Whether to wait for the run to complete processing on the server. If True, this will block execution, printing out the status updates of the run. Useful if you want to know exactly when your run completes. Otherwise, this will fire and forget your process. Default True

Return type:

None

rename_run(project_name, run_name, new_name)#

Assigns a new name to a run

Useful if a run was named incorrectly, or if a run was created with a temporary name and needs to be renamed to something more permanent

Parameters:

project_name (str) – The name of the project
run_name (str) – The name of the run
new_name (str) – The new name to assign to the run

Return type:

None

rename_project(project_name, new_name)#

Renames a project

Useful if a project was named incorrectly, or if a project was created with a temporary name and needs to be renamed to something more permanent

Parameters:

project_name (str) – The name of the project
new_name (str) – The new name to assign to the project

Return type:

None

dataquality.metrics module#

create_edit(project_name, run_name, split, edit, filter, task=None, inference_name=None)#

Creates an edit for a run given a filter

Parameters:

project_name (str) – The project name
run_name (str) – The run name
split (Split) – The split
edit (Union[Edit, Dict]) – The edit to make. see help(Edit) for more information
task (Optional[str]) – Required task name if run is MLTC
inference_name (Optional[str]) – Required inference name if split is inference

Return type:

Dict

get_run_summary(project_name, run_name, split, task=None, inference_name=None, filter=None)#

Gets the summary for a run/split

Calculates metrics (f1, recall, precision) overall (weighted) and per label. Also returns the top 50 rows of the dataframe (sorted by data_error_potential)

Parameters:

project_name (str) – The project name
run_name (str) – The run name
split (Split) – The split (training/test/validation/inference)
task (Optional[str]) – (If multi-label only) the task name in question
inference_name (Optional[str]) – (If inference split only) The inference split name
filter (Union[FilterParams, Dict, None]) – Optional filter to provide to restrict the summary to only to matching rows. See dq.schemas.metrics.FilterParams

Return type:

Dict

get_metrics(project_name, run_name, split, task=None, inference_name=None, category='gold', filter=None)#

Calculates available metrics for a run/split, grouped by a particular category

The category/column provided (can be gold, pred, or any categorical metadata column) will result in metrics per “group” or unique value of that category/column

Parameters:

project_name (str) – The project name
run_name (str) – The run name
split (Split) – The split (training/test/validation/inference)
task (Optional[str]) – (If multi-label only) the task name in question
inference_name (Optional[str]) – (If inference split only) The inference split name
category (str) – The category/column to calculate metrics for. Default “gold” Can be “gold” for ground truth, “pred” for predicted values, or any metadata column logged (or smart feature).
filter (Union[FilterParams, Dict, None]) – Optional filter to provide to restrict the metrics to only to matching rows. See dq.schemas.metrics.FilterParams

Return type:

Dict[str, List]

display_distribution(project_name, run_name, split, task=None, inference_name=None, column='data_error_potential', filter=None)#

Displays the column distribution for a run. Plotly must be installed

Parameters:

project_name (str) – The project name
run_name (str) – The run name
split (Split) – The split (training/test/validation/inference)
task (Optional[str]) – (If multi-label only) the task name in question
inference_name (Optional[str]) – (If inference split only) The inference split name
column (str) – The column to get the distribution for. Default data error potential
filter (Union[FilterParams, Dict, None]) – Optional filter to provide to restrict the distribution to only to matching rows. See dq.schemas.metrics.FilterParams

Return type:

None

get_dataframe(project_name, run_name, split, inference_name='', file_type=FileType.arrow, include_embs=False, include_probs=False, include_token_indices=False, hf_format=False, tagging_schema=None, filter=None, as_pandas=True, include_data_embs=False, meta_cols=None)#

Gets the dataframe for a run/split

Downloads an arrow (or specified type) file to your machine and returns a loaded Vaex dataframe.

Special note for NER. By default, the data will be downloaded at a sample level (1 row per sample text), with spans for each sample in a spans column in a spacy-compatible JSON format. If include_embs or include_probs is True, the data will be expanded into span level (1 row per span, with sample text repeated for each span row), in order to join the span-level embeddings/probs

Parameters:

project_name (str) – The project name
run_name (str) – The run name
split (Split) – The split (training/test/validation/inference)
inference_name (str) – Required if split is inference. The name of the inference split to get data for.
file_type (FileType) – The file type to download the data as. Default arrow
include_embs (bool) – Whether to include the embeddings in the data. Default False
include_probs (bool) – Whether to include the probs in the data. Default False
include_token_indices (bool) – (NER only) Whether to include logged text_token_indices in the data. Useful for reconstructing tokens for retraining
hf_format (bool) – (NER only) Whether to export the dataframe in a HuggingFace compatible format
tagging_schema (Optional[TaggingSchema]) – (NER only) If hf_format is True, you must pass a tagging schema
filter (Union[FilterParams, Dict, None]) – Optional filter to provide to restrict the distribution to only to matching rows. See dq.schemas.metrics.FilterParams
as_pandas (bool) – Whether to return the dataframe as a pandas df (or vaex if False) If you are having memory issues (the data is too large), set this to False, and vaex will memory map the data. If any columns returned are multi-dimensional (embeddings, probabilities etc), vaex will always be returned, because pandas cannot support multi-dimensional columns. Default True
include_data_embs (bool) – Whether to include the off the shelf data embeddings
meta_cols (Optional[List[str]]) – List of metadata columns to return in the dataframe. If “*” is included, return all metadata columns

Return type:

Union[DataFrame, DataFrame]

get_edited_dataframe(project_name, run_name, split, inference_name='', file_type=FileType.arrow, include_embs=False, include_probs=False, include_token_indices=False, hf_format=False, tagging_schema=None, reviewed_only=False, as_pandas=True, include_data_embs=False)#

Gets the edited dataframe for a run/split

Exports a run/split’s data with all active edits in the edits cart and returns a vaex or pandas dataframe

Parameters:

project_name (str) – The project name
run_name (str) – The run name
split (Split) – The split (training/test/validation/inference)
inference_name (str) – Required if split is inference. The name of the inference split to get data for.
file_type (FileType) – The file type to download the data as. Default arrow
include_embs (bool) – Whether to include the embeddings in the data. Default False
include_probs (bool) – Whether to include the probs in the data. Default False
include_token_indices (bool) – (NER only) Whether to include logged text_token_indices in the data. Useful for reconstructing tokens for retraining
hf_format (bool) – (NER only) Whether to export the dataframe in a HuggingFace compatible format
tagging_schema (Optional[TaggingSchema]) – (NER only) If hf_format is True, you must pass a tagging schema
reviewed_only (Optional[bool]) – Whether to export only reviewed edits or all edits. Default: False (all edits)
as_pandas (bool) – Whether to return the dataframe as a pandas df (or vaex if False) If you are having memory issues (the data is too large), set this to False, and vaex will memory map the data. If any columns returned are multi-dimensional (embeddings, probabilities etc), vaex will always be returned, because pandas cannot support multi-dimensional columns. Default True
include_data_embs (bool) – Whether to include the off the shelf data embeddings

Return type:

Union[DataFrame, DataFrame]

get_epochs(project_name, run_name, split)#

Returns the epochs logged for a run/split

Parameters:

project_name (str) – The project name
run_name (str) – The run name
split (Split) – The split (training/test/validation/inference)

Return type:

List[int]

get_embeddings(project_name, run_name, split, inference_name='', epoch=None)#

Downloads the embeddings for a run/split at an epoch as a Vaex dataframe.

If not provided, will take the embeddings from the final epoch. Note that only the n and n-1 epoch embeddings are available for download

An hdf5 file will be downloaded to local and a Vaex dataframe will be returned

Parameters:

project_name (str) – The project name
run_name (str) – The run name
split (Split) – The split (training/test/validation/inference)
inference_name (str) – Required if split is inference
epoch (Optional[int]) – The epoch to get embeddings for. Default final epoch

Return type:

DataFrame

get_data_embeddings(project_name, run_name, split, inference_name='')#

Downloads the data (off the shelf) embeddings for a run/split

An hdf5 file will be downloaded to local and a Vaex dataframe will be returned

Parameters:

project_name (str) – The project name
run_name (str) – The run name
split (Split) – The split (training/test/validation/inference)
inference_name (str) – Required if split is inference

Return type:

DataFrame

get_probabilities(project_name, run_name, split, inference_name='', epoch=None)#

Downloads the probabilities for a run/split at an epoch as a Vaex dataframe.

If not provided, will take the probabilities from the final epoch.

An hdf5 file will be downloaded to local and a Vaex dataframe will be returned

Parameters:

project_name (str) – The project name
run_name (str) – The run name
split (Split) – The split (training/test/validation/inference)
inference_name (str) – Required if split is inference
epoch (Optional[int]) – The epoch to get embeddings for. Default final epoch

Return type:

DataFrame

get_raw_data(project_name, run_name, split, inference_name='', epoch=None)#

Downloads the raw logged data for a run/split at an epoch as a Vaex dataframe.

If not provided, will take the probabilities from the final epoch.

An hdf5 file will be downloaded to local and a Vaex dataframe will be returned

Parameters:

project_name (str) – The project name
run_name (str) – The run name
split (Split) – The split (training/test/validation/inference)
inference_name (str) – Required if split is inference
epoch (Optional[int]) – The epoch to get embeddings for. Default final epoch

Return type:

DataFrame

get_alerts(project_name, run_name, split, inference_name=None)#

Get alerts for a project/run/split

Alerts are automatic insights calculated and provided by Galileo on your data

Return type:: List[Dict[str, str]]

get_labels_for_run(project_name, run_name, task=None)#

Gets labels for a given run.

If multi-label, and a task is provided, this will get the labels for that task. Otherwise, it will get all task-labels

In NER, the full label set with the tags for each label will be returned

Return type:: List

get_tasks_for_run(project_name, run_name)#

Gets task names for a multi-label run

Return type:: List[str]

Module contents#

login()#

Log into your Galileo environment.

The function will prompt your for an Authorization Token (api key) that you can access from the console.

To skip the prompt for automated workflows, you can set GALILEO_USERNAME (your email) and GALILEO_PASSWORD if you signed up with an email and password. You can set GALILEO_API_KEY to your API key if you have one.

Return type:: None

logout()#

Return type:: None

init(task_type, project_name=None, run_name=None, overwrite_local=True)#

Start a run

Initialize a new run and new project, initialize a new run in an existing project, or reinitialize an existing run in an existing project.

Before creating the project, check: - The user is valid, login if not - The DQ client version is compatible with API version

Optionally provide project and run names to create a new project/run or restart existing ones.

Return type:: None
Parameters:: task_type (str) – The task type for modeling. This must be one of the valid

dataquality.schemas.task_type.TaskType options :type project_name: Optional[str] :param project_name: The project name. If not passed in, a random one will be generated. If provided, and the project does not exist, it will be created. If it does exist, it will be set. :type run_name: Optional[str] :param run_name: The run name. If not passed in, a random one will be generated. If provided, and the project does not exist, it will be created. If it does exist, it will be set. :type overwrite_local: bool :param overwrite_local: If True, the current project/run log directory will be cleared during this function. If logging over many sessions with checkpoints, you may want to set this to False. Default True

log_data_samples(*, texts, ids, meta=None, **kwargs)#

Logs a batch of input samples for model training/test/validation/inference.

Fields are expected as lists of their content. Field names are in the plural of log_input_sample (text -> texts) The expected arguments come from the task_type being used: See dq.docs() for details

ex (text classification): .. code-block:: python

all_labels = [“A”, “B”, “C”] dq.set_labels_for_run(labels = all_labels)

texts: List[str] = [
“Text sample 1”, “Text sample 2”, “Text sample 3”, “Text sample 4”

]

labels: List[str] = [“B”, “C”, “A”, “A”]

meta = {
“sample_importance”: [“high”, “low”, “low”, “medium”] “quality_ranking”: [9.7, 2.4, 5.5, 1.2]

}

ids: List[int] = [0, 1, 2, 3] split = “training”

dq.log_data_samples(texts=texts, labels=labels, ids=ids, meta=meta split=split)

Parameters:

texts (List[str]) – List[str] the input samples to your model
ids (List[int]) – List[int | str] the ids per sample
split – Optional[str] the split for this data. Can also be set via
meta (Optional[Dict[str, List[Union[str, float, int]]]]) – Dict[str, List[str | int | float]]. Log additional metadata fields to

Return type:

None dq.set_split

each sample. The name of the field is the key of the dictionary, and the values are a list that correspond in length and order to the text samples. :type kwargs: Any :param kwargs: See dq.docs() for details on other task specific parameters

log_model_outputs(*, ids, embs=None, split=None, epoch=None, logits=None, probs=None, log_probs=None, inference_name=None, exclude_embs=False)#

Logs model outputs for model during training/test/validation.

Parameters:

ids (Union[List, ndarray]) – The ids for each sample. Must match input ids of logged samples
embs (Union[List, ndarray, None]) – The embeddings per output sample
split (Optional[Split]) – The current split. Must be set either here or via dq.set_split
epoch (Optional[int]) – The current epoch. Must be set either here or via dq.set_epoch
logits (Union[List, ndarray, None]) – The logits for each sample
probs (Union[List, ndarray, None]) – Deprecated, use logits. If passed in, a softmax will NOT be applied
inference_name (Optional[str]) – Inference name indicator for this inference split. If logging for an inference split, this is required.
exclude_embs (bool) – Optional flag to exclude embeddings from logging. If True and embs is set to None, this will generate random embs for each sample.

Return type:

None

The expected argument shapes come from the task_type being used See dq.docs() for more task specific details on parameter shape

configure(do_login=True, _internal=False)#

Update your active config with new information

You can use environment variables to set the config, or wait for prompts Available environment variables to update: * GALILEO_CONSOLE_URL * GALILEO_USERNAME * GALILEO_PASSWORD * GALILEO_API_KEY

Return type:: None

finish(last_epoch=None, wait=True, create_data_embs=None, data_embs_col='text', upload_model=False)#

Finishes the current run and invokes a job

Parameters:

last_epoch (Optional[int]) – If set, only epochs up to this value will be uploaded/processed This is inclusive, so setting last_epoch to 5 would upload epochs 0,1,2,3,4,5
wait (bool) – If true, after uploading the data, this will wait for the run to be processed by the Galileo server. If false, you can manually wait for the run by calling dq.wait_for_run() Default True
create_data_embs (Optional[bool]) – If True, an off-the-shelf transformer will run on the raw text input to generate data-level embeddings. These will be available in the data view tab of the Galileo console. You can also access these embeddings via dq.metrics.get_data_embeddings(). Default True if a GPU is available, else default False.
data_embs_col (str) – Optional text col on which to compute data embeddings. If not set, we default to ‘text’ which corresponds to the input text. Can also be set to target, generated_output or any other column that is logged as metadata.
upload_model (bool) – If True, the model will be stored in the galileo project. Default False or set by the environment variable DQ_UPLOAD_MODEL.

Return type:

str

set_labels_for_run(labels)#

Creates the mapping of the labels for the model to their respective indexes. :rtype: None

Parameters:: labels (Union[List[List[str]], List[str]]) – An ordered list of labels (ie [‘dog’,’cat’,’fish’]

If this is a multi-label type, then labels are a list of lists where each inner list indicates the label for the given task

This order MUST match the order of probabilities that the model outputs.

In the multi-label case, the outer order (order of the tasks) must match the task-order of the task-probabilities logged as well.

get_current_run_labels()#

Returns the current run labels, if there are any

Return type:: Optional[List[str]]

get_data_logger(task_type=None, *args, **kwargs)#

Return type:: BaseGalileoDataLogger

get_model_logger(task_type=None, *args, **kwargs)#

Return type:: BaseGalileoModelLogger

get_run_link(project_name=None, run_name=None)#

Gets the link to the run in the UI

Return type:: str

set_tasks_for_run(tasks, binary=True)#

Sets the task names for the run (multi-label case only).

This order MUST match the order of the labels list provided in log_input_data and the order of the probability vectors provided in log_model_outputs.

This also must match the order of the labels logged in set_labels_for_run (meaning that the first list of labels must be the labels of the first task passed in here)

Return type:

None

Parameters:

tasks (List[str]) – The list of tasks for your run
binary (bool) – Whether this is a binary multi label run. If true, tasks will also

be set as your labels, and you should NOT call dq.set_labels_for_run it will be handled for you. Default True

set_tagging_schema(tagging_schema)#

Sets the tagging schema for NER models

Only valid for text_ner task_types. Others will throw an exception

Return type:: None

docs()#

Print the documentation for your specific input and output logging format

Based on your task_type, this will print the appropriate documentation

Return type:: None

wait_for_run(project_name=None, run_name=None)#

Waits until a specific project run transitions from started to finished. Defaults to the current run if project_name and run_name are empty. Raises error if only one of project_name and run_name is passed in.

Parameters:

project_name (Optional[str]) – The project name. Default to current project if not passed in.
run_name (Optional[str]) – The run name. Default to current run if not passed in.

Return type:

None

Returns:

None. Function returns after the run transitions to finished

get_run_status(project_name=None, run_name=None)#

Returns the latest job of a specified project run. Defaults to the current run if project_name and run_name are empty. Raises error if only one of project_name and run_name is passed in.

Parameters:

project_name (Optional[str]) – The project name. Default to current project if not passed in.
run_name (Optional[str]) – The run name. Default to current run if not passed in.

Return type:

Dict[str, Any]

Returns:

Dict[str, Any]. Response will have key status with value corresponding to the status of the latest job for the run. Other info, such as created_at, may be included.

set_epoch(epoch)#

Set the current epoch.

When set, logging model outputs will use this if not logged explicitly

Return type:: None

set_split(split, inference_name=None)#

Set the current split.

When set, logging data inputs/model outputs will use this if not logged explicitly When setting split to inference, inference_name must be included

Return type:: None

set_epoch_and_split(epoch, split, inference_name=None)#

Set the current epoch and set the current split. When set, logging data inputs/model outputs will use this if not logged explicitly When setting split to inference, inference_name must be included

Return type:: None

set_console_url(console_url=None)#

For Enterprise users. Set the console URL to your Galileo Environment.

You can also set GALILEO_CONSOLE_URL before importing dataquality to bypass this :rtype: None

Parameters:: console_url (Optional[str]) – If set, that will be used. Otherwise, if an environment variable

GALILEO_CONSOLE_URL is set, that will be used. Otherwise, you will be prompted for a url.

log_data_sample(*, text, id, **kwargs)#

Log a single input example to disk

Fields are expected singular elements. Field names are in the singular of log_input_samples (texts -> text) The expected arguments come from the task_type being used: See dq.docs() for details

Parameters:

text (str) – List[str] the input samples to your model
id (int) – List[int | str] the ids per sample
split – Optional[str] the split for this data. Can also be set via dq.set_split
kwargs (Any) – See dq.docs() for details on other task specific parameters

Return type:

None

log_dataset(dataset, *, batch_size=100000, text='text', id='id', split=None, meta=None, **kwargs)#

Log an iterable or other dataset to disk. Useful for logging memory mapped files

Dataset provided must be an iterable that can be traversed row by row, and for each row, the fields can be indexed into either via string keys or int indexes. Pandas and Vaex dataframes are also allowed, as well as HuggingFace Datasets

valid examples:

d = [: {“my_text”: “sample1”, “my_labels”: “A”, “my_id”: 1, “sample_quality”: 5.3}, {“my_text”: “sample2”, “my_labels”: “A”, “my_id”: 2, “sample_quality”: 9.1}, {“my_text”: “sample3”, “my_labels”: “B”, “my_id”: 3, “sample_quality”: 2.7},

] dq.log_dataset(

d, text=”my_text”, id=”my_id”, label=”my_labels”, meta=[“sample_quality”]

)

Logging a pandas dataframe, df:: text label id sample_quality

0 sample1 A 1 5.3 1 sample2 A 2 9.1 2 sample3 B 3 2.7 # We don’t need to set text id or label because it matches the default dq.log_dataset(d, meta=[“sample_quality”])

Logging and iterable of tuples: d = [

(“sample1”, “A”, “ID1”), (“sample2”, “A”, “ID2”), (“sample3”, “B”, “ID3”),

] dq.log_dataset(d, text=0, id=2, label=1)

Invalid example:

d = {: “my_text”: [“sample1”, “sample2”, “sample3”], “my_labels”: [“A”, “A”, “B”], “my_id”: [1, 2, 3], “sample_quality”: [5.3, 9.1, 2.7]

}

In the invalid case, use dq.log_data_samples:

meta = {“sample_quality”: d[“sample_quality”]} dq.log_data_samples(

texts=d[“my_text”], labels=d[“my_labels”], ids=d[“my_ids”], meta=meta

)

Keyword arguments are specific to the task type. See dq.docs() for details

Parameters:

dataset (TypeVar(DataSet, bound= Union[Iterable, DataFrame, Dataset, DataFrame])) – The iterable or dataframe to log
text (Union[str, int]) – str | int The column, key, or int index for text data. Default “text”
id (Union[str, int]) – str | int The column, key, or int index for id data. Default “id”
split (Optional[Split]) – Optional[str] the split for this data. Can also be set via dq.set_split
meta (Union[List[str], List[int], None]) – List[str | int] Additional keys/columns to your input data to be logged as metadata. Consider a pandas dataframe, this would be the list of
kwargs (Any) – See help(dq.get_data_logger().log_dataset) for more details here

Batch_size:

The number of data samples to log at a time. Useful when logging a memory mapped dataset. A larger batch_size will result in faster logging at the expense of more memory usage. Default 100,000

Return type:

None columns corresponding to each metadata field to log

or dq.docs() for more general task details

log_image_dataset(dataset, *, imgs_local_colname=None, imgs_remote=None, batch_size=100000, id='id', label='label', split=None, inference_name=None, meta=None, parallel=False, **kwargs)#

Log an image dataset of input samples for image classification

Parameters:

dataset (TypeVar(DataSet, bound= Union[Iterable, DataFrame, Dataset, DataFrame])) – The dataset to log. This can be a Pandas/HF dataframe or an ImageFolder (from Torchvision).
imgs_local_colname (Optional[str]) – The name of the column containing the local images (typically paths but could also be bytes for HF dataframes). Ignored for ImageFolder where local paths are directly retrieved from the dataset.
imgs_remote (Optional[str]) – The name of the column containing paths to the remote images (in the case of a df) or remote directory containing the images (in the case of ImageFolder). Specifying this argument is required to skip uploading the images.
batch_size (int) – Number of samples to log in a batch. Default 10,000
id (str) – The name of the column containing the ids (in the dataframe)
label (str) – The name of the column containing the labels (in the dataframe)
split (Optional[Split]) – train/test/validation/inference. Can be set here or via dq.set_split
inference_name (Optional[str]) – If logging inference data, a name for this inference data is required. Can be set here or via dq.set_split
parallel (bool) – upload in parallel if set to True

Return type:

None

log_xgboost(model, X, *, y=None, feature_names=None, split=None, inference_name=None)#

Log data for tabular classification models with XGBoost

X can be logged as a numpy array or pandas DataFrame. If a numpy array is provided, feature_names must be provided. If a pandas DataFrame is provided, feature_names will be inferred from the column names.

Example with numpy arrays: .. code-block:: python

import xgboost as xgb from sklearn.datasets import load_wine

wine = load_wine()

X = wine.data y = wine.target feature_names = wine.feature_names

model = xgb.XGBClassifier() model.fit(X, y)

dq.log_xgboost(model, X, y=y, feature_names=feature_names, split=”training”)

# or for inference dq.log_xgboost(

model, X, feature_names, split=”inference”, inference_name=”my_inference”

)

Example with pandas DataFrames: .. code-block:: python

import xgboost as xgb from sklearn.datasets import load_wine

X, y = load_wine(as_frame=True, return_X_y=True)

model = xgb.XGBClassifier() model.fit(df, y)

dq.log_xgboost(model, X=df, y=y, split=”training”)

# or for inference dq.log_xgboost(

model, X=df, split=”inference”, inference_name=”my_inference”

)

Parameters:

model (XGBClassifier) – XGBClassifier model fit on the training data
X (Union[DataFrame, ndarray]) – The input data has a numpy array or pandas DataFrame. Data should have shape (n_samples, n_features)
y (Union[Series, ndarray, List, None]) – Optional pandas Series, List, or numpy array of ground truth labels with shape (n_samples,). Provide for non-inference only
feature_names (Optional[List[str]]) – List of feature names if X is input as numpy array. Must have length n_features
split (Optional[Split]) – Optional[str] the split for this data. Can also be set via dq.set_split
inference_name (Optional[str]) – Optional[str] the inference_name for this data. Can also be set via dq.set_split

Return type:

None

get_dq_log_file(project_name=None, run_name=None)#

Return type:: Optional[str]

build_run_report(conditions, emails, project_id, run_id, link)#

Build a run report and send it to the specified emails.

Return type:: None

register_run_report(conditions, emails)#

After a run is finished, a report will be sent to the specified emails.

Return type:: None

class AggregateFunction(value)#

Bases: str, Enum

An enumeration.

avg = 'Average'#

min = 'Minimum'#

max = 'Maximum'#

sum = 'Sum'#

pct = 'Percentage'#

class Operator(value)#

Bases: str, Enum

An enumeration.

eq = 'is equal to'#

neq = 'is not equal to'#

gt = 'is greater than'#

lt = 'is less than'#

gte = 'is greater than or equal to'#

lte = 'is less than or equal to'#

pydantic model Condition#

Bases: BaseModel

Class for building custom conditions for data quality checks

After building a condition, call evaluate to determine the truthiness of the condition against a given DataFrame.

With a bit of thought, complex and custom conditions can be built. To gain an intuition for what can be accomplished, consider the following examples:

Is the average confidence less than 0.3?

>>> c = Condition(
...     agg=AggregateFunction.avg,
...     metric="confidence",
...     operator=Operator.lt,
...     threshold=0.3,
... )
>>> c.evaluate(df)

Is the max DEP greater or equal to 0.45?

>>> c = Condition(
...     agg=AggregateFunction.max,
...     metric="data_error_potential",
...     operator=Operator.gte,
...     threshold=0.45,
... )
>>> c.evaluate(df)

By adding filters, you can further narrow down the scope of the condition. If the aggregate function is “pct”, you don’t need to specify a metric,

as the filters will determine the percentage of data.

For example:

Alert if over 80% of the dataset has confidence under 0.1

>>> c = Condition(
...     operator=Operator.gt,
...     threshold=0.8,
...     agg=AggregateFunction.pct,
...     filters=[
...         ConditionFilter(
...             metric="confidence", operator=Operator.lt, value=0.1
...         ),
...     ],
... )
>>> c.evaluate(df)

Alert if at least 20% of the dataset has drifted (Inference DataFrames only)

>>> c = Condition(
...     operator=Operator.gte,
...     threshold=0.2,
...     agg=AggregateFunction.pct,
...     filters=[
...         ConditionFilter(
...             metric="is_drifted", operator=Operator.eq, value=True
...         ),
...     ],
... )
>>> c.evaluate(df)

Alert 5% or more of the dataset contains PII

>>> c = Condition(
...     operator=Operator.gte,
...     threshold=0.05,
...     agg=AggregateFunction.pct,
...     filters=[
...         ConditionFilter(
...             metric="galileo_pii", operator=Operator.neq, value="None"
...         ),
...     ],
... )
>>> c.evaluate(df)

Complex conditions can be built when the filter has a different metric than the metric used in the condition. For example:

Alert if the min confidence of drifted data is less than 0.15

>>> c = Condition(
...     agg=AggregateFunction.min,
...     metric="confidence",
...     operator=Operator.lt,
...     threshold=0.15,
...     filters=[
...         ConditionFilter(
...             metric="is_drifted", operator=Operator.eq, value=True
...         )
...     ],
... )
>>> c.evaluate(df)

Alert if over 50% of high DEP (>=0.7) data contains PII

>>> c = Condition(
...     operator=Operator.gt,
...     threshold=0.5,
...     agg=AggregateFunction.pct,
...     filters=[
...         ConditionFilter(
...             metric="data_error_potential", operator=Operator.gte, value=0.7
...         ),
...         ConditionFilter(
...             metric="galileo_pii", operator=Operator.neq, value="None"
...         ),
...     ],
... )
>>> c.evaluate(df)

You can also call conditions directly, which will assert its truth against a df 1. Assert that average confidence less than 0.3 >>> c = Condition( … agg=AggregateFunction.avg, … metric=”confidence”, … operator=Operator.lt, … threshold=0.3, … ) >>> c(df) # Will raise an AssertionError if False

Parameters:

metric – The DF column for evaluating the condition
agg – An aggregate function to apply to the metric
operator – The operator to use for comparing the agg to the threshold (e.g. “gt”, “lt”, “eq”, “neq”)
threshold – Threshold value for evaluating the condition
filter – Optional filter to apply to the DataFrame before evaluating the condition

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Fields:

field agg: AggregateFunction [Required]#

field filters: List[ConditionFilter] [Optional]#

Validated by:

validate_filters

field metric: Optional[str] = None#

Validated by:

validate_metric

field operator: Operator [Required]#

field threshold: float [Required]#

evaluate(df)#

Return type:: Tuple[bool, float]

pydantic model ConditionFilter#

Bases: BaseModel

Filter a dataframe based on the column value

Note that the column used for filtering is the same as the metric used in the condition.

Parameters:

operator – The operator to use for filtering (e.g. “gt”, “lt”, “eq”, “neq”) See Operator
value – The value to compare against

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Fields:

field metric: str [Required]#

field operator: Operator [Required]#

field value: Union[float, int, str, bool] [Required]#

disable_galileo()#

Return type:: None

disable_galileo_verbose()#

Return type:: None

enable_galileo_verbose()#

Return type:: None

enable_galileo()#

Return type:: None

auto(hf_data=None, hf_inference_names=None, train_data=None, val_data=None, test_data=None, inference_data=None, max_padding_length=200, hf_model='distilbert-base-uncased', num_train_epochs=15, labels=None, project_name=None, run_name=None, wait=True, create_data_embs=None, early_stopping=True)#

Automatically gets insights on a text classification or NER dataset

Given either a pandas dataframe, file_path, or huggingface dataset path, this function will load the data, train a huggingface transformer model, and provide Galileo insights via a link to the Galileo Console

One of hf_data, train_data should be provided. If neither of those are, a demo dataset will be loaded by Galileo for training.

Parameters:

hf_data (Union[DatasetDict, str, None]) – Union[DatasetDict, str] Use this param if you have huggingface data in the hub or in memory. Otherwise see train_data, val_data, and test_data. If provided, train_data, val_data, and test_data are ignored.
hf_inference_names (Optional[List[str]]) – Use this param alongside hf_data if you have splits you’d like to consider as inference. A list of key names in hf_data to be run as inference runs after training. Any keys set must exist in hf_data
train_data (Union[DataFrame, Dataset, str, None]) – Optional training data to use. Can be one of * Pandas dataframe * Huggingface dataset * Path to a local file * Huggingface dataset hub path
val_data (Union[DataFrame, Dataset, str, None]) – Optional validation data to use. The validation data is what is used for the evaluation dataset in huggingface, and what is used for early stopping. If not provided, but test_data is, that will be used as the evaluation set. If neither val nor test are available, the train data will be randomly split 80/20 for use as evaluation data. Can be one of * Pandas dataframe * Huggingface dataset * Path to a local file * Huggingface dataset hub path
test_data (Union[DataFrame, Dataset, str, None]) – Optional test data to use. The test data, if provided with val, will be used after training is complete, as the held-out set. If no validation data is provided, this will instead be used as the evaluation set. Can be one of * Pandas dataframe * Huggingface dataset * Path to a local file * Huggingface dataset hub path
inference_data (Optional[Dict[str, Union[DataFrame, Dataset, str]]]) – User this param to include inference data alongside the train_data param. If you are passing data via the hf_data parameter, you should use the hf_inference_names param. Optional inference datasets to run with after training completes. The structure is a dictionary with the key being the inference name and the value one of * Pandas dataframe * Huggingface dataset * Path to a local file * Huggingface dataset hub path
max_padding_length (int) – The max length for padding the input text during tokenization. Default 200
hf_model (str) – The pretrained AutoModel from huggingface that will be used to tokenize and train on the provided data. Default distilbert-base-uncased
num_train_epochs (int) – The number of epochs to train for (early stopping will always be active). Default 15
labels (Optional[List[str]]) – Optional list of labels for this dataset. If not provided, they will attempt to be extracted from the data
project_name (Optional[str]) – Optional project name. If not set, a random name will be generated
run_name (Optional[str]) – Optional run name for this data. If not set, a random name will be generated
wait (bool) – Whether to wait for Galileo to complete processing your run. Default True
create_data_embs (Optional[bool]) – Whether to create data embeddings for this run. If True, Sentence-Transformers will be used to generate data embeddings for this dataset and uploaded with this run. You can access these embeddings via dq.metrics.get_data_embeddings in the emb column or dq.metrics.get_dataframe(…, include_data_embs=True) in the data_emb col Only available for TC currently. NER coming soon. Default True if a GPU is available, else default False.
early_stopping (bool) – Whether to use early stopping. Default True

Return type:

None

For text classification datasets, the only required columns are text and label

For NER, the required format is the huggingface standard format of tokens and tags (or ner_tags). See example: https://huggingface.co/datasets/rungalileo/mit_movies

MIT Movies dataset in huggingface format

tokens                                              ner_tags
[what, is, a, good, action, movie, that, is, r...       [0, 0, 0, 0, 7, 0, ...
[show, me, political, drama, movies, with, jef...       [0, 0, 7, 8, 0, 0, ...
[what, are, some, good, 1980, s, g, rated, mys...       [0, 0, 0, 0, 5, 6, ...
[list, a, crime, film, which, director, was, d...       [0, 0, 7, 0, 0, 0, ...
[is, there, a, thriller, movie, starring, al, ...       [0, 0, 0, 7, 0, 0, ...
...                                               ...                      ...

To see auto insights on a random, pre-selected dataset, simply run

import dataquality as dq

dq.auto()

An example using auto with a hosted huggingface text classification dataset

import dataquality as dq

dq.auto(hf_data="rungalileo/trec6")

Similarly, for NER

import dataquality as dq

dq.auto(hf_data="conll2003")

An example using auto with sklearn data as pandas dataframes

import dataquality as dq
import pandas as pd
from sklearn.datasets import fetch_20newsgroups

# Load the newsgroups dataset from sklearn
newsgroups_train = fetch_20newsgroups(subset='train')
newsgroups_test = fetch_20newsgroups(subset='test')
# Convert to pandas dataframes
df_train = pd.DataFrame(
    {"text": newsgroups_train.data, "label": newsgroups_train.target}
)
df_test = pd.DataFrame(
    {"text": newsgroups_test.data, "label": newsgroups_test.target}
)

dq.auto(
     train_data=df_train,
     test_data=df_test,
     labels=newsgroups_train.target_names,
     project_name="newsgroups_work",
     run_name="run_1_raw_data"
)

An example of using auto with a local CSV file with text and label columns

import dataquality as dq

dq.auto(
    train_data="train.csv",
    test_data="test.csv",
    project_name="data_from_local",
    run_name="run_1_raw_data"
)

class DataQuality(model=None, task=TaskType.text_classification, labels=None, train_data=None, test_data=None, val_data=None, project='', run='', framework=None, *args, **kwargs)#

Bases: object

Parameters:

model (Optional[Any]) – The model to inspect, if a string, it will be assumed to be auto
task (TaskType) – Task type for example “text_classification”
project (str) – Project name
run (str) – Run name
train_data (Optional[Any]) – Training data
test_data (Optional[Any]) – Optional test data
val_data (Optional[Any]) – Optional: validation data
labels (Optional[List[str]]) – The labels for the run
framework (Optional[ModelFramework]) – The framework to use, if provided it will be used instead of inferring it from the model. For example, if you have a torch model, you can pass framework=”torch”. If you have a torch model, you can pass framework=”torch”
args (Any) – Additional arguments
kwargs (Any) – Additional keyword arguments

from dataquality import DataQuality

with DataQuality(model, "text_classification",
                 labels = ["neg", "pos"],
                 train_data = train_data) as dq:
    model.fit(train_data)

If you want to train without a model, you can use the auto framework:

from dataquality import DataQuality

with DataQuality(labels = ["neg", "pos"],
                 train_data = train_data) as dq:
    dq.finish()

get_metrics(split=Split.train)#

Return type:: Dict[str, Any]

auto_notebook()#

Return type:: None