dataquality package#

Subpackages#

Submodules#

dataquality.analytics module#

pydantic model ProfileModel#

Bases: BaseModel

User profile

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Fields:
field packages: Optional[Dict[str, str]] = None#
field uuid: Optional[str] = None#
class Analytics(ApiClient, config)#

Bases: Borg

Analytics is used to track errors and logs in the background

To initialize the Analytics class you need to pass in an ApiClient and the dq config. :type ApiClient: Type[ApiClient] :param ApiClient: The ApiClient class :type config: Config :param config: The dq config

debug_logging(log_message, *args)#

This function is used to log debug messages. It will only log if the DQ_DEBUG environment variable is set to True.

Return type:

None

ipython_exception_handler(shell, etype, evalue, tb, tb_offset=None)#

This function is used to handle exceptions in ipython.

Return type:

None

track_exception_ipython(etype, evalue, tb)#

We parse the current environment and send the error to the api.

Return type:

None

handle_exception(etype, evalue, tb)#

This function is used to handle exceptions in python.

Return type:

None

capture_exception(error)#

This function is used to take an exception that is passed as an argument.

Return type:

None

log_import(module)#

This function is used to log an import of a module.

Return type:

None

log_function(function)#

This function is used to log an functional call

Return type:

None

log(data)#

This function is used to send the error to the api in a thread.

Return type:

None

set_config(config)#

This function is used to set the config post init.

Return type:

None

dataquality.dqyolo module#

main()#

dqyolo is a wrapper around ultralytics yolo that will automatically run the model on the validation and test sets and provide data insights.

Return type:

None

dataquality.exceptions module#

exception GalileoException#

Bases: Exception

A class for Galileo Exceptions

exception GalileoWarning#

Bases: Warning

A class for Galileo Warnings

exception LogBatchError#

Bases: Exception

An exception used to indicate an invalid batch of logged model outputs

dataquality.internal module#

Internal functions to help Galileans

reprocess_run(project_name, run_name, alerts=True, wait=True)#

Reprocesses a run that has already been processed by Galileo

Useful if a new feature has been added to the system that is desired to be added to an old run that hasn’t been migrated

Parameters:
  • project_name (str) – The name of the project

  • run_name (str) – The name of the run

  • alerts (bool) – Whether to create the alerts. If True, all alerts for the run will be removed, and recreated during processing. Default True

  • wait (bool) – Whether to wait for the run to complete processing on the server. If True, this will block execution, printing out the status updates of the run. Useful if you want to know exactly when your run completes. Otherwise, this will fire and forget your process. Default True

Return type:

None

reprocess_transferred_run(project_name, run_name, alerts=True, wait=True)#

Reprocess a run that has been transferred from another cluster

This is an internal helper function that allows us to reprocess a run that has been transferred from another cluster.

Parameters:
  • project_name (str) – The name of the project

  • run_name (str) – The name of the run

  • alerts (bool) – Whether to create the alerts. If True, all alerts for the run will be removed, and recreated during processing. Default True

  • wait (bool) – Whether to wait for the run to complete processing on the server. If True, this will block execution, printing out the status updates of the run. Useful if you want to know exactly when your run completes. Otherwise, this will fire and forget your process. Default True

Return type:

None

rename_run(project_name, run_name, new_name)#

Assigns a new name to a run

Useful if a run was named incorrectly, or if a run was created with a temporary name and needs to be renamed to something more permanent

Parameters:
  • project_name (str) – The name of the project

  • run_name (str) – The name of the run

  • new_name (str) – The new name to assign to the run

Return type:

None

rename_project(project_name, new_name)#

Renames a project

Useful if a project was named incorrectly, or if a project was created with a temporary name and needs to be renamed to something more permanent

Parameters:
  • project_name (str) – The name of the project

  • new_name (str) – The new name to assign to the project

Return type:

None

dataquality.metrics module#

create_edit(project_name, run_name, split, edit, filter, task=None, inference_name=None)#

Creates an edit for a run given a filter

Parameters:
  • project_name (str) – The project name

  • run_name (str) – The run name

  • split (Split) – The split

  • edit (Union[Edit, Dict]) – The edit to make. see help(Edit) for more information

  • task (Optional[str]) – Required task name if run is MLTC

  • inference_name (Optional[str]) – Required inference name if split is inference

Return type:

Dict

get_run_summary(project_name, run_name, split, task=None, inference_name=None, filter=None)#

Gets the summary for a run/split

Calculates metrics (f1, recall, precision) overall (weighted) and per label. Also returns the top 50 rows of the dataframe (sorted by data_error_potential)

Parameters:
  • project_name (str) – The project name

  • run_name (str) – The run name

  • split (Split) – The split (training/test/validation/inference)

  • task (Optional[str]) – (If multi-label only) the task name in question

  • inference_name (Optional[str]) – (If inference split only) The inference split name

  • filter (Union[FilterParams, Dict, None]) – Optional filter to provide to restrict the summary to only to matching rows. See dq.schemas.metrics.FilterParams

Return type:

Dict

get_metrics(project_name, run_name, split, task=None, inference_name=None, category='gold', filter=None)#

Calculates available metrics for a run/split, grouped by a particular category

The category/column provided (can be gold, pred, or any categorical metadata column) will result in metrics per β€œgroup” or unique value of that category/column

Parameters:
  • project_name (str) – The project name

  • run_name (str) – The run name

  • split (Split) – The split (training/test/validation/inference)

  • task (Optional[str]) – (If multi-label only) the task name in question

  • inference_name (Optional[str]) – (If inference split only) The inference split name

  • category (str) – The category/column to calculate metrics for. Default β€œgold” Can be β€œgold” for ground truth, β€œpred” for predicted values, or any metadata column logged (or smart feature).

  • filter (Union[FilterParams, Dict, None]) – Optional filter to provide to restrict the metrics to only to matching rows. See dq.schemas.metrics.FilterParams

Return type:

Dict[str, List]

display_distribution(project_name, run_name, split, task=None, inference_name=None, column='data_error_potential', filter=None)#

Displays the column distribution for a run. Plotly must be installed

Parameters:
  • project_name (str) – The project name

  • run_name (str) – The run name

  • split (Split) – The split (training/test/validation/inference)

  • task (Optional[str]) – (If multi-label only) the task name in question

  • inference_name (Optional[str]) – (If inference split only) The inference split name

  • column (str) – The column to get the distribution for. Default data error potential

  • filter (Union[FilterParams, Dict, None]) – Optional filter to provide to restrict the distribution to only to matching rows. See dq.schemas.metrics.FilterParams

Return type:

None

get_dataframe(project_name, run_name, split, inference_name='', file_type=FileType.arrow, include_embs=False, include_probs=False, include_token_indices=False, hf_format=False, tagging_schema=None, filter=None, as_pandas=True, include_data_embs=False, meta_cols=None)#

Gets the dataframe for a run/split

Downloads an arrow (or specified type) file to your machine and returns a loaded Vaex dataframe.

Special note for NER. By default, the data will be downloaded at a sample level (1 row per sample text), with spans for each sample in a spans column in a spacy-compatible JSON format. If include_embs or include_probs is True, the data will be expanded into span level (1 row per span, with sample text repeated for each span row), in order to join the span-level embeddings/probs

Parameters:
  • project_name (str) – The project name

  • run_name (str) – The run name

  • split (Split) – The split (training/test/validation/inference)

  • inference_name (str) – Required if split is inference. The name of the inference split to get data for.

  • file_type (FileType) – The file type to download the data as. Default arrow

  • include_embs (bool) – Whether to include the embeddings in the data. Default False

  • include_probs (bool) – Whether to include the probs in the data. Default False

  • include_token_indices (bool) – (NER only) Whether to include logged text_token_indices in the data. Useful for reconstructing tokens for retraining

  • hf_format (bool) – (NER only) Whether to export the dataframe in a HuggingFace compatible format

  • tagging_schema (Optional[TaggingSchema]) – (NER only) If hf_format is True, you must pass a tagging schema

  • filter (Union[FilterParams, Dict, None]) – Optional filter to provide to restrict the distribution to only to matching rows. See dq.schemas.metrics.FilterParams

  • as_pandas (bool) – Whether to return the dataframe as a pandas df (or vaex if False) If you are having memory issues (the data is too large), set this to False, and vaex will memory map the data. If any columns returned are multi-dimensional (embeddings, probabilities etc), vaex will always be returned, because pandas cannot support multi-dimensional columns. Default True

  • include_data_embs (bool) – Whether to include the off the shelf data embeddings

  • meta_cols (Optional[List[str]]) – List of metadata columns to return in the dataframe. If β€œ*” is included, return all metadata columns

Return type:

Union[DataFrame, DataFrame]

get_edited_dataframe(project_name, run_name, split, inference_name='', file_type=FileType.arrow, include_embs=False, include_probs=False, include_token_indices=False, hf_format=False, tagging_schema=None, as_pandas=True, include_data_embs=False)#

Gets the edited dataframe for a run/split

Exports a run/split’s data with all active edits in the edits cart and returns a vaex or pandas dataframe

Special note for NER. By default, the data will be downloaded at a sample level (1 row per sample text), with spans for each sample in a spans column in a spacy-compatible JSON format. If include_embs or include_probs is True, the data will be expanded into span level (1 row per span, with sample text repeated for each span row), in order to join the span-level embeddings/probs

Parameters:
  • project_name (str) – The project name

  • run_name (str) – The run name

  • split (Split) – The split (training/test/validation/inference)

  • inference_name (str) – Required if split is inference. The name of the inference split to get data for.

  • file_type (FileType) – The file type to download the data as. Default arrow

  • include_embs (bool) – Whether to include the embeddings in the data. Default False

  • include_probs (bool) – Whether to include the probs in the data. Default False

  • include_token_indices (bool) – (NER only) Whether to include logged text_token_indices in the data. Useful for reconstructing tokens for retraining

  • hf_format (bool) – (NER only) Whether to export the dataframe in a HuggingFace compatible format

  • tagging_schema (Optional[TaggingSchema]) – (NER only) If hf_format is True, you must pass a tagging schema

  • as_pandas (bool) – Whether to return the dataframe as a pandas df (or vaex if False) If you are having memory issues (the data is too large), set this to False, and vaex will memory map the data. If any columns returned are multi-dimensional (embeddings, probabilities etc), vaex will always be returned, because pandas cannot support multi-dimensional columns. Default True

  • include_data_embs (bool) – Whether to include the off the shelf data embeddings

Return type:

Union[DataFrame, DataFrame]

get_epochs(project_name, run_name, split)#

Returns the epochs logged for a run/split

Parameters:
  • project_name (str) – The project name

  • run_name (str) – The run name

  • split (Split) – The split (training/test/validation/inference)

Return type:

List[int]

get_embeddings(project_name, run_name, split, inference_name='', epoch=None)#

Downloads the embeddings for a run/split at an epoch as a Vaex dataframe.

If not provided, will take the embeddings from the final epoch. Note that only the n and n-1 epoch embeddings are available for download

An hdf5 file will be downloaded to local and a Vaex dataframe will be returned

Parameters:
  • project_name (str) – The project name

  • run_name (str) – The run name

  • split (Split) – The split (training/test/validation/inference)

  • inference_name (str) – Required if split is inference

  • epoch (Optional[int]) – The epoch to get embeddings for. Default final epoch

Return type:

DataFrame

get_data_embeddings(project_name, run_name, split, inference_name='')#

Downloads the data (off the shelf) embeddings for a run/split

An hdf5 file will be downloaded to local and a Vaex dataframe will be returned

Parameters:
  • project_name (str) – The project name

  • run_name (str) – The run name

  • split (Split) – The split (training/test/validation/inference)

  • inference_name (str) – Required if split is inference

Return type:

DataFrame

get_probabilities(project_name, run_name, split, inference_name='', epoch=None)#

Downloads the probabilities for a run/split at an epoch as a Vaex dataframe.

If not provided, will take the probabilities from the final epoch.

An hdf5 file will be downloaded to local and a Vaex dataframe will be returned

Parameters:
  • project_name (str) – The project name

  • run_name (str) – The run name

  • split (Split) – The split (training/test/validation/inference)

  • inference_name (str) – Required if split is inference

  • epoch (Optional[int]) – The epoch to get embeddings for. Default final epoch

Return type:

DataFrame

get_raw_data(project_name, run_name, split, inference_name='', epoch=None)#

Downloads the raw logged data for a run/split at an epoch as a Vaex dataframe.

If not provided, will take the probabilities from the final epoch.

An hdf5 file will be downloaded to local and a Vaex dataframe will be returned

Parameters:
  • project_name (str) – The project name

  • run_name (str) – The run name

  • split (Split) – The split (training/test/validation/inference)

  • inference_name (str) – Required if split is inference

  • epoch (Optional[int]) – The epoch to get embeddings for. Default final epoch

Return type:

DataFrame

get_alerts(project_name, run_name, split, inference_name=None)#

Get alerts for a project/run/split

Alerts are automatic insights calculated and provided by Galileo on your data

Return type:

List[Dict[str, str]]

get_labels_for_run(project_name, run_name, task=None)#

Gets labels for a given run.

If multi-label, and a task is provided, this will get the labels for that task. Otherwise, it will get all task-labels

In NER, the full label set with the tags for each label will be returned

Return type:

List

get_tasks_for_run(project_name, run_name)#

Gets task names for a multi-label run

Return type:

List[str]

Module contents#

login()#

Log into your Galileo environment.

The function will prompt your for an Authorization Token (api key) that you can access from the console.

To skip the prompt for automated workflows, you can set GALILEO_USERNAME (your email) and GALILEO_PASSWORD if you signed up with an email and password. You can set GALILEO_API_KEY to your API key if you have one.

Return type:

None

logout()#
Return type:

None

init(task_type, project_name=None, run_name=None, overwrite_local=True)#

Start a run

Initialize a new run and new project, initialize a new run in an existing project, or reinitialize an existing run in an existing project.

Before creating the project, check: - The user is valid, login if not - The DQ client version is compatible with API version

Optionally provide project and run names to create a new project/run or restart existing ones.

Return type:

None

Parameters:

task_type (str) – The task type for modeling. This must be one of the valid

dataquality.schemas.task_type.TaskType options :type project_name: Optional[str] :param project_name: The project name. If not passed in, a random one will be generated. If provided, and the project does not exist, it will be created. If it does exist, it will be set. :type run_name: Optional[str] :param run_name: The run name. If not passed in, a random one will be generated. If provided, and the project does not exist, it will be created. If it does exist, it will be set. :type overwrite_local: bool :param overwrite_local: If True, the current project/run log directory will be cleared during this function. If logging over many sessions with checkpoints, you may want to set this to False. Default True

log_data_samples(*, texts, ids, meta=None, **kwargs)#

Logs a batch of input samples for model training/test/validation/inference.

Fields are expected as lists of their content. Field names are in the plural of log_input_sample (text -> texts) The expected arguments come from the task_type being used: See dq.docs() for details

ex (text classification): .. code-block:: python

all_labels = [β€œA”, β€œB”, β€œC”] dq.set_labels_for_run(labels = all_labels)

texts: List[str] = [

β€œText sample 1”, β€œText sample 2”, β€œText sample 3”, β€œText sample 4”

]

labels: List[str] = [β€œB”, β€œC”, β€œA”, β€œA”]

meta = {

β€œsample_importance”: [β€œhigh”, β€œlow”, β€œlow”, β€œmedium”] β€œquality_ranking”: [9.7, 2.4, 5.5, 1.2]

}

ids: List[int] = [0, 1, 2, 3] split = β€œtraining”

dq.log_data_samples(texts=texts, labels=labels, ids=ids, meta=meta split=split)

Parameters:
  • texts (List[str]) – List[str] the input samples to your model

  • ids (List[int]) – List[int | str] the ids per sample

  • split – Optional[str] the split for this data. Can also be set via

  • meta (Optional[Dict[str, List[Union[str, float, int]]]]) – Dict[str, List[str | int | float]]. Log additional metadata fields to

Return type:

None dq.set_split

each sample. The name of the field is the key of the dictionary, and the values are a list that correspond in length and order to the text samples. :type kwargs: Any :param kwargs: See dq.docs() for details on other task specific parameters

log_model_outputs(*, ids, embs=None, split=None, epoch=None, logits=None, probs=None, log_probs=None, inference_name=None, exclude_embs=False)#

Logs model outputs for model during training/test/validation.

Parameters:
  • ids (Union[List, ndarray]) – The ids for each sample. Must match input ids of logged samples

  • embs (Union[List, ndarray, None]) – The embeddings per output sample

  • split (Optional[Split]) – The current split. Must be set either here or via dq.set_split

  • epoch (Optional[int]) – The current epoch. Must be set either here or via dq.set_epoch

  • logits (Union[List, ndarray, None]) – The logits for each sample

  • probs (Union[List, ndarray, None]) – Deprecated, use logits. If passed in, a softmax will NOT be applied

  • inference_name (Optional[str]) – Inference name indicator for this inference split. If logging for an inference split, this is required.

  • exclude_embs (bool) – Optional flag to exclude embeddings from logging. If True and embs is set to None, this will generate random embs for each sample.

Return type:

None

The expected argument shapes come from the task_type being used See dq.docs() for more task specific details on parameter shape

configure(do_login=True, _internal=False)#

Update your active config with new information

You can use environment variables to set the config, or wait for prompts Available environment variables to update: * GALILEO_CONSOLE_URL * GALILEO_USERNAME * GALILEO_PASSWORD * GALILEO_API_KEY

Return type:

None

finish(last_epoch=None, wait=True, create_data_embs=None, data_embs_col='text', upload_model=False)#

Finishes the current run and invokes a job

Parameters:
  • last_epoch (Optional[int]) – If set, only epochs up to this value will be uploaded/processed This is inclusive, so setting last_epoch to 5 would upload epochs 0,1,2,3,4,5

  • wait (bool) – If true, after uploading the data, this will wait for the run to be processed by the Galileo server. If false, you can manually wait for the run by calling dq.wait_for_run() Default True

  • create_data_embs (Optional[bool]) – If True, an off-the-shelf transformer will run on the raw text input to generate data-level embeddings. These will be available in the data view tab of the Galileo console. You can also access these embeddings via dq.metrics.get_data_embeddings(). Default True if a GPU is available, else default False.

  • data_embs_col (str) – Optional text col on which to compute data embeddings. If not set, we default to β€˜text’ which corresponds to the input text. Can also be set to target, generated_output or any other column that is logged as metadata.

  • upload_model (bool) – If True, the model will be stored in the galileo project. Default False or set by the environment variable DQ_UPLOAD_MODEL.

Return type:

str

set_labels_for_run(labels)#

Creates the mapping of the labels for the model to their respective indexes. :rtype: None

Parameters:

labels (Union[List[List[str]], List[str]]) – An ordered list of labels (ie [β€˜dog’,’cat’,’fish’]

If this is a multi-label type, then labels are a list of lists where each inner list indicates the label for the given task

This order MUST match the order of probabilities that the model outputs.

In the multi-label case, the outer order (order of the tasks) must match the task-order of the task-probabilities logged as well.

get_current_run_labels()#

Returns the current run labels, if there are any

Return type:

Optional[List[str]]

get_data_logger(task_type=None, *args, **kwargs)#
Return type:

BaseGalileoDataLogger

get_model_logger(task_type=None, *args, **kwargs)#
Return type:

BaseGalileoModelLogger

Gets the link to the run in the UI

Return type:

str

set_tasks_for_run(tasks, binary=True)#

Sets the task names for the run (multi-label case only).

This order MUST match the order of the labels list provided in log_input_data and the order of the probability vectors provided in log_model_outputs.

This also must match the order of the labels logged in set_labels_for_run (meaning that the first list of labels must be the labels of the first task passed in here)

Return type:

None

Parameters:
  • tasks (List[str]) – The list of tasks for your run

  • binary (bool) – Whether this is a binary multi label run. If true, tasks will also

be set as your labels, and you should NOT call dq.set_labels_for_run it will be handled for you. Default True

set_tagging_schema(tagging_schema)#

Sets the tagging schema for NER models

Only valid for text_ner task_types. Others will throw an exception

Return type:

None

docs()#

Print the documentation for your specific input and output logging format

Based on your task_type, this will print the appropriate documentation

Return type:

None

wait_for_run(project_name=None, run_name=None)#

Waits until a specific project run transitions from started to finished. Defaults to the current run if project_name and run_name are empty. Raises error if only one of project_name and run_name is passed in.

Parameters:
  • project_name (Optional[str]) – The project name. Default to current project if not passed in.

  • run_name (Optional[str]) – The run name. Default to current run if not passed in.

Return type:

None

Returns:

None. Function returns after the run transitions to finished

get_run_status(project_name=None, run_name=None)#

Returns the latest job of a specified project run. Defaults to the current run if project_name and run_name are empty. Raises error if only one of project_name and run_name is passed in.

Parameters:
  • project_name (Optional[str]) – The project name. Default to current project if not passed in.

  • run_name (Optional[str]) – The run name. Default to current run if not passed in.

Return type:

Dict[str, Any]

Returns:

Dict[str, Any]. Response will have key status with value corresponding to the status of the latest job for the run. Other info, such as created_at, may be included.

set_epoch(epoch)#

Set the current epoch.

When set, logging model outputs will use this if not logged explicitly

Return type:

None

set_split(split, inference_name=None)#

Set the current split.

When set, logging data inputs/model outputs will use this if not logged explicitly When setting split to inference, inference_name must be included

Return type:

None

set_epoch_and_split(epoch, split, inference_name=None)#

Set the current epoch and set the current split. When set, logging data inputs/model outputs will use this if not logged explicitly When setting split to inference, inference_name must be included

Return type:

None

set_console_url(console_url=None)#

For Enterprise users. Set the console URL to your Galileo Environment.

You can also set GALILEO_CONSOLE_URL before importing dataquality to bypass this :rtype: None

Parameters:

console_url (Optional[str]) – If set, that will be used. Otherwise, if an environment variable

GALILEO_CONSOLE_URL is set, that will be used. Otherwise, you will be prompted for a url.

log_data_sample(*, text, id, **kwargs)#

Log a single input example to disk

Fields are expected singular elements. Field names are in the singular of log_input_samples (texts -> text) The expected arguments come from the task_type being used: See dq.docs() for details

Parameters:
  • text (str) – List[str] the input samples to your model

  • id (int) – List[int | str] the ids per sample

  • split – Optional[str] the split for this data. Can also be set via dq.set_split

  • kwargs (Any) – See dq.docs() for details on other task specific parameters

Return type:

None

log_dataset(dataset, *, batch_size=100000, text='text', id='id', split=None, meta=None, **kwargs)#

Log an iterable or other dataset to disk. Useful for logging memory mapped files

Dataset provided must be an iterable that can be traversed row by row, and for each row, the fields can be indexed into either via string keys or int indexes. Pandas and Vaex dataframes are also allowed, as well as HuggingFace Datasets

valid examples:
d = [

{β€œmy_text”: β€œsample1”, β€œmy_labels”: β€œA”, β€œmy_id”: 1, β€œsample_quality”: 5.3}, {β€œmy_text”: β€œsample2”, β€œmy_labels”: β€œA”, β€œmy_id”: 2, β€œsample_quality”: 9.1}, {β€œmy_text”: β€œsample3”, β€œmy_labels”: β€œB”, β€œmy_id”: 3, β€œsample_quality”: 2.7},

] dq.log_dataset(

d, text=”my_text”, id=”my_id”, label=”my_labels”, meta=[β€œsample_quality”]

)

Logging a pandas dataframe, df:

text label id sample_quality

0 sample1 A 1 5.3 1 sample2 A 2 9.1 2 sample3 B 3 2.7 # We don’t need to set text id or label because it matches the default dq.log_dataset(d, meta=[β€œsample_quality”])

Logging and iterable of tuples: d = [

(β€œsample1”, β€œA”, β€œID1”), (β€œsample2”, β€œA”, β€œID2”), (β€œsample3”, β€œB”, β€œID3”),

] dq.log_dataset(d, text=0, id=2, label=1)

Invalid example:
d = {

β€œmy_text”: [β€œsample1”, β€œsample2”, β€œsample3”], β€œmy_labels”: [β€œA”, β€œA”, β€œB”], β€œmy_id”: [1, 2, 3], β€œsample_quality”: [5.3, 9.1, 2.7]

}

In the invalid case, use dq.log_data_samples:

meta = {β€œsample_quality”: d[β€œsample_quality”]} dq.log_data_samples(

texts=d[β€œmy_text”], labels=d[β€œmy_labels”], ids=d[β€œmy_ids”], meta=meta

)

Keyword arguments are specific to the task type. See dq.docs() for details

Parameters:
  • dataset (TypeVar(DataSet, bound= Union[Iterable, DataFrame, Dataset, DataFrame])) – The iterable or dataframe to log

  • text (Union[str, int]) – str | int The column, key, or int index for text data. Default β€œtext”

  • id (Union[str, int]) – str | int The column, key, or int index for id data. Default β€œid”

  • split (Optional[Split]) – Optional[str] the split for this data. Can also be set via dq.set_split

  • meta (Union[List[str], List[int], None]) – List[str | int] Additional keys/columns to your input data to be logged as metadata. Consider a pandas dataframe, this would be the list of

  • kwargs (Any) – See help(dq.get_data_logger().log_dataset) for more details here

Batch_size:

The number of data samples to log at a time. Useful when logging a memory mapped dataset. A larger batch_size will result in faster logging at the expense of more memory usage. Default 100,000

Return type:

None columns corresponding to each metadata field to log

or dq.docs() for more general task details

log_image_dataset(dataset, *, imgs_local_colname=None, imgs_remote=None, batch_size=100000, id='id', label='label', split=None, inference_name=None, meta=None, parallel=False, **kwargs)#

Log an image dataset of input samples for image classification

Parameters:
  • dataset (TypeVar(DataSet, bound= Union[Iterable, DataFrame, Dataset, DataFrame])) – The dataset to log. This can be a Pandas/HF dataframe or an ImageFolder (from Torchvision).

  • imgs_local_colname (Optional[str]) – The name of the column containing the local images (typically paths but could also be bytes for HF dataframes). Ignored for ImageFolder where local paths are directly retrieved from the dataset.

  • imgs_remote (Optional[str]) – The name of the column containing paths to the remote images (in the case of a df) or remote directory containing the images (in the case of ImageFolder). Specifying this argument is required to skip uploading the images.

  • batch_size (int) – Number of samples to log in a batch. Default 10,000

  • id (str) – The name of the column containing the ids (in the dataframe)

  • label (str) – The name of the column containing the labels (in the dataframe)

  • split (Optional[Split]) – train/test/validation/inference. Can be set here or via dq.set_split

  • inference_name (Optional[str]) – If logging inference data, a name for this inference data is required. Can be set here or via dq.set_split

  • parallel (bool) – upload in parallel if set to True

Return type:

None

log_xgboost(model, X, *, y=None, feature_names=None, split=None, inference_name=None)#

Log data for tabular classification models with XGBoost

X can be logged as a numpy array or pandas DataFrame. If a numpy array is provided, feature_names must be provided. If a pandas DataFrame is provided, feature_names will be inferred from the column names.

Example with numpy arrays: .. code-block:: python

import xgboost as xgb from sklearn.datasets import load_wine

wine = load_wine()

X = wine.data y = wine.target feature_names = wine.feature_names

model = xgb.XGBClassifier() model.fit(X, y)

dq.log_xgboost(model, X, y=y, feature_names=feature_names, split=”training”)

# or for inference dq.log_xgboost(

model, X, feature_names, split=”inference”, inference_name=”my_inference”

)

Example with pandas DataFrames: .. code-block:: python

import xgboost as xgb from sklearn.datasets import load_wine

X, y = load_wine(as_frame=True, return_X_y=True)

model = xgb.XGBClassifier() model.fit(df, y)

dq.log_xgboost(model, X=df, y=y, split=”training”)

# or for inference dq.log_xgboost(

model, X=df, split=”inference”, inference_name=”my_inference”

)

Parameters:
  • model (XGBClassifier) – XGBClassifier model fit on the training data

  • X (Union[DataFrame, ndarray]) – The input data has a numpy array or pandas DataFrame. Data should have shape (n_samples, n_features)

  • y (Union[Series, ndarray, List, None]) – Optional pandas Series, List, or numpy array of ground truth labels with shape (n_samples,). Provide for non-inference only

  • feature_names (Optional[List[str]]) – List of feature names if X is input as numpy array. Must have length n_features

  • split (Optional[Split]) – Optional[str] the split for this data. Can also be set via dq.set_split

  • inference_name (Optional[str]) – Optional[str] the inference_name for this data. Can also be set via dq.set_split

Return type:

None

get_dq_log_file(project_name=None, run_name=None)#
Return type:

Optional[str]

build_run_report(conditions, emails, project_id, run_id, link)#

Build a run report and send it to the specified emails.

Return type:

None

register_run_report(conditions, emails)#

Register conditions and emails for a run report.

After a run is finished, a report will be sent to the specified emails.

Return type:

None

class AggregateFunction(value)#

Bases: str, Enum

An enumeration.

avg = 'Average'#
min = 'Minimum'#
max = 'Maximum'#
sum = 'Sum'#
pct = 'Percentage'#
class Operator(value)#

Bases: str, Enum

An enumeration.

eq = 'is equal to'#
neq = 'is not equal to'#
gt = 'is greater than'#
lt = 'is less than'#
gte = 'is greater than or equal to'#
lte = 'is less than or equal to'#
pydantic model Condition#

Bases: BaseModel

Class for building custom conditions for data quality checks

After building a condition, call evaluate to determine the truthiness of the condition against a given DataFrame.

With a bit of thought, complex and custom conditions can be built. To gain an intuition for what can be accomplished, consider the following examples:

  1. Is the average confidence less than 0.3?
    >>> c = Condition(
    ...     agg=AggregateFunction.avg,
    ...     metric="confidence",
    ...     operator=Operator.lt,
    ...     threshold=0.3,
    ... )
    >>> c.evaluate(df)
    
  2. Is the max DEP greater or equal to 0.45?
    >>> c = Condition(
    ...     agg=AggregateFunction.max,
    ...     metric="data_error_potential",
    ...     operator=Operator.gte,
    ...     threshold=0.45,
    ... )
    >>> c.evaluate(df)
    

By adding filters, you can further narrow down the scope of the condition. If the aggregate function is β€œpct”, you don’t need to specify a metric,

as the filters will determine the percentage of data.

For example:

  1. Alert if over 80% of the dataset has confidence under 0.1
    >>> c = Condition(
    ...     operator=Operator.gt,
    ...     threshold=0.8,
    ...     agg=AggregateFunction.pct,
    ...     filters=[
    ...         ConditionFilter(
    ...             metric="confidence", operator=Operator.lt, value=0.1
    ...         ),
    ...     ],
    ... )
    >>> c.evaluate(df)
    
  2. Alert if at least 20% of the dataset has drifted (Inference DataFrames only)
    >>> c = Condition(
    ...     operator=Operator.gte,
    ...     threshold=0.2,
    ...     agg=AggregateFunction.pct,
    ...     filters=[
    ...         ConditionFilter(
    ...             metric="is_drifted", operator=Operator.eq, value=True
    ...         ),
    ...     ],
    ... )
    >>> c.evaluate(df)
    
  3. Alert 5% or more of the dataset contains PII
    >>> c = Condition(
    ...     operator=Operator.gte,
    ...     threshold=0.05,
    ...     agg=AggregateFunction.pct,
    ...     filters=[
    ...         ConditionFilter(
    ...             metric="galileo_pii", operator=Operator.neq, value="None"
    ...         ),
    ...     ],
    ... )
    >>> c.evaluate(df)
    

Complex conditions can be built when the filter has a different metric than the metric used in the condition. For example:

  1. Alert if the min confidence of drifted data is less than 0.15
    >>> c = Condition(
    ...     agg=AggregateFunction.min,
    ...     metric="confidence",
    ...     operator=Operator.lt,
    ...     threshold=0.15,
    ...     filters=[
    ...         ConditionFilter(
    ...             metric="is_drifted", operator=Operator.eq, value=True
    ...         )
    ...     ],
    ... )
    >>> c.evaluate(df)
    
  2. Alert if over 50% of high DEP (>=0.7) data contains PII
    >>> c = Condition(
    ...     operator=Operator.gt,
    ...     threshold=0.5,
    ...     agg=AggregateFunction.pct,
    ...     filters=[
    ...         ConditionFilter(
    ...             metric="data_error_potential", operator=Operator.gte, value=0.7
    ...         ),
    ...         ConditionFilter(
    ...             metric="galileo_pii", operator=Operator.neq, value="None"
    ...         ),
    ...     ],
    ... )
    >>> c.evaluate(df)
    

You can also call conditions directly, which will assert its truth against a df 1. Assert that average confidence less than 0.3 >>> c = Condition( … agg=AggregateFunction.avg, … metric=”confidence”, … operator=Operator.lt, … threshold=0.3, … ) >>> c(df) # Will raise an AssertionError if False

Parameters:
  • metric – The DF column for evaluating the condition

  • agg – An aggregate function to apply to the metric

  • operator – The operator to use for comparing the agg to the threshold (e.g. β€œgt”, β€œlt”, β€œeq”, β€œneq”)

  • threshold – Threshold value for evaluating the condition

  • filter – Optional filter to apply to the DataFrame before evaluating the condition

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Fields:
field agg: AggregateFunction [Required]#
field filters: List[ConditionFilter] [Optional]#
Validated by:
  • validate_filters

field metric: Optional[str] = None#
Validated by:
  • validate_metric

field operator: Operator [Required]#
field threshold: float [Required]#
evaluate(df)#
Return type:

Tuple[bool, float]

pydantic model ConditionFilter#

Bases: BaseModel

Filter a dataframe based on the column value

Note that the column used for filtering is the same as the metric used in the condition.

Parameters:
  • operator – The operator to use for filtering (e.g. β€œgt”, β€œlt”, β€œeq”, β€œneq”) See Operator

  • value – The value to compare against

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Fields:
field metric: str [Required]#
field operator: Operator [Required]#
field value: Union[float, int, str, bool] [Required]#
disable_galileo()#
Return type:

None

disable_galileo_verbose()#
Return type:

None

enable_galileo_verbose()#
Return type:

None

enable_galileo()#
Return type:

None

auto(hf_data=None, hf_inference_names=None, train_data=None, val_data=None, test_data=None, inference_data=None, max_padding_length=200, hf_model='distilbert-base-uncased', num_train_epochs=15, labels=None, project_name=None, run_name=None, wait=True, create_data_embs=None, early_stopping=True)#

Automatically gets insights on a text classification or NER dataset

Given either a pandas dataframe, file_path, or huggingface dataset path, this function will load the data, train a huggingface transformer model, and provide Galileo insights via a link to the Galileo Console

One of hf_data, train_data should be provided. If neither of those are, a demo dataset will be loaded by Galileo for training.

Parameters:
  • hf_data (Union[DatasetDict, str, None]) – Union[DatasetDict, str] Use this param if you have huggingface data in the hub or in memory. Otherwise see train_data, val_data, and test_data. If provided, train_data, val_data, and test_data are ignored.

  • hf_inference_names (Optional[List[str]]) – Use this param alongside hf_data if you have splits you’d like to consider as inference. A list of key names in hf_data to be run as inference runs after training. Any keys set must exist in hf_data

  • train_data (Union[DataFrame, Dataset, str, None]) – Optional training data to use. Can be one of * Pandas dataframe * Huggingface dataset * Path to a local file * Huggingface dataset hub path

  • val_data (Union[DataFrame, Dataset, str, None]) – Optional validation data to use. The validation data is what is used for the evaluation dataset in huggingface, and what is used for early stopping. If not provided, but test_data is, that will be used as the evaluation set. If neither val nor test are available, the train data will be randomly split 80/20 for use as evaluation data. Can be one of * Pandas dataframe * Huggingface dataset * Path to a local file * Huggingface dataset hub path

  • test_data (Union[DataFrame, Dataset, str, None]) – Optional test data to use. The test data, if provided with val, will be used after training is complete, as the held-out set. If no validation data is provided, this will instead be used as the evaluation set. Can be one of * Pandas dataframe * Huggingface dataset * Path to a local file * Huggingface dataset hub path

  • inference_data (Optional[Dict[str, Union[DataFrame, Dataset, str]]]) – User this param to include inference data alongside the train_data param. If you are passing data via the hf_data parameter, you should use the hf_inference_names param. Optional inference datasets to run with after training completes. The structure is a dictionary with the key being the inference name and the value one of * Pandas dataframe * Huggingface dataset * Path to a local file * Huggingface dataset hub path

  • max_padding_length (int) – The max length for padding the input text during tokenization. Default 200

  • hf_model (str) – The pretrained AutoModel from huggingface that will be used to tokenize and train on the provided data. Default distilbert-base-uncased

  • num_train_epochs (int) – The number of epochs to train for (early stopping will always be active). Default 15

  • labels (Optional[List[str]]) – Optional list of labels for this dataset. If not provided, they will attempt to be extracted from the data

  • project_name (Optional[str]) – Optional project name. If not set, a random name will be generated

  • run_name (Optional[str]) – Optional run name for this data. If not set, a random name will be generated

  • wait (bool) – Whether to wait for Galileo to complete processing your run. Default True

  • create_data_embs (Optional[bool]) – Whether to create data embeddings for this run. If True, Sentence-Transformers will be used to generate data embeddings for this dataset and uploaded with this run. You can access these embeddings via dq.metrics.get_data_embeddings in the emb column or dq.metrics.get_dataframe(…, include_data_embs=True) in the data_emb col Only available for TC currently. NER coming soon. Default True if a GPU is available, else default False.

  • early_stopping (bool) – Whether to use early stopping. Default True

Return type:

None

For text classification datasets, the only required columns are text and label

For NER, the required format is the huggingface standard format of tokens and tags (or ner_tags). See example: https://huggingface.co/datasets/rungalileo/mit_movies

MIT Movies dataset in huggingface format

tokens                                              ner_tags
[what, is, a, good, action, movie, that, is, r...       [0, 0, 0, 0, 7, 0, ...
[show, me, political, drama, movies, with, jef...       [0, 0, 7, 8, 0, 0, ...
[what, are, some, good, 1980, s, g, rated, mys...       [0, 0, 0, 0, 5, 6, ...
[list, a, crime, film, which, director, was, d...       [0, 0, 7, 0, 0, 0, ...
[is, there, a, thriller, movie, starring, al, ...       [0, 0, 0, 7, 0, 0, ...
...                                               ...                      ...

To see auto insights on a random, pre-selected dataset, simply run

import dataquality as dq

dq.auto()

An example using auto with a hosted huggingface text classification dataset

import dataquality as dq

dq.auto(hf_data="rungalileo/trec6")

Similarly, for NER

import dataquality as dq

dq.auto(hf_data="conll2003")

An example using auto with sklearn data as pandas dataframes

import dataquality as dq
import pandas as pd
from sklearn.datasets import fetch_20newsgroups

# Load the newsgroups dataset from sklearn
newsgroups_train = fetch_20newsgroups(subset='train')
newsgroups_test = fetch_20newsgroups(subset='test')
# Convert to pandas dataframes
df_train = pd.DataFrame(
    {"text": newsgroups_train.data, "label": newsgroups_train.target}
)
df_test = pd.DataFrame(
    {"text": newsgroups_test.data, "label": newsgroups_test.target}
)

dq.auto(
     train_data=df_train,
     test_data=df_test,
     labels=newsgroups_train.target_names,
     project_name="newsgroups_work",
     run_name="run_1_raw_data"
)

An example of using auto with a local CSV file with text and label columns

import dataquality as dq

dq.auto(
    train_data="train.csv",
    test_data="test.csv",
    project_name="data_from_local",
    run_name="run_1_raw_data"
)
class DataQuality(model=None, task=TaskType.text_classification, labels=None, train_data=None, test_data=None, val_data=None, project='', run='', framework=None, *args, **kwargs)#

Bases: object

Parameters:
  • model (Optional[Any]) – The model to inspect, if a string, it will be assumed to be auto

  • task (TaskType) – Task type for example β€œtext_classification”

  • project (str) – Project name

  • run (str) – Run name

  • train_data (Optional[Any]) – Training data

  • test_data (Optional[Any]) – Optional test data

  • val_data (Optional[Any]) – Optional: validation data

  • labels (Optional[List[str]]) – The labels for the run

  • framework (Optional[ModelFramework]) – The framework to use, if provided it will be used instead of inferring it from the model. For example, if you have a torch model, you can pass framework=”torch”. If you have a torch model, you can pass framework=”torch”

  • args (Any) – Additional arguments

  • kwargs (Any) – Additional keyword arguments

from dataquality import DataQuality

with DataQuality(model, "text_classification",
                 labels = ["neg", "pos"],
                 train_data = train_data) as dq:
    model.fit(train_data)

If you want to train without a model, you can use the auto framework:

from dataquality import DataQuality

with DataQuality(labels = ["neg", "pos"],
                 train_data = train_data) as dq:
    dq.finish()
get_metrics(split=Split.train)#
Return type:

Dict[str, Any]

auto_notebook()#
Return type:

None