dataquality.core package#

Submodules#

dataquality.core.auth module#

login()#

Log into your Galileo environment.

The function will prompt your for an Authorization Token (api key) that you can access from the console.

To skip the prompt for automated workflows, you can set GALILEO_USERNAME (your email) and GALILEO_PASSWORD if you signed up with an email and password. You can set GALILEO_API_KEY to your API key if you have one.

Return type:

None

logout()#
Return type:

None

dataquality.core.finish module#

finish(last_epoch=None, wait=True, create_data_embs=None, data_embs_col='text', upload_model=False)#

Finishes the current run and invokes a job

Parameters:
  • last_epoch (Optional[int]) – If set, only epochs up to this value will be uploaded/processed This is inclusive, so setting last_epoch to 5 would upload epochs 0,1,2,3,4,5

  • wait (bool) – If true, after uploading the data, this will wait for the run to be processed by the Galileo server. If false, you can manually wait for the run by calling dq.wait_for_run() Default True

  • create_data_embs (Optional[bool]) – If True, an off-the-shelf transformer will run on the raw text input to generate data-level embeddings. These will be available in the data view tab of the Galileo console. You can also access these embeddings via dq.metrics.get_data_embeddings(). Default True if a GPU is available, else default False.

  • data_embs_col (str) – Optional text col on which to compute data embeddings. If not set, we default to ‘text’ which corresponds to the input text. Can also be set to target, generated_output or any other column that is logged as metadata.

  • upload_model (bool) – If True, the model will be stored in the galileo project. Default False or set by the environment variable DQ_UPLOAD_MODEL.

Return type:

str

wait_for_run(project_name=None, run_name=None)#

Waits until a specific project run transitions from started to finished. Defaults to the current run if project_name and run_name are empty. Raises error if only one of project_name and run_name is passed in.

Parameters:
  • project_name (Optional[str]) – The project name. Default to current project if not passed in.

  • run_name (Optional[str]) – The run name. Default to current run if not passed in.

Return type:

None

Returns:

None. Function returns after the run transitions to finished

get_run_status(project_name=None, run_name=None)#

Returns the latest job of a specified project run. Defaults to the current run if project_name and run_name are empty. Raises error if only one of project_name and run_name is passed in.

Parameters:
  • project_name (Optional[str]) – The project name. Default to current project if not passed in.

  • run_name (Optional[str]) – The run name. Default to current run if not passed in.

Return type:

Dict[str, Any]

Returns:

Dict[str, Any]. Response will have key status with value corresponding to the status of the latest job for the run. Other info, such as created_at, may be included.

dataquality.core.init module#

class InitManager#

Bases: object

get_or_create_project(project_name)#

Gets a project by name, or creates a new one if it doesn’t exist.

Returns:

The project and a boolean indicating if the project was created

Return type:

Tuple[Dict, bool]

get_or_create_run(project_name, run_name, task_type)#

Gets a run by name, or creates a new one if it doesn’t exist.

Returns:

The run and a boolean indicating if the run was created

Return type:

Tuple[Dict, bool]

create_log_file_dir(project_id, run_id, overwrite_local)#
Return type:

None

create_run_name(project_name)#

Creates an auto-incrementing run_name for a given project

If a run_name is not passed into init, we create a run_name base with today’s date, and increment the digit at the end based on how many runs were created in this project with this scheme.

Return type:

str

ie:

2023-05-15_1 2023-05-15_2 2023-05-15_3 … 2023-05-15_n

init(task_type, project_name=None, run_name=None, overwrite_local=True)#

Start a run

Initialize a new run and new project, initialize a new run in an existing project, or reinitialize an existing run in an existing project.

Before creating the project, check: - The user is valid, login if not - The DQ client version is compatible with API version

Optionally provide project and run names to create a new project/run or restart existing ones.

Return type:

None

Parameters:

task_type (str) – The task type for modeling. This must be one of the valid

dataquality.schemas.task_type.TaskType options :type project_name: Optional[str] :param project_name: The project name. If not passed in, a random one will be generated. If provided, and the project does not exist, it will be created. If it does exist, it will be set. :type run_name: Optional[str] :param run_name: The run name. If not passed in, a random one will be generated. If provided, and the project does not exist, it will be created. If it does exist, it will be set. :type overwrite_local: bool :param overwrite_local: If True, the current project/run log directory will be cleared during this function. If logging over many sessions with checkpoints, you may want to set this to False. Default True

delete_run(project_name, run_name)#

Deletes a run from Galileo

Return type:

None

dataquality.core.log module#

log_data_samples(*, texts, ids, meta=None, **kwargs)#

Logs a batch of input samples for model training/test/validation/inference.

Fields are expected as lists of their content. Field names are in the plural of log_input_sample (text -> texts) The expected arguments come from the task_type being used: See dq.docs() for details

ex (text classification): .. code-block:: python

all_labels = [“A”, “B”, “C”] dq.set_labels_for_run(labels = all_labels)

texts: List[str] = [

“Text sample 1”, “Text sample 2”, “Text sample 3”, “Text sample 4”

]

labels: List[str] = [“B”, “C”, “A”, “A”]

meta = {

“sample_importance”: [“high”, “low”, “low”, “medium”] “quality_ranking”: [9.7, 2.4, 5.5, 1.2]

}

ids: List[int] = [0, 1, 2, 3] split = “training”

dq.log_data_samples(texts=texts, labels=labels, ids=ids, meta=meta split=split)

Parameters:
  • texts (List[str]) – List[str] the input samples to your model

  • ids (List[int]) – List[int | str] the ids per sample

  • split – Optional[str] the split for this data. Can also be set via

  • meta (Optional[Dict[str, List[Union[str, float, int]]]]) – Dict[str, List[str | int | float]]. Log additional metadata fields to

Return type:

None dq.set_split

each sample. The name of the field is the key of the dictionary, and the values are a list that correspond in length and order to the text samples. :type kwargs: Any :param kwargs: See dq.docs() for details on other task specific parameters

log_data_sample(*, text, id, **kwargs)#

Log a single input example to disk

Fields are expected singular elements. Field names are in the singular of log_input_samples (texts -> text) The expected arguments come from the task_type being used: See dq.docs() for details

Parameters:
  • text (str) – List[str] the input samples to your model

  • id (int) – List[int | str] the ids per sample

  • split – Optional[str] the split for this data. Can also be set via dq.set_split

  • kwargs (Any) – See dq.docs() for details on other task specific parameters

Return type:

None

log_image_dataset(dataset, *, imgs_local_colname=None, imgs_remote=None, batch_size=100000, id='id', label='label', split=None, inference_name=None, meta=None, parallel=False, **kwargs)#

Log an image dataset of input samples for image classification

Parameters:
  • dataset (TypeVar(DataSet, bound= Union[Iterable, DataFrame, Dataset, DataFrame])) – The dataset to log. This can be a Pandas/HF dataframe or an ImageFolder (from Torchvision).

  • imgs_local_colname (Optional[str]) – The name of the column containing the local images (typically paths but could also be bytes for HF dataframes). Ignored for ImageFolder where local paths are directly retrieved from the dataset.

  • imgs_remote (Optional[str]) – The name of the column containing paths to the remote images (in the case of a df) or remote directory containing the images (in the case of ImageFolder). Specifying this argument is required to skip uploading the images.

  • batch_size (int) – Number of samples to log in a batch. Default 10,000

  • id (str) – The name of the column containing the ids (in the dataframe)

  • label (str) – The name of the column containing the labels (in the dataframe)

  • split (Optional[Split]) – train/test/validation/inference. Can be set here or via dq.set_split

  • inference_name (Optional[str]) – If logging inference data, a name for this inference data is required. Can be set here or via dq.set_split

  • parallel (bool) – upload in parallel if set to True

Return type:

None

log_xgboost(model, X, *, y=None, feature_names=None, split=None, inference_name=None)#

Log data for tabular classification models with XGBoost

X can be logged as a numpy array or pandas DataFrame. If a numpy array is provided, feature_names must be provided. If a pandas DataFrame is provided, feature_names will be inferred from the column names.

Example with numpy arrays: .. code-block:: python

import xgboost as xgb from sklearn.datasets import load_wine

wine = load_wine()

X = wine.data y = wine.target feature_names = wine.feature_names

model = xgb.XGBClassifier() model.fit(X, y)

dq.log_xgboost(model, X, y=y, feature_names=feature_names, split=”training”)

# or for inference dq.log_xgboost(

model, X, feature_names, split=”inference”, inference_name=”my_inference”

)

Example with pandas DataFrames: .. code-block:: python

import xgboost as xgb from sklearn.datasets import load_wine

X, y = load_wine(as_frame=True, return_X_y=True)

model = xgb.XGBClassifier() model.fit(df, y)

dq.log_xgboost(model, X=df, y=y, split=”training”)

# or for inference dq.log_xgboost(

model, X=df, split=”inference”, inference_name=”my_inference”

)

Parameters:
  • model (XGBClassifier) – XGBClassifier model fit on the training data

  • X (Union[DataFrame, ndarray]) – The input data has a numpy array or pandas DataFrame. Data should have shape (n_samples, n_features)

  • y (Union[Series, ndarray, List, None]) – Optional pandas Series, List, or numpy array of ground truth labels with shape (n_samples,). Provide for non-inference only

  • feature_names (Optional[List[str]]) – List of feature names if X is input as numpy array. Must have length n_features

  • split (Optional[Split]) – Optional[str] the split for this data. Can also be set via dq.set_split

  • inference_name (Optional[str]) – Optional[str] the inference_name for this data. Can also be set via dq.set_split

Return type:

None

log_dataset(dataset, *, batch_size=100000, text='text', id='id', split=None, meta=None, **kwargs)#

Log an iterable or other dataset to disk. Useful for logging memory mapped files

Dataset provided must be an iterable that can be traversed row by row, and for each row, the fields can be indexed into either via string keys or int indexes. Pandas and Vaex dataframes are also allowed, as well as HuggingFace Datasets

valid examples:
d = [

{“my_text”: “sample1”, “my_labels”: “A”, “my_id”: 1, “sample_quality”: 5.3}, {“my_text”: “sample2”, “my_labels”: “A”, “my_id”: 2, “sample_quality”: 9.1}, {“my_text”: “sample3”, “my_labels”: “B”, “my_id”: 3, “sample_quality”: 2.7},

] dq.log_dataset(

d, text=”my_text”, id=”my_id”, label=”my_labels”, meta=[“sample_quality”]

)

Logging a pandas dataframe, df:

text label id sample_quality

0 sample1 A 1 5.3 1 sample2 A 2 9.1 2 sample3 B 3 2.7 # We don’t need to set text id or label because it matches the default dq.log_dataset(d, meta=[“sample_quality”])

Logging and iterable of tuples: d = [

(“sample1”, “A”, “ID1”), (“sample2”, “A”, “ID2”), (“sample3”, “B”, “ID3”),

] dq.log_dataset(d, text=0, id=2, label=1)

Invalid example:
d = {

“my_text”: [“sample1”, “sample2”, “sample3”], “my_labels”: [“A”, “A”, “B”], “my_id”: [1, 2, 3], “sample_quality”: [5.3, 9.1, 2.7]

}

In the invalid case, use dq.log_data_samples:

meta = {“sample_quality”: d[“sample_quality”]} dq.log_data_samples(

texts=d[“my_text”], labels=d[“my_labels”], ids=d[“my_ids”], meta=meta

)

Keyword arguments are specific to the task type. See dq.docs() for details

Parameters:
  • dataset (TypeVar(DataSet, bound= Union[Iterable, DataFrame, Dataset, DataFrame])) – The iterable or dataframe to log

  • text (Union[str, int]) – str | int The column, key, or int index for text data. Default “text”

  • id (Union[str, int]) – str | int The column, key, or int index for id data. Default “id”

  • split (Optional[Split]) – Optional[str] the split for this data. Can also be set via dq.set_split

  • meta (Union[List[str], List[int], None]) – List[str | int] Additional keys/columns to your input data to be logged as metadata. Consider a pandas dataframe, this would be the list of

  • kwargs (Any) – See help(dq.get_data_logger().log_dataset) for more details here

Batch_size:

The number of data samples to log at a time. Useful when logging a memory mapped dataset. A larger batch_size will result in faster logging at the expense of more memory usage. Default 100,000

Return type:

None columns corresponding to each metadata field to log

or dq.docs() for more general task details

log_model_outputs(*, ids, embs=None, split=None, epoch=None, logits=None, probs=None, log_probs=None, inference_name=None, exclude_embs=False)#

Logs model outputs for model during training/test/validation.

Parameters:
  • ids (Union[List, ndarray]) – The ids for each sample. Must match input ids of logged samples

  • embs (Union[List, ndarray, None]) – The embeddings per output sample

  • split (Optional[Split]) – The current split. Must be set either here or via dq.set_split

  • epoch (Optional[int]) – The current epoch. Must be set either here or via dq.set_epoch

  • logits (Union[List, ndarray, None]) – The logits for each sample

  • probs (Union[List, ndarray, None]) – Deprecated, use logits. If passed in, a softmax will NOT be applied

  • inference_name (Optional[str]) – Inference name indicator for this inference split. If logging for an inference split, this is required.

  • exclude_embs (bool) – Optional flag to exclude embeddings from logging. If True and embs is set to None, this will generate random embs for each sample.

Return type:

None

The expected argument shapes come from the task_type being used See dq.docs() for more task specific details on parameter shape

log_od_model_outputs(*, ids, pred_boxes, gold_boxes, labels, pred_embs, gold_embs, image_size, embs=None, probs=None, logits=None, split, epoch=None, inference_name=None)#

Logs model outputs for model during training/test/validation.

Parameters:
  • ids (Union[List, ndarray]) – The ids for each sample. Must match input ids of logged samples

  • pred_boxes (List[ndarray]) – The predicted bounding boxes for each sample

  • gold_boxes (List[ndarray]) – The ground trugh bounding boxes for each sample

  • labels (List[ndarray]) – The labels for each sample (classes for each bounding box)

  • pred_embs (List[ndarray]) – The embeddings for each predicted sample

  • gold_embs (List[ndarray]) – The embeddings for each ground truth sample

  • image_size (Optional[Tuple[int, int]]) – The size of the image

  • embs (Union[List, ndarray, None]) – The embeddings per output sample

  • logits (Union[List, ndarray, None]) – The logits for each sample

  • split (Split) – The current split. Must be set either here or via dq.set_split

  • epoch (Optional[int]) – The current epoch. Must be set either here or via dq.set_epoch

  • inference_name (Optional[str]) – Inference name indicator for this inference split. If logging for an inference split, this is required.

  • exclude_embs – Optional flag to exclude embeddings from logging. If True and embs is set to None, this will generate random embs for each sample.

Return type:

None

The expected argument shapes come from the task_type being used See dq.docs() for more task specific details on parameter shape

set_labels_for_run(labels)#

Creates the mapping of the labels for the model to their respective indexes. :rtype: None

Parameters:

labels (Union[List[List[str]], List[str]]) – An ordered list of labels (ie [‘dog’,’cat’,’fish’]

If this is a multi-label type, then labels are a list of lists where each inner list indicates the label for the given task

This order MUST match the order of probabilities that the model outputs.

In the multi-label case, the outer order (order of the tasks) must match the task-order of the task-probabilities logged as well.

get_current_run_labels()#

Returns the current run labels, if there are any

Return type:

Optional[List[str]]

set_tasks_for_run(tasks, binary=True)#

Sets the task names for the run (multi-label case only).

This order MUST match the order of the labels list provided in log_input_data and the order of the probability vectors provided in log_model_outputs.

This also must match the order of the labels logged in set_labels_for_run (meaning that the first list of labels must be the labels of the first task passed in here)

Return type:

None

Parameters:
  • tasks (List[str]) – The list of tasks for your run

  • binary (bool) – Whether this is a binary multi label run. If true, tasks will also

be set as your labels, and you should NOT call dq.set_labels_for_run it will be handled for you. Default True

set_tagging_schema(tagging_schema)#

Sets the tagging schema for NER models

Only valid for text_ner task_types. Others will throw an exception

Return type:

None

get_model_logger(task_type=None, *args, **kwargs)#
Return type:

BaseGalileoModelLogger

get_data_logger(task_type=None, *args, **kwargs)#
Return type:

BaseGalileoDataLogger

docs()#

Print the documentation for your specific input and output logging format

Based on your task_type, this will print the appropriate documentation

Return type:

None

set_epoch(epoch)#

Set the current epoch.

When set, logging model outputs will use this if not logged explicitly

Return type:

None

set_split(split, inference_name=None)#

Set the current split.

When set, logging data inputs/model outputs will use this if not logged explicitly When setting split to inference, inference_name must be included

Return type:

None

set_epoch_and_split(epoch, split, inference_name=None)#

Set the current epoch and set the current split. When set, logging data inputs/model outputs will use this if not logged explicitly When setting split to inference, inference_name must be included

Return type:

None

Gets the link to the run in the UI

Return type:

str

dataquality.core.report module#

register_run_report(conditions, emails)#

Register conditions and emails for a run report.

After a run is finished, a report will be sent to the specified emails.

Return type:

None

build_run_report(conditions, emails, project_id, run_id, link)#

Build a run report and send it to the specified emails.

Return type:

None

Module contents#

configure(do_login=True, _internal=False)#

Update your active config with new information

You can use environment variables to set the config, or wait for prompts Available environment variables to update: * GALILEO_CONSOLE_URL * GALILEO_USERNAME * GALILEO_PASSWORD * GALILEO_API_KEY

Return type:

None

set_console_url(console_url=None)#

For Enterprise users. Set the console URL to your Galileo Environment.

You can also set GALILEO_CONSOLE_URL before importing dataquality to bypass this :rtype: None

Parameters:

console_url (Optional[str]) – If set, that will be used. Otherwise, if an environment variable

GALILEO_CONSOLE_URL is set, that will be used. Otherwise, you will be prompted for a url.