dataquality.core package#
Submodules#
dataquality.core.auth module#
- login()#
Log into your Galileo environment.
The function will prompt your for an Authorization Token (api key) that you can access from the console.
To skip the prompt for automated workflows, you can set GALILEO_USERNAME (your email) and GALILEO_PASSWORD if you signed up with an email and password. You can set GALILEO_API_KEY to your API key if you have one.
- Return type:
None
- logout()#
- Return type:
None
dataquality.core.finish module#
- finish(last_epoch=None, wait=True, create_data_embs=None, data_embs_col='text', upload_model=False)#
Finishes the current run and invokes a job
- Parameters:
last_epoch (
Optional
[int
]) – If set, only epochs up to this value will be uploaded/processed This is inclusive, so setting last_epoch to 5 would upload epochs 0,1,2,3,4,5wait (
bool
) – If true, after uploading the data, this will wait for the run to be processed by the Galileo server. If false, you can manually wait for the run by calling dq.wait_for_run() Default Truecreate_data_embs (
Optional
[bool
]) – If True, an off-the-shelf transformer will run on the raw text input to generate data-level embeddings. These will be available in the data view tab of the Galileo console. You can also access these embeddings via dq.metrics.get_data_embeddings(). Default True if a GPU is available, else default False.data_embs_col (
str
) – Optional text col on which to compute data embeddings. If not set, we default to ‘text’ which corresponds to the input text. Can also be set to target, generated_output or any other column that is logged as metadata.upload_model (
bool
) – If True, the model will be stored in the galileo project. Default False or set by the environment variable DQ_UPLOAD_MODEL.
- Return type:
str
- wait_for_run(project_name=None, run_name=None)#
Waits until a specific project run transitions from started to finished. Defaults to the current run if project_name and run_name are empty. Raises error if only one of project_name and run_name is passed in.
- Parameters:
project_name (
Optional
[str
]) – The project name. Default to current project if not passed in.run_name (
Optional
[str
]) – The run name. Default to current run if not passed in.
- Return type:
None
- Returns:
None. Function returns after the run transitions to finished
- get_run_status(project_name=None, run_name=None)#
Returns the latest job of a specified project run. Defaults to the current run if project_name and run_name are empty. Raises error if only one of project_name and run_name is passed in.
- Parameters:
project_name (
Optional
[str
]) – The project name. Default to current project if not passed in.run_name (
Optional
[str
]) – The run name. Default to current run if not passed in.
- Return type:
Dict
[str
,Any
]- Returns:
Dict[str, Any]. Response will have key status with value corresponding to the status of the latest job for the run. Other info, such as created_at, may be included.
dataquality.core.init module#
- class InitManager#
Bases:
object
- get_or_create_project(project_name)#
Gets a project by name, or creates a new one if it doesn’t exist.
- Returns:
The project and a boolean indicating if the project was created
- Return type:
Tuple[Dict, bool]
- get_or_create_run(project_name, run_name, task_type)#
Gets a run by name, or creates a new one if it doesn’t exist.
- Returns:
The run and a boolean indicating if the run was created
- Return type:
Tuple[Dict, bool]
- create_log_file_dir(project_id, run_id, overwrite_local)#
- Return type:
None
- create_run_name(project_name)#
Creates an auto-incrementing run_name for a given project
If a run_name is not passed into init, we create a run_name base with today’s date, and increment the digit at the end based on how many runs were created in this project with this scheme.
- Return type:
str
- ie:
2023-05-15_1 2023-05-15_2 2023-05-15_3 … 2023-05-15_n
- init(task_type, project_name=None, run_name=None, overwrite_local=True)#
Start a run
Initialize a new run and new project, initialize a new run in an existing project, or reinitialize an existing run in an existing project.
Before creating the project, check: - The user is valid, login if not - The DQ client version is compatible with API version
Optionally provide project and run names to create a new project/run or restart existing ones.
- Return type:
None
- Parameters:
task_type (
str
) – The task type for modeling. This must be one of the valid
dataquality.schemas.task_type.TaskType options :type project_name:
Optional
[str
] :param project_name: The project name. If not passed in, a random one will be generated. If provided, and the project does not exist, it will be created. If it does exist, it will be set. :type run_name:Optional
[str
] :param run_name: The run name. If not passed in, a random one will be generated. If provided, and the project does not exist, it will be created. If it does exist, it will be set. :type overwrite_local:bool
:param overwrite_local: If True, the current project/run log directory will be cleared during this function. If logging over many sessions with checkpoints, you may want to set this to False. Default True
- delete_run(project_name, run_name)#
Deletes a run from Galileo
- Return type:
None
dataquality.core.log module#
- log_data_samples(*, texts, ids, meta=None, **kwargs)#
Logs a batch of input samples for model training/test/validation/inference.
Fields are expected as lists of their content. Field names are in the plural of log_input_sample (text -> texts) The expected arguments come from the task_type being used: See dq.docs() for details
ex (text classification): .. code-block:: python
all_labels = [“A”, “B”, “C”] dq.set_labels_for_run(labels = all_labels)
- texts: List[str] = [
“Text sample 1”, “Text sample 2”, “Text sample 3”, “Text sample 4”
]
labels: List[str] = [“B”, “C”, “A”, “A”]
- meta = {
“sample_importance”: [“high”, “low”, “low”, “medium”] “quality_ranking”: [9.7, 2.4, 5.5, 1.2]
}
ids: List[int] = [0, 1, 2, 3] split = “training”
dq.log_data_samples(texts=texts, labels=labels, ids=ids, meta=meta split=split)
- Parameters:
texts (
List
[str
]) – List[str] the input samples to your modelids (
List
[int
]) – List[int | str] the ids per samplesplit – Optional[str] the split for this data. Can also be set via
meta (
Optional
[Dict
[str
,List
[Union
[str
,float
,int
]]]]) – Dict[str, List[str | int | float]]. Log additional metadata fields to
- Return type:
None
dq.set_split
each sample. The name of the field is the key of the dictionary, and the values are a list that correspond in length and order to the text samples. :type kwargs:
Any
:param kwargs: See dq.docs() for details on other task specific parameters
- log_data_sample(*, text, id, **kwargs)#
Log a single input example to disk
Fields are expected singular elements. Field names are in the singular of log_input_samples (texts -> text) The expected arguments come from the task_type being used: See dq.docs() for details
- Parameters:
text (
str
) – List[str] the input samples to your modelid (
int
) – List[int | str] the ids per samplesplit – Optional[str] the split for this data. Can also be set via dq.set_split
kwargs (
Any
) – See dq.docs() for details on other task specific parameters
- Return type:
None
- log_image_dataset(dataset, *, imgs_local_colname=None, imgs_remote=None, batch_size=100000, id='id', label='label', split=None, inference_name=None, meta=None, parallel=False, **kwargs)#
Log an image dataset of input samples for image classification
- Parameters:
dataset (
TypeVar
(DataSet
, bound=Union
[Iterable
,DataFrame
,Dataset
,DataFrame
])) – The dataset to log. This can be a Pandas/HF dataframe or an ImageFolder (from Torchvision).imgs_local_colname (
Optional
[str
]) – The name of the column containing the local images (typically paths but could also be bytes for HF dataframes). Ignored for ImageFolder where local paths are directly retrieved from the dataset.imgs_remote (
Optional
[str
]) – The name of the column containing paths to the remote images (in the case of a df) or remote directory containing the images (in the case of ImageFolder). Specifying this argument is required to skip uploading the images.batch_size (
int
) – Number of samples to log in a batch. Default 10,000id (
str
) – The name of the column containing the ids (in the dataframe)label (
str
) – The name of the column containing the labels (in the dataframe)split (
Optional
[Split
]) – train/test/validation/inference. Can be set here or via dq.set_splitinference_name (
Optional
[str
]) – If logging inference data, a name for this inference data is required. Can be set here or via dq.set_splitparallel (
bool
) – upload in parallel if set to True
- Return type:
None
- log_xgboost(model, X, *, y=None, feature_names=None, split=None, inference_name=None)#
Log data for tabular classification models with XGBoost
X can be logged as a numpy array or pandas DataFrame. If a numpy array is provided, feature_names must be provided. If a pandas DataFrame is provided, feature_names will be inferred from the column names.
Example with numpy arrays: .. code-block:: python
import xgboost as xgb from sklearn.datasets import load_wine
wine = load_wine()
X = wine.data y = wine.target feature_names = wine.feature_names
model = xgb.XGBClassifier() model.fit(X, y)
dq.log_xgboost(model, X, y=y, feature_names=feature_names, split=”training”)
# or for inference dq.log_xgboost(
model, X, feature_names, split=”inference”, inference_name=”my_inference”
)
Example with pandas DataFrames: .. code-block:: python
import xgboost as xgb from sklearn.datasets import load_wine
X, y = load_wine(as_frame=True, return_X_y=True)
model = xgb.XGBClassifier() model.fit(df, y)
dq.log_xgboost(model, X=df, y=y, split=”training”)
# or for inference dq.log_xgboost(
model, X=df, split=”inference”, inference_name=”my_inference”
)
- Parameters:
model (
XGBClassifier
) – XGBClassifier model fit on the training dataX (
Union
[DataFrame
,ndarray
]) – The input data has a numpy array or pandas DataFrame. Data should have shape (n_samples, n_features)y (
Union
[Series
,ndarray
,List
,None
]) – Optional pandas Series, List, or numpy array of ground truth labels with shape (n_samples,). Provide for non-inference onlyfeature_names (
Optional
[List
[str
]]) – List of feature names if X is input as numpy array. Must have length n_featuressplit (
Optional
[Split
]) – Optional[str] the split for this data. Can also be set via dq.set_splitinference_name (
Optional
[str
]) – Optional[str] the inference_name for this data. Can also be set via dq.set_split
- Return type:
None
- log_dataset(dataset, *, batch_size=100000, text='text', id='id', split=None, meta=None, **kwargs)#
Log an iterable or other dataset to disk. Useful for logging memory mapped files
Dataset provided must be an iterable that can be traversed row by row, and for each row, the fields can be indexed into either via string keys or int indexes. Pandas and Vaex dataframes are also allowed, as well as HuggingFace Datasets
- valid examples:
- d = [
{“my_text”: “sample1”, “my_labels”: “A”, “my_id”: 1, “sample_quality”: 5.3}, {“my_text”: “sample2”, “my_labels”: “A”, “my_id”: 2, “sample_quality”: 9.1}, {“my_text”: “sample3”, “my_labels”: “B”, “my_id”: 3, “sample_quality”: 2.7},
] dq.log_dataset(
d, text=”my_text”, id=”my_id”, label=”my_labels”, meta=[“sample_quality”]
)
- Logging a pandas dataframe, df:
text label id sample_quality
0 sample1 A 1 5.3 1 sample2 A 2 9.1 2 sample3 B 3 2.7 # We don’t need to set text id or label because it matches the default dq.log_dataset(d, meta=[“sample_quality”])
Logging and iterable of tuples: d = [
(“sample1”, “A”, “ID1”), (“sample2”, “A”, “ID2”), (“sample3”, “B”, “ID3”),
] dq.log_dataset(d, text=0, id=2, label=1)
- Invalid example:
- d = {
“my_text”: [“sample1”, “sample2”, “sample3”], “my_labels”: [“A”, “A”, “B”], “my_id”: [1, 2, 3], “sample_quality”: [5.3, 9.1, 2.7]
}
- In the invalid case, use dq.log_data_samples:
meta = {“sample_quality”: d[“sample_quality”]} dq.log_data_samples(
texts=d[“my_text”], labels=d[“my_labels”], ids=d[“my_ids”], meta=meta
)
Keyword arguments are specific to the task type. See dq.docs() for details
- Parameters:
dataset (
TypeVar
(DataSet
, bound=Union
[Iterable
,DataFrame
,Dataset
,DataFrame
])) – The iterable or dataframe to logtext (
Union
[str
,int
]) – str | int The column, key, or int index for text data. Default “text”id (
Union
[str
,int
]) – str | int The column, key, or int index for id data. Default “id”split (
Optional
[Split
]) – Optional[str] the split for this data. Can also be set via dq.set_splitmeta (
Union
[List
[str
],List
[int
],None
]) – List[str | int] Additional keys/columns to your input data to be logged as metadata. Consider a pandas dataframe, this would be the list ofkwargs (
Any
) – See help(dq.get_data_logger().log_dataset) for more details here
- Batch_size:
The number of data samples to log at a time. Useful when logging a memory mapped dataset. A larger batch_size will result in faster logging at the expense of more memory usage. Default 100,000
- Return type:
None
columns corresponding to each metadata field to log
or dq.docs() for more general task details
- log_model_outputs(*, ids, embs=None, split=None, epoch=None, logits=None, probs=None, log_probs=None, inference_name=None, exclude_embs=False)#
Logs model outputs for model during training/test/validation.
- Parameters:
ids (
Union
[List
,ndarray
]) – The ids for each sample. Must match input ids of logged samplesembs (
Union
[List
,ndarray
,None
]) – The embeddings per output samplesplit (
Optional
[Split
]) – The current split. Must be set either here or via dq.set_splitepoch (
Optional
[int
]) – The current epoch. Must be set either here or via dq.set_epochlogits (
Union
[List
,ndarray
,None
]) – The logits for each sampleprobs (
Union
[List
,ndarray
,None
]) – Deprecated, use logits. If passed in, a softmax will NOT be appliedinference_name (
Optional
[str
]) – Inference name indicator for this inference split. If logging for an inference split, this is required.exclude_embs (
bool
) – Optional flag to exclude embeddings from logging. If True and embs is set to None, this will generate random embs for each sample.
- Return type:
None
The expected argument shapes come from the task_type being used See dq.docs() for more task specific details on parameter shape
- log_od_model_outputs(*, ids, pred_boxes, gold_boxes, labels, pred_embs, gold_embs, image_size, embs=None, probs=None, logits=None, split, epoch=None, inference_name=None)#
Logs model outputs for model during training/test/validation.
- Parameters:
ids (
Union
[List
,ndarray
]) – The ids for each sample. Must match input ids of logged samplespred_boxes (
List
[ndarray
]) – The predicted bounding boxes for each samplegold_boxes (
List
[ndarray
]) – The ground trugh bounding boxes for each samplelabels (
List
[ndarray
]) – The labels for each sample (classes for each bounding box)pred_embs (
List
[ndarray
]) – The embeddings for each predicted samplegold_embs (
List
[ndarray
]) – The embeddings for each ground truth sampleimage_size (
Optional
[Tuple
[int
,int
]]) – The size of the imageembs (
Union
[List
,ndarray
,None
]) – The embeddings per output samplelogits (
Union
[List
,ndarray
,None
]) – The logits for each samplesplit (
Split
) – The current split. Must be set either here or via dq.set_splitepoch (
Optional
[int
]) – The current epoch. Must be set either here or via dq.set_epochinference_name (
Optional
[str
]) – Inference name indicator for this inference split. If logging for an inference split, this is required.exclude_embs – Optional flag to exclude embeddings from logging. If True and embs is set to None, this will generate random embs for each sample.
- Return type:
None
The expected argument shapes come from the task_type being used See dq.docs() for more task specific details on parameter shape
- set_labels_for_run(labels)#
Creates the mapping of the labels for the model to their respective indexes. :rtype:
None
- Parameters:
labels (
Union
[List
[List
[str
]],List
[str
]]) – An ordered list of labels (ie [‘dog’,’cat’,’fish’]
If this is a multi-label type, then labels are a list of lists where each inner list indicates the label for the given task
This order MUST match the order of probabilities that the model outputs.
In the multi-label case, the outer order (order of the tasks) must match the task-order of the task-probabilities logged as well.
- get_current_run_labels()#
Returns the current run labels, if there are any
- Return type:
Optional
[List
[str
]]
- set_tasks_for_run(tasks, binary=True)#
Sets the task names for the run (multi-label case only).
This order MUST match the order of the labels list provided in log_input_data and the order of the probability vectors provided in log_model_outputs.
This also must match the order of the labels logged in set_labels_for_run (meaning that the first list of labels must be the labels of the first task passed in here)
- Return type:
None
- Parameters:
tasks (
List
[str
]) – The list of tasks for your runbinary (
bool
) – Whether this is a binary multi label run. If true, tasks will also
be set as your labels, and you should NOT call dq.set_labels_for_run it will be handled for you. Default True
- set_tagging_schema(tagging_schema)#
Sets the tagging schema for NER models
Only valid for text_ner task_types. Others will throw an exception
- Return type:
None
- get_model_logger(task_type=None, *args, **kwargs)#
- Return type:
- get_data_logger(task_type=None, *args, **kwargs)#
- Return type:
- docs()#
Print the documentation for your specific input and output logging format
Based on your task_type, this will print the appropriate documentation
- Return type:
None
- set_epoch(epoch)#
Set the current epoch.
When set, logging model outputs will use this if not logged explicitly
- Return type:
None
- set_split(split, inference_name=None)#
Set the current split.
When set, logging data inputs/model outputs will use this if not logged explicitly When setting split to inference, inference_name must be included
- Return type:
None
- set_epoch_and_split(epoch, split, inference_name=None)#
Set the current epoch and set the current split. When set, logging data inputs/model outputs will use this if not logged explicitly When setting split to inference, inference_name must be included
- Return type:
None
- get_run_link(project_name=None, run_name=None)#
Gets the link to the run in the UI
- Return type:
str
dataquality.core.report module#
- register_run_report(conditions, emails)#
Register conditions and emails for a run report.
After a run is finished, a report will be sent to the specified emails.
- Return type:
None
- build_run_report(conditions, emails, project_id, run_id, link)#
Build a run report and send it to the specified emails.
- Return type:
None
Module contents#
- configure(do_login=True, _internal=False)#
Update your active config with new information
You can use environment variables to set the config, or wait for prompts Available environment variables to update: * GALILEO_CONSOLE_URL * GALILEO_USERNAME * GALILEO_PASSWORD * GALILEO_API_KEY
- Return type:
None
- set_console_url(console_url=None)#
For Enterprise users. Set the console URL to your Galileo Environment.
You can also set GALILEO_CONSOLE_URL before importing dataquality to bypass this :rtype:
None
- Parameters:
console_url (
Optional
[str
]) – If set, that will be used. Otherwise, if an environment variable
GALILEO_CONSOLE_URL is set, that will be used. Otherwise, you will be prompted for a url.