dataquality.schemas package#

Submodules#

dataquality.schemas.condition module#

class Operator(value)#

Bases: str, Enum

An enumeration.

eq = 'is equal to'#

neq = 'is not equal to'#

gt = 'is greater than'#

lt = 'is less than'#

gte = 'is greater than or equal to'#

lte = 'is less than or equal to'#

class AggregateFunction(value)#

Bases: str, Enum

An enumeration.

avg = 'Average'#

min = 'Minimum'#

max = 'Maximum'#

sum = 'Sum'#

pct = 'Percentage'#

pydantic model ConditionFilter#

Bases: BaseModel

Filter a dataframe based on the column value

Note that the column used for filtering is the same as the metric used in the condition.

Parameters:

operator – The operator to use for filtering (e.g. “gt”, “lt”, “eq”, “neq”) See Operator
value – The value to compare against

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Fields:

metric (str)
operator (dataquality.schemas.condition.Operator)
value (float | int | str | bool)

field metric: str [Required]#

field operator: Operator [Required]#

field value: Union[float, int, str, bool] [Required]#

pydantic model Condition#

Bases: BaseModel

Class for building custom conditions for data quality checks

After building a condition, call evaluate to determine the truthiness of the condition against a given DataFrame.

With a bit of thought, complex and custom conditions can be built. To gain an intuition for what can be accomplished, consider the following examples:

Is the average confidence less than 0.3?

>>> c = Condition(
...     agg=AggregateFunction.avg,
...     metric="confidence",
...     operator=Operator.lt,
...     threshold=0.3,
... )
>>> c.evaluate(df)

Is the max DEP greater or equal to 0.45?

>>> c = Condition(
...     agg=AggregateFunction.max,
...     metric="data_error_potential",
...     operator=Operator.gte,
...     threshold=0.45,
... )
>>> c.evaluate(df)

By adding filters, you can further narrow down the scope of the condition. If the aggregate function is “pct”, you don’t need to specify a metric,

as the filters will determine the percentage of data.

For example:

Alert if over 80% of the dataset has confidence under 0.1

>>> c = Condition(
...     operator=Operator.gt,
...     threshold=0.8,
...     agg=AggregateFunction.pct,
...     filters=[
...         ConditionFilter(
...             metric="confidence", operator=Operator.lt, value=0.1
...         ),
...     ],
... )
>>> c.evaluate(df)

Alert if at least 20% of the dataset has drifted (Inference DataFrames only)

>>> c = Condition(
...     operator=Operator.gte,
...     threshold=0.2,
...     agg=AggregateFunction.pct,
...     filters=[
...         ConditionFilter(
...             metric="is_drifted", operator=Operator.eq, value=True
...         ),
...     ],
... )
>>> c.evaluate(df)

Alert 5% or more of the dataset contains PII

>>> c = Condition(
...     operator=Operator.gte,
...     threshold=0.05,
...     agg=AggregateFunction.pct,
...     filters=[
...         ConditionFilter(
...             metric="galileo_pii", operator=Operator.neq, value="None"
...         ),
...     ],
... )
>>> c.evaluate(df)

Complex conditions can be built when the filter has a different metric than the metric used in the condition. For example:

Alert if the min confidence of drifted data is less than 0.15

>>> c = Condition(
...     agg=AggregateFunction.min,
...     metric="confidence",
...     operator=Operator.lt,
...     threshold=0.15,
...     filters=[
...         ConditionFilter(
...             metric="is_drifted", operator=Operator.eq, value=True
...         )
...     ],
... )
>>> c.evaluate(df)

Alert if over 50% of high DEP (>=0.7) data contains PII

>>> c = Condition(
...     operator=Operator.gt,
...     threshold=0.5,
...     agg=AggregateFunction.pct,
...     filters=[
...         ConditionFilter(
...             metric="data_error_potential", operator=Operator.gte, value=0.7
...         ),
...         ConditionFilter(
...             metric="galileo_pii", operator=Operator.neq, value="None"
...         ),
...     ],
... )
>>> c.evaluate(df)

You can also call conditions directly, which will assert its truth against a df 1. Assert that average confidence less than 0.3 >>> c = Condition( … agg=AggregateFunction.avg, … metric=”confidence”, … operator=Operator.lt, … threshold=0.3, … ) >>> c(df) # Will raise an AssertionError if False

Parameters:

metric – The DF column for evaluating the condition
agg – An aggregate function to apply to the metric
operator – The operator to use for comparing the agg to the threshold (e.g. “gt”, “lt”, “eq”, “neq”)
threshold – Threshold value for evaluating the condition
filter – Optional filter to apply to the DataFrame before evaluating the condition

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Fields:

agg (dataquality.schemas.condition.AggregateFunction)
filters (List[dataquality.schemas.condition.ConditionFilter])
metric (str | None)
operator (dataquality.schemas.condition.Operator)
threshold (float)

field agg: AggregateFunction [Required]#

field filters: List[ConditionFilter] [Optional]#

Validated by:

validate_filters

field metric: Optional[str] = None#

Validated by:

validate_metric

field operator: Operator [Required]#

field threshold: float [Required]#

evaluate(df)#

Return type:: Tuple[bool, float]

dataquality.schemas.cv module#

class CVSmartFeatureColumn(image_path=CVSmartFeatureColumn.image_path, height=CVSmartFeatureColumn.height, width=CVSmartFeatureColumn.width, channels=CVSmartFeatureColumn.channels, hash=CVSmartFeatureColumn.hash, contrast=CVSmartFeatureColumn.contrast, overexp=CVSmartFeatureColumn.overexp, underexp=CVSmartFeatureColumn.underexp, blur=CVSmartFeatureColumn.blur, lowcontent=CVSmartFeatureColumn.lowcontent, outlier_size=CVSmartFeatureColumn.outlier_size, outlier_ratio=CVSmartFeatureColumn.outlier_ratio, outlier_near_duplicate_id=CVSmartFeatureColumn.outlier_near_duplicate_id, outlier_near_dup=CVSmartFeatureColumn.outlier_near_dup, outlier_channels=CVSmartFeatureColumn.outlier_channels, outlier_low_contrast=CVSmartFeatureColumn.outlier_low_contrast, outlier_overexposed=CVSmartFeatureColumn.outlier_overexposed, outlier_underexposed=CVSmartFeatureColumn.outlier_underexposed, outlier_low_content=CVSmartFeatureColumn.outlier_low_content, outlier_blurry=CVSmartFeatureColumn.outlier_blurry)#

Bases: str, Enum

A class holding the column names appearing with the smart feature methods. When updated, also need to update the coresponding schema in rungalileo.

image_path: str = 'sf_image_path'#

height: str = 'sf_height'#

width: str = 'sf_width'#

channels: str = 'sf_channels'#

hash: str = 'sf_hash'#

contrast: str = 'sf_contrast'#

overexp: str = 'sf_overexposed'#

underexp: str = 'sf_underexposed'#

blur: str = 'sf_blur'#

lowcontent: str = 'sf_content'#

outlier_size: str = 'has_odd_size'#

outlier_ratio: str = 'has_odd_ratio'#

outlier_near_duplicate_id: str = 'near_duplicate_id'#

outlier_near_dup: str = 'is_near_duplicate'#

outlier_channels: str = 'has_odd_channels'#

outlier_low_contrast: str = 'has_low_contrast'#

outlier_overexposed: str = 'is_overexposed'#

outlier_underexposed: str = 'is_underexposed'#

outlier_low_content: str = 'has_low_content'#

outlier_blurry: str = 'is_blurry'#

dataquality.schemas.dataframe module#

pydantic model BaseLoggerDataFrames#

Bases: BaseModel

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Fields:

data (vaex.dataframe.DataFrame)
emb (vaex.dataframe.DataFrame)
prob (vaex.dataframe.DataFrame)

field data: DataFrame [Required]#

field emb: DataFrame [Required]#

field prob: DataFrame [Required]#

class FileType(value)#

Bases: str, Enum

Valid file extensions for an exported dataframe

arrow = 'arrow'#

parquet = 'parquet'#

json = 'json'#

csv = 'csv'#

class DFVar(skip_upload='skip_upload', progress_name='progress_name')#

Bases: object

skip_upload: str = 'skip_upload'#

progress_name: str = 'progress_name'#

dataquality.schemas.edit module#

class EditAction(value)#

Bases: str, Enum

The available actions you can take in an edit

relabel = 'relabel'#

delete = 'delete'#

select_for_label = 'select_for_label'#

relabel_as_pred = 'relabel_as_pred'#

update_text = 'update_text'#

shift_span = 'shift_span'#

pydantic model Edit#

Bases: BaseModel

A class for help creating edits via dq.metrics An edit is a combination of a filter, and some edit action. You can use this class, as well as dq.metrics.create_edit and dq.metrics.get_edited_dataframe to create automated edits and improved datasets, leading to automated retraining pipelines. :param edit_action: EditAction the type of edit.

delete, relabel, relabel_as_pred, update_text, shift_span (ner only), and select_for_label (inference only)

Parameters:

new_label – Optional[str] needed if action is relabel, ignored otherwise. The new label to set for the edit
search_string – Optional[str] needed when action is text replacement or shift_span. The search string to use for the edit
use_regex – bool = False. Used for the search_string. When searching, whether to use regex or not. Default False.
shift_span_start_num_words – Optional[int] Needed if action is shift_span. How many words (forward or back) to shift the beginning of the span by
shift_span_end_num_words – Optional[int] Needed if action is shift_span. How many words (forward or back) to shift the end of the span by

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Fields:

edit_action (dataquality.schemas.edit.EditAction)
filter (dataquality.schemas.metrics.FilterParams | None)
inference_name (str | None)
new_label (str | None)
note (str | None)
project_id (uuid.UUID | None)
run_id (uuid.UUID | None)
search_string (str | None)
shift_span_end_num_words (int | None)
shift_span_start_num_words (int | None)
split (str | None)
task (str | None)
text_replacement (str | None)
use_regex (bool)

field edit_action: EditAction [Required]#

Validated by:

new_label_if_relabel
shift_span_validator
text_replacement_if_update_text
validate_edit_action_for_split

field filter: Optional[FilterParams] = None#

field inference_name: Optional[str] = None#

field new_label: Optional[Annotated[str]] = None#

field note: Optional[Annotated[str]] = None#

field project_id: Optional[Annotated[UUID]] = None#

field run_id: Optional[Annotated[UUID]] = None#

field search_string: Optional[Annotated[str]] = None#

field shift_span_end_num_words: Optional[Annotated[int]] = None#

field shift_span_start_num_words: Optional[Annotated[int]] = None#

field split: Optional[str] = None#

field task: Optional[str] = None#

field text_replacement: Optional[Annotated[str]] = None#

field use_regex: bool = False#

dataquality.schemas.hf module#

class HFCol(input_ids='input_ids', text='text', id='id', ner_tags='ner_tags', text_token_indices='text_token_indices', tokens='tokens', bpe_tokens='bpe_tokens', gold_spans='gold_spans', labels='labels', ner_labels='ner_labels', tags='tags')#

Bases: object

input_ids: str = 'input_ids'#

text: str = 'text'#

id: str = 'id'#

ner_tags: str = 'ner_tags'#

text_token_indices: str = 'text_token_indices'#

tokens: str = 'tokens'#

bpe_tokens: str = 'bpe_tokens'#

gold_spans: str = 'gold_spans'#

labels: str = 'labels'#

ner_labels: str = 'ner_labels'#

tags: str = 'tags'#

static get_fields()#

Return type:: List[str]

class SpanKey(label='label', start='start', end='end')#

Bases: object

label: str = 'label'#

start: str = 'start'#

end: str = 'end'#

dataquality.schemas.job module#

class JobName(value)#

Bases: str, Enum

An enumeration.

default = 'default'#

inference = 'inference'#

dataquality.schemas.metrics module#

pydantic model HashableBaseModel#

Bases: BaseModel

Hashable BaseModel https://github.com/pydantic/pydantic/issues/1303

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

pydantic model MetaFilter#

Bases: HashableBaseModel

A class for filtering arbitrary metadata dataframe columns

For example, to filter on a logged metadata column, “is_happy” for values [True], you can create a MetaFilter(name=”is_happy”, isin=[True])

You can use this filter for any columns, not just metadata columns. For example, you can use this to filter for DEP scores above 0.5: MetaFilter(name=”data_error_potential”, greater_than=0.5)

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Fields:

greater_than (float | None)
isin (List[str] | None)
less_than (float | None)
name (str)

field greater_than: Optional[float] = None#

field isin: Optional[List[str]] = None#

field less_than: Optional[float] = None#

field name: Annotated[str] [Required]#

Constraints:

strict = True

pydantic model InferenceFilter#

Bases: HashableBaseModel

A class for filtering an inference split

is_otb: Filters samples that are / are not On-The-Boundary
is_drifted: Filters samples that are / are not Drifted

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Fields:

is_drifted (bool | None)
is_otb (bool | None)

field is_drifted: Optional[bool] = None#

field is_otb: Optional[bool] = None#

pydantic model LassoSelection#

Bases: HashableBaseModel

Representation of a lasso selection (used during an embeddings selection)

x and y correspond to the cursor movement while tracing the lasso. This is natively provided by plotly when creating a lasso selection

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Fields:

x (List[float])
y (List[float])

field x: List[float] [Required]#

Validated by:

validate_xy

field y: List[float] [Required]#

Validated by:

validate_xy

pydantic model FilterParams#

Bases: HashableBaseModel

A class for sending filters to the API alongside most any request.

Each field represents things you can filter the dataframe on.

Parameters:

ids – List[int] = [] filter for specific IDs in the dataframe (span IDs for NER)
similar_to – Optional[int] = None provide an ID to run similarity search on
num_similar_to – Optional[int] = None if running similarity search, how many
text_pat – Optional[StrictStr] = None filter text samples by some text pattern
regex – Optional[bool] = None if searching with text, whether to use regex
data_error_potential_high – Optional[float] = None only samples with DEP <= this
data_error_potential_low – Optional[float] = None only samples with DEP >= this
misclassified_only – Optional[bool] = None Only look at missed samples
gold_filter – Optional[List[StrictStr]] = None filter GT classes
pred_filter – Optional[List[StrictStr]] = None filter prediction classes
meta_filter – Optional[List[MetaFilter]] = None see MetaFilter class
inference_filter – Optional[InferenceFilter] = None see InferenceFilter class
span_sample_ids – Optional[List[int]] = None (NER only) filter for full samples
span_text – Optional[str] = None (NER only) filter only on span text
exclude_ids – List[int] = [] opposite of ids
lasso – Optional[LassoSelection] = None see LassoSelection class
class_filter – Optional[List[StrictStr]] = None filter GT OR prediction
likely_mislabeled – Optional[bool] = None Filter for only likely_mislabeled samples. False/None will return all samples
likely_mislabeled_dep_percentile – Optional[int] A percentile threshold for l ikely mislabeled. This field (ranged 0-100) determines the precision of the likely_mislabeled filter. The threshold is applied against the DEP distribution of the likely_mislabeled samples. A threshold of 0 returns all, 100 returns 1 sample, and 50 will return the top 50% DEP samples that are likely_mislabeled. Higher = more precision, lower = more recall. Default 0.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

field class_filter: Optional[List[Annotated[str]]] = None#

field data_error_potential_high: Optional[float] = None#

field data_error_potential_low: Optional[float] = None#

field exclude_ids: List[int] = []#

field gold_filter: Optional[List[Annotated[str]]] = None#

field ids: List[int] = []#

field inference_filter: Optional[InferenceFilter] = None#

field lasso: Optional[LassoSelection] = None#

field likely_mislabeled: Optional[bool] = None#

field likely_mislabeled_dep_percentile: Optional[int] = 0#

Constraints:

ge = 0
le = 100

field meta_filter: Optional[List[MetaFilter]] = None#

field misclassified_only: Optional[bool] = None#

field num_similar_to: Optional[int] = None#

field pred_filter: Optional[List[Annotated[str]]] = None#

field regex: Optional[bool] = None#

field similar_to: Optional[List[int]] = None#

field span_regex: Optional[bool] = None#

field span_sample_ids: Optional[List[int]] = None#

field span_text: Optional[str] = None#

field text_pat: Optional[Annotated[str]] = None#

dataquality.schemas.model module#

class ModelFramework(value)#

Bases: str, Enum

An enumeration.

torch = 'torch'#

keras = 'keras'#

hf = 'hf'#

auto = 'auto'#

class ModelUploadType(value)#

Bases: str, Enum

An enumeration.

transformers = 'transformers'#

setfit = 'setfit'#

dataquality.schemas.ner module#

class NERProbMethod(value)#

Bases: str, Enum

An enumeration.

confidence = 'confidence'#

loss = 'loss'#

class NERErrorType(value)#

Bases: str, Enum

An enumeration.

wrong_tag = 'wrong_tag'#

missed_label = 'missed_label'#

span_shift = 'span_shift'#

ghost_span = 'ghost_span'#

none = 'None'#

class TaggingSchema(value)#

Bases: str, Enum

An enumeration.

BIO = 'BIO'#

BILOU = 'BILOU'#

BIOES = 'BIOES'#

class NERColumns(value)#

Bases: str, Enum

An enumeration.

id = 'id'#

sample_id = 'sample_id'#

split = 'split'#

epoch = 'epoch'#

is_gold = 'is_gold'#

is_pred = 'is_pred'#

span_start = 'span_start'#

span_end = 'span_end'#

gold = 'gold'#

pred = 'pred'#

conf_prob = 'conf_prob'#

loss_prob = 'loss_prob'#

loss_prob_label = 'loss_prob_label'#

galileo_error_type = 'galileo_error_type'#

emb = 'emb'#

inference_name = 'inference_name'#

dataquality.schemas.report module#

class ConditionStatus(value)#

Bases: str, Enum

An enumeration.

passed = 'passed'#

failed = 'failed'#

pydantic model SplitConditionData#

Bases: BaseModel

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Fields:

ground_truth (float)
inference_name (str | None)
link (str | None)
split (str)
status (dataquality.schemas.report.ConditionStatus)

field ground_truth: float [Required]#

field inference_name: Optional[str] = None#

field link: Optional[str] = None#

field split: str [Required]#

field status: ConditionStatus [Required]#

pydantic model ReportConditionData#

Bases: BaseModel

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Fields:

condition (str)
metric (str)
splits (List[dataquality.schemas.report.SplitConditionData])

field condition: str [Required]#

field metric: str [Required]#

field splits: List[SplitConditionData] [Required]#

pydantic model RunReportData#

Bases: BaseModel

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Fields:

conditions (List[dataquality.schemas.report.ReportConditionData])
created_at (str)
email_subject (str)
link (str)
project_name (str)
run_name (str)

field conditions: List[ReportConditionData] [Required]#

field created_at: str [Required]#

field email_subject: str [Required]#

field link: str [Required]#

field project_name: str [Required]#

field run_name: str [Required]#

dataquality.schemas.request_type module#

class RequestType(value)#

Bases: str, Enum

An enumeration.

GET = 'get'#

POST = 'post'#

PUT = 'put'#

DELETE = 'delete'#

static get_method(request)#

Return type:: Callable

dataquality.schemas.route module#

class Route(value)#

Bases: str, Enum

List of available API routes

projects = 'projects'#

runs = 'runs'#

users = 'users'#

cleanup = 'cleanup'#

login = 'login'#

current_user = 'current_user'#

healthcheck = 'healthcheck'#

healthcheck_dq = 'healthcheck/dq'#

slices = 'slices'#

split_path = 'split'#

splits = 'splits'#

inference_names = 'inference_names'#

jobs = 'jobs'#

latest_job = 'jobs/latest'#

presigned_url = 'presigned_url'#

tasks = 'tasks'#

labels = 'labels'#

epochs = 'epochs'#

summary = 'insights/summary'#

groupby = 'insights/groupby'#

metrics = 'metrics'#

distribution = 'insights/distribution'#

alerts = 'insights/alerts'#

export = 'export'#

edits = 'edits'#

export_edits = 'edits/export'#

notify = 'notify/email'#

token = 'get-token'#

upload_file = 'upload_file'#

model = 'model'#

link = 'link'#

static content_path(project_id=None, run_id=None, split=None)#

Return type:: str

dataquality.schemas.semantic_segmentation module#

class SemSegCols(value)#

Bases: str, Enum

An enumeration.

id = 'id'#

image = 'image'#

image_path = 'image_path'#

mask_path = 'mask_path'#

split = 'split'#

meta = 'meta'#

class ErrorType(value)#

Bases: str, Enum

An enumeration.

class_confusion = 'class_confusion'#

classification = 'classification'#

missed = 'missed'#

background = 'background'#

none = 'None'#

class PolygonType(value)#

Bases: str, Enum

An enumeration.

gold = 'gold'#

pred = 'pred'#

dummy = 'dummy'#

class SemSegMetricType(value)#

Bases: str, Enum

An enumeration.

miou = 'mean_iou'#

biou = 'boundary_iou'#

dice = 'dice'#

pydantic model ClassificationErrorData#

Bases: BaseModel

Data needed for determining classification errors on backend

accuracy: no pixels correctly classified / area of polygon mislabeled_class: label idx of the class that was most frequently mislabeled mislabeled_class_pct: the pct of pixels in the polygon

that were classified as mislabeled_class

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Fields:

accuracy (float)
mislabeled_class (int)
mislabeled_class_pct (float)

field accuracy: float [Required]#

field mislabeled_class: int [Required]#

field mislabeled_class_pct: float [Required]#

pydantic model SemSegMetricData#

Bases: BaseModel

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Fields:

area_per_class (List[int])
metric (dataquality.schemas.semantic_segmentation.SemSegMetricType)
value (float)
value_per_class (List[float])

field area_per_class: List[int] [Required]#

field metric: SemSegMetricType [Required]#

field value: float [Required]#

field value_per_class: List[float] [Required]#

pydantic model Pixel#

Bases: BaseModel

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Fields:

x (int)
y (int)

field x: int [Required]#

field y: int [Required]#

property deserialize_json: List[int]#: Takes a pixel object and returns it as list of ints

property deserialize_opencv: List[List[int]]#

Takes a pixel object and returns JSON compatible list

We deserialize to a JSON compatible format that matches what OpenCV expects when drawing contours.

OpenCV expects a list of list of pixel coordinates.

pydantic model Contour#

Bases: BaseModel

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Fields:

pixels (List[dataquality.schemas.semantic_segmentation.Pixel])

field pixels: List[Pixel] [Required]#

pydantic model Polygon#

Bases: BaseModel

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Fields:

area (int | None)
background_error_pct (float | None)
cls_error_data (dataquality.schemas.semantic_segmentation.ClassificationErrorData | None)
contours (List[dataquality.schemas.semantic_segmentation.Contour])
data_error_potential (float | None)
error_type (dataquality.schemas.semantic_segmentation.ErrorType)
ghost_percentage (float | None)
label_idx (int)
likely_mislabeled_pct (float | None)
polygon_type (dataquality.schemas.semantic_segmentation.PolygonType)
uuid (str)

field area: Optional[int] = None#

field background_error_pct: Optional[float] = None#

field cls_error_data: Optional[ClassificationErrorData] = None#

field contours: List[Contour] [Required]#

field data_error_potential: Optional[float] = None#

field error_type: ErrorType = ErrorType.none#

field ghost_percentage: Optional[float] = None#

field label_idx: int [Required]#

field likely_mislabeled_pct: Optional[float] = None#

field polygon_type: PolygonType [Required]#

field uuid: str [Required]#

contours_json(image_id)#

Deserialize the contours as a JSON

In the backend we store polygon contours per image, so we need to keep a reference to which image the polygon belongs to.

Return type:: Dict[int, List]

Example

polygon = Polygon(

contours=[: Contour(pixels=[Pixel(x=0, y=0), Pixel(x=0, y=1)]), Contour(pixels=[Pixel(x=12, y=9), Pixel(x=11, y=11)])

]

) polygon.contours_json(123) >>> {123: [[[0, 0], [0, 1]], [[12, 9], [11, 11]]]}

contours_opencv()#

Deserialize the contours in a polygon to be OpenCV contour compatible

OpenCV.drawContours expects a list of np.ndarrays corresponding to the contours in the polygon.

Return type:: List[ndarray]

Example

polygon = Polygon(: contours=[Contour(pixels=[Pixel(x=0, y=0), Pixel(x=0, y=1)])]

) polygon.contours_opencv() >>> [np.array([[0, 0], [0, 1]])]

static dummy_polygon()#

Creates an empty polygon with default values in case we have an image with no polygons in either the pred or gold mask.

Return type:: Polygon

dataquality.schemas.seq2seq module#

class Seq2SeqModelType(value)#

Bases: str, Enum

An enumeration.

encoder_decoder = 'encoder_decoder'#

decoder_only = 'decoder_only'#

static members()#

Return type:: List[str]

class Seq2SeqInputCols(value)#

Bases: str, Enum

An enumeration.

id = 'id'#

input = 'input'#

target = 'target'#

generated_output = 'generated_output'#

split_ = 'split'#

tokenized_label = 'tokenized_label'#

input_cutoff = 'input_cutoff'#

target_cutoff = 'target_cutoff'#

token_label_str = 'token_label_str'#

token_label_positions = 'token_label_positions'#

token_label_offsets = 'token_label_offsets'#

system_prompts = 'system_prompts'#

class Seq2SeqInputTempCols(value)#

Bases: str, Enum

An enumeration.

formatted_prompts = 'galileo_formatted_prompts'#

class Seq2SeqOutputCols(value)#

Bases: str, Enum

An enumeration.

id = 'id'#

emb = 'emb'#

token_logprobs = 'token_logprobs'#

top_logprobs = 'top_logprobs'#

generated_output = 'generated_output'#

generated_token_label_positions = 'generated_token_label_positions'#

generated_token_label_offsets = 'generated_token_label_offsets'#

generated_token_logprobs = 'generated_token_logprobs'#

generated_top_logprobs = 'generated_top_logprobs'#

split_ = 'split'#

epoch = 'epoch'#

inference_name = 'inference_name'#

generation_data = '_generation_data'#

static generated_cols()#

Return type:: List[str]

class AlignedTokenData(token_label_offsets, token_label_positions)#

Bases: object

token_label_offsets: List[List[Tuple[int, int]]]#

token_label_positions: List[List[Set[int]]]#

append(data)#

Append offsets and positions for a single sample

Assumes that data holds alignment info for :rtype: None

a single data sample. As such, when appending to token_label_offsets and token_label_positions we remove the “batch” dimensions respectively.

e.g. >> data.token_label_offsets[0]

class LogprobData(token_logprobs, top_logprobs)#

Bases: object

Data type for the top_logprobs for a single sample

Parameters:#

token_logprobs: np.ndarray of shape - [seq_len]: Token label logprobs for a single sample
top_logprobs: List[List[Tuple[str, float]]]: List of top-k (str) predictions + corresponding logprobs

token_logprobs: ndarray#

top_logprobs: List[List[Tuple[str, float]]]#

class ModelGeneration(generated_ids, generated_logprob_data)#

Bases: object

generated_ids: ndarray#

generated_logprob_data: LogprobData#

class BatchGenerationData(generated_outputs=<factory>, generated_token_label_positions=<factory>, generated_token_label_offsets=<factory>, generated_token_logprobs=<factory>, generated_top_logprobs=<factory>)#

Bases: object

Dataclass for Generated Output Data

Stores the processed information from generated over a batch OR df of text Inputs. Each parameter is a List of sample data with length equal to the numer of samples currently in the BatchGenerationData object.

Parameters:#

generated_outputs: List[str]: The actual generated strings for each Input sample
generated_token_label_positions: List[List[Set[int]]]: Token label positions for each sample
generated_token_label_offsets: List[List[Tuple[int, int]]]: Token label positions for each sample
generated_token_logprobs: np.ndarray of shape - [seq_len]: Token label logprobs for each sample
generated_top_logprobs: List[List[List[Tuple[str, float]]]]: top_logprobs for each sample

generated_outputs: List[str]#

generated_token_label_positions: List[List[Set[int]]]#

generated_token_label_offsets: List[List[Tuple[int, int]]]#

generated_token_logprobs: List[ndarray]#

generated_top_logprobs: List[List[List[Tuple[str, float]]]]#

extend_from(batch_data)#

Extend generation data from a new Batch

Note that we favor in-place combining of batches for improved memory and performance.

Return type:: None

dataquality.schemas.split module#

class Split(value)#

Bases: str, Enum

An enumeration.

train = 'training'#

training = 'training'#

val = 'validation'#

valid = 'validation'#

validation = 'validation'#

test = 'test'#

testing = 'test'#

inference = 'inference'#

static get_valid_attributes()#

Return type:: List[str]

static get_valid_keys()#

Return type:: List[str]

conform_split(split)#

Conforms split name to our naming conventions

Raises GalileoException if split is invalid

Return type:: Split

dataquality.schemas.task_type module#

class TaskType(value)#

Bases: str, Enum

Valid task types supported for logging by Galileo

text_classification = 'text_classification'#

text_multi_label = 'text_multi_label'#

text_ner = 'text_ner'#

image_classification = 'image_classification'#

tabular_classification = 'tabular_classification'#

object_detection = 'object_detection'#

semantic_segmentation = 'semantic_segmentation'#

prompt_evaluation = 'prompt_evaluation'#

seq2seq = 'seq2seq'#

llm_monitor = 'llm_monitor'#

seq2seq_completion = 'seq2seq_completion'#

seq2seq_chat = 'seq2seq_chat'#

static get_valid_tasks()#

Tasks that are valid for dataquality.

Return type:: List[TaskType]

static get_seq2seq_tasks()#

Sequence to Sequence tasks types.

Return type:: List[TaskType]

static get_mapping(task_int)#

Converts the servers task type enum to client names

Return type:: TaskType

dataquality.schemas.torch module#

class HelperData(value)#

Bases: str, Enum

A collection of all default attributes across all loggers

dqcallback = 'dqcallback'#

signature_cols = 'signature_cols'#

orig_collate_fn = 'orig_collate_fn'#

model_outputs_store = 'model_outputs_store'#

model = 'model'#

hook_manager = 'hook_manager'#

last_action = 'last_action'#

patches = 'patches'#

dl_next_idx_ids = 'dl_next_idx_ids'#

batch = 'batch'#

model_input = 'model_input'#

Module contents#

class RequestType(value)#

Bases: str, Enum

An enumeration.

GET = 'get'#

POST = 'post'#

PUT = 'put'#

DELETE = 'delete'#

static get_method(request)#

Return type:: Callable

class Route(value)#

Bases: str, Enum

List of available API routes

projects = 'projects'#

runs = 'runs'#

users = 'users'#

cleanup = 'cleanup'#

login = 'login'#

current_user = 'current_user'#

healthcheck = 'healthcheck'#

healthcheck_dq = 'healthcheck/dq'#

slices = 'slices'#

split_path = 'split'#

splits = 'splits'#

inference_names = 'inference_names'#

jobs = 'jobs'#

latest_job = 'jobs/latest'#

presigned_url = 'presigned_url'#

tasks = 'tasks'#

labels = 'labels'#

epochs = 'epochs'#

summary = 'insights/summary'#

groupby = 'insights/groupby'#

metrics = 'metrics'#

distribution = 'insights/distribution'#

alerts = 'insights/alerts'#

export = 'export'#

edits = 'edits'#

export_edits = 'edits/export'#

notify = 'notify/email'#

token = 'get-token'#

upload_file = 'upload_file'#

model = 'model'#

link = 'link'#

static content_path(project_id=None, run_id=None, split=None)#

Return type:: str