dataquality.schemas package#

Submodules#

dataquality.schemas.condition module#

class Operator(value)#

Bases: str, Enum

An enumeration.

eq = 'is equal to'#
neq = 'is not equal to'#
gt = 'is greater than'#
lt = 'is less than'#
gte = 'is greater than or equal to'#
lte = 'is less than or equal to'#
class AggregateFunction(value)#

Bases: str, Enum

An enumeration.

avg = 'Average'#
min = 'Minimum'#
max = 'Maximum'#
sum = 'Sum'#
pct = 'Percentage'#
pydantic model ConditionFilter#

Bases: BaseModel

Filter a dataframe based on the column value

Note that the column used for filtering is the same as the metric used in the condition.

Parameters:
  • operator – The operator to use for filtering (e.g. “gt”, “lt”, “eq”, “neq”) See Operator

  • value – The value to compare against

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Fields:
field metric: str [Required]#
field operator: Operator [Required]#
field value: Union[float, int, str, bool] [Required]#
pydantic model Condition#

Bases: BaseModel

Class for building custom conditions for data quality checks

After building a condition, call evaluate to determine the truthiness of the condition against a given DataFrame.

With a bit of thought, complex and custom conditions can be built. To gain an intuition for what can be accomplished, consider the following examples:

  1. Is the average confidence less than 0.3?
    >>> c = Condition(
    ...     agg=AggregateFunction.avg,
    ...     metric="confidence",
    ...     operator=Operator.lt,
    ...     threshold=0.3,
    ... )
    >>> c.evaluate(df)
    
  2. Is the max DEP greater or equal to 0.45?
    >>> c = Condition(
    ...     agg=AggregateFunction.max,
    ...     metric="data_error_potential",
    ...     operator=Operator.gte,
    ...     threshold=0.45,
    ... )
    >>> c.evaluate(df)
    

By adding filters, you can further narrow down the scope of the condition. If the aggregate function is “pct”, you don’t need to specify a metric,

as the filters will determine the percentage of data.

For example:

  1. Alert if over 80% of the dataset has confidence under 0.1
    >>> c = Condition(
    ...     operator=Operator.gt,
    ...     threshold=0.8,
    ...     agg=AggregateFunction.pct,
    ...     filters=[
    ...         ConditionFilter(
    ...             metric="confidence", operator=Operator.lt, value=0.1
    ...         ),
    ...     ],
    ... )
    >>> c.evaluate(df)
    
  2. Alert if at least 20% of the dataset has drifted (Inference DataFrames only)
    >>> c = Condition(
    ...     operator=Operator.gte,
    ...     threshold=0.2,
    ...     agg=AggregateFunction.pct,
    ...     filters=[
    ...         ConditionFilter(
    ...             metric="is_drifted", operator=Operator.eq, value=True
    ...         ),
    ...     ],
    ... )
    >>> c.evaluate(df)
    
  3. Alert 5% or more of the dataset contains PII
    >>> c = Condition(
    ...     operator=Operator.gte,
    ...     threshold=0.05,
    ...     agg=AggregateFunction.pct,
    ...     filters=[
    ...         ConditionFilter(
    ...             metric="galileo_pii", operator=Operator.neq, value="None"
    ...         ),
    ...     ],
    ... )
    >>> c.evaluate(df)
    

Complex conditions can be built when the filter has a different metric than the metric used in the condition. For example:

  1. Alert if the min confidence of drifted data is less than 0.15
    >>> c = Condition(
    ...     agg=AggregateFunction.min,
    ...     metric="confidence",
    ...     operator=Operator.lt,
    ...     threshold=0.15,
    ...     filters=[
    ...         ConditionFilter(
    ...             metric="is_drifted", operator=Operator.eq, value=True
    ...         )
    ...     ],
    ... )
    >>> c.evaluate(df)
    
  2. Alert if over 50% of high DEP (>=0.7) data contains PII
    >>> c = Condition(
    ...     operator=Operator.gt,
    ...     threshold=0.5,
    ...     agg=AggregateFunction.pct,
    ...     filters=[
    ...         ConditionFilter(
    ...             metric="data_error_potential", operator=Operator.gte, value=0.7
    ...         ),
    ...         ConditionFilter(
    ...             metric="galileo_pii", operator=Operator.neq, value="None"
    ...         ),
    ...     ],
    ... )
    >>> c.evaluate(df)
    

You can also call conditions directly, which will assert its truth against a df 1. Assert that average confidence less than 0.3 >>> c = Condition( … agg=AggregateFunction.avg, … metric=”confidence”, … operator=Operator.lt, … threshold=0.3, … ) >>> c(df) # Will raise an AssertionError if False

Parameters:
  • metric – The DF column for evaluating the condition

  • agg – An aggregate function to apply to the metric

  • operator – The operator to use for comparing the agg to the threshold (e.g. “gt”, “lt”, “eq”, “neq”)

  • threshold – Threshold value for evaluating the condition

  • filter – Optional filter to apply to the DataFrame before evaluating the condition

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Fields:
field agg: AggregateFunction [Required]#
field filters: List[ConditionFilter] [Optional]#
Validated by:
  • validate_filters

field metric: Optional[str] = None#
Validated by:
  • validate_metric

field operator: Operator [Required]#
field threshold: float [Required]#
evaluate(df)#
Return type:

Tuple[bool, float]

dataquality.schemas.cv module#

class CVSmartFeatureColumn(image_path=CVSmartFeatureColumn.image_path, height=CVSmartFeatureColumn.height, width=CVSmartFeatureColumn.width, channels=CVSmartFeatureColumn.channels, hash=CVSmartFeatureColumn.hash, contrast=CVSmartFeatureColumn.contrast, overexp=CVSmartFeatureColumn.overexp, underexp=CVSmartFeatureColumn.underexp, blur=CVSmartFeatureColumn.blur, lowcontent=CVSmartFeatureColumn.lowcontent, outlier_size=CVSmartFeatureColumn.outlier_size, outlier_ratio=CVSmartFeatureColumn.outlier_ratio, outlier_near_duplicate_id=CVSmartFeatureColumn.outlier_near_duplicate_id, outlier_near_dup=CVSmartFeatureColumn.outlier_near_dup, outlier_channels=CVSmartFeatureColumn.outlier_channels, outlier_low_contrast=CVSmartFeatureColumn.outlier_low_contrast, outlier_overexposed=CVSmartFeatureColumn.outlier_overexposed, outlier_underexposed=CVSmartFeatureColumn.outlier_underexposed, outlier_low_content=CVSmartFeatureColumn.outlier_low_content, outlier_blurry=CVSmartFeatureColumn.outlier_blurry)#

Bases: str, Enum

A class holding the column names appearing with the smart feature methods. When updated, also need to update the coresponding schema in rungalileo.

image_path: str = 'sf_image_path'#
height: str = 'sf_height'#
width: str = 'sf_width'#
channels: str = 'sf_channels'#
hash: str = 'sf_hash'#
contrast: str = 'sf_contrast'#
overexp: str = 'sf_overexposed'#
underexp: str = 'sf_underexposed'#
blur: str = 'sf_blur'#
lowcontent: str = 'sf_content'#
outlier_size: str = 'has_odd_size'#
outlier_ratio: str = 'has_odd_ratio'#
outlier_near_duplicate_id: str = 'near_duplicate_id'#
outlier_near_dup: str = 'is_near_duplicate'#
outlier_channels: str = 'has_odd_channels'#
outlier_low_contrast: str = 'has_low_contrast'#
outlier_overexposed: str = 'is_overexposed'#
outlier_underexposed: str = 'is_underexposed'#
outlier_low_content: str = 'has_low_content'#
outlier_blurry: str = 'is_blurry'#

dataquality.schemas.dataframe module#

pydantic model BaseLoggerDataFrames#

Bases: BaseModel

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Fields:
field data: DataFrame [Required]#
field emb: DataFrame [Required]#
field prob: DataFrame [Required]#
class FileType(value)#

Bases: str, Enum

Valid file extensions for an exported dataframe

arrow = 'arrow'#
parquet = 'parquet'#
json = 'json'#
csv = 'csv'#
class DFVar(skip_upload='skip_upload', progress_name='progress_name')#

Bases: object

skip_upload: str = 'skip_upload'#
progress_name: str = 'progress_name'#

dataquality.schemas.edit module#

class EditAction(value)#

Bases: str, Enum

The available actions you can take in an edit

relabel = 'relabel'#
delete = 'delete'#
select_for_label = 'select_for_label'#
relabel_as_pred = 'relabel_as_pred'#
update_text = 'update_text'#
shift_span = 'shift_span'#
pydantic model Edit#

Bases: BaseModel

A class for help creating edits via dq.metrics An edit is a combination of a filter, and some edit action. You can use this class, as well as dq.metrics.create_edit and dq.metrics.get_edited_dataframe to create automated edits and improved datasets, leading to automated retraining pipelines. :param edit_action: EditAction the type of edit.

delete, relabel, relabel_as_pred, update_text, shift_span (ner only), and select_for_label (inference only)

Parameters:
  • new_label – Optional[str] needed if action is relabel, ignored otherwise. The new label to set for the edit

  • search_string – Optional[str] needed when action is text replacement or shift_span. The search string to use for the edit

  • use_regex – bool = False. Used for the search_string. When searching, whether to use regex or not. Default False.

  • shift_span_start_num_words – Optional[int] Needed if action is shift_span. How many words (forward or back) to shift the beginning of the span by

  • shift_span_end_num_words – Optional[int] Needed if action is shift_span. How many words (forward or back) to shift the end of the span by

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Fields:
field edit_action: EditAction [Required]#
Validated by:
  • new_label_if_relabel

  • shift_span_validator

  • text_replacement_if_update_text

  • validate_edit_action_for_split

field filter: Optional[FilterParams] = None#
field inference_name: Optional[str] = None#
field new_label: Optional[Annotated[str]] = None#
field note: Optional[Annotated[str]] = None#
field project_id: Optional[Annotated[UUID]] = None#
field run_id: Optional[Annotated[UUID]] = None#
field search_string: Optional[Annotated[str]] = None#
field shift_span_end_num_words: Optional[Annotated[int]] = None#
field shift_span_start_num_words: Optional[Annotated[int]] = None#
field split: Optional[str] = None#
field task: Optional[str] = None#
field text_replacement: Optional[Annotated[str]] = None#
field use_regex: bool = False#

dataquality.schemas.hf module#

class HFCol(input_ids='input_ids', text='text', id='id', ner_tags='ner_tags', text_token_indices='text_token_indices', tokens='tokens', bpe_tokens='bpe_tokens', gold_spans='gold_spans', labels='labels', ner_labels='ner_labels', tags='tags')#

Bases: object

input_ids: str = 'input_ids'#
text: str = 'text'#
id: str = 'id'#
ner_tags: str = 'ner_tags'#
text_token_indices: str = 'text_token_indices'#
tokens: str = 'tokens'#
bpe_tokens: str = 'bpe_tokens'#
gold_spans: str = 'gold_spans'#
labels: str = 'labels'#
ner_labels: str = 'ner_labels'#
tags: str = 'tags'#
static get_fields()#
Return type:

List[str]

class SpanKey(label='label', start='start', end='end')#

Bases: object

label: str = 'label'#
start: str = 'start'#
end: str = 'end'#

dataquality.schemas.job module#

class JobName(value)#

Bases: str, Enum

An enumeration.

default = 'default'#
inference = 'inference'#

dataquality.schemas.metrics module#

pydantic model HashableBaseModel#

Bases: BaseModel

Hashable BaseModel https://github.com/pydantic/pydantic/issues/1303

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

pydantic model MetaFilter#

Bases: HashableBaseModel

A class for filtering arbitrary metadata dataframe columns

For example, to filter on a logged metadata column, “is_happy” for values [True], you can create a MetaFilter(name=”is_happy”, isin=[True])

You can use this filter for any columns, not just metadata columns. For example, you can use this to filter for DEP scores above 0.5: MetaFilter(name=”data_error_potential”, greater_than=0.5)

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Fields:
field greater_than: Optional[float] = None#
field isin: Optional[List[str]] = None#
field less_than: Optional[float] = None#
field name: Annotated[str] [Required]#
Constraints:
  • strict = True

pydantic model InferenceFilter#

Bases: HashableBaseModel

A class for filtering an inference split

  • is_otb: Filters samples that are / are not On-The-Boundary

  • is_drifted: Filters samples that are / are not Drifted

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Fields:
field is_drifted: Optional[bool] = None#
field is_otb: Optional[bool] = None#
pydantic model LassoSelection#

Bases: HashableBaseModel

Representation of a lasso selection (used during an embeddings selection)

x and y correspond to the cursor movement while tracing the lasso. This is natively provided by plotly when creating a lasso selection

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Fields:
field x: List[float] [Required]#
Validated by:
  • validate_xy

field y: List[float] [Required]#
Validated by:
  • validate_xy

pydantic model FilterParams#

Bases: HashableBaseModel

A class for sending filters to the API alongside most any request.

Each field represents things you can filter the dataframe on.

Parameters:
  • ids – List[int] = [] filter for specific IDs in the dataframe (span IDs for NER)

  • similar_to – Optional[int] = None provide an ID to run similarity search on

  • num_similar_to – Optional[int] = None if running similarity search, how many

  • text_pat – Optional[StrictStr] = None filter text samples by some text pattern

  • regex – Optional[bool] = None if searching with text, whether to use regex

  • data_error_potential_high – Optional[float] = None only samples with DEP <= this

  • data_error_potential_low – Optional[float] = None only samples with DEP >= this

  • misclassified_only – Optional[bool] = None Only look at missed samples

  • gold_filter – Optional[List[StrictStr]] = None filter GT classes

  • pred_filter – Optional[List[StrictStr]] = None filter prediction classes

  • meta_filter – Optional[List[MetaFilter]] = None see MetaFilter class

  • inference_filter – Optional[InferenceFilter] = None see InferenceFilter class

  • span_sample_ids – Optional[List[int]] = None (NER only) filter for full samples

  • span_text – Optional[str] = None (NER only) filter only on span text

  • exclude_ids – List[int] = [] opposite of ids

  • lasso – Optional[LassoSelection] = None see LassoSelection class

  • class_filter – Optional[List[StrictStr]] = None filter GT OR prediction

  • likely_mislabeled – Optional[bool] = None Filter for only likely_mislabeled samples. False/None will return all samples

  • likely_mislabeled_dep_percentile – Optional[int] A percentile threshold for l ikely mislabeled. This field (ranged 0-100) determines the precision of the likely_mislabeled filter. The threshold is applied against the DEP distribution of the likely_mislabeled samples. A threshold of 0 returns all, 100 returns 1 sample, and 50 will return the top 50% DEP samples that are likely_mislabeled. Higher = more precision, lower = more recall. Default 0.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Fields:
field class_filter: Optional[List[Annotated[str]]] = None#
field data_error_potential_high: Optional[float] = None#
field data_error_potential_low: Optional[float] = None#
field exclude_ids: List[int] = []#
field gold_filter: Optional[List[Annotated[str]]] = None#
field ids: List[int] = []#
field inference_filter: Optional[InferenceFilter] = None#
field lasso: Optional[LassoSelection] = None#
field likely_mislabeled: Optional[bool] = None#
field likely_mislabeled_dep_percentile: Optional[int] = 0#
Constraints:
  • ge = 0

  • le = 100

field meta_filter: Optional[List[MetaFilter]] = None#
field misclassified_only: Optional[bool] = None#
field num_similar_to: Optional[int] = None#
field pred_filter: Optional[List[Annotated[str]]] = None#
field regex: Optional[bool] = None#
field similar_to: Optional[List[int]] = None#
field span_regex: Optional[bool] = None#
field span_sample_ids: Optional[List[int]] = None#
field span_text: Optional[str] = None#
field text_pat: Optional[Annotated[str]] = None#

dataquality.schemas.model module#

class ModelFramework(value)#

Bases: str, Enum

An enumeration.

torch = 'torch'#
keras = 'keras'#
hf = 'hf'#
auto = 'auto'#
class ModelUploadType(value)#

Bases: str, Enum

An enumeration.

transformers = 'transformers'#
setfit = 'setfit'#

dataquality.schemas.ner module#

class NERProbMethod(value)#

Bases: str, Enum

An enumeration.

confidence = 'confidence'#
loss = 'loss'#
class NERErrorType(value)#

Bases: str, Enum

An enumeration.

wrong_tag = 'wrong_tag'#
missed_label = 'missed_label'#
span_shift = 'span_shift'#
ghost_span = 'ghost_span'#
none = 'None'#
class TaggingSchema(value)#

Bases: str, Enum

An enumeration.

BIO = 'BIO'#
BILOU = 'BILOU'#
BIOES = 'BIOES'#
class NERColumns(value)#

Bases: str, Enum

An enumeration.

id = 'id'#
sample_id = 'sample_id'#
split = 'split'#
epoch = 'epoch'#
is_gold = 'is_gold'#
is_pred = 'is_pred'#
span_start = 'span_start'#
span_end = 'span_end'#
gold = 'gold'#
pred = 'pred'#
conf_prob = 'conf_prob'#
loss_prob = 'loss_prob'#
loss_prob_label = 'loss_prob_label'#
galileo_error_type = 'galileo_error_type'#
emb = 'emb'#
inference_name = 'inference_name'#

dataquality.schemas.report module#

class ConditionStatus(value)#

Bases: str, Enum

An enumeration.

passed = 'passed'#
failed = 'failed'#
pydantic model SplitConditionData#

Bases: BaseModel

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Fields:
field ground_truth: float [Required]#
field inference_name: Optional[str] = None#
field split: str [Required]#
field status: ConditionStatus [Required]#
pydantic model ReportConditionData#

Bases: BaseModel

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Fields:
field condition: str [Required]#
field metric: str [Required]#
field splits: List[SplitConditionData] [Required]#
pydantic model RunReportData#

Bases: BaseModel

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Fields:
field conditions: List[ReportConditionData] [Required]#
field created_at: str [Required]#
field email_subject: str [Required]#
field project_name: str [Required]#
field run_name: str [Required]#

dataquality.schemas.request_type module#

class RequestType(value)#

Bases: str, Enum

An enumeration.

GET = 'get'#
POST = 'post'#
PUT = 'put'#
DELETE = 'delete'#
static get_method(request)#
Return type:

Callable

dataquality.schemas.route module#

class Route(value)#

Bases: str, Enum

List of available API routes

projects = 'projects'#
runs = 'runs'#
users = 'users'#
cleanup = 'cleanup'#
login = 'login'#
current_user = 'current_user'#
healthcheck = 'healthcheck'#
healthcheck_dq = 'healthcheck/dq'#
slices = 'slices'#
split_path = 'split'#
splits = 'splits'#
inference_names = 'inference_names'#
jobs = 'jobs'#
latest_job = 'jobs/latest'#
presigned_url = 'presigned_url'#
tasks = 'tasks'#
labels = 'labels'#
epochs = 'epochs'#
summary = 'insights/summary'#
groupby = 'insights/groupby'#
metrics = 'metrics'#
distribution = 'insights/distribution'#
alerts = 'insights/alerts'#
export = 'export'#
edits = 'edits'#
export_edits = 'edits/export'#
notify = 'notify/email'#
token = 'get-token'#
upload_file = 'upload_file'#
model = 'model'#
static content_path(project_id=None, run_id=None, split=None)#
Return type:

str

dataquality.schemas.semantic_segmentation module#

class SemSegCols(value)#

Bases: str, Enum

An enumeration.

id = 'id'#
image = 'image'#
image_path = 'image_path'#
mask_path = 'mask_path'#
split = 'split'#
meta = 'meta'#
class ErrorType(value)#

Bases: str, Enum

An enumeration.

class_confusion = 'class_confusion'#
classification = 'classification'#
missed = 'missed'#
background = 'background'#
none = 'None'#
class PolygonType(value)#

Bases: str, Enum

An enumeration.

gold = 'gold'#
pred = 'pred'#
dummy = 'dummy'#
class SemSegMetricType(value)#

Bases: str, Enum

An enumeration.

miou = 'mean_iou'#
biou = 'boundary_iou'#
dice = 'dice'#
pydantic model ClassificationErrorData#

Bases: BaseModel

Data needed for determining classification errors on backend

accuracy: no pixels correctly classified / area of polygon mislabeled_class: label idx of the class that was most frequently mislabeled mislabeled_class_pct: the pct of pixels in the polygon

that were classified as mislabeled_class

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Fields:
field accuracy: float [Required]#
field mislabeled_class: int [Required]#
field mislabeled_class_pct: float [Required]#
pydantic model SemSegMetricData#

Bases: BaseModel

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Fields:
field area_per_class: List[int] [Required]#
field metric: SemSegMetricType [Required]#
field value: float [Required]#
field value_per_class: List[float] [Required]#
pydantic model Pixel#

Bases: BaseModel

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Fields:
field x: int [Required]#
field y: int [Required]#
property deserialize_json: List[int]#

Takes a pixel object and returns it as list of ints

property deserialize_opencv: List[List[int]]#

Takes a pixel object and returns JSON compatible list

We deserialize to a JSON compatible format that matches what OpenCV expects when drawing contours.

OpenCV expects a list of list of pixel coordinates.

pydantic model Contour#

Bases: BaseModel

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Fields:
field pixels: List[Pixel] [Required]#
pydantic model Polygon#

Bases: BaseModel

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Fields:
field area: Optional[int] = None#
field background_error_pct: Optional[float] = None#
field cls_error_data: Optional[ClassificationErrorData] = None#
field contours: List[Contour] [Required]#
field data_error_potential: Optional[float] = None#
field error_type: ErrorType = ErrorType.none#
field ghost_percentage: Optional[float] = None#
field label_idx: int [Required]#
field likely_mislabeled_pct: Optional[float] = None#
field polygon_type: PolygonType [Required]#
field uuid: str [Required]#
contours_json(image_id)#

Deserialize the contours as a JSON

In the backend we store polygon contours per image, so we need to keep a reference to which image the polygon belongs to.

Return type:

Dict[int, List]

Example

polygon = Polygon(
contours=[

Contour(pixels=[Pixel(x=0, y=0), Pixel(x=0, y=1)]), Contour(pixels=[Pixel(x=12, y=9), Pixel(x=11, y=11)])

]

) polygon.contours_json(123) >>> {123: [[[0, 0], [0, 1]], [[12, 9], [11, 11]]]}

contours_opencv()#

Deserialize the contours in a polygon to be OpenCV contour compatible

OpenCV.drawContours expects a list of np.ndarrays corresponding to the contours in the polygon.

Return type:

List[ndarray]

Example

polygon = Polygon(

contours=[Contour(pixels=[Pixel(x=0, y=0), Pixel(x=0, y=1)])]

) polygon.contours_opencv() >>> [np.array([[0, 0], [0, 1]])]

static dummy_polygon()#

Creates an empty polygon with default values in case we have an image with no polygons in either the pred or gold mask.

Return type:

Polygon

dataquality.schemas.seq2seq module#

class Seq2SeqModelType(value)#

Bases: str, Enum

An enumeration.

encoder_decoder = 'encoder_decoder'#
decoder_only = 'decoder_only'#
static members()#
Return type:

List[str]

class Seq2SeqInputCols(value)#

Bases: str, Enum

An enumeration.

id = 'id'#
input = 'input'#
target = 'target'#
generated_output = 'generated_output'#
split_ = 'split'#
tokenized_label = 'tokenized_label'#
input_cutoff = 'input_cutoff'#
target_cutoff = 'target_cutoff'#
token_label_str = 'token_label_str'#
token_label_positions = 'token_label_positions'#
token_label_offsets = 'token_label_offsets'#
system_prompts = 'system_prompts'#
class Seq2SeqInputTempCols(value)#

Bases: str, Enum

An enumeration.

formatted_prompts = 'galileo_formatted_prompts'#
class Seq2SeqOutputCols(value)#

Bases: str, Enum

An enumeration.

id = 'id'#
emb = 'emb'#
token_logprobs = 'token_logprobs'#
top_logprobs = 'top_logprobs'#
generated_output = 'generated_output'#
generated_token_label_positions = 'generated_token_label_positions'#
generated_token_label_offsets = 'generated_token_label_offsets'#
generated_token_logprobs = 'generated_token_logprobs'#
generated_top_logprobs = 'generated_top_logprobs'#
split_ = 'split'#
epoch = 'epoch'#
inference_name = 'inference_name'#
generation_data = '_generation_data'#
static generated_cols()#
Return type:

List[str]

class AlignedTokenData(token_label_offsets, token_label_positions)#

Bases: object

token_label_offsets: List[List[Tuple[int, int]]]#
token_label_positions: List[List[Set[int]]]#
append(data)#

Append offsets and positions for a single sample

Assumes that data holds alignment info for :rtype: None

a single data sample. As such, when appending to token_label_offsets and token_label_positions we remove the “batch” dimensions respectively.

e.g. >> data.token_label_offsets[0]

class LogprobData(token_logprobs, top_logprobs)#

Bases: object

Data type for the top_logprobs for a single sample

Parameters:#

token_logprobs: np.ndarray of shape - [seq_len]

Token label logprobs for a single sample

top_logprobs: List[List[Tuple[str, float]]]

List of top-k (str) predictions + corresponding logprobs

token_logprobs: ndarray#
top_logprobs: List[List[Tuple[str, float]]]#
class ModelGeneration(generated_ids, generated_logprob_data)#

Bases: object

generated_ids: ndarray#
generated_logprob_data: LogprobData#
class BatchGenerationData(generated_outputs=<factory>, generated_token_label_positions=<factory>, generated_token_label_offsets=<factory>, generated_token_logprobs=<factory>, generated_top_logprobs=<factory>)#

Bases: object

Dataclass for Generated Output Data

Stores the processed information from generated over a batch OR df of text Inputs. Each parameter is a List of sample data with length equal to the numer of samples currently in the BatchGenerationData object.

Parameters:#

generated_outputs: List[str]

The actual generated strings for each Input sample

generated_token_label_positions: List[List[Set[int]]]

Token label positions for each sample

generated_token_label_offsets: List[List[Tuple[int, int]]]

Token label positions for each sample

generated_token_logprobs: np.ndarray of shape - [seq_len]

Token label logprobs for each sample

generated_top_logprobs: List[List[List[Tuple[str, float]]]]

top_logprobs for each sample

generated_outputs: List[str]#
generated_token_label_positions: List[List[Set[int]]]#
generated_token_label_offsets: List[List[Tuple[int, int]]]#
generated_token_logprobs: List[ndarray]#
generated_top_logprobs: List[List[List[Tuple[str, float]]]]#
extend_from(batch_data)#

Extend generation data from a new Batch

Note that we favor in-place combining of batches for improved memory and performance.

Return type:

None

dataquality.schemas.split module#

class Split(value)#

Bases: str, Enum

An enumeration.

train = 'training'#
training = 'training'#
val = 'validation'#
valid = 'validation'#
validation = 'validation'#
test = 'test'#
testing = 'test'#
inference = 'inference'#
static get_valid_attributes()#
Return type:

List[str]

static get_valid_keys()#
Return type:

List[str]

conform_split(split)#

Conforms split name to our naming conventions

Raises GalileoException if split is invalid

Return type:

Split

dataquality.schemas.task_type module#

class TaskType(value)#

Bases: str, Enum

Valid task types supported for logging by Galileo

text_classification = 'text_classification'#
text_multi_label = 'text_multi_label'#
text_ner = 'text_ner'#
image_classification = 'image_classification'#
tabular_classification = 'tabular_classification'#
object_detection = 'object_detection'#
semantic_segmentation = 'semantic_segmentation'#
prompt_evaluation = 'prompt_evaluation'#
seq2seq = 'seq2seq'#
llm_monitor = 'llm_monitor'#
seq2seq_completion = 'seq2seq_completion'#
seq2seq_chat = 'seq2seq_chat'#
static get_valid_tasks()#

Tasks that are valid for dataquality.

Return type:

List[TaskType]

static get_seq2seq_tasks()#

Sequence to Sequence tasks types.

Return type:

List[TaskType]

static get_mapping(task_int)#

Converts the servers task type enum to client names

Return type:

TaskType

dataquality.schemas.torch module#

class HelperData(value)#

Bases: str, Enum

A collection of all default attributes across all loggers

dqcallback = 'dqcallback'#
signature_cols = 'signature_cols'#
orig_collate_fn = 'orig_collate_fn'#
model_outputs_store = 'model_outputs_store'#
model = 'model'#
hook_manager = 'hook_manager'#
last_action = 'last_action'#
patches = 'patches'#
dl_next_idx_ids = 'dl_next_idx_ids'#
batch = 'batch'#
model_input = 'model_input'#

Module contents#

class RequestType(value)#

Bases: str, Enum

An enumeration.

GET = 'get'#
POST = 'post'#
PUT = 'put'#
DELETE = 'delete'#
static get_method(request)#
Return type:

Callable

class Route(value)#

Bases: str, Enum

List of available API routes

projects = 'projects'#
runs = 'runs'#
users = 'users'#
cleanup = 'cleanup'#
login = 'login'#
current_user = 'current_user'#
healthcheck = 'healthcheck'#
healthcheck_dq = 'healthcheck/dq'#
slices = 'slices'#
split_path = 'split'#
splits = 'splits'#
inference_names = 'inference_names'#
jobs = 'jobs'#
latest_job = 'jobs/latest'#
presigned_url = 'presigned_url'#
tasks = 'tasks'#
labels = 'labels'#
epochs = 'epochs'#
summary = 'insights/summary'#
groupby = 'insights/groupby'#
metrics = 'metrics'#
distribution = 'insights/distribution'#
alerts = 'insights/alerts'#
export = 'export'#
edits = 'edits'#
export_edits = 'edits/export'#
notify = 'notify/email'#
token = 'get-token'#
upload_file = 'upload_file'#
model = 'model'#
static content_path(project_id=None, run_id=None, split=None)#
Return type:

str