dataquality.schemas package#
Submodules#
dataquality.schemas.condition module#
- class Operator(value)#
Bases:
str
,Enum
An enumeration.
- eq = 'is equal to'#
- neq = 'is not equal to'#
- gt = 'is greater than'#
- lt = 'is less than'#
- gte = 'is greater than or equal to'#
- lte = 'is less than or equal to'#
- class AggregateFunction(value)#
Bases:
str
,Enum
An enumeration.
- avg = 'Average'#
- min = 'Minimum'#
- max = 'Maximum'#
- sum = 'Sum'#
- pct = 'Percentage'#
- pydantic model ConditionFilter#
Bases:
BaseModel
Filter a dataframe based on the column value
Note that the column used for filtering is the same as the metric used in the condition.
- Parameters:
operator – The operator to use for filtering (e.g. “gt”, “lt”, “eq”, “neq”) See Operator
value – The value to compare against
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- Fields:
-
field metric:
str
[Required]#
-
field value:
Union
[float
,int
,str
,bool
] [Required]#
- pydantic model Condition#
Bases:
BaseModel
Class for building custom conditions for data quality checks
After building a condition, call evaluate to determine the truthiness of the condition against a given DataFrame.
With a bit of thought, complex and custom conditions can be built. To gain an intuition for what can be accomplished, consider the following examples:
- Is the average confidence less than 0.3?
>>> c = Condition( ... agg=AggregateFunction.avg, ... metric="confidence", ... operator=Operator.lt, ... threshold=0.3, ... ) >>> c.evaluate(df)
- Is the max DEP greater or equal to 0.45?
>>> c = Condition( ... agg=AggregateFunction.max, ... metric="data_error_potential", ... operator=Operator.gte, ... threshold=0.45, ... ) >>> c.evaluate(df)
By adding filters, you can further narrow down the scope of the condition. If the aggregate function is “pct”, you don’t need to specify a metric,
as the filters will determine the percentage of data.
For example:
- Alert if over 80% of the dataset has confidence under 0.1
>>> c = Condition( ... operator=Operator.gt, ... threshold=0.8, ... agg=AggregateFunction.pct, ... filters=[ ... ConditionFilter( ... metric="confidence", operator=Operator.lt, value=0.1 ... ), ... ], ... ) >>> c.evaluate(df)
- Alert if at least 20% of the dataset has drifted (Inference DataFrames only)
>>> c = Condition( ... operator=Operator.gte, ... threshold=0.2, ... agg=AggregateFunction.pct, ... filters=[ ... ConditionFilter( ... metric="is_drifted", operator=Operator.eq, value=True ... ), ... ], ... ) >>> c.evaluate(df)
- Alert 5% or more of the dataset contains PII
>>> c = Condition( ... operator=Operator.gte, ... threshold=0.05, ... agg=AggregateFunction.pct, ... filters=[ ... ConditionFilter( ... metric="galileo_pii", operator=Operator.neq, value="None" ... ), ... ], ... ) >>> c.evaluate(df)
Complex conditions can be built when the filter has a different metric than the metric used in the condition. For example:
- Alert if the min confidence of drifted data is less than 0.15
>>> c = Condition( ... agg=AggregateFunction.min, ... metric="confidence", ... operator=Operator.lt, ... threshold=0.15, ... filters=[ ... ConditionFilter( ... metric="is_drifted", operator=Operator.eq, value=True ... ) ... ], ... ) >>> c.evaluate(df)
- Alert if over 50% of high DEP (>=0.7) data contains PII
>>> c = Condition( ... operator=Operator.gt, ... threshold=0.5, ... agg=AggregateFunction.pct, ... filters=[ ... ConditionFilter( ... metric="data_error_potential", operator=Operator.gte, value=0.7 ... ), ... ConditionFilter( ... metric="galileo_pii", operator=Operator.neq, value="None" ... ), ... ], ... ) >>> c.evaluate(df)
You can also call conditions directly, which will assert its truth against a df 1. Assert that average confidence less than 0.3 >>> c = Condition( … agg=AggregateFunction.avg, … metric=”confidence”, … operator=Operator.lt, … threshold=0.3, … ) >>> c(df) # Will raise an AssertionError if False
- Parameters:
metric – The DF column for evaluating the condition
agg – An aggregate function to apply to the metric
operator – The operator to use for comparing the agg to the threshold (e.g. “gt”, “lt”, “eq”, “neq”)
threshold – Threshold value for evaluating the condition
filter – Optional filter to apply to the DataFrame before evaluating the condition
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- Fields:
-
field agg:
AggregateFunction
[Required]#
-
field filters:
List
[ConditionFilter
] [Optional]# - Validated by:
validate_filters
-
field metric:
Optional
[str
] = None# - Validated by:
validate_metric
-
field threshold:
float
[Required]#
- evaluate(df)#
- Return type:
Tuple
[bool
,float
]
dataquality.schemas.cv module#
- class CVSmartFeatureColumn(image_path=CVSmartFeatureColumn.image_path, height=CVSmartFeatureColumn.height, width=CVSmartFeatureColumn.width, channels=CVSmartFeatureColumn.channels, hash=CVSmartFeatureColumn.hash, contrast=CVSmartFeatureColumn.contrast, overexp=CVSmartFeatureColumn.overexp, underexp=CVSmartFeatureColumn.underexp, blur=CVSmartFeatureColumn.blur, lowcontent=CVSmartFeatureColumn.lowcontent, outlier_size=CVSmartFeatureColumn.outlier_size, outlier_ratio=CVSmartFeatureColumn.outlier_ratio, outlier_near_duplicate_id=CVSmartFeatureColumn.outlier_near_duplicate_id, outlier_near_dup=CVSmartFeatureColumn.outlier_near_dup, outlier_channels=CVSmartFeatureColumn.outlier_channels, outlier_low_contrast=CVSmartFeatureColumn.outlier_low_contrast, outlier_overexposed=CVSmartFeatureColumn.outlier_overexposed, outlier_underexposed=CVSmartFeatureColumn.outlier_underexposed, outlier_low_content=CVSmartFeatureColumn.outlier_low_content, outlier_blurry=CVSmartFeatureColumn.outlier_blurry)#
Bases:
str
,Enum
A class holding the column names appearing with the smart feature methods. When updated, also need to update the coresponding schema in rungalileo.
-
image_path:
str
= 'sf_image_path'#
-
height:
str
= 'sf_height'#
-
width:
str
= 'sf_width'#
-
channels:
str
= 'sf_channels'#
-
hash:
str
= 'sf_hash'#
-
contrast:
str
= 'sf_contrast'#
-
overexp:
str
= 'sf_overexposed'#
-
underexp:
str
= 'sf_underexposed'#
-
blur:
str
= 'sf_blur'#
-
lowcontent:
str
= 'sf_content'#
-
outlier_size:
str
= 'has_odd_size'#
-
outlier_ratio:
str
= 'has_odd_ratio'#
-
outlier_near_duplicate_id:
str
= 'near_duplicate_id'#
-
outlier_near_dup:
str
= 'is_near_duplicate'#
-
outlier_channels:
str
= 'has_odd_channels'#
-
outlier_low_contrast:
str
= 'has_low_contrast'#
-
outlier_overexposed:
str
= 'is_overexposed'#
-
outlier_underexposed:
str
= 'is_underexposed'#
-
outlier_low_content:
str
= 'has_low_content'#
-
outlier_blurry:
str
= 'is_blurry'#
-
image_path:
dataquality.schemas.dataframe module#
- pydantic model BaseLoggerDataFrames#
Bases:
BaseModel
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- Fields:
-
field data:
DataFrame
[Required]#
-
field emb:
DataFrame
[Required]#
-
field prob:
DataFrame
[Required]#
dataquality.schemas.edit module#
- class EditAction(value)#
Bases:
str
,Enum
The available actions you can take in an edit
- relabel = 'relabel'#
- delete = 'delete'#
- select_for_label = 'select_for_label'#
- relabel_as_pred = 'relabel_as_pred'#
- update_text = 'update_text'#
- shift_span = 'shift_span'#
- pydantic model Edit#
Bases:
BaseModel
A class for help creating edits via dq.metrics An edit is a combination of a filter, and some edit action. You can use this class, as well as dq.metrics.create_edit and dq.metrics.get_edited_dataframe to create automated edits and improved datasets, leading to automated retraining pipelines. :param edit_action: EditAction the type of edit.
delete, relabel, relabel_as_pred, update_text, shift_span (ner only), and select_for_label (inference only)
- Parameters:
new_label – Optional[str] needed if action is relabel, ignored otherwise. The new label to set for the edit
search_string – Optional[str] needed when action is text replacement or shift_span. The search string to use for the edit
use_regex – bool = False. Used for the search_string. When searching, whether to use regex or not. Default False.
shift_span_start_num_words – Optional[int] Needed if action is shift_span. How many words (forward or back) to shift the beginning of the span by
shift_span_end_num_words – Optional[int] Needed if action is shift_span. How many words (forward or back) to shift the end of the span by
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- Fields:
-
field edit_action:
EditAction
[Required]# - Validated by:
new_label_if_relabel
shift_span_validator
text_replacement_if_update_text
validate_edit_action_for_split
-
field filter:
Optional
[FilterParams
] = None#
-
field inference_name:
Optional
[str
] = None#
-
field new_label:
Optional
[Annotated
[str
]] = None#
-
field note:
Optional
[Annotated
[str
]] = None#
-
field project_id:
Optional
[Annotated
[UUID
]] = None#
-
field run_id:
Optional
[Annotated
[UUID
]] = None#
-
field search_string:
Optional
[Annotated
[str
]] = None#
-
field shift_span_end_num_words:
Optional
[Annotated
[int
]] = None#
-
field shift_span_start_num_words:
Optional
[Annotated
[int
]] = None#
-
field split:
Optional
[str
] = None#
-
field task:
Optional
[str
] = None#
-
field text_replacement:
Optional
[Annotated
[str
]] = None#
-
field use_regex:
bool
= False#
dataquality.schemas.hf module#
- class HFCol(input_ids='input_ids', text='text', id='id', ner_tags='ner_tags', text_token_indices='text_token_indices', tokens='tokens', bpe_tokens='bpe_tokens', gold_spans='gold_spans', labels='labels', ner_labels='ner_labels', tags='tags')#
Bases:
object
-
input_ids:
str
= 'input_ids'#
-
text:
str
= 'text'#
-
id:
str
= 'id'#
-
ner_tags:
str
= 'ner_tags'#
-
text_token_indices:
str
= 'text_token_indices'#
-
tokens:
str
= 'tokens'#
-
bpe_tokens:
str
= 'bpe_tokens'#
-
gold_spans:
str
= 'gold_spans'#
-
labels:
str
= 'labels'#
-
ner_labels:
str
= 'ner_labels'#
-
tags:
str
= 'tags'#
- static get_fields()#
- Return type:
List
[str
]
-
input_ids:
dataquality.schemas.job module#
dataquality.schemas.metrics module#
- pydantic model HashableBaseModel#
Bases:
BaseModel
Hashable BaseModel https://github.com/pydantic/pydantic/issues/1303
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- pydantic model MetaFilter#
Bases:
HashableBaseModel
A class for filtering arbitrary metadata dataframe columns
For example, to filter on a logged metadata column, “is_happy” for values [True], you can create a MetaFilter(name=”is_happy”, isin=[True])
You can use this filter for any columns, not just metadata columns. For example, you can use this to filter for DEP scores above 0.5: MetaFilter(name=”data_error_potential”, greater_than=0.5)
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
-
field greater_than:
Optional
[float
] = None#
-
field isin:
Optional
[List
[str
]] = None#
-
field less_than:
Optional
[float
] = None#
-
field name:
Annotated
[str
] [Required]# - Constraints:
strict = True
-
field greater_than:
- pydantic model InferenceFilter#
Bases:
HashableBaseModel
A class for filtering an inference split
is_otb: Filters samples that are / are not On-The-Boundary
is_drifted: Filters samples that are / are not Drifted
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
-
field is_drifted:
Optional
[bool
] = None#
-
field is_otb:
Optional
[bool
] = None#
- pydantic model LassoSelection#
Bases:
HashableBaseModel
Representation of a lasso selection (used during an embeddings selection)
x and y correspond to the cursor movement while tracing the lasso. This is natively provided by plotly when creating a lasso selection
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- Fields:
-
field x:
List
[float
] [Required]# - Validated by:
validate_xy
-
field y:
List
[float
] [Required]# - Validated by:
validate_xy
- pydantic model FilterParams#
Bases:
HashableBaseModel
A class for sending filters to the API alongside most any request.
Each field represents things you can filter the dataframe on.
- Parameters:
ids – List[int] = [] filter for specific IDs in the dataframe (span IDs for NER)
similar_to – Optional[int] = None provide an ID to run similarity search on
num_similar_to – Optional[int] = None if running similarity search, how many
text_pat – Optional[StrictStr] = None filter text samples by some text pattern
regex – Optional[bool] = None if searching with text, whether to use regex
data_error_potential_high – Optional[float] = None only samples with DEP <= this
data_error_potential_low – Optional[float] = None only samples with DEP >= this
misclassified_only – Optional[bool] = None Only look at missed samples
gold_filter – Optional[List[StrictStr]] = None filter GT classes
pred_filter – Optional[List[StrictStr]] = None filter prediction classes
meta_filter – Optional[List[MetaFilter]] = None see MetaFilter class
inference_filter – Optional[InferenceFilter] = None see InferenceFilter class
span_sample_ids – Optional[List[int]] = None (NER only) filter for full samples
span_text – Optional[str] = None (NER only) filter only on span text
exclude_ids – List[int] = [] opposite of ids
lasso – Optional[LassoSelection] = None see LassoSelection class
class_filter – Optional[List[StrictStr]] = None filter GT OR prediction
likely_mislabeled – Optional[bool] = None Filter for only likely_mislabeled samples. False/None will return all samples
likely_mislabeled_dep_percentile – Optional[int] A percentile threshold for l ikely mislabeled. This field (ranged 0-100) determines the precision of the likely_mislabeled filter. The threshold is applied against the DEP distribution of the likely_mislabeled samples. A threshold of 0 returns all, 100 returns 1 sample, and 50 will return the top 50% DEP samples that are likely_mislabeled. Higher = more precision, lower = more recall. Default 0.
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- Fields:
-
field class_filter:
Optional
[List
[Annotated
[str
]]] = None#
-
field data_error_potential_high:
Optional
[float
] = None#
-
field data_error_potential_low:
Optional
[float
] = None#
-
field exclude_ids:
List
[int
] = []#
-
field gold_filter:
Optional
[List
[Annotated
[str
]]] = None#
-
field ids:
List
[int
] = []#
-
field inference_filter:
Optional
[InferenceFilter
] = None#
-
field lasso:
Optional
[LassoSelection
] = None#
-
field likely_mislabeled:
Optional
[bool
] = None#
-
field likely_mislabeled_dep_percentile:
Optional
[int
] = 0# - Constraints:
ge = 0
le = 100
-
field meta_filter:
Optional
[List
[MetaFilter
]] = None#
-
field misclassified_only:
Optional
[bool
] = None#
-
field num_similar_to:
Optional
[int
] = None#
-
field pred_filter:
Optional
[List
[Annotated
[str
]]] = None#
-
field regex:
Optional
[bool
] = None#
-
field similar_to:
Optional
[List
[int
]] = None#
-
field span_regex:
Optional
[bool
] = None#
-
field span_sample_ids:
Optional
[List
[int
]] = None#
-
field span_text:
Optional
[str
] = None#
-
field text_pat:
Optional
[Annotated
[str
]] = None#
dataquality.schemas.model module#
dataquality.schemas.ner module#
- class NERProbMethod(value)#
Bases:
str
,Enum
An enumeration.
- confidence = 'confidence'#
- loss = 'loss'#
- class NERErrorType(value)#
Bases:
str
,Enum
An enumeration.
- wrong_tag = 'wrong_tag'#
- missed_label = 'missed_label'#
- span_shift = 'span_shift'#
- ghost_span = 'ghost_span'#
- none = 'None'#
- class TaggingSchema(value)#
Bases:
str
,Enum
An enumeration.
- BIO = 'BIO'#
- BILOU = 'BILOU'#
- BIOES = 'BIOES'#
- class NERColumns(value)#
Bases:
str
,Enum
An enumeration.
- id = 'id'#
- sample_id = 'sample_id'#
- split = 'split'#
- epoch = 'epoch'#
- is_gold = 'is_gold'#
- is_pred = 'is_pred'#
- span_start = 'span_start'#
- span_end = 'span_end'#
- gold = 'gold'#
- pred = 'pred'#
- conf_prob = 'conf_prob'#
- loss_prob = 'loss_prob'#
- loss_prob_label = 'loss_prob_label'#
- galileo_error_type = 'galileo_error_type'#
- emb = 'emb'#
- inference_name = 'inference_name'#
dataquality.schemas.report module#
- class ConditionStatus(value)#
Bases:
str
,Enum
An enumeration.
- passed = 'passed'#
- failed = 'failed'#
- pydantic model SplitConditionData#
Bases:
BaseModel
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- Fields:
-
field ground_truth:
float
[Required]#
-
field inference_name:
Optional
[str
] = None#
-
field link:
Optional
[str
] = None#
-
field split:
str
[Required]#
-
field status:
ConditionStatus
[Required]#
- pydantic model ReportConditionData#
Bases:
BaseModel
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
-
field condition:
str
[Required]#
-
field metric:
str
[Required]#
-
field splits:
List
[SplitConditionData
] [Required]#
-
field condition:
- pydantic model RunReportData#
Bases:
BaseModel
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- Fields:
-
field conditions:
List
[ReportConditionData
] [Required]#
-
field created_at:
str
[Required]#
-
field email_subject:
str
[Required]#
-
field link:
str
[Required]#
-
field project_name:
str
[Required]#
-
field run_name:
str
[Required]#
dataquality.schemas.request_type module#
dataquality.schemas.route module#
- class Route(value)#
Bases:
str
,Enum
List of available API routes
- projects = 'projects'#
- runs = 'runs'#
- users = 'users'#
- cleanup = 'cleanup'#
- login = 'login'#
- current_user = 'current_user'#
- healthcheck = 'healthcheck'#
- healthcheck_dq = 'healthcheck/dq'#
- slices = 'slices'#
- split_path = 'split'#
- splits = 'splits'#
- inference_names = 'inference_names'#
- jobs = 'jobs'#
- latest_job = 'jobs/latest'#
- presigned_url = 'presigned_url'#
- tasks = 'tasks'#
- labels = 'labels'#
- epochs = 'epochs'#
- summary = 'insights/summary'#
- groupby = 'insights/groupby'#
- metrics = 'metrics'#
- distribution = 'insights/distribution'#
- alerts = 'insights/alerts'#
- export = 'export'#
- edits = 'edits'#
- export_edits = 'edits/export'#
- notify = 'notify/email'#
- token = 'get-token'#
- upload_file = 'upload_file'#
- model = 'model'#
- link = 'link'#
- static content_path(project_id=None, run_id=None, split=None)#
- Return type:
str
dataquality.schemas.semantic_segmentation module#
- class SemSegCols(value)#
Bases:
str
,Enum
An enumeration.
- id = 'id'#
- image = 'image'#
- image_path = 'image_path'#
- mask_path = 'mask_path'#
- split = 'split'#
- meta = 'meta'#
- class ErrorType(value)#
Bases:
str
,Enum
An enumeration.
- class_confusion = 'class_confusion'#
- classification = 'classification'#
- missed = 'missed'#
- background = 'background'#
- none = 'None'#
- class PolygonType(value)#
Bases:
str
,Enum
An enumeration.
- gold = 'gold'#
- pred = 'pred'#
- dummy = 'dummy'#
- class SemSegMetricType(value)#
Bases:
str
,Enum
An enumeration.
- miou = 'mean_iou'#
- biou = 'boundary_iou'#
- dice = 'dice'#
- pydantic model ClassificationErrorData#
Bases:
BaseModel
Data needed for determining classification errors on backend
accuracy: no pixels correctly classified / area of polygon mislabeled_class: label idx of the class that was most frequently mislabeled mislabeled_class_pct: the pct of pixels in the polygon
that were classified as mislabeled_class
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
-
field accuracy:
float
[Required]#
-
field mislabeled_class:
int
[Required]#
-
field mislabeled_class_pct:
float
[Required]#
-
field accuracy:
- pydantic model SemSegMetricData#
Bases:
BaseModel
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- Fields:
-
field area_per_class:
List
[int
] [Required]#
-
field metric:
SemSegMetricType
[Required]#
-
field value:
float
[Required]#
-
field value_per_class:
List
[float
] [Required]#
- pydantic model Pixel#
Bases:
BaseModel
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
-
field x:
int
[Required]#
-
field y:
int
[Required]#
- property deserialize_json: List[int]#
Takes a pixel object and returns it as list of ints
- property deserialize_opencv: List[List[int]]#
Takes a pixel object and returns JSON compatible list
We deserialize to a JSON compatible format that matches what OpenCV expects when drawing contours.
OpenCV expects a list of list of pixel coordinates.
-
field x:
- pydantic model Contour#
Bases:
BaseModel
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- pydantic model Polygon#
Bases:
BaseModel
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- Fields:
-
field area:
Optional
[int
] = None#
-
field background_error_pct:
Optional
[float
] = None#
-
field cls_error_data:
Optional
[ClassificationErrorData
] = None#
-
field data_error_potential:
Optional
[float
] = None#
-
field ghost_percentage:
Optional
[float
] = None#
-
field label_idx:
int
[Required]#
-
field likely_mislabeled_pct:
Optional
[float
] = None#
-
field polygon_type:
PolygonType
[Required]#
-
field uuid:
str
[Required]#
- contours_json(image_id)#
Deserialize the contours as a JSON
In the backend we store polygon contours per image, so we need to keep a reference to which image the polygon belongs to.
- Return type:
Dict
[int
,List
]
Example
- polygon = Polygon(
- contours=[
Contour(pixels=[Pixel(x=0, y=0), Pixel(x=0, y=1)]), Contour(pixels=[Pixel(x=12, y=9), Pixel(x=11, y=11)])
]
) polygon.contours_json(123) >>> {123: [[[0, 0], [0, 1]], [[12, 9], [11, 11]]]}
- contours_opencv()#
Deserialize the contours in a polygon to be OpenCV contour compatible
OpenCV.drawContours expects a list of np.ndarrays corresponding to the contours in the polygon.
- Return type:
List
[ndarray
]
Example
- polygon = Polygon(
contours=[Contour(pixels=[Pixel(x=0, y=0), Pixel(x=0, y=1)])]
) polygon.contours_opencv() >>> [np.array([[0, 0], [0, 1]])]
dataquality.schemas.seq2seq module#
- class Seq2SeqModelType(value)#
Bases:
str
,Enum
An enumeration.
- encoder_decoder = 'encoder_decoder'#
- decoder_only = 'decoder_only'#
- static members()#
- Return type:
List
[str
]
- class Seq2SeqInputCols(value)#
Bases:
str
,Enum
An enumeration.
- id = 'id'#
- input = 'input'#
- target = 'target'#
- generated_output = 'generated_output'#
- split_ = 'split'#
- tokenized_label = 'tokenized_label'#
- input_cutoff = 'input_cutoff'#
- target_cutoff = 'target_cutoff'#
- token_label_str = 'token_label_str'#
- token_label_positions = 'token_label_positions'#
- token_label_offsets = 'token_label_offsets'#
- system_prompts = 'system_prompts'#
- class Seq2SeqInputTempCols(value)#
Bases:
str
,Enum
An enumeration.
- formatted_prompts = 'galileo_formatted_prompts'#
- class Seq2SeqOutputCols(value)#
Bases:
str
,Enum
An enumeration.
- id = 'id'#
- emb = 'emb'#
- token_logprobs = 'token_logprobs'#
- top_logprobs = 'top_logprobs'#
- generated_output = 'generated_output'#
- generated_token_label_positions = 'generated_token_label_positions'#
- generated_token_label_offsets = 'generated_token_label_offsets'#
- generated_token_logprobs = 'generated_token_logprobs'#
- generated_top_logprobs = 'generated_top_logprobs'#
- split_ = 'split'#
- epoch = 'epoch'#
- inference_name = 'inference_name'#
- generation_data = '_generation_data'#
- static generated_cols()#
- Return type:
List
[str
]
- class AlignedTokenData(token_label_offsets, token_label_positions)#
Bases:
object
-
token_label_offsets:
List
[List
[Tuple
[int
,int
]]]#
-
token_label_positions:
List
[List
[Set
[int
]]]#
- append(data)#
Append offsets and positions for a single sample
Assumes that data holds alignment info for :rtype:
None
a single data sample. As such, when appending to token_label_offsets and token_label_positions we remove the “batch” dimensions respectively.
e.g. >> data.token_label_offsets[0]
-
token_label_offsets:
- class LogprobData(token_logprobs, top_logprobs)#
Bases:
object
Data type for the top_logprobs for a single sample
Parameters:#
- token_logprobs: np.ndarray of shape - [seq_len]
Token label logprobs for a single sample
- top_logprobs: List[List[Tuple[str, float]]]
List of top-k (str) predictions + corresponding logprobs
-
token_logprobs:
ndarray
#
-
top_logprobs:
List
[List
[Tuple
[str
,float
]]]#
- class ModelGeneration(generated_ids, generated_logprob_data)#
Bases:
object
-
generated_ids:
ndarray
#
-
generated_logprob_data:
LogprobData
#
-
generated_ids:
- class BatchGenerationData(generated_outputs=<factory>, generated_token_label_positions=<factory>, generated_token_label_offsets=<factory>, generated_token_logprobs=<factory>, generated_top_logprobs=<factory>)#
Bases:
object
Dataclass for Generated Output Data
Stores the processed information from generated over a batch OR df of text Inputs. Each parameter is a List of sample data with length equal to the numer of samples currently in the BatchGenerationData object.
Parameters:#
- generated_outputs: List[str]
The actual generated strings for each Input sample
- generated_token_label_positions: List[List[Set[int]]]
Token label positions for each sample
- generated_token_label_offsets: List[List[Tuple[int, int]]]
Token label positions for each sample
- generated_token_logprobs: np.ndarray of shape - [seq_len]
Token label logprobs for each sample
- generated_top_logprobs: List[List[List[Tuple[str, float]]]]
top_logprobs for each sample
-
generated_outputs:
List
[str
]#
-
generated_token_label_positions:
List
[List
[Set
[int
]]]#
-
generated_token_label_offsets:
List
[List
[Tuple
[int
,int
]]]#
-
generated_token_logprobs:
List
[ndarray
]#
-
generated_top_logprobs:
List
[List
[List
[Tuple
[str
,float
]]]]#
- extend_from(batch_data)#
Extend generation data from a new Batch
Note that we favor in-place combining of batches for improved memory and performance.
- Return type:
None
dataquality.schemas.split module#
- class Split(value)#
Bases:
str
,Enum
An enumeration.
- train = 'training'#
- training = 'training'#
- val = 'validation'#
- valid = 'validation'#
- validation = 'validation'#
- test = 'test'#
- testing = 'test'#
- inference = 'inference'#
- static get_valid_attributes()#
- Return type:
List
[str
]
- static get_valid_keys()#
- Return type:
List
[str
]
dataquality.schemas.task_type module#
- class TaskType(value)#
Bases:
str
,Enum
Valid task types supported for logging by Galileo
- text_classification = 'text_classification'#
- text_multi_label = 'text_multi_label'#
- text_ner = 'text_ner'#
- image_classification = 'image_classification'#
- tabular_classification = 'tabular_classification'#
- object_detection = 'object_detection'#
- semantic_segmentation = 'semantic_segmentation'#
- prompt_evaluation = 'prompt_evaluation'#
- seq2seq = 'seq2seq'#
- llm_monitor = 'llm_monitor'#
- seq2seq_completion = 'seq2seq_completion'#
- seq2seq_chat = 'seq2seq_chat'#
dataquality.schemas.torch module#
- class HelperData(value)#
Bases:
str
,Enum
A collection of all default attributes across all loggers
- dqcallback = 'dqcallback'#
- signature_cols = 'signature_cols'#
- orig_collate_fn = 'orig_collate_fn'#
- model_outputs_store = 'model_outputs_store'#
- model = 'model'#
- hook_manager = 'hook_manager'#
- last_action = 'last_action'#
- patches = 'patches'#
- dl_next_idx_ids = 'dl_next_idx_ids'#
- batch = 'batch'#
- model_input = 'model_input'#
Module contents#
- class RequestType(value)#
Bases:
str
,Enum
An enumeration.
- GET = 'get'#
- POST = 'post'#
- PUT = 'put'#
- DELETE = 'delete'#
- static get_method(request)#
- Return type:
Callable
- class Route(value)#
Bases:
str
,Enum
List of available API routes
- projects = 'projects'#
- runs = 'runs'#
- users = 'users'#
- cleanup = 'cleanup'#
- login = 'login'#
- current_user = 'current_user'#
- healthcheck = 'healthcheck'#
- healthcheck_dq = 'healthcheck/dq'#
- slices = 'slices'#
- split_path = 'split'#
- splits = 'splits'#
- inference_names = 'inference_names'#
- jobs = 'jobs'#
- latest_job = 'jobs/latest'#
- presigned_url = 'presigned_url'#
- tasks = 'tasks'#
- labels = 'labels'#
- epochs = 'epochs'#
- summary = 'insights/summary'#
- groupby = 'insights/groupby'#
- metrics = 'metrics'#
- distribution = 'insights/distribution'#
- alerts = 'insights/alerts'#
- export = 'export'#
- edits = 'edits'#
- export_edits = 'edits/export'#
- notify = 'notify/email'#
- token = 'get-token'#
- upload_file = 'upload_file'#
- model = 'model'#
- link = 'link'#
- static content_path(project_id=None, run_id=None, split=None)#
- Return type:
str