dataquality.dq_auto package#

Submodules#

dataquality.dq_auto.auto module#

auto(hf_data=None, hf_inference_names=None, train_data=None, val_data=None, test_data=None, inference_data=None, max_padding_length=200, hf_model='distilbert-base-uncased', num_train_epochs=15, labels=None, project_name=None, run_name=None, wait=True, create_data_embs=None, early_stopping=True)#

Automatically gets insights on a text classification or NER dataset

Given either a pandas dataframe, file_path, or huggingface dataset path, this function will load the data, train a huggingface transformer model, and provide Galileo insights via a link to the Galileo Console

One of hf_data, train_data should be provided. If neither of those are, a demo dataset will be loaded by Galileo for training.

Parameters:

hf_data (Union[DatasetDict, str, None]) – Union[DatasetDict, str] Use this param if you have huggingface data in the hub or in memory. Otherwise see train_data, val_data, and test_data. If provided, train_data, val_data, and test_data are ignored.
hf_inference_names (Optional[List[str]]) – Use this param alongside hf_data if you have splits you’d like to consider as inference. A list of key names in hf_data to be run as inference runs after training. Any keys set must exist in hf_data
train_data (Union[DataFrame, Dataset, str, None]) – Optional training data to use. Can be one of * Pandas dataframe * Huggingface dataset * Path to a local file * Huggingface dataset hub path
val_data (Union[DataFrame, Dataset, str, None]) – Optional validation data to use. The validation data is what is used for the evaluation dataset in huggingface, and what is used for early stopping. If not provided, but test_data is, that will be used as the evaluation set. If neither val nor test are available, the train data will be randomly split 80/20 for use as evaluation data. Can be one of * Pandas dataframe * Huggingface dataset * Path to a local file * Huggingface dataset hub path
test_data (Union[DataFrame, Dataset, str, None]) – Optional test data to use. The test data, if provided with val, will be used after training is complete, as the held-out set. If no validation data is provided, this will instead be used as the evaluation set. Can be one of * Pandas dataframe * Huggingface dataset * Path to a local file * Huggingface dataset hub path
inference_data (Optional[Dict[str, Union[DataFrame, Dataset, str]]]) – User this param to include inference data alongside the train_data param. If you are passing data via the hf_data parameter, you should use the hf_inference_names param. Optional inference datasets to run with after training completes. The structure is a dictionary with the key being the inference name and the value one of * Pandas dataframe * Huggingface dataset * Path to a local file * Huggingface dataset hub path
max_padding_length (int) – The max length for padding the input text during tokenization. Default 200
hf_model (str) – The pretrained AutoModel from huggingface that will be used to tokenize and train on the provided data. Default distilbert-base-uncased
num_train_epochs (int) – The number of epochs to train for (early stopping will always be active). Default 15
labels (Optional[List[str]]) – Optional list of labels for this dataset. If not provided, they will attempt to be extracted from the data
project_name (Optional[str]) – Optional project name. If not set, a random name will be generated
run_name (Optional[str]) – Optional run name for this data. If not set, a random name will be generated
wait (bool) – Whether to wait for Galileo to complete processing your run. Default True
create_data_embs (Optional[bool]) – Whether to create data embeddings for this run. If True, Sentence-Transformers will be used to generate data embeddings for this dataset and uploaded with this run. You can access these embeddings via dq.metrics.get_data_embeddings in the emb column or dq.metrics.get_dataframe(…, include_data_embs=True) in the data_emb col Only available for TC currently. NER coming soon. Default True if a GPU is available, else default False.
early_stopping (bool) – Whether to use early stopping. Default True

Return type:

None

For text classification datasets, the only required columns are text and label

For NER, the required format is the huggingface standard format of tokens and tags (or ner_tags). See example: https://huggingface.co/datasets/rungalileo/mit_movies

MIT Movies dataset in huggingface format

tokens                                              ner_tags
[what, is, a, good, action, movie, that, is, r...       [0, 0, 0, 0, 7, 0, ...
[show, me, political, drama, movies, with, jef...       [0, 0, 7, 8, 0, 0, ...
[what, are, some, good, 1980, s, g, rated, mys...       [0, 0, 0, 0, 5, 6, ...
[list, a, crime, film, which, director, was, d...       [0, 0, 7, 0, 0, 0, ...
[is, there, a, thriller, movie, starring, al, ...       [0, 0, 0, 7, 0, 0, ...
...                                               ...                      ...

To see auto insights on a random, pre-selected dataset, simply run

import dataquality as dq

dq.auto()

An example using auto with a hosted huggingface text classification dataset

import dataquality as dq

dq.auto(hf_data="rungalileo/trec6")

Similarly, for NER

import dataquality as dq

dq.auto(hf_data="conll2003")

An example using auto with sklearn data as pandas dataframes

import dataquality as dq
import pandas as pd
from sklearn.datasets import fetch_20newsgroups

# Load the newsgroups dataset from sklearn
newsgroups_train = fetch_20newsgroups(subset='train')
newsgroups_test = fetch_20newsgroups(subset='test')
# Convert to pandas dataframes
df_train = pd.DataFrame(
    {"text": newsgroups_train.data, "label": newsgroups_train.target}
)
df_test = pd.DataFrame(
    {"text": newsgroups_test.data, "label": newsgroups_test.target}
)

dq.auto(
     train_data=df_train,
     test_data=df_test,
     labels=newsgroups_train.target_names,
     project_name="newsgroups_work",
     run_name="run_1_raw_data"
)

An example of using auto with a local CSV file with text and label columns

import dataquality as dq

dq.auto(
    train_data="train.csv",
    test_data="test.csv",
    project_name="data_from_local",
    run_name="run_1_raw_data"
)

dataquality.dq_auto.base_data_manager module#

class BaseDatasetManager#

Bases: object

DEMO_DATASETS: List[str] = []#

get_dataset_dict(hf_data=None, hf_inference_names=None, train_data=None, val_data=None, test_data=None, inference_data=None, labels=None, column_mapping=None)#

Creates and/or validates the DatasetDict provided by the user.

If the user provides a DatasetDict, we simply validate it. Otherwise, we parse a combination of the parameters provided, generate a DatasetDict of their training data, and validate that.

Return type:: DatasetDict

try_load_dataset_dict(hf_data=None, train_data=None)#

Tries to load the DatasetDict if available

If the user provided the hf_data param we load it from huggingface If they provided nothing, we load the demo dataset Otherwise, we return None, because the user provided train/test/val data, and that requires task specific processing

For HF datasets, we optionally apply a formatting function to the dataset to convert it to the format we expect. This is useful for datasets that have non-standard columns, like the alpaca dataset, which has instruction, input, and target columns instead of text and label

Return type:: Optional[DatasetDict]

dataquality.dq_auto.ner module#

class NERDatasetManager#

Bases: BaseDatasetManager

DEMO_DATASETS: List[str] = ['conll2003', 'rungalileo/mit_movies', 'wnut_17']#

auto(hf_data=None, hf_inference_names=None, train_data=None, val_data=None, test_data=None, inference_data=None, num_train_epochs=15, hf_model='distilbert-base-uncased', labels=None, project_name='auto_ner', run_name=None, wait=True, early_stopping=True)#

Automatically gets insights on an NER or Token Classification dataset

Given either a pandas dataframe, file_path, or huggingface dataset path, this function will load the data, train a huggingface token classification model, and provide Galileo insights via a link to the Galileo Console

One of hf_data, train_data should be provided. If neither of those are, a demo dataset will be loaded by Galileo for training.

The data must be provided in the standard “huggingface” format * huggingface format: A dataset with tokens and (ner_tags or tags) columns

See example: https://huggingface.co/datasets/rungalileo/mit_movies

MIT Movies dataset in huggingface format

tokens ner_tags [what, is, a, good, action, movie, that, is, r… [0, 0, 0, 0, 7, 0, … [show, me, political, drama, movies, with, jef… [0, 0, 7, 8, 0, 0, … [what, are, some, good, 1980, s, g, rated, mys… [0, 0, 0, 0, 5, 6, … [list, a, crime, film, which, director, was, d… [0, 0, 7, 0, 0, 0, … [is, there, a, thriller, movie, starring, al, … [0, 0, 0, 7, 0, 0, … … … …

Parameters:

hf_data (Union[DatasetDict, str, None]) – Union[DatasetDict, str] Use this param if you have huggingface data in the hub or in memory. Otherwise see train_data, val_data, and test_data. If provided, train_data, val_data, and test_data are ignored
train_data (Union[DataFrame, Dataset, str, None]) – Optional training data to use. Can be one of * Pandas dataframe * Huggingface dataset * Path to a local file * Huggingface dataset hub path
val_data (Union[DataFrame, Dataset, str, None]) – Optional validation data to use. The validation data is what is used for the evaluation dataset in huggingface, and what is used for early stopping. If not provided, but test_data is, that will be used as the evaluation set. If neither val nor test are available, the train data will be randomly split 80/20 for use as evaluation data. Can be one of * Pandas dataframe * Huggingface dataset * Path to a local file * Huggingface dataset hub path
test_data (Union[DataFrame, Dataset, str, None]) – Optional test data to use. The test data, if provided with val, will be used after training is complete, as the held-out set. If no validation data is provided, this will instead be used as the evaluation set. Can be one of * Pandas dataframe * Huggingface dataset * Path to a local file * Huggingface dataset hub path
hf_model (str) – The pretrained AutoModel from huggingface that will be used to tokenize and train on the provided data. Default distilbert-base-uncased
labels (Optional[List[str]]) – Optional list of labels for this dataset. If not provided, they will attempt to be extracted from the data
project_name (str) – Optional project name. If not set, a random name will be generated
run_name (Optional[str]) – Optional run name for this data. If not set, a random name will be generated
wait (bool) – Whether to wait for Galileo to complete processing your run. Default True
early_stopping (bool) – Whether to use early stopping. Default True

Return type:

None

To see auto insights on a random, pre-selected dataset, simply run ```python

from dataquality.auto.ner import auto

auto()

```

An example using auto with a hosted huggingface dataset ```python

from dataquality.auto.text_classification import auto

auto(hf_data=”rungalileo/mit_movies”)

```

An example using auto with sklearn data as pandas dataframes ```python

import pandas as pd from dataquality.auto.ner import auto

TODO EXAMPLE FOR NER FROM PANDAS DFs

auto(
train_data=df_train, test_data=df_test, labels=[‘O’,’B-ACTOR’,’I-ACTOR’,’B-TITLE’,’I-TITLE’,’B-YEAR’,’I-YEAR’] project_name=”ner_movie_reviews”, run_name=”run_1_raw_data”

)

```

An example of using auto with a local CSV file with text and label columns ```python from dataquality.auto.ner import auto

auto(: train_data=”train.csv”, test_data=”test.csv”, project_name=”data_from_local”, run_name=”run_1_raw_data”

)#

dataquality.dq_auto.ner_trainer module#

compute_metrics(eval_pred)#

Metrics computation for token classification

Taken directly from the docs https://huggingface.co/course/chapter7/2#metrics and updated for typing

Return type:: Dict

get_trainer(dd, model_checkpoint, num_train_epochs, labels=None, early_stopping=True)#

Return type:: Tuple[Trainer, DatasetDict]

dataquality.dq_auto.notebook module#

auto_notebook()#

Return type:: None

dataquality.dq_auto.schema module#

class BaseAutoDatasetConfig(hf_data=None, train_path=None, val_path=None, test_path=None, train_data=None, val_data=None, test_data=None, input_col='text', target_col='label', formatter=<factory>)#

Bases: object

Configuration for creating a dataset from a file or object

One of hf_name, train_path or train_dataset should be provided. If none of those is, a demo dataset will be loaded by Galileo for training.

Parameters:

hf_data (Union[DatasetDict, str, None]) – Union[DatasetDict, str] Use this param if you have huggingface data in the hub or in memory. Otherwise see train_data, val_data, and test_data. If provided, train_data, val_data, and test_data are ignored
train_path (Optional[str]) – Optional path to training data file to use. Must be: * Path to a local file
val_path (Optional[str]) – Optional path to validation data to use. Must be: * Path to a local file
test_path (Optional[str]) – Optional test data to use. Must be: * Path to a local file
train_data (Union[DataFrame, Dataset, None]) – Optional training data to use. Can be one of * Pandas dataframe * Huggingface dataset
val_data (Union[DataFrame, Dataset, None]) – Optional validation data to use. Can be one of * Pandas dataframe * Huggingface dataset
test_data (Union[DataFrame, Dataset, None]) – Optional test data to use. Can be one of * Pandas dataframe * Huggingface dataset
input_col (str) – Column name for input data, defaults to “text”
target_col (str) – Column name for target data, defaults to “label”

hf_data: Union[DatasetDict, str, None] = None#

train_path: Optional[str] = None#

val_path: Optional[str] = None#

test_path: Optional[str] = None#

train_data: Union[DataFrame, Dataset, None] = None#

val_data: Union[DataFrame, Dataset, None] = None#

test_data: Union[DataFrame, Dataset, None] = None#

input_col: str = 'text'#

target_col: str = 'label'#

formatter: BaseFormatter#

class BaseAutoTrainingConfig(model='distilbert-base-uncased', epochs=15, learning_rate=0.0003, batch_size=4, create_data_embs=None, data_embs_col='text', return_model=False)#

Bases: object

Configuration for training a HuggingFace model

Base config values are based on auto with Text Classification. Can be overridden by parent class for each modality.

Parameters:

model (str) – The pretrained AutoModel from huggingface that will be used to tokenize and train on the provided data. Default distilbert-base-uncased
epochs (int) – Optional num training epochs. If not set, we default to 15
learning_rate (float) – Optional learning rate. If not set, we default to 3e-4
batch_size (int) – Optional batch size. If not set, we default to 4
create_data_embs (Optional[bool]) – Whether to create data embeddings for this run. If set to None, data embeddings will be created only if a GPU is available
return_model (bool) – Whether to return the trained model at the end of auto. Default False
data_embs_col (str) – Optional text col on which to compute data embeddings. If not set, we default to ‘text’

model: str = 'distilbert-base-uncased'#

epochs: int = 15#

learning_rate: float = 0.0003#

batch_size: int = 4#

create_data_embs: Optional[bool] = None#

data_embs_col: str = 'text'#

return_model: bool = False#

dataquality.dq_auto.tc_trainer module#

preprocess_function(input_data, tokenizer, max_length)#

Return type:: BatchEncoding

compute_metrics(metric, eval_pred)#

Return type:: Dict

get_trainer(dd, labels, model_checkpoint, max_padding_length, num_train_epochs, early_stopping=True)#

Return type:: Tuple[Trainer, DatasetDict]

dataquality.dq_auto.text_classification module#

class TCDatasetManager#

Bases: BaseDatasetManager

DEMO_DATASETS: List[str] = ['rungalileo/newsgroups', 'rungalileo/trec6', 'rungalileo/conv_intent', 'rungalileo/emotion', 'rungalileo/amazon_polarity_30k', 'rungalileo/sst2']#

auto(hf_data=None, hf_inference_names=None, train_data=None, val_data=None, test_data=None, inference_data=None, max_padding_length=200, num_train_epochs=15, hf_model='distilbert-base-uncased', labels=None, project_name='auto_tc', run_name=None, wait=True, create_data_embs=None, early_stopping=True)#

Automatically gets insights on a text classification dataset

One of hf_data, train_data should be provided. If neither of those are, a demo dataset will be loaded by Galileo for training.

Parameters:

hf_data (Union[DatasetDict, str, None]) – Union[DatasetDict, str] Use this param if you have huggingface data in the hub or in memory. Otherwise see train_data, val_data, and test_data. If provided, train_data, val_data, and test_data are ignored
hf_inference_names (Optional[List[str]]) – A list of key names in hf_data to be run as inference runs after training. If set, those keys must exist in hf_data
train_data (Union[DataFrame, Dataset, str, None]) – Optional training data to use. Can be one of * Pandas dataframe * Huggingface dataset * Path to a local file * Huggingface dataset hub path
val_data (Union[DataFrame, Dataset, str, None]) – Optional validation data to use. The validation data is what is used for the evaluation dataset in huggingface, and what is used for early stopping. If not provided, but test_data is, that will be used as the evaluation set. If neither val nor test are available, the train data will be randomly split 80/20 for use as evaluation data. Can be one of * Pandas dataframe * Huggingface dataset * Path to a local file * Huggingface dataset hub path
test_data (Union[DataFrame, Dataset, str, None]) – Optional test data to use. The test data, if provided with val, will be used after training is complete, as the held-out set. If no validation data is provided, this will instead be used as the evaluation set. Can be one of * Pandas dataframe * Huggingface dataset * Path to a local file * Huggingface dataset hub path
inference_data (Optional[Dict[str, Union[DataFrame, Dataset, str]]]) – Optional inference datasets to run with after training completes. The structure is a dictionary with the key being the infeerence name and the value one of * Pandas dataframe * Huggingface dataset * Path to a local file * Huggingface dataset hub path
max_padding_length (int) – The max length for padding the input text during tokenization. Default 200
hf_model (str) – The pretrained AutoModel from huggingface that will be used to tokenize and train on the provided data. Default distilbert-base-uncased
labels (Optional[List[str]]) – Optional list of labels for this dataset. If not provided, they will attempt to be extracted from the data
project_name (str) – Optional project name. If not set, a random name will be generated
run_name (Optional[str]) – Optional run name for this data. If not set, a random name will be generated
wait (bool) – Whether to wait for Galileo to complete processing your run. Default True
create_data_embs (Optional[bool]) – Whether to create data embeddings for this run. Default False
early_stopping (bool) – Whether to use early stopping. Default True

Return type:

Trainer

To see auto insights on a random, pre-selected dataset, simply run ```python

from dataquality.auto.text_classification import auto

auto()

```

An example using auto with a hosted huggingface dataset ```python

from dataquality.auto.text_classification import auto

auto(hf_data=”rungalileo/trec6”)

```

An example using auto with sklearn data as pandas dataframes ```python

import pandas as pd from sklearn.datasets import fetch_20newsgroups from dataquality.auto.text_classification import auto

# Load the newsgroups dataset from sklearn newsgroups_train = fetch_20newsgroups(subset=’train’) newsgroups_test = fetch_20newsgroups(subset=’test’) # Convert to pandas dataframes df_train = pd.DataFrame(

{“text”: newsgroups_train.data, “label”: newsgroups_train.target}

) df_test = pd.DataFrame(

{“text”: newsgroups_test.data, “label”: newsgroups_test.target}

)

auto(
train_data=df_train, test_data=df_test, labels=newsgroups_train.target_names, project_name=”newsgroups_work”, run_name=”run_1_raw_data”

)

```

An example of using auto with a local CSV file with text and label columns ```python from dataquality.auto.text_classification import auto

auto(: train_data=”train.csv”, test_data=”test.csv”, project_name=”data_from_local”, run_name=”run_1_raw_data”

dataquality.dq_auto package#

Submodules#

dataquality.dq_auto.auto module#

dataquality.dq_auto.base_data_manager module#

dataquality.dq_auto.ner module#

)#

dataquality.dq_auto.ner_trainer module#

dataquality.dq_auto.notebook module#

dataquality.dq_auto.schema module#

dataquality.dq_auto.tc_trainer module#

dataquality.dq_auto.text_classification module#

)#

Module contents#