dataquality.dq_auto package#
Submodules#
dataquality.dq_auto.auto module#
- auto(hf_data=None, hf_inference_names=None, train_data=None, val_data=None, test_data=None, inference_data=None, max_padding_length=200, hf_model='distilbert-base-uncased', num_train_epochs=15, labels=None, project_name=None, run_name=None, wait=True, create_data_embs=None, early_stopping=True)#
Automatically gets insights on a text classification or NER dataset
Given either a pandas dataframe, file_path, or huggingface dataset path, this function will load the data, train a huggingface transformer model, and provide Galileo insights via a link to the Galileo Console
One of hf_data, train_data should be provided. If neither of those are, a demo dataset will be loaded by Galileo for training.
- Parameters:
hf_data (
Union
[DatasetDict
,str
,None
]) – Union[DatasetDict, str] Use this param if you have huggingface data in the hub or in memory. Otherwise see train_data, val_data, and test_data. If provided, train_data, val_data, and test_data are ignored.hf_inference_names (
Optional
[List
[str
]]) – Use this param alongside hf_data if you have splits you’d like to consider as inference. A list of key names in hf_data to be run as inference runs after training. Any keys set must exist in hf_datatrain_data (
Union
[DataFrame
,Dataset
,str
,None
]) – Optional training data to use. Can be one of * Pandas dataframe * Huggingface dataset * Path to a local file * Huggingface dataset hub pathval_data (
Union
[DataFrame
,Dataset
,str
,None
]) – Optional validation data to use. The validation data is what is used for the evaluation dataset in huggingface, and what is used for early stopping. If not provided, but test_data is, that will be used as the evaluation set. If neither val nor test are available, the train data will be randomly split 80/20 for use as evaluation data. Can be one of * Pandas dataframe * Huggingface dataset * Path to a local file * Huggingface dataset hub pathtest_data (
Union
[DataFrame
,Dataset
,str
,None
]) – Optional test data to use. The test data, if provided with val, will be used after training is complete, as the held-out set. If no validation data is provided, this will instead be used as the evaluation set. Can be one of * Pandas dataframe * Huggingface dataset * Path to a local file * Huggingface dataset hub pathinference_data (
Optional
[Dict
[str
,Union
[DataFrame
,Dataset
,str
]]]) – User this param to include inference data alongside the train_data param. If you are passing data via the hf_data parameter, you should use the hf_inference_names param. Optional inference datasets to run with after training completes. The structure is a dictionary with the key being the inference name and the value one of * Pandas dataframe * Huggingface dataset * Path to a local file * Huggingface dataset hub pathmax_padding_length (
int
) – The max length for padding the input text during tokenization. Default 200hf_model (
str
) – The pretrained AutoModel from huggingface that will be used to tokenize and train on the provided data. Default distilbert-base-uncasednum_train_epochs (
int
) – The number of epochs to train for (early stopping will always be active). Default 15labels (
Optional
[List
[str
]]) – Optional list of labels for this dataset. If not provided, they will attempt to be extracted from the dataproject_name (
Optional
[str
]) – Optional project name. If not set, a random name will be generatedrun_name (
Optional
[str
]) – Optional run name for this data. If not set, a random name will be generatedwait (
bool
) – Whether to wait for Galileo to complete processing your run. Default Truecreate_data_embs (
Optional
[bool
]) – Whether to create data embeddings for this run. If True, Sentence-Transformers will be used to generate data embeddings for this dataset and uploaded with this run. You can access these embeddings via dq.metrics.get_data_embeddings in the emb column or dq.metrics.get_dataframe(…, include_data_embs=True) in the data_emb col Only available for TC currently. NER coming soon. Default True if a GPU is available, else default False.early_stopping (
bool
) – Whether to use early stopping. Default True
- Return type:
None
For text classification datasets, the only required columns are text and label
For NER, the required format is the huggingface standard format of tokens and tags (or ner_tags). See example: https://huggingface.co/datasets/rungalileo/mit_movies
MIT Movies dataset in huggingface format
tokens ner_tags [what, is, a, good, action, movie, that, is, r... [0, 0, 0, 0, 7, 0, ... [show, me, political, drama, movies, with, jef... [0, 0, 7, 8, 0, 0, ... [what, are, some, good, 1980, s, g, rated, mys... [0, 0, 0, 0, 5, 6, ... [list, a, crime, film, which, director, was, d... [0, 0, 7, 0, 0, 0, ... [is, there, a, thriller, movie, starring, al, ... [0, 0, 0, 7, 0, 0, ... ... ... ...
To see auto insights on a random, pre-selected dataset, simply run
import dataquality as dq dq.auto()
An example using auto with a hosted huggingface text classification dataset
import dataquality as dq dq.auto(hf_data="rungalileo/trec6")
Similarly, for NER
import dataquality as dq dq.auto(hf_data="conll2003")
An example using auto with sklearn data as pandas dataframes
import dataquality as dq import pandas as pd from sklearn.datasets import fetch_20newsgroups # Load the newsgroups dataset from sklearn newsgroups_train = fetch_20newsgroups(subset='train') newsgroups_test = fetch_20newsgroups(subset='test') # Convert to pandas dataframes df_train = pd.DataFrame( {"text": newsgroups_train.data, "label": newsgroups_train.target} ) df_test = pd.DataFrame( {"text": newsgroups_test.data, "label": newsgroups_test.target} ) dq.auto( train_data=df_train, test_data=df_test, labels=newsgroups_train.target_names, project_name="newsgroups_work", run_name="run_1_raw_data" )
An example of using auto with a local CSV file with text and label columns
import dataquality as dq dq.auto( train_data="train.csv", test_data="test.csv", project_name="data_from_local", run_name="run_1_raw_data" )
dataquality.dq_auto.base_data_manager module#
- class BaseDatasetManager#
Bases:
object
-
DEMO_DATASETS:
List
[str
] = []#
- get_dataset_dict(hf_data=None, hf_inference_names=None, train_data=None, val_data=None, test_data=None, inference_data=None, labels=None, column_mapping=None)#
Creates and/or validates the DatasetDict provided by the user.
If the user provides a DatasetDict, we simply validate it. Otherwise, we parse a combination of the parameters provided, generate a DatasetDict of their training data, and validate that.
- Return type:
DatasetDict
- try_load_dataset_dict(hf_data=None, train_data=None)#
Tries to load the DatasetDict if available
If the user provided the hf_data param we load it from huggingface If they provided nothing, we load the demo dataset Otherwise, we return None, because the user provided train/test/val data, and that requires task specific processing
For HF datasets, we optionally apply a formatting function to the dataset to convert it to the format we expect. This is useful for datasets that have non-standard columns, like the alpaca dataset, which has instruction, input, and target columns instead of text and label
- Return type:
Optional
[DatasetDict
]
-
DEMO_DATASETS:
dataquality.dq_auto.ner module#
- class NERDatasetManager#
Bases:
BaseDatasetManager
-
DEMO_DATASETS:
List
[str
] = ['conll2003', 'rungalileo/mit_movies', 'wnut_17']#
-
DEMO_DATASETS:
- auto(hf_data=None, hf_inference_names=None, train_data=None, val_data=None, test_data=None, inference_data=None, num_train_epochs=15, hf_model='distilbert-base-uncased', labels=None, project_name='auto_ner', run_name=None, wait=True, early_stopping=True)#
Automatically gets insights on an NER or Token Classification dataset
Given either a pandas dataframe, file_path, or huggingface dataset path, this function will load the data, train a huggingface token classification model, and provide Galileo insights via a link to the Galileo Console
One of hf_data, train_data should be provided. If neither of those are, a demo dataset will be loaded by Galileo for training.
The data must be provided in the standard “huggingface” format * huggingface format: A dataset with tokens and (ner_tags or tags) columns
See example: https://huggingface.co/datasets/rungalileo/mit_movies
MIT Movies dataset in huggingface format
tokens ner_tags [what, is, a, good, action, movie, that, is, r… [0, 0, 0, 0, 7, 0, … [show, me, political, drama, movies, with, jef… [0, 0, 7, 8, 0, 0, … [what, are, some, good, 1980, s, g, rated, mys… [0, 0, 0, 0, 5, 6, … [list, a, crime, film, which, director, was, d… [0, 0, 7, 0, 0, 0, … [is, there, a, thriller, movie, starring, al, … [0, 0, 0, 7, 0, 0, … … … …
- Parameters:
hf_data (
Union
[DatasetDict
,str
,None
]) – Union[DatasetDict, str] Use this param if you have huggingface data in the hub or in memory. Otherwise see train_data, val_data, and test_data. If provided, train_data, val_data, and test_data are ignoredtrain_data (
Union
[DataFrame
,Dataset
,str
,None
]) – Optional training data to use. Can be one of * Pandas dataframe * Huggingface dataset * Path to a local file * Huggingface dataset hub pathval_data (
Union
[DataFrame
,Dataset
,str
,None
]) – Optional validation data to use. The validation data is what is used for the evaluation dataset in huggingface, and what is used for early stopping. If not provided, but test_data is, that will be used as the evaluation set. If neither val nor test are available, the train data will be randomly split 80/20 for use as evaluation data. Can be one of * Pandas dataframe * Huggingface dataset * Path to a local file * Huggingface dataset hub pathtest_data (
Union
[DataFrame
,Dataset
,str
,None
]) – Optional test data to use. The test data, if provided with val, will be used after training is complete, as the held-out set. If no validation data is provided, this will instead be used as the evaluation set. Can be one of * Pandas dataframe * Huggingface dataset * Path to a local file * Huggingface dataset hub pathhf_model (
str
) – The pretrained AutoModel from huggingface that will be used to tokenize and train on the provided data. Default distilbert-base-uncasedlabels (
Optional
[List
[str
]]) – Optional list of labels for this dataset. If not provided, they will attempt to be extracted from the dataproject_name (
str
) – Optional project name. If not set, a random name will be generatedrun_name (
Optional
[str
]) – Optional run name for this data. If not set, a random name will be generatedwait (
bool
) – Whether to wait for Galileo to complete processing your run. Default Trueearly_stopping (
bool
) – Whether to use early stopping. Default True
- Return type:
None
To see auto insights on a random, pre-selected dataset, simply run ```python
from dataquality.auto.ner import auto
auto()
An example using auto with a hosted huggingface dataset ```python
from dataquality.auto.text_classification import auto
auto(hf_data=”rungalileo/mit_movies”)
An example using auto with sklearn data as pandas dataframes ```python
import pandas as pd from dataquality.auto.ner import auto
TODO EXAMPLE FOR NER FROM PANDAS DFs
- auto(
train_data=df_train, test_data=df_test, labels=[‘O’,’B-ACTOR’,’I-ACTOR’,’B-TITLE’,’I-TITLE’,’B-YEAR’,’I-YEAR’] project_name=”ner_movie_reviews”, run_name=”run_1_raw_data”
)
An example of using auto with a local CSV file with text and label columns ```python from dataquality.auto.ner import auto
- auto(
train_data=”train.csv”, test_data=”test.csv”, project_name=”data_from_local”, run_name=”run_1_raw_data”
)#
dataquality.dq_auto.ner_trainer module#
- compute_metrics(eval_pred)#
Metrics computation for token classification
Taken directly from the docs https://huggingface.co/course/chapter7/2#metrics and updated for typing
- Return type:
Dict
- get_trainer(dd, model_checkpoint, num_train_epochs, labels=None, early_stopping=True)#
- Return type:
Tuple
[Trainer
,DatasetDict
]
dataquality.dq_auto.notebook module#
- auto_notebook()#
- Return type:
None
dataquality.dq_auto.schema module#
- class BaseAutoDatasetConfig(hf_data=None, train_path=None, val_path=None, test_path=None, train_data=None, val_data=None, test_data=None, input_col='text', target_col='label', formatter=<factory>)#
Bases:
object
Configuration for creating a dataset from a file or object
One of hf_name, train_path or train_dataset should be provided. If none of those is, a demo dataset will be loaded by Galileo for training.
- Parameters:
hf_data (
Union
[DatasetDict
,str
,None
]) – Union[DatasetDict, str] Use this param if you have huggingface data in the hub or in memory. Otherwise see train_data, val_data, and test_data. If provided, train_data, val_data, and test_data are ignoredtrain_path (
Optional
[str
]) – Optional path to training data file to use. Must be: * Path to a local fileval_path (
Optional
[str
]) – Optional path to validation data to use. Must be: * Path to a local filetest_path (
Optional
[str
]) – Optional test data to use. Must be: * Path to a local filetrain_data (
Union
[DataFrame
,Dataset
,None
]) – Optional training data to use. Can be one of * Pandas dataframe * Huggingface datasetval_data (
Union
[DataFrame
,Dataset
,None
]) – Optional validation data to use. Can be one of * Pandas dataframe * Huggingface datasettest_data (
Union
[DataFrame
,Dataset
,None
]) – Optional test data to use. Can be one of * Pandas dataframe * Huggingface datasetinput_col (
str
) – Column name for input data, defaults to “text”target_col (
str
) – Column name for target data, defaults to “label”
-
hf_data:
Union
[DatasetDict
,str
,None
] = None#
-
train_path:
Optional
[str
] = None#
-
val_path:
Optional
[str
] = None#
-
test_path:
Optional
[str
] = None#
-
train_data:
Union
[DataFrame
,Dataset
,None
] = None#
-
val_data:
Union
[DataFrame
,Dataset
,None
] = None#
-
test_data:
Union
[DataFrame
,Dataset
,None
] = None#
-
input_col:
str
= 'text'#
-
target_col:
str
= 'label'#
-
formatter:
BaseFormatter
#
- class BaseAutoTrainingConfig(model='distilbert-base-uncased', epochs=15, learning_rate=0.0003, batch_size=4, create_data_embs=None, data_embs_col='text', return_model=False)#
Bases:
object
Configuration for training a HuggingFace model
Base config values are based on auto with Text Classification. Can be overridden by parent class for each modality.
- Parameters:
model (
str
) – The pretrained AutoModel from huggingface that will be used to tokenize and train on the provided data. Default distilbert-base-uncasedepochs (
int
) – Optional num training epochs. If not set, we default to 15learning_rate (
float
) – Optional learning rate. If not set, we default to 3e-4batch_size (
int
) – Optional batch size. If not set, we default to 4create_data_embs (
Optional
[bool
]) – Whether to create data embeddings for this run. If set to None, data embeddings will be created only if a GPU is availablereturn_model (
bool
) – Whether to return the trained model at the end of auto. Default Falsedata_embs_col (
str
) – Optional text col on which to compute data embeddings. If not set, we default to ‘text’
-
model:
str
= 'distilbert-base-uncased'#
-
epochs:
int
= 15#
-
learning_rate:
float
= 0.0003#
-
batch_size:
int
= 4#
-
create_data_embs:
Optional
[bool
] = None#
-
data_embs_col:
str
= 'text'#
-
return_model:
bool
= False#
dataquality.dq_auto.tc_trainer module#
- preprocess_function(input_data, tokenizer, max_length)#
- Return type:
BatchEncoding
- compute_metrics(metric, eval_pred)#
- Return type:
Dict
- get_trainer(dd, labels, model_checkpoint, max_padding_length, num_train_epochs, early_stopping=True)#
- Return type:
Tuple
[Trainer
,DatasetDict
]
dataquality.dq_auto.text_classification module#
- class TCDatasetManager#
Bases:
BaseDatasetManager
-
DEMO_DATASETS:
List
[str
] = ['rungalileo/newsgroups', 'rungalileo/trec6', 'rungalileo/conv_intent', 'rungalileo/emotion', 'rungalileo/amazon_polarity_30k', 'rungalileo/sst2']#
-
DEMO_DATASETS:
- auto(hf_data=None, hf_inference_names=None, train_data=None, val_data=None, test_data=None, inference_data=None, max_padding_length=200, num_train_epochs=15, hf_model='distilbert-base-uncased', labels=None, project_name='auto_tc', run_name=None, wait=True, create_data_embs=None, early_stopping=True)#
Automatically gets insights on a text classification dataset
Given either a pandas dataframe, file_path, or huggingface dataset path, this function will load the data, train a huggingface transformer model, and provide Galileo insights via a link to the Galileo Console
One of hf_data, train_data should be provided. If neither of those are, a demo dataset will be loaded by Galileo for training.
- Parameters:
hf_data (
Union
[DatasetDict
,str
,None
]) – Union[DatasetDict, str] Use this param if you have huggingface data in the hub or in memory. Otherwise see train_data, val_data, and test_data. If provided, train_data, val_data, and test_data are ignoredhf_inference_names (
Optional
[List
[str
]]) – A list of key names in hf_data to be run as inference runs after training. If set, those keys must exist in hf_datatrain_data (
Union
[DataFrame
,Dataset
,str
,None
]) – Optional training data to use. Can be one of * Pandas dataframe * Huggingface dataset * Path to a local file * Huggingface dataset hub pathval_data (
Union
[DataFrame
,Dataset
,str
,None
]) – Optional validation data to use. The validation data is what is used for the evaluation dataset in huggingface, and what is used for early stopping. If not provided, but test_data is, that will be used as the evaluation set. If neither val nor test are available, the train data will be randomly split 80/20 for use as evaluation data. Can be one of * Pandas dataframe * Huggingface dataset * Path to a local file * Huggingface dataset hub pathtest_data (
Union
[DataFrame
,Dataset
,str
,None
]) – Optional test data to use. The test data, if provided with val, will be used after training is complete, as the held-out set. If no validation data is provided, this will instead be used as the evaluation set. Can be one of * Pandas dataframe * Huggingface dataset * Path to a local file * Huggingface dataset hub pathinference_data (
Optional
[Dict
[str
,Union
[DataFrame
,Dataset
,str
]]]) – Optional inference datasets to run with after training completes. The structure is a dictionary with the key being the infeerence name and the value one of * Pandas dataframe * Huggingface dataset * Path to a local file * Huggingface dataset hub pathmax_padding_length (
int
) – The max length for padding the input text during tokenization. Default 200hf_model (
str
) – The pretrained AutoModel from huggingface that will be used to tokenize and train on the provided data. Default distilbert-base-uncasedlabels (
Optional
[List
[str
]]) – Optional list of labels for this dataset. If not provided, they will attempt to be extracted from the dataproject_name (
str
) – Optional project name. If not set, a random name will be generatedrun_name (
Optional
[str
]) – Optional run name for this data. If not set, a random name will be generatedwait (
bool
) – Whether to wait for Galileo to complete processing your run. Default Truecreate_data_embs (
Optional
[bool
]) – Whether to create data embeddings for this run. Default Falseearly_stopping (
bool
) – Whether to use early stopping. Default True
- Return type:
Trainer
To see auto insights on a random, pre-selected dataset, simply run ```python
from dataquality.auto.text_classification import auto
auto()
An example using auto with a hosted huggingface dataset ```python
from dataquality.auto.text_classification import auto
auto(hf_data=”rungalileo/trec6”)
An example using auto with sklearn data as pandas dataframes ```python
import pandas as pd from sklearn.datasets import fetch_20newsgroups from dataquality.auto.text_classification import auto
# Load the newsgroups dataset from sklearn newsgroups_train = fetch_20newsgroups(subset=’train’) newsgroups_test = fetch_20newsgroups(subset=’test’) # Convert to pandas dataframes df_train = pd.DataFrame(
{“text”: newsgroups_train.data, “label”: newsgroups_train.target}
) df_test = pd.DataFrame(
{“text”: newsgroups_test.data, “label”: newsgroups_test.target}
)
- auto(
train_data=df_train, test_data=df_test, labels=newsgroups_train.target_names, project_name=”newsgroups_work”, run_name=”run_1_raw_data”
)
An example of using auto with a local CSV file with text and label columns ```python from dataquality.auto.text_classification import auto
- auto(
train_data=”train.csv”, test_data=”test.csv”, project_name=”data_from_local”, run_name=”run_1_raw_data”
)#