dataquality.integrations.seq2seq package#
Subpackages#
- dataquality.integrations.seq2seq.formatters package
- Submodules
- dataquality.integrations.seq2seq.formatters.alpaca module
- dataquality.integrations.seq2seq.formatters.base module
- dataquality.integrations.seq2seq.formatters.chat module
ChatFormatter
ChatFormatter.name
ChatFormatter.input_col
ChatFormatter.target_col
ChatFormatter.max_train_size
ChatFormatter.process_batch
ChatFormatter.turns_col
ChatFormatter.metadata_col
ChatFormatter.content_col
ChatFormatter.role_col
ChatFormatter.user
ChatFormatter.assistant
ChatFormatter.system
ChatFormatter.format_sample()
ChatHistoryFormatter
- Module contents
Submodules#
dataquality.integrations.seq2seq.auto module#
- class S2SDatasetManager#
Bases:
BaseDatasetManager
-
DEMO_DATASETS:
List
[str
] = ['tatsu-lab/alpaca']#
- try_load_dataset_dict_from_config(dataset_config)#
Tries to load the DatasetDict if available
If the user provided the hf_data param we load it from huggingface If they provided nothing, we load the demo dataset Otherwise, we return None, because the user provided train/test/val data, and that requires task specific processing
For HF datasets, we optionally apply a formatting function to the dataset to convert it to the format we expect. This is useful for datasets that have non-standard columns, like the alpaca dataset, which has instruction, input, and target columns instead of text and label
- Return type:
Tuple
[Optional
[DatasetDict
],Seq2SeqDatasetConfig
]
- get_dataset_dict_from_config(dataset_config, max_train_size=None, create_val_data_if_missing=True)#
Creates and/or validates the DatasetDict provided by the user.
If a user provides a DatasetDict, we simply validate it. Otherwise, we parse a combination of the parameters provided, generate a DatasetDict of their training data, and validate that.
If the user provides hf_data, we load that dataset from huggingface and optionally apply a formatting function to the dataset to convert it to the format we expect. This is useful for datasets that have non-standard columns, like the alpaca dataset, which has instruction, input, and target columns instead of text and label
If the user provides train_path, val_path, or test_path, we load those files and convert them to a DatasetDict.
Else if the user provides train_data, val_data, or test_data, we convert those to a DatasetDict.
- Return type:
Tuple
[DatasetDict
,Seq2SeqDatasetConfig
]
-
DEMO_DATASETS:
- auto(project_name='auto_s2s', run_name=None, dataset_config=None, training_config=None, generation_config=None, max_train_size=None, wait=True)#
Automatically get insights on a Seq2Seq dataset
Given either a pandas dataframe, file_path, or huggingface dataset path, this function will load the data, train a huggingface transformer model, and provide Galileo insights via a link to the Galileo Console. If the number of epochs in training_config is set to 0, training/fine-tuning will be skipped and we will only do a forward pass (on all the splits).
One of DatasetConfig hf_data, train_path, or train_data should be provided. If none of those is, a demo dataset will be loaded by Galileo for training.
The validation data is what is used for the evaluation dataset in training. If not provided, but test_data is, that will be used as the evaluation set. If neither val nor test are available (and training is not skipped), the train data will be randomly split 80/20 for use as evaluation data.
The test data, if provided with val, will be used after training is complete, as the hold-out set. If no validation data is provided, this will instead be used as the evaluation set.
- Parameters:
project_name (
str
) – Optional project name. If not set, a random name will be generatedrun_name (
Optional
[str
]) – Optional run name for this data. If not set, a random name will be generateddataset_config (
Optional
[Seq2SeqDatasetConfig
]) – Optional config for loading the dataset. See Seq2SeqDatasetConfig for more detailstraining_config (
Optional
[Seq2SeqTrainingConfig
]) – Optional config for training the model. See Seq2SeqTrainingConfig for more detailsgeneration_config (
Optional
[Seq2SeqGenerationConfig
]) – Optional config for generating predictions. See Seq2SeqGenerationConfig for more detailsmax_train_size (
Optional
[int
]) – Optional max number of training examples to use.wait (
bool
) – Whether to wait for Galileo to complete processing your run. Default True
- Return type:
Optional
[PreTrainedModel
]
To see auto insights on a random, pre-selected dataset, simply run ```python
from dataquality.integrations.seq2seq import auto
auto()
An example using auto with a hosted huggingface dataset ```python
from dataquality.integrations.seq2seq.auto import auto from dataquality.integrations.seq2seq.schema import Seq2SeqDatasetConfig
dataset_config = Seq2SeqDatasetConfig(hf_data=”tatsu-lab/alpaca”) auto(dataset_config=dataset_config)
An example of using auto with a local file with text and label columns ```python from dataquality.integrations.seq2seq.auto import auto from dataquality.integrations.seq2seq.schema import AutoDatasetConfig
- dataset_config = Seq2SeqDatasetConfig(
train_path=”train.jsonl”, eval_path=”eval.jsonl”
) auto(
project_name=”s2s_auto”, run_name=”completion_dataset” dataset_config=dataset_config,
)#
dataquality.integrations.seq2seq.core module#
- set_tokenizer(tokenizer, model_type, max_input_tokens=None, max_target_tokens=None)#
Seq2seq only. Set the tokenizer for your run
Must be either a Tokenizer or a fast pretrained tokenizer, and must support decode, encode, encode_plus. We will use this tokenizer for both the input and the target. They will both be truncated after a certain length, which is set in the args max_input_tokens and max_target_tokens. :param - tokenizer: This must be either an instance of Tokenizer from tokenizers or a
PreTrainedTokenizerFast from huggingface (ie T5TokenizerFast, GPT2TokenizerFast, etc). Your tokenizer should have an .is_fast property that returns True if it’s a fast tokenizer. This class must implement the encode, decode, and encode_plus` methods.
- Parameters:
max_input_tokens (-) – max number of tokens used in the input. We will tokenize the input and truncate at this number. If not specified, we will use
max_target_tokens (-) – max number of tokens used in the target. We will tokenize the target and truncate at this number. If not specified, we will use tokenizer.model_max_length
- Return type:
None
You can set your tokenizer via the set_tokenizer(tok) function imported from dataquality.integrations.seq2seq.core
NOTE: We assume that the tokenizer you provide is the same tokenizer used for training. This must be true in order to align inputs and outputs correctly. Ensure all necessary properties (like add_eos_token) are set before setting your tokenizer so as to match the tokenization process to your training process.
- watch(tokenizer, model_type, model=None, generation_config=None, generation_splits=None, max_input_tokens=None, max_target_tokens=None, response_template=None)#
Seq2seq only. Log model generations for your run
- Return type:
None
Iterates over a given dataset and logs the generations for each sample. To generate outputs, a model that is an instance of transformers PreTrainedModel
must be given and it must have a generate method.
Unlike other watch functions, in this one we are just registering the model and generation config and not attaching any hooks to the model. We call it ‘watch’ for consistency.
dataquality.integrations.seq2seq.s2s_trainer module#
- validate_cols(ds, input_col, target_col)#
Validates that the input and target columns are in the dataset
- Return type:
None
- tokenize(ds, tokenizer, input_col, target_col, max_input_length, max_target_length)#
- Return type:
Dataset
- get_trainer(dd, input_col, target_col, training_config, generation_config)#
Sets up the model and tokenizer for training
Note that for now this fn is a misnomer since our initial implementation is not using the Trainer class from transformers. We will likely refactor this in the future to use the Trainer class.
For now, this function sets up the model and tokenizer, tokenizes the data, for each split, calls the DQ watch function, and returns the model and and tokenized dataset dict.
- Return type:
Tuple
[PreTrainedModel
,Dict
[str
,DataLoader
]]
- do_train(model, dataloaders, training_config, wait)#
- Return type:
PreTrainedModel
dataquality.integrations.seq2seq.schema module#
- class Seq2SeqDatasetConfig(hf_data=None, train_path=None, val_path=None, test_path=None, train_data=None, val_data=None, test_data=None, input_col='input', target_col='target', formatter=<factory>)#
Bases:
BaseAutoDatasetConfig
Configuration for creating a dataset from a file or object
One of hf_name, train_path or train_dataset should be provided. If none of those is, a demo dataset will be loaded by Galileo for training.
- Parameters:
hf_data (
Union
[DatasetDict
,str
,None
]) – Union[DatasetDict, str] Use this param if you have huggingface data in the hub or in memory. Otherwise see train_data, val_data, and test_data. If provided, train_data, val_data, and test_data are ignoredtrain_path (
Optional
[str
]) – Optional path to training data file to use. Must be: * Path to a local fileval_path (
Optional
[str
]) – Optional path to validation data to use. Must be: * Path to a local filetest_path (
Optional
[str
]) – Optional test data to use. Must be: * Path to a local filetrain_data (
Union
[DataFrame
,Dataset
,None
]) – Optional training data to use. Can be one of * Pandas dataframe * Huggingface datasetval_data (
Union
[DataFrame
,Dataset
,None
]) – Optional validation data to use. Can be one of * Pandas dataframe * Huggingface datasettest_data (
Union
[DataFrame
,Dataset
,None
]) – Optional test data to use. Can be one of * Pandas dataframe * Huggingface datasetinput_col (
str
) – Column name for input data, defaults to “input” for S2Starget_col (
str
) – Column name for target data, defaults to “target” for S2s
-
input_col:
str
= 'input'#
-
target_col:
str
= 'target'#
- class Seq2SeqTrainingConfig(model='google/flan-t5-base', epochs=3, learning_rate=0.0003, batch_size=4, create_data_embs=None, data_embs_col='input', return_model=False, accumulation_steps=4, max_input_tokens=512, max_target_tokens=128)#
Bases:
BaseAutoTrainingConfig
Configuration for training a seq2seq model
- Parameters:
model (
str
) – The pretrained AutoModel from huggingface that will be used to tokenize and train on the provided data. Default google/flan-t5-baseepochs (
int
) – Optional num training epochs. If not set, we default to 3learning_rate (
float
) – Optional learning rate. If not set, we default to 3e-4accumulation_steps (
int
) – Optional accumulation steps. If not set, we default to 4batch_size (
int
) – Optional batch size. If not set, we default to 4create_data_embs (
Optional
[bool
]) – Whether to create data embeddings for this run. If set to None, data embeddings will be created only if a GPU is availablemax_input_tokens (
int
) – Optional max input tokens. If not set, we default to 512max_target_tokens (
int
) – Optional max target tokens. If not set, we default to 128data_embs_col (
str
) – Optional text col on which to compute data embeddings. If not set, we default to ‘input’, can also be set to target or generated_output
-
model:
str
= 'google/flan-t5-base'#
-
epochs:
int
= 3#
-
accumulation_steps:
int
= 4#
-
max_input_tokens:
int
= 512#
-
max_target_tokens:
int
= 128#
-
data_embs_col:
str
= 'input'#
- class Seq2SeqGenerationConfig(max_new_tokens=64, temperature=0.2, do_sample=False, top_p=1.0, top_k=50, generation_splits=None)#
Bases:
object
Configuration for generating insights from a trained seq2seq model
We use the default values in HF GenerationConfig See more about the parameters here: https://huggingface.co/docs/transformers/v4.30.0/en/main_classes/text_generation#transformers.GenerationConfig
- Parameters:
generation_splits (
Optional
[List
[str
]]) – Optional list of splits to generate on. If not set, we default to [“test”]
-
max_new_tokens:
int
= 64#
-
temperature:
float
= 0.2#
-
do_sample:
bool
= False#
-
top_p:
float
= 1.0#
-
top_k:
int
= 50#
-
generation_splits:
Optional
[List
[str
]] = None#