dataquality.integrations.seq2seq package#

Subpackages#

Submodules#

dataquality.integrations.seq2seq.auto module#

class S2SDatasetManager#

Bases: BaseDatasetManager

DEMO_DATASETS: List[str] = ['tatsu-lab/alpaca']#
try_load_dataset_dict_from_config(dataset_config)#

Tries to load the DatasetDict if available

If the user provided the hf_data param we load it from huggingface If they provided nothing, we load the demo dataset Otherwise, we return None, because the user provided train/test/val data, and that requires task specific processing

For HF datasets, we optionally apply a formatting function to the dataset to convert it to the format we expect. This is useful for datasets that have non-standard columns, like the alpaca dataset, which has instruction, input, and target columns instead of text and label

Return type:

Tuple[Optional[DatasetDict], Seq2SeqDatasetConfig]

get_dataset_dict_from_config(dataset_config, max_train_size=None, create_val_data_if_missing=True)#

Creates and/or validates the DatasetDict provided by the user.

If a user provides a DatasetDict, we simply validate it. Otherwise, we parse a combination of the parameters provided, generate a DatasetDict of their training data, and validate that.

If the user provides hf_data, we load that dataset from huggingface and optionally apply a formatting function to the dataset to convert it to the format we expect. This is useful for datasets that have non-standard columns, like the alpaca dataset, which has instruction, input, and target columns instead of text and label

If the user provides train_path, val_path, or test_path, we load those files and convert them to a DatasetDict.

Else if the user provides train_data, val_data, or test_data, we convert those to a DatasetDict.

Return type:

Tuple[DatasetDict, Seq2SeqDatasetConfig]

auto(project_name='auto_s2s', run_name=None, dataset_config=None, training_config=None, generation_config=None, max_train_size=None, wait=True)#

Automatically get insights on a Seq2Seq dataset

Given either a pandas dataframe, file_path, or huggingface dataset path, this function will load the data, train a huggingface transformer model, and provide Galileo insights via a link to the Galileo Console. If the number of epochs in training_config is set to 0, training/fine-tuning will be skipped and we will only do a forward pass (on all the splits).

One of DatasetConfig hf_data, train_path, or train_data should be provided. If none of those is, a demo dataset will be loaded by Galileo for training.

The validation data is what is used for the evaluation dataset in training. If not provided, but test_data is, that will be used as the evaluation set. If neither val nor test are available (and training is not skipped), the train data will be randomly split 80/20 for use as evaluation data.

The test data, if provided with val, will be used after training is complete, as the hold-out set. If no validation data is provided, this will instead be used as the evaluation set.

Parameters:
  • project_name (str) – Optional project name. If not set, a random name will be generated

  • run_name (Optional[str]) – Optional run name for this data. If not set, a random name will be generated

  • dataset_config (Optional[Seq2SeqDatasetConfig]) – Optional config for loading the dataset. See Seq2SeqDatasetConfig for more details

  • training_config (Optional[Seq2SeqTrainingConfig]) – Optional config for training the model. See Seq2SeqTrainingConfig for more details

  • generation_config (Optional[Seq2SeqGenerationConfig]) – Optional config for generating predictions. See Seq2SeqGenerationConfig for more details

  • max_train_size (Optional[int]) – Optional max number of training examples to use.

  • wait (bool) – Whether to wait for Galileo to complete processing your run. Default True

Return type:

Optional[PreTrainedModel]

To see auto insights on a random, pre-selected dataset, simply run ```python

from dataquality.integrations.seq2seq import auto

auto()

```

An example using auto with a hosted huggingface dataset ```python

from dataquality.integrations.seq2seq.auto import auto from dataquality.integrations.seq2seq.schema import Seq2SeqDatasetConfig

dataset_config = Seq2SeqDatasetConfig(hf_data=”tatsu-lab/alpaca”) auto(dataset_config=dataset_config)

```

An example of using auto with a local file with text and label columns ```python from dataquality.integrations.seq2seq.auto import auto from dataquality.integrations.seq2seq.schema import AutoDatasetConfig

dataset_config = Seq2SeqDatasetConfig(

train_path=”train.jsonl”, eval_path=”eval.jsonl”

) auto(

project_name=”s2s_auto”, run_name=”completion_dataset” dataset_config=dataset_config,

)#

dataquality.integrations.seq2seq.core module#

set_tokenizer(tokenizer, model_type, max_input_tokens=None, max_target_tokens=None)#

Seq2seq only. Set the tokenizer for your run

Must be either a Tokenizer or a fast pretrained tokenizer, and must support decode, encode, encode_plus. We will use this tokenizer for both the input and the target. They will both be truncated after a certain length, which is set in the args max_input_tokens and max_target_tokens. :param - tokenizer: This must be either an instance of Tokenizer from tokenizers or a

PreTrainedTokenizerFast from huggingface (ie T5TokenizerFast, GPT2TokenizerFast, etc). Your tokenizer should have an .is_fast property that returns True if it’s a fast tokenizer. This class must implement the encode, decode, and encode_plus` methods.

Parameters:
  • max_input_tokens (-) – max number of tokens used in the input. We will tokenize the input and truncate at this number. If not specified, we will use

  • max_target_tokens (-) – max number of tokens used in the target. We will tokenize the target and truncate at this number. If not specified, we will use tokenizer.model_max_length

Return type:

None

You can set your tokenizer via the set_tokenizer(tok) function imported from dataquality.integrations.seq2seq.core

NOTE: We assume that the tokenizer you provide is the same tokenizer used for training. This must be true in order to align inputs and outputs correctly. Ensure all necessary properties (like add_eos_token) are set before setting your tokenizer so as to match the tokenization process to your training process.

watch(tokenizer, model_type, model=None, generation_config=None, generation_splits=None, max_input_tokens=None, max_target_tokens=None, response_template=None)#

Seq2seq only. Log model generations for your run

Return type:

None

Iterates over a given dataset and logs the generations for each sample. To generate outputs, a model that is an instance of transformers PreTrainedModel

must be given and it must have a generate method.

Unlike other watch functions, in this one we are just registering the model and generation config and not attaching any hooks to the model. We call it ‘watch’ for consistency.

dataquality.integrations.seq2seq.s2s_trainer module#

validate_cols(ds, input_col, target_col)#

Validates that the input and target columns are in the dataset

Return type:

None

tokenize(ds, tokenizer, input_col, target_col, max_input_length, max_target_length)#
Return type:

Dataset

get_trainer(dd, input_col, target_col, training_config, generation_config)#

Sets up the model and tokenizer for training

Note that for now this fn is a misnomer since our initial implementation is not using the Trainer class from transformers. We will likely refactor this in the future to use the Trainer class.

For now, this function sets up the model and tokenizer, tokenizes the data, for each split, calls the DQ watch function, and returns the model and and tokenized dataset dict.

Return type:

Tuple[PreTrainedModel, Dict[str, DataLoader]]

do_train(model, dataloaders, training_config, wait)#
Return type:

PreTrainedModel

dataquality.integrations.seq2seq.schema module#

class Seq2SeqDatasetConfig(hf_data=None, train_path=None, val_path=None, test_path=None, train_data=None, val_data=None, test_data=None, input_col='input', target_col='target', formatter=<factory>)#

Bases: BaseAutoDatasetConfig

Configuration for creating a dataset from a file or object

One of hf_name, train_path or train_dataset should be provided. If none of those is, a demo dataset will be loaded by Galileo for training.

Parameters:
  • hf_data (Union[DatasetDict, str, None]) – Union[DatasetDict, str] Use this param if you have huggingface data in the hub or in memory. Otherwise see train_data, val_data, and test_data. If provided, train_data, val_data, and test_data are ignored

  • train_path (Optional[str]) – Optional path to training data file to use. Must be: * Path to a local file

  • val_path (Optional[str]) – Optional path to validation data to use. Must be: * Path to a local file

  • test_path (Optional[str]) – Optional test data to use. Must be: * Path to a local file

  • train_data (Union[DataFrame, Dataset, None]) – Optional training data to use. Can be one of * Pandas dataframe * Huggingface dataset

  • val_data (Union[DataFrame, Dataset, None]) – Optional validation data to use. Can be one of * Pandas dataframe * Huggingface dataset

  • test_data (Union[DataFrame, Dataset, None]) – Optional test data to use. Can be one of * Pandas dataframe * Huggingface dataset

  • input_col (str) – Column name for input data, defaults to “input” for S2S

  • target_col (str) – Column name for target data, defaults to “target” for S2s

input_col: str = 'input'#
target_col: str = 'target'#
class Seq2SeqTrainingConfig(model='google/flan-t5-base', epochs=3, learning_rate=0.0003, batch_size=4, create_data_embs=None, data_embs_col='input', return_model=False, accumulation_steps=4, max_input_tokens=512, max_target_tokens=128)#

Bases: BaseAutoTrainingConfig

Configuration for training a seq2seq model

Parameters:
  • model (str) – The pretrained AutoModel from huggingface that will be used to tokenize and train on the provided data. Default google/flan-t5-base

  • epochs (int) – Optional num training epochs. If not set, we default to 3

  • learning_rate (float) – Optional learning rate. If not set, we default to 3e-4

  • accumulation_steps (int) – Optional accumulation steps. If not set, we default to 4

  • batch_size (int) – Optional batch size. If not set, we default to 4

  • create_data_embs (Optional[bool]) – Whether to create data embeddings for this run. If set to None, data embeddings will be created only if a GPU is available

  • max_input_tokens (int) – Optional max input tokens. If not set, we default to 512

  • max_target_tokens (int) – Optional max target tokens. If not set, we default to 128

  • data_embs_col (str) – Optional text col on which to compute data embeddings. If not set, we default to ‘input’, can also be set to target or generated_output

model: str = 'google/flan-t5-base'#
epochs: int = 3#
accumulation_steps: int = 4#
max_input_tokens: int = 512#
max_target_tokens: int = 128#
data_embs_col: str = 'input'#
class Seq2SeqGenerationConfig(max_new_tokens=64, temperature=0.2, do_sample=False, top_p=1.0, top_k=50, generation_splits=None)#

Bases: object

Configuration for generating insights from a trained seq2seq model

We use the default values in HF GenerationConfig See more about the parameters here: https://huggingface.co/docs/transformers/v4.30.0/en/main_classes/text_generation#transformers.GenerationConfig

Parameters:

generation_splits (Optional[List[str]]) – Optional list of splits to generate on. If not set, we default to [“test”]

max_new_tokens: int = 64#
temperature: float = 0.2#
do_sample: bool = False#
top_p: float = 1.0#
top_k: int = 50#
generation_splits: Optional[List[str]] = None#

Module contents#