dataquality.loggers.data_logger.seq2seq package#

Submodules#

dataquality.loggers.data_logger.seq2seq.chat module#

class Seq2SeqChatDataLogger(meta=None)#

Bases: Seq2SeqDataLogger

logger_config: Seq2SeqChatLoggerConfig = Seq2SeqChatLoggerConfig(labels=None, tasks=None, observed_num_labels=None, observed_labels=None, tagging_schema=None, last_epoch=0, cur_epoch=None, cur_split=None, cur_inference_name=None, training_logged=False, validation_logged=False, test_logged=False, inference_logged=False, exception='', helper_data={}, input_data_logged=defaultdict(<class 'int'>, {}), logged_input_ids=defaultdict(<class 'set'>, {}), idx_to_id_map=defaultdict(<class 'list'>, {}), conditions=[], report_emails=[], ner_labels=[], int_labels=False, feature_names=[], metadata_documents=set(), finish=<function BaseLoggerConfig.<lambda>>, existing_run=False, dataloader_random_sampling=False, remove_embs=False, sample_length={}, tokenizer=None, max_input_tokens=None, max_target_tokens=None, id_to_tokens=defaultdict(<class 'dict'>, {}), model=None, generation_config=None, generation_splits=set(), model_type=None, id_to_formatted_prompt_length=defaultdict(<class 'dict'>, {}), response_template=None)#

dataquality.loggers.data_logger.seq2seq.completion module#

class Seq2SeqCompletionDataLogger(meta=None)#

Bases: Seq2SeqDataLogger

logger_config: Seq2SeqCompletionLoggerConfig = Seq2SeqCompletionLoggerConfig(labels=None, tasks=None, observed_num_labels=None, observed_labels=None, tagging_schema=None, last_epoch=0, cur_epoch=None, cur_split=None, cur_inference_name=None, training_logged=False, validation_logged=False, test_logged=False, inference_logged=False, exception='', helper_data={}, input_data_logged=defaultdict(<class 'int'>, {}), logged_input_ids=defaultdict(<class 'set'>, {}), idx_to_id_map=defaultdict(<class 'list'>, {}), conditions=[], report_emails=[], ner_labels=[], int_labels=False, feature_names=[], metadata_documents=set(), finish=<function BaseLoggerConfig.<lambda>>, existing_run=False, dataloader_random_sampling=False, remove_embs=False, sample_length={}, tokenizer=None, max_input_tokens=None, max_target_tokens=None, id_to_tokens=defaultdict(<class 'dict'>, {}), model=None, generation_config=None, generation_splits=set(), model_type=None, id_to_formatted_prompt_length=defaultdict(<class 'dict'>, {}), response_template=None)#

dataquality.loggers.data_logger.seq2seq.formatters module#

class BaseSeq2SeqDataFormatter(logger_config)#

Bases: ABC

abstract set_input_cutoff(df)#
Return type:

DataFrame

abstract format_text(text, ids, tokenizer, max_tokens, split_key)#

Tokenize and align the text samples

format_text tokenizes and computes token alignments for each samples in text. Different logic is applied depending on the model architecture (EncoderDecoder vs. DecoderOnly).

In the end, we return AlignedTokenData and the target token strings (corresponding to token_label_str in Seq2SeqDataLogger). For both EncoderDecoder and DecoderOnly models the output is expected to be token alignment and string data over just the <Target> tokens in the Seq2Seq task. Note though that the input text samples are different between the two model architectures. See their respective implementations for further details.

Return type:

Tuple[AlignedTokenData, List[List[str]], List[str]]

Additional information computed / variable assignements:
  • Assign the necessary self.logger_config fields

  • Compute token_label_str: the per token str representation

of each sample (List[str]), saved and used for high DEP tokens. - In Decoder-Only: Decode the response tokens to get the str representation of the response (i.e. the target show in the UI).

Parameters:#

texts: List[str]

batch of str samples. For EncoderDecoder model’s these are exactly the targets vs. for DecoderOnly model’s each sample is the full formatted_prompt

ids: List[int]

sample ids - used for logger_config assignment

tokenizer: PreTrainedTokenizerFast max_tokens: Optional[int] split_key: str

Return:#

: batch_aligned_data: AlignedTokenData

Aligned token data for just target tokens, based on text

token_label_str: List[List[str]]

The target tokens (as strings) - see Seq2SeqDataLogger.token_label_str

targets: List[str]

The decoded response tokens - i.e. the string representation of the Targets for each sample. Note that this is only computed for Decoder-Only models. Returns [] for Encoder-Decoder

abstract generate_sample(input_str, tokenizer, model, max_input_tokens, generation_config, input_id=None, split_key=None)#

Generate and extract model logprobs

Tokenize the input string and then use hf.generate to generate just the output tokens.

We don’t rely on the scores returned by hf since these can be altered by hf internally depending on the generation config / can be hard to parse in the case of BEAM Search.

Instead, we pass the generated output back through the model to extract the token logprobs. We effectively ask the model to evaluate its own generation - which is identical to generation because of causal language modeling.

Return type:

ModelGeneration

Parameters:#

input_str: str

Input string context used to seed the generation

tokenizer: PreTrainedTokenizerFast max_input_tokens: int

the max number of tokens to use for tokenization

model: PreTrainedModel generation_config: GenerationConfig

Users generation config specifying the parameters for generation

Return:#

: model_generation: ModelGeneration

  • generated_ids: np.ndarray of shape - [seq_len]

  • generated_token_logprobs: np.ndarray of shape - [seq_len]

  • generated_top_logprobs: List[List[Tuple[str, float]]]

static process_generated_logits(generated_logits, generated_ids, tokenizer)#
Return type:

ModelGeneration

class EncoderDecoderDataFormatter(logger_config)#

Bases: BaseSeq2SeqDataFormatter

Seq2Seq data logger for EncoderDecoder models

Logging input data for EncoderDecoder models requires: 1. tokenizer: This must be an instance of PreTrainedTokenizerFast from huggingface

(ie T5TokenizerFast or GPT2TokenizerFast, etc). Your tokenizer should have an .is_fast property that returns True if it’s a fast tokenizer. This class must implement the encode, decode, and encode_plus methods

You can set your tokenizer via either the seq2seq set_tokenizer() or watch(tokenizer, …) functions in dataquality.integrations.seq2seq.core

  1. A two column (i.e. completion) dataset (pandas/huggingface etc) with string

    ‘text’ (model <Input> / <Instruction> / <Prompt>, …) and ‘label’ (model <Target> / (<Completion> / …) columns + a data sample id column. Ex: Billsum dataset, with text <Input> and summary as the <Label> id text summary 0 SECTION 1. LIABILITY … Shields a business entity … 1 SECTION 1. SHORT TITLE.

… Human Rights Information Act …

2 SECTION 1. SHORT TITLE.

… Jackie Robinson Commemorative Coin …

3 SECTION 1. NONRECOGNITION … Amends the Internal Revenue Code to … 4 SECTION 1. SHORT TITLE.

… Native American Energy Act - (Sec. 3…

You can log your dataset via the dq.log_dataset function, passing in the column mapping as necessary for text, label, and id dq.log_dataset(ds, text=”text”, label=”summary”, id=”id”)

Putting it all together:

from dataquality.integrations.seq2seq.core import set_tokenizer from datasets import load_dataset from transformers import T5TokenizerFast

tokenizer = T5TokenizerFast.from_pretrained(“t5-small”) ds = load_dataset(“billsum”) # Add id column to each dataset split as the idx ds = ds.map(lambda x,idx : {“id”:idx},with_indices=True) dq.init(“seq2seq”) # You can either use set_tokenizer() or watch() set_tokenizer(

tokenizer, “encoder_decoder”, max_input_tokens=512, max_target_tokens=128

) dq.log_dataset(ds[“train”], label=”summary”, split=”train”)

NOTE: We assume that the tokenizer you provide is the same tokenizer used for training. This must be true in order to align inputs and outputs correctly. Ensure all necessary properties (like add_eos_token) are set before setting your tokenizer as to match the tokenization process to your training process.

NOTE 2: Unlike DecoderOnly models, EncoderDecoder models explicitly separate the processing of the <Input> and <Target> data. Therefore, we do not need any additional information to isolate / extract information on the <Target> data.

format_text(text, ids, tokenizer, max_tokens, split_key)#

Further validation for Encoder-Decoder

Return type:

Tuple[AlignedTokenData, List[List[str]], List[str]]

For Encoder-Decoder we need to:
  • Save the target token ids: Equivalent to ground truth, it allows us to

    compare with the predictions and get perplexity and DEP scores

  • Save the target tokens: Decoding of the ids, to identify the tokens

  • Save the offsets and positions of the target tokens: allows us to extract

    token level information and align the tokens with the full sample text

We achieve this by:
  • Tokenize the target texts using max_target_tokens

  • From the tokenized outputs generate the corresponding token alignments

    (i.e. label_offsets and lable_positions).

  • Save ground-truth token id in id_to_tokens map, mapping

    sample id to tokenized label (sample_id -> List[token_id])

generate_sample(input_str, tokenizer, model, max_input_tokens, generation_config, input_id=None, split_key=None)#

Generate response for a single sample - Encoder Decoder

Return type:

ModelGeneration

set_input_cutoff(df)#

Calculate the cutoff index for the input strings.

When using Encoder-Decoder models, the input tokens are truncated based on the respective Encoders max_lengths OR the user specified :rtype: DataFrame

max_length (note: these may be different between Encoder and Decoder - see max_input_tokens vs. `max_target_tokens).

This function adds one column to the df:
  • ‘input_cutoff’: the position of the last character in the input.

class DecoderOnlyDataFormatter(logger_config)#

Bases: BaseSeq2SeqDataFormatter

Seq2Seq data logger for DecoderOnly models

Logging input data for DecoderOnly models requires: 1. tokenizer: This must be an instance of PreTrainedTokenizerFast from huggingface

(ie T5TokenizerFast or GPT2TokenizerFast, etc). Your tokenizer should have an .is_fast property that returns True if it’s a fast tokenizer. This class must implement the encode, decode, and encode_plus methods

You can set your tokenizer via either the seq2seq set_tokenizer() or watch(tokenizer, …) functions in dataquality.integrations.seq2seq.core

  1. A two column (i.e. completion) dataset (pandas/huggingface etc) with string

    ‘text’ (model <Input> / <Instruction> / <Prompt>, …) and ‘label’ (model <Target> / (<Completion> / …) columns + a data sample id column. Ex: Billsum dataset, with text <Input> and summary as the <Label> id text summary 0 SECTION 1. LIABILITY … Shields a business entity … 1 SECTION 1. SHORT TITLE.

… Human Rights Information Act …

2 SECTION 1. SHORT TITLE.

… Jackie Robinson Commemorative Coin …

3 SECTION 1. NONRECOGNITION … Amends the Internal Revenue Code to … 4 SECTION 1. SHORT TITLE.

… Native American Energy Act - (Sec. 3…

You can log your dataset via the dq.log_dataset function, passing in the column mapping as necessary for text, label, and id dq.log_dataset(ds, text=”text”, label=”summary”, id=”id”)

Putting it all together:

from dataquality.integrations.seq2seq.core import set_tokenizer from datasets import load_dataset from transformers import T5TokenizerFast

tokenizer = T5TokenizerFast.from_pretrained(“t5-small”) ds = load_dataset(“billsum”) # Add id column to each dataset split as the idx ds = ds.map(lambda x,idx : {“id”:idx},with_indices=True) dq.init(“seq2seq”) # You can either use set_tokenizer() or watch() set_tokenizer(

tokenizer, “encoder_decoder”, max_input_tokens=512, max_target_tokens=128

) dq.log_dataset(ds[“train”], label=”summary”, split=”train”)

NOTE: We assume that the tokenizer you provide is the same tokenizer used for training. This must be true in order to align inputs and outputs correctly. Ensure all necessary properties (like add_eos_token) are set before setting your tokenizer as to match the tokenization process to your training process.

NOTE 2: Unlike DecoderOnly models, EncoderDecoder models explicitly separate the processing of the <Input> and <Target> data. Therefore, we do not need any additional information to isolate / extract information on the <Target> data.

format_text(text, ids, tokenizer, max_tokens, split_key)#

Further formatting for Decoder-Only

Text is the formatted prompt of combined input/target

Tokenize text using the user’s max_input_tokens. From the tokenized outputs generate the corresponding token alignments (i.e. label_offsets and lable_positions).

Save the tokenized labels for each sample as id_to_tokens. This is essential during model logging for extracting GT token label information.

We also save a formatted_prompt_lengths map used later to remove padding tokens.

Return type:

Tuple[AlignedTokenData, List[List[str]], List[str]]

generate_sample(input_str, tokenizer, model, max_input_tokens, generation_config, input_id=None, split_key=None)#

Generate response for a single sample - Decoder Only

Return type:

ModelGeneration

set_input_cutoff(df)#

Calculate the cutoff index for the inputs

Return type:

DataFrame

Set the cutoff for the Input to just be the entire sample

i.e. the length of input

get_data_formatter(model_type, logger_config)#

Returns the data formatter for the given model_type

Return type:

BaseSeq2SeqDataFormatter

dataquality.loggers.data_logger.seq2seq.seq2seq_base module#

class Seq2SeqDataLogger(meta=None)#

Bases: BaseGalileoDataLogger

Seq2Seq base data logger

This class defines the base functionality for logging input data in Seq2Seq tasks - i.e. shared between EncoderDecoder and DecoderOnly architectures.

At its core, Seq2Seq data logging expects the user’s tokenizer (logged through the provided ‘watch’ integration) and expects the dataset to be formatted as a two column datasets - corresponding to Inputs and Targets.

During processing, we use the tokenizer to tokenize the Target data (used later during model output logging) and prepare for the alignment of token-level and string character level information.

After processing, the following key information is extracted:
  • ids

  • texts: corresponding to the <Input> data column

  • labels: corresponding to the <Target> data column

  • token_label_offsets + token_label_positions: used for alignment of

token level and string character level information within the UI. Note this only applies to the <Target> data.

Additionally, we critically save the tokenized Target data as the ground truth “labels” for model output logging.

While much of the general Seq2Seq logic can be shared between EncoderDecoder and DecoderOnly models, there are nuances and specific information that differentiate them. Therefore, the following abstract functions must be overridden by subclasses

  • validate_and_format

  • calculate_cutoffs

Note that some shared functionality is implemented here - generally around error handling.

logger_config: Seq2SeqLoggerConfig = Seq2SeqLoggerConfig(labels=None, tasks=None, observed_num_labels=None, observed_labels=None, tagging_schema=None, last_epoch=0, cur_epoch=None, cur_split=None, cur_inference_name=None, training_logged=False, validation_logged=False, test_logged=False, inference_logged=False, exception='', helper_data={}, input_data_logged=defaultdict(<class 'int'>, {}), logged_input_ids=defaultdict(<class 'set'>, {}), idx_to_id_map=defaultdict(<class 'list'>, {}), conditions=[], report_emails=[], ner_labels=[], int_labels=False, feature_names=[], metadata_documents=set(), finish=<function BaseLoggerConfig.<lambda>>, existing_run=False, dataloader_random_sampling=False, remove_embs=False, sample_length={}, tokenizer=None, max_input_tokens=None, max_target_tokens=None, id_to_tokens=defaultdict(<class 'dict'>, {}), model=None, generation_config=None, generation_splits=set(), model_type=None, id_to_formatted_prompt_length=defaultdict(<class 'dict'>, {}), response_template=None)#
DATA_FOLDER_EXTENSION = {'data': 'arrow', 'emb': 'hdf5', 'prob': 'hdf5'}#
property split_key: str#
validate_and_format()#

Seq2Seq validation

Validates input lengths and existence of a tokenizer

Further validation is done in the formatter for model specific validation (Encoder-Decoder vs Decoder-Only)

Return type:

None

log_dataset(dataset, *, batch_size=100000, text='input', id='id', label='target', formatted_prompt='formatted_label', split=None, inference_name=None, meta=None, **kwargs)#

Log a dataset/iterable of input samples.

Provide the dataset and the keys to index into it. See child for details

Return type:

None

static get_valid_attributes()#

Returns a list of valid attributes that for this Logger class :rtype: List[str] :return: List[str]

create_in_out_frames(in_frame, dir_name, prob_only, split, epoch_or_inf)#

Formats the input data and model output data For Seq2Seq we need to - add the generated output to the input dataframe - calculate the text cutoffs for the input dataframe - call the super method to create the dataframe

Return type:

BaseLoggerDataFrames

convert_large_string(df)#

Cast regular string to large_string for the text columns

In Seq2Seq the text columns are the input and target columns. See BaseDataLogger.convert_large_string for more details

Return type:

DataFrame

add_generated_output_to_df(df, split)#

Adds the generated output to the dataframe Adds the generated output to the dataframe, and also adds the token_label_positions column

Return type:

Optional[DataFrame]

classmethod separate_dataframe(df, prob_only=True, split=None)#

Separates the singular dataframe into its 3 components

Gets the probability df, the embedding df, and the “data” df containing all other columns

Return type:

BaseLoggerDataFrames

calculate_cutoffs(df)#

Calculates cuttoff indexes for the input and/or target string.

Transformer models (or sub-modules) are trained over a maximum number of tokens / sequence length. This max_length controls the maximum number of tokens that the transformer model can process / “see.” During training, the tokenizer uses this max_length to truncate additional tokens - so any tokens beyond the max token length are fully ignored.

calculate_cutoffs adds relevant max_length information at the string character level for the target and/or input columns. This character info communicates to the UI how much of the respective string gets “seen” during processing by the model.

In this abstract definition, we include basic error checking and compute the cutoffs for the target column. This logic is shared by EncoderDecoder and DecoderOnly models - it relies on the saved offset mapping.

Return type:

DataFrame

Therefore, this function adds the following columns to df:
  • ‘target_cutoff’: the position of the last character in the target

See formatters (EncoderDecoder and DecoderOnly) for model specific details when computing input_cutoff.

Module contents#