dataquality.loggers.data_logger.seq2seq package#
Submodules#
dataquality.loggers.data_logger.seq2seq.chat module#
- class Seq2SeqChatDataLogger(meta=None)#
Bases:
Seq2SeqDataLogger
-
logger_config:
Seq2SeqChatLoggerConfig
= Seq2SeqChatLoggerConfig(labels=None, tasks=None, observed_num_labels=None, observed_labels=None, tagging_schema=None, last_epoch=0, cur_epoch=None, cur_split=None, cur_inference_name=None, training_logged=False, validation_logged=False, test_logged=False, inference_logged=False, exception='', helper_data={}, input_data_logged=defaultdict(<class 'int'>, {}), logged_input_ids=defaultdict(<class 'set'>, {}), idx_to_id_map=defaultdict(<class 'list'>, {}), conditions=[], report_emails=[], ner_labels=[], int_labels=False, feature_names=[], metadata_documents=set(), finish=<function BaseLoggerConfig.<lambda>>, existing_run=False, dataloader_random_sampling=False, remove_embs=False, sample_length={}, tokenizer=None, max_input_tokens=None, max_target_tokens=None, id_to_tokens=defaultdict(<class 'dict'>, {}), model=None, generation_config=None, generation_splits=set(), model_type=None, id_to_formatted_prompt_length=defaultdict(<class 'dict'>, {}), response_template=None)#
-
logger_config:
dataquality.loggers.data_logger.seq2seq.completion module#
- class Seq2SeqCompletionDataLogger(meta=None)#
Bases:
Seq2SeqDataLogger
-
logger_config:
Seq2SeqCompletionLoggerConfig
= Seq2SeqCompletionLoggerConfig(labels=None, tasks=None, observed_num_labels=None, observed_labels=None, tagging_schema=None, last_epoch=0, cur_epoch=None, cur_split=None, cur_inference_name=None, training_logged=False, validation_logged=False, test_logged=False, inference_logged=False, exception='', helper_data={}, input_data_logged=defaultdict(<class 'int'>, {}), logged_input_ids=defaultdict(<class 'set'>, {}), idx_to_id_map=defaultdict(<class 'list'>, {}), conditions=[], report_emails=[], ner_labels=[], int_labels=False, feature_names=[], metadata_documents=set(), finish=<function BaseLoggerConfig.<lambda>>, existing_run=False, dataloader_random_sampling=False, remove_embs=False, sample_length={}, tokenizer=None, max_input_tokens=None, max_target_tokens=None, id_to_tokens=defaultdict(<class 'dict'>, {}), model=None, generation_config=None, generation_splits=set(), model_type=None, id_to_formatted_prompt_length=defaultdict(<class 'dict'>, {}), response_template=None)#
-
logger_config:
dataquality.loggers.data_logger.seq2seq.formatters module#
- class BaseSeq2SeqDataFormatter(logger_config)#
Bases:
ABC
- abstract set_input_cutoff(df)#
- Return type:
DataFrame
- abstract format_text(text, ids, tokenizer, max_tokens, split_key)#
Tokenize and align the text samples
format_text tokenizes and computes token alignments for each samples in text. Different logic is applied depending on the model architecture (EncoderDecoder vs. DecoderOnly).
In the end, we return AlignedTokenData and the target token strings (corresponding to token_label_str in Seq2SeqDataLogger). For both EncoderDecoder and DecoderOnly models the output is expected to be token alignment and string data over just the <Target> tokens in the Seq2Seq task. Note though that the input text samples are different between the two model architectures. See their respective implementations for further details.
- Return type:
Tuple
[AlignedTokenData
,List
[List
[str
]],List
[str
]]
- Additional information computed / variable assignements:
Assign the necessary self.logger_config fields
Compute token_label_str: the per token str representation
of each sample (List[str]), saved and used for high DEP tokens. - In Decoder-Only: Decode the response tokens to get the str representation of the response (i.e. the target show in the UI).
Parameters:#
- texts: List[str]
batch of str samples. For EncoderDecoder model’s these are exactly the targets vs. for DecoderOnly model’s each sample is the full formatted_prompt
- ids: List[int]
sample ids - used for logger_config assignment
tokenizer: PreTrainedTokenizerFast max_tokens: Optional[int] split_key: str
Return:#
: batch_aligned_data: AlignedTokenData
Aligned token data for just target tokens, based on text
- token_label_str: List[List[str]]
The target tokens (as strings) - see Seq2SeqDataLogger.token_label_str
- targets: List[str]
The decoded response tokens - i.e. the string representation of the Targets for each sample. Note that this is only computed for Decoder-Only models. Returns [] for Encoder-Decoder
- abstract generate_sample(input_str, tokenizer, model, max_input_tokens, generation_config, input_id=None, split_key=None)#
Generate and extract model logprobs
Tokenize the input string and then use hf.generate to generate just the output tokens.
We don’t rely on the scores returned by hf since these can be altered by hf internally depending on the generation config / can be hard to parse in the case of BEAM Search.
Instead, we pass the generated output back through the model to extract the token logprobs. We effectively ask the model to evaluate its own generation - which is identical to generation because of causal language modeling.
- Return type:
Parameters:#
- input_str: str
Input string context used to seed the generation
tokenizer: PreTrainedTokenizerFast max_input_tokens: int
the max number of tokens to use for tokenization
model: PreTrainedModel generation_config: GenerationConfig
Users generation config specifying the parameters for generation
Return:#
: model_generation: ModelGeneration
generated_ids: np.ndarray of shape - [seq_len]
generated_token_logprobs: np.ndarray of shape - [seq_len]
generated_top_logprobs: List[List[Tuple[str, float]]]
- static process_generated_logits(generated_logits, generated_ids, tokenizer)#
- Return type:
- class EncoderDecoderDataFormatter(logger_config)#
Bases:
BaseSeq2SeqDataFormatter
Seq2Seq data logger for EncoderDecoder models
Logging input data for EncoderDecoder models requires: 1. tokenizer: This must be an instance of PreTrainedTokenizerFast from huggingface
(ie T5TokenizerFast or GPT2TokenizerFast, etc). Your tokenizer should have an .is_fast property that returns True if it’s a fast tokenizer. This class must implement the encode, decode, and encode_plus methods
You can set your tokenizer via either the seq2seq set_tokenizer() or watch(tokenizer, …) functions in dataquality.integrations.seq2seq.core
- A two column (i.e. completion) dataset (pandas/huggingface etc) with string
‘text’ (model <Input> / <Instruction> / <Prompt>, …) and ‘label’ (model <Target> / (<Completion> / …) columns + a data sample id column. Ex: Billsum dataset, with text <Input> and summary as the <Label> id text summary 0 SECTION 1. LIABILITY … Shields a business entity … 1 SECTION 1. SHORT TITLE.
- … Human Rights Information Act …
2 SECTION 1. SHORT TITLE.
- … Jackie Robinson Commemorative Coin …
3 SECTION 1. NONRECOGNITION … Amends the Internal Revenue Code to … 4 SECTION 1. SHORT TITLE.
… Native American Energy Act - (Sec. 3…
You can log your dataset via the dq.log_dataset function, passing in the column mapping as necessary for text, label, and id dq.log_dataset(ds, text=”text”, label=”summary”, id=”id”)
- Putting it all together:
from dataquality.integrations.seq2seq.core import set_tokenizer from datasets import load_dataset from transformers import T5TokenizerFast
tokenizer = T5TokenizerFast.from_pretrained(“t5-small”) ds = load_dataset(“billsum”) # Add id column to each dataset split as the idx ds = ds.map(lambda x,idx : {“id”:idx},with_indices=True) dq.init(“seq2seq”) # You can either use set_tokenizer() or watch() set_tokenizer(
tokenizer, “encoder_decoder”, max_input_tokens=512, max_target_tokens=128
) dq.log_dataset(ds[“train”], label=”summary”, split=”train”)
NOTE: We assume that the tokenizer you provide is the same tokenizer used for training. This must be true in order to align inputs and outputs correctly. Ensure all necessary properties (like add_eos_token) are set before setting your tokenizer as to match the tokenization process to your training process.
NOTE 2: Unlike DecoderOnly models, EncoderDecoder models explicitly separate the processing of the <Input> and <Target> data. Therefore, we do not need any additional information to isolate / extract information on the <Target> data.
- format_text(text, ids, tokenizer, max_tokens, split_key)#
Further validation for Encoder-Decoder
- Return type:
Tuple
[AlignedTokenData
,List
[List
[str
]],List
[str
]]
- For Encoder-Decoder we need to:
- Save the target token ids: Equivalent to ground truth, it allows us to
compare with the predictions and get perplexity and DEP scores
Save the target tokens: Decoding of the ids, to identify the tokens
- Save the offsets and positions of the target tokens: allows us to extract
token level information and align the tokens with the full sample text
- We achieve this by:
Tokenize the target texts using max_target_tokens
- From the tokenized outputs generate the corresponding token alignments
(i.e. label_offsets and lable_positions).
- Save ground-truth token id in id_to_tokens map, mapping
sample id to tokenized label (sample_id -> List[token_id])
- generate_sample(input_str, tokenizer, model, max_input_tokens, generation_config, input_id=None, split_key=None)#
Generate response for a single sample - Encoder Decoder
- Return type:
- set_input_cutoff(df)#
Calculate the cutoff index for the input strings.
When using Encoder-Decoder models, the input tokens are truncated based on the respective Encoders max_lengths OR the user specified :rtype:
DataFrame
max_length (note: these may be different between Encoder and Decoder - see max_input_tokens vs. `max_target_tokens).
- This function adds one column to the df:
‘input_cutoff’: the position of the last character in the input.
- class DecoderOnlyDataFormatter(logger_config)#
Bases:
BaseSeq2SeqDataFormatter
Seq2Seq data logger for DecoderOnly models
Logging input data for DecoderOnly models requires: 1. tokenizer: This must be an instance of PreTrainedTokenizerFast from huggingface
(ie T5TokenizerFast or GPT2TokenizerFast, etc). Your tokenizer should have an .is_fast property that returns True if it’s a fast tokenizer. This class must implement the encode, decode, and encode_plus methods
You can set your tokenizer via either the seq2seq set_tokenizer() or watch(tokenizer, …) functions in dataquality.integrations.seq2seq.core
- A two column (i.e. completion) dataset (pandas/huggingface etc) with string
‘text’ (model <Input> / <Instruction> / <Prompt>, …) and ‘label’ (model <Target> / (<Completion> / …) columns + a data sample id column. Ex: Billsum dataset, with text <Input> and summary as the <Label> id text summary 0 SECTION 1. LIABILITY … Shields a business entity … 1 SECTION 1. SHORT TITLE.
- … Human Rights Information Act …
2 SECTION 1. SHORT TITLE.
- … Jackie Robinson Commemorative Coin …
3 SECTION 1. NONRECOGNITION … Amends the Internal Revenue Code to … 4 SECTION 1. SHORT TITLE.
… Native American Energy Act - (Sec. 3…
You can log your dataset via the dq.log_dataset function, passing in the column mapping as necessary for text, label, and id dq.log_dataset(ds, text=”text”, label=”summary”, id=”id”)
- Putting it all together:
from dataquality.integrations.seq2seq.core import set_tokenizer from datasets import load_dataset from transformers import T5TokenizerFast
tokenizer = T5TokenizerFast.from_pretrained(“t5-small”) ds = load_dataset(“billsum”) # Add id column to each dataset split as the idx ds = ds.map(lambda x,idx : {“id”:idx},with_indices=True) dq.init(“seq2seq”) # You can either use set_tokenizer() or watch() set_tokenizer(
tokenizer, “encoder_decoder”, max_input_tokens=512, max_target_tokens=128
) dq.log_dataset(ds[“train”], label=”summary”, split=”train”)
NOTE: We assume that the tokenizer you provide is the same tokenizer used for training. This must be true in order to align inputs and outputs correctly. Ensure all necessary properties (like add_eos_token) are set before setting your tokenizer as to match the tokenization process to your training process.
NOTE 2: Unlike DecoderOnly models, EncoderDecoder models explicitly separate the processing of the <Input> and <Target> data. Therefore, we do not need any additional information to isolate / extract information on the <Target> data.
- format_text(text, ids, tokenizer, max_tokens, split_key)#
Further formatting for Decoder-Only
Text is the formatted prompt of combined input/target
Tokenize text using the user’s max_input_tokens. From the tokenized outputs generate the corresponding token alignments (i.e. label_offsets and lable_positions).
Save the tokenized labels for each sample as id_to_tokens. This is essential during model logging for extracting GT token label information.
We also save a formatted_prompt_lengths map used later to remove padding tokens.
- Return type:
Tuple
[AlignedTokenData
,List
[List
[str
]],List
[str
]]
- generate_sample(input_str, tokenizer, model, max_input_tokens, generation_config, input_id=None, split_key=None)#
Generate response for a single sample - Decoder Only
- Return type:
- set_input_cutoff(df)#
Calculate the cutoff index for the inputs
- Return type:
DataFrame
- Set the cutoff for the Input to just be the entire sample
i.e. the length of input
- get_data_formatter(model_type, logger_config)#
Returns the data formatter for the given model_type
- Return type:
dataquality.loggers.data_logger.seq2seq.seq2seq_base module#
- class Seq2SeqDataLogger(meta=None)#
Bases:
BaseGalileoDataLogger
Seq2Seq base data logger
This class defines the base functionality for logging input data in Seq2Seq tasks - i.e. shared between EncoderDecoder and DecoderOnly architectures.
At its core, Seq2Seq data logging expects the user’s tokenizer (logged through the provided ‘watch’ integration) and expects the dataset to be formatted as a two column datasets - corresponding to Inputs and Targets.
During processing, we use the tokenizer to tokenize the Target data (used later during model output logging) and prepare for the alignment of token-level and string character level information.
- After processing, the following key information is extracted:
ids
texts: corresponding to the <Input> data column
labels: corresponding to the <Target> data column
token_label_offsets + token_label_positions: used for alignment of
token level and string character level information within the UI. Note this only applies to the <Target> data.
Additionally, we critically save the tokenized Target data as the ground truth “labels” for model output logging.
While much of the general Seq2Seq logic can be shared between EncoderDecoder and DecoderOnly models, there are nuances and specific information that differentiate them. Therefore, the following abstract functions must be overridden by subclasses
validate_and_format
calculate_cutoffs
Note that some shared functionality is implemented here - generally around error handling.
-
logger_config:
Seq2SeqLoggerConfig
= Seq2SeqLoggerConfig(labels=None, tasks=None, observed_num_labels=None, observed_labels=None, tagging_schema=None, last_epoch=0, cur_epoch=None, cur_split=None, cur_inference_name=None, training_logged=False, validation_logged=False, test_logged=False, inference_logged=False, exception='', helper_data={}, input_data_logged=defaultdict(<class 'int'>, {}), logged_input_ids=defaultdict(<class 'set'>, {}), idx_to_id_map=defaultdict(<class 'list'>, {}), conditions=[], report_emails=[], ner_labels=[], int_labels=False, feature_names=[], metadata_documents=set(), finish=<function BaseLoggerConfig.<lambda>>, existing_run=False, dataloader_random_sampling=False, remove_embs=False, sample_length={}, tokenizer=None, max_input_tokens=None, max_target_tokens=None, id_to_tokens=defaultdict(<class 'dict'>, {}), model=None, generation_config=None, generation_splits=set(), model_type=None, id_to_formatted_prompt_length=defaultdict(<class 'dict'>, {}), response_template=None)#
- DATA_FOLDER_EXTENSION = {'data': 'arrow', 'emb': 'hdf5', 'prob': 'hdf5'}#
- property split_key: str#
- validate_and_format()#
Seq2Seq validation
Validates input lengths and existence of a tokenizer
Further validation is done in the formatter for model specific validation (Encoder-Decoder vs Decoder-Only)
- Return type:
None
- log_dataset(dataset, *, batch_size=100000, text='input', id='id', label='target', formatted_prompt='formatted_label', split=None, inference_name=None, meta=None, **kwargs)#
Log a dataset/iterable of input samples.
Provide the dataset and the keys to index into it. See child for details
- Return type:
None
- static get_valid_attributes()#
Returns a list of valid attributes that for this Logger class :rtype:
List
[str
] :return: List[str]
- create_in_out_frames(in_frame, dir_name, prob_only, split, epoch_or_inf)#
Formats the input data and model output data For Seq2Seq we need to - add the generated output to the input dataframe - calculate the text cutoffs for the input dataframe - call the super method to create the dataframe
- Return type:
- convert_large_string(df)#
Cast regular string to large_string for the text columns
In Seq2Seq the text columns are the input and target columns. See BaseDataLogger.convert_large_string for more details
- Return type:
DataFrame
- add_generated_output_to_df(df, split)#
Adds the generated output to the dataframe Adds the generated output to the dataframe, and also adds the token_label_positions column
- Return type:
Optional
[DataFrame
]
- classmethod separate_dataframe(df, prob_only=True, split=None)#
Separates the singular dataframe into its 3 components
Gets the probability df, the embedding df, and the “data” df containing all other columns
- Return type:
- calculate_cutoffs(df)#
Calculates cuttoff indexes for the input and/or target string.
Transformer models (or sub-modules) are trained over a maximum number of tokens / sequence length. This max_length controls the maximum number of tokens that the transformer model can process / “see.” During training, the tokenizer uses this max_length to truncate additional tokens - so any tokens beyond the max token length are fully ignored.
calculate_cutoffs adds relevant max_length information at the string character level for the target and/or input columns. This character info communicates to the UI how much of the respective string gets “seen” during processing by the model.
In this abstract definition, we include basic error checking and compute the cutoffs for the target column. This logic is shared by EncoderDecoder and DecoderOnly models - it relies on the saved offset mapping.
- Return type:
DataFrame
- Therefore, this function adds the following columns to df:
‘target_cutoff’: the position of the last character in the target
See formatters (EncoderDecoder and DecoderOnly) for model specific details when computing input_cutoff.