dataquality.utils.seq2seq package#

Submodules#

dataquality.utils.seq2seq.data_error_potential module#

get_token_dep_from_labels(probs, labels)#

Extracts DEP per token prediction using the labels as indexing tools

First, extract the probabilities of the GT token label

Probs is a numpy array of shape [batch_size, max_token_len, vocab_size] where for each sample (text input) in the batch, every token of that sample has a probability vector of size vocab_size (which can be 30k+).

Labels is of shape [batch_size, max_token_length], where for each sample, it indicates the index into the vocab that the token should be (the token label).

We use advanced indexing to extract out only the probabilities for the token label for each sample, for each batch.

Then, we get the second highest probabilities per token via similar indexing.

Finally, compute dep and return.

Returns: (token_dep, gold_probs)

NOTE: This function is not actively being used as we don’t require the user to pass in labels. However, if we want to support that flow (which would make processing faster and more memory efficient), we can leverage these here.

Return type:: Tuple[ndarray, ndarray]

unpad_dep_probs_from_labels(token_dep, token_gold_probs, labels)#

Unpads the incoming numpy array by looking for padded/ignored indices

Ignored/padded indices are indicated by a -100 in the labels array.

token_dep, token_gold_probs, and labels are of shape [batch_size, max_token_length], but for each sample in the batch, the tokens for that sample that are ignored are -100 in the labels matrix. So we use that to get only the ones we care about.

We return a pyarrow array because each batch will have a different shape, which can’t be represented in numpy

Return type:: Tuple[array, array]

dataquality.utils.seq2seq.decoder_only module#

isolate_response_tokens(tokenized_formatted_prompt, response_template)#

Identify the final instance of the response_template and use that to isolate just the response tokens.

tokenized_formatted_prompt has = [num_tokens]

We search for the final occurrence of the response_template within the formatted prompt through sublist matching. After isolating the final response_template, we slice off the remaining tokens, representing the tokenized response.

Return type:: List[int]

Example

>> tokenized_formatted_prompt = [[7, 1, 2, 3, 8, 5, 9, 1, 2, 3, 9, 10, 6]] >> response_template = [1, 2, 3] >> extract_tokenized_responses(tokenized_formatted_prompt, response_template)

[[9, 10, 6]]

If a sample does not contain the response_template we represent the tokenized_response for that sample as [] - i.e. the <Empty String>.

extract_tokenized_responses(tokenized_formatted_prompts, response_template)#

Extracts the tokenized responses from the formatted prompts

For each sample, we isolate the response (from the input) and return it.

Return type:: List[List[int]]

dataquality.utils.seq2seq.generation module#

generate_on_batch(texts, ids, formatter, tokenizer, model, max_input_tokens, generation_config, split_key=None)#

Generate over a batch of text inputs

We use model to generate the output for each text sample individually and the corresponding logprob + token alignment data. Returns the processed batch data to be added to the dataframe.

Return type:: BatchGenerationData

Parameters:#

texts: pa.array of strs: batch of str input strings that we want to generate on

Return:#

: generated_data: BatchGenerationData

BatchGenerationData object with the processed generation data for the batch of text inputs.

add_generated_output_to_df(df, generation_column, formatter, tokenizer, model, max_input_tokens, generation_config, split_key=None)#

Generates model outputs over df and extracts the logprob data

Using the user’s model we generate the output for each sample in the df and the corresponding logprob data. We generate in batches of Input text using vaex’s evaluate_iterator. This avoids brining the full S2SIC.input into memory; however, we do end up materializing the full logprob and token alignemnt data for the generated outputs.

Return type:: DataFrame

We specifically add the following 5 columns to the df:

generated_output: str
generated_token_label_positions: pa.array
generated_token_label_offsets: pa.array
generated_token_logprobs: pa.array
generated_top_logprobs: pa.array

NOTE: Although we bring into memory quite a bit of information about the generated outputs, in general users won’t be generated over very many samples (on the order of 100-1000s because it simply takes too much time to do much more). Nevertheless, we should monitor this for memory issues.

Parameters:#

df: vaex.DataFrame: Dataframe with the input data that we want to generate based on

model: PreTrainedModel tokenizer: PreTrainedTokenizerFast max_input_tokens: the max number of tokens to use for tokenizing generation_config: GenerationConfig

Users generation config specifying the parameters for generation

Return:#

: df: vaex.DataFrame

Updated Dataframe with the generated columns added (see above)

dataquality.utils.seq2seq.logprobs module#

get_top_logprob_indices(logprobs)#

Extract per-token top-k logprobs

logprobs can either be at the sample level or batch level.

In both situations, we compute the top logprobs along the final (-1) vocab dimension. We use np.argpartition to remove the overhead of sorting along the vocab dimension - O(nlog(n)) -> O(n). For reference see: https://numpy.org/doc/stable/reference/generated/numpy.argpartition.html

Return type:: ndarray

Post-conditions:#

logprobs is left unchanged

TODO this can be so so much faster with torch on gpu!

Parameters:#

logprobs: np.ndarray of shape [(optional)batch_size, seq_len, vocab_size]: per-token logprobs for a sample or a batch

Return:#

: top_logprob_indices: np.ndarray of shape - […, TOP_K]

Indices of the top-k per-token logprobs. Note we preserve all but the last dimension (i.e. not -1) to seemlesly handle samples or batches.

extract_top_logprobs(sample_logprobs, top_indices, tokenizer)#

Extract per token top_logprobs for a single sample

For each token, we extract the top-k predicted tokens and corresponding logprobs. Then we convert predicted token_ids into strings using the tokenizer.

Return type:: List[List[Tuple[str, float]]]

Example top_logprobs data format for an example sequence: [

[(“the”, -0.2), (“a”, -0.6), …], [(“cat”, -0.05), (“dog”, -0.1), …], …

]

Breaking down this format:

The sample is represented as a List of Lists per token.
For each token, we store a fixed length (k) list of tuples for each of

the top-k predicted tokens - (token_string, logprob).

Parameters:#

sample_logprobs: np.ndarray of shape - [seq_len, Vocab size] top_indices: np.ndarray of shape - [seq_len, k] tokenizer: PreTrainedTokenizerFast

Return:#

: top_logprobs: List[List[Tuple[str, float]]]

len(top_logprobs) == sample_logprobs.shape[0] == num_tokens len(top_logprobs[i]) == TOP_K

process_sample_logprobs(sample_logprobs, sample_labels, sample_top_indices, tokenizer)#

Extract label_logprobs and top_logprobs

Whether the labels are GT target labels or generated labels, the process is identical. Extract the per token probability assigned to the token label and the top-k logprobs.

Return type:: LogprobData

Preconditions:

We assume that all inputs have been stripped of any padding tokens!

Parameters:#

sample_logprobs: np.ndarray of shape - [seq_len, vocab_size]: Per-token logprobs for a single sample
sample_labels: np.ndarray of shape - [seq_len]: Per-token lables for the sample. As a pre-condition we assume that this is a 1D tensor with length seq_len. This is important for extracting logprobs
sample_top_indices: np.ndarray of shape - [seq_len, TOP_K]: Top_K logprob indices for each token. Note that these are not in order.
tokenizer: PreTrainedTokenizerFast: Tokenizer used by the model

Returns:#

: logprob_data: LogprobData

token_logprobs and top_logprobs for the sample

dataquality.utils.seq2seq.offsets module#

rollup_offset_mapping(offset_mapping)#

For a single sample’s tokenizer offsets, extract the character level offsets for each token.

Return type:: Tuple[List[Tuple[int, int]], List[Set[int]]]

Character level offsets align each token with it’s character index in a sample. They follow rules:

There must not be a gap between offsets. The end of one must be the beginning
of the next

The span offsets must be strictly increasing

Each offset has 0 or more token positions associated. These positions indicate which tokens exist in the range indicated by the offset. ex: {‘offsets’: (0, 3), ‘token_positions’: {0, 1}}} means that tokens 0 and 1 from the tokenizer are encapsulated by the character range (0, 3]

We take overlapping ranges and gaps between ranges, and fill them in contiguously

ex:

offset_mapping = [(0, 1), (0, 20), (22, 23), (0, 0)]

is rolled into

[: {‘offsets’: (0, 1), ‘token_positions’: {0, 1}}, {‘offsets’: (1, 20), ‘token_positions’: {1}}, {‘offsets’: (20, 22), ‘token_positions’: {}}, {‘offsets’: (22, 23), ‘token_positions’: {2}},

]

and returned as [(0, 1), (1, 20), (20, 22), (22, 23)], [{0, 1}, {1}, {}, {2}]

align_tokens_to_character_spans(samples_offsets, disable_tqdm=False)#

Iterates through each samples offsets and creates character-aligned spans

Return type:: AlignedTokenData

Parameters:#

disable_tqdm: bool: Flag for disabling tqdm. Used generally when we are calling align_tokens_to_character_spans over small (e.g. 1 sample) batches

add_input_cutoff_to_df(df, tokenizer, text_col, max_tokens=None)#

Find the cutoff point in string coresponding to the last token.

We tokenize the text and truncate after max_tokens tokens, i.e., we only keep the first max_tokens tokens. To find the position in the text corresponding to the last token we use the offset_mapping returned by the tokenizer.

Return type:: DataFrame

add_target_cutoff_to_df(df, target_offsets_col)#

Look at the last offset of the tokenized target to find the position of the last character of the target string that was used by the model. Note that typically the model does not use the entire target during teacher forcing and there is a cut-off point (for example 128 tokens, or 512 tokens, etc).

Return type:: DataFrame

align_response_tokens_to_character_spans(tokenizer, tokenized_response, max_input_tokens)#

Decodes then re-tokenizes the isolated response to get the character alignments

TODO This can prob be done with just tokenizing the “target” in isolation!!: Specifically, we tokenize the Targets, then we figure out the index of the last token from the tokenized_response and find where that is in the offset map and slice the offset map accordingly. This may also avoid strange space issues with tokenizers hanlding words at the start of a document.

Return type:: Tuple[AlignedTokenData, str]
Returns:

:

aligned_token_data: AlignedTokenData: Aligned token data for a single Response - batch dim = 1.
decoded_response: str: The string representation of the Response, used as the Target string in the console. Note: we do not remove special characters, so these will appear in the console!

Module contents#

remove_padding(padded_token_seq, num_tokens, padding_side)#

Remove padding tokens from a single token sequence

To remove padding tokens we use the tokenized labels and slice tokens depending on the padding side of the tokenizer.

We assume padded_token_seq is a sequence tokens with shape [max_seq_len, …], where len(labels) = num_tokens <= max_seq_len and … indicates 0+ extra dimensions.

Return type:: ndarray

Parameters:#

padded_token_seq: np.ndarray of shape - [max_seq_len, …]: Padded token sequence. The first dimension must be the token dimension and be >= num_tokens. The following dimensions are unrestricted.
num_tokens: int: Length of the non-padded logits.
padding_side: str: Comes from the tokenizer used for the model, determines which side padding is applied.

Returns:#

: non_padded_token_seq: np.ndarray of shape - [num_tokens, …]

Sequence with padded tokens removed, leaving other dimensions un-altered.