dataquality.utils.seq2seq package#
Submodules#
dataquality.utils.seq2seq.data_error_potential module#
- get_token_dep_from_labels(probs, labels)#
Extracts DEP per token prediction using the labels as indexing tools
First, extract the probabilities of the GT token label
Probs is a numpy array of shape [batch_size, max_token_len, vocab_size] where for each sample (text input) in the batch, every token of that sample has a probability vector of size vocab_size (which can be 30k+).
Labels is of shape [batch_size, max_token_length], where for each sample, it indicates the index into the vocab that the token should be (the token label).
We use advanced indexing to extract out only the probabilities for the token label for each sample, for each batch.
Then, we get the second highest probabilities per token via similar indexing.
Finally, compute dep and return.
Returns: (token_dep, gold_probs)
NOTE: This function is not actively being used as we don’t require the user to pass in labels. However, if we want to support that flow (which would make processing faster and more memory efficient), we can leverage these here.
- Return type:
Tuple
[ndarray
,ndarray
]
- unpad_dep_probs_from_labels(token_dep, token_gold_probs, labels)#
Unpads the incoming numpy array by looking for padded/ignored indices
Ignored/padded indices are indicated by a -100 in the labels array.
token_dep, token_gold_probs, and labels are of shape [batch_size, max_token_length], but for each sample in the batch, the tokens for that sample that are ignored are -100 in the labels matrix. So we use that to get only the ones we care about.
We return a pyarrow array because each batch will have a different shape, which can’t be represented in numpy
NOTE: This function is not actively being used as we don’t require the user to pass in labels. However, if we want to support that flow (which would make processing faster and more memory efficient), we can leverage these here.
- Return type:
Tuple
[array
,array
]
dataquality.utils.seq2seq.decoder_only module#
- isolate_response_tokens(tokenized_formatted_prompt, response_template)#
Identify the final instance of the response_template and use that to isolate just the response tokens.
tokenized_formatted_prompt has = [num_tokens]
We search for the final occurrence of the response_template within the formatted prompt through sublist matching. After isolating the final response_template, we slice off the remaining tokens, representing the tokenized response.
- Return type:
List
[int
]
Example
>> tokenized_formatted_prompt = [[7, 1, 2, 3, 8, 5, 9, 1, 2, 3, 9, 10, 6]] >> response_template = [1, 2, 3] >> extract_tokenized_responses(tokenized_formatted_prompt, response_template)
[[9, 10, 6]]
If a sample does not contain the response_template we represent the tokenized_response for that sample as [] - i.e. the <Empty String>.
- extract_tokenized_responses(tokenized_formatted_prompts, response_template)#
Extracts the tokenized responses from the formatted prompts
For each sample, we isolate the response (from the input) and return it.
- Return type:
List
[List
[int
]]
dataquality.utils.seq2seq.generation module#
- generate_on_batch(texts, ids, formatter, tokenizer, model, max_input_tokens, generation_config, split_key=None)#
Generate over a batch of text inputs
We use model to generate the output for each text sample individually and the corresponding logprob + token alignment data. Returns the processed batch data to be added to the dataframe.
- Return type:
Parameters:#
- texts: pa.array of strs
batch of str input strings that we want to generate on
Return:#
: generated_data: BatchGenerationData
BatchGenerationData object with the processed generation data for the batch of text inputs.
- add_generated_output_to_df(df, generation_column, formatter, tokenizer, model, max_input_tokens, generation_config, split_key=None)#
Generates model outputs over df and extracts the logprob data
Using the user’s model we generate the output for each sample in the df and the corresponding logprob data. We generate in batches of Input text using vaex’s evaluate_iterator. This avoids brining the full S2SIC.input into memory; however, we do end up materializing the full logprob and token alignemnt data for the generated outputs.
- Return type:
DataFrame
- We specifically add the following 5 columns to the df:
generated_output: str
generated_token_label_positions: pa.array
generated_token_label_offsets: pa.array
generated_token_logprobs: pa.array
generated_top_logprobs: pa.array
NOTE: Although we bring into memory quite a bit of information about the generated outputs, in general users won’t be generated over very many samples (on the order of 100-1000s because it simply takes too much time to do much more). Nevertheless, we should monitor this for memory issues.
Parameters:#
- df: vaex.DataFrame
Dataframe with the input data that we want to generate based on
model: PreTrainedModel tokenizer: PreTrainedTokenizerFast max_input_tokens: the max number of tokens to use for tokenizing generation_config: GenerationConfig
Users generation config specifying the parameters for generation
Return:#
: df: vaex.DataFrame
Updated Dataframe with the generated columns added (see above)
dataquality.utils.seq2seq.logprobs module#
- get_top_logprob_indices(logprobs)#
Extract per-token top-k logprobs
logprobs can either be at the sample level or batch level.
In both situations, we compute the top logprobs along the final (-1) vocab dimension. We use np.argpartition to remove the overhead of sorting along the vocab dimension - O(nlog(n)) -> O(n). For reference see: https://numpy.org/doc/stable/reference/generated/numpy.argpartition.html
- Return type:
ndarray
Post-conditions:#
logprobs is left unchanged
TODO this can be so so much faster with torch on gpu!
Parameters:#
- logprobs: np.ndarray of shape [(optional)batch_size, seq_len, vocab_size]
per-token logprobs for a sample or a batch
Return:#
: top_logprob_indices: np.ndarray of shape - […, TOP_K]
Indices of the top-k per-token logprobs. Note we preserve all but the last dimension (i.e. not -1) to seemlesly handle samples or batches.
- extract_top_logprobs(sample_logprobs, top_indices, tokenizer)#
Extract per token top_logprobs for a single sample
For each token, we extract the top-k predicted tokens and corresponding logprobs. Then we convert predicted token_ids into strings using the tokenizer.
- Return type:
List
[List
[Tuple
[str
,float
]]]
Example top_logprobs data format for an example sequence: [
[(“the”, -0.2), (“a”, -0.6), …], [(“cat”, -0.05), (“dog”, -0.1), …], …
]
- Breaking down this format:
The sample is represented as a List of Lists per token.
For each token, we store a fixed length (k) list of tuples for each of
the top-k predicted tokens - (token_string, logprob).
Parameters:#
sample_logprobs: np.ndarray of shape - [seq_len, Vocab size] top_indices: np.ndarray of shape - [seq_len, k] tokenizer: PreTrainedTokenizerFast
Return:#
: top_logprobs: List[List[Tuple[str, float]]]
len(top_logprobs) == sample_logprobs.shape[0] == num_tokens len(top_logprobs[i]) == TOP_K
- process_sample_logprobs(sample_logprobs, sample_labels, sample_top_indices, tokenizer)#
Extract label_logprobs and top_logprobs
Whether the labels are GT target labels or generated labels, the process is identical. Extract the per token probability assigned to the token label and the top-k logprobs.
- Return type:
- Preconditions:
We assume that all inputs have been stripped of any padding tokens!
Parameters:#
- sample_logprobs: np.ndarray of shape - [seq_len, vocab_size]
Per-token logprobs for a single sample
- sample_labels: np.ndarray of shape - [seq_len]
Per-token lables for the sample. As a pre-condition we assume that this is a 1D tensor with length seq_len. This is important for extracting logprobs
- sample_top_indices: np.ndarray of shape - [seq_len, TOP_K]
Top_K logprob indices for each token. Note that these are not in order.
- tokenizer: PreTrainedTokenizerFast
Tokenizer used by the model
Returns:#
: logprob_data: LogprobData
token_logprobs and top_logprobs for the sample
dataquality.utils.seq2seq.offsets module#
- rollup_offset_mapping(offset_mapping)#
For a single sample’s tokenizer offsets, extract the character level offsets for each token.
- Return type:
Tuple
[List
[Tuple
[int
,int
]],List
[Set
[int
]]]
Character level offsets align each token with it’s character index in a sample. They follow rules:
- There must not be a gap between offsets. The end of one must be the beginning
of the next
The span offsets must be strictly increasing
Each offset has 0 or more token positions associated. These positions indicate which tokens exist in the range indicated by the offset. ex: {‘offsets’: (0, 3), ‘token_positions’: {0, 1}}} means that tokens 0 and 1 from the tokenizer are encapsulated by the character range (0, 3]
We take overlapping ranges and gaps between ranges, and fill them in contiguously
- ex:
offset_mapping = [(0, 1), (0, 20), (22, 23), (0, 0)]
- is rolled into
- [
{‘offsets’: (0, 1), ‘token_positions’: {0, 1}}, {‘offsets’: (1, 20), ‘token_positions’: {1}}, {‘offsets’: (20, 22), ‘token_positions’: {}}, {‘offsets’: (22, 23), ‘token_positions’: {2}},
]
and returned as [(0, 1), (1, 20), (20, 22), (22, 23)], [{0, 1}, {1}, {}, {2}]
- align_tokens_to_character_spans(samples_offsets, disable_tqdm=False)#
Iterates through each samples offsets and creates character-aligned spans
- Return type:
Parameters:#
- disable_tqdm: bool
Flag for disabling tqdm. Used generally when we are calling align_tokens_to_character_spans over small (e.g. 1 sample) batches
- add_input_cutoff_to_df(df, tokenizer, text_col, max_tokens=None)#
Find the cutoff point in string coresponding to the last token.
We tokenize the text and truncate after max_tokens tokens, i.e., we only keep the first max_tokens tokens. To find the position in the text corresponding to the last token we use the offset_mapping returned by the tokenizer.
- Return type:
DataFrame
- add_target_cutoff_to_df(df, target_offsets_col)#
Look at the last offset of the tokenized target to find the position of the last character of the target string that was used by the model. Note that typically the model does not use the entire target during teacher forcing and there is a cut-off point (for example 128 tokens, or 512 tokens, etc).
- Return type:
DataFrame
- align_response_tokens_to_character_spans(tokenizer, tokenized_response, max_input_tokens)#
Decodes then re-tokenizes the isolated response to get the character alignments
- TODO This can prob be done with just tokenizing the “target” in isolation!!
Specifically, we tokenize the Targets, then we figure out the index of the last token from the tokenized_response and find where that is in the offset map and slice the offset map accordingly. This may also avoid strange space issues with tokenizers hanlding words at the start of a document.
- Return type:
Tuple
[AlignedTokenData
,str
]- Returns:
- :
- aligned_token_data: AlignedTokenData
Aligned token data for a single Response - batch dim = 1.
- decoded_response: str
The string representation of the Response, used as the Target string in the console. Note: we do not remove special characters, so these will appear in the console!
Module contents#
- remove_padding(padded_token_seq, num_tokens, padding_side)#
Remove padding tokens from a single token sequence
To remove padding tokens we use the tokenized labels and slice tokens depending on the padding side of the tokenizer.
We assume padded_token_seq is a sequence tokens with shape [max_seq_len, …], where len(labels) = num_tokens <= max_seq_len and … indicates 0+ extra dimensions.
- Return type:
ndarray
Parameters:#
- padded_token_seq: np.ndarray of shape - [max_seq_len, …]
Padded token sequence. The first dimension must be the token dimension and be >= num_tokens. The following dimensions are unrestricted.
- num_tokens: int
Length of the non-padded logits.
- padding_side: str
Comes from the tokenizer used for the model, determines which side padding is applied.
Returns:#
: non_padded_token_seq: np.ndarray of shape - [num_tokens, …]
Sequence with padded tokens removed, leaving other dimensions un-altered.