stringalign.statistics

`stringalign.statistics`#

exception stringalign.statistics.CombinedAlignmentWarning[source]#: Used to warn when passing alignments with potentially combined operations to the confusion matrix.

class stringalign.statistics.StringConfusionMatrix(true_positives: Counter[str], false_positives: Counter[str], false_negatives: Counter[str], edit_counts: Counter[AlignmentOperation])[source]#

A confusion-matrix like object that counts edit operations for aligned strings.

The string confusion matrix counts the number of true positives, false positives and false negatives for two aligned strings. However, the number of true negatives does not make sense in the context of string alignment, as it would correspond to the number of times a token occurs in neither strings.

We use the following definitions of true positives, false positives and false negatives:

True positives: The number of times a token occurs in the “same place” in both strings, i.e. the number of stringalign.align.Kept operations with the given token as the substring.
False positives: The number of times a token occurs in the predicted string but not the reference, i.e. number of stringalign.align.Inserted operations with the given token as the substring plus the number of stringalign.align.Replaced operations with the given token as the predicted token.
False negatives: The number of times a token occurs in the reference string but not the predicted, i.e. number of stringalign.align.Deleted operations with the given token as the substring plus the number of stringalign.align.Replaced operations with the given token as the reference token.
Edit count: The number of edit operations (stringalign.align.Inserted, stringalign.align.Deleted or stringalign.align.Replaced)

In general, you should not initialize this class with the default constructor, but rather use some of the utility constructors:

from_strings_and_alignment()
from_strings()
from_string_collections()
get_empty()

compute_dice(aggregate_over: Iterable[str] | None = None) → dict[str, float] | float#

Compute the F1 score, also known as the Dice score.

The F1 score is given by the harmonic mean of the true positive rate and positive predictive value. Alternatively, you can interpret it as the number of true positives divided by the average number of predicted positives and the number of positives in the reference.

Parameters:: aggregate_over (optional) – If provided, this function returns only a single number, which is the f1 score aggregated for the tokens in the aggregate_over iterable. This is useful e.g. if you want to compute the false discovery rate for a set of special characters. See StringConfusionMatrix.compute_true_positive_rate() for examples of how this argument works.
Returns:: f1_score – Either a dictionary that maps tokens to their f1 score rate, or, if aggregate_over is provided, a single float that represent the f1 score aggregated for the specified tokens.
Return type:: dict[str, float] | float

compute_f1_score(aggregate_over: Iterable[str] | None = None) → dict[str, float] | float[source]#

Compute the F1 score, also known as the Dice score.

Parameters:: aggregate_over (optional) – If provided, this function returns only a single number, which is the f1 score aggregated for the tokens in the aggregate_over iterable. This is useful e.g. if you want to compute the false discovery rate for a set of special characters. See StringConfusionMatrix.compute_true_positive_rate() for examples of how this argument works.
Returns:: f1_score – Either a dictionary that maps tokens to their f1 score rate, or, if aggregate_over is provided, a single float that represent the f1 score aggregated for the specified tokens.
Return type:: dict[str, float] | float

compute_false_discovery_rate(aggregate_over: Iterable[str] | None = None) → dict[str, float] | float[source]#

Compute the false discovery rate.

The false discovery rate is given by the number of false positives divided by the total number of predicted positives.

Parameters:: aggregate_over (optional) – If provided, this function returns only a single number, which is the false discovery rate for the tokens in the aggregate_over iterable. This is useful e.g. if you want to compute the false discovery rate for a set of special characters. See StringConfusionMatrix.compute_true_positive_rate() for examples of how this argument works.
Returns:: false_discovery_rate – Either a dictionary that maps tokens to their false discovery rate, or, if aggregate_over is provided, a single float that represent the false discovery rate aggregated for the specified tokens.
Return type:: dict[str, float] | float

compute_positive_predictive_value(aggregate_over: Iterable[str] | None = None) → dict[str, float] | float[source]#

Compute the positive predicted value, also known as precision.

The positive predicted value is given by the number of true positives divided by the total number of predicted positives.

Parameters:: aggregate_over (optional) – If provided, this function returns only a single number, which is the positive predicted value for the tokens in the aggregate_over iterable. This is useful e.g. if you want to compute the positive predicted value for a set of special characters. See StringConfusionMatrix.compute_true_positive_rate() for examples of how this argument works.
Returns:: positive_predictive_value – Either a dictionary that maps tokens to their positive predicted value, or, if aggregate_over is provided, a single float that represent the positive predicted value aggregated for the specified tokens.
Return type:: dict[str, float] | float

compute_precision(aggregate_over: Iterable[str] | None = None) → dict[str, float] | float#

Compute the positive predicted value, also known as precision.

The positive predicted value is given by the number of true positives divided by the total number of predicted positives.

Parameters:: aggregate_over (optional) – If provided, this function returns only a single number, which is the positive predicted value for the tokens in the aggregate_over iterable. This is useful e.g. if you want to compute the positive predicted value for a set of special characters. See StringConfusionMatrix.compute_true_positive_rate() for examples of how this argument works.
Returns:: positive_predictive_value – Either a dictionary that maps tokens to their positive predicted value, or, if aggregate_over is provided, a single float that represent the positive predicted value aggregated for the specified tokens.
Return type:: dict[str, float] | float

compute_recall(aggregate_over: Iterable[str] | None = None) → dict[str, float] | float#

Compute the true positive rate, also known as sensitivity or recall.

The true positive rate is given by the number of true positives divided by the total number of positives.

Parameters:: aggregate_over (optional) – If provided, this function returns only a single number, which is the true positive rate for the tokens in the aggregate_over iterable. This is useful e.g. if you want to compute the true positive rate for a set of special characters.
Returns:: true_positive_rate – Either a dictionary that maps tokens to their true positive rate, or, if aggregate_over is provided, a single float that represent the true positive rate aggregated for the specified tokens.
Return type:: dict[str, float] | float

Examples

If we compute the true positive rate without aggregating over tokens, we get a dict of true positive rates

>>> cm = StringConfusionMatrix.from_strings("ostehøvel", "ostehovl")
>>> expected_tp = {'o': 1.0, 's': 1.0, 't': 1.0, 'h': 1.0, 'v': 1.0, 'l': 1.0, 'e': 0.5, 'ø': 0.0}
>>> tp = cm.compute_true_positive_rate()
>>> expected_tp == tp
True

If we specify an iterable of tokens to aggregate over, we get the total true positive rate for those tokens. In this case, we aggregate over ["æ", "ø", "å"], and the prediction did not find any of those tokens, so the true positive rate is zero.

>>> cm.compute_true_positive_rate(aggregate_over=["æ", "ø", "å"])
0.0

The aggregated statistics is micro averaged, so the function counts the number of true and false negatives for all tokens, sums them and then computes the true positive rate.

>>> cm = StringConfusionMatrix.from_strings("blåbær- og bringebærsyltetøy", "blabaer- og bringebærsyltetoy")
>>> cm.compute_true_positive_rate(aggregate_over=["æ", "ø", "å"])
0.25

compute_sensitivity(aggregate_over: Iterable[str] | None = None) → dict[str, float] | float#

Compute the true positive rate, also known as sensitivity or recall.

The true positive rate is given by the number of true positives divided by the total number of positives.

Parameters:: aggregate_over (optional) – If provided, this function returns only a single number, which is the true positive rate for the tokens in the aggregate_over iterable. This is useful e.g. if you want to compute the true positive rate for a set of special characters.
Returns:: true_positive_rate – Either a dictionary that maps tokens to their true positive rate, or, if aggregate_over is provided, a single float that represent the true positive rate aggregated for the specified tokens.
Return type:: dict[str, float] | float

Examples

If we compute the true positive rate without aggregating over tokens, we get a dict of true positive rates

>>> cm = StringConfusionMatrix.from_strings("ostehøvel", "ostehovl")
>>> expected_tp = {'o': 1.0, 's': 1.0, 't': 1.0, 'h': 1.0, 'v': 1.0, 'l': 1.0, 'e': 0.5, 'ø': 0.0}
>>> tp = cm.compute_true_positive_rate()
>>> expected_tp == tp
True

>>> cm.compute_true_positive_rate(aggregate_over=["æ", "ø", "å"])
0.0

The aggregated statistics is micro averaged, so the function counts the number of true and false negatives for all tokens, sums them and then computes the true positive rate.

>>> cm = StringConfusionMatrix.from_strings("blåbær- og bringebærsyltetøy", "blabaer- og bringebærsyltetoy")
>>> cm.compute_true_positive_rate(aggregate_over=["æ", "ø", "å"])
0.25

compute_token_error_rate() → float[source]#

Compute the token error rate (a generalisation of CER and WER).

The token error rate is the number of token edits divided by the total number of tokens in the reference.

If the tokenizer tokenizes the string into characters, this is equivalent to the character error rate (CER).

Returns:: token_error_rate – The token error rate.
Return type:: dict[str, float] | float

compute_true_positive_rate(aggregate_over: Iterable[str] | None = None) → dict[str, float] | float[source]#

Compute the true positive rate, also known as sensitivity or recall.

The true positive rate is given by the number of true positives divided by the total number of positives.

Parameters:: aggregate_over (optional) – If provided, this function returns only a single number, which is the true positive rate for the tokens in the aggregate_over iterable. This is useful e.g. if you want to compute the true positive rate for a set of special characters.
Returns:: true_positive_rate – Either a dictionary that maps tokens to their true positive rate, or, if aggregate_over is provided, a single float that represent the true positive rate aggregated for the specified tokens.
Return type:: dict[str, float] | float

Examples

If we compute the true positive rate without aggregating over tokens, we get a dict of true positive rates

>>> cm = StringConfusionMatrix.from_strings("ostehøvel", "ostehovl")
>>> expected_tp = {'o': 1.0, 's': 1.0, 't': 1.0, 'h': 1.0, 'v': 1.0, 'l': 1.0, 'e': 0.5, 'ø': 0.0}
>>> tp = cm.compute_true_positive_rate()
>>> expected_tp == tp
True

>>> cm.compute_true_positive_rate(aggregate_over=["æ", "ø", "å"])
0.0

The aggregated statistics is micro averaged, so the function counts the number of true and false negatives for all tokens, sums them and then computes the true positive rate.

>>> cm = StringConfusionMatrix.from_strings("blåbær- og bringebærsyltetøy", "blabaer- og bringebærsyltetoy")
>>> cm.compute_true_positive_rate(aggregate_over=["æ", "ø", "å"])
0.25

edit_counts: Counter[AlignmentOperation]#

false_negatives: Counter[str]#

false_positives: Counter[str]#

classmethod from_string_collections(references: Iterable[str], predictions: Iterable[str], tokenizer: Tokenizer | None = None) → Self[source]#

Create confusion matrix for many strings, summing statistics across pairs of references and predictions.

Parameters:

references – Iterable containing the reference strings.
predictions – Iterable containing The strings to align with the references.
tokenizer (optional) – A tokenizer that turns a string into an iterable of tokens. For this function, it is sufficient that it is a callable that turns a string into an iterable of tokens. If not provided, then stringalign.tokenize.DEFAULT_TOKENIZER is used instead, which by default is a grapheme cluster (character) tokenizer.

Returns:

confusion_matrix – The confusion matrix.

Return type:

StringConfusionMatrix

classmethod from_strings(reference: str, predicted: str, tokenizer: Tokenizer | None = None, randomize_alignment: bool = False, random_state: Generator | int | None = None) → Self[source]#

Create confusion matrix based on a reference string and a predicted string.

Note

This method will first align the strings and then create the confusion matrix. If you already have computed the alignment, you can use StringConfusionMatrix.from_strings_and_alignment() instead.

Parameters:

reference – The reference string, also known as gold standard or ground truth.
predicted – The string to align with the reference.
tokenizer (optional) – A tokenizer that turns a string into an iterable of tokens. For this function, it is sufficient that it is a callable that turns a string into an iterable of tokens. If not provided, then stringalign.tokenize.DEFAULT_TOKENIZER is used instead, which by default is a grapheme cluster (character) tokenizer.
randomize_alignment – If True, then a random optimal alignment is chosen (slightly slower if enabled)
random_state – The NumPy RNG or a seed to create a NumPy RNG used for picking the optimal alignment. If None, then the default RNG will be used instead.

Returns:

confusion_matrix – The confusion matrix.

Return type:

StringConfusionMatrix

classmethod from_strings_and_alignment(reference: str, predicted: str, alignment: Iterable[AlignmentOperation], tokenizer: Tokenizer | None = None) → Self[source]#

Create confusion matrix based on a reference string, a predicted string and their alignment.

Important

The string metrics are not well defined if we include combined alginments. This is because the true positive count etc. is not well defined for multi-token strings. For example, how many true positives is it for the 'll' substring in the string 'llllll'. The answer is most likely either 3 or 5 depending on whether we count overlapping substrings.

Parameters:

reference – The reference string, also known as gold standard or ground truth.
predicted – The string to align with the reference.
alignment – An optimal alignment for these strings
tokenizer (optional) – A tokenizer that turns a string into an iterable of tokens. For this function, it is sufficient that it is a callable that turns a string into an iterable of tokens. If not provided, then stringalign.tokenize.DEFAULT_TOKENIZER is used instead, which by default is a grapheme cluster (character) tokenizer.

Returns:

confusion_matrix – The confusion matrix.

Return type:

StringConfusionMatrix

classmethod get_empty() → Self[source]#

Make an empty confusion matrix (equivalent to that of two empty strings).

This can be used as a starting point for summing multiple confusion matrices when computing micro-averaged metrics over multiple string pairs.

Returns:: confusion_matrix – An empty confusion matrix.
Return type:: StringConfusionMatrix

true_positives: Counter[str]#

stringalign.statistics

Contents

stringalign.statistics#

`stringalign.statistics`#