API reference¶

Type aliases¶

Sample = Mapping[FieldName, FieldValue]
FieldName = str
FieldValue = Union[float, int, bool, str, Sequence[FieldValue]

Classes¶

Vocab¶

class text2array.Vocab(**kwargs)[source]¶

Bases: collections.UserDict, typing.MutableMapping

A dictionary from field names to StringStore objects as the field’s vocabulary.

extend(samples, fields=None)[source]¶

Extend vocabulary with field values in samples.

Parameters:	samples (Iterable[Sample]) – Samples to extend the vocabulary with. fields (`Optional`[`Iterable`[`str`]]) – Extend only these field names. Defaults to all field names in the vocabulary.
Return type:	`None`

classmethod from_samples(samples, options=None, pbar=None)[source]¶

Make an instance of this class from an iterable of samples.

A vocabulary is only made for fields whose value is a string token or a (nested) sequence of string tokens. It is important that samples be a true iterable, i.e. it can be iterated more than once. No exception is raised when this is violated.

Parameters:	samples (Iterable[Sample]) – Iterable of samples. pbar (`Optional`[`tqdm`]) – Instance of tqdm for displaying a progress bar. options (`Optional`[`Mapping`[`str`, `dict`]]) – Mapping from field names to dictionaries to control the creation of the vocabularies. Recognized dictionary keys are: `min_count` (`int`): Exclude tokens occurring fewer than this number of times from the vocabulary (default: 1). `pad` (`str`): String token to represent padding tokens. If `None`, no padding token is added to the vocabulary. Otherwise, it is the first entry in the vocabulary (index is 0). Note that if the field has no sequential values, no padding is added. String field values are not considered sequential (default: `<pad>`). `unk` (`str`): String token to represent unknown tokens with. If `None`, no unknown token is added to the vocabulary. This means when querying the vocabulary with such token, an error is raised. Otherwise, it is the first entry in the vocabulary after `pad`, if any (index is either 0 or 1) (default: `<unk>`). `max_size` (`int`): Maximum size of the vocabulary, excluding `pad` and `unk`. If `None`, no limit on the vocabulary size. Otherwise, at most, only this number of most frequent tokens are included in the vocabulary. Note that `min_count` also sets the maximum size implicitly. So, the size is limited by whichever is smaller. (default: `None`).
Returns:	Vocabulary instance.
Return type:	Vocab

itos(samples)[source]¶

Convert integers in the given samples to strings according to this vocabulary.

This method is essentially the inverse of stoi.

Parameters:	samples (Iterable[Sample]) – Samples to convert.
Returns:	Converted samples.
Return type:	Iterable[Sample]

stoi(samples)[source]¶

Convert strings in the given samples to integers according to this vocabulary.

This conversion means mapping all the (nested) string field values to integers according to the mapping specified by the StringStore object of that field. Field names with no entry in the vocabulary are ignored. Note that the actual conversion is lazy; it is not performed until the resulting iterable is iterated over.

Parameters:	samples (Iterable[Sample]) – Samples to convert.
Returns:	Converted samples.
Return type:	Iterable[Sample]

ShuffleIterator¶

class text2array.ShuffleIterator(items, key=None, scale=1.0, rng=None)[source]¶

Bases: typing.Iterable, collections.abc.Sized

Iterator that shuffles a sequence of items before iterating.

When key is not given, this iterator performs ordinary shuffling using random.shuffle. Otherwise, a noisy sorting is performed. The items are sorted ascending by the value of the given key, plus some random noise \(\epsilon \sim\) Uniform \((-z, z)\), where \(z\) equals scale times the standard deviation of key values. This formulation means that scale regulates how noisy the sorting is. The larger it is, the more noisy the sorting becomes, i.e. it resembles random shuffling more closely. In an extreme case where scale=0, this method just sorts the items by key. This method is useful when working with text data, where we want to shuffle the dataset and also minimize padding by ensuring that sentences of similar lengths are not too far apart.

Example

>>> from random import Random
>>> from text2array import ShuffleIterator
>>> samples = [
...   {'ws': ['a', 'b', 'b']},
...   {'ws': ['a']},
...   {'ws': ['a', 'a', 'b', 'b', 'b', 'b']},
... ]
>>> iter_ = ShuffleIterator(samples, key=lambda s: len(s['ws']), rng=Random(1234))
>>> for s in iter_:
...   print(s)
...
{'ws': ['a']}
{'ws': ['a', 'a', 'b', 'b', 'b', 'b']}
{'ws': ['a', 'b', 'b']}

Parameters:

items (Sequence[Any]) – Sequence of items to shuffle and iterate over.
key (typing.Callable[[Any], int]) – Callable to get the key value of an item.
scale (float) – Value to regulate the noise of the sorting. Must not be negative.
rng (Optional[Random]) – Random number generator to use for shuffling. Set this to ensure reproducibility. If not given, an instance of Random with the default seed is used.

BatchIterator¶

class text2array.BatchIterator(samples, batch_size=1)[source]¶

Bases: typing.Iterable, collections.abc.Sized

Iterator that produces batches of samples.

Example

>>> from text2array import BatchIterator
>>> samples = [
...   {'ws': ['a']},
...   {'ws': ['a', 'b']},
...   {'ws': ['b', 'b']},
... ]
>>> iter_ = BatchIterator(samples, batch_size=2)
>>> for b in iter_:
...   print(list(b))
...
[{'ws': ['a']}, {'ws': ['a', 'b']}]
[{'ws': ['b', 'b']}]

Parameters:	samples (Iterable[Sample]) – Iterable of samples to batch. batch_size (`int`) – Maximum number of samples in each batch.

Note

When samples is an instance of Sized, this iterator can be passed to len to get the number of batches. Otherwise, a TypeError is raised.

BucketIterator¶

class text2array.BucketIterator(samples, key, batch_size=1, shuffle_bucket=False, rng=None, sort_bucket=False, sort_bucket_by=None)[source]¶

Bases: typing.Iterable, collections.abc.Sized

Iterator that batches together samples from the same bucket.

Example

>>> from text2array import BucketIterator
>>> samples = [
...   {'ws': ['a']},
...   {'ws': ['a', 'b']},
...   {'ws': ['b']},
...   {'ws': ['c']},
...   {'ws': ['b', 'b']},
... ]
>>> iter_ = BucketIterator(samples, key=lambda s: len(s['ws']), batch_size=2)
>>> for b in iter_:
...   print(list(b))
...
[{'ws': ['a']}, {'ws': ['b']}]
[{'ws': ['c']}]
[{'ws': ['a', 'b']}, {'ws': ['b', 'b']}]

Parameters:

samples (Iterable[Sample]) – Iterable of samples to batch.
key (typing.Callable[[Sample], Any]) – Callable to get the bucket key of a sample.
batch_size (int) – Maximum number of samples in each batch.
shuffle_bucket (bool) – Whether to shuffle every bucket before batching.
rng (Optional[Random]) – Random number generator to use for shuffling. Set this to ensure reproducibility. If not given, an instance of Random with the default seed is used.
sort_bucket (bool) – Whether to sort every bucket before batching. When both shuffle_bucket and sort_bucket is True, sorting will be ignored (but don’t rely on this behavior).
sort_bucket_by (typing.Callable[[Sample], Any]) – Callable acting as the sort key if sort_bucket=True.

Note

When samples is an instance of Sized, this iterator can be passed to len to get the number of batches. Otherwise, a TypeError is raised.

Batch¶

class text2array.Batch(samples=None)[source]¶

Bases: collections.UserList, typing.MutableSequence

A class to represent a single batch.

Parameters:	samples (Sequence[Sample]) – Sequence of samples this batch should contain.

to_array(pad_with=0)[source]¶

Convert the batch into ndarray.

Parameters:	pad_with (`Union`[`int`, `float`, `Mapping`[`str`, `Union`[`int`, `float`]]]) – Pad sequential field values with this value. Can also be a mapping from field names to padding value for that field. Fields whose name is not in the mapping will be padded with zeros.
Return type:	`Dict`[`str`, `ndarray`]
Returns:	A mapping from field names to arrays whose first dimension corresponds to the batch size as returned by `len`.

StringStore¶

class text2array.StringStore(initial=None, default=None)[source]¶

An ordered set of strings, with an optional default value for unknown strings.

This class implements both MutableSet and Sequence with str as its contents.

Example

>>> from text2array import StringStore
>>> store = StringStore('abb', default='a')
>>> list(store)
['a', 'b']
>>> store.add('b')
1
>>> store.add('c')
2
>>> list(store)
['a', 'b', 'c']
>>> store.index('a')
0
>>> store.index('b')
1
>>> store.index('d')
0

Parameters:	initial (`Optional`[`Sequence`[`str`]]) – Initial elements of the store. default (`Optional`[`str`]) – Default string as a representation of unknown strings, i.e. those that do not exist in the store.