API reference¶
Type aliases¶
Sample = Mapping[FieldName, FieldValue]
FieldName = str
FieldValue = Union[float, int, bool, str, Sequence[FieldValue]
Classes¶
Vocab¶
-
class
text2array.
Vocab
(**kwargs)[source]¶ Bases:
collections.UserDict
,typing.MutableMapping
A dictionary from field names to
StringStore
objects as the field’s vocabulary.-
extend
(samples, fields=None)[source]¶ Extend vocabulary with field values in samples.
Parameters: Return type: None
-
classmethod
from_samples
(samples, options=None, pbar=None)[source]¶ Make an instance of this class from an iterable of samples.
A vocabulary is only made for fields whose value is a string token or a (nested) sequence of string tokens. It is important that
samples
be a true iterable, i.e. it can be iterated more than once. No exception is raised when this is violated.Parameters: - samples (Iterable[Sample]) – Iterable of samples.
- pbar (
Optional
[tqdm
]) – Instance of tqdm for displaying a progress bar. - options (
Optional
[Mapping
[str
,dict
]]) –Mapping from field names to dictionaries to control the creation of the vocabularies. Recognized dictionary keys are:
min_count
(int
): Exclude tokens occurring fewer than this number of times from the vocabulary (default: 1).pad
(str
): String token to represent padding tokens. IfNone
, no padding token is added to the vocabulary. Otherwise, it is the first entry in the vocabulary (index is 0). Note that if the field has no sequential values, no padding is added. String field values are not considered sequential (default:<pad>
).unk
(str
): String token to represent unknown tokens with. IfNone
, no unknown token is added to the vocabulary. This means when querying the vocabulary with such token, an error is raised. Otherwise, it is the first entry in the vocabulary afterpad
, if any (index is either 0 or 1) (default:<unk>
).max_size
(int
): Maximum size of the vocabulary, excludingpad
andunk
. IfNone
, no limit on the vocabulary size. Otherwise, at most, only this number of most frequent tokens are included in the vocabulary. Note thatmin_count
also sets the maximum size implicitly. So, the size is limited by whichever is smaller. (default:None
).
Returns: Vocabulary instance.
Return type:
-
itos
(samples)[source]¶ Convert integers in the given samples to strings according to this vocabulary.
This method is essentially the inverse of
stoi
.Parameters: samples (Iterable[Sample]) – Samples to convert. Returns: Converted samples. Return type: Iterable[Sample]
-
stoi
(samples)[source]¶ Convert strings in the given samples to integers according to this vocabulary.
This conversion means mapping all the (nested) string field values to integers according to the mapping specified by the
StringStore
object of that field. Field names with no entry in the vocabulary are ignored. Note that the actual conversion is lazy; it is not performed until the resulting iterable is iterated over.Parameters: samples (Iterable[Sample]) – Samples to convert. Returns: Converted samples. Return type: Iterable[Sample]
-
ShuffleIterator¶
-
class
text2array.
ShuffleIterator
(items, key=None, scale=1.0, rng=None)[source]¶ Bases:
typing.Iterable
,collections.abc.Sized
Iterator that shuffles a sequence of items before iterating.
When
key
is not given, this iterator performs ordinary shuffling usingrandom.shuffle
. Otherwise, a noisy sorting is performed. The items are sorted ascending by the value of the given key, plus some random noise \(\epsilon \sim\) Uniform \((-z, z)\), where \(z\) equalsscale
times the standard deviation of key values. This formulation means thatscale
regulates how noisy the sorting is. The larger it is, the more noisy the sorting becomes, i.e. it resembles random shuffling more closely. In an extreme case wherescale=0
, this method just sorts the items bykey
. This method is useful when working with text data, where we want to shuffle the dataset and also minimize padding by ensuring that sentences of similar lengths are not too far apart.Example
>>> from random import Random >>> from text2array import ShuffleIterator >>> samples = [ ... {'ws': ['a', 'b', 'b']}, ... {'ws': ['a']}, ... {'ws': ['a', 'a', 'b', 'b', 'b', 'b']}, ... ] >>> iter_ = ShuffleIterator(samples, key=lambda s: len(s['ws']), rng=Random(1234)) >>> for s in iter_: ... print(s) ... {'ws': ['a']} {'ws': ['a', 'a', 'b', 'b', 'b', 'b']} {'ws': ['a', 'b', 'b']}
Parameters: - items (Sequence[Any]) – Sequence of items to shuffle and iterate over.
- key (typing.Callable[[Any], int]) – Callable to get the key value of an item.
- scale (
float
) – Value to regulate the noise of the sorting. Must not be negative. - rng (
Optional
[Random
]) – Random number generator to use for shuffling. Set this to ensure reproducibility. If not given, an instance ofRandom
with the default seed is used.
BatchIterator¶
-
class
text2array.
BatchIterator
(samples, batch_size=1)[source]¶ Bases:
typing.Iterable
,collections.abc.Sized
Iterator that produces batches of samples.
Example
>>> from text2array import BatchIterator >>> samples = [ ... {'ws': ['a']}, ... {'ws': ['a', 'b']}, ... {'ws': ['b', 'b']}, ... ] >>> iter_ = BatchIterator(samples, batch_size=2) >>> for b in iter_: ... print(list(b)) ... [{'ws': ['a']}, {'ws': ['a', 'b']}] [{'ws': ['b', 'b']}]
Parameters:
BucketIterator¶
-
class
text2array.
BucketIterator
(samples, key, batch_size=1, shuffle_bucket=False, rng=None, sort_bucket=False, sort_bucket_by=None)[source]¶ Bases:
typing.Iterable
,collections.abc.Sized
Iterator that batches together samples from the same bucket.
Example
>>> from text2array import BucketIterator >>> samples = [ ... {'ws': ['a']}, ... {'ws': ['a', 'b']}, ... {'ws': ['b']}, ... {'ws': ['c']}, ... {'ws': ['b', 'b']}, ... ] >>> iter_ = BucketIterator(samples, key=lambda s: len(s['ws']), batch_size=2) >>> for b in iter_: ... print(list(b)) ... [{'ws': ['a']}, {'ws': ['b']}] [{'ws': ['c']}] [{'ws': ['a', 'b']}, {'ws': ['b', 'b']}]
Parameters: - samples (Iterable[Sample]) – Iterable of samples to batch.
- key (typing.Callable[[Sample], Any]) – Callable to get the bucket key of a sample.
- batch_size (
int
) – Maximum number of samples in each batch. - shuffle_bucket (
bool
) – Whether to shuffle every bucket before batching. - rng (
Optional
[Random
]) – Random number generator to use for shuffling. Set this to ensure reproducibility. If not given, an instance ofRandom
with the default seed is used. - sort_bucket (
bool
) – Whether to sort every bucket before batching. When bothshuffle_bucket
andsort_bucket
isTrue
, sorting will be ignored (but don’t rely on this behavior). - sort_bucket_by (typing.Callable[[Sample], Any]) – Callable acting as the sort key
if
sort_bucket=True
.
Batch¶
-
class
text2array.
Batch
(samples=None)[source]¶ Bases:
collections.UserList
,typing.MutableSequence
A class to represent a single batch.
Parameters: samples (Sequence[Sample]) – Sequence of samples this batch should contain. -
to_array
(pad_with=0)[source]¶ Convert the batch into
ndarray
.Parameters: pad_with ( Union
[int
,float
,Mapping
[str
,Union
[int
,float
]]]) – Pad sequential field values with this value. Can also be a mapping from field names to padding value for that field. Fields whose name is not in the mapping will be padded with zeros.Return type: Dict
[str
,ndarray
]Returns: A mapping from field names to arrays whose first dimension corresponds to the batch size as returned by len
.
-
StringStore¶
-
class
text2array.
StringStore
(initial=None, default=None)[source]¶ An ordered set of strings, with an optional default value for unknown strings.
This class implements both
MutableSet
andSequence
withstr
as its contents.Example
>>> from text2array import StringStore >>> store = StringStore('abb', default='a') >>> list(store) ['a', 'b'] >>> store.add('b') 1 >>> store.add('c') 2 >>> list(store) ['a', 'b', 'c'] >>> store.index('a') 0 >>> store.index('b') 1 >>> store.index('d') 0
Parameters: