# Text API ## Overview The `mxnet.contrib.text` APIs refer to classes and functions related to text data processing, such as bulding indices and loading pre-trained embedding vectors for text tokens and storing them in the `mxnet.ndarray.NDArray` format. ```eval_rst .. warning:: This package contains experimental APIs and may change in the near future. ``` This document lists the text APIs in mxnet: ```eval_rst .. autosummary:: :nosignatures: mxnet.contrib.text.embedding mxnet.contrib.text.vocab mxnet.contrib.text.utils ``` All the code demonstrated in this document assumes that the following modules or packages are imported. ```python >>> from mxnet import gluon >>> from mxnet import nd >>> from mxnet.contrib import text >>> import collections ``` ### Looking up pre-trained word embeddings for indexed words As a common use case, let us look up pre-trained word embedding vectors for indexed words in just a few lines of code. To begin with, Suppose that we have a simple text data set in the string format. We can count word frequency in the data set. ```python >>> text_data = " hello world \n hello nice world \n hi world \n" >>> counter = text.utils.count_tokens_from_str(text_data) ``` The obtained `counter` has key-value pairs whose keys are words and values are word frequencies. Suppose that we want to build indices for all the keys in `counter` and load the defined fastText word embedding for all such indexed words. First, we need a Vocabulary object with `counter` as its argument ```python >>> my_vocab = text.vocab.Vocabulary(counter) ``` We can create a fastText word embedding object by specifying the embedding name `fasttext` and the pre-trained file `wiki.simple.vec`. We also specify that the indexed tokens for loading the fastText word embedding come from the defined Vocabulary object `my_vocab`. ```python >>> my_embedding = text.embedding.create('fasttext', pretrained_file_name='wiki.simple.vec', ... vocabulary=my_vocab) ``` Now we are ready to look up the fastText word embedding vectors for indexed words, such as 'hello' and 'world'. ```python >>> my_embedding.get_vecs_by_tokens(['hello', 'world']) [[ 3.95669997e-01 2.14540005e-01 -3.53889987e-02 -2.42990002e-01 ... -7.54180014e-01 -3.14429998e-01 2.40180008e-02 -7.61009976e-02] [ 1.04440004e-01 -1.08580001e-01 2.72119999e-01 1.32990003e-01 ... -3.73499990e-01 5.67310005e-02 5.60180008e-01 2.90190000e-02]] ``` ### Using pre-trained word embeddings in `gluon` To demonstrate how to use pre-trained word embeddings in the `gluon` package, let us first obtain indices of the words 'hello' and 'world'. ```python >>> my_embedding.to_indices(['hello', 'world']) [2, 1] ``` We can obtain the vector representation for the words 'hello' and 'world' by specifying their indices (2 and 1) and the `my_embedding.idx_to_vec` in `mxnet.gluon.nn.Embedding`. ```python >>> layer = gluon.nn.Embedding(len(my_embedding), my_embedding.vec_len) >>> layer.initialize() >>> layer.weight.set_data(my_embedding.idx_to_vec) >>> layer(nd.array([2, 1])) [[ 3.95669997e-01 2.14540005e-01 -3.53889987e-02 -2.42990002e-01 ... -7.54180014e-01 -3.14429998e-01 2.40180008e-02 -7.61009976e-02] [ 1.04440004e-01 -1.08580001e-01 2.72119999e-01 1.32990003e-01 ... -3.73499990e-01 5.67310005e-02 5.60180008e-01 2.90190000e-02]] ``` ## Vocabulary The vocabulary builds indices for text tokens. Such indexed tokens can be used by token embedding instances. The input counter whose keys are candidate indices may be obtained via [`count_tokens_from_str`](#mxnet.contrib.text.utils.count_tokens_from_str). ```eval_rst .. currentmodule:: mxnet.contrib.text.vocab .. autosummary:: :nosignatures: Vocabulary ``` Suppose that we have a simple text data set in the string format. We can count word frequency in the data set. ```python >>> text_data = " hello world \n hello nice world \n hi world \n" >>> counter = text.utils.count_tokens_from_str(text_data) ``` The obtained `counter` has key-value pairs whose keys are words and values are word frequencies. Suppose that we want to build indices for the 2 most frequent keys in `counter` with the unknown token representation '<unk>' and a reserved token '<pad>'. ```python >>> my_vocab = text.vocab.Vocabulary(counter, most_freq_count=2, unknown_token='<unk>', ... reserved_tokens=['<pad>']) ``` We can access properties such as `token_to_idx` (mapping tokens to indices), `idx_to_token` (mapping indices to tokens), `vec_len` (length of each embedding vector), and `unknown_token` (representation of any unknown token) and `reserved_tokens`. ```python >>> my_vocab.token_to_idx {'<unk>': 0, '<pad>': 1, 'world': 2, 'hello': 3} >>> my_vocab.idx_to_token ['<unk>', '<pad>', 'world', 'hello'] >>> my_vocab.unknown_token '<unk>' >>> my_vocab.reserved_tokens ['<pad>'] >>> len(my_vocab) 4 ``` Besides the specified unknown token '<unk>' and reserved_token '<pad>' are indexed, the 2 most frequent words 'world' and 'hello' are also indexed. ## Text token embedding To load token embeddings from an externally hosted pre-trained token embedding file, such as those of GloVe and FastText, use [`embedding.create(embedding_name, pretrained_file_name)`](#mxnet.contrib.text.embedding.create). To get all the available `embedding_name` and `pretrained_file_name`, use [`embedding.get_pretrained_file_names()`](#mxnet.contrib.text.embedding.get_pretrained_file_names). ```python >>> text.embedding.get_pretrained_file_names() {'glove': ['glove.42B.300d.txt', 'glove.6B.50d.txt', 'glove.6B.100d.txt', ...], 'fasttext': ['wiki.en.vec', 'wiki.simple.vec', 'wiki.zh.vec', ...]} ``` Alternatively, to load embedding vectors from a custom pre-trained text token embedding file, use [`CustomEmbedding`](#mxnet.contrib.text.embedding.CustomEmbedding). Moreover, to load composite embedding vectors, such as to concatenate embedding vectors, use [`CompositeEmbedding`](#mxnet.contrib.text.embedding.CompositeEmbedding). The indexed tokens in a text token embedding may come from a vocabulary or from the loaded embedding vectors. In the former case, only the indexed tokens in a vocabulary are associated with the loaded embedding vectors, such as loaded from a pre-trained token embedding file. In the later case, all the tokens from the loaded embedding vectors, such as loaded from a pre-trained token embedding file, are taken as the indexed tokens of the embedding. ```eval_rst .. currentmodule:: mxnet.contrib.text.embedding .. autosummary:: :nosignatures: register create get_pretrained_file_names GloVe FastText CustomEmbedding CompositeEmbedding ``` ### Indexed tokens are from a vocabulary One can specify that only the indexed tokens in a vocabulary are associated with the loaded embedding vectors, such as loaded from a pre-trained token embedding file. To begin with, suppose that we have a simple text data set in the string format. We can count word frequency in the data set. ```python >>> text_data = " hello world \n hello nice world \n hi world \n" >>> counter = text.utils.count_tokens_from_str(text_data) ``` The obtained `counter` has key-value pairs whose keys are words and values are word frequencies. Suppose that we want to build indices for the most frequent 2 keys in `counter` and load the defined fastText word embedding with pre-trained file `wiki.simple.vec` for all these 2 words. ```python >>> my_vocab = text.vocab.Vocabulary(counter, most_freq_count=2) >>> my_embedding = text.embedding.create('fasttext', pretrained_file_name='wiki.simple.vec', ... vocabulary=my_vocab) ``` Now we are ready to look up the fastText word embedding vectors for indexed words. ```python >>> my_embedding.get_vecs_by_tokens(['hello', 'world']) [[ 3.95669997e-01 2.14540005e-01 -3.53889987e-02 -2.42990002e-01 ... -7.54180014e-01 -3.14429998e-01 2.40180008e-02 -7.61009976e-02] [ 1.04440004e-01 -1.08580001e-01 2.72119999e-01 1.32990003e-01 ... -3.73499990e-01 5.67310005e-02 5.60180008e-01 2.90190000e-02]] ``` We can also access properties such as `token_to_idx` (mapping tokens to indices), `idx_to_token` (mapping indices to tokens), and `vec_len` (length of each embedding vector). ```python >>> my_embedding.token_to_idx {'<unk>': 0, 'world': 1, 'hello': 2} >>> my_embedding.idx_to_token ['<unk>', 'world', 'hello'] >>> len(my_embedding) 3 >>> my_embedding.vec_len 300 ``` If a token is unknown to `glossary`, its embedding vector is initialized according to the default specification in `fasttext_simple` (all elements are 0). ```python >>> my_embedding.get_vecs_by_tokens('nice') [ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. ... 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] ``` ### Indexed tokens are from the loaded embedding vectors One can also use all the tokens from the loaded embedding vectors, such as loaded from a pre-trained token embedding file, as the indexed tokens of the embedding. To begin with, we can create a fastText word embedding object by specifying the embedding name 'fasttext' and the pre-trained file 'wiki.simple.vec'. The argument `init_unknown_vec` specifies default vector representation for any unknown token. To index all the tokens from this pre-trained word embedding file, we do not need to specify any vocabulary. ```python >>> my_embedding = text.embedding.create('fasttext', pretrained_file_name='wiki.simple.vec', ... init_unknown_vec=nd.zeros) ``` We can access properties such as `token_to_idx` (mapping tokens to indices), `idx_to_token` (mapping indices to tokens), `vec_len` (length of each embedding vector), and `unknown_token` (representation of any unknown token, default value is '<unk>'). ```python >>> my_embedding.token_to_idx['nice'] 2586 >>> my_embedding.idx_to_token[2586] 'nice' >>> my_embedding.vec_len 300 >>> my_embedding.unknown_token '<unk>' ``` For every unknown token, if its representation '<unk>' is encountered in the pre-trained token embedding file, index 0 of property `idx_to_vec` maps to the pre-trained token embedding vector loaded from the file; otherwise, index 0 of property `idx_to_vec` maps to the default token embedding vector specified via `init_unknown_vec` (set to nd.zeros here). Since the pre-trained file does not have a vector for the token '<unk>', index 0 has to map to an additional token '<unk>' and the number of tokens in the embedding is 111,052. ```python >>> len(my_embedding) 111052 >>> my_embedding.idx_to_vec[0] [ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. ... 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] >>> my_embedding.get_vecs_by_tokens('nice') [ 0.49397001 0.39996001 0.24000999 -0.15121 -0.087512 0.37114 ... 0.089521 0.29175001 -0.40917999 -0.089206 -0.1816 -0.36616999] >>> my_embedding.get_vecs_by_tokens(['unknownT0kEN', 'unknownT0kEN']) [[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. ... 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. ... 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]] ``` ### Implement a new text token embedding For ``optimizer``, create a subclass of `mxnet.contrib.text.embedding._TokenEmbedding`. Also add ``@mxnet.contrib.text.embedding._TokenEmbedding.register`` before this class. See [`embedding.py`](https://github.com/apache/incubator-mxnet/blob/master/python/mxnet/contrib/text/embedding.py) for examples. ## Text utilities The following functions provide utilities for text data processing. ```eval_rst .. currentmodule:: mxnet.contrib.text.utils .. autosummary:: :nosignatures: count_tokens_from_str ``` ## API Reference ```eval_rst .. automodule:: mxnet.contrib.text.embedding :members: register, create, get_pretrained_file_names .. autoclass:: mxnet.contrib.text.embedding.GloVe :members: get_vecs_by_tokens, update_token_vectors, to_indices, to_tokens .. autoclass:: mxnet.contrib.text.embedding.FastText :members: get_vecs_by_tokens, update_token_vectors, to_indices, to_tokens .. autoclass:: mxnet.contrib.text.embedding.CustomEmbedding :members: get_vecs_by_tokens, update_token_vectors, to_indices, to_tokens .. autoclass:: mxnet.contrib.text.embedding.CompositeEmbedding :members: get_vecs_by_tokens, update_token_vectors, to_indices, to_tokens .. automodule:: mxnet.contrib.text.vocab .. autoclass:: mxnet.contrib.text.vocab.Vocabulary :members: to_indices, to_tokens .. automodule:: mxnet.contrib.text.utils :members: count_tokens_from_str ```