Transformer Data Loader
To Make Writing A Training Loop Simple
The previous article discussed my simple implementation of the transformer architecture from Attention Is All You Need by Ashish Vaswani et al.
This article discusses the implementation of a data loader in detail:
- Where to get text data
- How to tokenize text data
- How to assign a unique integer for each token text
- How to set up
DataLoader
Ultimately, we will have a data loader that simplifies writing a training loop.
1 Where To Get Text Data
We need pairs of sentences in two languages to perform translation tasks — for example, German and corresponding English texts.
The paper mentions the below two datasets:
- WMT 2014 English-to-German translation
- WMT 2014 English-to-French translation
But I wanted to use something much smaller to train my model for less than a day without requiring massive GPU power.
Yet, I don’t want to write a web scraping script to collect such paired texts as it will take a lot of time and defeat the purpose.
So, I decided to use PyTorch’s torchtext.datasets, specifically to use Multi30k’s training dataset. Also, I decided to do German-to-English translation so that I could understand translated sentences generated by the model.
However, the torchtext.datasets library has other machine translation datasets, too. So, I wrote a utility function to load a dataset:
from torch.utils.data import IterableDataset
from torchtext import datasets
from typing import Tuple
def load_dataset(name: str, split: str, language_pair: Tuple[str, str]) -> IterableDataset:
= eval(f'datasets.{name}')
dataset_class = dataset_class(split=split, language_pair=language_pair)
dataset return dataset
For example, I can load the training dataset from Multi30k German-English translation as follows:
= load_dataset('Multi30k', 'train', ('de', 'en')) dataset
The dataset has 29K pairs of German and English sentences.
Note: de
is from Deutsch (German language). en
is from English. So, ('de', 'en')
means that we are loading a dataset for German-English text pairs.
The returned dataset is torch.utils.data.IterableDataset
, which is iterable and we can use in a for loop:
for de_text, en_text in dataset:
print(de_text, en_text)
The first sentence pair is:
Zwei junge weiße Männer sind im Freien in der Nähe vieler Büsche.
Two young, White males are outside near many bushes.
One thing to note is that we need to reload IterableDataset
once the loop reaches the end. So, if you do the for
loop again, you will get an StopIteration
exception.
We can use DataLoader
to generate sentence batches for each language:
from torch.utils.data import DataLoader
= load_dataset('Multi30k', 'train', ('de', 'en'))
dataset = DataLoader(dataset, batch_size=32)
loader
for de_text_batch, en_text_batch in loader:
...still texts...
However, we can not directly feed text data into neural networks. So, we need to tokenize text data and convert them into PyTorch tensors.
2 How To Tokenize Text Data
We need to tokenize each sentence text into token texts. We call this process tokenization. Tokenization is language-specific, and it’s not a simple business to implement it from scratch.
So, I used spaCy in my code.
It’s easy to install spaCy provided you already have a Python environment.
# Install spacy in your conda or virtual environment
pip install spacy
To use spaCy’s language tokenizer, we must obtain the respective language modules.
For example, we can download the English language module as follows:
# Download English language module
python -m spacy download en_core_web_sm
In case of venv
environment, we can locate the download module at:
./venv/lib/python3.8/site-packages/en_core_web_sm
We can load the English language module as follows:
import spacy
= spacy.load('en_core_web_sm') # sm means small tokenizer
As a side note, we can also import en_core_web_sm
as a Python module:
import en_core_web_sm
= en_core_web_sm.load() tokenizer
I like the first method because it specifies which language to load as a string that we can store in a config file.
Either way, it’s simple to tokenize an English sentence as follows:
import spacy
= spacy.load('en_core_web_sm')
tokenizer = tokenizer('Hello, world!')
tokens
print([token.text for token in tokens])
# Output: ['Hello', ',', 'world', '!']
Similarly, we can download de_core_news_sm
for German text tokenization.
Now, we can tokenize German and English sentences from the Multi30k dataset.
import spacy
= spacy.load('de_core_news_sm')
de_tokenizer = spacy.load('en_core_web_sm')
en_tokenizer
= load_dataset('Multi30k', 'train', ('de', 'en'))
dataset
for de_text, en_text in dataset:
= de_tokenizer( de_text )
de_tokens = en_tokenizer( en_text ) en_tokens
… now what?
3 How To Assign Unique Integer For Each Token Text
We want to convert each token text into a unique integer (token ID). We use token IDs to look up an embedding vector.
You can read more about word-embedding look-up in this article.
So, we need to build a map between token texts and token IDs.
Torchtext has Vocab class for such purpose, but I decided to write my own implementation so my codes do not depend too much on the Torchtext framework.
Suppose I have a list of English or German texts. I can make a list of unique token texts as follows:
from collections import Counter
= Counter()
counter for doc in tokenizer.pipe(texts):
= []
token_texts for token in doc:
= token.text.strip()
token_text if len(token_text) > 0: # not a white space
token_texts.append(token_text)
counter.update(token_texts)
# unique tokens
= [token for token, count in counter.most_common()] tokens
I used Counter
to make a list of unique token texts. We can also use set
to do the same.
One advantage of Counter
is that it is in the frequency order. When dealing with many, you can limit the number of tokens to the most frequent N
tokens, eliminating infrequently used ones.
10000) # up to most frequent 10,000 tokens counter.most_common(
I’m using it to generate a list of unique tokens for my implementation.
The below shows the top 20 tokens from the English sentences of Multi30k
dataset.
a
.
A
in
the
on
is
and
man
of
with
,
woman
are
to
Two
at
wearing
people
shirt
I save all the tokens in a file. So, the next time we need a list of the unique tokens, we can load it from the file.
= '<where we want to save tokens>'
path =True)
os.makedirs(os.path.dirname(path), exist_ok
with open(path, 'w') as f:
'\n'.join(tokens)) f.writelines(
Now that we have a list of unique token texts, all we need to do is:
= { tokens[i] : i for i in range(len(tokens)) } index_lookup
Voilà! We have a map between token texts and unique token IDs.
That’s not it yet. We need to deal with four special tokens. So, we reserve indices 0–4 for them:
# special token indices
= 0
UNK_IDX = 1
PAD_IDX = 2
SOS_IDX = 3
EOS_IDX
= '<unk>' # Unknown
UNK = '<pad>' # Padding
PAD = '<sos>' # Start of sentence
SOS = '<eos>' # End of sentence
EOS
= [UNK, PAD, SOS, EOS] SPECIAL_TOKENS
You can read more details about the special characters in this article.
So, we combine the special tokens with the list of unique token texts and build a map of token texts and token IDs:
= SPECIAL_TOKENS + tokens
tokens = { tokens[i] : i for i in range(len(tokens)) } index_lookup
We can look up a token index by a token text as follows:
if token in index_lookup:
= index_lookup[token]
token_index else:
= UNK_IDX token_index
So, I put everything together into my Vocab
class:
import spacy
from typing import List
# special token indices
= 0
UNK_IDX = 1
PAD_IDX = 2
SOS_IDX = 3
EOS_IDX
= '<unk>' # Unknown
UNK = '<pad>' # Padding
PAD = '<sos>' # Start of sentence
SOS = '<eos>' # End of sentence
EOS
= [UNK, PAD, SOS, EOS]
SPECIAL_TOKENS
class Vocab:
def __init__(self, tokenizer: spacy.language.Language, tokens: List[str]=[]) -> None:
self.tokenizer = tokenizer
self.tokens = SPECIAL_TOKENS + tokens
self.index_lookup = {self.tokens[i]:i for i in range(len(self.tokens))}
def __len__(self) -> int:
return len(self.tokens) # vocab size
def __call__(self, text: str) -> List[int]:
= text.strip()
text return [self.to_index(token.text) for token in self.tokenizer(text)]
def to_index(self, token: str) -> int:
return self.index_lookup[token] if token in self.index_lookup else UNK_IDX
Now, we can convert a sentence text into a list of integers as follows:
= Vocab(tokenizer, tokens)
vocab = vocab('Hello, world!')
token_indices print(token_indices)
# output: [5599, 15, 1861, 1228]
4 How To Set Up A DataLoader
I used PyTorch’s DataLoader
and collate_fn
to encapsulate tokenization and token index processing details, so it’s easy to use for training.
The idea of collate_fn
is simple. It’s a function that converts a batch of raw data into tensors. A batch is a list of source (German) and target (English) sentence pairs:
def collate_fn(batch: List[Tuple[str, str]]):
.... convert text data into tensors ...return ... tensors ...
Once we have collate_fn
defined, we can give it to DataLoader
as follows:
= DataLoader(dataset, batch_size=32, collate_fn=collate_fn) loader
Inside the collate_fn
function, we tokenize sentence pairs from batch
. We prepend SOS_IDX
and append EOS_IDX
for target sentences. Finally, we convert token indices into Tensor
and keep them in a list.
from torch import Tensor
= []
source_tokens_list = []
target_tokens_list for i, (source_sentence, target_sentence) in enumerate(batch):
# Tokenization
= source_vocab(source_sentence)
source_tokens = target_vocab(target_sentence)
target_tokens
= [SOS_IDX] + target_tokens + [EOS_IDX]
target_tokens
source_tokens_list.append( Tensor(source_tokens) ) target_tokens_list.append( Tensor(target_tokens) )
Each sentence comes in a different number of tokens. So, we use pad_sequence
to append paddings to each token sequence up to the max sequence length (the longest sequence length within the current batch):
from torch.nn.utils.rnn import pad_sequence
= pad_sequence(source_tokens_list,
source_batch =PAD_IDX,
padding_value=True)
batch_first= pad_sequence(target_tokens_list,
target_batch =PAD_IDX,
padding_value=True) batch_first
padding_value = PAD_IDX
means we use PAD_IDX
to pad shorter token ID sequences. As PAD_IDX
is 1, we are appending 1s to them.
batch_first = True
means we want the shape to have the batch dimension first: (batch_size, max_sequence_length)
instead of the default shape (max_sequence_length, batch_size)
, which I feel is unintuitive.
For details of pad_sequence
, please refer to the PyTorch documentation.
We split the target batch into two batches:
- Inputs to the decoder (Each input starts with
SOS_IDX
) - Labels for loss calculation (Each label ends with
EOS_IDX
)
= target_batch[:, 1:] # SOS_IDX, ...
label_batch = target_batch[:, :-1] # ..., EOS_IDX target_batch
Then, we create a source mask and target mask:
= create_masks(source_batch, target_batch) source_mask, target_mask
For the details of create_masks
, please look at this article.
At the end of collate_fn
, we move all batches and masks to the target device:
....= [ source_batch,
all_batchs
target_batch,
label_batch,
source_mask,
target_mask ]
# move everything to the target device
return [x.to(device) for x in all_batches]
I created a make_dataloader
function to build a DataLoader
given a dataset and a pair of Vocab
objects.
The collate_fn
is defined within the make_dataloader
so that it can access all the input parameters:
def make_dataloader(
dataset : IterableDataset,
source_vocab : Vocab,
target_vocab : Vocab,int,
batch_size : -> DataLoader:
device : torch.device)
def collate_fn(batch: List[Tuple[str, str]]):
all the above details ...
...
return DataLoader( dataset,
= batch_size,
batch_size = collate_fn ) collate_fn
At the end of the make_dataloader
, it returns a DataLoader
with collate_fn
specified.
The data loader makes it easy to write a training loop as follows:
# Training parameters
= 10
epochs = 32
batch_size = torch.device('cuda:0')
device
# Vocab pair
= Vocab(de_tokenizer, de_tokens)
source_vocab = Vocab(en_tokenizer, en_tokens)
target_vocab
# Transformer
= Transformer(....)
model
# Loss function
= ...
loss_func
for epoch in range(epochs):
= load_dataset('Multi30k', 'train', ('de', 'en'))
dataset = make_dataloader(dataset,
loader
source_vocab,
target_vocab,
batch_size,
device)
for source, target, label, source_mask, target_mask in loader:
= model(source, target, source_mask, target_mask)
logits = loss_func(logits, label)
loss
-prop etc... ... back
5 References
- The Annotated Transformer
Harvard NLP - Language Modeling with nn.Transformer and Torchtext
PyTorch