Pre-trained BERT

BERT is short for Bidirectional Encoder Representation from Transformers, which is the Encoder of the two-way Transformer, because the Decoder cannot get the information to be predicted. The main innovation for the model is in the pre-trained method, which uses Masked Language Model and Next Sentence Prediction to capture the word and sentence level representation respectively. This allows us to use a pre-trained BERT model by fine-tuning the same on downstream specific tasks such as sentiment classification, intent detection, question answering and more.

Pre-trained Task 1: Masked Language Model

Pre-trained Task 2; Next Sentence Prediction

Pre-trained Models

Google Research recently open-sourced implementation of BERT and also released the following pre-trained models:

In the following example, I will use bert-base-uncased pre-trained model.

Embedding

In BERT, the embedding is the summation of three types of embeddings:

Embedding

where:

We have to convert the input to the feature that is understood by BERT.

Tokenization

BERT-Base, uncased uses a vocabulary of 30,522 words. The processes of tokenization involves splitting the input text into list of tokens that are available in the vocabulary. In order to deal with the words not available in the vocabulary, BERT uses a technique called BPE based WordPiece tokenization.

Model Architecture

Here I use pre-trained BERT for binary sentiment analysis on Stanford Sentiment Treebank.

- BertModel
    - embeddings: BertEmbeddings
      	- word_embeddings: Embedding(28996, 768)
      	- position_embeddings: Embedding(512, 768)
      	- token_type_embeddings: Embedding(2, 768)
      	- LayerNorm: FusedLayerNorm(torch.Size([768])
	- dropout: Dropout = 0.1
    - encoder: BertEncoder
      	- BertLayer
          	- attention: BertAttention
            		- self: BertSelfAttention
              		- query: Linear(in_features=768, out_features=768, bias=True)
              		- key: Linear(in_features=768, out_features=768, bias=True)
               		- value: Linear(in_features=768, out_features=768, bias=True)
              		- dropout: Dropout = 0.1
            	- output: BertSelfOutput(
              		- dense: Linear(in_features=768, out_features=768, bias=True)
              		- LayerNorm: FusedLayerNorm(torch.Size([768]), 
              		- dropout: Dropout =0.1

          	- intermediate: BertIntermediate(
            		- dense): Linear(in_features=768, out_features=3072, bias=True)
          
          	- output: BertOutput
            		- dense: Linear(in_features=3072, out_features=768, bias=True)
            		- LayerNorm: FusedLayerNorm(torch.Size([768])
            		- dropout: Dropout =0.1
 	- pooler: BertPooler
      		- dense: Linear(in_features=768, out_features=768, bias=True)
      		- activation: Tanh()
	- dropout: Dropout =0.1
 	- classifier: Linear(in_features=768, out_features = 2, bias=True)