How to use Bert?
Using BERT for a specific task is relatively simple:
BERT can be used for a wide range of linguistic tasks , by adding only a small layer to the main model. Classification tasks such as sentiment analysis are performed similarly to the next sentence classification, by adding a classification layer on top of the output of the token transformer [CLS].
In question answering tasks (e.g. SQuAD v1.1) , the software is given a question about a sequence of text and is required to mark the answer in the sequence. Using BERT, a question-answer model can be aligned by learning two extra vectors that mark the start and end of the answer.
In Named Entity Recognition (NER) , the software is given a sequence of text that is boost your business with our doctor database already needed to mark the various entity types (Person, Organization, Date, etc.) that appear in the text. Using BERT, a NER model can be trained by feeding the output vector of each token into a classification layer that predicts the NER label.
Language model alignment in BERT is done by predicting 15% of the tokens in the input, which are chosen randomly. These tokens are preprocessed as follows: 80% are replaced with a "[MASK]" token, 10% with a random word, and 10% use the original word .
The intuition that led the authors to choose this approach is the following:
If we used [MASK] 100% of the time, the model would not necessarily produce good token representations for unmasked words. Unmasked tokens were still used for context, but the model was optimized to predict masked words.
If we used [MASK] 90% of the time and random words 10% of the time , this would teach the model that the observed word is never correct.
If we used [MASK] 90% of the time and kept the same word 10% of the time , the model could simply trivially copy the non-contextual embedding.