Machine learning and ML.NET – NLP and BERT

catalogue

1. Conditions precedent

2. Understand Transformers architecture

3. BERT intuition

4. ONNX model

5. Implementation using ML.NET

5.1 data model

5.2 training

The training class is quite simple. It has only one method, BuildAndTrain, which uses the path and pre training mode.

5.3 predictor

5.4 assistants and extensions

5.4 word splitter

5.5 BERT

The prediction method carries out several steps. Let's explore it in more detail.

5.5 Program

conclusion

So far, in our trip to ML.NET, we have focused on computer vision issues, such as image classification and object detection. In this paper, we change a direction slightly to explore NLP (natural language processing) and a series of problems that we can solve through machine learning.

Natural language processing (NLP) is a sub field of artificial intelligence. Its main goal is to help programs understand and process natural language data. The output of this process is a computer program that can "understand" the language.

Are you afraid AI will take your job? Make sure you're the one who built it.

Maintain relevance in the emerging AI industry!  

As early as 2018, Google published a paper on deep neural network, which is called from Transformers Bidirectional encoder representation from transformer or BERT. Because of its simplicity, it has become one of the most popular NLP algorithms. With this algorithm, anyone can train their most advanced question answering system (or various other models) in just a few hours. In this article, we will do this by creating a question answering system using BERT.

BERT is a transformer based architecture neural network . This is why in this article, we will first explore the architecture, and then continue to have a deeper understanding of BERT:

  1. precondition
  2. Understanding Transformers architecture
  3. BERT intuition
  4. ONNX model
  5. Implementation using ML.NET

1. Conditions precedent

The implementation provided here is C# completed, and we use the latest NET 5. So please make sure you have installed this SDK. If you are using Visual Studio, it comes with version 16.8.3. Also, make sure you have the following packages installed:

$ dotnet add package Microsoft.ML
$ dotnet add package Microsoft.ML.OnnxRuntime
$ dotnet add package Microsoft.ML.OnnxTransformer

You can do the same from the package manager console:

Install-Package Microsoft.ML
Install-Package Microsoft.ML.OnnxRuntime
Install-Package Microsoft.ML.OnnxTransformer

You can use the Manage NuGetPackage option of Visual Studio to do similar things:

If you need to use ML.NET to understand the basics of machine learning, please check This article.

2. Understand Transformers architecture

Language is sequential data. Basically, you can think of it as a word stream, where the meaning of each word depends on the word before it and the word after it. This is why it is difficult for computers to understand language, because in order to understand a word, you need a context.

In addition, sometimes as output, you need to provide a series of data (words). A good example of this is the translation of English into Serbian. As the input of the algorithm, we use a word sequence. For the output, we also need to provide a sequence.

In this example, the algorithm needs to understand English and understand how to map english words to Serbian words (in essence, this means that you must have a certain understanding of Serbian). Over the years, many have been used for this purpose Deep learning Architecture, e.g Cyclic neural network and LSTM . However, it is the use of Transformer architecture that has changed everything.

RNN and LSTM networks cannot fully meet the requirements because they are difficult to train and prone to gradient disappearance (and explosion). Transformers} aims to solve these problems and bring better performance and better understanding of the language. "They were published in the legend paper in 2017" Attention is all you need "Was introduced.  

In short, they use the encoder decoder structure and the self attention layer to better understand the language. If we return to the translation example, the encoder is responsible for understanding English and the decoder is responsible for understanding Serbian and mapping English to Serbian.

During training, the process encoder provides word embedding from English. Computers don't understand words. They understand numbers and matrices (sets of numbers). This is why we convert words into a vector space, which means that we assign certain vectors to each word in the language (map them to a potential vector space). These are word embedding. There are many available word embeddings, such as Word2Vec.

However, the position of the word in the sentence is also important for the context. This is why location coding is completed. This is how the Encoder gets information about words and their context. The self attention layer of Encoder is determining the relationship between words and providing us with information about the relevance of each word in the sentence. This is how Encoder understands English. Then the data enters the deep neural network, and then enters the mapping attention layer of the decoder.

However, before that, the decoder obtains the same information about Serbian. In the same way, it uses word embedding, position coding and self attention to learn how to understand Serbian. The mapping attention layer of the decoder then has information about English and Serbian. It just learns how to convert words from one language to another. To learn more about Transformers, see This article.

3. BERT intuition

BERT uses this transformer architecture to understand the language. More precisely, it uses an encoder. The architecture achieved two milestones. First, it realizes bidirectional. This means that each sentence is learned in two ways, and the context is better learned, whether it is the previous context or the future context. BERT is the first bi-directional and unsupervised language representation, using only plain text corpus( Wikipedia )Pre training. It is also one of the first pre training models of NLP. We understand the transfer learning of computer vision. However, before BERT, this concept was not really popular in the NLP world.

This makes sense because you can train the model on a large amount of data, and once it understands the language, you can fine tune it for more specific tasks. This is why BERT's training can be divided into two stages: pre training and fine tuning.

In order to achieve bidirectional, BERT uses two methods for pre training:

  • Mask language modeling - MLM
  • Next sentence prediction - NSP

Masked Language Modeling uses masked input. This means that some words in the sentence are blocked. Filling in the blanks is BERT's job. Next Sentence Prediction gives two sentences as input and expects BERT to predict one sentence after another. In fact, both methods occur at the same time.

In the fine tuning phase, we train BERT for specific tasks. This means that if we want to create a Q & a solution, we only need to train the additional BERT layer. This is exactly what we did in this tutorial. What we need to do is replace the output layer of the network with a new set of layers designed for our specific purpose. As input, we will have a piece of text (or context) and a question as output, and we expect to get the answer to the question.

For example, our system should use two sentences: "Jim is walk through the woods." (paragraph or context) and "what's his name?" (question) provide the answer "Jim".

4. ONNX model

Before we delve into the implementation of object detection applications using ML.NET, we need to introduce another theoretical thing. That is the open neural network switching (ONNX) file format. This file format is an open source format of AI model and supports interoperability between frameworks.  

Basically, you can train models in machine learning frameworks such as PyTorch, save them and convert them to ONNX format. You can then use the ONNX model in different frameworks, such as ML.NET. This is exactly what we did in this tutorial. You can ONNX website Find more information on.

In this tutorial, we use the pre trained BERT model. This model here It's Bert square. Essentially, we import this model into ML.NET and run it in our application.

One very interesting and useful thing we can do with ONNX model is that we can use many tools to visually represent the model. This is useful when we use the pre training model in this tutorial.

We often need to know the names of the input and output layers, and this tool is very suitable. Therefore, once we download the BERT model, we can use one of these tools for visual representation. In this guide, we use Netron , this is only part of the output:

I know, it's crazy. BERT is a big model. You might want to know how to use it and why you need it? However, in order to use the ONNX model, we usually need to know the names of the input and output layers of the model. This is the way to find BERT:

5. Implementation using ML.NET

If you see us download from Model You'll notice some interesting things in the dependencies section of the Bert square repository. More precisely, you'll notice tokenization.py Dependency. This means that we need to perform tokenization ourselves. Word tokenization is the process of splitting a large number of text samples into words. This is a requirement in natural language processing tasks, in which each word needs to be captured and further analyzed. There are many ways to do this.  

In fact, we perform word coding, for which we use word piece tokenization, such as Described herein . It's from tokenzaton Py migrated version. To achieve this complete solution, we build your solution as follows:

In the Assets folder, you can find the downloaded onnx model and folder, which contains the vocabulary of the model we want to train. The} machine learning folder contains all the necessary code that we use in this application. Training and prediction classes exist, just like these model data classes. In a separate folder, we can find the helper class used to load files and extension classes for Softmax on Enumerable type and string splitting.

This solution is inspired by the implementation of Gjeran Vlot, which can be found in here Found.

5.1 data model

You may notice that in the} DataModel folder, we have two classes for input and prediction of BERT. The {BertInput} class is a representation input. They have the same name and size as the layers in the model:

using Microsoft.ML.Data;

namespace BertMlNet.MachineLearning.DataModel
{
    public class BertInput
    {
        [VectorType(1)]
        [ColumnName("unique_ids_raw_output___9:0")]
        public long[] UniqueIds { get; set; }

        [VectorType(1, 256)]
        [ColumnName("segment_ids:0")]
        public long[] SegmentIds { get; set; }

        [VectorType(1, 256)]
        [ColumnName("input_mask:0")]
        public long[] InputMask { get; set; }

        [VectorType(1, 256)]
        [ColumnName("input_ids:0")]
        public long[] InputIds { get; set; }
    }
}

The Bertpredictions class uses the BERT output layer:

using Microsoft.ML.Data;

namespace BertMlNet.MachineLearning.DataModel
{
    public class BertPredictions
    {
        [VectorType(1, 256)]
        [ColumnName("unstack:1")]
        public float[] EndLogits { get; set; }

        [VectorType(1, 256)]
        [ColumnName("unstack:0")]
        public float[] StartLogits { get; set; }

        [VectorType(1)]
        [ColumnName("unique_ids:0")]
        public long[] UniqueIds { get; set; }
    }
}

5.2 training

The training class is quite simple. It has only one method, BuildAndTrain, which uses the path and pre training mode.

using BertMlNet.MachineLearning.DataModel;
using Microsoft.ML;
using System.Collections.Generic;

namespace BertMlNet.MachineLearning
{
    public class Trainer
    {
        private readonly MLContext _mlContext;

        public Trainer()
        {
            _mlContext = new MLContext(11);
        }

        public ITransformer BuidAndTrain(string bertModelPath, bool useGpu)
        {
            var pipeline = _mlContext.Transforms
                            .ApplyOnnxModel(modelFile: bertModelPath, 
                                            outputColumnNames: new[] { "unstack:1", 
                                                                       "unstack:0", 
                                                                       "unique_ids:0" }, 
                                            inputColumnNames: new[] {"unique_ids_raw_output___9:0",
                                                                      "segment_ids:0", 
                                                                      "input_mask:0", 
                                                                      "input_ids:0" }, 
                                            gpuDeviceId: useGpu ? 0 : (int?)null);

            return pipeline.Fit(_mlContext.Data.LoadFromEnumerable(new List<BertInput>()));
        }
    }
}

In the above method, we constructed the pipeline. Here we apply the ONNX model and connect the data model to the layer of the BERT ONNX model. Note that we have a flag that can be used to train this model on a CPU or GPU. Finally, we fit this model to null data. We do so, so we can load the data schema, that is. Load the model.

5.3 predictor

This prediction class is simpler. It receives trained and loaded models and creates a prediction engine. It then uses the prediction engine to create a prediction of the new image.

using BertMlNet.MachineLearning.DataModel;
using Microsoft.ML;

namespace BertMlNet.MachineLearning
{
    public class Predictor
    {
        private MLContext _mLContext;
        private PredictionEngine<BertInput, BertPredictions> _predictionEngine;

        public Predictor(ITransformer trainedModel)
        {
            _mLContext = new MLContext();
            _predictionEngine = _mLContext.Model
                                          .CreatePredictionEngine<BertInput, BertPredictions>(trainedModel);
        }

        public BertPredictions Predict(BertInput encodedInput)
        {
            return _predictionEngine.Predict(encodedInput);
        }
    }
}

5.4 assistants and extensions

There is a helper class and two extension classes. The auxiliary class FileReader has a method to read text files. We'll use it later to load the vocabulary from the file. This is simple:

using System.Collections.Generic;
using System.IO;

namespace BertMlNet.Helpers
{
    public static class FileReader
    {
        public static List<string> ReadFile(string filename)
        {
            var result = new List<string>();

            using (var reader = new StreamReader(filename))
            {
                string line;

                while ((line = reader.ReadLine()) != null)
                {
                    if (!string.IsNullOrWhiteSpace(line))
                    {
                        result.Add(line);
                    }
                }
            }

            return result;
        }
    }
}

There are two extension classes. One is used to perform Softmax operations on a collection of elements, and the other is used to split strings and output results one at a time.

using System;
using System.Collections.Generic;
using System.Linq;

namespace BertMlNet.Extensions
{
    public static class SoftmaxEnumerableExtension
    {
        public static IEnumerable<(T Item, float Probability)> Softmax<T>(
                                            this IEnumerable<T> collection, 
                                            Func<T, float> scoreSelector)
        {
            var maxScore = collection.Max(scoreSelector);
            var sum = collection.Sum(r => Math.Exp(scoreSelector(r) - maxScore));

          return collection.Select(r => (r, (float)(Math.Exp(scoreSelector(r) - maxScore) / sum)));
        }
    }
}
using System.Collections.Generic;

namespace BertMlNet.Extensions
{
    static class StringExtension
    {
        public static IEnumerable<string> SplitAndKeep(
          					this string inputString, params char[] delimiters)
        {
            int start = 0, index;

            while ((index = inputString.IndexOfAny(delimiters, start)) != -1)
            {
                if (index - start > 0)
                    yield return inputString.Substring(start, index - start);

                yield return inputString.Substring(index, 1);

                start = index + 1;
            }

            if (start < inputString.Length)
            {
                yield return inputString.Substring(start);
            }
        }
    }
}

5.4 word splitter

Well, so far, we've explored the simple parts of the solution. Let's move on to the more complex and important parts, and let's see how we implement tokenization. First, we define the default BERT token list. For example, you should always separate two sentences [SEP] with markers to distinguish them. The [CLS] token always appears at the beginning of the text and is specific to the classification task.

namespace BertMlNet.Tokenizers
{
    public class Tokens
    {
        public const string Padding = "";
        public const string Unknown = "[UNK]";
        public const string Classification = "[CLS]";
        public const string Separation = "[SEP]";
        public const string Mask = "[MASK]";
    }
}

The Tokenization process is completed in the Tokenizer class. There are two public methods: Tokenize , and , Tokenize. The first one first divides the received text into sentences. Then, for each sentence, each word is converted to embedded. Note that it may happen that a word is represented by multiple tags.

For example, the word "embeddings" is represented as a tag array ['em ',' ##bed ',' ##ding ',' ##s']. The word has been split into smaller subwords and characters. The two pound signs in front of some of these subwords are just the way our marker uses to indicate that this subword or character is part of a larger word and before another subword.

Therefore, for example, the '##bed' tag is separate from the 'bed' tag. Another thing the Tokenize method is doing is returning the lexical index and the segmented index. Both are used as BERT inputs. To learn more why, check out This article.

using BertMlNet.Extensions;
using System;
using System.Collections.Generic;
using System.Linq;

namespace BertMlNet.Tokenizers
{
    public class Tokenizer
    {
        private readonly List<string> _vocabulary;

        public Tokenizer(List<string> vocabulary)
        {
            _vocabulary = vocabulary;
        }

        public List<(string Token, int VocabularyIndex, long SegmentIndex)> Tokenize(params string[] texts)
        {
            IEnumerable<string> tokens = new string[] { Tokens.Classification };

            foreach (var text in texts)
            {
                tokens = tokens.Concat(TokenizeSentence(text));
                tokens = tokens.Concat(new string[] { Tokens.Separation });
            }

            var tokenAndIndex = tokens
                .SelectMany(TokenizeSubwords)
                .ToList();

            var segmentIndexes = SegmentIndex(tokenAndIndex);

            return tokenAndIndex.Zip(segmentIndexes, (tokenindex, segmentindex) 
                                => (tokenindex.Token, tokenindex.VocabularyIndex, segmentindex)).ToList();
        }

        public List<string> Untokenize(List<string> tokens)
        {
            var currentToken = string.Empty;
            var untokens = new List<string>();
            tokens.Reverse();

            tokens.ForEach(token =>
            {
                if (token.StartsWith("##"))
                {
                    currentToken = token.Replace("##", "") + currentToken;
                }
                else
                {
                    currentToken = token + currentToken;
                    untokens.Add(currentToken);
                    currentToken = string.Empty;
                }
            });

            untokens.Reverse();

            return untokens;
        }

        public IEnumerable<long> SegmentIndex(List<(string token, int index)> tokens)
        {
            var segmentIndex = 0;
            var segmentIndexes = new List<long>();

            foreach (var (token, index) in tokens)
            {
                segmentIndexes.Add(segmentIndex);

                if (token == Tokens.Separation)
                {
                    segmentIndex++;
                }
            }

            return segmentIndexes;
        }

        private IEnumerable<(string Token, int VocabularyIndex)> TokenizeSubwords(string word)
        {
            if (_vocabulary.Contains(word))
            {
                return new (string, int)[] { (word, _vocabulary.IndexOf(word)) };
            }

            var tokens = new List<(string, int)>();
            var remaining = word;

            while (!string.IsNullOrEmpty(remaining) && remaining.Length > 2)
            {
                var prefix = _vocabulary.Where(remaining.StartsWith)
                    .OrderByDescending(o => o.Count())
                    .FirstOrDefault();

                if (prefix == null)
                {
                    tokens.Add((Tokens.Unknown, _vocabulary.IndexOf(Tokens.Unknown)));

                    return tokens;
                }

                remaining = remaining.Replace(prefix, "##");

                tokens.Add((prefix, _vocabulary.IndexOf(prefix)));
            }

            if (!string.IsNullOrWhiteSpace(word) && !tokens.Any())
            {
                tokens.Add((Tokens.Unknown, _vocabulary.IndexOf(Tokens.Unknown)));
            }

            return tokens;
        }

        private IEnumerable<string> TokenizeSentence(string text)
        {
            // remove spaces and split the , . : ; etc..
            return text.Split(new string[] { " ", "   ", "\r\n" }, StringSplitOptions.None)
                .SelectMany(o => o.SplitAndKeep(".,;:\\/?!#$%()=+-*\"'–_`<>&^@{}[]|~'".ToArray()))
                .Select(o => o.ToLower());
        }
    }
}

Another public method is {tokenize. This method is used to reverse the process. Basically, as the output of BERT, we will get various embeddings. The goal of this method is to transform this information into meaningful sentences.

This class has multiple methods to enable this process.

5.5 BERT

The Bert class puts all these things together. In the constructor, we read the vocabulary file and instantiate the {Train, Tokenizer, and} Predictor objects. There is only one common method - Predict. This method receives context and issues. As output, retrieve the answer with probability:

using BertMlNet.Extensions;
using BertMlNet.Helpers;
using BertMlNet.MachineLearning;
using BertMlNet.MachineLearning.DataModel;
using BertMlNet.Tokenizers;
using System.Collections.Generic;
using System.Linq;

namespace BertMlNet
{
    public class Bert
    {
        private List<string> _vocabulary;

        private readonly Tokenizer _tokenizer;
        private Predictor _predictor;

        public Bert(string vocabularyFilePath, string bertModelPath)
        {
            _vocabulary = FileReader.ReadFile(vocabularyFilePath);
            _tokenizer = new Tokenizer(_vocabulary);

            var trainer = new Trainer();
            var trainedModel = trainer.BuidAndTrain(bertModelPath, false);
            _predictor = new Predictor(trainedModel);
        }

        public (List<string> tokens, float probability) Predict(string context, string question)
        {
            var tokens = _tokenizer.Tokenize(question, context);
            var input = BuildInput(tokens);

            var predictions = _predictor.Predict(input);

            var contextStart = tokens.FindIndex(o => o.Token == Tokens.Separation);

            var (startIndex, endIndex, probability) = GetBestPrediction(predictions, contextStart, 20, 30);

            var predictedTokens = input.InputIds
                .Skip(startIndex)
                .Take(endIndex + 1 - startIndex)
                .Select(o => _vocabulary[(int)o])
                .ToList();

            var connectedTokens = _tokenizer.Untokenize(predictedTokens);

            return (connectedTokens, probability);
        }

        private BertInput BuildInput(List<(string Token, int Index, long SegmentIndex)> tokens)
        {
            var padding = Enumerable.Repeat(0L, 256 - tokens.Count).ToList();

            var tokenIndexes = tokens.Select(token => (long)token.Index).Concat(padding).ToArray();
            var segmentIndexes = tokens.Select(token => token.SegmentIndex).Concat(padding).ToArray();
            var inputMask = tokens.Select(o => 1L).Concat(padding).ToArray();

            return new BertInput()
            {
                InputIds = tokenIndexes,
                SegmentIds = segmentIndexes,
                InputMask = inputMask,
                UniqueIds = new long[] { 0 }
            };
        }

        private (int StartIndex, int EndIndex, float Probability) GetBestPrediction(BertPredictions result, int minIndex, int topN, int maxLength)
        {
            var bestStartLogits = result.StartLogits
                .Select((logit, index) => (Logit: logit, Index: index))
                .OrderByDescending(o => o.Logit)
                .Take(topN);

            var bestEndLogits = result.EndLogits
                .Select((logit, index) => (Logit: logit, Index: index))
                .OrderByDescending(o => o.Logit)
                .Take(topN);

            var bestResultsWithScore = bestStartLogits
                .SelectMany(startLogit =>
                    bestEndLogits
                    .Select(endLogit =>
                        (
                            StartLogit: startLogit.Index,
                            EndLogit: endLogit.Index,
                            Score: startLogit.Logit + endLogit.Logit
                        )
                     )
                )
                .Where(entry => !(entry.EndLogit < entry.StartLogit || entry.EndLogit - entry.StartLogit > maxLength || entry.StartLogit == 0 && entry.EndLogit == 0 || entry.StartLogit < minIndex))
                .Take(topN);

            var (item, probability) = bestResultsWithScore
                .Softmax(o => o.Score)
                .OrderByDescending(o => o.Probability)
                .FirstOrDefault();

            return (StartIndex: item.StartLogit, EndIndex: item.EndLogit, probability);
        }
    }
}

The prediction method carries out several steps. Let's explore it in more detail.

public (List<string> tokens, float probability) Predict(string context, string question)
        {
            var tokens = _tokenizer.Tokenize(question, context);
            var input = BuildInput(tokens);

            var predictions = _predictor.Predict(input);

            var contextStart = tokens.FindIndex(o => o.Token == Tokens.Separation);

            var (startIndex, endIndex, probability) = GetBestPrediction(predictions, 
  							                contextStart, 
                    							20, 
  									30);

            var predictedTokens = input.InputIds
                .Skip(startIndex)
                .Take(endIndex + 1 - startIndex)
                .Select(o => _vocabulary[(int)o])
                .ToList();

            var connectedTokens = _tokenizer.Untokenize(predictedTokens);

            return (connectedTokens, probability);
        }

First, this method tokenizes the context of the question and transmission (based on which BERT should give the answer). Then we build the BertInput based on this information. This is done in the BertInput method. Basically, all tokenized information is populated, so it can be used as BERT input and used to initialize the BertInput object.

Then we get the prediction of the model from Predictor. This information is then subjected to additional processing and the best prediction is found from the context. BERT chooses the word most likely to be the answer from the context, and then we choose the best word. Finally, these words are unmarked.

5.5 Program

The program is taking advantage of what we implement in the Bert class. First, let's define the startup settings:

{
  "profiles": {
    "BERT.Console": {
      "commandName": "Project",
      "commandLineArgs": "\"Jim is walking through the woods.\" \"What is his name?\""
    }
  }
}

We defined two command line parameters: "Jim is walking through the woods." And "What is his name?". As we have already mentioned, the first is context and the second is question. The main methods are minimal:

using System;
using System.Text.Json;

namespace BertMlNet
{
    class Program
    {
        static void Main(string[] args)
        {
            var model = new Bert("..\\BertMlNet\\Assets\\Vocabulary\\vocab.txt",
                                "..\\BertMlNet\\Assets\\Model\\bertsquad-10.onnx");

            var (tokens, probability) = model.Predict(args[0], args[1]);

            Console.WriteLine(JsonSerializer.Serialize(new
            {
                Probability = probability,
                Tokens = tokens
            }));
        }
    }
}

Technically, we use the path of the vocabulary file and the path of the model to create the {Bert} object. We then call the Predict method using command line arguments. As an output, we get:

{"Probability":0.9111285,"Tokens":["jim"]}

We can see that BERT is 91% sure that the answer to the question is "Jim" and is correct.

conclusion

In this article, we understand the working principle of BERT. More specifically, we have the opportunity to explore the different ways the Transformers architecture works and how BERT uses it to understand the language. Finally, we learned about the ONNX model format and how to use it in ML.NET.

https://rubikscode.net/2021/04/19/machine-learning-with-ml-net-nlp-with-bert/

Tags: Deep Learning AI ASP.NET NLP

Posted by mtombs on Sat, 15 Jan 2022 07:30:15 +1030