Author|Veysel Kocaman Compile|VK Source | Towards Data Science
Natural Language Processing (NLP) is a key component of many data science systems that must understand or reason about text. Common use cases include text classification, question answering, paraphrasing or summarization, sentiment analysis, natural language BI, language modeling, and disambiguation.
NLP is increasingly important in more and more artificial intelligence applications. If you're building chatbots, searching patent databases, matching patients to clinical trials, grading customer service or sales calls, extracting summaries from financial reports, you must extract accurate information from text.
Text classification is one of the main tasks of modern natural language processing, which is the task of assigning a suitable category to a sentence or document. Categories depend on the selected dataset and can start with themes.
Every text classification problem follows similar steps and is solved with a different algorithm. Not to mention classic and popular machine learning classifiers like random forests or Logistic regression, there are more than 150 deep learning frameworks posing various text classification problems.
Several benchmark datasets are used in the text classification problem, and the latest benchmarks can be tracked at nlpprogress.com. Below are the basic statistics about these datasets.
Simple text classification applications typically follow these steps:
- Text preprocessing and cleaning
- Feature Engineering (manually create features from text)
- Feature vectorization (TfIDF, frequency, encoding) or embedding (word2vec, doc2vec, Bert, Elmo, sentence embedding, etc.)
- Train models with ML and DL algorithms.
Text Classification in Spark-NLP
In this article, we will build a text classification model in Spark NLP using Universal Sentence Embeddings. We will then compare with other ML and DL methods and text vectorization methods.
There are several text classification options in Spark NLP:
- Text Preprocessing in Spark-NLP and ML Algorithms Based on Spark-ML
- Text Preprocessing and Word Embeddings in Spark-NLP and ML Algorithms (Glove, Bert, Elmo)
- Text Preprocessing and Sentence Embeddings in Spark-NLP and ML Algorithms (Universal Sentence Encoders)
- Text preprocessing and ClassifierDL module in Spark-NLP (based on TensorFlow)
As we discussed in depth in our important article on Spark NLP, all of these text processing steps before ClassifierDL can be implemented in a specified pipeline sequence, and each stage is a transformer or estimator. The stages run sequentially, and the input data frame is transformed as it passes through each stage. That is, the data goes through the various pipes sequentially. The transform() method of each stage updates the dataset and passes it to the next stage. With the help of the pipeline, we can ensure that the training and test data go through the same feature processing steps.
Universal Sentence Encoders
In Natural Language Processing (NLP), text embedding plays an important role before building any deep learning model. Text embeddings convert text (words or sentences) into vectors.
Basically, text embedding methods encode words and sentences in fixed-length vectors to greatly improve the processing of text data. The idea is simple: words that appear in the same context tend to have similar meanings.
Techniques like Word2vec and Glove work by converting a word into a vector. Therefore, the corresponding vector "cat" is closer to "dog" than "eagle". However, when embedding a sentence, the context of the entire sentence needs to be captured into this vector. This is the function of "Universal Sentence Encoders".
Universal Sentence Encoders encode text into high-dimensional vectors that can be used for text classification, semantic similarity, clustering, and other natural language tasks. Pre-trained Universal Sentence Encoders are publicly available in the Tensorflow hub. It has two variants, one trained with Transformer encoder and the other with Deep Average Network (DAN).
Spark NLP uses the Tensorflow hub version, which is wrapped in a way to run in the Spark environment. That is, you just plug and play this embedding in Spark NLP and train the model in a distributed fashion.
Embeddings are generated for sentences without further computation, since we are not averaging word embeddings for each word in a sentence to obtain sentence embeddings.
Application of ClassifierDL and USE in Text Classification in Spark-NLP
In this article, we will use the AGNews dataset, one of the benchmark datasets in text classification tasks, to build a text classifier in Spark NLP using USE and ClassifierDL, the latest module added in Spark NLP version 2.4.4.
ClassifierDL is the first multi-class text classifier in Spark NLP that uses various text embeddings as input for text classification. ClassifierDLAnnotator uses a deep learning model (DNN) built inside TensorFlow, which supports up to 50 classes.
That said, you can use this classifirdl to build a text classifier with Bert, Elmo, Glove, and Universal Sentence Encoders in Spark NLP.
Let's start coding!
The statement loads the necessary packages and starts a Spark session.
import sparknlp spark = sparknlp.start() # sparknlp.start(gpu=True) >> train on GPU from sparknlp.base import * from sparknlp.annotator import * from pyspark.ml import Pipeline import pandas as pd print("Spark NLP version", sparknlp.version()) print("Apache Spark version:", spark.version) >> Spark NLP version 2.4.5 >> Apache Spark version: 2.4.4
Then we can download the AGNews dataset from the Github repo ( https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/tutorials/Certification_Trainings/Public).
! wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Public/data/news_category_train.csv ! wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Public/data/news_category_test.csv trainDataset = spark.read \ .option("header", True) \ .csv("news_category_train.csv") trainDataset.show(10, truncate=50) >> +--------+--------------------------------------------------+ |category| description| +--------+--------------------------------------------------+ |Business| Short sellers, Wall Street's dwindling band of...| |Business| Private investment firm Carlyle Group, which h...| |Business| Soaring crude prices plus worries about the ec...| |Business| Authorities have halted oil export flows from ...| |Business| Tearaway world oil prices, toppling records an...| |Business| Stocks ended slightly higher on Friday but sta...| |Business| Assets of the nation's retail money market mut...| |Business| Retail sales bounced back a bit in July, and n...| |Business|" After earning a PH.D. in Sociology, Danny Baz...| |Business| Short sellers, Wall Street's dwindling band o...| +--------+--------------------------------------------------+ only showing top 10 rows
The AGNews dataset has 4 classes: World, Sci/Tech, Sports, Business
from pyspark.sql.functions import col trainDataset.groupBy("category") \ .count() \ .orderBy(col("count").desc()) \ .show() >> +--------+-----+ |category|count| +--------+-----+ | World|30000| |Sci/Tech|30000| | Sports|30000| |Business|30000| +--------+-----+ testDataset = spark.read \ .option("header", True) \ .csv("news_category_test.csv") testDataset.groupBy("category") \ .count() \ .orderBy(col("count").desc()) \ .show() >> +--------+-----+ |category|count| +--------+-----+ |Sci/Tech| 1900| | Sports| 1900| | World| 1900| |Business| 1900| +--------+-----+
Now, we can feed this data to the Spark NLP DocumentAssembler, which is the entry point to Spark NLP for any Spark datagram.
# The actual content is in the description column document = DocumentAssembler()\ .setInputCol("description")\ .setOutputCol("document") #We can download pre-trained embeddings use = UniversalSentenceEncoder.pretrained()\ .setInputCols(["document"])\ .setOutputCol("sentence_embeddings") # classes/labels/categories in the category column classsifierdl = ClassifierDLApproach()\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("class")\ .setLabelColumn("category")\ .setMaxEpochs(5)\ .setEnableOutputLogs(True) use_clf_pipeline = Pipeline( stages = [ document, use, classsifierdl ])
Above, we take the dataset, input, and then get the sentence embeddings from using, then train in ClassifierDL
Now we start training. We will train for 5 epoch s using .setMaxEpochs() in ClassiferDL. In the Colab environment, this takes about 10 minutes to complete.
use_pipelineModel = use_clf_pipeline.fit(trainDataset)
When you run this command, Spark NLP will write the training logs to the annotator_logs folder in the home directory. Below is the resulting log.
As you can see, we achieved over 90% validation accuracy in less than 10 minutes without text preprocessing, which is often the most time-consuming and laborious step in any NLP modeling.
Now let's get the forecast at the earliest. We will use the test set downloaded above.
The following is the test result obtained through the classification_report in the sklearn library.
We achieved a test set accuracy of 89.3%! looks great!
Spark-NLP Text Preprocessing Classification Based on Bert and globe Embedding
As with any text classification problem, there are many useful text preprocessing techniques, including stemming, stemming, spell checking, and stopword removal, and in addition to spell checking, almost every NLP library in Python applies these technical tools. Currently, the Spark NLP library is the only NLP library available with spell checking capabilities.
Let's apply these steps in a Spark NLP pipeline and then use the glove embedding to train a text classifier. We will first apply several text preprocessing steps (normalize by preserving alphabetical order only, remove stopwords and stemming), then take the word embeddings (marked stemming) for each token, then average each sentence word embeddings in to get sentence embeddings for each row.
For all these text preprocessing tools and more in Spark NLP, you can find detailed instructions and code examples in this Colab notebook ( https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/2.Text_Preprocessing_with_SparkNLP_Annotators_Transformers.ipynb).
Then we can train.
clf_pipelineModel = clf_pipeline.fit(trainDataset)
Get test results.
Now we have 88% test set accuracy! Even after all these text cleaning steps, we still can't beat Universal Sentence Embeddings+ClassifierDL, mainly because USE performs better on raw text relative to the data-cleaned version.
To train the same classifier as BERT, we can replace glove_embeddings with BERT_embedding in the same pipeline built above.
word_embeddings = BertEmbeddings\ .pretrained('bert_base_cased', 'en') \ .setInputCols(["document",'lemma'])\ .setOutputCol("embeddings")\ .setPoolingLayer(-2) # default 0
We can also use Elmo embeds.
word_embeddings = ElmoEmbeddings\ .pretrained('elmo', 'en')\ .setInputCols(["document",'lemma'])\ .setOutputCol("embeddings")
Fast inference with LightPipeline
As we discussed in depth in an earlier article, LightPipelines are Spark NLP-specific pipelines, equivalent to Spark ML pipelines, but designed to process small amounts of data. They are useful when working with small datasets, debugging results, or running training or predictions from API s that serve one-off requests.
Spark NLP LightPipelines are Spark ML pipelines converted into multi-threaded tasks on a separate machine, fast for small data volumes (smaller is relative, but 50,000 sentences roughly max) more than 10 times. To use them, we just plug in a trained pipeline, we don't even need to convert the input text to a DataFrame, we can feed it into a pipeline that first accepts a DataFrame as input. This feature will be very useful when it is necessary to obtain predictions for a few lines of text from a trained ML model.
LightPipelines are easy to create and avoid dealing with Spark datasets. They are also very fast, performing parallel computations when only working on the driver nodes. Let's see how it applies to the case we described above:
light_model = LightPipeline(clf_pipelineModel) text="Euro 2020 and the Copa America have both been moved to the summer of 2021 due to the coronavirus outbreak." light_model.annotate(text)['class'] >> "Sports"
You can also save this trained model to disk and use it later with ClassifierDLModel.load() in another Spark pipeline.
This paper uses word embeddings and Universal Sentence Encoders in Spark-NLP to train a multi-class text classification model, and obtains good model accuracy in less than 10min training time. The entire code can be found in this Github repo (Colab compatible, https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/5.Text_Classification_with_ClassifierDL.ipynb). We also prepared another Notebook that pretty much covers Spark All possible combinations of text classification in NLP and Spark ML (CV, TfIdf, Glove, Bert, Elmo, USE, LR, RF, ClassifierDL, DocClassifier): https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/5.1_Text_classification_examples_in_SparkML_SparkNLP.ipynb.
We also started to provide online Spark NLP training for public and enterprise (medical) versions. Here are links to all public Colab Notebook s ( https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/tutorials/Certification_Trainings/Public)
The John Snow lab will organize virtual Spark NLP training, here is the link to the next training:
Screenshot of the above code
Welcome to the Panchuang AI blog site: http://panchuang.net/
sklearn machine learning Chinese official documents: http://sklearn123.com/
Welcome to the Panchuang blog resource summary station: http://docs.panchuang.net/