This week, my mentor and I have a brief discussion about an interesting potential project. We want to apply policy gradient on natural language processing (NLP), especially in sentiment analysis.
What is NLP?
Natural language processing (NLP) is a hot topic that builds computational algorithms to let computer automatically learn, analyze and represent human language. There are many implementations using NLP-based systems, like search engine, Amazon Alexa, and OpenAI’s language model. The most powerful and beautiful part of human is the language. With NLP, we can let machines own the ability to talk, analyze complex natural language. For example, machines can automatically translate languages and also it is possible to generate dialogues (such as chatbot).
This is general that shallow machine learning models, time consuming and hand crafted features are used to study NLP projects. We know for a sentence, the representative information matrix is sparse, which leads to curse of dimensionality. Nowadays, with recent highly developed and well-trained word embeddings (low dimension representation matrix), various NLP projects can achieve good performance compared to traditional methods like logistic regression or naive bayes models.
Reinforcement Learning
As discussed in last post, RL is a method that train agent to choose discrete actions with a reward. Nowadays, more NLP projects especially language generation tasks, like text summarization, chatbot start employ RL methods.
When we use RNN models to generate language, ground-truth tokens will be replaced by the generated tokens from the model, which will increase the errors in NLP projects. This questions induce researchers to implement RL on NLP problems. We can consider a text classification project. We create an environment (input words and context vectors seen at every time step) and an agent, trying to classify the text which is the action here (based on a policy (parameters) which involves predicting the next word of a sequence at each time step). The action is in an arbitrary manner at beginning, but based on the result the agent (RNN-based generative model) would get a reward to determine the next state of action. This continues until arriving at the end of the sequence where a reward is finally calculated. Reward functions vary by task; for instance, in a sentence generation task, a reward could be information flow. There is a RL algorithm called REINFORCE that can be used to address this problem. It can be used to address NLP projects like machine translation.
But RL methods require appropriate action and state space, which may set limits on the model regardless the promising results. There is another method used for NLP, called adversarial training. We can consider GAN in computer vision, in this method, the objective is to fool a discriminator trained to distinguish generated sequences from real ones. We can consider a chatbot. It is possible to frame the task under such setting where the discriminator is trained to distinguish machine-generated language and human language.