Vinobot - Automated Wine Club Reviews

Jay Ozer
19 min readDec 23, 2020

Introduction

Wouldn’t be nice to be able to solve an existing challenge in a practical manner? For my second practicum project, I thought I could delve into the relatively new world of text generation. When I mentioned my chosen scope, my sister asked if it was possible to generate short banners in the style of the banners of best-performing products. She leads a software engineering team of a marketplace app where hosts and guests are matched for service thus she is always interested in features that can aid new hosts to build their profiles with efficiency and increase bookings. Banners are free text, short sentences written by the hosts to pull attention to their product as guests browse the app. Since data for this purpose was not currently available, I thought I can use wine review data instead and do a series of proofs of concept to deepen my understanding of what is possible in text generation.

In this article, I am going to share the methods I used, some of my key findings, my challenges, and hopefully a few useful tips and tricks I picked up along the way. For the analysis, there are five notebooks with a pickled ready to consume data set, on the project’s GitHub repo, Vinobot.

Background and problem statement

Advertising is the backbone of sales. There are some products in any marketplace that are favored more than the others and one way to influence this is via banner ads. In many cases, users may either be blown away by the offer or turned off by it. This is the decade of swipe anyway. But what if you had a template for the winning strategy that you can slightly alter for the masses. In banner words, you could increase sales if you could theoretically replicate the style of writing of the top performers, right? Plus for a single type of product, the banner should not have to be personalized too much which works great. I thought the same applies to wine reviews which is why I chose it as my dummy data set.

This study aims to create an automated content generation system to assist human writers and make the writing process more efficient and effective. It would be ideal if Vinobot can imitate a sommelier and generate wine names, varieties, and short reviews.

Data

For this project, I have been looking for a clean data set that allows me to concentrate my time on the text generation portion rather than the data sourcing and prepping. I recently came across to a series of tutorials on graph databases and the data source was a semi curated data set that is build by lju-lazarevic to demonstrate graph databases. I will be using the winemag-data-130k-v3.csv file as my data set. It is a modified version of the original from Kaggle. It was scraped from Wine Enthusiast specifically for text analysis projects. Perfect!

The original data size was ~51MB but after further cleaning and converting, the largest training data set size was around 30MB, hopefully, decent enough to get coherent results.

Exploratory Analysis

Although I started out with a curated dataset there still are a few things that can be done. One of my goals for this project was to show data visually. While browsing pandas docs, I came across a cool function that takes a scalar and returns a string with the CSS property ‘color: red’ for positive strings and black otherwise. Below are the function and my results.

Sort totals & return NANs in red
Total number and percent of nans

It returns sorted totals & returns percent of NANs in red. I noticed that the taster_twitter_handle has more nans than taster_name which tells me some of the reviewers may not have a Twitter account. This is no reason to lose that row. Therefore, I first populated taster_name nulls with “unknown reviewer” and then populated nans of taster_twitter_handle by using the taster_name field. Overall, my approach was to drop the minimum number of rows. So imputing was the name of my game.

A good trick I had to learn was to expand the field column width to its max. This was especially useful when I had to inspect the wine reviews column visually.

Another favorite method for me is using a box plot. I always feel like I get so much information for the simple effort of creating the chart. But this time, I was almost fooled by the uniform distribution of Switzerland’s wine prices. With a median around $35 and no outliers, it looked like the dream subset to base price imputing on. I even remember thinking to myself:

“Figures,… of course it is Switzerland. After all, they are known with their military neutrality policy; the Swiss have not participated in a foreign war since the Paris Treaty. And this certainly must have carried itself over to the wine prices.”

Hilarious!!! Once I count the number of reviews, it is easy to see the root cause of the uniformity. Switzerland has only 6 out of ~119K wines.

country vs price
review counter

As the output from the explanatory phase, I decided to pickle my clean data set. Why pickle you may ask? Believe me, I have been using export to_csv, nicely. But this sentence: “Python-pickling does the same thing for Python objects. It creates a serialized, byte-wise .pkl file that preserves a Python object precisely and exactly.” changed my mind. Here is the full article.

pickled

Text Generation

During my first practicum project, Machine teaching restaurant health inspection scores, I had the chance to work with the violation description text column and was able to produce interesting insights. I think Dipanjan Sarkar explains the philosophy of language, and it’s four problems the best in his “Text Analytics with Python” book. These are; the nature of the meaning of language, the use of language, language cognition, and the relationship between language and reality. All these make NLP an interesting subject. After all, most machine learning techniques are tuned to work with numerical data.

Besides the normal infatuation, I have been noticing a significant increase in my Medium inbox with articles stating how transformers and pre-trained language models are so much more superior to more traditional techniques such as RNN and LSTM. I used 4 methods to try to generate text. These were Recurrent Neural Networks (RNN), Long Short Term Memory (LSTM), seq2seq, and GPT-2 (Generative Pre-trained Transformer).

Generate wine variety names with RNN

To generate a variety of names from scratch, there has to be a system that generates short texts quickly. These texts should have a unique style and could actually serve as names for new types of wines.

Recurrent neural networks are called recurrent because the algorithm performs the same computations for every element in the sequence. The inputs, outputs, and states are represented by vectors. The way it works is that it generates the next char given current and keeps track of the history so far. (for instance to generate for one my favorite varieties; Malbec, the sequence would be \t, m,a,l,b,e,c,\n. input ‘\t’, output ‘m’ then input ‘m’, output ‘a’. Its state remembers \t and m have been seen so far and with that, it can stop at the end of the sequence.)

RNN with 500 nodes

I did 3 main runs one with 200 epochs, one with 2K epochs, and finally at 5K epochs. Epochs are the number of times the full data set will be iterated and batch size the number of samples after which the parameters are adjusted which I set to 64. The number of epochs is not that significant. More important is the validation and training error and as long as that keeps dropping, the training should continue. And, if the validation error starts increasing it might be an indication of overfitting. With these in mind, I set the number of epochs to 5K but decided to terminate around 1500 based on the diminishing return on error rates. I have 64 batches that mean the learner needs to go through 64 iterations for one epoch. Initially, I thought the loss value stopped improving ~1000 epochs however it started improving again which was the reason for me to cap at 5K. But in the end, I was not able to go under 0.32 loss, no matter how long I trained. Also, a mid-level SageMaker machine such as ml.p3.4xlarge was sufficient enough to do this.

The process is straightforward. First, initialize the first character of the sequence. Then calculate the probability distribution for the most probable second character in the sequence. Repeat until the desired outcome is reached. The below function automates this iteration and can generate wine variety names.

Generate wine name

Typically, having a larger training set should create better results but in this case, the amount of data I have seems sufficient enough. I think leaving the non-english chars in was the better idea (for that authentic black box feeling. lol) since the other way around, the names sound too homogenous. A second typical way to improve the performance of the model could be training for more epochs. I ran 200, 2K, and 10K initially and decided 5K would be ideal for my needs. It runs fairly quickly and still produces decent and probable results. And the loss ratio did not change significantly after 1500 epochs.
Another way could be to increase the hidden layers. Currently, using 250 but also experimented with 50, 100, and 150. Seems like as I increase the number of hidden layers, better results were produced. During my initial runs, at about every 50 names generated, I saw one that is an actual name. I wanted to randomize the final product a bit more and after tinkering around my best run was with 500 layers and 1000 epochs. This in my opinion produced the most authentic results without any repetition or signs of overfitting and in the shortest amount of time.

I think results can easily be valid for new varieties of wine. Some of my favorites were: Masy, Siovasie, Srigaz, Graüy, and Charaussa. And then there were hilarious ones such as Chardonna, Mellot, and Pilot. Perhaps we can even combine the names; I am sure I would be interested in a glass of “Mellot Pilot”. My girlfriend thought it was a cute wine name and said “You don’t want a stressed-out pilot, you want a mellow one. And when you go all out, you order a flight.” 😜

Generate wine reviews with LSTM

Why change the method, right? RNN worked pretty well for what we need when predicting wine variety names. Because now we are dealing with longer sentences, wine reviews. Longer sentences create long term dependencies.

Here is an example of what I mean by short and long term dependency:
- Short term dependency would be predicting ‘aromas’ in the sentence “With attractive melon and other tropical aromas”
- Long term dependency would be predicting the second part of the sentence “this comes across as rather rough and tannic, with rustic, earthy, herbal characteristics… [it’s a good companion to a hearty winter stew.]”

When it comes to treating input sentences that span long intervals, there are limitations to RNNs. These limitations are the cause of vanishing and exploding gradient problems. Gradient is the rate of change of error with respect to weights. Weights can be adjusted to reduce the error. In vanishing gradients, the gradients of the weights become smaller and smaller and eventually become zero as the method moves backward from the last time-step towards the first time-step. In the exploding gradient problem, the gradient values of the weights become bigger and bigger as the method move back-propagate towards the first time-step. As a result, gradient clips and this clipping limits the maximum value of gradients at every node. This sounds a lot like compressing audio and how hard clipping can create distortion thus introduce noise.

As the title suggests, my goal is to write like sommelier
- Input: A dataset of wine reviews of many reviewers (I will use only Jim Gordon)
- Output: Complete sentence(s) as wine taster/ reviewer Jim Gordon would complete. Why Jim Gordon? Because his reviews represent 3.7% of the overall and that is the amount of data I was able to process at a reasonable time that produced good results.

The first step was preprocessing the text and I had the chance to use my new fav python package, colorama to visually inspect the changes.

using Fore and Style
colorama at work

Next was creating the corpus:

Chardonnay should not feel jammy

Although at first, it looked ready, I noticed the variety “Chardonnay” in corpus and realized that I could get better results if I removed these from the reviews. I think having the words Chardonnay and Red Blend in the same sentence is a dead giveaway that the review was manufactured by a bot. Even a really drunk sommelier would not make this mistake. Two options come to mind: 1) I can either remove the variety name completely or 2) I can replace it using either”wine” or “this wine” words.

I thought, subtracting from the corpus rather than adding to it would increase wine reviews’ applicability. I would love to achieve some coherence and I don’t think introducing noise in the form of new words is going to help with that. For these reasons, I went with option one. First, I created a list of unique names of wine varieties and then added these words to NLTK’s stopwords to be excluded from the corpus.

Remove wine variety from corpus

Another valuable lesson I learned here is not to stick to the same method blindly. Initially, I used NLTK to remove the wine variety list, just like I would do when removing stopwords. The only difference this time was the additional word list to be excluded. When the process ran for long, I initially assumed the issue was my corpus length. For this reason, I have tried numerous runs using smaller and smaller data sets. But that was not the root cause of the issue, even with the smallest set of reviews (~400), I was not able to get it to work. This is when I found out that the slowness was caused by my function running in a loop and comparing two large corpuses word by word. After all, there are 708 unique words on my list. The alternative was regex (re.compile) which worked lightning-fast for my needs.

Another issue to note is the existence of integers in the corpus. During character to integer mapping, I did end up removing numbers from the corpus but this time I used .translate instead of regex. Using .translate was imperative to process the larger corpus. re.sub took about twice as long as str.translate and actually slightly longer when I didn’t use a pre-compiled pattern. All the cleaning and modifications set the vocabulary size to 34 which is pretty good considering apart from space the rest are typical punctuations. I created the LSTM model in Keras, compiled it using “categorical_crossentropy” for loss and “adam” as the optimizer.

The first test run with 1 epoch generated mostly vowels one after the other without even changing the letters properly.
The first proper run was with 20 epochs. I realized I made the mistake of removing all stop words from the corpus, the results were gibberish again, somehow simply longer words. The second proper run was again with 20 epochs. I still used a cleaned version of the vocabulary effectively updating foreign letters to English ones. But perhaps this was the issue, perhaps I needed to keep the integrity of the corpus At this moment, I thought I might be inadvertently introducing noise. It looked like I was generating only the most common letters, the results were too homogenous, still gibberish though therefore I moved on to my third and final run. My purpose for this run was to test if training speed is directly related to the processing power. If you remember, I have only been using ‘Jim Gordon’ as my reviewer with only processing 3.7% of my data set. Since on average it takes over 3hrs to train for 20 epochs I went with a massive ml.p3.16xlarge machine with 6 Tesla GPUs (Although at any point, it did not look like I was using them all), I was able to shave off an entire hour, finishing just under 2 hours. I saw examples where training was done for 500 epochs to generate much shorter sentences. For this reason, I do not think 20 epochs is enough to create a complex model. I think to produce coherent generated sequences not only I had to use my entire corpus but also build a much more complex network.

Auto complete reviews with seq2seq

When I first started learning about the seq2seq method, I thought generating a new sequence that can be a possible ending of a sentence is a little close to what I am looking for. Ideally, I would like to be able to apply this to the banner generation which requires generating around 2 sentences. And my goal is to be able to write in a way that can end the inputted sentence, similar to auto-completion. This alone makes me think the size of my data set may not be enough.

My approach to vocabulary was a bit different this time. To have a larger variety, I only removed special characters from the training data set. In the end, the length of my vocabulary was 55. The approach to training is similar to before but this time the LSTM network is trained to generate suffixes corresponding to the prefix sentences (the encoder-decoder model). Similar to how I generated wine variety names but this time, running sequence by sequence. Instead of each word, each sentence is divided into two at each char position. This generates one start and one end for each occurrence. Just like the /n and /t from the RNN run. I also had to preprocess sentences a bit differently. Instead of creating a single string corpus, I kept them as a list of 2 sentence reviews. Each sentence between 40 to 50 words. Why? Because there are still limitations. I already know that LSTMs are pretty difficult to train because of the long gradient paths. Every time the gradients are propagating from the end, all the way through the transformation cell to the beginning. So for long documents (~100 words which is a single review), there has to be a 100-layer network that is very deep, a major limitation.

To be able to run, I ended up removing my usual 3.7% data set and settled with less than 0.1%. This is about 400 reviews. The results were less than exciting, to say the least. After this failed experiment, I started to think it may be out of my reach to generate complete sentences, and maybe I should stick with auto-completion but come up with a system that can do it word by word.

Text generation by fine tuning GPT-2

Most state of the art NLP models were trained specifically on a particular task like sentiment classification or textual entailment using supervised learning. But, supervised models have two major limitations. First, they need a large amount of annotated data for learning a particular task. This typically does not exists for a specific domain application. In fact, there is an entire industry build around data labeling and annotation services. And second, they fail to generalize for tasks other than what they have been trained for. Why? Because, raw text data especially when it is in its stemmed and lemmatized form, each word has its own meaning but lacks the semantic relationship among the entire corpus. This sparse word structure is the requirement behind a large amount of data. Sparse words and not enough data typically result in one of two outcomes, a poor model or overfitting. Neither works. In an ideal world, there will be a universal model to represent any embedding (a fixed-length vector typically used to encode and represent a text entity) in a vector space with minimal feature engineering.

As I mentioned earlier, my Medium stream is full of great articles about transformers and the first one I looked into was BERT since it seems to be the current buzzword; purely winning by frequency of occurrence! Unfortunately, it was not suited to my needs. Apparently, BERT, DistilBERT, and all BERT variants are not autoregressive models, meaning they cannot generate text. This means models can see future time steps in the input, hence they cannot do text generation which is left to right. Luckily, GPT-2 is autoregressive, meaning developers masked out future time steps so that the model can predict the next word from left to right. “The “Generative” in GPT-2 means, the model was trained to predict (or “generate”) the next token in a sequence of tokens in an unsupervised way. 2 means, this is not the first time this was tried” (FloydHub)

Max Woolf explains fine-tuning GPT-2 best in his blog: “The actual Transformer architecture GPT-2 uses is very complicated to explain. For the purposes of fine-tuning, since we can’t modify the architecture, it’s easier to think of GPT-2 as a black box, taking in inputs and providing outputs. Like previous forms of text generators, the inputs are a sequence of tokens, and the outputs are the probability of the next token in the sequence, with these probabilities serving as weights for the AI to pick the next token in the sequence. In this case, both the input and output tokens are byte pair encodings, which instead of using character tokens (slower to train but includes case/formatting) or word tokens (faster to train but does not include case/formatting) like most RNN approaches, the inputs are “compressed” to the shortest combination of bytes including case/formatting, which serves as a compromise between both approaches but unfortunately adds randomness to the final generation length. The byte pair encodings are later decoded into readable text for a human generation.”

He also recommends using the 124M model when fine-tuning GPT-2, for its balance of speed, size, and creativity. However, recommends the 355M model if fine-tuning/additional training with large amounts of data (>10 MB). My training corpus was around 30MB. I used both and in my experiments, there wasn’t a massive difference between the models. I also used a wrapper (gpt2-client) around the original GPT-2 repository that features the same functionality but with more accessibility, comprehensibility, and utility. The below code for generating a random text from a sentence is referenced from the gpt2-client page.

I first ran the above code without any training / fine-tuning for the wine corpus. I wanted to prove if my fine-tuning was going to make a difference. I typed wine as the anchor prompt, and the first sentence generated was:

“For now the only thing that stands between the two is kids bedtime.”

I thought that was funny in the context of wine. However, the rest of the generated text was surprisingly violent. It was mostly about fighting and crime and the economy. I did not think it was relevant to wine the way I would like, at all. This returns a single string and the default is set to 1023 tokens so it is a long string however if I was creating an API based on the model I just created and needed to pass the generated text elsewhere, I could have added: text = gpt2.generate(sess, return_as_list=True)[0]

For fine-tuning, I trained my pre-trained model with my wine corpus. This took surprisingly less time under 30 mins while training the entire corpus, ~7M tokens. It was completed with an average loss of 2.84. Once I created the model I saved it for future use.

run355M

After loading the saved model (run355M) I can now generate text from the trained model. I think an important feature to note here is the “include_prefix” parameter. In hindsight, perhaps adding the prefix “This wine is” was a bit overkill and introduced some unwanted noise. If I didn’t, I assume I would still have the desired effects of “This wine is” but without the repetitive nature of the string and the unnecessary pattern it introduces. finally, the reason I set the length to 45 tokens/words is that on average, the reviews are built from 40 to 50 words.

This wine is made from merlot, cabernet sauvignon, petit verdot and petit verdot. it has a firm structure, with the fruit, wood and acidity of the variety.

This wine is big and bold, with concentrated blackberry, black currant and cola flavors. it’s dry and firm, with a firm, dark tannic structure.

This wine is a blend of cabernet sauvignon, merlot and petit verdot. it’s rich in black cherry and blackberry jam, oak, and a good, solid tannin structure. drink now.

This wine is a blend of cabernet sauvignon, merlot and petit verdot. it’s a juicy wine with soft tannins, a ripe black cherry flavor and a light touch of acidity.

This wine is made from the cabernet sauvignon and merlot varieties. it has a strongly tannic character that is characteristic of the variety. it is very dry and shows intense acidity.

Success! The fourth one from the top is the winner in my book, but none of these reviews read like they were generated, artificially. By the way all sounds delicious. I wonder if one describes Mellot Pilot. I bulk generated a few lists with variations and these are available on the project’s GitHub page.

Future work

Ideally, this app would be able to write like any one of the staff at Wine Enthusiast. Perhaps one way to do this would be to train the fine-tune results using a taster specific data set. The final product can be an app that displays the user a list of tasters, prompts the user to pick one, and generates a list of 2 sentence reviews.

Also, it would be better if a type of variety can also be included in the choices. Say I would like to review Chardonnay like Jim Gordon. Since there currently are 708 unique varieties in my wine list perhaps I can use BERT for topic modeling on wine reviews first and re-categorize my data set to a more manageable number.

Finally, I think reviews are like short poems and in some cases resemble a haiku. I also think when they rhyme (slightly, not writing song lyrics or drunk reviews) sounds more human. I would be interested in seeing how training with project Gutenberg poetry corpus would affect the results.

Final Thoughts

Transformers may be the missing link that enables computers to actually understand language and its context. After my LSTM run, I was not too hopeful I could generate coherent text. If GPT-2 is this good I can only imagine what can be done using GPT-3 (released by OpenAI, June 2020). It is currently in private beta but I read that Microsoft teamed up with OpenAI to exclusively license its GPT-3 language model.

I believe sustainable business models are built upon democratization. I think pre-build NLP libraries can aid this cause by allowing rapid access to experimentation.

--

--

Jay Ozer

I spend my time following the proptech space and database technologies.