Final Blog

Vision

Is it possible to predict the box office of a movie? With “big data” resources and “machine learning” methods, we achieved the target of generating some reasonable forecasts. Two months ago, we brainstormed about how to make a reasonable prediction for a super popular movie that might break the historical record –Avengers: EndGame, and if so, we could apply such method to other movies as well.

Since then, we explored how multiple data sources can fit into different models. We used data in three different aspects: historical movies that are similar to the target movie, discussion from social media (Twitter) as popularity, and news articles from public media (New York Times) as opinions from professionals. As a result, we finally come up with a way to generalize this forecasting process and make it into a useful prediction tool for all movies.

Achievements

We achieved the goals we set in the beginning:

  • Predict the whole box office of movie Avengers 4: EndGame.
  • Generalize the movie box office forecast model.

Data

We collected data from four datasets and finally applied three in the prediction models. Data from IMDB to get the some basic features about the movie, such as movie genre and budget. In addition, IMDB reviews were also used for training the NLP model to analyze New York Time articles, alone with binarized rating score. Data from Twitter to indicate the movie popularity. Data from New York Times to show the attitudes of the professional reviews. Trailer view counts from Youtube were also collected with Youtube API. However, due to large numbers of movies and the restriction of Youtube API, the feature of trailer view is not available for all the movies and finally we did not apply this feature in our final prediction model. On the other hand, the number of tweets could also reflect the popularity as Youtube trailer view counts do (even better), which means removing this Youtube trailer view counts feature would not lose much information to our prediction models.

Datasets Introduction

IMDB

Movie features of 45000 movie records: genres, production companies, release date, IMDB popularity index, IMDB vote average, runtime, budget, revenue.
For the training set of the NLP model, 50,000 IMDB reviews for a variety of movies were used. Each review contains 2 parts: The actual review, and the review score. Actual reviews were tokenized into vectors, and review scores were binarized into either 0 or 1. The ratio of positive reviews and negative reviews were roughly 1:1.

Social media: number of Tweets with hashtags of the movie title.

We use data from Twitter to present the social influence of the movie. For simplicity, we first only take the number of tweets into our consideration. We used the Twitter premium API to estimate the number of tweets related to the target movie each day.

Since the API is very expensive and has very strict rate limits (100 request per month). We need to store the request results and reuse them. For a specific movie, we collected and stored all tweets related to a movie in a window of length 50 days. Here is the figure.

In our final implementation, we only used data in a month ( from 25 days prior to the release date to 5 days after the release date).

Public media articles

Although the data collected by using New York Times API includes the actual article content, only headlines of the articles were used for NLP analysis to reflect the professional opinions for the movie. The reason being that although the content of the article contains far more information than headline alone, most of the headlines were already representative for the content, and it would be pointless to include the redundant sentences. By doing so, the performance of the software can also be improved.

Model Structure and Implementation Detail


Our predictor is made of two parts: base and variance:

box office = base + variance 

The base value used a single linear regression model with features from the IMDB dataset. The base part is more about static features and it is not reflexible with real time. Actually, using average box office is good enough in this part.

For the variance part, we first use an additional pair of linear regression models, with number of tweets and ratings of articles as separate features. The result from the two variance models are then treated as a new pair of features for the third variance linear regression model.

Data processing

####IMDB data
We use the IMDB dataset from Kaggle and cleaned it into features that are most relevant and most likely to affect to the final revenue. To narrow down the dataset, only movies that are similar to the target movie are considered for the training. Similarity is defined as movies with the same production companies, similar genres, and release time.

####Twitter
We map the number of tweets at each time interval (7 days as default) into a single feature. Therefore, the number of related tweets on i days before the release date as the i/7 th feature. Take the movie Avengers: Endgame as an example, of which the release date is 04/26/2019. We use premium search API to get the related tweets in a 30 day window (04/01 ~ 05/01), which can be mapped into features X3, X2, X1 and X0.

Twitter was developing really fast in the past 5 years. More and more people begin talking movies on Twitter. For those older movies, there is a huge gap between the number of tweets at that time and nowadays. To weaken the influence of this problem, square root value of tweets numbers is applied.

####NYT articles
New York Times article was used for providing a professional view of the movie. A LSTM (Long short-term memory) natural language processing model was trained by using movie reviews on IMDB. The training set contains a string of sentences for the review content, and number of 0 representing a negative review, or 1 representing a positive review. The model produces a float number of rating from 0 to 1, representing how positive the review is. The New York Times articles were fed into the NLP model for evaluating the attitude.

For each movie, all the articles in a window of time that relate to this movie were fed into the model, and generate an average score. Similar movies that share same key words were fed in to the model as well. The average article rating was found to be 0.91 for Avengers: End Game, and 0.96 for Avengers: Infinity War.
The amount of articles for each movie depends on the window period, and how many articles exist in this window period. The default window was set to 30 days before the movie release, and all articles on New Your Times during this period were used. The final output for this model was the average score of the movie, and amount of total articles. For movies that does not have any article, score were set to 0, and amount was set to 0.

Visualization

For Avengers: Endgame, if we set the number of similar movies to 20, the prediction model gives a base value at 663.9 million. In our final code, the number of similar movies is set to 10 or 15, we wanted to minimize the usage of the expensive Twitter API, and the base value became 888.9 million), which is similar to the previous movie Avengers: Infinity War. Which is reasonable considering the features we chose.

To visualize the data, we picked two features. The figure below shows the positive correlation between two of the features and the box office.

In fact, we can tell that this movie is actually much more popular than Infinity War from social media and news. The social media is able to capture this part of information in the figure below:

The first peak occurred at April 2nd, which is the pre-sale date, and since the release date–April 22nd, the discussion amount kept growing to the climax. To combine this piece of information to our prediction, we got a variance value at 1.95 billion – a total of 2.8 billion for the global revenue.

Feelings

We are so excited that our project was voted as the best overall project in the final presentation (4 votes from TAs). We did not make a very attractive poster for our project, but we did spend a lot of time on coding. one TA told us that our model was more complicated than those of most group. Data collecting, cleaning, and analysis in this project is really challenging. Our project is very useful to help customers to determine whether to watch a movie. Or it can be used by cinemas to decide how to allocate resources for different movies.

We used many techniques mentioned in this course, such as data cleaning (hw1), web scraping (hw2), NLP (hw7), machine learning (hw5), MapReduce (hw3), data visualization (hw6). We have learned how to be a novice data scientist.