Final Blog

Vision

Is it possible to predict the box office of a movie? With “big data” resources and “machine learning” methods, we achieved the target of generating some reasonable forecasts. Two months ago, we brainstormed about how to make a reasonable prediction for a super popular movie that might break the historical record –Avengers: EndGame, and if so, we could apply such method to other movies as well.

Since then, we explored how multiple data sources can fit into different models. We used data in three different aspects: historical movies that are similar to the target movie, discussion from social media (Twitter) as popularity, and news articles from public media (New York Times) as opinions from professionals. As a result, we finally come up with a way to generalize this forecasting process and make it into a useful prediction tool for all movies.

Achievements

We achieved the goals we set in the beginning:

  • Predict the whole box office of movie Avengers 4: EndGame.
  • Generalize the movie box office forecast model.

Data

We collected data from four datasets and finally applied three in the prediction models. Data from IMDB to get the some basic features about the movie, such as movie genre and budget. In addition, IMDB reviews were also used for training the NLP model to analyze New York Time articles, alone with binarized rating score. Data from Twitter to indicate the movie popularity. Data from New York Times to show the attitudes of the professional reviews. Trailer view counts from Youtube were also collected with Youtube API. However, due to large numbers of movies and the restriction of Youtube API, the feature of trailer view is not available for all the movies and finally we did not apply this feature in our final prediction model. On the other hand, the number of tweets could also reflect the popularity as Youtube trailer view counts do (even better), which means removing this Youtube trailer view counts feature would not lose much information to our prediction models.

Datasets Introduction

IMDB

Movie features of 45000 movie records: genres, production companies, release date, IMDB popularity index, IMDB vote average, runtime, budget, revenue.
For the training set of the NLP model, 50,000 IMDB reviews for a variety of movies were used. Each review contains 2 parts: The actual review, and the review score. Actual reviews were tokenized into vectors, and review scores were binarized into either 0 or 1. The ratio of positive reviews and negative reviews were roughly 1:1.

Social media: number of Tweets with hashtags of the movie title.

We use data from Twitter to present the social influence of the movie. For simplicity, we first only take the number of tweets into our consideration. We used the Twitter premium API to estimate the number of tweets related to the target movie each day.

Since the API is very expensive and has very strict rate limits (100 request per month). We need to store the request results and reuse them. For a specific movie, we collected and stored all tweets related to a movie in a window of length 50 days. Here is the figure.

In our final implementation, we only used data in a month ( from 25 days prior to the release date to 5 days after the release date).

Public media articles

Although the data collected by using New York Times API includes the actual article content, only headlines of the articles were used for NLP analysis to reflect the professional opinions for the movie. The reason being that although the content of the article contains far more information than headline alone, most of the headlines were already representative for the content, and it would be pointless to include the redundant sentences. By doing so, the performance of the software can also be improved.

Model Structure and Implementation Detail


Our predictor is made of two parts: base and variance:

box office = base + variance 

The base value used a single linear regression model with features from the IMDB dataset. The base part is more about static features and it is not reflexible with real time. Actually, using average box office is good enough in this part.

For the variance part, we first use an additional pair of linear regression models, with number of tweets and ratings of articles as separate features. The result from the two variance models are then treated as a new pair of features for the third variance linear regression model.

Data processing

####IMDB data
We use the IMDB dataset from Kaggle and cleaned it into features that are most relevant and most likely to affect to the final revenue. To narrow down the dataset, only movies that are similar to the target movie are considered for the training. Similarity is defined as movies with the same production companies, similar genres, and release time.

####Twitter
We map the number of tweets at each time interval (7 days as default) into a single feature. Therefore, the number of related tweets on i days before the release date as the i/7 th feature. Take the movie Avengers: Endgame as an example, of which the release date is 04/26/2019. We use premium search API to get the related tweets in a 30 day window (04/01 ~ 05/01), which can be mapped into features X3, X2, X1 and X0.

Twitter was developing really fast in the past 5 years. More and more people begin talking movies on Twitter. For those older movies, there is a huge gap between the number of tweets at that time and nowadays. To weaken the influence of this problem, square root value of tweets numbers is applied.

####NYT articles
New York Times article was used for providing a professional view of the movie. A LSTM (Long short-term memory) natural language processing model was trained by using movie reviews on IMDB. The training set contains a string of sentences for the review content, and number of 0 representing a negative review, or 1 representing a positive review. The model produces a float number of rating from 0 to 1, representing how positive the review is. The New York Times articles were fed into the NLP model for evaluating the attitude.

For each movie, all the articles in a window of time that relate to this movie were fed into the model, and generate an average score. Similar movies that share same key words were fed in to the model as well. The average article rating was found to be 0.91 for Avengers: End Game, and 0.96 for Avengers: Infinity War.
The amount of articles for each movie depends on the window period, and how many articles exist in this window period. The default window was set to 30 days before the movie release, and all articles on New Your Times during this period were used. The final output for this model was the average score of the movie, and amount of total articles. For movies that does not have any article, score were set to 0, and amount was set to 0.

Visualization

For Avengers: Endgame, if we set the number of similar movies to 20, the prediction model gives a base value at 663.9 million. In our final code, the number of similar movies is set to 10 or 15, we wanted to minimize the usage of the expensive Twitter API, and the base value became 888.9 million), which is similar to the previous movie Avengers: Infinity War. Which is reasonable considering the features we chose.

To visualize the data, we picked two features. The figure below shows the positive correlation between two of the features and the box office.

In fact, we can tell that this movie is actually much more popular than Infinity War from social media and news. The social media is able to capture this part of information in the figure below:

The first peak occurred at April 2nd, which is the pre-sale date, and since the release date–April 22nd, the discussion amount kept growing to the climax. To combine this piece of information to our prediction, we got a variance value at 1.95 billion – a total of 2.8 billion for the global revenue.

Feelings

We are so excited that our project was voted as the best overall project in the final presentation (4 votes from TAs). We did not make a very attractive poster for our project, but we did spend a lot of time on coding. one TA told us that our model was more complicated than those of most group. Data collecting, cleaning, and analysis in this project is really challenging. Our project is very useful to help customers to determine whether to watch a movie. Or it can be used by cinemas to decide how to allocate resources for different movies.

We used many techniques mentioned in this course, such as data cleaning (hw1), web scraping (hw2), NLP (hw7), machine learning (hw5), MapReduce (hw3), data visualization (hw6). We have learned how to be a novice data scientist.

Blog Post 2

Current Status

We trained a box office Prediction model that predicts a rough value for the final box office and a Social Media model to correlate the topic discussion amount and the movie box office.

Box Office Prediction Model

To simplify the problem, the model is trained with movies only related to Marvel superhero and 5 features are considered for the training. Features selected are Features about director, actors, actresses: Number of followers on Twitter (select top three), feature about showing time influence: Time past from the first movie, a feature about popularity: The times of watching movie trailers on Youtube.

feature example
[top1_number_of_followers, top2_number_of_followers, top3_number_of_followers, years_past, times_watch_movie_trailers]

Since the amount of data is very limited, we decided to choose linear regression model to make the prediction. We used linear_model in sklearn package.

The coefficients for each feature are:
Coefficients:
[16.29878943 -9.82518317 19.13476566 14.24214735 1.05286416]
Prediction:
[614.15516941]

Base on this model, the prediction result for the movie is 614.2 million, which is much lesser than our expectation. One of the possible reason might be the trailer’s view count can’t yet reflect the popularity of Avengers: Endgame, since it was just released on March 13th, and there are still two weeks to the premiere. As the view count increases, we can expect to get a greater value.

For the data visualization, we select two most influential features as x,y axis: top1 number of followers of the crew on Twitter and trailers view count. Z dimension is the box office of the movie. As shown in the plot, it is obvious that the movie with higher followers and watching times tends to have a higher box office.

Next step, we plan to train with all the movies in the same genres as Avenger4, which are movies with genres “Science Fiction”, “Action” or Adventure”.
More features are also considered for training: cast total facebook likes, IMDB movie score.

Social Media Model

In Midterm report, we planned to use the Twitter standard API to estimate the number of tweets related to the target movie each day. For a specific movie, we planned to treat the number of related tweets on i days before the release date as features Xi. If we are interested in all tweets 30 days prior to the release date. We can get 30 features: X30, X29, … , X1.

Take the movie Avengers: Endgame as an example, of which the release date is 04/26/2019. We first use standard search API to get the related tweets in the past 9 days (03/27 ~ 04/04). These are features X32, X31, …, X24. Finally, we can plot the following figure.

The reason why the number of tweets related to Avengers: Endgame is very large on April 2 is that the pre-sales began on that day.

We have completed this function. However, due to the restriction of Twitter API, we cannot access very old data. Even we have upgraded to a Premium account, we can only request for 100 times each month, which is far from enough in our project. So we may have to give up using Twitter to estimate the popularity of a movie later. So we may use data from other sources to estimate the popularity of a movie. Currently, we want to use the New York Times API to search the article related to a movie. Since the number of articles is much smaller than that of tweets, we may not directly use the number of articles as a feature. We need to analyze the content of articles.

Next Steps

Goals

Data - processing: enlarge our training data with more movies in the same genres. Search for features for these new movies.
Train model for predicting base value (first-day box office): linear regression model

Features

We found ways to automate the process of data collection, additional dataset so we are going to extend the number of possible features for experiment.
The current feature list for the box-office prediction model is as follows:
[movie_title, box_office, release_date, genres, Series, actors, actors_popularity, director, director_popularity, production_company, company_total_gross, trailer_viewcount, pre_sale]

Previously, we didn’t find a good way to get this data through web scraping or existing API as an indicator for actor popularity. Since there exist many fan accounts, fake accounts that use the same name, which we want to avoid, so simple web scraping would not work, and we manually collected number of followers of the actors. This time, we found existing data of facebook likes for top actors and combined multiple datasets. In addition, we use Youtube API to query the view count for each movie’s trailer.

Switch from Twitter to NYT Article

The original data source for this project was twitter. However, due to limitation of twitter API, data collection was found to be rather hard. Therefore, the New York Times articles were used as a replacement. Due the nature of articles, the analysis methodology needs to be adjusted. The new proposed method was to perform natural language processing technique on each article, semantic analysis, and rate each movie based on the result. This article-based rating can be used as one feature of the prediction model.

Timeline

Apr/20 - Apr/25: Data preparation for linear regression training; Social media model training
Apr/26 - Apr/30: Data update and optimization
May/1 - May/2: Poster preparation
May/3 - May/10: Model generalization and conclusion; Blog3

Blog Post 1

About the Project

Is it impossible to predict the box office of a movie? Maybe not :) With “big data” resources and “machine learning” methods, we could make some accurate forecasts. That is what we are trying to achieve in this project. Starting from forecasting one popular movie Avengers 4: EndGame, which will be released in April 2019, we will see how different features of the movie have influence on the box office. The features to be considered are mainly in three categories: The financial influence from movie crews, The social comments and economic background of movie industry. Learning from this prediction model for one specific movie, we could finally come up with a tool to generalize this forecasting progress and make it one useful tool to select the upcoming movie with most box office potential.

Goals

  • Predict the first day box office of movie Avengers 4: EndGame.
  • Predict the whole box office of movie Avengers 4: EndGame.
  • Generalize the movie box office forecast model.

Data

Movie features: crew, plot keywords, similar movies, release information.

We use the dataset downloaded from Kaggle. After data-cleaning, here is part of the information.

Social influence: reviews, hashtags on social media

We use data from Twitter to present the social influence of the movie. For simplicity, we first only take the number of tweets into our consideration. In later version, we will also use some NLP method to analyze the content of tweets.

Methods

We divide the box office of a movie into two parts: base and variance.

box office = base + variance 

Then we use the dataset from Kaggle to predict the base value and use the data from Twitter to predict the variance part.
There are many useful features that help us predict the base value of the box office of a film. However, it misses one necessary term– “box office records the first day of release”. The box office Mojo is an website that offers this information: https://www.boxofficemojo.com/alltime/days/?page=open&p=.htm. Therefore, an additional step of web scraping and data cleaning is needed for the data of movie features


For the base part, we do a multiple regression analysis. Y is the base value of the box office, X are some movie features. For the variance, we will adopt machine learning method to analyze and also use regression to predict the value.

Timeline

Feb/2019: Topic discussion & Data preparation
Mar/1 - Mar/15: Methodology research & Data - preprocessing I
Mar/15 - Mar/31: Data - preprocessing II & Train model for predicting base value (first day box office)
Mar/30 - Apr/15: Train model for predicting base value (whole box office) & variance value(first day box office)
Apr/15 - Apr/30: Train model for predicting variance value(real time)

Current Status

Progress

We crawled all tweets related to a movie one month before the release date of this movie and compute the number of tweets on each day. Then we get 30 features x0, … , x29. So if we want to predict the box office of Avenger 4, not only will we crawled all tweets related to Avenger 4, but also we need to select a set of movies that is similar to Avenger 4, which is our training set. We need to crawl all tweets for these movies one by one (MapReduce).

To use Twitter API, we first created a Twitter developer account and obtain credentials.Then we created an app on https://developer.twitter.com/en/apps. Note that Twitter is very concerned about users’ privacy. You have to describe in detail the functionality of your app and how you will use the data get from Twitter. After creating an app, we can get Consumer API keys and Access tokens and use Twitter APIs in our program.

In our project, we choose to use Tweepy, a Python wrapper around the Twitter API. There are many other libraries in various programming languages that let you use Twitter API. because it is simple to use yet fully supports the Twitter API.

The first step is to setup tweepy to authenticate with Twitter credentials:

1
2
3
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)

Then we can use the search API to get tweets that related to the tag “TheAvengers”, we can use the following code:

1
2
3
4
5
6
for tweet in tweepy.Cursor(api.search, q="#TheAvengers", count = 100,
lang="en",
since_id=sinceId,
max_id=maxId).items():
print(tweet.created_at, tweet.text, tweet.id)
csvWriter.writerow([tweet.created_at, tweet.id, tweet.text.encode('utf-8')])

The result looks like:

Note that in the latest Twitter search API, Twitter replace the parameters since and until with since_id and max_id, tweetId is an integer that increase monotonically in Twitter. Another problem is that the search index has a 7-day limit. In other word, if you want to get all tweets in the past 30 days, you have to call this API for 5 times.

Next Steps

Goals

Data - preprocessing II: Use mapreduce to process the data. We set the data as key and count the number of tweets on each day.
Collect data for the movie first day box office.
Train model for predicting base value (first day box office): linear regression model training.