Blog Post 1

About the Project

Is it impossible to predict the box office of a movie? Maybe not :) With “big data” resources and “machine learning” methods, we could make some accurate forecasts. That is what we are trying to achieve in this project. Starting from forecasting one popular movie Avengers 4: EndGame, which will be released in April 2019, we will see how different features of the movie have influence on the box office. The features to be considered are mainly in three categories: The financial influence from movie crews, The social comments and economic background of movie industry. Learning from this prediction model for one specific movie, we could finally come up with a tool to generalize this forecasting progress and make it one useful tool to select the upcoming movie with most box office potential.

Goals

  • Predict the first day box office of movie Avengers 4: EndGame.
  • Predict the whole box office of movie Avengers 4: EndGame.
  • Generalize the movie box office forecast model.

Data

Movie features: crew, plot keywords, similar movies, release information.

We use the dataset downloaded from Kaggle. After data-cleaning, here is part of the information.

Social influence: reviews, hashtags on social media

We use data from Twitter to present the social influence of the movie. For simplicity, we first only take the number of tweets into our consideration. In later version, we will also use some NLP method to analyze the content of tweets.

Methods

We divide the box office of a movie into two parts: base and variance.

box office = base + variance 

Then we use the dataset from Kaggle to predict the base value and use the data from Twitter to predict the variance part.
There are many useful features that help us predict the base value of the box office of a film. However, it misses one necessary term– “box office records the first day of release”. The box office Mojo is an website that offers this information: https://www.boxofficemojo.com/alltime/days/?page=open&p=.htm. Therefore, an additional step of web scraping and data cleaning is needed for the data of movie features


For the base part, we do a multiple regression analysis. Y is the base value of the box office, X are some movie features. For the variance, we will adopt machine learning method to analyze and also use regression to predict the value.

Timeline

Feb/2019: Topic discussion & Data preparation
Mar/1 - Mar/15: Methodology research & Data - preprocessing I
Mar/15 - Mar/31: Data - preprocessing II & Train model for predicting base value (first day box office)
Mar/30 - Apr/15: Train model for predicting base value (whole box office) & variance value(first day box office)
Apr/15 - Apr/30: Train model for predicting variance value(real time)

Current Status

Progress

We crawled all tweets related to a movie one month before the release date of this movie and compute the number of tweets on each day. Then we get 30 features x0, … , x29. So if we want to predict the box office of Avenger 4, not only will we crawled all tweets related to Avenger 4, but also we need to select a set of movies that is similar to Avenger 4, which is our training set. We need to crawl all tweets for these movies one by one (MapReduce).

To use Twitter API, we first created a Twitter developer account and obtain credentials.Then we created an app on https://developer.twitter.com/en/apps. Note that Twitter is very concerned about users’ privacy. You have to describe in detail the functionality of your app and how you will use the data get from Twitter. After creating an app, we can get Consumer API keys and Access tokens and use Twitter APIs in our program.

In our project, we choose to use Tweepy, a Python wrapper around the Twitter API. There are many other libraries in various programming languages that let you use Twitter API. because it is simple to use yet fully supports the Twitter API.

The first step is to setup tweepy to authenticate with Twitter credentials:

1
2
3
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)

Then we can use the search API to get tweets that related to the tag “TheAvengers”, we can use the following code:

1
2
3
4
5
6
for tweet in tweepy.Cursor(api.search, q="#TheAvengers", count = 100,
lang="en",
since_id=sinceId,
max_id=maxId).items():
print(tweet.created_at, tweet.text, tweet.id)
csvWriter.writerow([tweet.created_at, tweet.id, tweet.text.encode('utf-8')])

The result looks like:

Note that in the latest Twitter search API, Twitter replace the parameters since and until with since_id and max_id, tweetId is an integer that increase monotonically in Twitter. Another problem is that the search index has a 7-day limit. In other word, if you want to get all tweets in the past 30 days, you have to call this API for 5 times.

Next Steps

Goals

Data - preprocessing II: Use mapreduce to process the data. We set the data as key and count the number of tweets on each day.
Collect data for the movie first day box office.
Train model for predicting base value (first day box office): linear regression model training.