Blog Post 2

Current Status

We trained a box office Prediction model that predicts a rough value for the final box office and a Social Media model to correlate the topic discussion amount and the movie box office.

Box Office Prediction Model

To simplify the problem, the model is trained with movies only related to Marvel superhero and 5 features are considered for the training. Features selected are Features about director, actors, actresses: Number of followers on Twitter (select top three), feature about showing time influence: Time past from the first movie, a feature about popularity: The times of watching movie trailers on Youtube.

feature example
[top1_number_of_followers, top2_number_of_followers, top3_number_of_followers, years_past, times_watch_movie_trailers]

Since the amount of data is very limited, we decided to choose linear regression model to make the prediction. We used linear_model in sklearn package.

The coefficients for each feature are:
Coefficients:
[16.29878943 -9.82518317 19.13476566 14.24214735 1.05286416]
Prediction:
[614.15516941]

Base on this model, the prediction result for the movie is 614.2 million, which is much lesser than our expectation. One of the possible reason might be the trailer’s view count can’t yet reflect the popularity of Avengers: Endgame, since it was just released on March 13th, and there are still two weeks to the premiere. As the view count increases, we can expect to get a greater value.

For the data visualization, we select two most influential features as x,y axis: top1 number of followers of the crew on Twitter and trailers view count. Z dimension is the box office of the movie. As shown in the plot, it is obvious that the movie with higher followers and watching times tends to have a higher box office.

Next step, we plan to train with all the movies in the same genres as Avenger4, which are movies with genres “Science Fiction”, “Action” or Adventure”.
More features are also considered for training: cast total facebook likes, IMDB movie score.

Social Media Model

In Midterm report, we planned to use the Twitter standard API to estimate the number of tweets related to the target movie each day. For a specific movie, we planned to treat the number of related tweets on i days before the release date as features Xi. If we are interested in all tweets 30 days prior to the release date. We can get 30 features: X30, X29, … , X1.

Take the movie Avengers: Endgame as an example, of which the release date is 04/26/2019. We first use standard search API to get the related tweets in the past 9 days (03/27 ~ 04/04). These are features X32, X31, …, X24. Finally, we can plot the following figure.

The reason why the number of tweets related to Avengers: Endgame is very large on April 2 is that the pre-sales began on that day.

We have completed this function. However, due to the restriction of Twitter API, we cannot access very old data. Even we have upgraded to a Premium account, we can only request for 100 times each month, which is far from enough in our project. So we may have to give up using Twitter to estimate the popularity of a movie later. So we may use data from other sources to estimate the popularity of a movie. Currently, we want to use the New York Times API to search the article related to a movie. Since the number of articles is much smaller than that of tweets, we may not directly use the number of articles as a feature. We need to analyze the content of articles.

Next Steps

Goals

Data - processing: enlarge our training data with more movies in the same genres. Search for features for these new movies.
Train model for predicting base value (first-day box office): linear regression model

Features

We found ways to automate the process of data collection, additional dataset so we are going to extend the number of possible features for experiment.
The current feature list for the box-office prediction model is as follows:
[movie_title, box_office, release_date, genres, Series, actors, actors_popularity, director, director_popularity, production_company, company_total_gross, trailer_viewcount, pre_sale]

Previously, we didn’t find a good way to get this data through web scraping or existing API as an indicator for actor popularity. Since there exist many fan accounts, fake accounts that use the same name, which we want to avoid, so simple web scraping would not work, and we manually collected number of followers of the actors. This time, we found existing data of facebook likes for top actors and combined multiple datasets. In addition, we use Youtube API to query the view count for each movie’s trailer.

Switch from Twitter to NYT Article

The original data source for this project was twitter. However, due to limitation of twitter API, data collection was found to be rather hard. Therefore, the New York Times articles were used as a replacement. Due the nature of articles, the analysis methodology needs to be adjusted. The new proposed method was to perform natural language processing technique on each article, semantic analysis, and rate each movie based on the result. This article-based rating can be used as one feature of the prediction model.

Timeline

Apr/20 - Apr/25: Data preparation for linear regression training; Social media model training
Apr/26 - Apr/30: Data update and optimization
May/1 - May/2: Poster preparation
May/3 - May/10: Model generalization and conclusion; Blog3