Movie Industry: Python Analysis

sherry salek
Jul 6, 2022
4 min read

Updated: Jul 20, 2022

Changing from small screens to the largest theaters, the analysis of Movie Industry Data has become important to decide what kind of media is presented to us. There are a lot of factors that should be considered when analyzing the success or failure of a movie such as producer, production house, director, cast, runtime of the movie, the genre, the script, time of release, user ratings and last but not the least the marketing. We are going to analyze the movie data from Kaggle Website. This data was scraped from IMDb.

Loading and Cleaning the Data

We are starting our analysis with importing our libraries:

Then we are loading our data and take a closer look at our data to get a general feel for the data.

what are my features?

We can observe that the data consists of 7668 rows and15 columns. The columns are:

budget: the budget of a movie. Some movies don't have this, so it appears as 0
company: the production company
country: country of origin
director: the director
genre: main genre of the movie.
gross: revenue of the movie
name: name of the movie
rating: rating of the movie (R, PG, etc.)
released: release date (YYYY-MM-DD)
runtime: duration of the movie
score: IMDb user rating
votes: number of user votes
star: main actor/actress
writer: writer of the movie
year: year of release

what are the types?

Finding missing values:

We need to see if we have any missing data. Let's loop through the data and see if there is anything missing. We can find percentage of null values in all columns.

There are many ways to handle missing values such as:

Fill in missing values with a single value (e.g.)

A location based imputation (e.g.)

A very common way to replace missing values is using a median (e.g.)

but for the sake of time, I decided to drop all rows that have missing data.

Change data type of columns:

We use astype() method enables us to set or convert the data type of an existing data column in

Create a correct year column:

We need to extract the year from released column. Running the following should extract the correct year.

Sorting the table by the gross column:

We are sorting the gross column using .sort_values() in descending order. Sorting dataframes will return a dataframe with sorted values if inplace=False. Otherwise if inplace=True , it will return None and it will modify the original dataframe itself. By default, all sorting done in ascending order only. If we mention, ascending=False it will sort in descending order.

Checking duplicate values

We check the company column to see if there is any duplicate values.Pandas drop_duplicates() method helps in removing duplicates from the data frame.

Data Analysis with Visualization

Checking Hypothesis

We are going to check which columns are most correlated to the gross column. The column that we are going to test in order to check its correlation is the budget column. Is it true that the more money they spend, the more money they're going to bring in. The other thing might be the company, like bigger companies such as "20th century fox corporation" or "Walt Disney ", they make famous movies

We can use scatter plot to check the budget and the gross. We also add trendline to our scatter plot to show the positive correlation.

Gross vs Budget

The coefficient of determination, R2, is similar to the correlation coefficient, R. The correlation coefficient will tell you how strong of a linear relationship there is between two variables.

We can also check the correlation matrix between all numeric column:

We are going to choose Pearson method and shows the matrix in a Heat map visualization and we can check easily the variables that are highly correlated to each other:

We are going to change the type of the columns with object type to category so that we would be able to check all the columns correlation. We can check every single field with numeric representation of it. We use cat.code to turn them into category.

Now we visualize it with HeatMap:

With above visualization we can see the gross and budget are correlated. We can also see there is a correlation between the gross and the votes. It means if the movie was a success, it got lots of votes

Let's simplify above visualization:

Let's show above table in an organized way. We use Unstack() method, which is also similar to the stack method, it returns a DataFrame having a new level of column labels:

Then we are sorting the pairs. Pandas sort_values() function sorts a data frame in Ascending or Descending order of passed Column:

Then, We can now take a look at the ones that have a high correlation (> 0.5)

We find out the company does not have correlation with the gross. However we found out votes are highly correlated with the gross.

Let's check the scatter plot for gross and votes:

Conclusions:

Votes and Gross have the highest correlation to earning. The more votes a movie receives, the more earnings it gets.
Company has the lowest correlation with the Gross. It means the size and importance of a company production does not have any effect on the earning of a movie.