As with any dataset, the first steps are going to be data exploration and data cleaning. We need to get a better understanding of what we're dealing with. Since you've gone through this process a number of times before on previous days, can you tackle the following challenges on your own?
Answer these questions about how the data is structured:
How many rows and columns does the dataset contain?
Are there any NaN values present?
Are there any duplicate rows?
What are the data types of the columns?
Convert the USD_Production_Budget
, USD_Worldwide_Gross
, and USD_Domestic_Gross
columns to a numeric format by removing $
signs and ,
.
Note that domestic in this context refers to the United States.
Convert the Release_Date
column to a Pandas Datetime type.
.
.
..
...
..
.
.
Solution for Challenge 1
With any new dataset, it's a good idea to do some standard checks and conversions. I typically always first look at .shape
, .head()
, .tail()
, .info()
and .sample()
. Here's what I'm spotting already:
There are thousands of entries in the DataFrame - one entry for each movie. We'll have some challenges formatting the data before we can do more analysis because we have non-numeric characters in our budget and revenue columns.
We can check for NaN values with this line:
data.isna().values.any()
And check for duplicates with this line:
data.duplicated().values.any()
We can see the total number of duplicates by creating a subset and looking at the length of that subset:
duplicated_rows = data[data.duplicated()] len(duplicated_rows)
The fact that there are no duplicates or NaN (not-a-number) values in the dataset will make our job easier. We can also see if there are null values in .info()
, which also shows us that we need to do some type conversion.
Solution for Challenge 2
In order to convert the data in the budget and revenue columns and remove all the non-numeric characters, we can use a nested for loop. We create two Python lists: the characters to remove and the column names. Inside the nested loop we can combine .replace()
and .to_numeric()
to achieve our goal.
chars_to_remove = [',', '$'] columns_to_clean = ['USD_Production_Budget', 'USD_Worldwide_Gross', 'USD_Domestic_Gross'] for col in columns_to_clean: for char in chars_to_remove: # Replace each character with an empty string data[col] = data[col].astype(str).str.replace(char, "") # Convert column to a numeric data type data[col] = pd.to_numeric(data[col])
Solution for Challenge 3
To convert the Release_Date column to a DateTime object, all we need to do is call the to_datetime()
function.
data.Release_Date = pd.to_datetime(data.Release_Date)
When we check .info()
again we see that the columns now have the desired data type. This allows us to proceed with the next parts of our analysis.