Crimes In Boston
Posted in Analysis
Data Context¶
Crime incident reports are provided by Boston Police Department (BPD) to document the initial details surrounding an incident to which BPD officers respond. This is a dataset containing records from the new crime incident report system, which includes a reduced set of fields focused on capturing the type of incident as well as when and where it occurred.
Data Content¶
Records begin in June 14, 2015 and continue to September 3, 2018.
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
After importing pandas and matplotlib (with the jupyter notebook matplotlib inline command), I am storing the crime.csv file into a Pandas DataFrame named 'crime'. I'm bringing over only the columns I think I will need and passing them into the 'usecols' argument. From here I want to take a quick preview of the data through the .head() method.
cols = ['INCIDENT_NUMBER','OFFENSE_CODE_GROUP','SHOOTING',
'OCCURRED_ON_DATE','STREET']
crime = pd.read_csv('crime.csv', usecols = cols)
crime.head()
A quick peek at the data looks good so far, now lets view the various attributes of the dataset. I'm using '.shape' to see that there are 638,148 rows and 5 columns. I also wanted to view the dtypes for each column to get an idea of how pandas is classifying the data.
crime.shape
crime.dtypes
All columns are showing as datatype 'object'. This makes sense given the data preview from earlier. Next I want to examine the data as it pertains to missing values. I will call the '.isna().sum()' on the 'crime' dataframe to show the number of rows that have missing data for each column. It looks like the 'SHOOTING' column has 636,109 missing values. After taking another look at the csv file, it appears that this column either contains a 'Y' to indicate there was a shooting involved or 'blank' to indicate there was not a shooting involved. Another column with missing values is the 'STREET' column with 21,742 values missing.
crime.isna().sum()
Now I am a little intrigued with the 1 missing value for both 'OFFENSE_CODE_GROUP' and 'OCCURRED_ON_DATE' columns. After filtering the dataframe to show the null row under the 'OCCURRED_ON_DATE' column with '.isnull()', I discovered a junk row that contained '=======' under the 'INCIDENT_NUMBER'. I am going to drop this row by filtering it out of the DataFrame and storing it back into crime.
crime[crime['OCCURRED_ON_DATE'].isnull()]
crime = crime[crime.INCIDENT_NUMBER != '=======']
OK, so that took care of that one rougue row...now clean up the other two columns with missing values. With the 'SHOOTING' column I am using '.fillna' and replacing missing values with 'N'. For the 'STREET' column I am replacing it's missing values with 'Unk'. With both instances I am passing the 'inplace=True' parameter so changes affect the original DataFrame.
crime.isna().sum()
crime.SHOOTING.fillna('N',inplace=True)
crime.STREET.fillna('Unk', inplace=True)
The last manipulation I want to make to the dataset is changing the 'OCCURRED_ON_DATE' column from an object dtype to a datetime dtype by using the 'pd.to_datetime()' method. Doing this will allow me to easily slice the date up to its various components (year, month, day_name) and perform analysis with these views. I needed to pass the "errors='coerce'" argument to make sure I had valide dates. Any invalid dates would result in a 'NaT' (Not a Time).
crime['OCCURRED_ON_DATE'] = pd.to_datetime(
crime['OCCURRED_ON_DATE'],
errors='coerce',
format='%m/%d/%Y %H:%M')
Previewing the data one last time before getting into the analysis. This time using '.tail()' to view the last 5 rows of the data.
crime.tail()
What type of crimes are most common?¶
To start, I want to look at what types of crimes are most common within the entire dataset. Sidenote: These are not all crimes perse, they are more incidents that involve the Boston PD. With that said, lets take a look by calling '.value_counts()' on the 'OFFENSE_CODE_GROUP' column. I've also plotted the top 10 "crimes" on a bar chart for some visualization.
crime.OFFENSE_CODE_GROUP.value_counts().head()
crime.OFFENSE_CODE_GROUP.value_counts().nlargest(10).plot(kind='bar')
So it looks like the most incidents that occur within the data set are 'Motor Vehicle Accident Response' with 74,264 occurrences. To take this a bit further I wanted to view the months that had the most Motor Vehicle Accident Responses. I did this by filtering the 'crime' DataFrame on 'Motor Vehicle Accident Response' and storing it in a new DataFrame called 'accidents'. I then passed in the 'OCCURRED_ON_DATE' column into the '.groupby()' method and modified the column to show date on a monthly basis with 'dt.month_name()'.
accidents = crime[crime['OFFENSE_CODE_GROUP']=='Motor Vehicle Accident Response']
months=['January','February','March','April','May','June',
'July','August','September','October','November','December']
accidents.groupby(accidents['OCCURRED_ON_DATE'].dt.month_name()).OFFENSE_CODE_GROUP.count().reindex(months).plot(kind='bar')
The chart above shows that the summer months of June, July, and August are when Motor Vehicle Accident Responses are most frequest. I then did the same exact thing with 'STREET' to show the top 5 streets in Boston that had the most Motor Vehicle Accident Responses. Since there were a number of missing values in the 'STREET' column, 'Unk' is showing up with the highest Motor Vehicle Accident Response count. Coming in second and third place is 'BLUE HILL AVE' and 'WASHINGTON ST' respectively.
accidents.groupby(accidents['STREET']).OFFENSE_CODE_GROUP.count().nlargest(5)
Which month/day of week has the most shootings?¶
The next column I was aiming to analyze was the 'SHOOTING' column. I created a new DataFrame called 'shooting' and filtered the original crime DataFrame by bringing in only rows where column 'SHOOTING' was equal to 'Y'. From here I wanted to view the months and days where shooting incidents were most common.
shooting = crime[crime.SHOOTING == 'Y']
month_shootings = (shooting.groupby(shooting['OCCURRED_ON_DATE']
.dt.month_name()).SHOOTING.count()
.reindex(months).plot(kind='bar'))
Based on the above chart, it looks like, the summer months of June, July, and August are again leading the results, along with December (so much for that holiday spirit). Now lets take a look at which day of the week resulted in the most shootings. I created another DataFrame called 'week_shootings' and did another groupby with 'OCCURRED_ON_DATE', but this time I used '.dt.weekday_name' to get the day from the datetime column.
days=['Monday','Tuesday','Wednesday','Thursday',
'Friday','Saturday','Sunday']
week_shootings = (shooting.groupby(crime['OCCURRED_ON_DATE']
.dt.weekday_name).SHOOTING.count().reindex(days).plot(kind='bar'))
and now a by street view with another groupby and using '.nlargest()' to show the top 5 results...
shooting.groupby(shooting['STREET']).SHOOTING.count().nlargest(5)
Well there you have it. If you find yourself in Boston during the heat of summer, make sure to avoid the crazy streets of Washington St. and Blue Hill Ave. However, as they say, "Past performance is not indicative of future resutls"...but don't chance it.
Disclaimer: Don't take any of this analysis seriously. I am merely working with data to learn/explore the Python pandas library.