Data Context¶

Crimes in Boston - Kaggle.com

Crime incident reports are provided by Boston Police Department (BPD) to document the initial details surrounding an incident to which BPD officers respond. This is a dataset containing records from the new crime incident report system, which includes a reduced set of fields focused on capturing the type of incident as well as when and where it occurred.

Data Content¶

Records begin in June 14, 2015 and continue to September 3, 2018.

In [1]:

import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

After importing pandas and matplotlib (with the jupyter notebook matplotlib inline command), I am storing the crime.csv file into a Pandas DataFrame named 'crime'. I'm bringing over only the columns I think I will need and passing them into the 'usecols' argument. From here I want to take a quick preview of the data through the .head() method.

In [2]:

cols = ['INCIDENT_NUMBER','OFFENSE_CODE_GROUP','SHOOTING',
       'OCCURRED_ON_DATE','STREET']

crime = pd.read_csv('crime.csv', usecols = cols)

In [3]:

crime.head()

Out[3]:

	INCIDENT_NUMBER	OFFENSE_CODE_GROUP	SHOOTING	OCCURRED_ON_DATE	STREET
0	I182070945	Larceny	NaN	9/2/2018 13:00	LINCOLN ST
1	I182070943	Vandalism	NaN	8/21/2018 0:00	HECLA ST
2	I182070941	Towed	NaN	9/3/2018 19:27	CAZENOVE ST
3	I182070940	Investigate Property	NaN	9/3/2018 21:16	NEWCOMB ST
4	I182070938	Investigate Property	NaN	9/3/2018 21:05	DELHI ST

A quick peek at the data looks good so far, now lets view the various attributes of the dataset. I'm using '.shape' to see that there are 638,148 rows and 5 columns. I also wanted to view the dtypes for each column to get an idea of how pandas is classifying the data.

In [4]:

crime.shape

Out[4]:

(638148, 5)

In [5]:

crime.dtypes

Out[5]:

INCIDENT_NUMBER       object
OFFENSE_CODE_GROUP    object
SHOOTING              object
OCCURRED_ON_DATE      object
STREET                object
dtype: object

All columns are showing as datatype 'object'. This makes sense given the data preview from earlier. Next I want to examine the data as it pertains to missing values. I will call the '.isna().sum()' on the 'crime' dataframe to show the number of rows that have missing data for each column. It looks like the 'SHOOTING' column has 636,109 missing values. After taking another look at the csv file, it appears that this column either contains a 'Y' to indicate there was a shooting involved or 'blank' to indicate there was not a shooting involved. Another column with missing values is the 'STREET' column with 21,742 values missing.

In [6]:

crime.isna().sum()

Out[6]:

INCIDENT_NUMBER            0
OFFENSE_CODE_GROUP         1
SHOOTING              636109
OCCURRED_ON_DATE           1
STREET                 21743
dtype: int64

Now I am a little intrigued with the 1 missing value for both 'OFFENSE_CODE_GROUP' and 'OCCURRED_ON_DATE' columns. After filtering the dataframe to show the null row under the 'OCCURRED_ON_DATE' column with '.isnull()', I discovered a junk row that contained '=======' under the 'INCIDENT_NUMBER'. I am going to drop this row by filtering it out of the DataFrame and storing it back into crime.

In [7]:

crime[crime['OCCURRED_ON_DATE'].isnull()]

Out[7]:

	INCIDENT_NUMBER	OFFENSE_CODE_GROUP	SHOOTING	OCCURRED_ON_DATE	STREET
319073	=======	NaN	NaN	NaN	NaN

In [8]:

crime = crime[crime.INCIDENT_NUMBER != '=======']

OK, so that took care of that one rougue row...now clean up the other two columns with missing values. With the 'SHOOTING' column I am using '.fillna' and replacing missing values with 'N'. For the 'STREET' column I am replacing it's missing values with 'Unk'. With both instances I am passing the 'inplace=True' parameter so changes affect the original DataFrame.

In [9]:

crime.isna().sum()

Out[9]:

INCIDENT_NUMBER            0
OFFENSE_CODE_GROUP         0
SHOOTING              636108
OCCURRED_ON_DATE           0
STREET                 21742
dtype: int64

In [10]:

crime.SHOOTING.fillna('N',inplace=True)

In [11]:

crime.STREET.fillna('Unk', inplace=True)

The last manipulation I want to make to the dataset is changing the 'OCCURRED_ON_DATE' column from an object dtype to a datetime dtype by using the 'pd.to_datetime()' method. Doing this will allow me to easily slice the date up to its various components (year, month, day_name) and perform analysis with these views. I needed to pass the "errors='coerce'" argument to make sure I had valide dates. Any invalid dates would result in a 'NaT' (Not a Time).

In [12]:

crime['OCCURRED_ON_DATE'] = pd.to_datetime(
                            crime['OCCURRED_ON_DATE'],
                            errors='coerce',
                            format='%m/%d/%Y %H:%M')

Previewing the data one last time before getting into the analysis. This time using '.tail()' to view the last 5 rows of the data.

In [13]:

crime.tail()

Out[13]:

	INCIDENT_NUMBER	OFFENSE_CODE_GROUP	SHOOTING	OCCURRED_ON_DATE	STREET
638143	I050310906-00	Warrant Arrests	N	2016-06-05 17:25:00	COVENTRY ST
638144	I030217815-08	Homicide	N	2015-07-09 13:38:00	RIVER ST
638145	I030217815-08	Warrant Arrests	N	2015-07-09 13:38:00	RIVER ST
638146	I010370257-00	Warrant Arrests	N	2016-05-31 19:35:00	NEW WASHINGTON ST
638147	142052550	Warrant Arrests	N	2015-06-22 00:12:00	WASHINGTON ST

What type of crimes are most common?¶

To start, I want to look at what types of crimes are most common within the entire dataset. Sidenote: These are not all crimes perse, they are more incidents that involve the Boston PD. With that said, lets take a look by calling '.value_counts()' on the 'OFFENSE_CODE_GROUP' column. I've also plotted the top 10 "crimes" on a bar chart for some visualization.

In [14]:

crime.OFFENSE_CODE_GROUP.value_counts().head()

Out[14]:

Motor Vehicle Accident Response    74264
Larceny                            51870
Medical Assistance                 47080
Investigate Person                 37500
Other                              36150
Name: OFFENSE_CODE_GROUP, dtype: int64

In [15]:

crime.OFFENSE_CODE_GROUP.value_counts().nlargest(10).plot(kind='bar')

So it looks like the most incidents that occur within the data set are 'Motor Vehicle Accident Response' with 74,264 occurrences. To take this a bit further I wanted to view the months that had the most Motor Vehicle Accident Responses. I did this by filtering the 'crime' DataFrame on 'Motor Vehicle Accident Response' and storing it in a new DataFrame called 'accidents'. I then passed in the 'OCCURRED_ON_DATE' column into the '.groupby()' method and modified the column to show date on a monthly basis with 'dt.month_name()'.

In [16]:

accidents = crime[crime['OFFENSE_CODE_GROUP']=='Motor Vehicle Accident Response']

In [17]:

months=['January','February','March','April','May','June',
       'July','August','September','October','November','December']

accidents.groupby(accidents['OCCURRED_ON_DATE'].dt.month_name()).OFFENSE_CODE_GROUP.count().reindex(months).plot(kind='bar')

Out[17]:

<matplotlib.axes._subplots.AxesSubplot at 0xcff8ad0>

The chart above shows that the summer months of June, July, and August are when Motor Vehicle Accident Responses are most frequest. I then did the same exact thing with 'STREET' to show the top 5 streets in Boston that had the most Motor Vehicle Accident Responses. Since there were a number of missing values in the 'STREET' column, 'Unk' is showing up with the highest Motor Vehicle Accident Response count. Coming in second and third place is 'BLUE HILL AVE' and 'WASHINGTON ST' respectively.

In [18]:

accidents.groupby(accidents['STREET']).OFFENSE_CODE_GROUP.count().nlargest(5)

Out[18]:

STREET
Unk               6762
BLUE HILL AVE     2492
WASHINGTON ST     2260
DORCHESTER AVE    1628
CENTRE ST         1232
Name: OFFENSE_CODE_GROUP, dtype: int64

Which month/day of week has the most shootings?¶

The next column I was aiming to analyze was the 'SHOOTING' column. I created a new DataFrame called 'shooting' and filtered the original crime DataFrame by bringing in only rows where column 'SHOOTING' was equal to 'Y'. From here I wanted to view the months and days where shooting incidents were most common.

In [19]:

shooting = crime[crime.SHOOTING == 'Y']

In [20]:

month_shootings = (shooting.groupby(shooting['OCCURRED_ON_DATE']
                       .dt.month_name()).SHOOTING.count()
                       .reindex(months).plot(kind='bar'))

Based on the above chart, it looks like, the summer months of June, July, and August are again leading the results, along with December (so much for that holiday spirit). Now lets take a look at which day of the week resulted in the most shootings. I created another DataFrame called 'week_shootings' and did another groupby with 'OCCURRED_ON_DATE', but this time I used '.dt.weekday_name' to get the day from the datetime column.

In [21]:

days=['Monday','Tuesday','Wednesday','Thursday',
      'Friday','Saturday','Sunday']

week_shootings = (shooting.groupby(crime['OCCURRED_ON_DATE']
                .dt.weekday_name).SHOOTING.count().reindex(days).plot(kind='bar'))

and now a by street view with another groupby and using '.nlargest()' to show the top 5 results...

In [22]:

shooting.groupby(shooting['STREET']).SHOOTING.count().nlargest(5)

Out[22]:

STREET
WASHINGTON ST    92
BLUE HILL AVE    64
DUDLEY ST        48
COLUMBIA RD      46
Unk              46
Name: SHOOTING, dtype: int64

Well there you have it. If you find yourself in Boston during the heat of summer, make sure to avoid the crazy streets of Washington St. and Blue Hill Ave. However, as they say, "Past performance is not indicative of future resutls"...but don't chance it.

Disclaimer: Don't take any of this analysis seriously. I am merely working with data to learn/explore the Python pandas library.