SIMPLE PYTHON FOR DATA ANALYTICS PROJECT FOR BEGINNERS USING SAMPLE DATA.

Deborah Babundo
9 min readMar 14, 2022

--

“Your first data analytics project does not have to be complex”

image by grow.google.com

Python is one of the most important programming languages data analysts can use to take their skills to the next level. There are four main python libraries you need to start analyzing and visualizing with python namely; NumPy, pandas, seaborn, and matplotlib. Projects in data analytics like any other tech profession are important because they can be used to showcase your level of proficiency in analyzing and visualizing data.

When I just started practicing python, I couldn’t find projects which are less complex for the understanding of a beginner like myself, I decided to create my first project with something relatable like quick survey data. Your first python project does not have to be complex.

For the sake of this tutorial, I will be using the data I collected from my family and friends with the help of google forms(To view the form, check here: QUICK SURVEY FORM). This quick survey was collected from a sample of the entire population(my family and friends), The data would be used to answer a few questions as seen below:
1. Factors people consider when making ice-cream purchases(in terms of price, brand popularity, and flavor).
2. Change in flavor preference.
3. The most preferred ice cream flavor.
4. The least preferred flavor preference.
5. How does price affect the sales of ice cream.
6. Preferred choice by age and sex.
You can practice with me using the data I collected which can be accessed here and the pivot table version can be accessed here.

In order to get started, you will need to have installed your preferred python environment on your system. Also, make sure you have the following dependencies installed on your computer:

  • Python 3.6+
  • Numpy
  • Pandas
  • Matplotlib
  • Seaborn

Now that you are here, I assume that you are all set to get started. We would start by importing the libraries:

import pandas as pd
import seaborn as sns
import matplolib.pyplot as plt

Next, we would import the data we would use for this project, we would use the read_csv function:

FR= pd.read_csv('Quick Survey-  FORM RESPONSES.csv')
QS= pd.read_csv('Quick Survey - PIVOTTABLE.csv')

Next, we would be visualizing what people consider when purchasing ice cream between their favorite flavor and the price, we are trying to find out if they would consider the price over their favorite flavor, Let’s get started with creating variables:

#create variable
FR1=FR['When purchasing ice-cream, do you consider the price of the ice-cream or your favorite flavor?']

Next up, we would plot and visualize the graph using the seaborn countplot function, in order to find which of the choices was the most picked, this will determine the most considered factor:

#plot,style and visualization
sns.countplot(FR1,hue=FR1,data=FR,palette='icefire_r')
sns.set_style('ticks')
sns.despine()
plt.title('PURCHASING CONSIDERATION',fontsize=14,fontfamily='verdana')
plt.xlabel('CHOICES',fontsize=14,fontfamily='verdana')
plt.ylabel('COUNT',fontsize=14,fontfamily='verdana')
plt.legend(bbox_to_anchor=(1.7,2.0))
PLOT OUTPUT

Based on the above plot output, it can be concluded that the majority of the sample population which makes up this data will consider their favorite flavor over the price of the ice cream.

Up next, we would check what people consider when purchasing ice cream based on ice cream brand popularity and the quality of the ice cream. Getting started with creating variables, I would be using the already created pivot table CSV file, which you can find here :

#CREATING VARIABLEQS1= QS['PURCHASING FACTOR [BRAND]']
QS2= QS['NO OF PEOPLE PER PURCHASING FACTOR.1']
#SPLIT VARIABLE
QS1=QS1.head(3)
QS2=QS2.head(3)
#convert QS2 to int
QS2=QS2.astype(int)
#use the fillna function to fill the no option choices
QS1.fillna('No Option',inplace=True)

We would be visualizing the data using the matplotlib bar function. This plot will visualize what people(the sample collected) will consider between the popularity and quality of the brand of ice cream:

#LABELS
plt.title('PURCHASING FACTOR BASED ON BRAND',fontsize=12,fontfamily='Verdana')
plt.xlabel('CHOICES',fontsize=12,fontfamily='Verdana')
plt.ylabel('COUNT',fontsize=12,fontfamily='Verdana')
#style
sns.despine()
cols=('#FF228B22','#FF00FF7F','#FF2E8B57')
#plot
plt.bar(QS1,QS2,color=cols,linewidth=3)
plt.style.use('seaborn')
PLOT OUTPUT

From the above plot, we can deduce that most people in the sample data population will consider the quality of their favorite flavor over the popularity of the brand.

We know people’s preferences can change so we would visualize how often and if they have had a change in preference in the last 6 months. We would start with creating variables:

#create variable
QS3=QS['CHANGE IN FAVORITE FLAVOR IN THE LAST SIX MONTH']
QS4=QS['CHANGE CHOICES']
#split variable
QS3=QS3.head(2)
QS4=QS4.head(2)
#convert QS4 to int with astype function.
QS4=QS4.astype(int)

We would plot a bar chart using the matplotlib bar function. This will visualize the sample population’s preference change in the last 6 months:

#labels
plt.title('CHOICE PREFERENCE IN THE LAST SIX MONTH', fontsize=12,fontfamily='verdana')
plt.xlabel('CHOICES',fontsize=12,fontfamily='verdana')
plt.ylabel('COUNT',fontsize=12,fontfamily='verdana')
#style
sns.despine()
sns.set_style('ticks')
cols=('#DDA0DD','#580F41')
#plot
plt.bar(QS3,QS4,color=cols,linewidth=5)
plt.style.use('seaborn')
PLOT OUTPUT

From the above plot, we can deduce that majority of the sample population has not changed their choice preference in the last 6 months.

We know that how expensive the price of a product is could affect the sales of the product, next up we would be visualizing if the sample population would still purchase their favorite flavor even if it is more expensive, starting with creating the variable:

#create variable
QS5=QS['PRICE FACTOR[ EXPENSIVE]']
QS6=QS['NO OF CHOICES']
#split
QS5=QS5.head(2)
QS6=QS6.head(2)
#convert QS6 to int with astype function
QS6=QS6.astype(int)
#create labels
plt.title('PRICE FACTOR BASED ON HOW EXPENSIVE',fontsize=12,fontfamily='verdana')
plt.xlabel('CHOICES',fontsize=12,fontfamily='verdana')
plt.ylabel('COUNT',fontsize=12,fontfamily='verdana')

We would visualize using the matplotlib pie function, but first, let’s write the plot, The plot will visualize whether the sample population would prefer to purchase their favorite flavor even if it is more expensive:

#style
cols=('#FF81C0','#F97306')
sns.despine()
sns.set_style('ticks')
#plot
plt.bar(QS5,QS6,color=cols,linewidth=5)
PLOT OUTPUT

Based on the above plot, we can deduce that majority of the sample population will still consider purchasing their favorite flavor of ice cream even if it is more expensive than others.

I asked the sample population what flavor of ice cream they would purchase if they had all the available resources, next up we would plot their choices to find the most preferred flavor by the sample population but first, let’s create the plots:

#create variable
QS7=QS['MOST PREFERRED FLAVOR']
QS8=QS['NO OF PEOPLE PER FLAVOR']
#split
QS7=QS7.head(10)
QS8=QS8.head(10)
#convert QS8 to int using the astype function
QS8=QS8.astype(int)

We would visualize this using the pie function in matplotlib, but first, let's write the plot and style to our preference:

`#style
sns.despine()
sns.set_style('ticks')
cols=('#DC143C','#A9561E','#8C000F','#929591','#FF796C','#7FFFD4','#D1B26F','#FFFFCB','#FBDD7E','#E6DDA6')
#plot
plt.pie(QS8,labels=QS7,colors=cols,explode=(0.1,0,0,0,0,0,0,0,0,0),autopct='%0.2f%%',radius=5)
plt.legend(bbox_to_anchor=(3.0,3.0))
PLOT OUTPUT

From the above plot, we can deduce that the most preferred flavor by the sample population is the strawberry flavor, it is preferred by 36.36% of the sample population.

We just visualized the most preferred flavor, Do you know what’s next? analyzing and visualizing the least preferred flavor. First, let’s create the variable:

#create variable
QS9=QS['LEAST PREFERRED FLAVOR']
QS10=QS['NO OF PEOPLE PER FLAVOR.1']
#split
QS9=QS9.head(7)
QS10=QS10.head(7)
#convert QS10 to int using astype function
QS10=QS10.astype(int)

Next up, we would create the plot and visualize the data using the matplotlib function based on our style preference:

#style
sns.despine()
sns.set_style('ticks')
cols=('#DC143C','#A9561E','#FBDD7E','#FFD700','#7FFFD4','#D1B26F','#FFFFCB')
#plot
plt.pie(QS10,labels=QS9,colors=cols,explode=(0,0.1,0,0,0,0,0),autopct='%0.1f%%',radius=1.5)
plt.legend(bbox_to_anchor=(1.9,1.5))
PLOT OUTPUT

37% of the sample population chose chocolate as their least preferred flavor or that one flavor that they won’t consider even if they had all resources. If the sample population was the customer base of an ice-cream company, the least ice cream to be made or improved should be the chocolate flavor based on the context.

Next up we would be plotting and visualizing the ages of the sample population, but first, let’s create the label:

#create labelsplt.title('AGE COUNT FOR THE SAMPLE POPULATION',fontsize=12,fontfamily='verdana')
plt.xlabel('AGE',fontsize=12,fontfamily='verdana')
plt.ylabel('COUNT',fontsize=12,fontfamily='verdana')

Now that we have created the labels, we would visualize the ages of the sample population but first, let’s plot and style it using the seaborn distplot function:

#style
sns.set_style('ticks')
sns.despine()
#plot
sns.distplot(FR['Age'],kde=False)
PLOT OUTPUT

The most common age in the sample is 21 which means that most of the people from who the data was collected are age 21.

We would compare the relationship between the favorite flavor and the ages of the sample population. First, let's create the label and variables:

#create variable
FR2=FR['Age']
FR3=FR['If you had available resources, what ice-cream flavor would be your most preferred choice?']
#create labels
plt.title("FAVORITE FLAVOR BASED ON AGE",fontsize=12,fontfamily='verdana')

Now that we have created the label and variables, we would visualize the data using the seaborn countplot function:

#style
sns.set_style('ticks')
sns.despine()
#plot
sns.countplot(x=FR2,hue=FR3,data=FR,palette='Spectral')
plt.legend(bbox_to_anchor=(1.5,1.1))
PLOT OUTPUT

The plot explains the choices of the different preferred flavors in the sample data based on the ages of the population.

In the same vein, let’s plot and visualize the relationship between the least preferred flavor and ages of the sample population. let’s create the label and variables:

#create variable
FR2=FR['Age']
FR4=FR['Even if you had all the resources, what ice-cream flavor would you not consider?']
#create labels
plt.title("LEAST FAVORITE FLAVOR BASED ON AGE",fontsize=12,fontfamily='verdana')

We would visualize and plot the relationship between the least preferred flavor and the ages of the sample population, let’s plot:

#style
sns.set_style('ticks')
sns.set_context('poster')
sns.despine()
#plot
sns.countplot(x=FR2,hue=FR4,data=FR,palette='gist_earth')
plt.legend(bbox_to_anchor=(1.9,1.4))
PLOT OUTPUT

The plot explains the choices of the different least preferred flavors in the sample data based on the ages of the population.

We would repeat the process but with the sample population sex data, let’s start with the relationship between the preferred flavors and the sex of the population, let’s create the variable and label:

#create variable
FR3=FR['If you had available resources, what ice-cream flavor would be your most preferred choice?']
FR5=FR['Sex']
#create label
plt.title('FAVORITE FLAVOR BASED ON SEX',fontsize=12,fontfamily='verdana')

Let’s visualize the plot using the seaborn countplot function, first, let’s create the plot:

#style
sns.set_style('ticks')
sns.despine()
#plot
sns.countplot(x=FR5,hue=FR3,data=FR,palette='Pastel2_r')
plt.legend(bbox_to_anchor=(1.5,1.1))
PLOT OUTPUT

The plot explains the choices of the different preferred flavors in the sample data based on the sex of the sample population.

Next up, we would plot and visualize the relationship between the sex of the sample population and the least preferred flavor, let’s create the variable and label:

#create variable
FR4=FR['Even if you had all the resources, what ice-cream flavor would you not consider?']
FR5=FR['Sex']
#create labels
plt.title("LEAST FAVORITE FLAVOR BASED ON SEX",fontsize=12,fontfamily='verdana')

Let’s visualize the plot using the seaborn countplot function, first, let’s create the plot:

#style
sns.set_style('ticks')
sns.despine()
#plot
ax=sns.countplot(x=FR4,hue=FR5,data=FR,palette='Paired')
ax.set(xlabel="LEAST FAVORITE FLAVOR", ylabel = "COUNT")
plt.legend(bbox_to_anchor=(1.2,1.0))
PLOT OUTPUT

The plot explains the choices of the different least preferred flavors in the sample data based on the sex of the sample population.

Finally, we would plot and visualize the frequency of the sex data of the sample population, this will help us know the total amount of female and male in the sample population. Let’s create the label and variable first:

#create variable
FR5=FR['Sex']
#create labels
plt.title("FREQUENCY COUNT OF THE SEX DATA",fontsize=12,fontfamily='verdana')

Let’s visualize the plot using the seaborn countplot function, first, let’s create the plot:

#style
sns.set_style('ticks')
sns.despine()
#plot
ax=sns.countplot(x=FR5,hue=FR5,data=FR)
ax.set(xlabel="SEX", ylabel = "COUNT")
plt.legend(bbox_to_anchor=(1.5,1.3))
PLOT OUTPUT

The above plot shows that the sample population has more females than males, 21 females and 10 males division of the sample population.

In summary, while this is a sample population, this visualization could be used to draw out conclusions in the real world. Please read through and practice, let me know if you run through any issues, I will definitely help.

--

--

Deborah Babundo
Deborah Babundo

Written by Deborah Babundo

Hi Since you are here, I am ifeoma and I love to write, I also love to analyze data. You love to read? check my page for what I have written. Thanks for coming.

No responses yet