Self-Reported User Engagement for an Online Forum¶
There are a myriad of ways to analyze and understand website usage and forum participation:
- click-through-rate
- counting backlinks
- PageRank, and so on.
Not least of which is asking the end-useres themselves to fill out a questionnaire. Such a questionnaire is called as a survey and it provides insights into not only how a person uses a forum but also why.
pip install opendatasets
import opendatasets as od
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (9, 5)
matplotlib.rcParams['figure.facecolor'] = '#00000000'
%matplotlib inline
import seaborn as sns
import numpy as np
Getting our dataset¶
We'll be using the StackOverflow developer survey dataset for our analysis. This is survey that is conducted annually and we'll deal with the latest 2020 one.
With the opendatasets
helper library the files will be downloaded.
od.download('stackoverflow-developer-survey-2020')
pd.read_csv('stackoverflow-developer-survey-2020/survey_results_public.csv').head()
pd.read_csv('stackoverflow-developer-survey-2020/survey_results_public.csv').tail()
pd.read_csv('stackoverflow-developer-survey-2020/survey_results_public.csv').columns
pd.read_csv('stackoverflow-developer-survey-2020/survey_results_schema.csv', index_col='Column').QuestionText
By default pandas truncates the occurance of Series which is a column for us as seen above. We can expand it by toggling settings of the IPython console in 3 lines or only for pandas output cell using:
pd.set_option('display.max_colwidth', None)
pd.read_csv('stackoverflow-developer-survey-2020/survey_results_schema.csv', usecols = ['QuestionText'], index_col='QuestionText')#.QuestionText
pd.read_csv('stackoverflow-developer-survey-2020/survey_results_schema.csv', index_col='Column').QuestionText['CompFreq']
Above columns are much more readable than their truncated versions. We've now loaded the dataset, and we're ready to move on to the next step of preprocessing & cleaning the data for our analysis.
Data Preparation & Cleaning¶
While the survey responses contain a wealth of information, we'll limit our analysis to the following areas:
- Age & Location
- Programming experience
- Forum usage
Let's select a subset of columns with the relevant data for our analysis.
stambha = ['Hobbyist', 'SOAccount', 'SOComm', 'SOPartFreq', 'SOVisitFreq', 'NEWSOSites', 'WelcomeChange', 'NEWCollabToolsWorkedWith','NEWOffTopic', 'NEWOtherComms', 'NEWStuck']
len(stambha)
Let's take-out a sub-set of the data from required columns into a new DataFrame and call it as df.
df = pd.read_csv('stackoverflow-developer-survey-2020/survey_results_public.csv')[stambha].copy()
df
pd.read_csv('stackoverflow-developer-survey-2020/survey_results_schema.csv', index_col='Column').QuestionText[stambha]
df
Let's view some basic information about the data frame.
df.shape
df.info()
df.last_valid_index()
Let's now view some basic statistics about the the numeric columns.
df.describe()
df['Hobbyist'].value_counts()
df.sample(10)
Exploratory Data Analysis¶
Forum Account¶
Let's look at the distribution of responses weather a respondent had a forum account or not. It's a well known fact that register users are mire likely to farticipate in forum activities like surveys.
(pd.read_csv('stackoverflow-developer-survey-2020/survey_results_schema.csv', index_col='Column').QuestionText).SOAccount
user_counts = df.SOAccount.value_counts()
user_counts
A pie chart would be a good way to visualize the distribution.
plt.figure(figsize=(20,10))
plt.title((pd.read_csv('stackoverflow-developer-survey-2020/survey_results_schema.csv', index_col='Column').QuestionText).SOAccount)
plt.pie(user_counts, labels=user_counts.index, autopct='%f%%', startangle=0);
About 83% of survey respondents who have answered the question had an account on the forum.
Hobby¶
Let's look at the distribution of responses, weather a respondent considered themselves as a hobbyist or not.
(pd.read_csv('stackoverflow-developer-survey-2020/survey_results_schema.csv', index_col='Column').QuestionText).Hobbyist
hobby_counts = df.Hobbyist.value_counts()
hobby_counts
plt.figure(figsize=(20,10))
plt.title((pd.read_csv('stackoverflow-developer-survey-2020/survey_results_schema.csv', index_col='Column').QuestionText).Hobbyist)
plt.pie(hobby_counts, labels= hobby_counts.index, autopct='%f%%', startangle=0);
It appears that four in five of the respondents have taken up programming as a hobby and not professionally.
Let's also plot the visit-frequency, but this time we'll convert the percentages into numbers, and sort by percentage values to make it easier to visualize the order.
(pd.read_csv('stackoverflow-developer-survey-2020/survey_results_schema.csv', index_col='Column').QuestionText).SOVisitFreq
VisitFreq_pct = df.SOVisitFreq.value_counts() #* 100 / df.SOVisitFreq.count()
sns.barplot(VisitFreq_pct, VisitFreq_pct.index)
plt.title((pd.read_csv('stackoverflow-developer-survey-2020/survey_results_schema.csv', index_col='Column').QuestionText).SOVisitFreq)
plt.ylabel(None);
plt.xlabel('Percentage');
It turns that 55% of respondents visit our forum atleast once daily - which is very encouraging. This seems to suggest that user retention is 30%/day, meaning 30% of those who have visited our forum today will probably return tommorrow.
On the flip side, this entails selection bias. Our respondent may not be representative of the average person who uses our forum because he did not stick around to take this survey. Only those who think our forum is great are responding to our call.
plt.figure(figsize=(20,10))
plt.title((pd.read_csv('stackoverflow-developer-survey-2020/survey_results_schema.csv', index_col='Column').QuestionText).SOVisitFreq)
plt.pie(VisitFreq_pct, labels= VisitFreq_pct.index, autopct='%f%%', startangle=0);
Community feel¶
There are various reasons why a member of a forum may feel excluded:
- Echo chamber, bee hive mind that encorages groupthink and shuts-down dissenting views
- Observer Effect, 'observing the process changes the process.' on a public online forum, the thought that all their activity can be traced and tracked.
- Language barriers
- Ettiquette and socially acceptable behavioural boundries.
Let's visualize the data from SOComm
column.
(pd.read_csv('stackoverflow-developer-survey-2020/survey_results_schema.csv', index_col='Column').QuestionText).SOComm
(df.SOComm.value_counts(normalize=True, ascending=True)*100).plot(kind='barh', color='g')
plt.title((pd.read_csv('stackoverflow-developer-survey-2020/survey_results_schema.csv', index_col='Column').QuestionText).SOComm)
plt.xlabel('Percentage');
It appears that close to 35% of respondents don't want to identify themselves with the StackOverflow label.
The NEWSOSites
field contains information about the new topic-specific forums being launched. Since the question allows multiple answers, the column contains lists of values separated by ;
, which makes it a bit harder to analyze directly.
(pd.read_csv('stackoverflow-developer-survey-2020/survey_results_schema.csv', index_col='Column').QuestionText).NEWSOSites
df.NEWSOSites.value_counts()
Let's define a helper function which turns a column containing lists of values (like df.NEWSOSites
) into a data frame with one column for each possible option.
def sandhi_vicched(col_series):
result_df = col_series.to_frame()
options = []
# Iterate over the column
for idx, value in col_series[col_series.notnull()].iteritems():
# Break each value into list of options
for option in value.split(';'):
# Add the option as a column to result
if not option in result_df.columns:
options.append(option)
result_df[option] = False
# Mark the value in the option column as True
result_df.at[idx, option] = True
return result_df[options]
NEWSOSites_df = sandhi_vicched(df.NEWSOSites)
NEWSOSites_df
The NEWSOSites_df
has one column for each option that can be selected as a response. If a responded has selected the option, the value in the column is True
, otherwise it is false.
We can now use the column-wise totals to identify the most popular forums.
NEWSOSites_totals = NEWSOSites_df.sum().sort_values(ascending=False)
NEWSOSites_totals
As one might expect, the most popular forums is "Stack Overflow", the first of its name.
sns.heatmap(NEWSOSites_df)
Note that here cream colour means more users picking that forum option and black line means that respondent has not selected that forum as his pick. Heatmaps are only possible when we have a DataFrame which is rectangular/square shaped. Since our NEWSOSites_df
was a boolean matrix, only two colours are present in the heatmap.
Asking and Answering Questions¶
We've already gained several insights about the respondents and the programming community in general, simply by exploring individual columns of the dataset. Let's ask some specific questions, and try to answer them using data frame operations and interesting visualizations.
What do forumites do when stuck on a programming problem?¶
Let's look at the responses in the survey.
NEWStuck_df = sandhi_vicched(df.NEWStuck)
NEWStuck_numbers = NEWStuck_df.mean().sort_values(ascending=False)
NEWStuck_numbers
NEWStuck_df.sum()
We can visualize this information using a bar chart.
sns.set_style('darkgrid')
plt.figure(figsize=(20,6))
plt.title((pd.read_csv('stackoverflow-developer-survey-2020/survey_results_schema.csv', index_col='Column').QuestionText).NEWStuck)
sns.barplot(NEWStuck_numbers.index, NEWStuck_numbers);
plt.xticks(rotation = '45')
Which version control and collaboration tools are popular with respondents?¶
For this we can can use the NEWCollabToolsWorkedWith
column, with similar processing as the previous one.
NEWCollabToolsWorkedWith_df = sandhi_vicched(df.NEWCollabToolsWorkedWith)
NEWCollabToolsWorkedWith_percentages = NEWCollabToolsWorkedWith_df.mean().sort_values(ascending=False) * 100
NEWCollabToolsWorkedWith_percentages
plt.figure(figsize=(12, 12))
sns.barplot(NEWCollabToolsWorkedWith_percentages, NEWCollabToolsWorkedWith_percentages.index)
plt.title("Colaboration Tools");
plt.xlabel('count');
Once again, it's not surprising that GitHub is the version control tool most people are interested in using - since it is an easy-to-learn and also the most popular.
However, when we want to see the market-share of each tool, it is better to use the pie chart:
plt.figure(figsize=(20,20))
plt.title("Market-Share of Collabaration-Tools")
plt.rcParams['font.size'] = 25.0
plt.pie(NEWCollabToolsWorkedWith_percentages, labels=NEWCollabToolsWorkedWith_percentages.index, autopct='%f%%', startangle=0);
What is the percent of weekly active users among respondents?¶
To answer, this we can use the SOVisitFreq
column.
df.SOVisitFreq
First, we'll count number of occurences of unique values.
SOVisitFreq_df = df.SOVisitFreq.value_counts()
SOVisitFreq_df
It appears that a total of 6 options were included. Let's aggregate these to identify the percentage of respondents who selected each options.
SOVisitFreq_percentages = (SOVisitFreq_df.sort_values(ascending=False) * 100) /SOVisitFreq_df.sum()
SOVisitFreq_percentages
We can plot this information using a horizontal bar chart.
plt.figure(figsize=(20, 10))
sns.barplot(SOVisitFreq_percentages, SOVisitFreq_percentages.index)
plt.title("Forum vists in given time frame");
plt.xlabel('Percentage');
Perhaps not surprisingly, 55%+ of the respondents are daily active users of the forum.
How often do hobbyists who are also part of other online communities visit our forum community?¶
df.groupby([df.SOVisitFreq, df.NEWOtherComms])['Hobbyist'].count().unstack().plot.barh(figsize=(20,20), stacked=True, fontsize=18)
plt.show();
How welcoming do visitors find our forum vis-a-vis other forums, especially regarding the request to let 'off-topic' posts stay on the forum?¶
df.groupby([df.WelcomeChange, df.NEWOffTopic])['NEWOtherComms'].count().unstack().plot.barh(figsize=(20,20), stacked=True, fontsize=18)
plt.show();
Conclusion¶
We've drawn many interesting inferences from the survey, here's a summary of the few of them:
- Having account on other forums does not affect user-retention.
This finding goes against what would be condisdered intuition. If a user also uses or has an account on other forums, then that does not neccessarily mean s/he will be spending less part of the day catching-up with your forum.
GitHub is by far the most-widely used collabaration tool by developers. Is this only because of the 'first mover advantage' or 'network effect'? This can be the topic of further investigation.
The joint-probability of a person having opened an occount AND responding to our survey is 0.83
A significant percentage of programmers either come to the forums when they get stuck or watch a video/tutorial.
# Select a project name
project='self-reported-user-engagement'
# Install the Jovian library
!pip install jovian --upgrade --quiet
import jovian
jovian.commit(project=project)