Self-Reported User Engagement for an Online Forum¶

There are a myriad of ways to analyze and understand website usage and forum participation:

click-through-rate
counting backlinks
PageRank, and so on.

Not least of which is asking the end-useres themselves to fill out a questionnaire. Such a questionnaire is called as a survey and it provides insights into not only how a person uses a forum but also why.

pip install opendatasets

Requirement already satisfied: opendatasets in /usr/local/lib/python3.6/dist-packages (0.0.109)
Requirement already satisfied: tqdm in /usr/local/lib/python3.6/dist-packages (from opendatasets) (4.41.1)

import opendatasets as od
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (9, 5)
matplotlib.rcParams['figure.facecolor'] = '#00000000'
%matplotlib inline
import seaborn as sns
import numpy as np

Getting our dataset¶

We'll be using the StackOverflow developer survey dataset for our analysis. This is survey that is conducted annually and we'll deal with the latest 2020 one.

With the opendatasets helper library the files will be downloaded.

od.download('stackoverflow-developer-survey-2020')

Using downloaded and verified file: ./stackoverflow-developer-survey-2020/survey_results_public.csv
Using downloaded and verified file: ./stackoverflow-developer-survey-2020/survey_results_schema.csv
Using downloaded and verified file: ./stackoverflow-developer-survey-2020/README.txt

pd.read_csv('stackoverflow-developer-survey-2020/survey_results_public.csv').head()

pd.read_csv('stackoverflow-developer-survey-2020/survey_results_public.csv').tail()

pd.read_csv('stackoverflow-developer-survey-2020/survey_results_public.csv').columns

Index(['Respondent', 'MainBranch', 'Hobbyist', 'Age', 'Age1stCode', 'CompFreq',
       'CompTotal', 'ConvertedComp', 'Country', 'CurrencyDesc',
       'CurrencySymbol', 'DatabaseDesireNextYear', 'DatabaseWorkedWith',
       'DevType', 'EdLevel', 'Employment', 'Ethnicity', 'Gender', 'JobFactors',
       'JobSat', 'JobSeek', 'LanguageDesireNextYear', 'LanguageWorkedWith',
       'MiscTechDesireNextYear', 'MiscTechWorkedWith',
       'NEWCollabToolsDesireNextYear', 'NEWCollabToolsWorkedWith', 'NEWDevOps',
       'NEWDevOpsImpt', 'NEWEdImpt', 'NEWJobHunt', 'NEWJobHuntResearch',
       'NEWLearn', 'NEWOffTopic', 'NEWOnboardGood', 'NEWOtherComms',
       'NEWOvertime', 'NEWPurchaseResearch', 'NEWPurpleLink', 'NEWSOSites',
       'NEWStuck', 'OpSys', 'OrgSize', 'PlatformDesireNextYear',
       'PlatformWorkedWith', 'PurchaseWhat', 'Sexuality', 'SOAccount',
       'SOComm', 'SOPartFreq', 'SOVisitFreq', 'SurveyEase', 'SurveyLength',
       'Trans', 'UndergradMajor', 'WebframeDesireNextYear',
       'WebframeWorkedWith', 'WelcomeChange', 'WorkWeekHrs', 'YearsCode',
       'YearsCodePro'],
      dtype='object')

pd.read_csv('stackoverflow-developer-survey-2020/survey_results_schema.csv', index_col='Column').QuestionText

Column
Respondent                                                                                                                                                                                           Randomized respondent ID number (not in order of survey response time)
MainBranch                                                                                                                                                 Which of the following options best describes you today? Here, by "developer" we mean "someone who writes code."
Hobbyist                                                                                                                                                                                                                                            Do you code as a hobby?
Age                                                                                                                                                                            What is your age (in years)? If you prefer not to answer, you may leave this question blank.
Age1stCode                                                                                                                                                      At what age did you write your first line of code or program? (e.g., webpage, Hello World, Scratch project)
                                                                                                                                              ...                                                                                                                          
WebframeWorkedWith    Which web frameworks have you done extensive development work in over the past year, and which do you want to work in over the next year? (If you both worked with the framework and want to continue to do so, please check both boxes in that row.)
WelcomeChange                                                                                                                                                                                             Compared to last year, how welcome do you feel on Stack Overflow?
WorkWeekHrs                                                                                                                                                                        On average, how many hours per week do you work? Please enter a whole number in the box.
YearsCode                                                                                                                                                                                            Including any education, how many years have you been coding in total?
YearsCodePro                                                                                                                                                                NOT including education, how many years have you coded professionally (as a part of your work)?
Name: QuestionText, Length: 61, dtype: object

By default pandas truncates the occurance of Series which is a column for us as seen above. We can expand it by toggling settings of the IPython console in 3 lines or only for pandas output cell using:

pd.set_option('display.max_colwidth', None)

pd.read_csv('stackoverflow-developer-survey-2020/survey_results_schema.csv', usecols = ['QuestionText'], index_col='QuestionText')#.QuestionText

pd.read_csv('stackoverflow-developer-survey-2020/survey_results_schema.csv', index_col='Column').QuestionText['CompFreq']

'Is that compensation weekly, monthly, or yearly?'

Above columns are much more readable than their truncated versions. We've now loaded the dataset, and we're ready to move on to the next step of preprocessing & cleaning the data for our analysis.

Data Preparation & Cleaning¶

While the survey responses contain a wealth of information, we'll limit our analysis to the following areas:

Age & Location
Programming experience
Forum usage

Let's select a subset of columns with the relevant data for our analysis.

stambha = ['Hobbyist', 'SOAccount', 'SOComm', 'SOPartFreq', 'SOVisitFreq', 'NEWSOSites', 'WelcomeChange', 'NEWCollabToolsWorkedWith','NEWOffTopic', 'NEWOtherComms', 'NEWStuck']

len(stambha)

11

Let's take-out a sub-set of the data from required columns into a new DataFrame and call it as df.

df = pd.read_csv('stackoverflow-developer-survey-2020/survey_results_public.csv')[stambha].copy()

df

pd.read_csv('stackoverflow-developer-survey-2020/survey_results_schema.csv', index_col='Column').QuestionText[stambha]

Column
Hobbyist                                                                                                                                                                                                                                             Do you code as a hobby?
SOAccount                                                                                                                                                                                                                              Do you have a Stack Overflow account?
SOComm                                                                                                                                                                                                    Do you consider yourself a member of the Stack Overflow community?
SOPartFreq                                                                                                                     How frequently would you say you participate in Q&A on Stack Overflow? By participate we mean ask, answer, vote for, or comment on questions.
SOVisitFreq                                                                                                                                                                                                           How frequently would you say you visit Stack Overflow?
NEWSOSites                                                                                                                                                                              Which of the following Stack Overflow sites have you visited? Select all that apply.
WelcomeChange                                                                                                                                                                                              Compared to last year, how welcome do you feel on Stack Overflow?
NEWCollabToolsWorkedWith    Which collaboration tools have you done extensive development work in over the past year, and which do you want to work in over the next year? (If you worked with the tool and want to continue to do so, please check both boxes in that row.)
NEWOffTopic                                                                                                                                                                         Do you think Stack Overflow should relax restrictions on what is considered “off-topic”?
NEWOtherComms                                                                                                                                                                                                    Are you a member of any other online developer communities?
NEWStuck                                                                                                                                                                                              What do you do when you get stuck on a problem? Select all that apply.
Name: QuestionText, dtype: object

df

Let's view some basic information about the data frame.

df.shape

(64461, 11)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64461 entries, 0 to 64460
Data columns (total 11 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Hobbyist                  64416 non-null  object
 1   SOAccount                 56805 non-null  object
 2   SOComm                    56476 non-null  object
 3   SOPartFreq                46792 non-null  object
 4   SOVisitFreq               56970 non-null  object
 5   NEWSOSites                58275 non-null  object
 6   WelcomeChange             52683 non-null  object
 7   NEWCollabToolsWorkedWith  52883 non-null  object
 8   NEWOffTopic               50804 non-null  object
 9   NEWOtherComms             57205 non-null  object
 10  NEWStuck                  54983 non-null  object
dtypes: object(11)
memory usage: 5.4+ MB

df.last_valid_index()

64460

Let's now view some basic statistics about the the numeric columns.

df.describe()

df['Hobbyist'].value_counts()

Yes    50388
No     14028
Name: Hobbyist, dtype: int64

df.sample(10)

Exploratory Data Analysis¶

Forum Account¶

Let's look at the distribution of responses weather a respondent had a forum account or not. It's a well known fact that register users are mire likely to farticipate in forum activities like surveys.

(pd.read_csv('stackoverflow-developer-survey-2020/survey_results_schema.csv', index_col='Column').QuestionText).SOAccount

'Do you have a Stack Overflow account?'

user_counts = df.SOAccount.value_counts()
user_counts

Yes                        47275
No                          6101
Not sure/can't remember     3429
Name: SOAccount, dtype: int64

A pie chart would be a good way to visualize the distribution.

plt.figure(figsize=(20,10))
plt.title((pd.read_csv('stackoverflow-developer-survey-2020/survey_results_schema.csv', index_col='Column').QuestionText).SOAccount)
plt.pie(user_counts, labels=user_counts.index, autopct='%f%%', startangle=0);

About 83% of survey respondents who have answered the question had an account on the forum.

Hobby¶

Let's look at the distribution of responses, weather a respondent considered themselves as a hobbyist or not.

(pd.read_csv('stackoverflow-developer-survey-2020/survey_results_schema.csv', index_col='Column').QuestionText).Hobbyist

'Do you code as a hobby?'

hobby_counts = df.Hobbyist.value_counts()
hobby_counts

Yes    50388
No     14028
Name: Hobbyist, dtype: int64

plt.figure(figsize=(20,10))
plt.title((pd.read_csv('stackoverflow-developer-survey-2020/survey_results_schema.csv', index_col='Column').QuestionText).Hobbyist)
plt.pie(hobby_counts, labels= hobby_counts.index, autopct='%f%%', startangle=0);

It appears that four in five of the respondents have taken up programming as a hobby and not professionally.

Let's also plot the visit-frequency, but this time we'll convert the percentages into numbers, and sort by percentage values to make it easier to visualize the order.

(pd.read_csv('stackoverflow-developer-survey-2020/survey_results_schema.csv', index_col='Column').QuestionText).SOVisitFreq

'How frequently would you say you visit Stack Overflow?'

VisitFreq_pct = df.SOVisitFreq.value_counts() #* 100 / df.SOVisitFreq.count()
sns.barplot(VisitFreq_pct, VisitFreq_pct.index)
plt.title((pd.read_csv('stackoverflow-developer-survey-2020/survey_results_schema.csv', index_col='Column').QuestionText).SOVisitFreq)
plt.ylabel(None);
plt.xlabel('Percentage');

It turns that 55% of respondents visit our forum atleast once daily - which is very encouraging. This seems to suggest that user retention is 30%/day, meaning 30% of those who have visited our forum today will probably return tommorrow.

On the flip side, this entails selection bias. Our respondent may not be representative of the average person who uses our forum because he did not stick around to take this survey. Only those who think our forum is great are responding to our call.

plt.figure(figsize=(20,10))
plt.title((pd.read_csv('stackoverflow-developer-survey-2020/survey_results_schema.csv', index_col='Column').QuestionText).SOVisitFreq)
plt.pie(VisitFreq_pct, labels= VisitFreq_pct.index, autopct='%f%%', startangle=0);

Community feel¶

There are various reasons why a member of a forum may feel excluded:

Echo chamber, bee hive mind that encorages groupthink and shuts-down dissenting views
Observer Effect, 'observing the process changes the process.' on a public online forum, the thought that all their activity can be traced and tracked.
Language barriers
Ettiquette and socially acceptable behavioural boundries.

Let's visualize the data from SOComm column.

(pd.read_csv('stackoverflow-developer-survey-2020/survey_results_schema.csv', index_col='Column').QuestionText).SOComm

'Do you consider yourself a member of the Stack Overflow community?'

(df.SOComm.value_counts(normalize=True, ascending=True)*100).plot(kind='barh', color='g')
plt.title((pd.read_csv('stackoverflow-developer-survey-2020/survey_results_schema.csv', index_col='Column').QuestionText).SOComm)
plt.xlabel('Percentage');

It appears that close to 35% of respondents don't want to identify themselves with the StackOverflow label.

The NEWSOSites field contains information about the new topic-specific forums being launched. Since the question allows multiple answers, the column contains lists of values separated by ;, which makes it a bit harder to analyze directly.

(pd.read_csv('stackoverflow-developer-survey-2020/survey_results_schema.csv', index_col='Column').QuestionText).NEWSOSites

'Which of the following Stack Overflow sites have you visited? Select all that apply.'

df.NEWSOSites.value_counts()

Stack Overflow (public Q&A for anyone who codes);Stack Exchange (public Q&A for a variety of topics)                                                                                                                                                                22415
Stack Overflow (public Q&A for anyone who codes);Stack Exchange (public Q&A for a variety of topics);Stack Overflow Jobs (for job seekers)                                                                                                                          13891
Stack Overflow (public Q&A for anyone who codes)                                                                                                                                                                                                                    12762
Stack Overflow (public Q&A for anyone who codes);Stack Overflow Jobs (for job seekers)                                                                                                                                                                               4588
Stack Overflow (public Q&A for anyone who codes);Stack Exchange (public Q&A for a variety of topics);Stack Overflow Jobs (for job seekers);Stack Overflow for Teams (private Q&A for organizations)                                                                   906
                                                                                                                                                                                                                                                                    ...  
Stack Exchange (public Q&A for a variety of topics);Stack Overflow for Teams (private Q&A for organizations);Stack Overflow Advertising (for technology companies)                                                                                                      2
Stack Exchange (public Q&A for a variety of topics);Stack Overflow Jobs (for job seekers);Stack Overflow for Teams (private Q&A for organizations);Stack Overflow Talent (for hiring companies/recruiters);Stack Overflow Advertising (for technology companies)        2
Stack Exchange (public Q&A for a variety of topics);Stack Overflow Jobs (for job seekers);Stack Overflow Talent (for hiring companies/recruiters);Stack Overflow Advertising (for technology companies)                                                                 2
Stack Exchange (public Q&A for a variety of topics);Stack Overflow Advertising (for technology companies)                                                                                                                                                               1
Stack Exchange (public Q&A for a variety of topics);Stack Overflow for Teams (private Q&A for organizations)                                                                                                                                                            1
Name: NEWSOSites, Length: 61, dtype: int64

Let's define a helper function which turns a column containing lists of values (like df.NEWSOSites) into a data frame with one column for each possible option.

def sandhi_vicched(col_series):
    result_df = col_series.to_frame()
    options = []
    # Iterate over the column
    for idx, value  in col_series[col_series.notnull()].iteritems():
        # Break each value into list of options
        for option in value.split(';'):
            # Add the option as a column to result
            if not option in result_df.columns:
                options.append(option)
                result_df[option] = False
            # Mark the value in the option column as True
            result_df.at[idx, option] = True
    return result_df[options]

NEWSOSites_df = sandhi_vicched(df.NEWSOSites)

NEWSOSites_df

The NEWSOSites_df has one column for each option that can be selected as a response. If a responded has selected the option, the value in the column is True, otherwise it is false.

We can now use the column-wise totals to identify the most popular forums.

NEWSOSites_totals = NEWSOSites_df.sum().sort_values(ascending=False)
NEWSOSites_totals

Stack Overflow (public Q&A for anyone who codes)            57114
Stack Exchange (public Q&A for a variety of topics)         39219
Stack Overflow Jobs (for job seekers)                       21126
Stack Overflow for Teams (private Q&A for organizations)     2631
Stack Overflow Talent (for hiring companies/recruiters)      1417
Stack Overflow Advertising (for technology companies)         837
I have never visited any of these sites                       528
dtype: int64

As one might expect, the most popular forums is "Stack Overflow", the first of its name.

sns.heatmap(NEWSOSites_df)

<matplotlib.axes._subplots.AxesSubplot at 0x7f9dc74ac7f0>

Note that here cream colour means more users picking that forum option and black line means that respondent has not selected that forum as his pick. Heatmaps are only possible when we have a DataFrame which is rectangular/square shaped. Since our NEWSOSites_df was a boolean matrix, only two colours are present in the heatmap.

Asking and Answering Questions¶

We've already gained several insights about the respondents and the programming community in general, simply by exploring individual columns of the dataset. Let's ask some specific questions, and try to answer them using data frame operations and interesting visualizations.

What do forumites do when stuck on a programming problem?¶

Let's look at the responses in the survey.

NEWStuck_df = sandhi_vicched(df.NEWStuck)
NEWStuck_numbers = NEWStuck_df.mean().sort_values(ascending=False)
NEWStuck_numbers

Visit Stack Overflow                                0.772607
Do other work and come back later                   0.464048
Watch help / tutorial videos                        0.450040
Call a coworker or friend                           0.425451
Go for a walk or other physical activity            0.369215
Play games                                          0.128217
Meditate                                            0.099905
Panic                                               0.093250
Visit another developer community (please name):    0.087495
dtype: float64

NEWStuck_df.sum()

Visit Stack Overflow                                49803
Go for a walk or other physical activity            23800
Do other work and come back later                   29913
Call a coworker or friend                           27425
Watch help / tutorial videos                        29010
Visit another developer community (please name):     5640
Play games                                           8265
Meditate                                             6440
Panic                                                6011
dtype: int64

We can visualize this information using a bar chart.

sns.set_style('darkgrid')
plt.figure(figsize=(20,6))
plt.title((pd.read_csv('stackoverflow-developer-survey-2020/survey_results_schema.csv', index_col='Column').QuestionText).NEWStuck)
sns.barplot(NEWStuck_numbers.index, NEWStuck_numbers);
plt.xticks(rotation = '45')

(array([0, 1, 2, 3, 4, 5, 6, 7, 8]),
 <a list of 9 Text major ticklabel objects>)

Which version control and collaboration tools are popular with respondents?¶

For this we can can use the NEWCollabToolsWorkedWith column, with similar processing as the previous one.

NEWCollabToolsWorkedWith_df = sandhi_vicched(df.NEWCollabToolsWorkedWith)
NEWCollabToolsWorkedWith_percentages = NEWCollabToolsWorkedWith_df.mean().sort_values(ascending=False) * 100
NEWCollabToolsWorkedWith_percentages

Github                            67.926343
Slack                             43.465041
Jira                              39.127534
Google Suite (Docs, Meet, etc)    34.054700
Gitlab                            30.320659
Confluence                        26.561797
Trello                            24.286002
Microsoft Teams                   20.970820
Microsoft Azure                   12.176355
Stack Overflow for Teams           4.742402
Facebook Workplace                 2.451094
dtype: float64

plt.figure(figsize=(12, 12))
sns.barplot(NEWCollabToolsWorkedWith_percentages, NEWCollabToolsWorkedWith_percentages.index)
plt.title("Colaboration Tools");
plt.xlabel('count');

Once again, it's not surprising that GitHub is the version control tool most people are interested in using - since it is an easy-to-learn and also the most popular.

However, when we want to see the market-share of each tool, it is better to use the pie chart:

plt.figure(figsize=(20,20))
plt.title("Market-Share of Collabaration-Tools")
plt.rcParams['font.size'] = 25.0
plt.pie(NEWCollabToolsWorkedWith_percentages, labels=NEWCollabToolsWorkedWith_percentages.index, autopct='%f%%', startangle=0);

What is the percent of weekly active users among respondents?¶

To answer, this we can use the SOVisitFreq column.

df.SOVisitFreq

0                 Multiple times per day
1                 Multiple times per day
2                  Daily or almost daily
3                 Multiple times per day
4        A few times per month or weekly
                      ...               
64456                                NaN
64457                                NaN
64458                                NaN
64459                                NaN
64460                                NaN
Name: SOVisitFreq, Length: 64461, dtype: object

First, we'll count number of occurences of unique values.

SOVisitFreq_df = df.SOVisitFreq.value_counts()

SOVisitFreq_df

Daily or almost daily                                 17372
Multiple times per day                                16273
A few times per week                                  13493
A few times per month or weekly                        7901
Less than once per month or monthly                    1739
I have never visited Stack Overflow (before today)      192
Name: SOVisitFreq, dtype: int64

It appears that a total of 6 options were included. Let's aggregate these to identify the percentage of respondents who selected each options.

SOVisitFreq_percentages = (SOVisitFreq_df.sort_values(ascending=False) * 100) /SOVisitFreq_df.sum()
SOVisitFreq_percentages

Daily or almost daily                                 30.493242
Multiple times per day                                28.564157
A few times per week                                  23.684395
A few times per month or weekly                       13.868703
Less than once per month or monthly                    3.052484
I have never visited Stack Overflow (before today)     0.337019
Name: SOVisitFreq, dtype: float64

We can plot this information using a horizontal bar chart.

plt.figure(figsize=(20, 10))
sns.barplot(SOVisitFreq_percentages, SOVisitFreq_percentages.index)
plt.title("Forum vists in given time frame");
plt.xlabel('Percentage');

Perhaps not surprisingly, 55%+ of the respondents are daily active users of the forum.

How often do hobbyists who are also part of other online communities visit our forum community?¶

df.groupby([df.SOVisitFreq, df.NEWOtherComms])['Hobbyist'].count().unstack().plot.barh(figsize=(20,20), stacked=True, fontsize=18)
plt.show();

How welcoming do visitors find our forum vis-a-vis other forums, especially regarding the request to let 'off-topic' posts stay on the forum?¶

df.groupby([df.WelcomeChange, df.NEWOffTopic])['NEWOtherComms'].count().unstack().plot.barh(figsize=(20,20), stacked=True, fontsize=18)
plt.show();

Conclusion¶

We've drawn many interesting inferences from the survey, here's a summary of the few of them:

Having account on other forums does not affect user-retention.

This finding goes against what would be condisdered intuition. If a user also uses or has an account on other forums, then that does not neccessarily mean s/he will be spending less part of the day catching-up with your forum.

GitHub is by far the most-widely used collabaration tool by developers. Is this only because of the 'first mover advantage' or 'network effect'? This can be the topic of further investigation.
The joint-probability of a person having opened an occount AND responding to our survey is 0.83
A significant percentage of programmers either come to the forums when they get stuck or watch a video/tutorial.

References¶

# Select a project name
project='self-reported-user-engagement'
# Install the Jovian library
!pip install jovian --upgrade --quiet
import jovian
jovian.commit(project=project)

[jovian] Detected Colab notebook...
[jovian] Uploading colab notebook to Jovian...
[jovian] Committed successfully! https://jovian.ml/vedant-madane/self-reported-user-engagement

'https://jovian.ml/vedant-madane/self-reported-user-engagement'

Data Analysis With Python Course Project