Where do networks come from?

The key assumption underlying both the peer effects and structural approaches to network effects assume some degree of exogeneity in the existence and structure of network ties.

Exogeneity is both a theoretical claim as well as an empirical assumption. All reasonable theories are built on a set of axioms that assume some primitive or exogenous features of the world or of the target system which is being analyzed.  Many models in economics, for instance, assume that preferences are exogenous. From these preferences, we are then able to then derive things like behavior, choice, “roles” as well as the structure of social relationships.

Screen Shot 2017-05-10 at 10.31.25 AM.png

Similarly, some sociological and anthropological traditions start with axioms that assume that “roles” are exogenous. These roles—e.g., the position a individual occupies in a social structure—govern behavior, preferences, as well as social relationships.

Screen Shot 2017-05-10 at 10.31.32 AM

Much of the network analysis we’ve been conducting or discussing thus far also has an exogeneity assumption built in. The primitives are social relationships and their structure. All other things we observe such as behavior, preferences and roles emerge from the pattern of exogenous network ties. In the lectures on structural holes, status and peer effects, we argue that the pattern of social relationships cause in differences in behavior, preferences, as well as roles and not vice versa.

Screen Shot 2017-05-10 at 10.31.38 AM

The challenge of network formation

However, a challenge for the social relationships first perspective is that networks are unlikely to be fully “exogenous.” They form and evolve through certain processes that make some people more likely to connect to each other, and make some people less likely to do so.

Network scholars have spent considerable time on trying to understand how networks form and change. At a broad conceptual level, we can think about five factors that shape whether a tie between two individuals—e.g., ego and alter—forms.

Screen Shot 2017-05-10 at 11.08.33 AM.png

The logic behind most models of network formation is simple. At one end, there are “benefits” whether actual or perceived as well as pecuniary and non-pecuniary/psychic  for connecting with someone. At the other end, there are “costs” which make it either easier or harder to form a relationship with someone, either because searching for them, coordinating with them, or potentially dealing with them is more costly than with someone else. Relatedly, some individuals may have a lower cost of building a network than others and/or it may be lower cost (relative to benefit) to connect with someone.

Factor 1: Characteristics of Ego, the sender.

Characteristics encapsulated in “Factor 1” include a range of factors that make it easier for certain types of people (e.g., those who have a certain characteristics themselves) to connect with many others. This characteristic may include things that either make it easier for these people (relative to others) to make many connections or perhaps provide them greater benefit from doing so. Research in this stream has found a substantial range of characteristics that vary at the individual level, that also predict an increased or decreased propensity to have a certain type of network surrounding them. These things include:

  • Personality: Some work has found that differences in personality traits are correlated with network structure. For instance, individuals who have many ties are also likely to have Extroverted personalities. Relatedly, those who are high in “self monitoring” also have a greater likelihood of being “brokers” or occupying “structural holes” in a social network.
  • Other factors that may also be related to larger networks include:
    • Strategic intent
    • Intelligence
    • Physical characteristics (e.g., beauty or height)
    • Age
  • Some factors may be describe an individual at a certain point in time:
    • After the loss of a job
    • After being promoted to a new role
  • Other factors may be socially constructed, but describing the Ego in a given context:
    • Caste
    • Religion

One can reason about the various ways in which these characteristics of Ego either lower their costs of making ties or increase the benefit they get. Can you come up with other individual-level factors that might matter?

Factor 2: Characteristics of Alter, the receiver.

A related set of arguments can be made about the characteristics of an alter or alters. For instance, one could theorize about the following characteristics of alter(s) that may make them more likely to receive connections from others.

  • Personality
  • Intelligence
  • Skill
  • Wealth
  • Social standing
  • Formal role in the organization

Like the Ego-centric perspective, one could logically use a “cost” and “benefit” perspective for reasoning about why some Alter may have more advice seekers (e.g., they are smart) or more friends (e.g., they are helpful). In purely altercentric models, we ignore the characteristics of Ego.

Factor 3: The interaction of Ego/Alter characteristics (e.g., homophily)

The 3rd Factor is one related to the “Ego-Alter” interaction. In such models, there is something about the characteristics of Ego and Alter together that predict an increased or decreased propensity to have network ties. The most common theme in these models is homophily or the tendency for individuals who are similar to each other to have a higher propensity to connect. Research has found that individuals who are similar in the following characteristics are more likely to connect with each other, relative to the alternatives:

  • Race and ethnicity
  • Gender
  • Age
  • Formal organizational position
  • Occupation
  • Religion

There are many theories about why such a preference exists. On one hand, social contexts (e.g., communities, neighborhoods, etc.) are often organized by these characteristics. This makes it much easier to connect with people who are similar to you. There is also an element of choice. Individuals who are similar to you are likely have similar experiences, share similar values, and like and dislike similar things. As a consequence, the costs of interacting with similar people is likely to be less than interacting with people who are different.

However, the type of relation may matter here. In mating networks you are more likely to see heterophily than homophily. This might also be true of mentoring relationships, where individuals are more likely to be mentored by those of a different level of senority than them.

What other factors at this level might increase or decrease the cost of interaction or raise its benefits?

Factor 4: Social and Physical Context

The fourth factor can broadly be thought of as the social or physical context within which individuals are forming social networks. A simple example is office or neighborhood layout. A substantial amount of research has found that physical distance has a substantial effect on whether two individuals form ties. Scientists who are nearby, for instance, are more likely to collaborate and their research trajectories also become rather similar.

Research has found that there is a exponential relationship between physical distance and the propensity to connect. This effect is called propinquity. Individuals who are physically proximate are substantially more likely to interact, followed by steep declines in the rates of interaction as distance increases.

In addition to propinquity, other aspects of the social context are also likely to affect the extent of tie formation. These factors could be the reorganization of roles, task inter-dependencies, as well as cultural or organizational norms regarding competition or collaboration. Incentives are also important in determining what the shape of the network might be. The challenge with many of these effects are that they are often “absorbed” into the intercept of the model. That is, they are only able to be detected when looking across contexts, but not within context.

Factor 5: Endogenous Network Processes


Finally, the structure of one part of the network may affect the structure of another. Consider a simple example: Reciprocity. If I consider you a friend. There is a social-psychological as well as a sociological process that also increases the likelihood that I consider you a friend. This is akin to tit-for-tat. If you give me a gift, I will give you one in return. Networks exhibit this property with substantial regularity (but not always!). In this context, the emergence of a network tie, the reciprocal one, is endogenous to the network. That is, it emerges from within the network structure and not outside of it.

Similarly, there are other endogenous network processes that others have detected in networks. These include factors such as transitivity. For instance, a friend of a friend is often a friend. Heiderian balance theory, for example, argues that individuals desire balance in their relationships. The situation of being friend’s with your friend’s enemy is unsustainable according to balance theory (why?). Because it is, that structure will endogenously change into something else–either the enemies become friends or  the network splits.

Other forces include preferential attachment. New entrants into a network are proportionally more likely to connect to individuals based on the size of their degree centrality. This process gives some networks a power law distribution, rather than a binomial/normal distribution that would be expected if the network was formed through a purely random process.


Image result for power law distribution
Power law distribution



Image result for normal distribution
Normal Distribution



Empirical considerations

Though the theoretical ideas behind network formation are quite straightforward, disentangling the differential impact of these effects remains quite challenging. In a subsequent post, we will discuss the various approaches to estimating these models.



Seeing the networks in your company

Thus far we have assumed that we had network data. But data like the “Professionals” was gathered using a survey in a real organization. In this post I will walk you through the process of creating a simple network survey in SurveyMonkey (a web based survey application) and analyzing the responses from the survey using R. Lets begin by going to www.surveymonkey.com.  Here is the landing page (as of May 5, 2017). You will need to purchase a basic subscription to download the data (I purchased an educator subscription for $18).

Screen Shot 2017-05-05 at 8.31.33 AM.png

I’ve signed up for a free account (for now). After I complete all my signup information. Here is the screen that I get, asking me to start by creating a survey.

Screen Shot 2017-05-05 at 8.35.15 AM

I will call my survey, “Simple Network Survey.” I enter this into the text box, and then press + Add Questions. Pressing this takes me to a new screen.


Screen Shot 2017-05-05 at 8.37.27 AM

In order to create the appropriate network data (where we know who considers whom a friend, advice giver, etc.), we will need to begin by asking people who they are. I prefer to do this first using a dropdown menu where an individual can select just one option. The question I ask is: What is your name? Please select from the dropdown menu.  Make sure that the question type is “Dropdown”

Screen Shot 2017-05-05 at 8.39.27 AM.png

Once I have this, I would like to enter the names of the people who will be taking the survey. My list (of fake people) include: Alice, Bob, Chris, Dina, Elena, Frank, and Greg. I add these using the “Add Answers in Bulk” option:

Screen Shot 2017-05-05 at 8.42.24 AM.png

Once I click save, I move to the Options tab, and I check off “Require an Answer to This Question.” Next I click DONE. 

I now create a new page (+ New Page). This is where I will place the network survey.

Screen Shot 2017-05-05 at 8.44.24 AM.png

For the purposes of this example, I will only ask two questions about people’s networks. What questions shall we ask?

Perhaps one of the things that hardest to teach about network analysis is determining the right types of questions to ask people. The questions should reveal something people and their social networks that we might not have been able to assess if we hadn’t asked them those questions.

We can think about kinds of questions in terms of a 2×2 — on one dimension we have questions about networks that provide people with resources (Instrumental) and on the other, we have questions about more personal/social relationships (e.g., Expresssive).  On the other dimension we have questions that are either “Enduring or qualitative” or “Event based.” The table below summarizes some examples.

Enduring/Qualitative Event Based
Instrumental Advice



Asked for advice in the past week.
Expressive Friendship


Social support

Informally go to Lunch

Talked about important personal matters

Here are some examples:

Questions about who you know:

Below is a list of names of your colleagues at [firm name]. Some of them you may (1) know well, others you (2) may be acquainted with, and still others (3) you may not know at all. Please check the box next to the names of those individuals who are in categories (1) or (2).

Advice (Work-related)

Sometimes it is useful to get help or advice from your colleagues on performing some aspect of doing your work well. Please check the box next to the names of those individuals who you would approach for help or advice on such work related issues.

Advice (Work related) Reciprocal

There also may be people who come to you seeking help or advice about doing their own work well. Please check the box next to the names of those individuals who might typically come to you for help or advice on work related issues.

Advice (Career and Success)

Sometimes it is useful to seek advice from colleagues at work about more than just how to do your work well. For example, you may be interested in “how things work” around here, or how to optimize your chances for a successful career here. If you needed help along these lines, who would you go to for help or advice regarding these issues?  Please check the box next to the names of those individuals who you would approach for help or advice on these non-technical related issues.

Advice (Career and Success) Reciprocal

There also may be people who come to you seeking help or advice about such non-task related issues. Please check the box next to the names of those individuals who might typically come to you for help or advice along these dimensions.


Sometimes during the course of interactions at the workplace, friendships form. We are interested in whether you have people at [firm name] who you consider to be friends of yours. Please check the box next to the names of the individuals who you think of as friends here at [firm name].

Event based questions:


Below you will find a list of people who work at [firm name]. Please check the names of the individuals with whom you have met with for lunch at least once during the past 30 days.

Event based advice

Below you will find a list of people who work at [firm name]. Please check the names of the individuals from whom you’ve sought out advice about work related matters at least once during the past 30 days.

The problem of recall: People are highly inaccurate when you ask them to recall specific interaction events. They are much more accurate when you ask them to recall enduring and qualitatively meaningful relationships.  Events are highly informative when you know what happens during that event, but otherwise they are harder to generalize from.

Now that we have some examples of questions, lets add one two the survey. I typically recommend having 2 questions, one expressive (e.g., friendship) and one instrumental (e.g., advice). They usually provide different information.

Lets, for the sake of example, add an advice network question to Page 2. We will create a “Multiple Choice” question where the answers are the names of the people in the organization (e.g., Alice, etc.). The question we ask is:

Sometimes it is useful to get help or advice from your colleagues on performing some aspect of doing your work well. Please check the box next to the names of those individuals who you would approach for help or advice on such work related issues.

We will also add a short note telling people not to select their own name and to check as few or as many names as appropriate. Below the options, also check “Allow more than one answer to this question (use checkboxes).

Screen Shot 2017-05-05 at 9.00.21 AM

Let us now save this question by clicking save.

I will now add one more question, this can be our “Dependent variable” which measures the extent to which co-workers have a positive or negative impact.

Screen Shot 2017-05-05 at 9.55.18 AM.png

After all the questions are in, click “Next” at the top and lets begin collecting responses.

Screenshot 2017-05-05 10.39.16.png

We will use the “Get Web Link” option. The web link for the survey I made is:


Lets quickly fill out the survey. I will also fill in responses for everyone in the roster.

Screenshot 2017-05-05 10.42.01.png

After all the responses are in for all the people in the organization (e.g., Alice…) we can download the data. I have downloaded the excel file. It comes as a zip file and a resulting csv file with the data. These are respectively attached here and here.

The raw CSV file that is exported from Survey Monkey looks like this:

Screenshot 2017-05-05 19.26.35.png

Lets clean this up so that we get a 7×7 matrix. Note that there is an ordered list of names on the left (Alice…Greg on the rows) and a similarly ordered list of names at the top (columns). The rows are the respondents (senders) and the columns are the people with whom they do and do not have a relationship. With the names, the matrix looks like:

Screenshot 2017-05-05 19.30.34.png

Without the names, it looks like:

Screenshot 2017-05-05 19.37.20.png

Try to match it up to the survey response in our original file. The matrix is now saved as surveyexample.csv.

The following code imports the data (the cleaned up version above) and plots the network:

# This file provides some simple code to get you started on your Network Analysis Journey




#(Q0) “who do you know or know of at [the firm]”,

#Load the “Survey Monkey” network data from Dropbox.

survey <- fread(https://www.dropbox.com/s/nd13m6szn8d8lto/surveyexample.csv?dl=1&#8217;)

#Convert the data.table objects into matrix format so they can be

#analyzed using the sna package.

survey = as.matrix(survey)

# this creates the no

names = c(“Alice”, “Bob”, “Chris”,“Dina”,“Elena”,“Frank”, “Greg”)

# Rename all the rows

rownames(survey) = names

# Rename all the columns

colnames(survey) = names

# Plot the survey network

gplot(survey, label = names)

Here is the resulting network.

Screenshot 2017-05-05 20.44.55.png


We can calculate each person’s centrality and also correlate the network positions with the final question we asked. We need to first convert it into a numeric and then import it into R.

# This file provides some simple code to get you started on your Network Analysis Journey


#(Q0) “who do you know or know of at [the firm]”,

#Load the “Survey Monkey” network data from Dropbox.
survey <- fread(‘https://www.dropbox.com/s/nd13m6szn8d8lto/surveyexample.csv?dl=1&#8217;)

#Convert the data.table objects into matrix format so they can be
#analyzed using the sna package.
survey = as.matrix(survey)

# this creates the no
names = c(“Alice”, “Bob”, “Chris”,”Dina”,”Elena”,”Frank”, “Greg”)

# Rename all the rows
rownames(survey) = names

# Rename all the columns
colnames(survey) = names

# Plot the survey network
gplot(survey, label = names)

#Load the “Survey Monkey” network data from Dropbox.
surveyoutcome <- fread(‘https://www.dropbox.com/s/we2dvevfejte8ov/surveyoutcome.csv?dl=1&#8217;)

#Convert the data.table objects into matrix format so they can be
#analyzed using the sna package.
surveyoutcome = as.matrix(surveyoutcome)

# rename rownames and create a variable which is the integer
# version of the numeric response
colnames(surveyoutcome) = c(“name”,”response”,”respval”)
respval = as.integer(surveyoutcome[,3])

# Calculate outdegree for the survey response
survey.outdegree = degree(survey, cmode = “outdegree”)

# Estimate a model regressing the respval on the outdgree
m.0 = lm(respval ~ survey.outdegree)

Here is the regression outcome:

Screenshot 2017-05-05 21.01.51.png


The above walk-through should give you a way to collect network data, and then analyze it using R.

Before, I conclude I want to discuss the various survey approaches used by network analysts

Types of Network Surveys

Roster based surveys: Roster based methods are perhaps the most common approach. This is what we just completed above. With roster surveys, you provide the respondent with a list of names of people or organizations. Then you ask them to indicate (by checking off the boxes next to the names) which of these people they have a certain relationship with. The nice thing about roster based surveys is that they tend to be quite accurate because people don’t have to recall the names out of the blue. Further, the roster allows you get longer network lists than if people had to recall names from memory. The down-side of this is that if the organization has too many people (say in the 1000s) it would be too hard to make people go through a list of 1000 or even worse, 2000 people.

List based surveys: The other type of survey is a list survey. Here you ask the question and then request that your respondents list the names of people in the organization that they have this relationship with. What might be some concerns with a survey method like this? 

Ego-network surveys:  This is a slightly modified version of the list-based survey. Here you ask the people to list up to five people (or k people) that they have a certain relationship with. Then you ask them to indicate whether the people listed also have a relationship of a certain type with each other. 

Position generator surveys: This is perhaps the least structural of the network surveys. Here what you do is the following: You provide a list of the “positions” that people can potentially occupy – so in an organization you list the different functional areas, levels of seniority, etc.  And then ask people whether they have a no relationship with someone in such a position, acquaintance in that position, a friend in that position, etc.  This is a very indirect measure of networks, but it provides a broad understanding of the “range” of a persons network.

In addition to these classical approaches to collecting network data, organizations have more modern methods available to figure out potential sources of interaction between their employees. These include:

Email:  IT administrators know every email you send to everyone else and what it contains. This is true in most cases in the vast majority of organizations. Scary, yes. True, yes. But this is information that everyone knows exists and some organizations are using it to understand informal interaction and trying to make better decisions with this information.

Mailing list/Groups activity: Another source of information about networks and interaction are the mailing lists that people are a part of.

RFID:  Most of our ID cards have RFID these days – we use these cards to enter/exit buildings. RFID censors can also be placed in strategic locations to understand interactions that are face-to-face between people. Conference organizers are also using RFID tags to understand interaction among attendees.

Online data sources:

LinkedIn —  LinkedIn has a massive economic graph. Their data include where people got their degrees, where they worked, who they worked with, etc.

Facebook: This is the largest social network in the world. Period.

About firms:  The websites of Venture capital firms tell you who their partners, etc. are and where they attended college and when they graduated.  It also tells you that some may be investing in similar projects.


More: In a future post, I will walk through how to create “network” data using text in documents. The “ties” here are measures of similarity between the text descriptions of entities.


Peer effects, knowledge transfer and social influence

The structural approach to social networks is inherently beautiful as a representational approach. I am always in awe of the fact that we can learn so much about how human beings act or their outcomes based merely on the pattern of their social ties. The idea is both simple and profound.

The structural approach is built on assumptions regarding information transfer across a simpler unit of analysis: the dyad. In the world of dyads, new complications arise and different theories must be developed and tested.

Let us take the Professionals data we have been analyzing as an example. Here is the advice network among these professionals.

Screen Shot 2017-05-04 at 10.45.24 AM.png

In the prior analyses, we have focused on analyzing the structure of each node’s connections.  For example, each node has a specific number of incoming connections, its outdegree:

Screen Shot 2017-05-04 at 10.47.03 AM.png

The beauty of the structural approach to social networks is that we can learn a lot about the outcomes of individuals and organizations by merely looking at the pattern of their relationships. Recall our prior analysis. There is information in indegree. We were able to explain 6.5% of the variation in our measure of whether a person has the “knowledge to succeed” just by looking at the count of their incoming connections! While indegree may capture or reflect other processes and might not be causal, it is nevertheless information rich.

However, an Ego’s alters (e.g., the people that a focal node is connected to) are not all the same—as we sometimes implicitly assume in our models. As a note, I don’t believe that researchers actually believe that all the people we are connected to are the same. Indeed, betweenness, closeness, eigenvector centrality, all assume that not all connections are the same by their very construction. However, the heterogeneity in alter characteristics is implicit rather than explicit because we never specify in our theories or models, exactly how these individuals vary.

The peer effects framework on the other had often ignores variation in structure, but emphasizes variation in the characteristics of connections.

Below, I walk through some examples of this approach.

A simple model of peer effects

The “peer effects” framework is called as such because it is based on a line of research in the economics of education where scholars were attempting to understand the impact of classroom peers on academic outcomes. Hence, peer effects.

Let us start with a simple setup. Let us assume there are 100 students in a classroom. The teacher has decided that everyone in the class will have a study partner, so he asks each of the students to pair up into groups of two. There are now 50 pairs, each with two people. The teacher wonders, whether having a smart peer (i.e., alter) increases the performance of for a focal student (e.g. Ego). Visually, he is interested in understanding this influence process:

Screen Shot 2017-05-04 at 1.20.36 PM.png

At the end of the class, all of the students take a standardized exam. This exam is scored on a 100 point scale, and students can get anywhere from a score of 0 to 100. The teacher takes this score and runs the following regression with 100 observations, 1 for each student. She’s also good with standard errors, so she clusters standard errors at the level of the dyad:

score_{i} = \beta_{0} + \beta_{1} score_{j} + \epsilon 

After running the regression, she finds a large and statistically significant coefficient for \beta_{1}. How should she interpret it?

A naive causal interpretation is: for every unit increase in score_{j} there is a corresponding \beta_{1} increase in score_{i}. Or, by having a study partner with a certain score, there is a corresponding increase/decrease in the performance of the focal student. This interpretation is naive for a reason, because is probably (though not definitely) wrong.

But before we dive into why it is probably wrong, it is useful to reiterate that this “peer effects” representation is quite general. For example these outcomes might be determined in part by the influence of peers (however defined).


  • Finance: Putting money away into a retirement savings account, adopting a microfinance product, etc.
  • Health behaviors: Obesity, Happiness, use of HIV/AIDS test, etc.
  • Academic performance: Getting good grades, choosing a major.
  • Entrepreneurship: Becoming an entrepreneur; deciding against becoming an entrepreneur.
  • Careers: Quitting; moving to a new company.
  • Adoption of products: Prescribing a drug, buying a car.
  • Adoption of behaviors: Smoking, drinking, sexual events.
  • Adoption of ideas: Learning from patents.
  • Organizational behavior:  Adoption of corporate practices and policies.

The basic idea is simple: We observe some level or change in the behavior or characteristics of an alter (or alters) and we see whether these are correlated to the behaviors or outcomes of Ego.


This apparently simple process is much more nuanced and complicated than it appears. There are dozens of “mechanisms” that can lead to the correlation we might observe (or that the teacher observes. Here are some examples of a few reasons why we might observe a correlation, either positive or negative. Consider the case of product adoption.



  Name Definition
1 Direct transfer of specific information. Alter tells me about a product, but nothing more.
2 Persuasion Effects Alter tells me about the product, and forcefully persuades me to adopt it.
3 Direct transfer of general information. Alter tells me about a website that reviews products, and on this page a list is produced where the product that I adopt is listed first.
4 Role-modeling / Imitation I see Alter doing something, I copy it.
5 Install Base Effects  I see many Alters adopting a product (i.e. buying an iPad, I adopt the iPad)
6 Threshold Effects I only buy an iPad if at least 10 people I know own it, once the 10th person adopts, I decide to adopt.
7 Snob effects I see an Alter(s) doing something, I avoid doing it myself.
8 Simultaneous Alter helps me out and I help her out, and together we perform better than either one would alone, because we, by talking through a problem for example, figure it out together.
9 Reverse causality The Alter does not affect Ego; but rather the Ego affects the Alter.
10 Contextual Effects We are both in the same neighborhood, and because we get exposed to the same billboard, we see the same advertisement for a project, and thus we adopt it.
11 Induced Environmental Effects Having a high achieving peer results in a teacher who teaches at a higher level, thus the student learns more not because of greater transfer of information from her peer, but because teaching quality improves.
12 Selection bias I become friends with people who already own iPads. I become friends with people who like technology, and because they like technology, they also own iPads.
13 Homophily Effects I like iPads and because I do, I become friends with iPads.

Can you think of more mechanisms?


Which mechanism is actually at play in a specific context?

This question is a hard one. Because we have several potential mechanisms that we must work with, how do we rule out some of them? Some mechanisms are easier to rule out then others, but most are actually quite difficult to conclusively confirm or deny.

To deal with this issue (which is VERY common during the review process) I have come up with a two part classification. The first set of mechanisms are what I call “pseudo-mechanisms.” Pseudo-mechanisms are alternative explanations of the correlation that have nothing to do with social influence of the type we care about: influence flowing from the peer to the focal individual. Charles Manski, in a famous paper has defined these as the reflection problem and the selection problem. 

Reflection problem: The reflection problem asks you to imagine a mirror. You see two objections moving. And if it is unclear to you that you are looking at a mirror, then you can’t tell which one is the actual person who is moving and which one is the mirror image. More formally, imagine that we have two sets of variables, let us call them  x and y; let x be the measurement of the characteristics of individual ’s peers’ characteristics at time t and let y be the measurement of the focal individual ’s characteristics at time t. Now, because of the simultaneous measurement, we are unable to tell whether the change in x’s characteristics has caused a change in y’s characteristic, or vice versa. And this indeterminacy exists for each observation.

Furthermore, we are unable to tell whether each of these actors was exposed to some environmental shock (advertising, etc. at the same time, which make their adoption correlated). The only way that we can insure that the reflection problem is not an issue is by measuring the traits and characteristics of the xs prior to measuring those of y.

However, solving the doing this does not resolve the issue of causality. Thus, it is a necessary, but insufficient condition.

Another important, and much more difficult condition now has to be met in order for the effect to have the title “Causal.”  This is the selection problem. The set of conditions that solves the selection problem are twofold:

  1. Either you know all the reasons why two people were paired together (i.e. why person y is friends with, shares a room with, enters the college as, with x).
  2. OR the two individuals are randomly assigned, and thus breaking the correlation between the characteristics of x and y.

Assume for a moment that we have ruled out reflection and selection effects by (1) using a lagged measure of peer consumption or action, and (2) the ego and alter are randomly paired, we have only ruled out a handful of possible “mechanisms” producing the peer effects. We can rule out the “pseudo-mechanisms” #8 – #13 (except for #11), but that leaves us with 8 possible mechanisms.

Imagine a doctor telling you that “Yes, we’ve ruled out the fact that you are faking your symptoms, but there are 8 or more possible viruses that could be causing your infection!”

So, we need to now try and distinguish between these.

This is hard, even harder than resolving the reflection and selection problems.  The reflection and selection problems are interesting in that they are hard problems to solve, but we know how to solve them. Not to make too many medical analogies, but this like separating conjoined twins. Hard, but someone can do it and has done it.

So how do we distinguish between different mechanisms, say #1 – #7?

This will depend a lot on context, and a lot on the data that you have available.

Let us examine a very simple situation where we have two students. Let us call the first student “Ego” and let us call the second student “Alter.” Assume for a moment that we have completely alleviated the problems of reflection and selection.


Screen Shot 2017-05-04 at 2.31.58 PM.png

Let us say that really there are two contender mechanisms.  (This is probably not true; but, for a moment assume that it is true.)

Mechanism 1: A student learns general study habits from his/her peer (alter) and this why his performance increases.

Mechanism 2: A student interacts a lot with his/her peer (alter) and they study together, and the peer helps the student learn the material.

How would we go about designing a test that would distinguish between these two mechanisms?

  1. For instance, if what the student is getting from her peer is increased motivation, that should have a positive effect on various subjects.
  2. On the other hand, if the student is learning something rather specific (like how to do an integral), then the effects should be subject specific.

Assume you do this test, and you find out that there are effects across subjects, what can you say about the mechanisms? Can you say anything?

How to conduct the estimation in R

Standard peer effects estimations are quite straightforward. This is especially true when you have randomization in the pairing of focal individuals to peers and longitudinal data so you can lag the characteristics of the peer.

score_{i,t+1} = \beta_{0} + \beta_{1} score_{j,t} + \epsilon 

Here is a synthetic peer effects dataset in which 2000 individuals have been randomly paired: peer_effects.csv.

Let us examine the extent to which there are peer effects.

The model we want to estimate is:

postself_{i,t+1} = \beta_{0} + \beta_{1} prepeer{j,t} + \epsilon 

Estimating this equation in R with this data results in:

Screen Shot 2017-05-04 at 3.28.39 PM.png

If the randomization is proper, this coefficient should be stable if we control for the focal individuals own pretreatment score.

Screen Shot 2017-05-04 at 3.30.22 PM.png

Another worry we have is whether this effect of the peer (captured by the pre-treatment characteristics) is homogeneous or heterogeneous. That is, does it depend on the characteristics of the focal individual or does it apply to everyone? To test this, we include a main effect of the characteristics of the focal individual (self_char) and an interaction term (pre_peer * self_char).

Screen Shot 2017-05-04 at 3.33.01 PM.png

Here, we see that the peer effects depends on the characteristic of the focal individual. If the focal individual has this characteristic (e.g., willingness to listen), the peer effect is larger.

This is only a simple demonstration of the complexity of peer effects, there are likely to be many interactional factors that turn peer effects “on” or “off” or modulate them in some important way. One could imagine the following contingencies, where peer effects depend on characteristics of:

  • the focal individual
  • the environment
  • the alter/peer
  • personalities of both


Ideas on entrepreneurship, innovation and social networks

Here you will find some of my research, summaries of recent trends and topics in business research, and educational materials I’ve used or developed for my MBA and PhD classes. I focus on social network analysis, innovation and entrepreneurship. Here are some relevant posts:

Social Network Analysis

Class Syllabi

R/Methodological Tutorials

Conceptual lectures 

Analyzing Networks in R: Centrality and Graphing

One important procedure in network analysis is determining the centrality of a node within a social network. In this post, I will show you how to do four things:

  1. Calculate four centrality measures
    • Closeness centrality
    • Betweenness centrality
    • Degree centrality (indegree and outdegree)
    • Eigenvector centrality
  2. Symmetrize social networks
  3. Plot social networks using the gplot function in R.
  4. Correlate centrality measures to outcomes or dependent variables.

The Krackhardt Kite Network

Below is a stylized network, called the “Kite Network” developed by Professor David Krackhardt of Carnegie Mellon University.Screen Shot 2017-04-25 at 2.24.39 PM.png

The kite network has nodes that are more powerful than others. Which node is the most powerful in the kite network?

Screen Shot 2017-04-25 at 2.27.11 PM.png

One possible answer is node D. The reason is that it has the most number of connections. Indeed, is powerful. It has a type of centrality in the network called Popularity centrality or Degree centrality. If you want to get many people on board with an organizational change, or organize a party, D is your node. You can calculate degree centrality by merely counting the number of connections that a node has.

Screen Shot 2017-04-25 at 2.30.09 PM.png

Another answer is either F or G. The centrality of these nodes is a bit harder to see. They have what is called Farness centrality. If you count up the number of “hops” on the network it takes to get from one node (say, A) to all other nodes (B … to … J) and take the average, you get farness centrality. F and G have the lowest farness (or highest closeness) which means it takes a lot less time for information (or disease) to get from F and G to everyone else. Research has shown that Farness/Closeness is correlated to how fast ideas, knowledge, information spread out from a starting point.

Screen Shot 2017-04-25 at 2.42.06 PM

Finally, H has what we call Betweenness centrality. Betweenness measures the extent to which information must travel over a certain node in order to get somewhere else in the network. In other words, nodes high in betweenness are bridges that connect otherwise disconnected parts of the network.  There is a extremely large body of research showing that individuals who are high in betweenness have access to diverse information in their organizations and are often the source of creative ideas, have greater bargaining power, and experience superior career outcomes.

Representing Networks

The Kite Network provides a very simple introduction to the idea of centrality. The the starting point for thinking about network analysis is invariably a graph like the one above. Graphs are fundamental to network analysis, we can understand lot from just a graph. Some people, for instance, when they’ve seen enough graphs can tell how a network formed as well as what actions that individuals can engage in and so on and so forth.


The problem with graphs, however, is that as graphs grow larger and more dense. They reveal a lot less information just through pure visualization.

For example, lets compare the three graphs below:



With the small graph (with 10 nodes and 10% of the edges existing), it is rather easy to spin a story about who has power and who is marginal. The second graph (on the upper right) has only 50 nodes and 10% of the ties exist. Things are beginning to get messy. Once we move to 100 nodes and 10% ties, it is basically a hairball and little insight can be provided by just looking at it.

Due to the limited use of standard visualization techniques for networks, scholars have developed a wider and more flexible set of representations for networks and ways to reason about them.

The Starting Point for all Network Representations: Nodes and Edges

Recall that networks are made up of nodes and edges. These two elements are also the basic units of representation for the other methods we will use.

An important feature of all the representation strategies we will discuss is that they all represent almost exactly the same information as the graph above. Further, we can, with ease move from one representation to another in a few steps.


Let us begin with trying now to represent the Kite Network that we drew above as a matrix.  How do we go about doing this? I have created a csv file with the kite network that you can download here: kite.csv.  You can use the code from R-SNA-Kite.R to import the Kite network into R, and plot it.

# This provides some basic analysis of the kite network




# Load the kite network

kite <- fread(https://www.dropbox.com/s/c7f6q7nn2w34o1c/kite.csv?dl=1&#8217;)

# Change the format to a matrix

kite = as.matrix(kite)

# Create a vector from A to J which will become the row and column names

names = c(“A”,“B”,“C”,“D”,“E”,“F”,“G”,“H”,“I”,“J”)

# Change the row names

# Rename all the rows

rownames(kite) = names

# Rename all the columns

colnames(kite) = names

# Display the kite network matrix


# Plot the kite network

gplot(kite, label = rownames(kite)




 We can also represent networks as lists instead of matrices.  Lists are exceptionally useful since there is “junk” information stored in matrices. This junk information primarily wastes spaces and adds clutter to the representation. The “0” values are junk in the sense that – although it is important to know that a tie is missing, we do not need to explicitly state it.

 Edge Lists

The edge list representation merely lists all the dyads which consist of the “1”’s in the matrix. We can easily do this for the Kite network by listing the edges:




Node Lists

 Node lists are similar to edge lists in that they are lists, but they are organized around the node and the connections that the node has to other nodes.

A               B C D F

B               A D E G

The beauty of all three representations (matricies, edge lists, node lists) is that they can represent exactly the same binary networks. There are slight differences that arise which we will discuss a bit later.

Directionality and Value in networks

 Undirected Networks

We have been working undirected networks. That is, networks that lack direction in their edges.  There are some phenomenon and interactions that inherently lack directionality.  The assumption of undirected ties has at least three implications:

  1. One implication is that you have less network data to represent.
  2. You don’t know exactly – or assume implicitly or explicitly – that the flow of information is equivalent regardless of direction across the network.
  3. Your graph does not include arrows.

What are some examples of “naturally” undirected relationships?

  • Shared-memberships
  • Co-authorships
  • Marriage

Directed Networks

Although we have not been using them in our reasoning, directed networks are an important representational tool in many contexts. In directed networks we assume a direction to the flow of “stuff” in the network. This direction of flow is represented graphically by the use of arrows at the end of the edges in the network.

Directionality increases data. Having directions to edges essentially doubling the amount of information we need to store about each edge.

Values in Networks

Another relaxation in our representation of networks is to add values to edges. Edges represent much more than just 0’s and 1’s. Networks can be valued – so that a dyad can have a value like 1..2..3..4 or .23 etc. What might be some examples of “valued” networks?

Although valued networks are more reflective of real social relationships than dichotomized networks, they are less commonly used. Part of the reason is that valued networks are harder to work with mathematically. Thus, people do not to use them as much as their dichotomized siblings.


Now that we have the basics of representation down, let us try to extract some insight from the network. Let’s do network analysis.

The most common and often most useful way to analyze a social network is to look at the centrality of the nodes in the network. Centrality is a way to assess the relative importance of a node in a graph or a social network. Several different measures of centrality exist. Each measure has different properties and theoretical interpretations.

Measures of centrality can be classified into two types: (a) local and (b) global.

Local measures of centrality focus on a focal node (the focal node is the node that is currently the focus of attention) and the immediate features of the network surrounding that node. Local measures of centrality such as degree are often easy to calculate, but have as a limitation that they do not capture important features of the whole network.

Global measures on the other hand, take into account the larger network and incorporate features that are not limited to the focal actors immediate network.  Global measures such as closeness, Eigenvector centrality, or betweenness centrality are often much more difficult to calculate (especially by hand) but provide very rich information about the position of an actor in a social network. Global measures often take into account the network ties of all other entities in the larger network as well.

 Local Measures of Centrality

The simplest measure of centrality in a social network is degree. There are two types of degree centrality – indegree and outdegree.

  • Indegree is the count of the total number of incoming connections to a node. In the language of friendship, indegree can be thought of as “popularity” centrality. The node is popular because many other nodes nominate it as a node with whom they have a certain kind of relationship.
  • Outdegree is the total number of outgoing connections from a node. Outdegree can be thought of as the level of gregariousness of a node. Nodes with high outdegrees have many outgoing connections. In directed graphs indegree and outdegree can be distinguished, but in a undirected graph (no arrows) we can simply measure degree centrality.


Indegree and Outdegree

Outdegree_{i} = \sum_{j} N_{ij} 

In the equation above, we can think of N_{ij} as the value of the cell with the row index i and column index j  in a network matrix N .

Bob James Jill Jane
Bob 0 1 1 0
James 0 0 1 1
Jill 0 1 0 0
Jane 1 1 0 0

In the network represented by the matrix above, Bob has an outdegree of 2, but so does James, Jill and Jane.  However, if we calculate indegree, represented as:

Indegree_{i} = \sum_{j} N_{ji} 

We find that Bob has an indegree of 1, James 3, and Jill and Jane each have an indegree of 2.

Degree centrality is often a useful first cut at estimating the overall position of an entity in a social network. Although degree centrality is usually correlated with other more global measures of centrality, the correlation is not perfect and the information captured by the other centrality measures is sometimes as useful if not more useful than the humble degree centrality.

Global measures of centrality

Although indegree and outdegree are useful they are closer to “intuition” measures that rely on local and heuristic information about the actor than true position in the larger social network.

To really capture an actor’s position in a social network we will need to learn how to calculate more global measures.  Scholars have developed a variety of global measures of centrality, but three global measures are most commonly used. Interestingly, they also have a lot of technological applications and as one can imagine they are difficult to calculate by hand.

 Closeness centrality

The first measure we will cover is called closeness centrality. There are other names for it as well; sometimes it is called access centrality.  Simply put, closeness centrality captures the average distance from the focal node to all other nodes in the social network.  The mathematical representation of closeness is as follows:

Closeness_{i} = \left( \frac{\sum_{\forall j,-i D_{ij}}}{n-1} \right)/1


This formula can be easily interpreted.

The formula can be easily interpreted. We are trying to calculate the closeness of the node  to all other nodes in the network; thus, Closeness, . The numerator is the sum of all the pairwise distances between node i and all other nodes j (excluding i). That sum of distances is then divided by the total number of nodes in the network n subtracted by 1 (to adjust the count to exclude node i). We now have farness, which is the average distance of node i to all other nodes in the network. Taking the reciprocal gets us closeness.

Let us try and calculate closeness centrality using the Kite network. Focusing on node D, let us begin by calculating the distance between node D and all other nodes. It will take node D only one step to reach nodes A, B, C, E, G, and F. Two steps are required to reach node H. Three steps are required to reach node I and four steps are required to reach node J. Farness can be calculated using the following arithmetic:

 \frac{1+1+1+1+1+1+2+3+4}{9} = 1.67 

The farness centrality for node D is approximately 1.67. This means that on average, node D is less than two steps away from information in the network. Try and calculate the closeness centrality for all other nodes in network. Farness can easily be converted into closeness by taking the reciprocal (or some other scaling). Is the node that had the highest degree the one with the highest closeness?

The entities in a network that are high in closeness centrality are often the most appropriate choices for spreading information through the network.

Betweenness centrality

We now move to betweenness centrality. Betweenness is perhaps one of the most powerful measures of centrality and is tightly related to the idea of structural holes. Betweenness can be calculated as:

Betweenness_{i} = \sum_{\forall j,k} \frac{s_{j,k}(i)}{s_{jk}} 

The idea behind betweenness is simple. Betweenness measures the extent to which a node acts as a bridge between other nodes in the network. It is computed by looking at all pairs of nodes in the network and examining how frequently i, the focal node, exists on the shortest paths between nodes j and k.

  • The term s_{j,k}(i) in the equation  is the number of shortest paths originating at j and ending at k that must go through i.
  • The term s_{jk} is the total number of shortest paths going from j to k.
  • Thus \frac{s_{j,k}(i)}{s_{jk}} is the proportion of shortest paths between j and k that must go through i.
  • If we sum this term over all pairs of nodes excluding i in the network we have betweenness centrality.

Betweenness centrality calculations are quite difficult.

Most times a computer is required to do these calculations. However, we are in luck. Recent research indicates that local betweenness centrality, defined as:

  • Betweenness calculated based on only on the network consisting of a focal node’s contacts and the connections between them

is highly correlated with the larger betweenness measure.

Let us try to calculate betweenness on a very simple graph consisting of three nodes – A, B, and C. In calculating the betweenness of B we look at the number of shortest baths between A and C and C and A.


Since this is an undirected graph we can consider AC and CA to be the same. As we can see, there is only one shortest path between A and C. Thus, the denominator is 1. Of these shortest paths, one of them must go through B. Therefore, B’s betweenness is Betweenness(B) = 1/1 = 1. Similarly, we can see that in  computing A’s betweenness we evaluate the number of shortest paths between B and C. We find that there is 1 shortest path and none of these shortest paths goes through A  since B and C are directly connected. Thus, A’s betweenness centrality is Betweenness(A) = 0/1 = 0

If you like, try and calculate betweenness centrality scores for the kite  network. Who has the highest betweenness? Is it the same node with the highest degree or closeness?

Eigenvector centrality

The final measure of centrality is Eigenvector centrality. Think of Eigenvector (EV) centrality as degree centrality on Redbull. The basic intuition behind EV centrality is that it is not sufficient to have a large network, but your network contacts should also have a large network, and their network contacts should also have a large network, and so should their network contacts, etc.

Thus a recursive measure of centrality which is based not only on your degree, but the degree of your contacts, their contacts, and so on. Thus, two people with degree of 6 would have equivalent centrality even if one of those people was connected to people who were not connected to anyone else and the other was connected to six people who themselves were also connected to many other people.

It is generally not possible to calculate Eigenvector centrality by hand – except on the most trivial networks.

However, most network analysis packages have routines to calculate Eigenvector centrality quite efficiently.

Calculating Centrality, Symmetrizing Matricies and Plotting Networks

Now that we have a basic grasp of measures of centrality, let us use the professionals data we worked with in the prior lecture to calculate centrality for the “advice network.” The analysis file can be found here at RSNAcentrality.R.  You must load the data first, up until the centrality calculations. 

# Create a “weak” and “strong” symmetrized version of the advice network (q1)

q1.weak = symmetrize(q1,rule = “weak”) # a tie exists between ij and ji if ij == 1 OR ji == 1

q1.strong = symmetrize(q1,rule = “strong”) # a tie exists between ij and ji if ij == 1 AND ji == 1

# Calculate degree centrality for q1

q1.indegree = degree(q1, cmode = “indegree”)

q1.outdegree = degree(q1, cmode = “outdegree”)

# Calculate betweenness centrality

q1.betweenness = betweenness(q1)

# Calculate eigenvector centrality (we will need to do this for an undirected network, lets use weak)

q1.evcent.weak = evcent(q1.weak)

# Calculate closeness centrality, lets do this again with the weak symmetrized network

q1.closeness.weak = closeness(q1.weak)

# plot histograms of each of the centrality measures

par(mfrow = c(3,2))







Screen Shot 2017-05-03 at 4.09.38 PM.png


Let us take a look at the scatter plots comparing these measures.

# What is the correlation between these centrality measures? Lets look at scatter plots.


Screen Shot 2017-05-03 at 4.12.48 PMFinally, lets test a simple hypothesis. That more more “close” you are to others in a social network, the more likely you feel like you have the knowledge to succeed.

# Examine if there is a correlation between closeness centrality in the advice network whether

# they feel like they have the knowledge to succeed.

m.0 <- lm(attr$success ~ q1.closeness.weak)


# Plot the regression and the data points.



Screen Shot 2017-05-03 at 4.17.34 PM.png

The first order correlation holds. Is this a real effect? How can we tell?

The organization behind the chart, an assignment.

Network analysis has permeated the analytical toolbox of many of the world’s most innovative companies. Firms, both large and small, have begun to use the tools of social network analysis (SNA) to understand the collaborative structure within their organizations and the collaborative and competitive structures in their respective markets. In the early days of SNA much of the work was conducted by boutique consultants. Today, many of the largest consulting firms have specialists in this area and many companies such as Google have built robust internal teams with this talent.

Despite the revolutionary change in the practice of SNA over the last decade, the basic workflow of conducting a SNA remains the same. Five steps in particular are universal and do not vary by industry, organizational size, or other firm characteristics. These steps are:

  1. Select a bounded social unit as the target of the SNA. This could be a division in a firm, the whole organization, or a cluster or industry. For examine, one could do an SNA on just the medical school or all of Stanford University. One could also study a single VC or all of the firms on Sand Hill Road.
  2. Once a bounded unit is selected, learn about the context and ask: (1) what are the relevant entities (i.e. who are the people who should be included in the study); (2) what are the relevant relationships (i.e. friendship or co-investing); and, finally, (3) what are people trying to achieve with these relationships (i.e. generating more innovative ideas, getting promoted, or getting in on the best deals)?
  3. Once these are determined, the SNA will require the analyst to ask the relevant entities about their relationships. This is done through what is called a “Network Survey.”
  4. Once the surveys are completed, code and analyze the data. Data coding is relatively straightforward and can be done in Excel by creating a matrix and filling in the relationships between entities. After the coding, analysis will consist of visualization and calculating centrality measures.
  5. Interpret. Now that you have the data, you are likely to see patterns that you may have or have not expected. Why do you see them, do you see people with unexpected centrality in the network—why are they central, what are the implications for the bounded social unit and the person who has that centrality?

That is all it takes to do a SNA. The process can be scaled or scoped down depending on the context. Your task for the Final project is to follow these steps and in a team of 3 or less, conduct a SNA on a bounded social unit of your choice (except this class). In prior years, students have done SNA on their MBA class, their MSx class, their startup, their friend’s startup, among others.


Here are the deliverables for this class:

  1. Choose a bounded social context to study. I would recommend—for purely time-related reasons—to choose a social unit that is somewhere between 15-25 people. I am OK with smaller units as well as larger ones, but smaller units may not be as interesting and larger ones may become unmanageable in such a short period of time.
  2. After choosing the setting, describe why it is interesting/important, who will consist of your relevant entities, and what are meaningful relationships that exist between these people (and why?). Finally, describe why you think that networks may matter here and for what?
  3. Develop and conduct survey with two components: Ask two network questions (the simplest are: who do you consider a friend? and who do you go to for work-related advice?). Ask up to 5 questions about people’s background and achievements. These could include: Where did you go to undergrad, how many years of experience do you have. With respect to achievements, you can ask questions about work satisfaction, feelings of success, etc. You can do this survey on paper or using online survey software such as Survey Monkey or Google Survey.
  4. After you have survey, code and analyze it. Once you have the raw data, set up a meeting with me and I will help you calculate the centrality measures and visualize the data. I’ll send out an example of how you should store your data by the next class.
  5. Interpret. What is interesting and unexpected. Make some predictions about peoples outcomes (will someone leave, get promoted, have a brilliant idea), and justify them based on what you know of the context, network theory and your own intuition.

On the final day of class we’ll be presenting our analysis to the class. You have two final deliverables:

  1. Please prepare a 5 minute presentation with your analysis and interpretation.
  2. Write a short report with your findings and submit them to the TA. The report should be at most 3 pages (12pt font, double spaced).

Creating value for others

One of the most important resources you have for creating value is your social network. And one of the most important ways that you can create value is by understanding the goals and needs of people you know and using your own network to help them meet those goals in ways that only you can do. In so doing, you create value for the person being helped and also you.

This assignment’s goal is to help to reason more carefully about the unique value that your network holds for other people (and, of course, yourself). Using LinkedIn (I imagine that you have a LinkedIn account) I want you to begin by choosing 3 people in your network that you know are trying to accomplish something, whether it be finding a job in a specific industry, starting a business, or something else. Next, I want you to find for each of these three people, 2 other people to whom these first three are not connected to, are unlikely to know and who are likely to be useful to them in achieving their goals.

Once you have done this, please write a short report totaling no more than 2 pages with the following information about the three potential brokering opportunities you just described.

  1. Describe the goals and needs of each of the three individuals you listed; describe also how you know these people, how/why they are connected to you, and how you discovered that they had the goals you described.
  2. Next, describe the two other people in your network that might have the “resources” (i.e. a job opportunity at their firm, the connections to people who can provide financing for a venture, etc.) that would be useful for the people you listed, how you know these two people that can help the person, and why you think they would help you help the first person (e.g. is it a win-win situation, will they be repaying a debt to you, or will you be asking for a favor from these people that you may need to repay later).
  3. Third, describe how you will broker this connection and more importantly describe why this is the unique value that you provide and why it is unlikely that other people would not be able to make these same types of brokering connections.
  4. Finally, based on your answers to these questions. Describe what you think is the unique value that you can bring to the table with your network. Are you better at brokering one type of connection vs. another, etc.