Ideas on entrepreneurship, innovation and social networks

Here you will find some of my research, summaries of recent trends and topics in business research, and educational materials I’ve used or developed for my MBA and PhD classes. I focus on social network analysis, innovation and entrepreneurship. Here are some relevant posts:

Social Network Analysis

Class Syllabi

R/Methodological Tutorials

Conceptual lectures

Analyzing Networks in R: Centrality and Graphing

One important procedure in network analysis is determining the centrality of a node within a social network. In this post, I will show you how to do four things:

1. Calculate four centrality measures
• Closeness centrality
• Betweenness centrality
• Degree centrality (indegree and outdegree)
• Eigenvector centrality
2. Symmetrize social networks
3. Plot social networks using the gplot function in R.
4. Correlate centrality measures to outcomes or dependent variables.

The Krackhardt Kite Network

Below is a stylized network, called the “Kite Network” developed by Professor David Krackhardt of Carnegie Mellon University.

The kite network has nodes that are more powerful than others. Which node is the most powerful in the kite network?

One possible answer is node D. The reason is that it has the most number of connections. Indeed, is powerful. It has a type of centrality in the network called Popularity centrality or Degree centrality. If you want to get many people on board with an organizational change, or organize a party, D is your node. You can calculate degree centrality by merely counting the number of connections that a node has.

Another answer is either F or G. The centrality of these nodes is a bit harder to see. They have what is called Farness centrality. If you count up the number of “hops” on the network it takes to get from one node (say, A) to all other nodes (B … to … J) and take the average, you get farness centrality. F and G have the lowest farness (or highest closeness) which means it takes a lot less time for information (or disease) to get from F and G to everyone else. Research has shown that Farness/Closeness is correlated to how fast ideas, knowledge, information spread out from a starting point.

Finally, H has what we call Betweenness centrality. Betweenness measures the extent to which information must travel over a certain node in order to get somewhere else in the network. In other words, nodes high in betweenness are bridges that connect otherwise disconnected parts of the network.  There is a extremely large body of research showing that individuals who are high in betweenness have access to diverse information in their organizations and are often the source of creative ideas, have greater bargaining power, and experience superior career outcomes.

Representing Networks

The Kite Network provides a very simple introduction to the idea of centrality. The the starting point for thinking about network analysis is invariably a graph like the one above. Graphs are fundamental to network analysis, we can understand lot from just a graph. Some people, for instance, when they’ve seen enough graphs can tell how a network formed as well as what actions that individuals can engage in and so on and so forth.

The problem with graphs, however, is that as graphs grow larger and more dense. They reveal a lot less information just through pure visualization.

For example, lets compare the three graphs below:

With the small graph (with 10 nodes and 10% of the edges existing), it is rather easy to spin a story about who has power and who is marginal. The second graph (on the upper right) has only 50 nodes and 10% of the ties exist. Things are beginning to get messy. Once we move to 100 nodes and 10% ties, it is basically a hairball and little insight can be provided by just looking at it.

Due to the limited use of standard visualization techniques for networks, scholars have developed a wider and more flexible set of representations for networks and ways to reason about them.

The Starting Point for all Network Representations: Nodes and Edges

Recall that networks are made up of nodes and edges. These two elements are also the basic units of representation for the other methods we will use.

An important feature of all the representation strategies we will discuss is that they all represent almost exactly the same information as the graph above. Further, we can, with ease move from one representation to another in a few steps.

Matrices

Let us begin with trying now to represent the Kite Network that we drew above as a matrix.  How do we go about doing this? I have created a csv file with the kite network that you can download here: kite.csv.  You can use the code from R-SNA-Kite.R to import the Kite network into R, and plot it.

# This provides some basic analysis of the kite network

library(data.table)

library(curl)

library(sna)

# Load the kite network

# Change the format to a matrix

kite = as.matrix(kite)

# Create a vector from A to J which will become the row and column names

names = c(“A”,“B”,“C”,“D”,“E”,“F”,“G”,“H”,“I”,“J”)

# Change the row names

# Rename all the rows

rownames(kite) = names

# Rename all the columns

colnames(kite) = names

# Display the kite network matrix

kite

# Plot the kite network

gplot(kite, label = rownames(kite)

Lists

We can also represent networks as lists instead of matrices.  Lists are exceptionally useful since there is “junk” information stored in matrices. This junk information primarily wastes spaces and adds clutter to the representation. The “0” values are junk in the sense that – although it is important to know that a tie is missing, we do not need to explicitly state it.

Edge Lists

The edge list representation merely lists all the dyads which consist of the “1”’s in the matrix. We can easily do this for the Kite network by listing the edges:

A->B

A->C

A->D

Node Lists

Node lists are similar to edge lists in that they are lists, but they are organized around the node and the connections that the node has to other nodes.

A               B C D F

B               A D E G

The beauty of all three representations (matricies, edge lists, node lists) is that they can represent exactly the same binary networks. There are slight differences that arise which we will discuss a bit later.

Directionality and Value in networks

Undirected Networks

We have been working undirected networks. That is, networks that lack direction in their edges.  There are some phenomenon and interactions that inherently lack directionality.  The assumption of undirected ties has at least three implications:

1. One implication is that you have less network data to represent.
2. You don’t know exactly – or assume implicitly or explicitly – that the flow of information is equivalent regardless of direction across the network.
3. Your graph does not include arrows.

What are some examples of “naturally” undirected relationships?

• Shared-memberships
• Co-authorships
• Marriage

Directed Networks

Although we have not been using them in our reasoning, directed networks are an important representational tool in many contexts. In directed networks we assume a direction to the flow of “stuff” in the network. This direction of flow is represented graphically by the use of arrows at the end of the edges in the network.

Directionality increases data. Having directions to edges essentially doubling the amount of information we need to store about each edge.

Values in Networks

Another relaxation in our representation of networks is to add values to edges. Edges represent much more than just 0’s and 1’s. Networks can be valued – so that a dyad can have a value like 1..2..3..4 or .23 etc. What might be some examples of “valued” networks?

Although valued networks are more reflective of real social relationships than dichotomized networks, they are less commonly used. Part of the reason is that valued networks are harder to work with mathematically. Thus, people do not to use them as much as their dichotomized siblings.

Centrality

Now that we have the basics of representation down, let us try to extract some insight from the network. Let’s do network analysis.

The most common and often most useful way to analyze a social network is to look at the centrality of the nodes in the network. Centrality is a way to assess the relative importance of a node in a graph or a social network. Several different measures of centrality exist. Each measure has different properties and theoretical interpretations.

Measures of centrality can be classified into two types: (a) local and (b) global.

Local measures of centrality focus on a focal node (the focal node is the node that is currently the focus of attention) and the immediate features of the network surrounding that node. Local measures of centrality such as degree are often easy to calculate, but have as a limitation that they do not capture important features of the whole network.

Global measures on the other hand, take into account the larger network and incorporate features that are not limited to the focal actors immediate network.  Global measures such as closeness, Eigenvector centrality, or betweenness centrality are often much more difficult to calculate (especially by hand) but provide very rich information about the position of an actor in a social network. Global measures often take into account the network ties of all other entities in the larger network as well.

Local Measures of Centrality

The simplest measure of centrality in a social network is degree. There are two types of degree centrality – indegree and outdegree.

• Indegree is the count of the total number of incoming connections to a node. In the language of friendship, indegree can be thought of as “popularity” centrality. The node is popular because many other nodes nominate it as a node with whom they have a certain kind of relationship.
• Outdegree is the total number of outgoing connections from a node. Outdegree can be thought of as the level of gregariousness of a node. Nodes with high outdegrees have many outgoing connections. In directed graphs indegree and outdegree can be distinguished, but in a undirected graph (no arrows) we can simply measure degree centrality.

Indegree and Outdegree

$Outdegree_{i} = \sum_{j} N_{ij}$

In the equation above, we can think of $N_{ij}$ as the value of the cell with the row index $i$ and column index $j$  in a network matrix $N$ .

 Bob James Jill Jane Bob 0 1 1 0 James 0 0 1 1 Jill 0 1 0 0 Jane 1 1 0 0

In the network represented by the matrix above, Bob has an outdegree of 2, but so does James, Jill and Jane.  However, if we calculate indegree, represented as:

$Indegree_{i} = \sum_{j} N_{ji}$

We find that Bob has an indegree of 1, James 3, and Jill and Jane each have an indegree of 2.

Degree centrality is often a useful first cut at estimating the overall position of an entity in a social network. Although degree centrality is usually correlated with other more global measures of centrality, the correlation is not perfect and the information captured by the other centrality measures is sometimes as useful if not more useful than the humble degree centrality.

Global measures of centrality

Although indegree and outdegree are useful they are closer to “intuition” measures that rely on local and heuristic information about the actor than true position in the larger social network.

To really capture an actor’s position in a social network we will need to learn how to calculate more global measures.  Scholars have developed a variety of global measures of centrality, but three global measures are most commonly used. Interestingly, they also have a lot of technological applications and as one can imagine they are difficult to calculate by hand.

Closeness centrality

The first measure we will cover is called closeness centrality. There are other names for it as well; sometimes it is called access centrality.  Simply put, closeness centrality captures the average distance from the focal node to all other nodes in the social network.  The mathematical representation of closeness is as follows:

$Closeness_{i} = \left( \frac{\sum_{\forall j,-i D_{ij}}}{n-1} \right)/1$

This formula can be easily interpreted.

The formula can be easily interpreted. We are trying to calculate the closeness of the node  to all other nodes in the network; thus, Closeness, . The numerator is the sum of all the pairwise distances between node i and all other nodes j (excluding i). That sum of distances is then divided by the total number of nodes in the network n subtracted by 1 (to adjust the count to exclude node i). We now have farness, which is the average distance of node i to all other nodes in the network. Taking the reciprocal gets us closeness.

Let us try and calculate closeness centrality using the Kite network. Focusing on node D, let us begin by calculating the distance between node D and all other nodes. It will take node D only one step to reach nodes A, B, C, E, G, and F. Two steps are required to reach node H. Three steps are required to reach node I and four steps are required to reach node J. Farness can be calculated using the following arithmetic:

$\frac{1+1+1+1+1+1+2+3+4}{9} = 1.67$

The farness centrality for node D is approximately 1.67. This means that on average, node D is less than two steps away from information in the network. Try and calculate the closeness centrality for all other nodes in network. Farness can easily be converted into closeness by taking the reciprocal (or some other scaling). Is the node that had the highest degree the one with the highest closeness?

The entities in a network that are high in closeness centrality are often the most appropriate choices for spreading information through the network.

Betweenness centrality

We now move to betweenness centrality. Betweenness is perhaps one of the most powerful measures of centrality and is tightly related to the idea of structural holes. Betweenness can be calculated as:

$Betweenness_{i} = \sum_{\forall j,k} \frac{s_{j,k}(i)}{s_{jk}}$

The idea behind betweenness is simple. Betweenness measures the extent to which a node acts as a bridge between other nodes in the network. It is computed by looking at all pairs of nodes in the network and examining how frequently i, the focal node, exists on the shortest paths between nodes j and k.

• The term $s_{j,k}(i)$ in the equation  is the number of shortest paths originating at j and ending at k that must go through i.
• The term $s_{jk}$ is the total number of shortest paths going from j to k.
• Thus $\frac{s_{j,k}(i)}{s_{jk}}$ is the proportion of shortest paths between j and k that must go through i.
• If we sum this term over all pairs of nodes excluding i in the network we have betweenness centrality.

Betweenness centrality calculations are quite difficult.

Most times a computer is required to do these calculations. However, we are in luck. Recent research indicates that local betweenness centrality, defined as:

• Betweenness calculated based on only on the network consisting of a focal node’s contacts and the connections between them

is highly correlated with the larger betweenness measure.

Let us try to calculate betweenness on a very simple graph consisting of three nodes – A, B, and C. In calculating the betweenness of B we look at the number of shortest baths between A and C and C and A.

A—B—C

Since this is an undirected graph we can consider AC and CA to be the same. As we can see, there is only one shortest path between A and C. Thus, the denominator is 1. Of these shortest paths, one of them must go through B. Therefore, B’s betweenness is Betweenness(B) = 1/1 = 1. Similarly, we can see that in  computing A’s betweenness we evaluate the number of shortest paths between B and C. We find that there is 1 shortest path and none of these shortest paths goes through A  since B and C are directly connected. Thus, A’s betweenness centrality is Betweenness(A) = 0/1 = 0

If you like, try and calculate betweenness centrality scores for the kite  network. Who has the highest betweenness? Is it the same node with the highest degree or closeness?

Eigenvector centrality

The final measure of centrality is Eigenvector centrality. Think of Eigenvector (EV) centrality as degree centrality on Redbull. The basic intuition behind EV centrality is that it is not sufficient to have a large network, but your network contacts should also have a large network, and their network contacts should also have a large network, and so should their network contacts, etc.

Thus a recursive measure of centrality which is based not only on your degree, but the degree of your contacts, their contacts, and so on. Thus, two people with degree of 6 would have equivalent centrality even if one of those people was connected to people who were not connected to anyone else and the other was connected to six people who themselves were also connected to many other people.

It is generally not possible to calculate Eigenvector centrality by hand – except on the most trivial networks.

However, most network analysis packages have routines to calculate Eigenvector centrality quite efficiently.

Calculating Centrality, Symmetrizing Matricies and Plotting Networks

Now that we have a basic grasp of measures of centrality, let us use the professionals data we worked with in the prior lecture to calculate centrality for the “advice network.” The analysis file can be found here at RSNAcentrality.R.  You must load the data first, up until the centrality calculations.

# Create a “weak” and “strong” symmetrized version of the advice network (q1)

q1.weak = symmetrize(q1,rule = “weak”) # a tie exists between ij and ji if ij == 1 OR ji == 1

q1.strong = symmetrize(q1,rule = “strong”) # a tie exists between ij and ji if ij == 1 AND ji == 1

# Calculate degree centrality for q1

q1.indegree = degree(q1, cmode = “indegree”)

q1.outdegree = degree(q1, cmode = “outdegree”)

# Calculate betweenness centrality

q1.betweenness = betweenness(q1)

# Calculate eigenvector centrality (we will need to do this for an undirected network, lets use weak)

q1.evcent.weak = evcent(q1.weak)

# Calculate closeness centrality, lets do this again with the weak symmetrized network

q1.closeness.weak = closeness(q1.weak)

# plot histograms of each of the centrality measures

par(mfrow = c(3,2))

hist(q1.indegree)

hist(q1.outdegree)

hist(q1.betweenness)

hist(q1.evcent.weak)

hist(q1.closeness.weak)

Let us take a look at the scatter plots comparing these measures.

# What is the correlation between these centrality measures? Lets look at scatter plots.

pairs(~q1.indegree+q1.outdegree+q1.betweenness+q1.evcent.weak+q1.closeness.weak)

Finally, lets test a simple hypothesis. That more more “close” you are to others in a social network, the more likely you feel like you have the knowledge to succeed.

# Examine if there is a correlation between closeness centrality in the advice network whether

# they feel like they have the knowledge to succeed.

m.0 <- lm(attr$success ~ q1.closeness.weak) summary(m.0) # Plot the regression and the data points. plot(q1.closeness.weak,attr$success)

abline(m.0)

The first order correlation holds. Is this a real effect? How can we tell?

PhD Student Stars

After I defended my PhD dissertation in March of 2010, I decided to send my friend Chris (who is a star Informatics professor now) an e-mail summarizing what I had learned during that experience. As I read this email 7 years later, there is little I would change about the advice I would give to a new PhD student. Indeed, I give very similar advice to my own students, some of whom are now professors at great universities themselves.

Here is the text of the original email:

———-

So, now with a PhD (well, enough signatures to get me a PhD) in hand. I thought I should perhaps write some of my thoughts down about what I learned throughout the process. Primarily, I learned that “research” is much like any other job, perhaps even akin to making “widgets” in a factory. There is a process. Although I haven’t figured the entire process all out yet, particularly the publishing part, which is now going to be the primary interface between me and the production of widgets, I think I have come up with an outline for a theory.

Prelims

Before I started graduate school, even before I started my MS (I think), I read the website below. It gave me the best advice in terms of a general framework about how I should think about acting/behaving during my time in graduate school. I believe it was what helped me get admitted, finish, and find a job.

http://www.psychwww.com/careers/suprstar.htm

I would recommend any graduate student read it and take it to heart. When I started graduate school for my master’s degree, I tried to model myself after these suggestions. Though others might argue otherwise, I think, for the most part I worked an average of around 5-6 hours of real work per day, for at most 6 days a week – putting peak times aside. I mostly worked at school. I think most faculty knew my name and I personally asked almost all faculty to come to my presentations.

I expect to work significantly harder during my faculty job. Raising the average real hours worked a day to 7 or at most 8.

Some observations about “poorly” performing students:

• The students who do the worst in graduate school are not present on campus and in the office on a regular basis. This conforms with the visibility hypothesis. Being on campus is important. First, you work. Second, you can talk with other students to resolve your problems. Talk to faculty and be a part of the intellectual life of the place. That means attending talks, giving talks. Even the “mindless” chitchat often contains important pieces of knowledge, gossip, tips and tricks, linkage into important networks that will provide guidance and encouragement during your PhD and beyond.
• Students who perform poorly often reinvent the wheel. They do not take good advice from others – both explicit advice and what would I consider “implicit” advice (e.g. modeling yourself after the best of the cohorts above you.)
• This includes writing papers. The structure of research papers is quite standard. This includes how to write introductions, results sections, etc. However, it also consists of due diligence on statistical procedures, etc. I have learned this through trial and error. But I often look at other good papers that try to do “similar things” (broadly defined) to see what types of other tests, etc. I should do before I wrap up my paper.
• It also includes presentations. Particularly glaring is the absence of students at other people’s presentations. I am often surprised by this since academic output consists of two tangible products: papers and presentations. Just as writing good papers requires reading good papers, giving good presentations requires going to good presentations. And much like how writing good papers requires the ability to take and give productive criticism, so does presentation.