Ideas on entrepreneurship, innovation and social networks

Here you will find some of my research, summaries of recent trends and topics in business research, and educational materials I’ve used or developed for my MBA and PhD classes. I focus on social network analysis, innovation and entrepreneurship. Here are some relevant posts:

Social Network Analysis

Class Syllabi

R/Methodological Tutorials

Conceptual lectures 

Analyzing Networks in R: Centrality and Graphing

One important procedure in network analysis is determining the centrality of a node within a social network. In this post, I will show you how to do four things:

  1. Calculate four centrality measures
    • Closeness centrality
    • Betweenness centrality
    • Degree centrality (indegree and outdegree)
    • Eigenvector centrality
  2. Symmetrize social networks
  3. Plot social networks using the gplot function in R.
  4. Correlate centrality measures to outcomes or dependent variables.

The Krackhardt Kite Network

Below is a stylized network, called the “Kite Network” developed by Professor David Krackhardt of Carnegie Mellon University.Screen Shot 2017-04-25 at 2.24.39 PM.png

The kite network has nodes that are more powerful than others. Which node is the most powerful in the kite network?

Screen Shot 2017-04-25 at 2.27.11 PM.png

One possible answer is node D. The reason is that it has the most number of connections. Indeed, is powerful. It has a type of centrality in the network called Popularity centrality or Degree centrality. If you want to get many people on board with an organizational change, or organize a party, D is your node. You can calculate degree centrality by merely counting the number of connections that a node has.

Screen Shot 2017-04-25 at 2.30.09 PM.png

Another answer is either F or G. The centrality of these nodes is a bit harder to see. They have what is called Farness centrality. If you count up the number of “hops” on the network it takes to get from one node (say, A) to all other nodes (B … to … J) and take the average, you get farness centrality. F and G have the lowest farness (or highest closeness) which means it takes a lot less time for information (or disease) to get from F and G to everyone else. Research has shown that Farness/Closeness is correlated to how fast ideas, knowledge, information spread out from a starting point.

Screen Shot 2017-04-25 at 2.42.06 PM

Finally, H has what we call Betweenness centrality. Betweenness measures the extent to which information must travel over a certain node in order to get somewhere else in the network. In other words, nodes high in betweenness are bridges that connect otherwise disconnected parts of the network.  There is a extremely large body of research showing that individuals who are high in betweenness have access to diverse information in their organizations and are often the source of creative ideas, have greater bargaining power, and experience superior career outcomes.

Representing Networks

The Kite Network provides a very simple introduction to the idea of centrality. The the starting point for thinking about network analysis is invariably a graph like the one above. Graphs are fundamental to network analysis, we can understand lot from just a graph. Some people, for instance, when they’ve seen enough graphs can tell how a network formed as well as what actions that individuals can engage in and so on and so forth.

 

The problem with graphs, however, is that as graphs grow larger and more dense. They reveal a lot less information just through pure visualization.

For example, lets compare the three graphs below:

 

 

With the small graph (with 10 nodes and 10% of the edges existing), it is rather easy to spin a story about who has power and who is marginal. The second graph (on the upper right) has only 50 nodes and 10% of the ties exist. Things are beginning to get messy. Once we move to 100 nodes and 10% ties, it is basically a hairball and little insight can be provided by just looking at it.

Due to the limited use of standard visualization techniques for networks, scholars have developed a wider and more flexible set of representations for networks and ways to reason about them.

The Starting Point for all Network Representations: Nodes and Edges

Recall that networks are made up of nodes and edges. These two elements are also the basic units of representation for the other methods we will use.

An important feature of all the representation strategies we will discuss is that they all represent almost exactly the same information as the graph above. Further, we can, with ease move from one representation to another in a few steps.

Matrices 

Let us begin with trying now to represent the Kite Network that we drew above as a matrix.  How do we go about doing this? I have created a csv file with the kite network that you can download here: kite.csv.  You can use the code from R-SNA-Kite.R to import the Kite network into R, and plot it.

# This provides some basic analysis of the kite network

library(data.table)

library(curl)

library(sna)

# Load the kite network

kite <- fread(https://www.dropbox.com/s/c7f6q7nn2w34o1c/kite.csv?dl=1&#8217;)

# Change the format to a matrix

kite = as.matrix(kite)

# Create a vector from A to J which will become the row and column names

names = c(“A”,“B”,“C”,“D”,“E”,“F”,“G”,“H”,“I”,“J”)

# Change the row names

# Rename all the rows

rownames(kite) = names

# Rename all the columns

colnames(kite) = names

# Display the kite network matrix

kite

# Plot the kite network

gplot(kite, label = rownames(kite)

 

 

 Lists

 We can also represent networks as lists instead of matrices.  Lists are exceptionally useful since there is “junk” information stored in matrices. This junk information primarily wastes spaces and adds clutter to the representation. The “0” values are junk in the sense that – although it is important to know that a tie is missing, we do not need to explicitly state it.

 Edge Lists

The edge list representation merely lists all the dyads which consist of the “1”’s in the matrix. We can easily do this for the Kite network by listing the edges:

A->B

A->C

A->D

Node Lists

 Node lists are similar to edge lists in that they are lists, but they are organized around the node and the connections that the node has to other nodes.

A               B C D F

B               A D E G

The beauty of all three representations (matricies, edge lists, node lists) is that they can represent exactly the same binary networks. There are slight differences that arise which we will discuss a bit later.

Directionality and Value in networks

 Undirected Networks

We have been working undirected networks. That is, networks that lack direction in their edges.  There are some phenomenon and interactions that inherently lack directionality.  The assumption of undirected ties has at least three implications:

  1. One implication is that you have less network data to represent.
  2. You don’t know exactly – or assume implicitly or explicitly – that the flow of information is equivalent regardless of direction across the network.
  3. Your graph does not include arrows.

What are some examples of “naturally” undirected relationships?

  • Shared-memberships
  • Co-authorships
  • Marriage

Directed Networks

Although we have not been using them in our reasoning, directed networks are an important representational tool in many contexts. In directed networks we assume a direction to the flow of “stuff” in the network. This direction of flow is represented graphically by the use of arrows at the end of the edges in the network.

Directionality increases data. Having directions to edges essentially doubling the amount of information we need to store about each edge.

Values in Networks

Another relaxation in our representation of networks is to add values to edges. Edges represent much more than just 0’s and 1’s. Networks can be valued – so that a dyad can have a value like 1..2..3..4 or .23 etc. What might be some examples of “valued” networks?

Although valued networks are more reflective of real social relationships than dichotomized networks, they are less commonly used. Part of the reason is that valued networks are harder to work with mathematically. Thus, people do not to use them as much as their dichotomized siblings.

Centrality

Now that we have the basics of representation down, let us try to extract some insight from the network. Let’s do network analysis.

The most common and often most useful way to analyze a social network is to look at the centrality of the nodes in the network. Centrality is a way to assess the relative importance of a node in a graph or a social network. Several different measures of centrality exist. Each measure has different properties and theoretical interpretations.

Measures of centrality can be classified into two types: (a) local and (b) global.

Local measures of centrality focus on a focal node (the focal node is the node that is currently the focus of attention) and the immediate features of the network surrounding that node. Local measures of centrality such as degree are often easy to calculate, but have as a limitation that they do not capture important features of the whole network.

Global measures on the other hand, take into account the larger network and incorporate features that are not limited to the focal actors immediate network.  Global measures such as closeness, Eigenvector centrality, or betweenness centrality are often much more difficult to calculate (especially by hand) but provide very rich information about the position of an actor in a social network. Global measures often take into account the network ties of all other entities in the larger network as well.

 Local Measures of Centrality

The simplest measure of centrality in a social network is degree. There are two types of degree centrality – indegree and outdegree.

  • Indegree is the count of the total number of incoming connections to a node. In the language of friendship, indegree can be thought of as “popularity” centrality. The node is popular because many other nodes nominate it as a node with whom they have a certain kind of relationship.
  • Outdegree is the total number of outgoing connections from a node. Outdegree can be thought of as the level of gregariousness of a node. Nodes with high outdegrees have many outgoing connections. In directed graphs indegree and outdegree can be distinguished, but in a undirected graph (no arrows) we can simply measure degree centrality.

 

Indegree and Outdegree

Outdegree_{i} = \sum_{j} N_{ij} 

In the equation above, we can think of N_{ij} as the value of the cell with the row index i and column index j  in a network matrix N .

Bob James Jill Jane
Bob 0 1 1 0
James 0 0 1 1
Jill 0 1 0 0
Jane 1 1 0 0

In the network represented by the matrix above, Bob has an outdegree of 2, but so does James, Jill and Jane.  However, if we calculate indegree, represented as:

Indegree_{i} = \sum_{j} N_{ji} 

We find that Bob has an indegree of 1, James 3, and Jill and Jane each have an indegree of 2.

Degree centrality is often a useful first cut at estimating the overall position of an entity in a social network. Although degree centrality is usually correlated with other more global measures of centrality, the correlation is not perfect and the information captured by the other centrality measures is sometimes as useful if not more useful than the humble degree centrality.

Global measures of centrality

Although indegree and outdegree are useful they are closer to “intuition” measures that rely on local and heuristic information about the actor than true position in the larger social network.

To really capture an actor’s position in a social network we will need to learn how to calculate more global measures.  Scholars have developed a variety of global measures of centrality, but three global measures are most commonly used. Interestingly, they also have a lot of technological applications and as one can imagine they are difficult to calculate by hand.

 Closeness centrality

The first measure we will cover is called closeness centrality. There are other names for it as well; sometimes it is called access centrality.  Simply put, closeness centrality captures the average distance from the focal node to all other nodes in the social network.  The mathematical representation of closeness is as follows:

Closeness_{i} = \left( \frac{\sum_{\forall j,-i D_{ij}}}{n-1} \right)/1

 

This formula can be easily interpreted.

The formula can be easily interpreted. We are trying to calculate the closeness of the node  to all other nodes in the network; thus, Closeness, . The numerator is the sum of all the pairwise distances between node i and all other nodes j (excluding i). That sum of distances is then divided by the total number of nodes in the network n subtracted by 1 (to adjust the count to exclude node i). We now have farness, which is the average distance of node i to all other nodes in the network. Taking the reciprocal gets us closeness.

Let us try and calculate closeness centrality using the Kite network. Focusing on node D, let us begin by calculating the distance between node D and all other nodes. It will take node D only one step to reach nodes A, B, C, E, G, and F. Two steps are required to reach node H. Three steps are required to reach node I and four steps are required to reach node J. Farness can be calculated using the following arithmetic:

 \frac{1+1+1+1+1+1+2+3+4}{9} = 1.67 

The farness centrality for node D is approximately 1.67. This means that on average, node D is less than two steps away from information in the network. Try and calculate the closeness centrality for all other nodes in network. Farness can easily be converted into closeness by taking the reciprocal (or some other scaling). Is the node that had the highest degree the one with the highest closeness?

The entities in a network that are high in closeness centrality are often the most appropriate choices for spreading information through the network.

Betweenness centrality

We now move to betweenness centrality. Betweenness is perhaps one of the most powerful measures of centrality and is tightly related to the idea of structural holes. Betweenness can be calculated as:

Betweenness_{i} = \sum_{\forall j,k} \frac{s_{j,k}(i)}{s_{jk}} 

The idea behind betweenness is simple. Betweenness measures the extent to which a node acts as a bridge between other nodes in the network. It is computed by looking at all pairs of nodes in the network and examining how frequently i, the focal node, exists on the shortest paths between nodes j and k.

  • The term s_{j,k}(i) in the equation  is the number of shortest paths originating at j and ending at k that must go through i.
  • The term s_{jk} is the total number of shortest paths going from j to k.
  • Thus \frac{s_{j,k}(i)}{s_{jk}} is the proportion of shortest paths between j and k that must go through i.
  • If we sum this term over all pairs of nodes excluding i in the network we have betweenness centrality.

Betweenness centrality calculations are quite difficult.

Most times a computer is required to do these calculations. However, we are in luck. Recent research indicates that local betweenness centrality, defined as:

  • Betweenness calculated based on only on the network consisting of a focal node’s contacts and the connections between them

is highly correlated with the larger betweenness measure.

Let us try to calculate betweenness on a very simple graph consisting of three nodes – A, B, and C. In calculating the betweenness of B we look at the number of shortest baths between A and C and C and A.

A—B—C

Since this is an undirected graph we can consider AC and CA to be the same. As we can see, there is only one shortest path between A and C. Thus, the denominator is 1. Of these shortest paths, one of them must go through B. Therefore, B’s betweenness is Betweenness(B) = 1/1 = 1. Similarly, we can see that in  computing A’s betweenness we evaluate the number of shortest paths between B and C. We find that there is 1 shortest path and none of these shortest paths goes through A  since B and C are directly connected. Thus, A’s betweenness centrality is Betweenness(A) = 0/1 = 0

If you like, try and calculate betweenness centrality scores for the kite  network. Who has the highest betweenness? Is it the same node with the highest degree or closeness?

Eigenvector centrality

The final measure of centrality is Eigenvector centrality. Think of Eigenvector (EV) centrality as degree centrality on Redbull. The basic intuition behind EV centrality is that it is not sufficient to have a large network, but your network contacts should also have a large network, and their network contacts should also have a large network, and so should their network contacts, etc.

Thus a recursive measure of centrality which is based not only on your degree, but the degree of your contacts, their contacts, and so on. Thus, two people with degree of 6 would have equivalent centrality even if one of those people was connected to people who were not connected to anyone else and the other was connected to six people who themselves were also connected to many other people.

It is generally not possible to calculate Eigenvector centrality by hand – except on the most trivial networks.

However, most network analysis packages have routines to calculate Eigenvector centrality quite efficiently.

Calculating Centrality, Symmetrizing Matricies and Plotting Networks

Now that we have a basic grasp of measures of centrality, let us use the professionals data we worked with in the prior lecture to calculate centrality for the “advice network.” The analysis file can be found here at RSNAcentrality.R.  You must load the data first, up until the centrality calculations. 

# Create a “weak” and “strong” symmetrized version of the advice network (q1)

q1.weak = symmetrize(q1,rule = “weak”) # a tie exists between ij and ji if ij == 1 OR ji == 1

q1.strong = symmetrize(q1,rule = “strong”) # a tie exists between ij and ji if ij == 1 AND ji == 1

# Calculate degree centrality for q1

q1.indegree = degree(q1, cmode = “indegree”)

q1.outdegree = degree(q1, cmode = “outdegree”)

# Calculate betweenness centrality

q1.betweenness = betweenness(q1)

# Calculate eigenvector centrality (we will need to do this for an undirected network, lets use weak)

q1.evcent.weak = evcent(q1.weak)

# Calculate closeness centrality, lets do this again with the weak symmetrized network

q1.closeness.weak = closeness(q1.weak)

# plot histograms of each of the centrality measures

par(mfrow = c(3,2))

hist(q1.indegree)

hist(q1.outdegree)

hist(q1.betweenness)

hist(q1.evcent.weak)

hist(q1.closeness.weak)

 

Screen Shot 2017-05-03 at 4.09.38 PM.png

 

Let us take a look at the scatter plots comparing these measures.

# What is the correlation between these centrality measures? Lets look at scatter plots.

pairs(~q1.indegree+q1.outdegree+q1.betweenness+q1.evcent.weak+q1.closeness.weak)

Screen Shot 2017-05-03 at 4.12.48 PMFinally, lets test a simple hypothesis. That more more “close” you are to others in a social network, the more likely you feel like you have the knowledge to succeed.

# Examine if there is a correlation between closeness centrality in the advice network whether

# they feel like they have the knowledge to succeed.

m.0 <- lm(attr$success ~ q1.closeness.weak)

summary(m.0)

# Plot the regression and the data points.

plot(q1.closeness.weak,attr$success)

abline(m.0)

Screen Shot 2017-05-03 at 4.17.34 PM.png

The first order correlation holds. Is this a real effect? How can we tell?

PhD Student Stars

After I defended my PhD dissertation in March of 2010, I decided to send my friend Chris (who is a star Informatics professor now) an e-mail summarizing what I had learned during that experience. As I read this email 7 years later, there is little I would change about the advice I would give to a new PhD student. Indeed, I give very similar advice to my own students, some of whom are now professors at great universities themselves.

Here is the text of the original email:

———-

So, now with a PhD (well, enough signatures to get me a PhD) in hand. I thought I should perhaps write some of my thoughts down about what I learned throughout the process. Primarily, I learned that “research” is much like any other job, perhaps even akin to making “widgets” in a factory. There is a process. Although I haven’t figured the entire process all out yet, particularly the publishing part, which is now going to be the primary interface between me and the production of widgets, I think I have come up with an outline for a theory.

Prelims

Before I started graduate school, even before I started my MS (I think), I read the website below. It gave me the best advice in terms of a general framework about how I should think about acting/behaving during my time in graduate school. I believe it was what helped me get admitted, finish, and find a job.

http://www.psychwww.com/careers/suprstar.htm

I would recommend any graduate student read it and take it to heart. When I started graduate school for my master’s degree, I tried to model myself after these suggestions. Though others might argue otherwise, I think, for the most part I worked an average of around 5-6 hours of real work per day, for at most 6 days a week – putting peak times aside. I mostly worked at school. I think most faculty knew my name and I personally asked almost all faculty to come to my presentations.

I expect to work significantly harder during my faculty job. Raising the average real hours worked a day to 7 or at most 8.

Some observations about “poorly” performing students:

  • The students who do the worst in graduate school are not present on campus and in the office on a regular basis. This conforms with the visibility hypothesis. Being on campus is important. First, you work. Second, you can talk with other students to resolve your problems. Talk to faculty and be a part of the intellectual life of the place. That means attending talks, giving talks. Even the “mindless” chitchat often contains important pieces of knowledge, gossip, tips and tricks, linkage into important networks that will provide guidance and encouragement during your PhD and beyond.
  • Students who perform poorly often reinvent the wheel. They do not take good advice from others – both explicit advice and what would I consider “implicit” advice (e.g. modeling yourself after the best of the cohorts above you.)
    • This includes writing papers. The structure of research papers is quite standard. This includes how to write introductions, results sections, etc. However, it also consists of due diligence on statistical procedures, etc. I have learned this through trial and error. But I often look at other good papers that try to do “similar things” (broadly defined) to see what types of other tests, etc. I should do before I wrap up my paper.
    • It also includes presentations. Particularly glaring is the absence of students at other people’s presentations. I am often surprised by this since academic output consists of two tangible products: papers and presentations. Just as writing good papers requires reading good papers, giving good presentations requires going to good presentations. And much like how writing good papers requires the ability to take and give productive criticism, so does presentation.

  • Read. I am often just in AWE of students’ lack of knowledge in their own field of study. I have encountered many students who are totally unaware of the basic – that is core – papers or ideas in their field. Not that I am the most well read person in the world or even the program, but I do work quite hard to keep abreast of recent literature (less so these days), the news, and the classics (putting a lot of time into this right now.) Reading and digesting the literature puts ideas, especially theoretical ideas, in context. Reading is important, as is remembering what you read. We all make mistakes. I might cite Smith’s 1975 paper, while it might be Smithe’s 1975 or 76 paper. But my “hunch” is that even when we make mistakses these are good heuristics for remembering papers, linking names to concepts (Granovetter -> Weak Ties) to era’s (1970’s) and linking these with each other into a “network” of sorts of concepts, authors, and eras. Knowing these basic things, will give individuals a good lay-of-the-land with respect to where the holes are in the research, where the interesting problems are, and where your own research can fit in. It also goes back to “re-creating the wheel.” A good knowledge of the current and past literature will give you, in addition to a better theoretical lens with which to view your research, ideas about data, about survey instruments, about methods, and about framing research as well. – A good quote about the importance of reading can be found here:

    “My first rule was given to me by TH White, author of The Sword in the Stone and other Arthurian fantasies and was: Read. Read everything you can lay hands on. I always advise people who want to write a fantasy or science fiction or romance to stop reading everything in those genres and start reading everything else from Bunyan to Byatt.” – Michael Moorcock

     

  • Do not wait for feedback to do work. I often notice students with a paralysis of sorts when it comes to doing anything that they have not gotten explicit directions from their advisors or approval from them for some reason. Keep playing with your data and your ideas. Feedback is slow, people are busy, and even when you do get feedback – remember that no one knows your data and the methods you used to analyze it better than you. Keep plugging away. I kind of have a heuristic about “regression analysis.” Once I get a “main effect” to be significant – I try (though, I increasingly notice that I often fail on some dimension) to do all I can to make it disappear (in theoretically justified ways of course). If I do get it to stay, then I am more confident. If it disappears, then you have to start searching for theory again (especially if you didn’t include the variable that made the effect disappear for a theoretically justified reason.)
  • Don’t listen to all the advice you get from your advisors. They are busy and they are human. Take all the comments in, make appropriate changes, and argue back when you have you. You will have to do it for the rest of your life with reviewers anyway. “Critiques” are not always correct.
  • Don’t TA too much. I see some students overload with TAs even in their 8th or 9th year (yes!). I think a manageable number of TA’s per semester is three if you got your research organized and you are in your 2nd or third year. If your research is a mess, keep it to 2 TA’s a semester. Here is a simple formula. Assuming that an average student can TA three classes per semester (not all unique) – that is 6 total classes a year earning $28,800 per annum without significantly extending their time in the PhD program. Now assume that any additional TA above this 6 TAs per year will increase the length of time you stay in the PhD program by 3 months (that’s just 1/4 of a year) and that your opportunity cost of staying in the PhD program is 80,000 (an above average salary for a master’s student). That decision to TA just that one extra course will cost you 20,000-4800 = 15,200. That is probably a low end of the estimate. Increasing the number of extra months that you might stay because of an extra TA by another month will increase this to over 20k lost. Bump up the salary… and you see the point. TAing, even if it just adds a “few” months will really hurt your pocket book. The next point is related more to time in the PhD program.
  • Little rules, big rules. Making sure you don’t break the little rules will help you make your deadlines on time. Finish your classes, your first paper, and your second paper ON TIME. That is almost like the first commandment of the PhD at Heinz. The little rules at Heinz are quite simple. These are the major milestones of the PhD here and are almost sacred. Doing this will provide you with enough structure in the formative periods of your PhD that will take you along through your proposal and defense. The more important thing coming out of finishing your FP and SP on time is that this will give you the “meta skills” to get you organized for your proposal and your dissertation. Finally, as a secondary note regarding the FP/SP deadlines is that there is an organizational memory. Everyone knows who didn’t finish their papers on time. Faculty have long memories as well. They are more lenient (with risky topics, etc.) when people make sure they obey the little rules. So, if you follow the little rules, you can break the big ones. Breaking the big rules is where the fun is.
  • Again, time. The academic job market penalizes “long” PhDs. This is a qualitative observation. Though there may be a handful of PhD’s who finished after 9 years who ended up with jobs in academia – the fact of the matter is that it is really looked down upon. Six year might be the peak of the neutral point at which it is OK not to have finished your PhD by this time, after that your prospects of landing a good academic job decline quite dramatically, and it snowballs to almost nil by 8th or 9th year.
  • Be nice to other students. Word spreads about “assholes” (this is a technical term – see Van Maanen 1978). We all have made faux pas’ in our lives. Probably tons of them. But consistent “assholeary” is bad. Be trustworthy and others will trust you and even let others know they trust you too.
  • Everybody here is pretty smart. It is not just you. It is hard work that creates the gradient on which good graduate students vary. Hard work is demonstrated by being on schedule, writing, reading, and working every week day for at least a few hours on your research (on average.) I am often surprised at how easy this is, and how some people just do not get it.
  • Know when to quit. Get real advice. Don’t stick it out longer than you have to because of your ego.