A journey into the Reddit network

Hyperlink formation

Abstract and objectives

Online social media has grown in popularity over the years, attracting more and more users to interact and share. As such, their complexity has also increased. Among them, Reddit’s social news allows its users to interact and build communities, submitting various content and topics for discussion.

The goal of this project is to reveal what is the underlying structure of the network:

  • How hyperlinks are created between subreddits ?
  • Can we characterise them using an existing social theory ?
  • Are the results the same at all scale (i.e. when a single specific community is considered) ?
  • Does the network exhibit structural changes through time ?

We present here some results on how these topics relate to each other, both on a large scale and on a smaller scale. Finally we explore the evolution of the network through time.

The data
We use this reddit dataset, which contains a network of the hyperlinks in a post between subreddits, from January 2014 to April 2017. The nature of the relationship (positive/negative) were previously obtained by Kumar S. et al1 using crowd-sourcing and training a text based classifier. We use a complementary dataset to group the subreddits into clusters.

At a global scale…

As a starting point, looking at the proportion of negative links in subgroups might be helpful. More precisely, let’s look at the proportion as a function of the activity of the subreddits.

The figure shows that more active subreddits are more likely to produce negative posts towards other subreddits than positive posts. Indeed we find for example

which are highly controversial subjects and have a lot of connections. Also this effect vanishes if we assume that the user randomly write positive/negative posts.

What happens if we consider neighboring relationships?

As a simple neighborhood, let’s consider only relations between three subreddits with a connection between them. Hereafter we will call it a triad. We begin our analysis with the balance theory, introduced by Heider in the 1940s2.

Statement Occurrence
The friend of my friend is my friend High
The enemy of my friend is my enemy High
The friend of my friend is my enemy Low
The enemy of my enemy is my enemy Low

This theory claims that some types of triads appear more often than the others. In particular, the theory is based on the assumption that whenever there exists an interpersonal relationship between two people, communities, then there exists a harmony so that the ideas shared by both subjects coexist without any tension. This leads for example to an overrepresentation of the situation the friend of my friend is my friend.

This theory can be extended to the weak structural balance theory, by weakening the assumption and stating that only the friend of my friend is my enemy relationship should be underrepresented, as introduced by Davis in the 1960s3.

To check if this relation holds in the Reddit network, we trained a linear classifier to guess the sign (positive/negative) of the closing edge of the triad using the characteristic of the other two relationships, that already exists when this hyperlink is created. As some subreddits have a preference for a certain type of relationships (e.g. controversial as we have seen before), we need to include a parameter to account for this. The table below shows if a statement is more likely to appear when compared to a randomized network. We observe that the Reddit network adheres to the weak balance structure.

Statement Prediction Balance Theory Weak Balance Theory
The friend of my friend is my friend Likely
The enemy of my friend is my enemy Likely
The friend of my friend is my enemy Unlikely
The enemy of my enemy is my enemy Likely

So, what’s next?

The previous theory might oversimplify the situation, as other crucial mechanisms might have been ignored. Here, we apply the theory of status as described in Signed Network in Social Media paper4. This theory introduces a new perspective to the creation of a relation between two subreddits: positive links are created from a subreddit with a lower status in the network to a subreddit with higher status. The status can be interpreted in this situation as respect.

However this model performs worse than the balance theory when trained on a linear classifier and the status theory does not match the measurements.

At a local scale…

We have seen that for the reddit network the weak balance theory seems to hold. But

Is it still the case at a local scale?

To address this question we must first group the subreddits into distinct communities. To this end we use this complementary dataset providing information about what kind of users frequent which subreddits. Using this the subreddits are grouped into the communities shown in this interactive plot (click or double click on the legend for a pleasant surprise):

The communities labeled in the graph above were the ones that were identified as the most relevant for analysis. They contain a sufficient large amount of subreddits, which reasonably interact with each other (in our records). Other non-coherent communities were combined into an Others community.

The communities were then classified into one of three classes based on how much the subreddits of the communities interact and how positive on average their interactions are

Interacting Social communities

Features Communities

High connectivity

High Interactivity

Average negativity

These are communities where the subreddits both interact together at a reasonable rate. This is measured by number of hyperlinks created per subreddits and by the transitivity of the graph — some graph theoretical property indicative of how connected a network is. Also they exhibit an average level of negative interactions.

In general these communities seem to follow the weak theory of balance. The statement friend of my friend is my friend is overrepresented in all communities, which is a characteristic of the Socially Interacting communities. Nevertheless the community Adult content seems to be the only community that fully respects the weak balance theory.

Polemic Interacting Social communities

Features Communities

High connectivity

High Interactivity

High negative hyperlink proportion

This second class of communities differ from the first one by having a higher level of negative interactions. The theory of balance is not at all applicable here in general.

All communities of this class have an underrepresentation of the friend of my friend is my friend type of interaction and an overrepresentation of the counterintuitive the enemy of my enemy is my enemy type of interaction. This is probably due to the polemic nature of these categories and the possible complexity of the realtionships between the subreddits in such communities. The social theories presented here such as balance are mainly based on binary simplistic classification of relations as one of either friendship or animosity. Thus given the aforementioned complexity they should not be expected to hold for these communities, which is the result we find as neither status or balance hold for any of these communities.

Non-community clusters

Features Clusters

Low connectivity

Low Interactivity

This last class of communities are in fact not communities at all, with a low, almost non-existent level of interactivity (an order of magnitude lower than other communities) and low connectivity. They were grouped together because users often search for similar content.

These cluster do not lend themselves to any type of social analysis. One such cluster, Porn is an archetypical example of this class. The subreddits in this cluster do not seem to interact at all. Probably due to the inherent non-interactive nature of this community: this is a passive cluster. All theme-related interacting subreddits should be found in the Adults content community. Neither the balance theory nor the status theory were found to apply for this class.

An evolution through time

Until now we have only considered the last state of the network. However the network evolves through time: some hyperlinks are created, the sentiment of the associated post might change. Let’s examine the proportion of positive link created each month in the categories defined above.

Using a simple regression analysis, the only communities that have a significant uptrend in the proportion of negative hyperlinks created per month are the Gaming and Adult content communities. These observations might reveal some polarization of these two clusters, which could question the application of the weak balance theory in a distant future.

We also observed that during certain periods, some subreddits created significantly more negative hyperlinks than usual. We call these periods conflicts, as they might be caused by a coordinated attack of negative hyperlinks. We manage to confirm this hypothesis on the following three categories, for which conflicting periods systematically involved the creation of negative hyperlinks by popular subreddits:

One particular interesting example is the Politics category, for which we observed a massive increase in negative hyperlinks on November 2016, for the US elections. Also the participation to conflicts differ between the categories: Politics and Popular subjects involve more subreddits (about 15-20%) than Gaming (only 5-6%). This means that conflits in Politics and Popular subjects tend to concern a larger proportion of the community. These results are in line with the polemic nature of these categories, as was discussed in the previous section.

Conclusion

So, overall, what have we learned with this analysis of the Reddit network ? First, the network exhibit complex structures, as the usual balance and status theory do not seem to apply here, with negative hyperlinks that tend to be created by subreddits that participate more in the social network. Thus, accounting for the popularity of the subreddits creating the hyperlink is necessary. Doing so, we have found a prevalence for interactions of the type my friend of my friend is my friend.

Due to its complex structure and the diversity of subreddits, an analysis at a local scale reveals a classification of communities of subreddits into three categories:

  • the interacting social communities, such as Adult content and Tech, for which the weak balance theory holds or at least interactions of the type my friend of my friend is my friend are overrepresented.
  • the polemic interacting social communities, that includes Politics and Popular subjects communities, with a large presence of situations of the type the enemy of my enemy is my enemy.
  • the non-communities clusters, that rarely interact, such as Porn.

Lastly, we conducted a temporal analysis to detect periods of conflicts, characterised by a large increase in the number of newly created negative hyperlinks. In addition to having a uptrend for the proportion of such negative hyperlinks created per month, the Gaming category has conflicts caused by subreddits that tend to be more popular.

References

  1. S. Kumar, W.L. Hamilton, J. Leskovec, D. Jurafsky. Community Interaction and Conflict on the Web. World Wide Web Conference, 2018 

  2. F. Heider. The Psychology of Interpersonal Relations. John Wiley & Sons, 1958 

  3. Davis, J. A. Structural balance, mechanical solidarity, and interpersonal relations. American journal of sociology 68.4 (1963): 444-462. 

  4. J. Leskovec, D. Huttenlocher, J. Kleinberg. Signed networks in social media Proceedings of the SIGCHI conference on human factors in computing systems. 2010.