S. Cresci, R. Di Pietro, M. Petrocchi, A. Spognardi, and M. Tesconi.
DNA-inspired online behavioral modeling and its application to spambot detection.
IEEE Intelligent Systems, 5(31):58–64, 2016.

Digital DNA Fingerprinting.

An analysis of Twitter accounts who posted #Brexit tweets

This study, carried out by Serena Tardelli, Stefano Cresci, and Maurizio Tesconi at IIT-CNR, analyses a sample of Brexit tweets using the Digital DNA Fingerprinting technique. This innovative behavioral modeling approach encodes user behaviors as a sequence of actions, resulting in strings of characters that resemble the digital DNA of Twitter users. We then employ string mining and bioinformatics algorithms in order to characterize and detect social spambots, by grouping similar digital DNA sequences.

For this study, we modeled account behaviors by encoding every tweet shared by a user with a different character: A for a simple tweet, C for a retweet, and T for a reply. In this way, every user is associated with a string of characters (i.e., a digital DNA sequence) that represents the chronologically ordered sequence of his/her actions. We then quantify behavioral similarity by looking at the longest common substring (LCS) between the digital DNA sequences of different users. The intuition behind our methodology is that automated (i.e., spambot) accounts feature a higher behavioral similarity than human-operated accounts.

Table 1 shows some statistics about the data that we analysed.

dataset accounts tweets
A sample of Brexit tweets from 2016-05-19 to 2016-05-31 340,640 1,824,555

Table 1: Dataset statistics.

Analysis of #Brexit tweets

In this first step we extracted a user's digital DNA by considering the sequence of his/her tweets on the Brexit topic.

Distances users brexit
Figure 1: Sequence plot of Brexit tweets per user.

Each horizontal line in the plot of Figure 1 represents the digital DNA sequence of a user that posted on the Brexit topic. The length of the sequences is equal to the number of tweets on the topic, and the different colors show the types of tweets shared (e.g., tweet, retweet, reply).

A simple analysis of the lengths of the sequences shown in Figure 1 reveals that two accounts posted many more tweets than all the others: @iVoteLeave and @iVoteStay. While the vast majority of the accounts posted less than 100 tweets, both of those anomalous accounts posted around 15,000 tweets in a time span of only 13 days. A deeper analysis of the 2 accounts revealed that they were created at the same time and that they shared tweets exactly at the same time, specifically, a tweet per minute. The tweets they published are all retweets containing specific hashtags: the @iVoteLeave account retweeted tweets with the hashtags #voteLeave or #LeaveEU, while @iVoteStay retweeted tweets with the hashtags #strongerIn or #voteRemain. Clearly, these 2 accounts are automated bots that spam retweets.

Analysis of user timelines

After discarding the 2 spambot accounts identified in the previous step, we then crawled the Twitter timelines of the users who posted more than 100 tweets.

The boxplots in Figure 2 show the distribution of DNA bases, corresponding to the types of tweets, among the digital DNA sequences extracted from users timelines. As shown in figure, users of our dataset shared more retweets than replies or simple tweets. This behavior can be considered normal and typical, since retweeting is a less demanding operation than posting a new tweet or replying to another user. Nonetheless, as represented in the top part of Figure 3, a few users almost only posted normal tweets, as represented by a cluster of DNA sequences (i.e., horizontal lines) that are almost completely red-colored.

Figure 2: Distribution of the different DNA bases inside the sequences.

Figure 3: Sequence plot of user timelines.

Figure 4: Intra-sequence Shannon entropy.

Figure 5: Inter-sequence Shannon entropy.

Figures 4 and 5 show, respectively, the intra-sequence and inter-sequence Shannon entropy of the DNA sequences of user timelines. Although the boxplot of Figure 4 shows that the composition within each timeline is rather heterogeneous and entropic, as the majority of them contains a mixed type of tweets, there's still a group of accounts that doesn't follow this general pattern and have an entropy almost equal to 0, meaning that those timelines contain easily predictable types of tweets. These accounts are probably the ones shown in the top part of Figure 3 and that posted only type A tweets.

Figure 5 represents the inter-sequence entropy, which is useful for spotting synchronized behaviours among different users. The boxplot shows inter-sequence entropy values almost always higher than those computed intra-sequence, meaning that overall, the users of this group do not show synchronized behaviours.

Distances users brexit

Figure 6: LCS similarity plot.

Figure 6 shows the LCS plot of the users. Given that the LCS is a measure of similarity between digital DNA sequences, suspiciously high values of LCS might serve as a red flag for automation.

Notably, the LCS curve of Figure 6 uncovers a group of accounts that share high behavioral similarity, with LCS values in the region of 3,200. This behavior is unusual and different from the general trend of the group, which features drastically lower values of LCS. The explanation of this phenomenon might be that this group of highly similar accounts is the one who posted only tweets of type A and that it is mainly composed of spambots accounts. Then, in order to verify our thesis we isolated those accounts with LCS > 2,400. Figure 7 shows the sequence plot of this group, highlighting that their timelines are almost entirely made of type A tweets only. This is a typical behaviour of spambots that rarely interact with other accounts. With a deeper analysis on the 115 highly similar accounts of this group, we identified two main subgroups of spambots.

Distances users brexit

Figure 7: Sequence plot of 115 highly similar accounts.

Spambots Groups

Group 1

Group of female users with similar screen names. Table 2 shows a sample of the accounts.

photo screen_name

Table 2: A group of Brexit spambot accounts.

These accounts periodically posted the same tweet at the same time over the considered time span of 13 days. Their timelines are identical. Below is an example of a tweet posted at the same time by these accounts. They all stopped posting after June the 27th.

Group 2

A group of male spambot users.


Table 3: Another group of Brexit spambot accounts.

This group also posted tweets at the same time in the period considered and their activity stopped at June the 10th.

Group 3

This group of automated accounts also posted similar Brexit tweets at the same time, but unlike groups 1 and 2, these accounts don't seem to be coordinated or synchronized between each other. They are still active on Twitter and their tweets always contain external URLs. They probably follow RSS feeds of news websites and automatically tweet as fresh news are published. Table 4 shows a sample of these accounts.


Table 4: A group of news spambot accounts, that occasionally posted news about Brexit.

Our analysis, performed via the digital DNA behavioral modeling framework, showed that users of spambot groups 1 and 2 are coordinated and have the same behavior and goal. Such accounts were specifically created to tamper discussions on the Brexit topic. Being spread, coordinated, and featuring a seemingly human Twitter profile, such accounts demonstrate the latest advances in spambot design and represent a dangerous threat for social platforms. In literature, this novel type of spambots is called social spambots.

Group 3 spambots are probably the less dangerous ones. These accounts share links to news articles and spread all kinds of information without a specific goal. Nonetheless, they represent a part of the wide and diverse set of automated and spambot accounts that pollute Twitter content everyday.