Follow Graph

One of the most important pieces of Twitter data is the follow graph — the network among users where user A has a directed edge to user B if A follows B on Twitter. Follow relationships shape which tweets people see, which other users Twitter recommends they follow, and greatly affects their experience on the site. Here we describe how to get the follow graph out of the twclient database.

Without further ado, here’s the simple version:

select
    source_user_id,
    target_user_id
from follow
where
    -- the follow table does type-2 SCD, so this condition says
    -- "only currently valid rows (not marked obsolete by a subsequent fetch)"
    valid_end_dt is null;

As you can see, it isn’t complicated. Directed edges are stored in the follow table, with the following user in the source_user_id column and the followed user in the target_user_id column. Both columns are foreign keys to the user table, ensuring that the graph is kept consistent with records of users.

The only point which needs explanation is the WHERE clause. The follow table is stored in a type-2 SCD format, which minimizes the storage needed to handle repeated fetches of slowly changing data like users’ following lists. In this format, each row has start and end validity dates: when a follow edge is first observed, its row is created with valid_start_dt indicating when it was first observed and NULL valid_end_dt, with the latter indicating that it is still valid. When it is first not observed in a fetch where it should have been if it were still valid, the row is updated to set valid_end_dt to the time the follow edge was observed to be missing. Finally, if a follow relationship is first observed, then some time later found to be missing, then observed again (say because a user follows, unfollows and refollows another user), a new row is created. Because most user don’t regularly follow and unfollow large numbers of other users, this format allows us to store repeated fetches of the follow graph easily and with minimal space requirements.

You can also easily select your fetched follow graph as it existed at a particular time in the past. The WHERE clause snippet in the example above — where valid_end_dt is null — asks for all currently valid rows, or the most current state of the graph. But you can also get, for example, the follow graph as you had recorded it six months ago:

select
    source_user_id,
    target_user_id
from follow
where
    valid_start_dt < now() - interval '6 months' and
    (valid_end_dt >= now() - interval '6 months' or valid_end_dt is null);

That is, follow edges which were first observed more than six months ago, and which either expired more recently than six months ago or whch are still valid. Note that you need either of two conditions on valid_end_dt: valid_end_dt >= now() - interval '6 months', specifying follow edges which expired during the last six months because one user was observed to have unfollowed the other, and valid_end_dt is null to specify rows which are still valid.

It sounds obvious but is worth pointing out that this query will not give you the true state of the follow graph six months ago, as recorded on Twitter’s servers, but only the version you had fetched. For example, if your last fetch was seven months ago, six months ago your copy of the follow graph was a month out of date, and that’s the version this query will return.

Here’s a slightly more complicated query which selects only follow edges between a certain set of users—here, those with the “universe” tag. It leverages the tagging feature explained in the fetching data vignette and provides an example of how to use it:

with tmp_universe as
(
    select
        u.user_id
    from "user" u -- standard sql reserves this name, need to quote it
        inner join user_tag ut using(user_id)
        inner join tag ta using(tag_id)
    where
        -- just an example of using tagging, a tag
        -- with this name is not created automatically
        ta.name = 'universe'
)
select
    uts.user_id as source_user_id,
    utt.user_id as target_user_id
from tmp_universe uts
    inner join follow fo on fo.source_user_id = uts.user_id
    inner join tmp_universe utt on utt.user_id = fo.target_user_id
where
    fo.valid_end_dt is null;