Follow Graph
One of the most important pieces of Twitter data is the follow graph — the network among users where user A has a directed edge to user B if A follows B on Twitter. Follow relationships shape which tweets people see, which other users Twitter recommends they follow, and greatly affects their experience on the site. Here we describe how to get the follow graph out of the twclient database.
Without further ado, here’s the simple version:
select
source_user_id,
target_user_id
from follow
where
-- the follow table does type-2 SCD, so this condition says
-- "only currently valid rows (not marked obsolete by a subsequent fetch)"
valid_end_dt is null;
As you can see, it isn’t complicated. Directed edges are stored in the
follow
table, with the following user in the source_user_id
column and
the followed user in the target_user_id
column. Both columns are foreign
keys to the user
table, ensuring that the graph is kept consistent with
records of users.
The only point which needs explanation is the WHERE clause. The follow
table is stored in a type-2 SCD
format, which minimizes the storage needed to handle repeated fetches of slowly
changing data like users’ following lists. In this format, each row has start
and end validity dates: when a follow edge is first observed, its row is
created with valid_start_dt
indicating when it was first observed and NULL
valid_end_dt
, with the latter indicating that it is still valid. When it is
first not observed in a fetch where it should have been if it were still
valid, the row is updated to set valid_end_dt
to the time the follow edge
was observed to be missing. Finally, if a follow relationship is first
observed, then some time later found to be missing, then observed again (say
because a user follows, unfollows and refollows another user), a new row is
created. Because most user don’t regularly follow and unfollow large numbers of
other users, this format allows us to store repeated fetches of the follow
graph easily and with minimal space requirements.
You can also easily select your fetched follow graph as it existed at a
particular time in the past. The WHERE clause snippet in the example above —
where valid_end_dt is null
— asks for all currently valid rows, or the
most current state of the graph. But you can also get, for example, the follow
graph as you had recorded it six months ago:
select
source_user_id,
target_user_id
from follow
where
valid_start_dt < now() - interval '6 months' and
(valid_end_dt >= now() - interval '6 months' or valid_end_dt is null);
That is, follow edges which were first observed more than six months ago, and
which either expired more recently than six months ago or whch are still valid.
Note that you need either of two conditions on valid_end_dt
: valid_end_dt
>= now() - interval '6 months'
, specifying follow edges which expired during
the last six months because one user was observed to have unfollowed the other,
and valid_end_dt is null
to specify rows which are still valid.
It sounds obvious but is worth pointing out that this query will not give you the true state of the follow graph six months ago, as recorded on Twitter’s servers, but only the version you had fetched. For example, if your last fetch was seven months ago, six months ago your copy of the follow graph was a month out of date, and that’s the version this query will return.
Here’s a slightly more complicated query which selects only follow edges between a certain set of users—here, those with the “universe” tag. It leverages the tagging feature explained in the fetching data vignette and provides an example of how to use it:
with tmp_universe as
(
select
u.user_id
from "user" u -- standard sql reserves this name, need to quote it
inner join user_tag ut using(user_id)
inner join tag ta using(tag_id)
where
-- just an example of using tagging, a tag
-- with this name is not created automatically
ta.name = 'universe'
)
select
uts.user_id as source_user_id,
utt.user_id as target_user_id
from tmp_universe uts
inner join follow fo on fo.source_user_id = uts.user_id
inner join tmp_universe utt on utt.user_id = fo.target_user_id
where
fo.valid_end_dt is null;