================= Follow Graph ================= One of the most important pieces of Twitter data is the *follow graph* --- the network among users where user A has a directed edge to user B if A follows B on Twitter. Follow relationships shape which tweets people see, which other users Twitter recommends they follow, and greatly affects their experience on the site. Here we describe how to get the follow graph out of the twclient database. Without further ado, here's the simple version: .. code-block:: sql select source_user_id, target_user_id from follow where -- the follow table does type-2 SCD, so this condition says -- "only currently valid rows (not marked obsolete by a subsequent fetch)" valid_end_dt is null; As you can see, it isn't complicated. Directed edges are stored in the ``follow`` table, with the following user in the ``source_user_id`` column and the followed user in the ``target_user_id`` column. Both columns are foreign keys to the ``user`` table, ensuring that the graph is kept consistent with records of users. The only point which needs explanation is the WHERE clause. The ``follow`` table is stored in a `type-2 SCD `__ format, which minimizes the storage needed to handle repeated fetches of slowly changing data like users' following lists. In this format, each row has start and end validity dates: when a follow edge is first observed, its row is created with ``valid_start_dt`` indicating when it was first observed and NULL ``valid_end_dt``, with the latter indicating that it is still valid. When it is first *not* observed in a fetch where it should have been if it were still valid, the row is updated to set ``valid_end_dt`` to the time the follow edge was observed to be missing. Finally, if a follow relationship is first observed, then some time later found to be missing, then observed again (say because a user follows, unfollows and refollows another user), a new row is created. Because most user don't regularly follow and unfollow large numbers of other users, this format allows us to store repeated fetches of the follow graph easily and with minimal space requirements. You can also easily select your fetched follow graph *as it existed at a particular time in the past*. The WHERE clause snippet in the example above --- ``where valid_end_dt is null`` --- asks for all currently valid rows, or the most current state of the graph. But you can also get, for example, the follow graph as you had recorded it six months ago: .. code-block:: sql select source_user_id, target_user_id from follow where valid_start_dt < now() - interval '6 months' and (valid_end_dt >= now() - interval '6 months' or valid_end_dt is null); That is, follow edges which were first observed more than six months ago, and which either expired more recently than six months ago or whch are still valid. Note that you need either of two conditions on ``valid_end_dt``: ``valid_end_dt >= now() - interval '6 months'``, specifying follow edges which expired during the last six months because one user was observed to have unfollowed the other, and ``valid_end_dt is null`` to specify rows which are still valid. It sounds obvious but is worth pointing out that this query will not give you the *true* state of the follow graph six months ago, as recorded on Twitter's servers, but only the version you had fetched. For example, if your last fetch was seven months ago, six months ago your copy of the follow graph was a month out of date, and that's the version this query will return. Here's a slightly more complicated query which selects only follow edges between a certain set of users---here, those with the "universe" tag. It leverages the tagging feature explained in the :doc:`fetching data vignette ` and provides an example of how to use it: .. code-block:: sql with tmp_universe as ( select u.user_id from "user" u -- standard sql reserves this name, need to quote it inner join user_tag ut using(user_id) inner join tag ta using(tag_id) where -- just an example of using tagging, a tag -- with this name is not created automatically ta.name = 'universe' ) select uts.user_id as source_user_id, utt.user_id as target_user_id from tmp_universe uts inner join follow fo on fo.source_user_id = uts.user_id inner join tmp_universe utt on utt.user_id = fo.target_user_id where fo.valid_end_dt is null;