===================================== Getting data from the Twitter API ===================================== So you want to analyze some Twitter data. (That's why you're here, right?) This vignette walks through how to get set up and how to acquire the data. **Obligatory disclaimer / reminder**: You should comply with Twitter's terms of service and respect user privacy. It's important to only access data you have a right to access. ---------------- Introduction ---------------- Twclient makes acquiring data easier than directly interacting with the Twitter REST API, which you can do through a lightweight client like Twitter's own `twurl `__ or a more featureful package like `tweepy `__. Using either of these makes you do quite a bit of work you'd rather avoid: thinking about cursoring and pagination of results, manually handling multiple sets of credentials if you have more than one, and of course munging data into the format you want. (No disrespect to tweepy, of course: twclient uses it for low-level interactions with the Twitter API.) Data munging in particular is not a simple task. You may have to keep and organize the raw json responses from Twitter's API, and then extract things from them via a tool like `jq `__; if using tweepy, you have to write some python code to serialize the User, Tweet, etc, objects it produces to a format you can work with. In general, of course, there's no way around this: if you want to write an application like a Twitter client, which people can use to view their feeds, post tweets, and whatever else, you need the API in its full complexity. But here we have a simpler task---read-only scraping of data---and so we can make a simpler tool. (For formatting that data and exporting it from the database, see the :doc:`other vignette on exporting data `.) Note that Twitter has other data sources than the REST API, in particular the `PowerTrack `__ API, and this package does not support those. It also does not (yet) support Twitter's new `v2 API `__. Enough talk! How do you get started? In brief: This package provides a command-line interface for loading data from the Twitter REST API into the database of your choice. You can invoke it as ``twclient``, as in ``twclient fetch users -n wwbrannon``. You need to get and set up API credentials, set up your database, and then you can pull and work with data. ------------- API setup ------------- You can't get data from the Twitter API without API credentials, so the next step is to get at least one set of credentials. If you don't already have credentials, Twitter has `documentation `__ on how to get them. You'll generally receive four pieces of `OAuth `__ authentication information: a consumer key, consumer secret, access token and access token secret. If using `OAuth 2.0 bearer tokens `__ you may receive only a consumer key and consumer secret. Regardless, you can add them to twclient as follows (replacing the "XXXXX" with your values, and omitting token and token secret if using a bearer token): .. code-block:: bash twclient config add-api -n twitter1 \ --consumer-key XXXXX \ --consumer-secret XXXXX \ --token XXXXX \ --token-secret XXXXX Similarly to the database setup, this command stores the credentials in your config file under an API profile named "twitter1" for ease of use. We've only added one set of credentials here, but you can add arbitrarily many under different names. Twclient will seamlessly switch between them as each one hits rate limits. ------------------ Database setup ------------------ Configuration --------------- Next, you need to configure a database. The easy, low-effort way to do this is to use a file-backed SQLite database. Because SQLite is built into Python and doesn't have a separate server process, you don't need to install or configure anything else to get started. Here's how: .. code-block:: bash twclient config add-db -f /path/to/your/project/sqlite.db db This command tells twclient to create a persistent profile for a database and call it "db", with the database itself stored in SQLite format in the file you specify. The database profile you create is stored in a twclient configuration file, by default ``~/.twclientrc``, so that you don't need to keep providing the database URL for each command. Be aware, though, that if you want to interact with the database via SQL, Python doesn't package a frontend shell or client. SQLite has a `standard client `__ you can download, and you can also install it from `Homebrew `__ (on a Mac) or your Linux distribution's package manager. SQLite is not the only database you can use: twclient can use any database `supported by sqlalchemy `__. In addition to SQLite, We've also tested with `Postgres `__, which is used for the SQL examples in the :doc:`data export vignette `. Do note that while you can use other sqlalchemy-compatible databases, we've only tested with SQLite and Postgres. If you want to use Postgres, you'll need to do at least a bit of work to set up the database. If you're on a Mac, `Postgres.app `__ is a highly user-friendly distribution of Postgres. It's not the only one: among others you can download the database from `its website `__, use `Amazon RDS `__, or `run it with Docker `__. You can configure twclient to use a Postgres database (that you've already set up) as follows: .. code-block:: bash twclient config add-db -u "postgresql:///" postgres The only new thing here is that, instead of a single file passed to the ``-f`` option, we have a `sqlalchemy connection URL `__ and the ``-u`` option. (``-f`` is syntactic sugar for sqlalchemy's SQLite URL format.) The specific URL here, ``postgresql:///``, indicates the default database on a Postgres server accessed through the default local Unix socket, with trust/passwordless authentication, using sqlalchemy's default Postgres driver. (If you're using Postgres.app on a Mac, this is likely the URL you want to use.) Installing Data Model ----------------------- Next up, we have to install the data model: create the tables, columns, keys and other DB objects the twclient package uses. Be aware that doing this will **drop all existing twclient data in your database**. The ``twclient initialize`` command will do the trick, but to confirm that you understand running it will **drop all existing twclient data in your database** you have to specify the ``-y`` flag: .. code-block:: bash # if you've configured more than one database with `twclient config add-db`, # pass the `-d` option to specify which one to initialize twclient initialize -y And that's it! If you fire up a database client you'll see a new database schema installed. The tables, columns and other objects are documented, in the form of their sqlalchemy model classes, in the API documentation for twclient.models. ------------------------- Actually pulling data ------------------------- Now comes the fun part: actually downloading some data. We'll assume you've pulled together sets of Twitter users and `Twitter lists `__ you want to retrieve information on. This example will use the following two files, one each of individual users and lists of users. (The usernames, user IDs and lists are fake for privacy reasons, so replace them with real ones if you want to run this.) Here's ``users.csv``: :: screen_name user1 user2 user3 test1234 foobar stuff And here's ``lists.csv``: :: list myaccount/mylist 2389231097 18230127 big_newspaper/reporters 20218236 1937309 1824379139 A word about identifiers -------------------------- In general, Twitter allows you to refer to a user or list by either a) a numeric user ID or list ID, or b) a human-readable name. Readable names for users are called screen names, and for lists are called "full names." List full names consist of the screen name of the user who owns the list and a list-specific slug, separated by a slash. (For example, "cspan/members-of-congress".) With twclient, you can mix numeric and human-readable names for lists, as in ``lists.csv`` above, but not for users. That is, you could instead use this ``users_alternative.csv``: :: user_id 137923923763 37480133935 237290537913 3784935713 3096490427891 612092404590 but not one file which mixes user IDs and screen names together. This is because of the way the underlying Twitter API endpoints are implemented: They'll accept mixed references to lists, but not to users. Hydrating users ----------------- The first step is to `hydrate `__ the target users, which confirms with the Twitter API that they exist, retrieves some summary information about them and creates records for them in the database. You can do this with the ``twclient fetch`` family of commands, and specifically ``twclient fetch users``. We'll start by fetching the users in the lists of ``lists.csv``, though you could do the individual users first: .. code-block:: bash tail -n +2 lists.csv | xargs twclient fetch users -v -b -l This command skips the CSV header line (via ``tail -n +2 lists.csv``), which twclient doesn't actually use, and pipes the rest of it to ``twclient fetch -v users -b -l`` via ``xargs``. The ``-v`` flag requests verbose output, ``-b`` says to continue even if the Twitter API says some of the lists requested are protected or don't exist, and ``-l`` says that the users to hydrate are given in the form of Twitter lists. (If you'd left the header line out of the CSV file and wanted to avoid using xargs, note that you could instead write something like ``twclient fetch users -v -b -l $(cat lists.csv)``.) Similarly, you can hydrate the individual users as follows: .. code-block:: bash tail -n +2 users.csv | xargs twclient fetch users -v -b -n A noteworthy difference from the case of lists is that you use the ``-n`` option, for users identified by screen names, rather than the ``-l`` option for lists. Tagging users --------------- Having fetched the users, we may want to give them *tags* for easier reference in SQL or later commands. Twclient has a tag table that allows you to associate arbitrary tag names with user IDs, to keep track of relevant groups of users in your analysis. Let's say we want to track all individually fetched users together, and all users retrieved from lists together, as two groups. First, we need to create a tag: .. code-block:: bash twclient tag create twitter_lists Next, we associate the new tag with the users it should apply to: .. code-block:: bash tail -n +2 lists.csv | xargs twclient tag apply twitter_lists -l Similarly, we can tag the individually fetched users: .. code-block:: bash twclient tag create twitter_users tail -n +2 users.csv | xargs twclient tag apply twitter_users -l Users fetched from Twitter lists will be associated with the lists they are members of in the ``list`` and ``user_list`` tables, so there's no need to tag lists individually. Finally, we might want to create one tag referring to both sets of users (for example, to run a regular job for fetching everyone's tweets). We do the same two-step as above: .. code-block:: bash twclient tag create universe twclient tag apply universe -g twitter_users twitter_lists This time, however, you can see that the ``-g`` option allows selecting users to operate on---whether that's tagging, hydrating, or fetching tweets and follow edges---according to tags you've defined. Fetching tweets ----------------- Now, with fully hydrated users, it's time to get down to one of our primary jobs: fetching the users' tweets. We can do this with the ``twclient fetch tweets`` command: .. code-block:: bash twclient fetch tweets -v -b -g universe As before, ``-v`` asks for verbose output, ``-b`` says to ignore nonexistent or protected users rather than aborting the job, and ``-g universe`` says to fetch tweets for those users tagged ``universe``. Note that twclient also extensively normalizes the tweet objects returned by Twitter. In addition to the tweet text, we pull out urls, hashtags, "cashtags", user mentions and other things so that it's easy to compute derived datasets like the mention / quote / etc graphs over users. (For how to do this and sample SQL, see the vignette on :doc:`exporting data `.) The raw json API responses are also saved so that you can work with data we don't parse. Fetching the follow graph --------------------------- Finally, we want to get the user IDs of our target users' followers and friends. (A "friend" is Twitter's term for the opposite of a follower: if A follows B, B is A's friend and A is B's follower.) There are two more ``twclient fetch`` subcommands for this: ``twclient fetch friends`` and ``twclient fetch followers``. Neither command hydrates users, because the underlying Twitter API endpoints don't, so the ``follow`` table will end up being populated with bare numeric user IDs. Here's fetching friends, using options you've seen all of by now: .. code-block:: bash twclient fetch friends -v -b -g universe And here's followers: .. code-block:: bash twclient fetch followers -v -b -p -j 5000 -g universe The one new flag used here, ``-j 5000``, indicates the size of the batch used for loading follow edges. The default if you don't use ``-j`` is to accumulate all edges in memory and load them at once, which is faster but can cause out-of-memory errors for large accounts. Specifying ``-j`` will trade runtime for memory and let you process these large accounts. The ``-v`` flag is also particularly useful here: if you're working with users who have many followers or friends, it can take some time to process them. Verbose output will print progress information (``-v -v`` will print even more) to help monitor the job. The fetched follow graph data itself is stored in a `type-2 SCD `__ format, which (without getting into the details) means that you can just keep running these commands and storing multiple snapshots at different times, without using enormous amounts of disk space. (See the :doc:`exporting data vignette ` for details of how to get follow graph snapshots out of the SCD table.) --------------------------- Putting it all together --------------------------- Here's all of our hard work in one little script (again, remember that the user IDs and list IDs are fake for privacy; replace them with real ones if you want to run this example): .. code-block:: bash #!/bin/bash set -xe # We assume you've already installed the twclient package (e.g., from PyPI) # and gotten API keys, so we won't show any of that here. See also the # command-line -h/--help option for more info. cat << EOF > users.csv screen_name user1 user2 user3 test1234 foobar stuff EOF cat << EOF > lists.csv list cspan/members-of-congress 2389231097 18230127 nytimes/nyt-journalists 20218236 1937309 1824379139 EOF twclient config add-db -f /path/to/your/project/sqlite.db db twclient initialize -y twclient config add-api -n twitter1 \ --consumer-key XXXXX \ --consumer-secret XXXXXX \ --token XXXXXX \ --token-secret XXXXXX twclient config add-api -n twitter2 \ --consumer-key XXXXX \ --consumer-secret XXXXXX \ --token XXXXXX \ --token-secret XXXXXX tail -n +2 lists.csv | xargs twclient fetch users -v -b -l twclient tag create twitter_lists tail -n +2 lists.csv | xargs twclient tag apply twitter_lists -l tail -n +2 users.csv | xargs twclient fetch users -v -b -n twclient tag create twitter_users tail -n +2 users.csv | xargs twclient tag apply twitter_users -l twclient tag create universe twclient tag apply universe -g twitter_users twitter_lists twclient fetch tweets -v -b -g universe twclient fetch friends -v -b -g universe twclient fetch followers -v -b -j 5000 -g universe Tada! Now you have data in a DB. You can use canned SQL queries, like those in the :doc:`exporting data vignette `, to get whatever piece of data you want out of it: the follow graph, a user's tweets, mention / quote / reply / retweet graphs, etc. Your creativity in SQL is the limit. Wasn't that easier than you're used to?