Case Study: Staple Yourself to a Tweet to Understand 30 Billion Redis Updates Per Day

header-graphic-redis-at-twitter (1)In a recent post, Raffi Kirkorian, VP of Engineering at Twitter, explained how Redis is used at Twitter to support over 30 billion timeline updates per day based on 5000 tweets per second or 400,000,000 tweets per day. There is no doubt, Twitter’s infrastructure deals with extremely high scale demands. So, next time you get a Tweet from Katy Perry, remember 39 million inserts just occurred on Redis.

In this post, we staple ourselves to a tweet to experience the use case behind Redis and dive into the architecture of Twitter’s giant Redis cluster as we report on portions of Raffi’s talk.

Staple Yourself to a Tweet—Producing and Distributing a Tweet

Most users only know that, when you follow someone, their tweets show up on your homepage timeline. Yet, there is a lot more to it.

Posting a tweet actually uses several components from the platform services team. This group of engineers develops and operates the machines and applications for the timeline service, tweet service, social graph service, and user service. These services are exposed via Twitter APIs to internal development teams and external developers.

katy-perry-updates-redis-at-twitterOf course, a tweet starts when a user experiences something worth tweeting and decides to capture it. The tweet information passes through load balancers, and hits Twitter’s Write API. There, the tweet begins to follow a process led by something called the fanout daemon. The daemon first does a query against the social graph service called Flock, and Flock provides a list of the followers of the person who tweeted. At this point, the daemon now has the ID of the original user, the tweet ID, and the IDs of all the followers.

Next, the daemon takes the list of followers and begins to iterate through each of the followers’ home timelines, updating each timeline with the latest tweet information. These timelines are stored in memory within Twitter’s Redis cluster and replicated across data centers on three different machines. For each user, the daemon inserts the Tweet ID (8 bytes) in a native list structure in Redis along with the User ID (8 bytes) and some additional information for retweets, replies, and similar system-centric data (4 bytes). Redis doesn’t store the 140 character tweet information itself nor does it store a list of the entire history of tweets by the users that are followed. Instead, the Twitter engineering team limits Redis to storing the last 800 tweet IDs for each home timeline. Even with this limited amount of information, the Redis cluster uses several terabytes of RAM. This allows the Twitter engineering team to cache almost every single active user’s home timeline in memory at any given time and provide the fastest response times possible. These are also written to disk.

So, if you follow Katy Perry, and she tweets, your home timeline representation in Redis is updated along with 39 million other people’s Redis lists.

twitter-redis-architecture-01

Staple Yourself to a Tweet—Tweet Consumption

While there might be 5000 tweets per second on average and peaks up to 12,000, views are actually what keeps the datastore busy. There are over 300,000 queries per second on home timelines and 30,000 on search-based timelines.

When a user logs in to Twitter or a 3rd party tool that uses the Twitter API, they are presented the home timeline, served from data in the Redis cluster. This is a temporal merge of all the people followed and includes some business rules like stripping out @ replies for people you don’t follow and visibility of retweets. The timeline contains the ID of all the tweets and those IDs are hydrated or rendered with additional data by pulling data for user objects from a system called Gizmoduck and tweet objects from a system called TweetyPie, each with their own caches.

For search, each tweet is tokenized by an ingester that also considers product features and creates the index based on the tags for each word in a tweet. Upon the writing to the index, each Tweet is also ranked by additional information like number of favorites, replies, and retweets. The index is stored in Early Bird machines, a modified Apache Lucene index that is stored in RAM and sharded on a massive cluster, replicating for load.

When a user does a search, a scatter/gather service called Blender queries one of every unique shard for a query match. Blender takes the tweet timeline, merges, re-computes, and sorts the results of a search timeline. Blender also powers the discover page.

twitter-redis-architecture-02

For more information on Redis:

  • There are Redis clients for ActionScript, C, C#, C++, Clojure, Common Lisp, D, Dart, Emacs Lisp, Erlang, Fancy, Go, Haskell, haXe, IO, Java, Lua, Node.js, Objective C, Perl, PHP, Pure Data, Python, Ruby, Scala, Scheme, Smalltalk, and Tcl.
  • How Redis is used at Viacom
  • How it Redis is used at Twitter
  • How Redis is used at Pinterest
  • How Redis is used at Superfeedr
  • Interview with the inventor of Redis, Salvatore Sanfilippo

Permalink

Tags: , , ,

2 comments on “Case Study: Staple Yourself to a Tweet to Understand 30 Billion Redis Updates Per Day

  1. The Katy Perry example is not accurate.

    Twitter doesn’t fan out writes for popular users.
    It merges tweets from popular users you are following at read time. They maintain a list of a hundred users who are too popular for write fanouts.

    • Adam Bloom on said:

      Hi Tim,
      Thanks for the clarification. Of course, I do my best to give accurate information that is also interesting, worth reading, and helpful. So, let me just clarify the referenced post/talk in more detail:
      A. Raffi explains the fanout use case and uses the example of Redis updates for 20K followers around 11 minutes.
      B. Around 25 min, he goes through problems with fanouts using Lady Gaga, Katy Perry, Justin Bieber, and others as examples. He also mentions the experiments Twitter is doing to resolve these issues.

      By your suggestion, I suspect one of the mentioned improvements was implemented since this talk.
      At the time of his talk (I believe Nov 2012), he said that Katy only had 28 million followers.

      Any updated references are greatly appreciated.

      Kindly,
      Adam

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>