<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss xmlns:content="http://purl.org/rss/1.0/modules/content/" version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Search-Time on despatches</title><link>https://icle.es/tags/search-time/</link><description>Recent content in Search-Time on despatches</description><generator>Hugo</generator><language>en</language><lastBuildDate>Wed, 18 Mar 2026 15:13:17 +0000</lastBuildDate><atom:link href="https://icle.es/tags/search-time/index.xml" rel="self" type="application/rss+xml"/><item><title>Making Twitter Faster</title><link>https://icle.es/2009/03/04/making-twitter-faster/</link><pubDate>Wed, 04 Mar 2009 17:36:35 +0000</pubDate><guid>https://icle.es/2009/03/04/making-twitter-faster/</guid><description>&lt;p>From my perspective, Twitter has a really really interesting technical problem
to solve. How to store and retrieve a large amount of data really really
quickly.&lt;/p>
&lt;p>I am making some assumptions based on how I see twitter working. I have little
information about how it is architected apart from some posts that suggests that
it is running ruby on rails with MySQL?&lt;/p>
&lt;p>Twitter is in the rare category where there is a very large number of data being
added. There should be no updates (except to user information but there should
be relatively very small amount of that). There is no need for transactionality.
If I guess right, it should be a large amount of inserts and selects.&lt;/p>
&lt;p>While a relational database is probably the only viable choice for the time
being, I think that twitter can scale and perform better if all the extra bits
of a relational database system was removed.&lt;/p>
&lt;p>I love challenges like this. Technical ones are easier ;-)&lt;/p>
&lt;p>If I didn&amp;rsquo;t have a lifetime job, I would prototype this in a bit more depth.
&lt;a href="http://garry.blog.kraya.co.uk" title="Garry&amp;#39;s Blog">Garry&lt;/a> pointed me in the
direction of &lt;a href="//hadoop.apache.org/" title="Hadoop">Hadoop&lt;/a>. Having had a quick look at
it, it can take care of the infrastructure, clustering and massive horizontal
scaling requirements.&lt;/p></description><content:encoded><![CDATA[<p>From my perspective, Twitter has a really really interesting technical problem
to solve. How to store and retrieve a large amount of data really really
quickly.</p>
<p>I am making some assumptions based on how I see twitter working. I have little
information about how it is architected apart from some posts that suggests that
it is running ruby on rails with MySQL?</p>
<p>Twitter is in the rare category where there is a very large number of data being
added. There should be no updates (except to user information but there should
be relatively very small amount of that). There is no need for transactionality.
If I guess right, it should be a large amount of inserts and selects.</p>
<p>While a relational database is probably the only viable choice for the time
being, I think that twitter can scale and perform better if all the extra bits
of a relational database system was removed.</p>
<p>I love challenges like this. Technical ones are easier ;-)</p>
<p>If I didn&rsquo;t have a lifetime job, I would prototype this in a bit more depth.
<a href="http://garry.blog.kraya.co.uk" title="Garry&#39;s Blog">Garry</a> pointed me in the
direction of <a href="//hadoop.apache.org/" title="Hadoop">Hadoop</a>. Having had a quick look at
it, it can take care of the infrastructure, clustering and massive horizontal
scaling requirements.</p>
<p>Now for the data layer on top. How to store and retrieve the data.
<a href="http://hadoop.apache.org/hbase/" title="HBase - a scalable distributed database">HBase</a>
is probably a good option but doing it manually should be fairly straightforward
too.</p>
<p>From my limited understanding of twitter, there are two key pieces of
functionality, the timelines and search.</p>
<p>The timelines can be solved by storing each tweet as a file within a directory
structure. My tweets would go into</p>
<p><code>/w/o/r/d/s/o/n/s/a/n/d/&lt;tweet-filename&gt;</code></p>
<p>The filename would be <code>&lt;username&gt;-&lt;timestamp&gt;</code></p>
<p>For the public timeline, you just have a similar folder structure, but with the
timestamp, for example, the timestamp 1236158897 would go into the following
structure as a symlink</p>
<p><code>/1/2/3/6/1/5/8/8/9/7/&lt;username&gt;</code></p>
<p>For search, pick up each word in the tweet and pop the tweet as a symlink into
that folder. You could have a folder per word or follow the structure above.</p>
<p><code>/t/w/i/t/t/e/r/&lt;username&gt;-&lt;timestamp&gt;</code> OR</p>
<p><code>twitter/&lt;username&gt;-&lt;timestamp&gt;</code></p>
<p>You would then have an application running on top with a distributed cache with
an API to ease access into the data easier than direct file access. Running on
Linux, the kernel will take care of the large part of the automatic caching and
buffering as long as there is enough RAM on the box.</p>
<p>This can in theory be done without Hadoop in between and separating the
directory structures across multiple servers but that can have complications of
its own, especially with adding and removing boxes for scalability.</p>
<p>You are also likely to run into issues with the number of files /
sub-directories limits but they can be solved by &lsquo;archiving&rsquo; - multiple options
for that too&hellip;</p>
<p>Thinking about this problem brought me back to the good old days of working on
the search mechanism within megabus.com. We needed the site to deal with a large
number of searches on limited hardware when the project was still classified as
a pilot.</p>
<p>With some hard work and experimentation, we were able to reduce the search time
to a tenth of the original time.</p>
<p>I&rsquo;ll admit that I don&rsquo;t know the details or the intricacies of the requirements
that twitter has. I have probably over-simplified the problem but it was still
fun to think about. If you can think of problems with this - let me know; I
wanna turn them into opportunities ;-)</p>]]></content:encoded></item></channel></rss>