Thursday, February 28, 2013

BigTable used by Google

In the world of cloud computing and many web based applications, the most essential is a database that can accommodate a large number of details of large numbers of users. Traditionally relational databases present a view that is composed of multiple tables, each with rows and named columns. And the queries mostly performed in SQL allow one to extract specific columns from a row where certain conditions are met. In traditional database, ACID properties i.e. atomic, consistent, isolated and durable are the main aspects. But many times it is impossible to guarantee consistency. So the ACID properties of database make uncomfortable for highly distributed environments. So the alternative storage system that highly distributed environments like Google used is BigTable.
 
BigTable is a distributed storage system that is structured as a large table that may be petabytes in size and distributed among thousands of machines. BigTable is described as a fast and extremely scalable DBMS (database management system). BigTable is a compressed, high performance and proprietary data storage system built on Google File System, Chubby Lock Service, SSTable and a few other Google technologies. But it is just used by Google and is not distributed outside Google although Google allows access to it as a part of Google App Engine.
 
                                                  Fig 1: Google BigTable
BigTable development was started in 2004 and now it is widely used by the Google applications such as web indexing, Google Reader, Google Maps, Google Book Search, Google Earth, Blogger.com, Orkut, YouTube, Gmail etc.
BigTable maps two arbitrary string values (row key and column key) and timestamp, hence three dimensional mapping into an associated arbitrary byte array. There can be multiple versions of a cell with different timestamps. In order to manage the huge tables, BigTable splits the tables at row boundaries and saves them as tablets. Each tablet is around 200MB, and each server serves about 100 tablets. This set up allows tablets from a single table to be spread among many machines. It also allows for load balancing, because if one table is receiving many queries, it can move the busy table to another machine that is not so busy. Also if some machine goes down, tables may be spread to other machines so that performance of all given machine is minimal. Tables are stored as immutable SSTables and logs of all the machines. When machine’s system memory is full, it compresses some tablets by using Google proprietary compression techniques such as Zippy and BMDiff. The locations of BigTable tablets are stored in cells. The particular tablet is handled by three-tiered system.
                                                                                                                                                                                     
                       
                                                     Fig 2: BigTable Architecture
BigTable is not like a relational database and can be defined as a sparse, distributed multi-dimensional sorted map.

Sparse:  The table is sparse which means that different rows in a table may use different columns, in which many of the columns that are used may be empty for that particular row.

distributed: BigTable’s data is distributed among many independent machines. As in Google, BigTable is built on top of Google File System.

sorted: In BigTable, a key is hashed to a position in a table. BigTable sorts its data by keys. This helps to keep the related data close together on the same machine.

Google has built its own database storage system for the reason of scalability and better control of performance characteristics.