Sandip Foundation's Students' Blog.

Showing posts with label big data. Show all posts

Tuesday, March 31, 2015

MongoDB find() and findOne() methods

In MongoDB CRUD operation find() and findOne() methods are used to read a documents from the collections. Lets study this methods with example.

1. find() :

The most basic operation to query documents out of the database is called find(). MongoDB find() method returns all documents present in the database.

For example, Here I have student database and collection name is grades which has 800 documents. After executing this query it returns first 10 documents and it ask for next 10 documents. If you want more documents type "it" next 10 documents will get display.

find() method can take some arguments. You can filter result by adding search criteria. For example in "grades" collection there are "homework","quiz" and "exam " type of documents. If I want only "exam" type . The query will be db.grades.find({''type'' :''exam''}).

We can also add multiple fields in criteria but both the criteria should match. Like here I have selected students with type :exam and you can also include and exclude fields if you want by making field name true or false, here I excluded object id field.

2. findOne():

This method also retrieves the data but it returns only one document at a time. Lets see following example fineOne() without argument returns very first document from the collection.

We can also add criteria in findOne() method like find() method. In following example only document with the "student_id" : 112 is return as specified in criteria.

Monday, August 25, 2014

Types of NoSQL databases

Hello everyone, in my last blog “NoSQL” I have just listed different categories of NoSQL databases. Now let’s see it in details one by one.

Key value store
Document store
Column Family store
Graph based

1. Key value store:

· This type of model is the most simplest and easiest to implement. The main idea here is use of hash table in which there exist a unique key and a value associated with it.

· There is no complexity around the key value store data model as it can be implemented very fast and easily.

· The key can be synthetic or auto generated while the value can be string, JSON, BSON etc.

· It is schema free and your value is stored as a key eg. In one column you will be having key “Name” and value would be “Kaveri” and in the second column, it’s not necessary mean that you must have the value of name again; it could be different kind of data in the same column in different row.

· Following table shows the example of key value store data model :

key	Value
“India”	{“09, Sai Anand App, Gangapur Road, Nasik 422013 ,India”}
“US”	{“3975 Fair Ridge Drive. Suite 200 South, Fairfax, VA 22033″}
“CA”	{“47112,warm springs blvd #103 Fremont, CA, United States 94539”}

· This key/value type database allow clients to read and write values using a key as follows:

§ Get(key): returns the value associated with the provided key.

§ Put(key, value): associates the value with the key.

§ Multi-get(key1, key2, .., keyN): returns the list of values associated with the list of keys.

§ Delete(key): removes the entry for the key from the data store.

· Example of key value store databases are: Memcached, Coherence, Redis ,Riak and Amazon’s DynamoDB are the most popular key-value store NoSQL databases.

2. Document store:

· These kinds of NoSQL databases are very interesting because instead of rows and columns data is stored in documents, this semi structured data is simply stored in JSON, BSON.

· Document database were inspired by Lotus and are similar to key value store. The model is basically documents that are collection of other key value collections.

· Document databases are next level of key value. It supports querying more efficiently.

· Document database would look like,

· Example of document databases are: Mango,CouchDB,Cloudant.

3. Column family store:

· Column family types were created to store and process very large amount of data distributed over many machines.

· This type of NoSQL databases are not schema free. They are kinds of semi structured database which means you need to specify a group of column in these databases and they are called column families.

· There are still keys but they point to multiple column arranged by column family.

· A column family database can have different column on each row so is not relational and doesn’t qualifies in an RDBMS as a table

· The implementation of column family much similar to the Google’s BigTable .

· Colum family work better with complex dataset.

· Example of column family databases are BigTable,Hbase,Accumulo.

4. Graph databases:

· Instead of table of rows and columns and fixed structure of SQL, a flexible graph model is used which can gain scale across the multiple machines.

· Graph structures are used with edges, nodes and properties which provides index-free adjacency. Data can be easily transformed from one model to the other using a Graph Base NoSQL database.

· Graph database takes the document database to the extreme by introducing the concept of type of relationship between document and nodes. For example relationship present between the people on social network such as Facebook.

· It uses sophisticated shortest path algorithm to make data querying more efficient.

· Although it is slower than its other NoSQL counter paths a graph database can have the most complex structure and still traverse billions of nodes and relationship with light speed.

· Example of graph databases are: Neo4j,Sones, InfiniteGraph, AllegroGraph, OrientDB, InfoGrid

Tuesday, August 19, 2014

NoSQL Databases

NoSQL is next generation DBMS which is used to store and retrieve the data which does not have specific format or table like structure. NoSQL database is also called as “Not only SQL” to emphasize that it does not use SQL language but It uses SQL-like language. The original intension has been modern web scale databases. The movement began in early 2009 and growing rapidly. NoSQL databases are mostly addressing some of the points:-

· Non-relational

· Distributed

· Open source

· Horizontally scale

As we Know there are different types of Database Management System and RDBMS is most commonly used database management system. If we go back in history, first flat file systems were created then in 1970's Codd came up with Relational theory and based on that relational databases were developed. The problem was that in flat file system there was no standard way of storing data and no standard way of communicating with data. Everybody has implementing their own protocol and that was creating lots of inefficiency. So relational database standardized the way we communicate with database. Then life moved on and everything was going well and suddenly, we entered into Big data scenario. Relational databases are unable to handle huge amount of data so the answer was “NoSQL”. So NoSQL was created because of limitations of relational databases.

The data is structured in relational databases. We have to define table structure in advance, we have to tell the system in advance like table will have 10 rows and 10 columns and each column will take particular data type and this would be the maximum value one can enter and so on. This way we can't handle unstructured data where there is no fixed format or format is always changing. So NoSQL handles both structured and unstructured data.

-NoSQL focused to provide:

· Scalability

· Performance

· High Availability

· Simple API

· Schema free

· Easy replication support

· Eventually consistent(not ACID)

NoSQL can handle large amount of data. As data is keep on growing it provides scalability and high availability in terms of hardware failure. Ability of NoSQL is to handle large amount of data with amazing performance with comparison they offers less functionality than RDBMS.

NoSQL is category which can be commonly further divided into three categories given below with example:

· Key Value Store : Memcached,Coherence, Redis

· Tabular : BigTable,Hbase,Accumulo

· Document oriented : Mango,CouchDB,Cloudant

If you search you will find more categories. Here is the list of some:

· Column store/Column family

· Document store

· Key value store/Tuple sore

· Graph database

· Multimodel database

· Object database

· Grid and Cloud solution

· XML database

· Multidimensional database

· Event sourcing

· Network model

· Other NoSQL related databases

· Unsolved and uncategorized

So what is missing from NoSQL databases if you compare them with relational databases? Well Joins are not there because of this joins relational databases are not so scalable. So in NoSQL databases this functionality is not implemented hence scalability and performance comes in. The support for complex transaction is not there for example you can’t not do insert three records then update two records and check something if it’s not then rollback everything. Again constraint support is not there. Transaction support and constraint support is not there on database level but transaction support and constraint support can be applied on application level.

What are the situations or scenarios where one can use NoSQL database?

· The ability to store and retrieve great quantity of data is important.

· Storing relationship between elements is not important.

· Dealing with growing list of elements, example: Twitter post, Internet server logs, Blogs.

· The data is not structured or the structure is changing with the time.

· Prototype or fast application need to be developed

· Constraints and validation logic is not required to be implemented in database.

What are the situations or scenarios where not to use NoSQL database?

· Complex transactions need to be handled

· Joins must be handled by databases

· Validations must be handled by databases

Tuesday, July 22, 2014

Introduction to Hadoop

Hadoop is framework of Linux based set of tools. It is not any software that you can download on your computer and say "Hey, I have downloaded Hadoop". It is an open source tool and it is freely distributed under Apache license. It means no company is controlling Hadoop, It is maintain by Apache. The concept behind the Hadoop is big data.

Hadoop: Big Data Processing made easier

A corporate data stores continue to grow almost 50% annually, and this increase in data storage requires a good management and this indirectly needs a change in the current technology. Storage and management technology has evolved to an extent but still today’s enterprises are faced with many evolving needs that can strain storage technologies. However, the big data analytics process demands the capabilities that are beyond the typical storage that should handle terabytes and petabytes of unstructured information which is big challenge. So it is requiring something more, a new way or a platform to deal with large volumes of data.

Hadoop:-

Hadoop is an open-source software framework that offers a platform to deal with big data. Hadoop was derived from Google’s MapReduce and Google File System papers. Hadoop is written in Java programming language and is an Apache top-level project being built.

Hadoop platform is designed to solve problems caused by large amounts of data that contain complex and unstructured data which cannot be placed into the tables directly. Hadoop solves the most common problem with the big data i.e. efficiently storing and accessing large amounts of data.

The intrinsic design of Hadoop allows it to run as a platform that is able to work across a large number of machines that don’t share any memory or disks. It reduces the management overhead associated with the large data sets. The Hadoop framework provides both reliability and data motion to applications. The data is being loaded into Hadoop platform, the software breaks down the data into fragments which are then spread across different servers. The distributed nature of the data means there is no more traditional data centered server where the data has been stored and has to go to access the data. Furthermore, Hadoop keeps tracks of the data where it resides and it also protects multiple copies of the information.

Unlike the limitations associated with the centralized database system, which may consist of a large disk drive connected to a server that features multiple processors, with Hadoop every sever in the cluster is allowed to participate in the processing of the data through Hadoop’s capability to spread the work and the data across the cluster and each server then operates it’s little piece of data and then all the results is unified.

Hadoop is referred to as MapReduce where the code and processes are mapped to all servers and the results are reduced into a single set. This process makes Hadoop to deal with massive data.

However, there are certain pre-requisites, hardware requirements and configuration that must be met to ensure success. Big data analytics requires that organizations should choose the data to analyze it and then apply aggregate methods before it goes for extract, transform and load process. The data can be structured, unstructured or from multiple sources such as social networks, data logs, websites etc which is accomplish by processes and considerations such as the capability to move computing power closer to the data and perform parallel or batch processing of large data sets.

But Hadoop cannot accomplish everything on it’s own. Organizations will need to consider what additional components are required to build a Hadoop project. For effective management and implementation of Hadoop require some expertise and experiences, and if it is not available then it should take help of the service provider that can offer full support for the Hadoop project.

Although Hadoop has been around for some time, more and more organizations had started using its capabilities and had reduced the challenge of handling the big data processing. It also includes the major companies who are associated with Hadoop such as Google, Yahoo, JP Morgan Chase etc.