Introduction to NoSQL

A large part of this data is handled by the Relationship Database Management System (RDMBSs). The 1970 E.F. Codd's paper on relational models, "A relational model of data for large shared data banks", makes data modeling and application programming easier.

Application practice has proven that the relationship model is well suited to client server programming, far beyond the expected benefits, and today it is the dominant technology for structured data storage in network and business applications.

NoSQL is a revolutionary new database movement, and it was suggested early on that the trend was on the rise until 2009. NoSQL advocates advocate the use of non-relational data storage, a concept that is undoubtedly an injection of new thinking over the use of relational databases.

The relationship database follows THED rules

Transactions are transaction in English, similar to real-world transactions, and have four characteristics:

1, A (Atomicity) atomicity
Atomicity is easy to understand, that is, all operations in a transaction are either done or not done, the condition for the success of the transaction is that all operations in the transaction are successful, as long as one operation fails, the entire transaction fails and needs to be rolled back.

For example, bank transfer, from the A account to 100 yuan to B account, divided into two steps: 1) from the A account to withdraw 100 yuan; These two steps are either completed together, or not completed together, if only the first step is completed, the second step fails, the money will somehow be 100 yuan less.

2, C (Consistency) consistency
Consistency is also easier to understand, which means that the database remains in a consistent state and that the operation of the transaction does not change the original consistency constraints of the database.

For example, if a transaction changes a, you must change b so that the transaction is still satisfied after the end of the transaction, or the transaction fails.

3, I (Isolation) independence
So-called independence means that the data that a transaction accesses is not affected by each other, and if the data that one transaction wants to access is being modified by another transaction, the data it accesses is not affected by the uncommitted transaction as long as the other transaction is not committed.
For example, there is an existing transaction from the A account to the B account, in this case the transaction has not been completed, if B query their account at this time, is not see the new increase of 100 yuan.

4, D (Durability) persistence
Persistence means that once a transaction is committed, the modifications it makes are permanently saved on the database and are not lost even in the event of an outage.

Distributed systems

A distributed system consists of multiple computers and communication software components connected over a computer network (local network or wide area network).

A distributed system is a software system built on a network. It is precisely because of the characteristics of software that distributed systems are highly cohesion and transparent.

As a result, the difference between a network and a distributed system is more in high-level software (especially the operating system) than in hardware.

Distributed systems can be used on different platforms such as PCs, workstations, local area networks, and wide area networks.

The benefits of distributed computing

Reliability (fault tolerance):
An important advantage in distributed computing systems is reliability. A system crash on one server does not affect the rest of the servers.

Scalability:
More machines can be added as needed in distributed computing systems.

Resource sharing:
Sharing data is essential for applications such as banking and booking systems.

Flexibility:
Since the system is very flexible, it is easy to install, implement and debug new services.

Faster speed:
Distributed computing systems can have the computing power of multiple computers, making it faster to process than other systems.

Open systems:
Because it is an open system, the service can be accessed either locally or remotely.

Higher performance:
Higher performance (and better price/performance ratio) can be provided compared to centralized computer network clusters.

The disadvantages of distributed computing

Troubleshooting: :
Troubleshooting and diagnosing problems.

Software:
Less software support is a major drawback of distributed computing systems.

Internet:
Network infrastructure issues, including: transmission problems, high load, loss of information, etc.

Security:
The characteristics of the development system make the distributed computing system have the risk of data security and sharing.

What is NoSQL?

NoSQL refers to a non-relationship database. NoSQL, sometimes referred to as the acronym for Not Only SQL, is a generic term for a database management system that differs from traditional relationship databases.

NoSQL is used for the storage of ultra-large-scale data. ( Google or Facebook, for example, collect trillions of bits of data for their users every day.) These types of data stores do not require fixed patterns and can scale out without redundant operations.

Why Use NoSQL?

Today we can easily access and crawl data through third-party platforms such as Google, Facebook, etc. U sers' personal information, social networks, geographic locations, user-generated data and user action logs have multiplied. If we want to mine these user data, then SQL database is no longer suitable for these applications, noSQL database development can also handle these large data very well.

Social Network:

Each record: UserID1, UserID2
Separate records: UserID, first_name,last_name, age, gender,...
Task: Find all friends of friends of friends of ... friends of a given user.

Wikipedia page:

Large collection of documents
Combination of structured and unstructured data
Task: Retrieve all pages regarding athletics of Summer Olympic before 1950.

RDBMS vs NoSQL

Rdbms
- Highly organized and structured data
- Structured query language (SQL) (SQL)
- Data and relationships are stored in separate tables.
- Data manipulation language, data definition language
- Strict consistency
- The underlying transaction

Nosql
- Represents more than SQL
- There is no declarative query language
- There are no predefined patterns
-Key - Value-to-store, column store, document store, graphics database
- Final consistency, not ADID attributes
- Unstructured and unpredictable data
- CAP the therm
- High performance, high availability and scalability

A brief history of NoSQL

The term NoSQL first appeared in 1998 as a lightweight, open source, non-SQL-enabled relationship database developed by Carlo Strozzi.

In 2009, Last.fm's Johan Oskarsson launched a discussion about distributed open source databases, and Eric Evans from Rackspace re-introduced the concept of NoSQL, which refers primarily to non-dnational, distributed, database design patterns that do not provide ACIDs.

The "no:sql" seminar held in Atlanta in 2009 was a milestone, with the slogan "select fun, profit from real_world where relational s false; " 。 Therefore, the most common interpretation of NoSQL is "unrelated" and emphasizes the advantages of Key-Value Stores and document databases, rather than simply opposing RDBMS.

CAP Theorem

In computer science, cap theorem, also known as brewer's theorem, states that for a distributed computing system, it is not possible to meet the following three points at the same time:

Consistency (all nodes have the same data at the same time)
Availability (guarantees that every request will be responded to regardless of success or failure)
Partition tolerance (loss or failure of any information in the system does not affect the continued operation of the system)

The core of CAP theory is that a distributed system can not meet the three requirements of consistency, availability and partition fault tolerance at the same time, and can only meet two at most.

Therefore, according to cap principle, the NoSQL database is divided into three categories: meet the CA principle, meet the CP principle, and meet the AP principle:

CA - A single point cluster, a system that meets consistency and availability and is generally less scalable.
CP - A system that meets consistency, partition tolerance, and is usually not particularly high in performance.
AP - Systems that meet availability, partition tolerance, and may typically require less consistency.

The pros/cons of NoSQL

Advantages:

- High scalability
- Distributed computing
- Low cost
- Architectural flexibility, semi-structured data
- No complicated relationships

Disadvantages:

- No standardization
- Limited query capabilities (so far)
- Ultimately consistent is an un intuitive program

BASE

BASE：Basically Available, Soft-state, Eventually Consistent。 Defined by Eric Brewer.

The core of CAP theory is that a distributed system can not meet the three requirements of consistency, availability and partition fault tolerance at the same time, and can only meet two at most.

BASE is a weak requirement for availability and consistency in NoSQL databases:

Basically Availble -- Basically available
Soft-state -- soft state/flexible transactions. Soft state can be understood as "no connection" and "Hard state" is "connection-oriented"
Eventual Consistency -- Final consistency Final consistency is also the ultimate goal of ACID.

ACID vs BASE

ACID	BASE
Atomicity (A tomicity)	Basic available (B asically A vailable)
Consistency (C onsistency)	Soft State/Flexible Transactions (S oft State)
Isolation (I solation)	Final consistency (E ventual consistency)
Persistence (D urable)

NoSQL database classification

Type	Part of the representative	Characteristics
Column storage	Hbase Cassandra Hypertable	As the name implies, data is stored by column. The biggest feature is convenient storage of structured and semi-structured data, convenient data compression, for a column or a few columns of queries have a very large IO advantage.
Document storage	Mongodb Couchdb	Document storage is typically stored in a jason-like format, and the content stored is document-type. This also gives you the opportunity to index certain fields and implement some of the functionality of the relationship database.
Key-value storage	Tokyo Cabinet / Tyrant Berkeley DB MemcacheDB Redis	You can quickly query its value with key. I n general, storage, regardless of the format of the value, is charged in full. (Redis includes additional features)
Figure storage	Neo4J FlockDB	The best storage for graphical relationships. Using traditional relationship databases to solve these problems is poor and designed to be inconvenient to use.
The object is stored	db4o Versant	Access data through objects through syntax operation databases similar to object-oriented languages.
xml database	Berkeley DB XML BaseX	Efficient storage of XML data and support for XML internal query syntax, such as XQuery, Xpath.

Who's using it?

Many companies now use NoSQ:

Google
Facebook
Mozilla
Adobe
Foursquare
Linkedin
Digg
McGraw-Hill Education
Vermont Public Radio

Introduction to NoSQL

Table of contents