Storm Basics

May 17, 2021 09:00 Storm getting Started

Table of contents


Storm is a distributed, reliable, fault-020s data flow processing system. I t delegates work tasks to different types of components, each of which is responsible for a simple, specific task. T he input flow of the Storm cluster is managed by a component called spout, which passes data to bolt, which either saves the data to some kind of memory or passes the data to another bolt. As you can imagine, a Storm cluster is a series of bolts that convert spout data.

Here is a simple example of this concept. L ast night I saw the host on the news program talking about politicians and their positions on various political topics. They kept repeating different names, and I began to think about whether they were mentioned the same number of times and the deviation between different times.

Imagine an announcer reading subtitles as your data stream. Y ou can read a file with a spout (or socket, via HTTP, or something else). T he line of text is passed by spout to a bolt, which is then cut by word. T he word stream is passed on to another bolt, where each word is compared to a list of political names. F or each matching name encountered, the second bolt adds 1 to the count of the name in the database. Y ou can query the database at any time to see the results, and these counts are updated in real time as the data arrives. Refer to Topology Figure 1-1 for all components (spouts and bolts) and their relationships

Storm Basics

Now imagine that it's easy to define the parallelity levels of each bolt and spout across the Storm cluster, so you can extend your topology indefinitely. I t's amazing, isn't it? Although this is a simple example, you can also see the power of Storm.

What are some typical Storm applications?

The data processing stream

As the example above shows, unlike other streaming systems, Storm does not require intermediate queues.

Continuous calculations

Continuously sends data to clients, enabling them to update and display results in real time, such as site metrics.

Distributed remote procedure calls

Frequent CPU-intensive operations are parallelized.

Storm component

For a Storm cluster, a continuously running primary node organizes several nodes to work.

In a Storm cluster, there are two types of nodes: the master node master node and the worker nodes. T he primary node runs a daemon called Nimbus. T his daemon is responsible for distributing code in the cluster, assigning tasks to work nodes, and monitoring failures. T he Supervisor daemon runs on the working node as part of the topology. A Storm topology runs many working nodes on different machines.

Because Storm maintains all cluster state on Zookeyer or local disks, the daemon can be stateless and fail or restart without affecting the health of the entire system (see Figure 1-2)

Storm Basics

At the bottom of the system, Storm uses zeromq (0mq, zeromq T his is an advanced, embedded network communication library that provides great functionality to make Storm possible. Some of the characteristics of zeromq are listed below.

  • An Socket library of a synth architecture
  • For cluster products and supercomputing, faster than TCP
  • Communication can be made via inproc (in-process), IPC (inter-process), TCP, and multicast (multicast protocol).
  • Scalable multi-core messaging application for asynchronous I/O
  • N-N connectivity is achieved using fanouts, publishing subscriptions (PUB-SUB), pipelines, request responses (REQ-REP), and more

NOTE: Storm uses push/pull sockets only

Storm's characteristics

Of all these design ideas and decisions, there are some great features that make Storm unique.

  • Simplify programming: If you've ever tried real-time processing from scratch, you should understand how painful it is. With Storm, complexity is greatly reduced.
  • It's easier to develop in a JVM-based language, but you can develop in any language on Storm with a small middleware. There are ready-made middleware to choose from, and of course you can develop your own.
  • Fault tolerance: The Storm cluster focuses on the state of the work node and reassigns tasks if necessary if downtime is necessary.
  • Scalable: All you need to do to scale the cluster is to add machines. Storm assigns tasks to new machines when they are ready.
  • Reliable: All messages are guaranteed to be processed at least once. If something goes wrong, the message may be processed more than once, but you will never lose it.
  • Fast: Speed is a key factor driving Storm design
  • Transactional: You can get exactly once messaging semantics for pretty much any computationation. You can get exactly one message semantic for almost any calculation.