Data Collection and Ingestion are the task of fetching the data from any data source. There are two ways to achieve Data Ingestion-
- Real-time Streaming – Today, organizations are generating data from various sources and for building a real-time data lake, data is required to be Integrated into a single stream.
- Batch Streaming- The first and the foremost point to consider when executing a data lake is to limit it with strict rules and processes for ingestion. For an instance, Kafka and Flume both allow connections to be made to HBase and Hive directly, and Apache Spark can ingest and process data without writing it on the disk.
While these functionalities are robust, they compromise the main idea of the data lake, which is untouched and unaltered data. Ideally, data should be ingested, into a raw landing zone where it can be stored, and copied to another zone for enrichment and processing.
Some usual objectives for building a Data Lake are-
- To keep a Data Lake as a central repository for Big Data
- As a testing setup to experiment with new technologies and data
- Cost reduction through offloading of analytical systems and archiving of old data
- Meta-data management and keeping Catalog
- Automation of Data Pipelines
- Tracking measurements with installed alert systems on failures and violations
- Data discovery, prototyping, and experimentation
Apache NiFi provides an easy-to-use, powerful, and reliable system to process and distribute data over multiple resources. Apache NiFi supports the routing and processing of data from any source to any destination and a little data transformation.
NiFi is a platform with a UI that helps in defining the source of the data for collection, the processors for the transformation of this data, and the destination where the data needs to be stored.
Every processor in NiFi has some relationship like retry, success, invalid data, failed, etc. These relationships are useful for the processors to connect to one another. The relationships help while transferring the data from one processor to another even when one processor has reported failure.
Key Apache NiFi features-
- Guaranteed Delivery – Apache NiFi believes that even at a large scale of data and operation, guaranteed delivery is a must. This is achieved through a purpose-built write-ahead log and content repository.
- Data Buffering with Back Pressure and Pressure Release – NiFi supports the buffering of all queued data and providing back pressure as the queues overflow beyond limits, or to age off the data that reaches a specific value.
- Prioritized Queuing – With NiFi, it is possible to set prioritization schemes to draw out data from a queue.
- Flow-specific QoS – At some points in the data flow, the data is critical and loss-intolerant, while the processing time is crucial at others. NiFi enables flow-specific configurations in these scenarios.
- Visual Command and Control – NiFi enables the simplicity of operation through the visual establishment of the data flow, that too in real time.
- Security – NiFi allows for the Multi-Tenant Authorization. The authority level that is given to every data flow also applies to all of its components leading to a fine-grained access to the security of the data by the Admin.
In 2018’s update of Apache Spark, some in-demand features have been made available. A couple of new processors like the CountText processor that helps to count elements in text documents like lines and words. Another much-needed processor that has been included is the Spark Processor. The ExecuteSparkInteractive processor gives a simpler and more efficient way to call the Apache Spark batch and Machine Learning Jobs.
Apache NiFi is sure creating ripples in the Streaming of Data.
Author Bio:
Ethan Millar is working with AegisSoftTech as a Content Creator from last 5 years. He has vast experience in writing technical articles for Java, Big data analytics services, .Net and Microsoft CRM Development. Stay Connected with him.
Comments
0 comments