Day 18 of 100Days Of DataEngineering streaming analytics
- This course is about two core products used in the streaming world specific to Google’s Cloud Platform.
- Apache Beam
- Cloud Dataflow
- This is the final course you’ll need before studying and sitting for the Google Certified Data Engineering Exam.
- Streaming – a type of data processing engine used for infinite data sets.
- In the world of streaming data there are two kinds of data sets.
- Bounded
- Unbounded
- Bounded data is static. It’s data at rest.
- The term in unbounded means the data set is never complete.
- The analysis on unbounded data is only good during certain time windows.
- The three Vs of data are:
- Volume
- Variety
- Velocity
14-streaming analytics 2
- Apache Beam is an open source model for Batch and Streaming use cases.
- Beam = Batch strEAM together spell Beam.
- Apache Beam will often be referred to as Beam.
- With Beam we build Pipelines.
- Beam has been designed for batching and streaming so you don’t have to think about one of the other.
- A pipeline is concept is very similar to a TensorFlow Graph.
- Pipelines can be run in multiple execution environments.
- In this course our executor of choice is Cloud Dataflow.
- The paper that started it all was the MapReduce paper written in 2004.
- The two core concepts we have to understand are event time and processing time.
- In a perfect world we would process our data immediately once it’s received, but that’s not very real world.
- If our results were processed when they were received we’d see a linear processing model for our distribution.
- Pipelines must handle out of order data.
- A PCollection is simply a set of data in our pipeline. (EXAM QUESTION)
- A PCollection can handle bounded and unbounded data sets.
- Windowing is breaking down our data into discrete data sets based on some metric, usually time.
- Refinements are a way of further of fine tuning our results.
- The three tensions that arise from handling infinite unordered data are:
- Completeness
- Latency
- Cost
- How we balance these three tensions is determined by the use case.
- Sessions capture a burst of user activity.
- Events happen in a stream and we bucketize them based on when they happened is called event time based windowing.
- Triggers control when results are emitted. (EXAM QUESTION)
- Triggers are relative to the watermark. In the pic below take note of the code calling the watermark, a trigger.
15-streaming analytics 3
- Resource Type is the name of the service or entity we want to monitor. (BigQuery or Cloud Dataflow)
- Metric is the entity or item we want to monitor. (Number of queries or watermark age)
- Cloud Dataflow is really 2 things:
- SDK for authoring our pipelines
- Fully managed service for running or executing those pipelines
- A runner is an environment where our pipelines can be executed.
- When choose cloud dataflow as your runner Google throws in a few benefits.
- Optimizer
- Smart workers
- Monitoring
- Graphs are executed as a single unit.
- Every element is a collection has an implicit time stamp.
- A backing store is simply a file or table you want to read from
- There are three core types of PCollections.
- Element wise – Called a ParDo. Same thing is executed in parallel over and over.
- Aggregating – Multiple input into one output.
- Composite – Sub graphs of other PCollections.
- A graph is a pictorial representation of how we authored our pipeline.
- We can use the Resources button in Stackdriver to monitor our resources.