|

Day 18 of 100Days Of DataEngineering streaming analytics

  • This course is about two core products used in the streaming world specific to Google’s Cloud Platform.
    • Apache Beam
    • Cloud Dataflow 
  • This is the final course you’ll need before studying and sitting for the Google Certified Data Engineering Exam. 
  • Streaming – a type of data processing engine used for infinite data sets. 
  • In the world of streaming data there are two kinds of data sets.
    • Bounded
    • Unbounded 
  • Bounded data is static. It’s data at rest. 
  • The term in unbounded means the data set is never complete. 
  • The analysis on unbounded data is only good during certain time windows
  • The three Vs of data are:
    • Volume
    • Variety 
    • Velocity

14-streaming analytics 2

  • Apache Beam is an open source model for Batch and Streaming use cases. 
  • Beam =  Batch strEAM together spell Beam. 
  • Apache Beam will often be referred to as Beam
  • With Beam we build Pipelines.
  • Beam has been designed for batching and streaming so you don’t have to think about one of the other. 
  • pipeline is concept is very similar to a TensorFlow Graph. 
  • Pipelines can be run in multiple execution environments. 
  • In this course our executor of choice is Cloud Dataflow. 
  • The paper that started it all was the MapReduce paper written in 2004. 
  • The two core concepts we have to understand are event time and processing time
  • In a perfect world we would process our data immediately once it’s received, but that’s not very real world. 
  • If our results were processed when they were received we’d see a linear processing model for our distribution. 
  • Pipelines must handle out of order data. 
  • PCollection is simply a set of data in our pipeline.   (EXAM QUESTION)
  • PCollection can handle bounded and unbounded data sets. 
  • Windowing is breaking down our data into discrete data sets based on some metric, usually time. 
  • Refinements are a way of further of fine tuning our results. 
  • The three tensions that arise from handling infinite unordered data are:
    • Completeness
    • Latency
    • Cost
  • How we balance these three tensions is determined by the use case
  • Sessions capture a burst of user activity. 
  • Events happen in a stream and we bucketize them based on when they happened is called event time based windowing
  • Triggers control when results are emitted. (EXAM QUESTION)
  • Triggers are relative to the watermark. In the pic below take note of the code calling the watermark, a trigger.

15-streaming analytics 3

  • Resource Type is the name of the service or entity we want to monitor. (BigQuery or Cloud Dataflow)
  • Metric is the entity or item we want to monitor. (Number of queries or watermark age) 
  • Cloud Dataflow is really 2 things:
    • SDK for authoring our pipelines
    • Fully managed service for running or executing those pipelines
  • runner is an environment where our pipelines can be executed. 
  • When choose cloud dataflow as your runner Google throws in a few benefits.
    • Optimizer 
    • Smart workers
    • Monitoring 
  • Graphs are executed as a single unit. 
  • Every element is a collection has an implicit time stamp. 
  • backing store is simply a file or table you want to read from
  • There are three core types of PCollections.
    • Element wise – Called a ParDo. Same thing is executed in parallel over and over. 
    • Aggregating –  Multiple input into one output. 
    • Composite – Sub graphs of other PCollections. 
  • graph is a pictorial representation of how we authored our pipeline. 
  • We can use the Resources button in Stackdriver to monitor our resources.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *