Day 17 of 100DaysOfDataEngineering
08-big data 1
- Pub/Sub is an asynchronous messaging service designed to be highly reliable and scalable.
- The service is built on a core Google infrastructure component that many Google products have relied upon for over a decade.
- Google Cloud Pub/Sub is a publish/subscribe (Pub/Sub) service: a messaging service where the senders of messages are decoupled from the receivers of messages.
- Here are the key components of sub/pub.
- **_Message:** the data that moves through the service.
_ - Topic: a named entity that represents a feed of messages.
- Subscription: a named entity that represents an interest in receiving messages on a particular topic.
- Publisher (also called a producer): creates messages and sends (publishes) them to the messaging service on a specified topic.
- Subscriber (also called a consumer): receives messages on a specified subscription.
- Google Cloud Datastore is a NoSQL document database built for automatic scaling, high performance, and ease of application development.
- Cloud Datastore is ideal for applications that rely on highly available structured data at scale.
- A good use case for datastore might be product catalogs that provide real-time inventory and product details for a retailer.
- Another use case might be user profiles that deliver a customized experience based on the user’s past activities and preferences.
- Big Query is at the most basic level a serverless data warehouse.
- The term serverless simply means all of the hardware and software components have been abstracted away from the user leaving them to focus solely on the task at hand.
- With Big Query you pay for each query submitted.
- Big Query is a manged query service.
- Big Query uses a SQL like language.
- Once we’ve created our project and set the location is important to make sure our APIs for the services we are going to work with are enabled.
- Bigtable is a sparsely populated table that can scale to billions of rows and thousands of columns, allowing you to store terabytes or even petabytes of data.
- Cloud Bigtable is exposed to applications through multiple clients, including a supported extension to the Apache HBase.
- Cloud Bigtable also excels as a storage engine for batch MapReduce operations, stream processing/analytics, and machine-learning applications.
- A single value in each BigTable is indexed and that column is called Row Key.
- Each BigTable table is composed of rows, each of which typically describes a single entity, and columns, which contain individual values for each row.
- Most Hadoop — on premise or in cloud — architectures require you deploy a cluster, and then you proceed to fill up said cluster with jobs, be it MapReduce jobs, Hive queries or Spark.
- With Cloud Dataproc you submit the job and the cluster is spun up afterwards or as part of overall job pipeline.
09-big data 2
- On premise Hadoop deployments are costly and are either underutilized or over provisioned.
- Google Cloud Platform (GCP) decouples compute and storage.
- The on premise build out is costly. You need hardware, you’ll need resources to build the servers, install the OS, configure the OS, debug it and the you might be able to process some data.
- In GCP we spin up our clusters and the submit jobs to those clusters.
- Scaling up means purchasing bigger servers.
- Scaling out means distributing your workload on many servers.
- Compute engine resources live in regions or zones.
- Zones live inside regions.
- Disk and instance are both zonal resources.
- Communication within regions will almost always be cheaper and faster than communication across different regions.
- HDFS does exist on GCP however it’s only there for phase one of moving our on premise jobs to GCP.
- After the jobs are moved we want to begin using google cloud storage.
- GCP separates storage an compute.
- Over a decade ago, Google built a new foundation for its search engine. It was called the Google File System – GFS, for short – and it ran across thousands of servers, turning an entire data center into something that behaved a lot like a single machine.
- GFS was built for batch operations – i.e., operations that happen in the background before they’re actually applied to a live website – Colossus is built for real time services.
- Google Cloud Storage sits on Colossus.
- We can only use preemptible workers with Google Cloud Storage.
- Ideally, we want to use less persistent workers and more preemptibles for our jobs.
- A great option to using this new approach would be to spin up the bare minimum in terms of persistent VMs and use a lot of preemptibles.
10-big data 3
- he first step in creating a new cluster is to give it a name.
- The HDFS replication factor on GCP is 2.
- When you create a cluster your create a master and worker nodes.
- Choose zones closest to your data.
- There’s no need to queue jobs.
- On-premise Hadoop clusters are static.
- The high availability option uses 3 masters.
- The notifications icon in the upper right hand corner of the console (bell icon) provides us with actions we’ve taken.
- Keep in mind that these clusters are VMs or virtual machines.
- We can use gcloud dataproc clusters create to create a cluster from the google cloud shell.
- The three different options we have for our clusters are
- single node,
- standard mode
- high availability
- HDFS prevents our cloud dataproc approach.
- Our end goal on GCP is to decouple compute and storage.
- Preemptible nodes are significantly discounted.
- Preemptible nodes only run for 24 hours.
- Only worker nodes can be preempted.
- When a preemptible node receives a shut down it has 30 seconds before a system shutdown is started.
- There are different image versions with different software on the clusters.
- There are currently only two supported cloud dataproc versions.
- Image version 1.1 is the default for all new clusters.
- Connectors make is easy for us to use GCS and BigQuery with Cloud Dataproc.
- We can easily scale our dataproc clusters by increasing the worker nodes.
- We can increase the number of workers to make our jobs run faster.
- We can decrease the number of workers to save money.
- The default dataproc cluster does not have high availability turned on.
- The High Availability option within Dataproc has 3 masters.
11-big data 4
12-big data 5
13-big data 6
- White boarding is used by many companies for interviewing candidates.
- Additionally, white boarding is used extensively when designing data engineering solutions for clients.
- When you whiteboard, you explain by creating pictorial representations of the concept in simple form.
- When you white board any topic keep the scope limited to the question.
- Use one color marker and preferably black.
- All concepts should represented as simple shapes. (Boxes, lines, circles)
- A cloud dataproc server could be a box with N1-M. (node one and it’s the master node) Writing out words takes too long so use short hand when possible.
- Try to keep objects the same size if they are similar or the same. Drawing 10 nodes on a Hadoop cluster should be the same size.
- Look back at audience often. They need to feel engaged.
- Talk and write slowly. This is hard for many newbies because white boarding can be stressful.
- One question or idea per/board. If you only have one board take a picture with a phone to capture the topic then move on. Too many concepts on one board.
- Ask questions. Your the expert so you should be able to field most anything they ask. If you don’t know the answer DO NOT make it up, just tell them you don’t know but you’ll find out and get back to them with the answer.