100DaysOfDataEngineering Day 16 GCP for data engineers
- introduction
1) database professionals
2) developers
3) machine learning engineers
4) general interest
Q: Will this course help me to attain the Google Certified Data Engineer?
A: Yes. This is the first course in a series of courses created to attain that certification. This course is not a brain dump. The series is designed to give you real world insight into what you’ll be doing as a data engineer working on Google’s cloud.
Q: Why become a data engineer?
A: Here are some great reasons why.
- It’s the single hottest career on the planet. Data Science is a distant number 2.
- You’ll always have a job.
- You’ll never hate your boss because you’ll just leave that job for another one.
- You can be an employee or a consultant.
- At the senior levels the salaries are very high. (200K average when creating this course)
- Data is very cool.
Q: Can I pass the test if I take just this course?
A: Nope. Only after you’ve taken the series of courses and worked inside the console will you be prepared for the exam.
Q: Is there an order you’d recommend we take the courses?
A: Yes. This course is the first in the series. It will lay the ground work and provide you with the basics of Google’s Cloud Platform.
Q: How long will it take me to prepare for the exam?
A: Around 3-6 months.
Q: How much does the exam cost?
A: Final release is slated for $200.
Q: What’s the hardest part of the exam?
A: The breadth of what you are expected to know. You’ll need to know GCP but you’ll also have to know some machine learning and big data. Many database professionals who have taken the exam tell me they were unprepared for the questions around machine learning.
Q: How many courses will you have dedicated for preparing for this exam?
A: Right now I’ve designed 6 of them.
Q: Will I have to learn some programming?
A: Yes. Some of the Google Cloud Components will require you to learn Java and Python. No worries there. I’ll walk you through what you’ll need to know.
Q: How is the exam structured?
A: All the questions are multiple choice and are based on the different services within GCP and a case study.
Q: What’s a case study?
A: A case study is a real world end to end example of what you would need to do to take a client from on premise to the cloud.
Q: What would be your recommend approach to preparing for the exam?
A: You have to know what services are on Google Cloud Platform. Then, you have to be able to recommend a service for each scenario. Lastly, you have to be able to implement what you recommend. So.
- Firstly, learn how to navigate on GCP. The basics will take you about a month.
- Learn to use the services specific to data engineering. This means you’ll need to walk through and know how to use around 20 services. This takes some time. Several months to walk through these services in detail.
- Walk through a case study end to end. This is where you everything you’ve learned to work.
Q: Are we really expected to know big data, structured data and machine learning?
A: Yes. With the infrastructure and maintenance tasks abstracted away we should be able to do this.
CLOUD RESOURCE HIERARCHY
- organization (company), allows administration and oversight
- folders – additional grouping mechanisms
- delegation of rights
- security access/denial
- projects
- resources (only one project)
- compute engine
- app engine services
- cloud storage buckets
- Exam Case Study
- document of what you are doing in the real world
- proprietary technology
- predictive analytics (machine learning)
- technical requirements
- handle streaming and batch
- migrate existing hadoop workloads
- scalable
- use managed services whenever possible
- encrypt data in flight and at rest
- connect a vpn between the production data center and cloud environment
- document of what you are doing in the real world
- BIG DATA SERVICES
- BigQuery – largest fastest data warehousing and analytics
- Cloud Dataflow
- batch streaming big data processing
- supporting ETL
- batch computaition and continuous computation
- Dataproc
- Cloud Datalab
- exploration
- analysis
- visualization
- Google Genomics
Summary
- At the lowest level, resources are the fundamental components that make up all Google Cloud services.
- Google Compute Engine Virtual Machines (VMs), Google Cloud Pub/Sub topics, Google Cloud Storage (GCS) buckets, Google App Engine (GAE) instances are all services provided by Google.
- All these lower level resources can only be parented by projects, which represent the first grouping mechanism of the Cloud Resource Hierarchy.
- All resources must belong to exactly one project.
- The Organization resource is the root node of the Cloud Resource Hierarchy and all resources that belong to an organization are grouped under the organization node
- Folders are an additional grouping mechanism on top of projects.
- The IAM access control policies applied on the Organization resource apply throughout the hierarchy on all resources in the organization.
- With an Organization resource, projects belong to your organization instead of the employee who created the project. This means that the projects are no longer deleted when an employee leaves the company; instead they will follow the organization’s lifecycle on Google Cloud Platform.
- Folder resources provide an additional grouping mechanism and isolation boundaries between projects.
- Recall that the project resource is the base level organizing entity. Unlike an Organization, a project is required to use Google Cloud Platform, and forms the basis for creating, enabling and using all Cloud Platform services, managing APIs, enabling billing, adding and removing collaborators, and managing permissions.
Services
- Compute Engine – Virtual Machines, Disks, and Network [Compute]
- App Engine – Managed Application Platform [Compute]
- Cloud Storage – Object & File Storage and Serving [Storage and Databases]
- CloudSQL – Managed MySQL [Storage and Databases]
- BigTable – HBase Compatible NoSQL [Storage and Databases]
- CloudDatastore – Distributed Hierarchical Key/Value Storage. [Storage and Databases]
- CloudSpanner – Managed globally distributed relational database. [Storage and Databases]
- Persistent Disk – VM-Attachable Disks. [Storage and Databases]
- BigQuery – Serverless Data Warehousing & Analytics. [Big Data]
- Cloud Dataflow – Managed Data Processing. [Big Data]
- Dataproc – Managed Spark & Hadoop. [Big Data]
- Cloud Datalab – Data Visualization and Exploration. [Big Data]
- Goggle Genomics – Managed Genomics Platform. [Big Data]
_03-GCP for data engineers – section 2
- creating an account- use chrome cloud.google.com…try it free
1) gmail account
2) free trial
3) credit card - navigation
- home page, webpage, manage projects and resources
- CARDS
- RESOURCES
- TRACE
- API
- Getting started
- STATUS
- BILLING
- ERROR REPORTING
- icons: gift, gcp, ! ? notifications
- home page, webpage, manage projects and resources
- customize: turn on/off cards
- home hamburger
- BILLING at a high level
- create a budget: restrict/be notified
- IAMS – identity access management -> QUOTAS – > servicce objects
- GCP Security
- editor
- Quotas- restrict through services or metrics
- Service Accounts
- Settings : name, id, project number
- GCP Security
- Cloud Shell: command line access, persistent,
- open/provision
- ~$gcloud auth list
- ~$gcloud config set project clouddataprocH22
- GCP APIs
- all services are not enabled by default
- the process starts with a project
- ‘enable api’
- Installing GCLOUD SDK
- Before we get started we need an account on Google’s Cloud Platform
- The location where we will be spending most of our time is the Google Cloud Platform Console or just console for short.
- The home page is a web based console and is dividend into panels that are called cards.
- The project is the entity around which almost everything is centered.
- Projects have three core identifiers.
- Name
- ProjectID
- Project Number
- The Resources card gives us the health of our services in one location.
- Cloud shell provides you with a command line interface for your Google resources.
- With cloud shell nothing is installed on your system.
- gcloud and other command line utilities are installed in the shell.
- The shell is actually a micro virtual instance running a Debian Linux operations system.
- Billing provides us insight into sundry aspects of how are bills are crafted.
- When you first sign up you receive a $300 credit.
- Each project is associated with a billingID.
- Billing administrators can be different than project administrators.
- APIs and Services can be used interchangeably.
- The path to creating a project is:
- Create Project
- Set location
- Enable APIs
- Not all APIs are enabled by default.
- The API manger screen will allow you to enable APIs that are not enabled that you may need for a project.
- IAM stands for Identity Access Management
- Do keep in mind that access is at the project level.
- We can use the type column to tell us if the object is a person or a service.
- The ADD button on the IAM project screen allows us to give other users access to our projects.
- gcloud is a command line tool for performing platform tasks on GCP.
- It can be installed locally on your laptop.
- All the tools Google offers including gcloud are installed on the shell virtual instance.
04-GCP for data engineers – section 3
- Compute Engine is virtual machines, disks and network.
- Compute Engine is an IAAS or infrastructure as a service.
- Compute Engine VMs are the fastest to spin up and break down compared to AWS and Azure.
- The first thing we need to do when spinning up a compute instance is to give it a name. (After creating a project we can use as a container of course)
- The second thing we need to do is to give it a location.
- The name of the instance must be lower case.
- Billing estimates on the left hand side of the screen provide us with a good estimate of the cost for this instance.
- If we choose boot disk when creating our instance we have a lot of other os images we can choose from.
- A preemptible VM is an instance that you can create and run at a much lower price than a normal instance. We can also set this on the creation page. However, Compute Engine might terminate (preempt) these instances if it requires access to those resources for other tasks.
- Once our instance has been spun up we can SSH into it.
- Google Cloud Launcher offers ready-to-go development stacks, solutions, and services to accelerate development.
- Under compute engine we have a lot of options for working with compute instances.
- An example is an instance template. An instance template is required for creating managed instance groups. A managed instance group contains identical virtual machine instances. To maintain these identical instances, the instance group uses the specified instance template to create VM instances for the group.
- Google Container Engine is a powerful cluster manager and orchestration system for running your Docker containers. Container Engine schedules your containers into the cluster and manages them automatically based on requirements you define (such as CPU and memory). It’s built on the open-source Kubernetes system, giving you the flexibility to take advantage of on-premises, hybrid, or public cloud infrastructure.
- Kubernetes is an open-source system for automating deployment, scaling, and management of containerized applications. It groups containers that make up an application into logical units for easy management and discovery.
- Google Interactive tutorials teach you the basics of commonly-used services.
05-GCP for data engineers – section 4
- All data in Google Cloud Storage belongs inside a project. A project consists of a set of users, a set of APIs, and billing, authentication, and monitoring settings for those APIs. You can have one project or multiple projects.
- A resource is an entity within Google Cloud Platform
- Buckets are the basic containers that hold your data. Everything that you store in Google Cloud Storage must be contained in a bucket. You can use buckets to organize your data and control access to your data, but unlike directories and folders, you cannot nest buckets.
- Bucket names have more restrictions than object names, because every bucket resides in a single Google Cloud Storage namespace. Also, bucket names can be used with a CNAME redirect, which means they need to conform to DNS naming conventions.
- Bucket labels are key:value metadata pairs that allow you to group your buckets along with other Google Cloud Platform resources.
- In GCP Storage, you create a bucket to store your data. A bucket has three properties that you specify when you create it: a globally unique name, a location where the bucket and its contents are stored, and a default storage class for objects added to the bucket.
- Cloud Storage offers four storage classes: Multi-Regional Storage, Regional Storage, Nearline Storage, and Coldline Storage. All storage classes offer the same throughput, low latency (time to first byte typically tens of milliseconds), and high durability.
- Multi-Regional Storage is geo-redundant, which means Cloud Storage stores your data redundantly in at least two regions separated by at least 100 miles within the multi-regional location of the bucket.
- Every bucket name must be unique across the entire Google Cloud Storage namespace.
- gsutil is a Python application that lets you access Cloud Storage from the command line.
- To use gsutil locally you must download and install the Google Cloud SDK.
06-GCP for data engineers – section 5
- Cloud SQL is a fully managed MySQL database service on GCP.
- Fully managed means you don’t install anything, backups, replication and patching is automated and your guaranteed 99.95% availability.
- PostgreSQL is in beta at this time
- Data is replicated across multiple data centers.
- There are a few restrictions and only version 5.5 or higher is supported.
- Instance size is limited to 500 gigs.
- This service is on by default.
- Under storage, you choose SQL to navigate to the instance page.
- An instance is an installation of MySQL.
- You need a unique instance name that’s all lowercase and CAN NOT be changed later.
- You need to specify a root password, region and zone.
- If you’re spinning up a production instance choose SSDs.
- You can enable backups during instance creation or after the instance has been spun up.
- I’d also recommend you choose autoscaling. This is one of the greatest features on GCP.
- You can also easily created a read only replica on GCP. This is great for reporting.
- You can also add high availability and I’d highly recommend this for all real world production boxes.
- In order to backup the transactions logs you need to check Enable binary logging under Backups and binary logging.
07-GCP for data engineers – section 6
- This course is the second in a series and was created for learning the data engineering role on Google’s Cloud Platform often referred to as GCP.
- In the course we are going to learn GCPs managed service for Hadoop and big data related processes.
- The service is called Cloud Dataproc.
- The course will help you learn the skills you need to be able to pass the Google Certified Data Engineer.
- When you “spin up” or create a cluster in Cloud Dataproc you do so within the confines of a project.
- As data engineers you’ll need to know how to migrate existing on-premise Hadoop Clusters to GCP.
- There are two kinds of data. There is structured and unstructured.
- Approximately 90% of all data in the enterprise is unstructured.
- Approximately 70% of all companies use the cloud in some form or another right now. [2016]
- A very small portion of the vast amount of data collected is analyzed.
- Approximately 90% of all the accumulated data was created in the last two years.
- Unstructured data doesn’t conform neatly into a tabular data set. [Like an excel spreadsheet]
- An example of internal unstructured data are emails.
- A social media post about your organization is an example of external unstructured data.