Topics in the Google Cloud Professional Data Engineer certification exam
by Sabeehah Ahmed, Professional Services Delivery Engineer, Rackspace Technology
For a few years now, I've had an interest in the concepts of machine learning (ML) and wanted to know more. When I came across the Google® Cloud Professional Data Engineer certification exam, I was intrigued about how ML concepts intertwine with cloud concepts, especially in the Google Cloud Platform (GCP).
As a Biomedical Engineering student at university, I know that the concepts of ML strongly relate to how the human brain works. It's interesting to see how computing can use the neural connections of the brain. With no background in cloud other than a year of experience in the cloud computing field, I achieved three Google Cloud certifications---an accomplishment that makes me proud.
In this post, I'm sharing the primary resources that I used to study for this exam and the topics covered in the exam.
Study Sources
Here are the places I went to prepare for taking the exam.
Data Engineering on GCP course at Fast Lane
While studying for this certification, I first attended a four-day Google course, which I highly recommend if you are interested in Big Data and ML within Google. In this course, I learned about the tools that GCP provides when ingesting, preparing, and analyzing Big Data. Some of the key topics were:
- Challenges faced with Data Engineering
- Deep dive into:
- - Big Query
- - Big Table
- - Dataflow
- - Dataproc
- - ML services including Kubeflow
- Demos and labs for using each service
Although this course does not relate directly to the Professional Data Engineer certification exam, I found it quite beneficial and learned quite a lot. You can find the course at Fast Lane Data Engineering on the Google Cloud Platform
There are also other courses outside of the data engineering track at Fast Lane Google Cloud Training
 
Linux Academy
I used Linux Academy as my primary source of study material. I went through the whole data engineering course and found it very useful. They updated their course quite recently, so it is very precise and provides many helpful tips and content that comes up in the exam. Because the practice tests are similar to the certification exam, they prepared me well. As someone with little experience in the data world, I feel that this course is perfect in terms of content and explanation.
I took the
Google Cloud Certified Professional Data Engineer course. 
Coursera and Qwiklabs
Coursera has another useful course, which is slightly longer than the Linux Academy course. The Coursera course has example scenarios that helped me consider which tools are best for specific customer issues. Also, most sections have Qwiklabs that are quite useful. They let me use the GCP console to put theory into hands-on practice, which improved my understanding.
I focused on the following courses:
Data Engineering, Big Data, and Machine Learning on GCP
Preparing for the Google Cloud Professional Data Engineer Exam
Topics in the certification exam
The section provides a list of the main topics covered in the exam, some sub-topics, and occasionally my thoughts about the material.
BigQuery (A major focus in the exam)
- Integration with Google Identity and Access Management (IAM) roles
- Basic understanding of the GCP Key Management Sevice (KMS) and keys (google-managed, customer-supplied, and customer-managed)
- Partitioned tables, specifically as used in SQL commands
- Wildcards
- Federated tables
- Integration with Google Cloud Storage (GCS)
- BigQuery (BQ) data transfer service and connectors
- When to use normalized and denormalized data
- Loading different data formats into BQ, including a good understanding of the Apache Avro™, CSV, Apache Parquet, and JSON formats
- Pricing with slots
- Cached queries
Dataflow
- Integration with IAM roles, especially the developer role
- Differences between global, fixed, session, and sliding windows and when to use each type
- Best practices on handling pipeline errors, especially, try-catch-block errors
- Different types of transform methods, for example, Apache Beam ParDo
- Watermarks
- Apache Beam
BigTable
- Schema design, such as when to use tall and narrow tables or short and wide ones
- Schema that might cause slow performance and how to optimize performance
- When to use hard disk drive (HDD)
- How to switch between HDD and solid-state drive (SSD)
Pub/sub
- Process of moving from the Apache Kafka to pub/sub workflow
- IAM controls on different levels, such as the fact that the publisher-level has no IAM controls
- Learn how the process of message flow works, such as why delays in sending messages might occur
Cloud Spanner
- Not much in the exam, just basic concepts
- Primary and secondary indexes
Dataproc
- Good understanding of the Apache Hadoop ecosystem
- IAM integration
- Benefits of preemptible nodes
- Best practices for migrating Hadoop clusters to Dataproc, such as always separating data from storage by using GCS
- Best practices for optimizing performance
- Connectors
- Apache Spark
Dataprep
- Not much in the exam but you should know the basic concepts
Machine Learning
- Differences between training and test data
- Overfitting and underfitting, such as why they can happen, and how to prevent them
- Good understanding of ML types, including supervised learning, unsupervised learning, and reinforcement learning, although I saw no questions on reinforcement
- Not much on Tensorflow, but you should know the basic concepts
- Good understanding of how neural networks (NN) work. There were questions on wide NN, deep NN, and both wide and deep NN
- Regularization parameters, such as L1 and L2, including a couple of scenario-based questions of when to use each type
GCP ML services
- Good understanding of each service, especially Natural Language API, such as sentiment and entity analysis
- A couple of questions on when it is beneficial for a customer to use ML services
- AI platform, including how it works and online versus batch predictions
Datalab
- Basic concepts
- A question came up about how you can share notebooks
DataStudio
- Basic concepts
- Caching with BQ, including query cache and prefetch cache
- A question came up about metrics and dimensions, so you should know the difference
Cloud Composer
- You should know directed acyclic graph (DAG) files in detail, including the components
Extra notes
- The exam had no case studies
- The exam didn't have much on Cloud SQL
- You should know the data pipelines very well
- You should know the key differences between the data services in GCP
- The Google Practice exam is also quite useful, so consider taking it
Conclusion
I hope the post on my journey to certification helps you on your journey. Good luck to all those who plan to take the exam!

Recent Posts
Der Bericht über den Zustand der Cloud 2025
Januar 10th, 2025
Google Cloud Hybrid Networking-Muster - Teil 2
Oktober 16th, 2024
Google Cloud Hybrid Networking-Muster - Teil 2
Oktober 15th, 2024
How Rackspace Leverages AWS Systems Manager
Oktober 9th, 2024
Windows Server verhindert Zeitsynchronisation mit Rackspace NTP
Oktober 3rd, 2024