+
+
SERVERLESS DATA ANALYSIS
+
+
GETTING STARTED WITH GCP CONSOLE
+
+
When the lab is ready a green button will appear that looks like this:
+
+
 _ Qwiklabs + roitraining_files/2fa0ccada9d929f0.png)
+
+
When you are ready to begin, click Start Lab.
+
+
+
+
Step 1: Locate the Username, Password and Project Id
+
+
Press the green [Start] button to start the lab. After setup is completed you will see something similar to this on the right side of the Qwiklabs window:
+
+
 _ Qwiklabs + roitraining_files/eaa80bb0490b07d0.png)
+
+
Step 2: Browse to Console
+
+
Open an Incognito window in your browser.
+And go to http://console.cloud.google.com
+
+
Step 3: Sign in to Console
+
+
Log in with the Username and Password provided. The steps below are suggestive. The actual dialog and procedures may vary from this example.
+
+
 _ Qwiklabs + roitraining_files/1c492727805af169.png)
+
+
Step 4: Accept the conditions
+
+
Accept the new account terms and conditions.
+
+
 _ Qwiklabs + roitraining_files/32331ec60c5f6609.png)
+
+
This is a temporary account. You will only have access to the account for this one lab.
+
- Do not add recovery options
+- Do not sign up for free trials
+
+
Step 5: Don't change the password
+
+
If prompted, don't change the password. Just click [Continue].
+
+
 _ Qwiklabs + roitraining_files/ef164317a73a66d7.png)
+
+
Step 6 Agree to the Terms of Service
+
+
Select (x) Yes, (x) Yes and click [AGREE AND CONTINUE].
+
+
 _ Qwiklabs + roitraining_files/e0edec7592d289e1.png)
+
+
Step 7: Console opens
+
+
The Google Cloud Platform Console opens.
+
+
You may see a bar occupying the top part of the Console inviting you to sign up for a free trial. You can click on the [DISMISS] button so that the entire Console screen is available.
+
+
 _ Qwiklabs + roitraining_files/a1b4bfec239cc863.png)
+
+
Step 8: Switch project (if necessary)
+
+
On the top blue horizontal bar, click on the drop down icon to select the correct project (if not already so). You can confirm the project id from your Qwiklabs window (shown in step 1 above).
+
+
 _ Qwiklabs + roitraining_files/849103afbf5e9178.png)
+
+
Click on "view more projects" if necessary and select the correct project id.
+
+
PART 1: BUILD A BIGQUERY QUERY
+
+
Overview
+
+
Duration is 1 min
+
+
In this lab, you learn how to build up a complex BigQuery using clauses, subqueries, built-in functions and joins.
+
+
What you learn
+
+
In this lab, you:
+
- Create and run a query
+- Modify the query to add clauses, subqueries, built-in functions and joins.
+
+
Introduction
+
+
Duration is 1 min
+
+
The goal of this lab is to build up a complex BigQuery using clauses, subqueries, built-in functions and joins, and to run the query.
+
+
Before you begin
+
+
Duration is 1 min
+
+
If you have not started the lab, go ahead and click the green "Start Lab" button. Once done, it will display credentials for this lab. Repeat the steps in Lab 0 to log into the Cloud console with the credentials provided in this lab.
+
+
Here is a quick reference:
+
+
Open new incognito window → go to cloud console → login with provided credentials → follow the prompts → switch project if necessary
+
+
Create and run a query
+
+
Duration is 3 min
+
+
Step 1
+
+
Open the Google Cloud Console (in the incognito window) and using the menu, navigate into BigQuery web UI, click on the Compose Query button on top left, and then click on Show Options, and ensure you are using Standard SQL. You are using Standard SQL if the "Use Legacy SQL" checkbox is unchecked.
+
+
 _ Qwiklabs + roitraining_files/beb1dee4f27cf451.png)
+
+
Step 2
+
+
Click Compose Query.
+
+
Step 3
+
+
In the New Query window, type (or copy-and-paste) the following query:
+
SELECT
+ airline,
+ date,
+ departure_delay
+FROM
+ `bigquery-samples.airline_ontime_data.flights`
+WHERE
+ departure_delay > 0
+ AND departure_airport = 'LGA'
+LIMIT
+ 100
+
+
What does this query do? ______________________
+
+
Step 4
+
+
Click Run Query.
+
+
Aggregate and Boolean functions
+
+
Duration is 5 min
+
+
Step 1
+
+
To the previous query, add an additional clause to filter by date and group the results by airline. Because you are grouping the results, the SELECT statement will have to use an aggregate function. In the New Query window, type the following query:
+
SELECT
+ airline,
+ COUNT(departure_delay)
+FROM
+ `bigquery-samples.airline_ontime_data.flights`
+WHERE
+ departure_airport = 'LGA'
+ AND date = '2008-05-13'
+GROUP BY
+ airline
+ORDER BY airline
+
+
Step 2
+
+
Click Run Query. What does this query do? ______________________________________________________
+
+
What is the number you get for American Airlines (AA)?
+
+
+
+
Step 3
+
+
Now change the query slightly:
+
SELECT
+ airline,
+ COUNT(departure_delay)
+FROM
+ `bigquery-samples.airline_ontime_data.flights`
+WHERE
+ departure_delay > 0 AND
+ departure_airport = 'LGA'
+ AND date = '2008-05-13'
+GROUP BY
+ airline
+ORDER BY airline
+
+
Step 4
+
+
Click Run Query. What does this query do? ______________________________________________________
+
+
What is the number you get for American Airlines (AA)?
+
+
+
+
Step 5
+
+
The first query returns the total number of flights by each airline from La Guardia, and the second query returns the total number of flights that departed late. (Do you see why?)
+
+
How would you get both the number delayed as well as the total number of flights?
+
+
+
+
+
+
Step 6
+
+
Run this query:
+
SELECT
+ f.airline,
+ COUNT(f.departure_delay) AS total_flights,
+ SUM(IF(f.departure_delay > 0, 1, 0)) AS num_delayed
+FROM
+ `bigquery-samples.airline_ontime_data.flights` AS f
+WHERE
+ f.departure_airport = 'LGA' AND f.date = '2008-05-13'
+GROUP BY
+ f.airline
+
+
String operations
+
+
Duration is 3 min
+
+
Step 1
+
+
In the New Query window, type the following query:
+
SELECT
+ CONCAT(CAST(year AS STRING), '-', LPAD(CAST(month AS STRING),2,'0'), '-', LPAD(CAST(day AS STRING),2,'0')) AS rainyday
+FROM
+ `bigquery-samples.weather_geo.gsod`
+WHERE
+ station_number = 725030
+ AND total_precipitation > 0
+
+
Step 2
+
+
Click Run Query.
+
+
Step 3
+
+
How would you do the airline query to aggregate over all these dates instead of just ‘2008-05-13'?
+
+
+
+
You could use a JOIN, as shown next.
+
+
Join on Date
+
+
Duration is 3 min
+
+
Step 1
+
+
In the New Query window, type the following query:
+
SELECT
+ f.airline,
+ SUM(IF(f.arrival_delay > 0, 1, 0)) AS num_delayed,
+ COUNT(f.arrival_delay) AS total_flights
+FROM
+ `bigquery-samples.airline_ontime_data.flights` AS f
+JOIN (
+ SELECT
+ CONCAT(CAST(year AS STRING), '-', LPAD(CAST(month AS STRING),2,'0'), '-', LPAD(CAST(day AS STRING),2,'0')) AS rainyday
+ FROM
+ `bigquery-samples.weather_geo.gsod`
+ WHERE
+ station_number = 725030
+ AND total_precipitation > 0) AS w
+ON
+ w.rainyday = f.date
+WHERE f.arrival_airport = 'LGA'
+GROUP BY f.airline
+
+
Step 2
+
+
Click __Run Query. __How would you get the fraction of flights delayed for each airline?
+
+
You could put the entire query above into a subquery and then select from the columns of this result
+
+
Subquery
+
+
Duration is 3 min
+
+
Step 1
+
+
In the New Query window, type the following query:
+
SELECT
+ airline,
+ num_delayed,
+ total_flights,
+ num_delayed / total_flights AS frac_delayed
+FROM (
+SELECT
+ f.airline AS airline,
+ SUM(IF(f.arrival_delay > 0, 1, 0)) AS num_delayed,
+ COUNT(f.arrival_delay) AS total_flights
+FROM
+ `bigquery-samples.airline_ontime_data.flights` AS f
+JOIN (
+ SELECT
+ CONCAT(CAST(year AS STRING), '-', LPAD(CAST(month AS STRING),2,'0'), '-', LPAD(CAST(day AS STRING),2,'0')) AS rainyday
+ FROM
+ `bigquery-samples.weather_geo.gsod`
+ WHERE
+ station_number = 725030
+ AND total_precipitation > 0) AS w
+ON
+ w.rainyday = f.date
+WHERE f.arrival_airport = 'LGA'
+GROUP BY f.airline
+ )
+ORDER BY
+ frac_delayed ASC
+
+
Step 2
+
+
Click Run Query
+
+
+ _ Qwiklabs + roitraining_files/3ac518b975e3eb26.png)
+ | Stop here if you are done. Wait for instructions from the Instructor before going into the next section
+ |
+
+
+
PART 2: LOADING AND EXPORTING DATA
+
+
Overview
+
+
Duration is 1 min
+
+
In this lab you use load data in different formats into BigQuery tables.
+
+
What you learn
+
+
In this lab, you:
+
- Load a CSV file into a BigQuery table using the web UI
+- Load a JSON file into a BigQuery table using the CLI
+
+
Introduction
+
+
Duration is 1 min
+
+
In this lab, you load data into BigQuery in multiple ways. You also transform the data you load, and you query the data.
+
+
Upload data using the web UI
+
+
Duration is 14 min
+
+
__Task: __In this section of the lab, you upload a CSV file to BigQuery using the BigQuery web UI.
+
+
BigQuery supports the following data formats when loading data into tables: CSV, JSON, AVRO, or Cloud Datastore backups. This example focuses on loading a CSV file into BigQuery.
+
+
Step 1
+
+
Open the Google Cloud Console (in the incognito window) and using the menu, navigate into BigQuery web UI.
+
+
Step 2
+
+
Click the blue arrow to the right of your project name and choose Create new dataset.
+
+
Step 3
+
+
In the ‘Create Dataset' dialog, for Dataset ID, type cpb101_flight_data and then click OK.
+
+
Step 4
+
+
Download the following file to your local machine. This file contains the data that will populate the first table.
+
+
Download airports.csv
+
+
Step 5
+
+
Create a new table in the cpb101_flight_data dataset to store the data from the CSV file. Click the create table icon (the plus sign) to the right of the dataset.
+
+
Step 6
+
+
On the Create Table page, in the Source Data section:
+
- For Location, leave File upload selected.
+- To the right of File upload, click Choose file, then browse to and select airports.csv.
+- Verify File format is set to CSV.
+
+
Note: __When you have created a table previously, the __Create from Previous Job option allows you to quickly use your settings to create similar tables.
+
+
Step 7
+
+
In the Destination Table section:
+
- For Table name, leave cpb101_flight_data selected.
+- For Destination table name, type AIRPORTS.
+- For Table type, Native table should be selected and unchangeable.
+
+
Step 8
+
+
In the Schema section:
+
+
Step 9
+
+
In the Options section:
+
- For Field delimiter, verify Comma is selected.
+- Since airports.csv contains a single header row, for Header rows to skip, type 1.
+- Accept the remaining default values and click Create Table. BigQuery creates a load job to create the table and upload data into the table (this may take a few seconds). You can track job progress by clicking Job History.
+
+
Step 10
+
+
Once the load job is complete, click cpb101_flight_data > AIRPORTS.
+
+
Step 11
+
+
On the Table Details page, click Details to view the table properties and then click Preview to view the table data.
+
+
Upload data using the CLI
+
+
Duration is 7 min
+
+
Task: In this section of the lab, you upload multiple JSON files and an associated schema file to BigQuery using the CLI.
+
+
Step 1
+
+
Navigate to the Google Cloud Platform Console and to the right of your project name, click Activate Google Cloud Shell.
+
+
Step 2
+
+
Type the following command to download schema_flight_performance.json (the schema file for the table in this example) to your working directory.
+
curl https://storage.googleapis.com/cloud-training/CPB200/BQ/lab4/schema_flight_performance.json -o schema_flight_performance.json
+
+
Step 3
+
+
The JSON files containing the data for your table are stored in a Google Cloud Storage bucket. They have URIs like the following:
+
+
+
+
Type the following command to create a table named flights_2014 in the __cpb101_flight_data __dataset, using data from files in Google Cloud Storage and the schema file stored on your virtual machine.
+
+
Note that your Project ID is stored as a variable in Cloud Shell () so there's no need for you to remember it. If you require it, you can view your Project ID in the command line to the right of your username (after the @ symbol).
+
bq load --source_format=NEWLINE_DELIMITED_JSON $DEVSHELL_PROJECT_ID:cpb101_flight_data.flights_2014 gs://cloud-training/CPB200/BQ/lab4/domestic_2014_flights_*.json ./schema_flight_performance.json
+
+
If you are prompted to select a project to be set as default, choose the Project ID that was setup when you started this qwiklab (Look in the "Connect" tab of your Qwiklabs window, the project id typically looks something like "qwiklabs-gcp-123xyz" ).
+
+
Note
+
There are multiple JSON files in the bucket named according to the convention: domestic_2014_flights_*.json. The wildcard (*) character is used to include all of the .json files in the bucket.
+
+
+
Step 4
+
+
Once the table is created, type the following command to verify table flights_2014 exists in dataset cpb101_flight_data.
+
bq ls $DEVSHELL_PROJECT_ID:cpb101_flight_data
+
+
The output should look like the following:
+
+
+
+
+
+
+
+
+
+
Export table
+
+
Duration is 6 min
+
+
__Task: __In this section of the lab, you export a BigQuery table using the web UI.
+
+
Step 1
+
+
If you don't already have a bucket on Cloud Storage, create one from the Storage section of the GCP console. Bucket names have to be globally unique.
+
+
Step 2
+
+
Back to the Google Cloud Console (in the incognito window) and using the menu, navigate into BigQuery web UI
+
+
Step 3
+
+
Select the AIRPORTS table that you created recently, and using the "down" button to its right, select the option for Export Table.
+
+
Step 4
+
+
In the dialog, specify and click _OK. _
+
+
Step 5
+
+
Use the CLI to export the table:
+
bq extract cpb101_flight_data.AIRPORTS gs://<your-bucket-name>/bq/airports2.csv
+
+
Remember to change with the bucket you created earlier.
+
+
Step 6
+
+
Browse to your bucket and ensure that both .csv files have been created.
+
+
+ _ Qwiklabs + roitraining_files/3ac518b975e3eb26.png)
+ | Stop here if you are done. Wait for instructions from the Instructor before going into the next section
+ |
+
+
+
PART 3: ADVANCED SQL QUERIES
+
+
Overview
+
+
Duration is 1 min
+
+
In this lab you use some advanced SQL concepts to answer the question: what programming languages do open-source programmers program in on weekends?
+
+
What you learn
+
+
In this lab, you write a query that uses advanced SQL concepts:
+
- Nested fields
+- Regular expressions
+- With statement
+- Group and Having
+
+
Introduction
+
+
Duration is 1 min
+
+
In this lab, you use some advanced SQL concepts to answer the question: what programming languages do open-source programmers program in on weekends?
+
+
To answer this question, we will use a BigQuery public dataset that has information on all GitHub commits.
+
+
+
+
Duration is 5 min
+
+
In this section, you will learn how to work with nested fields.
+
+
Step 1
+
+
Open the Google Cloud Console (in the incognito window) and using the menu, navigate into BigQuery web UI.
+
+
Step 2
+
+
Compose a new query, making sure that the "Legacy SQL" option is not checked (you are using Standard SQL).
+
SELECT
+ author.email,
+ diff.new_path AS path,
+ author.date
+FROM
+ `bigquery-public-data.github_repos.commits`,
+ UNNEST(difference) diff
+WHERE
+ EXTRACT(YEAR
+ FROM
+ author.date)=2016
+LIMIT 10
+
+
Step 3
+
+
Play a little with the query above to understand what it is doing. For example, instead of , try just . What type of field is author?
+
+
Step 4
+
+
Change to . Why does it not work? Replace by . Does this work? Why? What is the doing?
+
+
+
+
Duration is 5 min
+
+
In this section, you will learn how to use regular expressions. Let's assume that the filename extension is the programming language, i.e., a file that ends in .py has the language "py". How will you pull out the extension from the path?
+
+
Step 1
+
+
Type the following query:
+
SELECT
+ author.email,
+ LOWER(REGEXP_EXTRACT(diff.new_path, r'\.([^\./\(~_ \- #]*)$')) lang,
+ diff.new_path AS path,
+ author.date
+FROM
+ `bigquery-public-data.github_repos.commits`,
+ UNNEST(difference) diff
+WHERE
+ EXTRACT(YEAR
+ FROM
+ author.date)=2016
+LIMIT
+ 10
+
+
Step 2
+
+
Modify the query above to only use lang if the language consists purely of letters and has a length that is fewer than 8 characters.
+
+
Step 3
+
+
Modify the query above to group by language and list in descending order of the number of commits. Here's a potential solution:
+
WITH
+ commits AS (
+ SELECT
+ author.email,
+ LOWER(REGEXP_EXTRACT(diff.new_path, r'\.([^\./\(~_ \- #]*)$')) lang,
+ diff.new_path AS path,
+ author.date
+ FROM
+ `bigquery-public-data.github_repos.commits`,
+ UNNEST(difference) diff
+ WHERE
+ EXTRACT(YEAR
+ FROM
+ author.date)=2016 )
+SELECT
+ lang,
+ COUNT(path) AS numcommits
+FROM
+ commits
+WHERE
+ LENGTH(lang) < 8
+ AND lang IS NOT NULL
+ AND REGEXP_CONTAINS(lang, '[a-zA-Z]')
+GROUP BY
+ lang
+HAVING
+ numcommits > 100
+ORDER BY
+ numcommits DESC
+
+
Weekend or weekday?
+
+
Duration is 5 min
+
+
Now, group the commits based on whether or not it happened on a weekend. How would you do it?
+
+
Step 1
+
+
Modify the query above to extract the day of the week from author.date. Days 2 to 6 are weekdays.
+
+
Step 2
+
+
Here's a potential solution:
+
WITH
+ commits AS (
+ SELECT
+ author.email,
+ EXTRACT(DAYOFWEEK
+ FROM
+ author.date) BETWEEN 2
+ AND 6 is_weekday,
+ LOWER(REGEXP_EXTRACT(diff.new_path, r'\.([^\./\(~_ \- #]*)$')) lang,
+ diff.new_path AS path,
+ author.date
+ FROM
+ `bigquery-public-data.github_repos.commits`,
+ UNNEST(difference) diff
+ WHERE
+ EXTRACT(YEAR
+ FROM
+ author.date)=2016)
+SELECT
+ lang,
+ is_weekday,
+ COUNT(path) AS numcommits
+FROM
+ commits
+WHERE
+ LENGTH(lang) < 8
+ AND lang IS NOT NULL
+ AND REGEXP_CONTAINS(lang, '[a-zA-Z]')
+GROUP BY
+ lang,
+ is_weekday
+HAVING
+ numcommits > 100
+ORDER BY
+ numcommits DESC
+
+
Ignoring file extensions that do not correspond to programming languages, it appears that the most popular weekend programming languages are JavaScript, PHP and C.
+
+
Acknowledgment: This section of lab (and query) is based on an article by Felipe Hoffa: https://medium.com/@hoffa/the-top-weekend-languages-according-to-githubs-code-6022ea2e33e8#.8oj2rp804
+
+
+ _ Qwiklabs + roitraining_files/3ac518b975e3eb26.png)
+ | Stop here if you are done. Wait for instructions from the Instructor before going into the next section
+ |
+
+
+
PART 4: A SIMPLE DATAFLOW PIPELINE
+
+
Overview
+
+
Duration is 1 min
+
+
In this lab, you learn how to write a simple Dataflow pipeline and run it both locally and on the cloud.
+
+
What you learn
+
+
In this lab, you learn how to:
+
- Setup a Java Dataflow project using Maven
+- Write a simple pipeline in Java
+- Execute the query on the local machine
+- Execute the query on the cloud
+
+
Introduction
+
+
Duration is 1 min
+
+
The goal of this lab is to become familiar with the structure of a Dataflow project and learn how to execute a Dataflow pipeline.
+
+
Open Dataflow project
+
+
Duration is 3 min
+
+
Step 1
+
+
Start CloudShell and navigate to the directory for this lab:
+
cd ~/training-data-analyst/courses/data_analysis/lab2/python
+
+
If this directory doesn't exist, you may need to git clone the repository first:
+
cd ~
+git clone https://github.com/GoogleCloudPlatform/training-data-analyst
+cd ~/training-data-analyst/courses/data_analysis/lab2/python
+
+
Step 2
+
+
Install the necessary dependencies for Python dataflow:
+
sudo ./install_packages.sh
+
+
Verify that you have the right version of pip (should be > 8.0):
+
pip -V
+
+
If not, open a new CloudShell tab and it should pick up the updated pip.
+
+
Pipeline filtering
+
+
Duration is 5 min
+
+
Step 1
+
+
View the source code for the pipeline using nano:
+
cd ~/training-data-analyst/courses/data_analysis/lab2/python
+nano grep.py
+
+
Step 2
+
+
What files are being read? _____________________________________________________
+
+
What is the search term? ______________________________________________________
+
+
Where does the output go? ___________________________________________________
+
+
There are three transforms in the pipeline:
+
- What does the transform do? _________________________________
+- What does the second transform do? ______________________________
+
- Where does its input come from? ________________________
+- What does it do with this input? __________________________
+- What does it write to its output? __________________________
+- Where does the output go to? ____________________________
+
- What does the third transform do? _____________________
+
+
Execute the pipeline locally
+
+
Duration is 2 min
+
+
Step 1
+
+
Execute locally:
+
python grep.py
+
+
Note: if you see an error that says " you may ignore it. The error is simply saying that logging from the oauth2 library will go to stderr.
+
+
Step 2
+
+
Examine the output file:
+
cat /tmp/output-*
+
+
Does the output seem logical? ______________________
+
+
Execute the pipeline on the cloud
+
+
Duration is 10 min
+
+
Step 1
+
+
If you don't already have a bucket on Cloud Storage, create one from the Storage section of the GCP console. Bucket names have to be globally unique.
+
+
Step 2
+
+
Copy some Java files to the cloud (make sure to replace with the bucket name you created in the previous step):
+
gsutil cp ../javahelp/src/main/java/com/google/cloud/training/dataanalyst/javahelp/*.java gs://<YOUR-BUCKET-NAME>/javahelp
+
+
Step 3
+
+
Edit the Dataflow pipeline in by opening up in nano:
+
cd ~/training-data-analyst/courses/data_analysis/lab2/python
+nano grepc.py
+
+
and changing the PROJECT and BUCKET variables appropriately.
+
+
Step 4
+
+
Submit the Dataflow to the cloud:
+
python grepc.py
+
+
Because this is such a small job, running on the cloud will take significantly longer than running it locally (on the order of 2-3 minutes).
+
+
Step 5
+
+
On your Cloud Console, navigate to the Dataflow section (from the 3 bars on the top-left menu), and look at the Jobs. Select your job and monitor its progress. You will see something like this:
+
+
 _ Qwiklabs + roitraining_files/8826f7db15d23f15.png)
+
+
Step 6
+
+
Wait for the job status to turn to Succeeded. At this point, your CloudShell will display a command-line prompt. In CloudShell, examine the output:
+
gsutil cat gs://<YOUR-BUCKET-NAME>/javahelp/output-*
+
+
+ _ Qwiklabs + roitraining_files/3ac518b975e3eb26.png)
+ | Stop here if you are done. Wait for instructions from the Instructor before going into the next section
+ |
+
+
+
PART 5: MAPREDUCE IN DATAFLOW
+
+
Overview
+
+
Duration is 1 min
+
+
In this lab, you learn how to use pipeline options and carry out Map and Reduce operations in Dataflow.
+
+
What you learn
+
+
In this lab, you learn how to:
+
- Use pipeline options in Dataflow
+- Carry out mapping transformations
+- Carry out reduce aggregations
+
+
Introduction
+
+
Duration is 1 min
+
+
The goal of this lab is to learn how to write MapReduce operations using Dataflow.
+
+
Identify Map and Reduce operations
+
+
Duration is 5 min
+
+
Step 1
+
+
Start CloudShell and navigate to the directory for this lab:
+
cd ~/training-data-analyst/courses/data_analysis/lab2
+
+
If this directory doesn't exist, you may need to git clone the repository:
+
git clone https://github.com/GoogleCloudPlatform/training-data-analyst
+
+
Step 2
+
+
View the source code for the pipeline using nano:
+
cd ~/training-data-analyst/courses/data_analysis/lab2/python
+nano is_popular.py
+
+
Step 3
+
+
What custom arguments are defined? ____________________
+
+
What is the default output prefix? _________________________________________
+
+
How is the variable output_prefix in main() set? _____________________________
+
+
How are the pipeline arguments such as --runner set? ______________________
+
+
Step 4
+
+
What are the key steps in the pipeline? _____________________________________________________________________________
+
+
Which of these steps happen in parallel? ____________________________________
+
+
Which of these steps are aggregations? _____________________________________
+
+
Execute the pipeline
+
+
Duration is 2 min
+
+
Step 1
+
+
Install the necessary dependencies for Python dataflow:
+
sudo ./install_packages.sh
+
+
Verify that you have the right version of pip (should be > 8.0):
+
pip -V
+
+
If not, open a new CloudShell tab and it should pick up the updated pip.
+
+
Step 2
+
+
Run the pipeline locally:
+
./is_popular.py
+
+
Note: if you see an error that says " you may ignore it. The error is simply saying that logging from the oauth2 library will go to stderr.
+
+
Step 3
+
+
Examine the output file:
+
cat /tmp/output-*
+
+
Use command line parameters
+
+
Duration is 2 min
+
+
Step 1
+
+
Change the output prefix from the default value:
+
./is_popular.py --output_prefix=/tmp/myoutput
+
+
What will be the name of the new file that is written out?
+
+
Step 2
+
+
Note that we now have a new file in the /tmp directory:
+
ls -lrt /tmp/myoutput*
+
+
+ _ Qwiklabs + roitraining_files/3ac518b975e3eb26.png)
+ | Stop here if you are done. Wait for instructions from the Instructor before going into the next section
+ |
+
+
+
+
+
Overview
+
+
Duration is 1 min
+
+
In this lab, you learn how to use BigQuery as a data source into Dataflow, and how to use the results of a pipeline as a side input to another pipeline.
+
+
What you learn
+
+
In this lab, you learn how to:
+
- Read data from BigQuery into Dataflow
+- Use the output of a pipeline as a side-input to another pipeline
+
+
Introduction
+
+
Duration is 1 min
+
+
The goal of this lab is to learn how to use BigQuery as a data source into Dataflow, and how to use the result of a pipeline as a side input to another pipeline.
+
+
Try out BigQuery query
+
+
Duration is 4 min
+
+
Step 1
+
+
Open the Google Cloud Console (in the incognito window) and using the menu, navigate into BigQuery web UI, and click on Compose Query.
+
+
Step 2
+
+
Copy-and-paste this query:
+
SELECT
+ content
+FROM
+ [fh-bigquery:github_extracts.contents_java_2016]
+LIMIT
+ 10
+
+
Step 3
+
+
Click on Run Query.
+
+
What is being returned? _______________________________ ____________________
+
+
The BigQuery table contains the content (and some metadata) of all the Java files present in github in 2016.
+
+
Step 4
+
+
To find out how many Java files this table has, type the following query and click Run Query:
+
SELECT
+ COUNT(*)
+FROM
+ [fh-bigquery:github_extracts.contents_java_2016]
+
+
The reason zero bytes are processed is that this is table metadata.
+
+
How many files are there in this dataset? __________________________________
+
+
Is this a dataset you want to process locally or on the cloud? ______________
+
+
Explore the pipeline code
+
+
Duration is 10 min
+
+
Step 1
+
+
On your Cloud Console, start CloudShell and navigate to the directory for this lab:
+
cd ~/training-data-analyst/courses/data_analysis/lab2
+
+
If this directory doesn't exist, you may need to git clone the repository:
+
git clone https://github.com/GoogleCloudPlatform/training-data-analyst
+
+
Step 2
+
+
View the pipeline code using nano and answer the following questions:
+
cd ~/training-data-analyst/courses/data_analysis/lab2/javahelp
+nano src/main/java/com/google/cloud/training/dataanalyst/javahelp/JavaProjectsThatNeedHelp.java
+
+
The pipeline looks like this (refer to this diagram as you read the code):
+
+
 _ Qwiklabs + roitraining_files/56694328f3596edc.png)
+
+
Step 3
+
+
Looking at the class documentation at the very top, what is the purpose of this pipeline? __________________________________________________________
+
+
Where does GetJava get Java content from? _______________________________
+
+
What does ToLines do? (Hint: look at the content field of the BigQuery result) ____________________________________________________
+
+
Step 4
+
+
Why is the result of ToLines stored in a named PCollection instead of being directly passed to another apply()? ________________________________________________
+
+
What are the two actions carried out on javaContent? ____________________________
+
+
Step 5
+
+
If a file has 3 FIXMEs and 2 TODOs in its content (on different lines), how many calls for help are associated with it? __________________________________________________
+
+
If a file is in the package com.google.devtools.build, what are the packages that it is associated with? ____________________________________________________
+
+
Why is the numHelpNeeded variable not enough? Why do we need to do Sum.integersPerKey()? ___________________________________ (Hint: there are multiple files in a package)
+
+
Why is this converted to a View? ___________________________________________
+
+
Step 6
+
+
Which operation uses the View as a side input? _____________________________
+
+
Instead of simply ParDo.of(), this operation uses ____________________________
+
+
Besides c.element() and c.output(), this operation also makes use of what method in ProcessContext? __________________________________________________________
+
+
Execute the pipeline
+
+
Duration is 5 min
+
+
Step 1
+
+
If you don't already have a bucket on Cloud Storage, create one from the Storage section of the GCP console. Bucket names have to be globally unique.
+
+
Step 2
+
+
Execute the pipeline by typing in (make sure to replace with the bucket name you created in the previous step):
+
cd ~/training-data-analyst/courses/data_analysis/lab2/javahelp
+./run_oncloud3.sh <PROJECT> <YOUR-BUCKET-NAME> JavaProjectsThatNeedHelp
+
+
Monitor the job from the GCP console from the Dataflow section.
+
+
Step 3
+
+
Once the pipeline has finished executing, download and view the output:
+
gsutil cp gs://<YOUR-BUCKET-NAME>/javahelp/output.csv .
+head output.csv
+
+
+ _ Qwiklabs + roitraining_files/3ac518b975e3eb26.png)
+ | Stop here if you are done. Wait for instructions from the Instructor before going into the next section
+ |
+
+
+
PART 7: STREAMING INTO BIGQUERY
+
+
Overview
+
+
Duration is 1 min
+
+
In this lab, you learn how to use Dataflow to aggregate records received in real-time in Cloud Pub/Sub. The aggregate statistics will then be streamed into BigQuery and analyzed even as the data are streaming in.
+
+
What you learn
+
+
In this lab, you learn how to:
+
- Create Cloud Pub/Sub topic
+- Read from Pub/Sub in Dataflow
+- Compute windowed aggregates
+- Stream into BigQuery
+
+
Introduction
+
+
Duration is 1 min
+
+
The goal of this lab is to learn how to use Pub/Sub as a real-time streaming source into Dataflow and BigQuery as a streaming sink.
+
+
 _ Qwiklabs + roitraining_files/246cac282f5e8b2b.png)
+
+
Set up BigQuery and Pub/Sub
+
+
Duration is 3 min
+
+
Step 1
+
+
Open the Google Cloud Console (in the incognito window) and using the menu, navigate into BigQuery web UI. Next click on the blue arrow next to your project name (on the left-hand panel) and click on Create new dataset and if you do not have a dataset named , please create one.
+
+
 _ Qwiklabs + roitraining_files/59c77bc988898ac5.png)
+
+
Step 2
+
+
Back on you Cloud Console, visit the Pub/Sub section of GCP Console and click on Create Topic. Give your new topic the name and select Create.
+
+
Explore the pipeline code
+
+
Duration is 10 min
+
+
Step 1
+
+
Start CloudShell and navigate to the directory for this lab:
+
cd ~/training-data-analyst/courses/data_analysis/lab2
+
+
If this directory doesn't exist, you may need to git clone the repository:
+
git clone https://github.com/GoogleCloudPlatform/training-data-analyst
+
+
Step 2
+
+
View the pipeline code using nano and answer the following questions:
+
cd ~/training-data-analyst/courses/data_analysis/lab2/javahelp
+nano src/main/java/com/google/cloud/training/dataanalyst/javahelp/StreamDemoConsumer.java
+
+
Step 3
+
+
What are the fields in the BigQuery table? _______________________________
+
+
Step 4
+
+
What is the pipeline source? ________________________________________________
+
+
Step 5
+
+
How often will aggregates be computed? ___________________________________________
+
+
Aggregates will be computed over what time period? _________________________________
+
+
Step 6
+
+
What aggregate is being computed in this pipeline? ____________________________
+
+
How would you change it to compute the average number of words in each message over the time period? ____________________________
+
+
Step 7
+
+
What is the output sink for the pipeline? ____________________________
+
+
Execute the pipeline
+
+
Duration is 3 min
+
+
Step 1
+
+
If you don't already have a bucket on Cloud Storage, create one from the Storage section of the GCP console. Bucket names have to be globally unique.
+
+
Step 2
+
+
Execute the pipeline by typing in (make sure to replace with the bucket name you created in the previous step):
+
cd ~/training-data-analyst/courses/data_analysis/lab2/javahelp
+./run_oncloud4.sh <PROJECT> <YOUR-BUCKET-NAME>
+
+
Monitor the job from the GCP console from the Dataflow section. Note that this pipeline will not exit.
+
+
Step 3
+
+
Visit the Pub/Sub section of GCP Console and click on your streamdemo topic. Notice that it has a Dataflow subscription. Click on the Publish button and type in a message (any message) and click Publish:
+
+
 _ Qwiklabs + roitraining_files/54513e6524bf166a.png)
+
+
Step 4
+
+
Publish a few more messages.
+
+
Carry out streaming analytics
+
+
Duration is 3 min
+
+
Step 1
+
+
Open the Google Cloud Console (in the incognito window) and using the menu, navigate into BigQuery web UI. Compose a new query and type in (change your PROJECTID appropriately):
+
SELECT timestamp, num_words from [PROJECTID:demos.streamdemo] LIMIT 10
+
+
Clean up
+
+
Duration is 3 min
+
+
Step 1
+
+
Cancel the job from the GCP console from the Dataflow section.
+
+
Step 2
+
+
Delete the topic from the Pub/Sub section of GCP Console
+
+
Step 3
+
+
Delete the table from the left-panel of BigQuery console
+
+
©Google, Inc. or its affiliates. All rights reserved. Do not distribute.
+
+
Provide Feedback on this Lab
+
+