Building a Snowflake Data Pipeline & LLM Analytics for Google Reviews

A Case Study for Poppy Kids Pediatric Dentistry

Jay Ozer
8 min readJul 18, 2024
Novato’s Premier Pediatric Dentistry

Hey there! I’m excited to share a project that’s close to my heart. As a data enthusiast married to a pediatric dentist, I’ve found myself at the intersection of technology and healthcare in a unique way. My wife runs Poppy Kids Pediatric Dentistry, and let me tell you, in today’s digital world, online reviews are an important part of running a dental practice.

We quickly realized that keeping track of what our little patients (and their parents!) think about us is super important. Their opinions shape our future. The reviews make our day and help other families discover the exceptional care we strive to provide.

But here’s the kicker — Google reviews can be as unpredictable as a toddler in a candy store. They appear and, out of nowhere, disappear, leaving us scratching our heads.

We recently received our 100th 5-star review, so it was time to take a systematic approach and tackle this challenge head-on. This post will walk you through how I built a nifty Snowflake data pipeline to capture, and analyze our Google reviews automatically. I’m excited to share how we address these two critical needs for our family business.

The Challenge:

1. Vanishing Reviews: Google’s policy enforcement can lead to the sudden removal of reviews. While this helps maintain quality, it can inadvertently remove legitimate feedback. Without a system to track these changes, our business loses valuable insights and the ability to dispute removals effectively.

2. Performance Insights: Reviews serve as a real-time barometer of business performance. Having a comprehensive dataset of reviews enables deep analytics, helping the practice continually improve its services. Also, Google utilizes machine learning models to evaluate reviews. While this policy is necessary, it underscores the importance of having our record of reviews. What better way than doing our own analytics and assessing their approach?

The Solution: A Snowflake Data Pipeline & Cortex Analytics

I developed an automated weekly pipeline using Snowflake to address these challenges. The pipeline has 6 main steps, with a 7th for data analytics using Snowflake Cortex functions.

Key Features of the Pipeline:
1. Incremental Updates: Only new reviews are added to the raw table, preventing duplicates.
2. Historical Tracking: By storing all fetched data, we maintain a history of reviews, including those that might be removed later.
3. Timezone Handling: Converts UTC timestamps to Pacific Time for local relevance.
4. Flexible Design: Can be easily adapted for multiple locations or businesses.

1. Data Extraction:
— Utilized the Outscraper API to fetch Google reviews: By default, Google Places API only allows past five reviews. It is possible to apply for Google Business API to access your business. Still ideally, I would like to analyze my business reviews and compare them with reviews from other local pediatric practices. Outscraper has a very functional API that extracts Google Reviews with a single line of code. This meant using Outscraper API’s Python package within Snowflake. This was much easier than I initially anticipated. Here are the step-by-step instructions for this step. You can find the complete code in my GitHub repo.

pip install --upgrade outscraper
snow snowpark package create outscraper
Run results and instructions to upload to stage

— Upload outscraper.zip to Snowflake stage and then as an external package: First, create a stage and then use the add data option to load the outscaper.zip file to the newly created stage.

CREATE OR REPLACE STAGE my_package_stage; 
LIST @MY_PACKAGE_STAGE;
Load outscraper.zip into the newly created stage

Once the outscraper.zip is uploaded to the stage, the next is to import it as a stage package. You can do this in a Python file or a notebook, making Snowflake Notebook a very flexible feature. Remember the @ sign as you import the zip file.

Stage Package installed

2. Set up Secrets & Network Rule

I opted for a new schema called secrets in my data warehouse and added my Outscraper API key. I debated (with myself) on whether to add my place_id as a secret but decided not to. It is far easier to replace the place_id and rerun the function. If, in the future, I am pulling records for 10 different businesses, a table that houses this info should work well.

CREATE SCHEMA IF NOT EXISTS SECRETS;
USE SCHEMA POPPY.SECRETS;

CREATE OR REPLACE SECRET OUTSCRAPER_API_KEY
TYPE = GENERIC_STRING
SECRET_STRING = '<outscraper_api_token>';

Next was creating an external network rule and integration to use the API within Snowflake.

-- Create network rule for Outscraper API
CREATE OR REPLACE NETWORK RULE outscraper_nr
MODE = EGRESS
TYPE = HOST_PORT
VALUE_LIST = ('api.app.outscraper.com');

-- Create external access integration
CREATE OR REPLACE EXTERNAL ACCESS INTEGRATION outscraper_eai
ALLOWED_NETWORK_RULES = (outscraper_nr)
ALLOWED_AUTHENTICATION_SECRETS = (POPPY.SECRETS.OUTSCRAPER_API_KEY)
ENABLED = true;

3. Create a Python UDF in Snowflake to securely make API call and capture Google Reviews:

This initial function loads all data. Everything else was derived from this function. The `requests` package is built in Snowflake, so the function runs out of the box.

CREATE OR REPLACE FUNCTION get_google_reviews(place_id STRING)
RETURNS VARIANT
LANGUAGE PYTHON
RUNTIME_VERSION = '3.8'
EXTERNAL_ACCESS_INTEGRATIONS = (outscraper_eai)
SECRETS = ('api_key' = POPPY.SECRETS.OUTSCRAPER_API_KEY)
HANDLER = 'get_reviews'
PACKAGES = ('requests')
AS
$$
import requests
import json
import _snowflake

def get_reviews(place_id):
api_key = _snowflake.get_generic_secret_string('api_key')
url = f"https://api.app.outscraper.com/maps/reviews-v3"
params = {
"query": place_id,
"reviewsLimit": 1000,
"async": False,
"sort":"newest"
}
headers = {
"X-API-KEY": api_key
}
response = requests.get(url, params=params, headers=headers)
if response.status_code == 200:
return json.dumps(response.json())
else:
return json.dumps({"error": f"API request failed with status code {response.status_code}"})
$$;

SHOW FUNCTIONS LIKE '%google_reviews%';

-- Test the function (replace with your place API key)
SELECT get_google_reviews('ChIJLdUfAim9hYARqRFWA5LwoAI') AS reviews;

4. Data Storage:
— Raw JSON data is stored in the GOOGLE_REVIEWS_JSON table: I decided to pull all records as is for keepsake. Outscraper API returns data in JSON format. My JSON table has only three columns: Place_id, fetch_timestamp, and raw_data for the JSON output.
— Parsed and flattened data is stored in the GOOGLE_REVIEWS_RAW table: This took me a few tries to parse due to the nature of the exported JSON document. I decided to parse and store all fields instead of a subset. You never know when you may need them.

5. Data Processing:
— Updated function to return past 30 days: First, I updated my function to return reviews from the past 30 days. I thought about running it for all reviews with every run and most probably, but it is redundant since I am keeping a log of each run in my JSON table. Also, outscraper_api gives a $10 credit that renews every month. Although I don't have enough reviews to utilize my free credits, you never know what happens in the future :) The function is identical to the initial load function with a time clause and additional cutoff parameter.

CREATE OR REPLACE FUNCTION get_google_reviews_last_30_days(place_id STRING)
RETURNS VARIANT
LANGUAGE PYTHON
RUNTIME_VERSION = '3.8'
EXTERNAL_ACCESS_INTEGRATIONS = (outscraper_eai)
SECRETS = ('api_key' = POPPY.SECRETS.OUTSCRAPER_API_KEY)
HANDLER = 'get_reviews_last_30_days'
PACKAGES = ('requests')
AS
$$
import requests
import json
from datetime import datetime, timedelta
import _snowflake

def get_reviews_last_30_days(place_id):
api_key = _snowflake.get_generic_secret_string('api_key')
url = "https://api.app.outscraper.com/maps/reviews-v3"

thirty_days_ago = int((datetime.now() - timedelta(days=30)).timestamp())

params = {
"query": place_id,
"reviewsLimit": 100,
"async": False,
"sort": "newest",
"cutoff": thirty_days_ago
}
headers = {
"X-API-KEY": api_key
}
response = requests.get(url, params=params, headers=headers)
if response.status_code == 200:
return json.dumps(response.json())
else:
return json.dumps({"error": f"API request failed with status code {response.status_code}"})
$$;

— Created a view (GOOGLE_REVIEWS_V): I created a view using only the fields I need for data analytics. A smaller data set made it easier to work within the notebook. I also converted UTC timestamps to my timezone, Pacific Time, during view creation. I also learned about the time travel feature in Snowflake. During view creation, the date must be filtered to get the latest data. Here is my where clause:

-- For getting latest data - the time travel
WHERE review_datetime_utc <= SYSDATE();

6. Automation:

— Implemented procedures to extract data from the API and insert new records into the raw table: I created two procedures and an orchestrator procedure that calls these two. The first two procedures are just wrappers. The main orchestrator calls these two in an order.

-- Main Orchestrator Procedure
CREATE OR REPLACE PROCEDURE POPPY.ANALYTICS.UPDATE_GOOGLE_REVIEWS(place_id STRING)
RETURNS STRING
LANGUAGE SQL
AS
$$
DECLARE
api_result STRING;
insert_result STRING;
BEGIN
-- Call procedure to pull API data into JSON table
CALL POPPY.ANALYTICS.PULL_API_TO_JSON(:place_id) INTO :api_result;

-- Call procedure to insert new records from JSON to RAW table
CALL POPPY.ANALYTICS.INSERT_NEW_RECORDS_TO_RAW() INTO :insert_result;

RETURN 'Google Reviews Update Complete for place_id: ' || :place_id || '. ' || :api_result || ' ' || :insert_result;
END;
$$;

— Set up a Snowflake task: I decided to run this every week on Friday at 3 PM PST. Poppy Kids closes early on Fridays, so this is the week's end. Setting up an automated schedule using the Snowflake tasks feature was super straightforward. Here is the code that runs my orchestrator stored procedure weekly at 3 pm:

-- Create scheduled task for google reviews update - every 7 days
CREATE OR REPLACE TASK POPPY.ANALYTICS.UPDATE_GOOGLE_REVIEWS_TASK
WAREHOUSE = POPPY_WH
SCHEDULE = 'USING CRON 0 15 * * FRI America/Los_Angeles'
AS
CALL POPPY.ANALYTICS.UPDATE_GOOGLE_REVIEWS('<place_id>');

-- Activate the task
ALTER TASK POPPY.ANALYTICS.UPDATE_GOOGLE_REVIEWS_TASK RESUME;

7. Analytics:

The best part of this was using the Snowflake Cortex functions to do data analytics. I used a quickstart and tailored my data set and prompts. First, I checked things like a number of reviews each year and the hour of the day most reviews were received. Using Cortex functions, I could assign a sentiment score to each review. Surprisingly, most of the sentiment scores are similar, the benefit of having 5-star reviews :)

I digged deeper by creating an aspect-based sentiment by categorizing all review text. This provided more detailed insights into staff details, service quality, and overall experience.

To pay homage to the oldies (NLTK), I created a word cloud analyzing all reviews using `mistral-large`. You can find all the charts in my Github, but here is the llm-powered Wordcloud.

Poppy Kids Dental reviews — Wordcloud

Conclusion:
This Snowflake data pipeline provides Poppy Kids Pediatric Dentistry with a powerful tool to maintain a comprehensive record of our Google reviews. It safeguards against losing valuable feedback and sets the stage for deep analytics to drive continuous improvement in patient care and satisfaction.

By leveraging Snowflake’s robust features and integrating with external APIs, we’ve created a scalable, secure, and efficient solution that can be adapted for businesses across various industries facing similar challenges with online reviews.

My Github repo contains the complete code for this pipeline, which provides a template for others to implement similar solutions.

I would love to get the next 100 reviews as soon as possible. With more data, I can run so many cool analytics.

--

--

Jay Ozer

I spend my time following the AI space and database technologies.