Whimful

BigQuery data sharing

Whimful publishes a daily pageview stats table directly into your GCP project via BigQuery Analytics Hub. Storage stays on Whimful's side; queries you run are billed to your project.

How to subscribe

  1. Your account admin enables sharing in account settings and shares the listing link with your engineering team.
  2. Open the listing link in a browser signed in to your GCP project. Pick a destination dataset name.
  3. Confirm. The linked dataset appears in your project under BigQuery.

Table reference — whimful_daily_pageview_stats

One row per (date, dimension combo) per day. Partitioned by event_date, clustered by host, path.

ColumnTypeDescription
event_dateDATEUTC date the event occurred. Partition column.
pathSTRINGURL path including hash fragment, with sensitive query keys removed by wa_common.sanitise_url_path. UTM params are extracted to the utm record and stripped from the path.
siteSTRINGSite name as shown in the Whimful dashboard — the page hostname lowercased with a leading "www." stripped.
hostSTRINGHostname of the page URL.
referrer_domainSTRINGRegistered domain of the referrer URL (NET.REG_DOMAIN(referrer)). NULL if no referrer.
deviceRECORDDecoded device information from user agent. Excludes browser_version and os_version (cardinality reduction).
↳ device.browserSTRINGBrowser family (e.g. "Chrome", "Safari").
↳ device.osSTRINGOperating system family (e.g. "macOS", "iOS").
↳ device.device_typeSTRINGDevice form factor (e.g. "mobile", "desktop", "tablet").
↳ device.device_vendorSTRINGDevice manufacturer when available (e.g. "Apple").
↳ device.device_modelSTRINGDevice model name when available (e.g. "iPhone").
↳ device.engineSTRINGBrowser engine (e.g. "WebKit", "Blink").
geoRECORDCoarse geo data. Coordinates are intentionally not exposed.
↳ geo.region_codeSTRINGISO 3166-1 alpha-2 country code.
↳ geo.region_subdivision_codeSTRINGISO 3166-2 region/state code.
↳ geo.citySTRINGCity name as resolved from IP geolocation.
bot_nameSTRINGBot/crawler name extracted from the user agent (e.g. "Googlebot", "GPTBot", "facebookexternalhit", "meta-externalads", "AdsBot-Google", "ClaudeBot"). NULL when the visitor does not match a known bot signature. Heuristic: the standard `(compatible; <name>/<version>)` Mozilla clause, identifiers immediately followed by a `+http://...` or `+<bot>@<domain>` contact marker, and finally any token ending in "bot", "crawler", or "spider".
bot_scoreINT64IP-reputation score from whimful.wa_processed.classified_ips. 0–11; BOT >= 7, SUSPICIOUS >= 4, CLEAN < 4. NULL when the IP has never been classified.
is_vpnBOOLTRUE when the IP belongs to a known VPN provider (whimful.wa_processed.classified_ips.isPredictedVpn). NULL when the IP has never been classified.
is_data_centerBOOLTRUE when the IP belongs to a known cloud/datacenter provider (whimful.wa_processed.classified_ips.isPredictedDatacenter). NULL when the IP has never been classified.
utmRECORDUTM campaign attribution parsed from the page URL query string.
↳ utm.sourceSTRINGutm_source value.
↳ utm.mediumSTRINGutm_medium value.
↳ utm.campaignSTRINGutm_campaign value.
↳ utm.termSTRINGutm_term value.
↳ utm.contentSTRINGutm_content value.
pageviewsINT64COUNT(*) of PAGE_VIEW events at this dimension grain.
users_hllBYTESHLL_COUNT.INIT(wa_uid). Use HLL_COUNT.MERGE() to count uniques at any rollup.
sessions_hllBYTESHLL_COUNT.INIT(wa_sid). Use HLL_COUNT.MERGE().
visit_duration_kllBYTESKLL sketch on client_info.visit_duration_millis from PAGE_PERFORMANCE events.
active_visit_duration_kllBYTESKLL sketch on client_info.active_visit_duration_millis.
cwv_metricsRECORDCore Web Vitals distributions (KLL sketches), captured from PAGE_PERFORMANCE events. Merge with KLL_QUANTILES.MERGE_POINT_DOUBLE(sketch, q).
↳ cwv_metrics.lcp_kllBYTESLargest Contentful Paint (ms).
↳ cwv_metrics.cls_kllBYTESCumulative Layout Shift (unitless).
↳ cwv_metrics.inp_kllBYTESInteraction to Next Paint (ms).
↳ cwv_metrics.fid_kllBYTESFirst Input Delay (ms).
↳ cwv_metrics.fcp_kllBYTESFirst Contentful Paint (ms).
↳ cwv_metrics.ttfb_kllBYTESTime to First Byte (ms).
↳ cwv_metrics.dom_content_loaded_kllBYTESdom_content_loaded time (ms).
↳ cwv_metrics.load_time_kllBYTESload_time (ms).

Example queries

Daily pageviews, last 30 days

SELECT event_date, SUM(pageviews) AS pageviews
FROM `<your-project>.<linked-dataset>.whimful_daily_pageview_stats`
WHERE event_date >= CURRENT_DATE() - 30
GROUP BY event_date
ORDER BY event_date;

Unique users per day (HLL)

SELECT event_date, HLL_COUNT.MERGE(users_hll) AS unique_users
FROM `<your-project>.<linked-dataset>.whimful_daily_pageview_stats`
WHERE event_date >= CURRENT_DATE() - 30
GROUP BY event_date
ORDER BY event_date;

By country, last 7 days

SELECT geo.region_code AS country,
       HLL_COUNT.MERGE(users_hll)    AS users,
       HLL_COUNT.MERGE(sessions_hll) AS sessions,
       SUM(pageviews)                AS pageviews
FROM `<your-project>.<linked-dataset>.whimful_daily_pageview_stats`
WHERE event_date >= CURRENT_DATE() - 7
GROUP BY country
ORDER BY users DESC;

p75 Web Vitals by page (KLL)

SELECT path,
       KLL_QUANTILES.MERGE_POINT_DOUBLE(cwv_metrics.lcp_kll, 0.75) AS lcp_p75_ms,
       KLL_QUANTILES.MERGE_POINT_DOUBLE(cwv_metrics.inp_kll, 0.75) AS inp_p75_ms,
       KLL_QUANTILES.MERGE_POINT_DOUBLE(cwv_metrics.cls_kll, 0.75) AS cls_p75,
       SUM(pageviews) AS pageviews
FROM `<your-project>.<linked-dataset>.whimful_daily_pageview_stats`
WHERE event_date >= CURRENT_DATE() - 7
GROUP BY path
HAVING pageviews >= 100
ORDER BY pageviews DESC
LIMIT 50;

UTM attribution with median active time on page

SELECT utm.source, utm.medium, utm.campaign,
       HLL_COUNT.MERGE(sessions_hll) AS sessions,
       KLL_QUANTILES.MERGE_POINT_DOUBLE(active_visit_duration_kll, 0.5) AS median_active_ms
FROM `<your-project>.<linked-dataset>.whimful_daily_pageview_stats`
WHERE event_date >= CURRENT_DATE() - 30
  AND utm.source IS NOT NULL
GROUP BY utm.source, utm.medium, utm.campaign
ORDER BY sessions DESC;

Exclude bot, VPN, and data-center traffic

SELECT event_date,
       SUM(pageviews)             AS pageviews,
       HLL_COUNT.MERGE(users_hll) AS unique_users
FROM `<your-project>.<linked-dataset>.whimful_daily_pageview_stats`
WHERE event_date >= CURRENT_DATE() - 30
  AND bot_name IS NULL
  AND COALESCE(bot_score, 0) < 4
  AND NOT COALESCE(is_vpn, FALSE)
  AND NOT COALESCE(is_data_center, FALSE)
GROUP BY event_date
ORDER BY event_date;

HLL and KLL primer

  • HLL sketches let you count uniques at any rollup. Always use HLL_COUNT.MERGE(column); never SUM or COUNT on the BYTES column directly.
  • KLL sketches store the full distribution. KLL_QUANTILES.MERGE_POINT_DOUBLE(col, q) returns the value at quantile q (e.g. 0.75 for p75). The standard convention for Web Vitals is p75.
  • Both sketch types are mergeable: pre-aggregate, save the sketches, merge them again later. Approximate but bounded: HLL ≈ 1.5% error, KLL bounded relative error.

Gotchas

  • Data lands at T+1, around 02:00 UTC. Yesterday's data appears around 02:00 UTC today.
  • The most recent two days are always re-materialised, so values for those dates can shift slightly as late Web Vitals arrive.
  • Path query strings are sanitised. Sensitive params (email, token, password, auth, session, key, secret) are stripped before storage.
  • All timestamps and partitions are UTC.
  • The bot_name column is a best-effort heuristic. It catches three patterns: the standard (compatible; <name>/<version>) Mozilla clause (Googlebot, bingbot, GPTBot, OAI-SearchBot, …), bots that announce a contact URL or email right after their identifier (facebookexternalhit, meta-externalads, AdsBot-Google, Applebot, ClaudeBot, …), and as a final fallback any token ending in "bot", "crawler", or "spider". For a stricter filter, combine it with bot_score < 4 (IP reputation), NOT is_vpn, and NOT is_data_center.
  • The IP-classification columns (bot_score, is_vpn, is_data_center) are NULL for IPs we have not seen frequently enough to classify yet. Use COALESCE when filtering so unclassified traffic is not silently dropped.
  • Storage cost is on Whimful; query bytes are billed to your project. Use partition filters (event_date >= …) to keep query cost low.