BigQuery data sharing
Whimful publishes a daily pageview stats table directly into your GCP project via BigQuery Analytics Hub. Storage stays on Whimful's side; queries you run are billed to your project.
How to subscribe
- Your account admin enables sharing in account settings and shares the listing link with your engineering team.
- Open the listing link in a browser signed in to your GCP project. Pick a destination dataset name.
- Confirm. The linked dataset appears in your project under BigQuery.
Table reference — whimful_daily_pageview_stats
One row per (date, dimension combo) per day. Partitioned by event_date, clustered by host, path.
| Column | Type | Description |
|---|---|---|
event_date | DATE | UTC date the event occurred. Partition column. |
path | STRING | URL path including hash fragment, with sensitive query keys removed by wa_common.sanitise_url_path. UTM params are extracted to the utm record and stripped from the path. |
site | STRING | Site name as shown in the Whimful dashboard — the page hostname lowercased with a leading "www." stripped. |
host | STRING | Hostname of the page URL. |
referrer_domain | STRING | Registered domain of the referrer URL (NET.REG_DOMAIN(referrer)). NULL if no referrer. |
device | RECORD | Decoded device information from user agent. Excludes browser_version and os_version (cardinality reduction). |
↳ device.browser | STRING | Browser family (e.g. "Chrome", "Safari"). |
↳ device.os | STRING | Operating system family (e.g. "macOS", "iOS"). |
↳ device.device_type | STRING | Device form factor (e.g. "mobile", "desktop", "tablet"). |
↳ device.device_vendor | STRING | Device manufacturer when available (e.g. "Apple"). |
↳ device.device_model | STRING | Device model name when available (e.g. "iPhone"). |
↳ device.engine | STRING | Browser engine (e.g. "WebKit", "Blink"). |
geo | RECORD | Coarse geo data. Coordinates are intentionally not exposed. |
↳ geo.region_code | STRING | ISO 3166-1 alpha-2 country code. |
↳ geo.region_subdivision_code | STRING | ISO 3166-2 region/state code. |
↳ geo.city | STRING | City name as resolved from IP geolocation. |
bot_name | STRING | Bot/crawler name extracted from the user agent (e.g. "Googlebot", "GPTBot", "facebookexternalhit", "meta-externalads", "AdsBot-Google", "ClaudeBot"). NULL when the visitor does not match a known bot signature. Heuristic: the standard `(compatible; <name>/<version>)` Mozilla clause, identifiers immediately followed by a `+http://...` or `+<bot>@<domain>` contact marker, and finally any token ending in "bot", "crawler", or "spider". |
bot_score | INT64 | IP-reputation score from whimful.wa_processed.classified_ips. 0–11; BOT >= 7, SUSPICIOUS >= 4, CLEAN < 4. NULL when the IP has never been classified. |
is_vpn | BOOL | TRUE when the IP belongs to a known VPN provider (whimful.wa_processed.classified_ips.isPredictedVpn). NULL when the IP has never been classified. |
is_data_center | BOOL | TRUE when the IP belongs to a known cloud/datacenter provider (whimful.wa_processed.classified_ips.isPredictedDatacenter). NULL when the IP has never been classified. |
utm | RECORD | UTM campaign attribution parsed from the page URL query string. |
↳ utm.source | STRING | utm_source value. |
↳ utm.medium | STRING | utm_medium value. |
↳ utm.campaign | STRING | utm_campaign value. |
↳ utm.term | STRING | utm_term value. |
↳ utm.content | STRING | utm_content value. |
pageviews | INT64 | COUNT(*) of PAGE_VIEW events at this dimension grain. |
users_hll | BYTES | HLL_COUNT.INIT(wa_uid). Use HLL_COUNT.MERGE() to count uniques at any rollup. |
sessions_hll | BYTES | HLL_COUNT.INIT(wa_sid). Use HLL_COUNT.MERGE(). |
visit_duration_kll | BYTES | KLL sketch on client_info.visit_duration_millis from PAGE_PERFORMANCE events. |
active_visit_duration_kll | BYTES | KLL sketch on client_info.active_visit_duration_millis. |
cwv_metrics | RECORD | Core Web Vitals distributions (KLL sketches), captured from PAGE_PERFORMANCE events. Merge with KLL_QUANTILES.MERGE_POINT_DOUBLE(sketch, q). |
↳ cwv_metrics.lcp_kll | BYTES | Largest Contentful Paint (ms). |
↳ cwv_metrics.cls_kll | BYTES | Cumulative Layout Shift (unitless). |
↳ cwv_metrics.inp_kll | BYTES | Interaction to Next Paint (ms). |
↳ cwv_metrics.fid_kll | BYTES | First Input Delay (ms). |
↳ cwv_metrics.fcp_kll | BYTES | First Contentful Paint (ms). |
↳ cwv_metrics.ttfb_kll | BYTES | Time to First Byte (ms). |
↳ cwv_metrics.dom_content_loaded_kll | BYTES | dom_content_loaded time (ms). |
↳ cwv_metrics.load_time_kll | BYTES | load_time (ms). |
Example queries
Daily pageviews, last 30 days
SELECT event_date, SUM(pageviews) AS pageviews
FROM `<your-project>.<linked-dataset>.whimful_daily_pageview_stats`
WHERE event_date >= CURRENT_DATE() - 30
GROUP BY event_date
ORDER BY event_date;Unique users per day (HLL)
SELECT event_date, HLL_COUNT.MERGE(users_hll) AS unique_users
FROM `<your-project>.<linked-dataset>.whimful_daily_pageview_stats`
WHERE event_date >= CURRENT_DATE() - 30
GROUP BY event_date
ORDER BY event_date;By country, last 7 days
SELECT geo.region_code AS country,
HLL_COUNT.MERGE(users_hll) AS users,
HLL_COUNT.MERGE(sessions_hll) AS sessions,
SUM(pageviews) AS pageviews
FROM `<your-project>.<linked-dataset>.whimful_daily_pageview_stats`
WHERE event_date >= CURRENT_DATE() - 7
GROUP BY country
ORDER BY users DESC;p75 Web Vitals by page (KLL)
SELECT path,
KLL_QUANTILES.MERGE_POINT_DOUBLE(cwv_metrics.lcp_kll, 0.75) AS lcp_p75_ms,
KLL_QUANTILES.MERGE_POINT_DOUBLE(cwv_metrics.inp_kll, 0.75) AS inp_p75_ms,
KLL_QUANTILES.MERGE_POINT_DOUBLE(cwv_metrics.cls_kll, 0.75) AS cls_p75,
SUM(pageviews) AS pageviews
FROM `<your-project>.<linked-dataset>.whimful_daily_pageview_stats`
WHERE event_date >= CURRENT_DATE() - 7
GROUP BY path
HAVING pageviews >= 100
ORDER BY pageviews DESC
LIMIT 50;UTM attribution with median active time on page
SELECT utm.source, utm.medium, utm.campaign,
HLL_COUNT.MERGE(sessions_hll) AS sessions,
KLL_QUANTILES.MERGE_POINT_DOUBLE(active_visit_duration_kll, 0.5) AS median_active_ms
FROM `<your-project>.<linked-dataset>.whimful_daily_pageview_stats`
WHERE event_date >= CURRENT_DATE() - 30
AND utm.source IS NOT NULL
GROUP BY utm.source, utm.medium, utm.campaign
ORDER BY sessions DESC;Exclude bot, VPN, and data-center traffic
SELECT event_date,
SUM(pageviews) AS pageviews,
HLL_COUNT.MERGE(users_hll) AS unique_users
FROM `<your-project>.<linked-dataset>.whimful_daily_pageview_stats`
WHERE event_date >= CURRENT_DATE() - 30
AND bot_name IS NULL
AND COALESCE(bot_score, 0) < 4
AND NOT COALESCE(is_vpn, FALSE)
AND NOT COALESCE(is_data_center, FALSE)
GROUP BY event_date
ORDER BY event_date;HLL and KLL primer
- HLL sketches let you count uniques at any rollup. Always use
HLL_COUNT.MERGE(column); neverSUMorCOUNTon the BYTES column directly. - KLL sketches store the full distribution.
KLL_QUANTILES.MERGE_POINT_DOUBLE(col, q)returns the value at quantileq(e.g. 0.75 for p75). The standard convention for Web Vitals is p75. - Both sketch types are mergeable: pre-aggregate, save the sketches, merge them again later. Approximate but bounded: HLL ≈ 1.5% error, KLL bounded relative error.
Gotchas
- Data lands at T+1, around 02:00 UTC. Yesterday's data appears around 02:00 UTC today.
- The most recent two days are always re-materialised, so values for those dates can shift slightly as late Web Vitals arrive.
- Path query strings are sanitised. Sensitive params (email, token, password, auth, session, key, secret) are stripped before storage.
- All timestamps and partitions are UTC.
- The
bot_namecolumn is a best-effort heuristic. It catches three patterns: the standard(compatible; <name>/<version>)Mozilla clause (Googlebot, bingbot, GPTBot, OAI-SearchBot, …), bots that announce a contact URL or email right after their identifier (facebookexternalhit, meta-externalads, AdsBot-Google, Applebot, ClaudeBot, …), and as a final fallback any token ending in "bot", "crawler", or "spider". For a stricter filter, combine it withbot_score < 4(IP reputation),NOT is_vpn, andNOT is_data_center. - The IP-classification columns (
bot_score,is_vpn,is_data_center) are NULL for IPs we have not seen frequently enough to classify yet. UseCOALESCEwhen filtering so unclassified traffic is not silently dropped. - Storage cost is on Whimful; query bytes are billed to your project. Use partition filters (
event_date >= …) to keep query cost low.