Can we get rid of addonHistograms?
import pandas as pd import numpy as np import matplotlib from matplotlib import pyplot as plt from moztelemetry.dataset import Dataset from moztelemetry import get_pings_properties, get_one_ping_per_client
Unable to parse whitelist (/mnt/anaconda2/lib/python2.7/site-packages/moztelemetry/histogram-whitelists.json). Assuming all histograms are acceptable.
Let’s just look at a non-representative 10% of main pings gathered on a recent Tuesday.
pings = Dataset.from_source("telemetry") \ .where(docType='main') \ .where(submissionDate="20170328") \ .records(sc, sample=0.1)
subset = get_pings_properties(pings, ["payload/addonHistograms"])
full_count = subset.count() full_count
37815981
filtered = subset.filter(lambda p: p["payload/addonHistograms"] is not None) filtered_count = filtered.count() filtered_count
25794
1.0 * filtered_count / full_count
0.0006820925787962502
addons = filtered.flatMap(lambda p: p['payload/addonHistograms'].keys()).map(lambda key: (key, 1))
addons.countByKey()
defaultdict(int, {u'Firebug': 92, u'shumway@research.mozilla.org': 15, u'uriloader@pdf.js': 4})
Wow, so most of those addonHistograms sections are empty.
…And those that aren’t are from defunct data collection sources. Looks like we can remove this without too many complaint. Excellent.