As previously examined there is a large volume of pings coming from a single client_id
It’s gotten much worse since then, so more investigation is needed.
import ujson as json import matplotlib.pyplot as plt import pandas as pd import numpy as np import plotly.plotly as py from plotly.graph_objs import * from moztelemetry import get_pings_properties, get_one_ping_per_client from moztelemetry.dataset import Dataset %matplotlib inline
Unable to parse whitelist (/mnt/anaconda2/lib/python2.7/site-packages/moztelemetry/histogram-whitelists.json). Assuming all histograms are acceptable.
from datetime import datetime, timedelta
pings = Dataset.from_source("telemetry") \ .where(docType='main') \ .where(appBuildId="20161014004013")\ .where(appUpdateChannel="aurora") \ .records(sc, sample=1)
pings = pings.filter(lambda p: p["clientId"] == "omitted")
From last time, we know there was a roughly-increasing number of pings we were receiving from this client day over day. How has this changed?
Since the first analysis we’ve noticed that a lot of these pings are duplicates: they have the same id and everything! So while we’re here let’s skip the dupes and graph the volume of pings, counting only the first time we saw them.
submission_dates = get_pings_properties(pings, ["id", "meta/submissionDate"])
ping_counts = submission_dates.map(lambda p: (datetime.strptime(p["meta/submissionDate"], '%Y%m%d'), 1)).countByKey()
first_ping_counts = submission_dates\ .map(lambda p: (p["id"], p))\ .reduceByKey(lambda a, b: a if a["meta/submissionDate"] < b["meta/submissionDate"] else b)\ .map(lambda p: (datetime.strptime(p[1]["meta/submissionDate"], '%Y%m%d'), 1))\ .countByKey()
df = pd.DataFrame(ping_counts.items(), columns=["date", "count (including duplicates)"]).set_index(["date"]) ax = df.plot(figsize=(17, 7)) df2 = pd.DataFrame(first_ping_counts.items(), columns=["date", "unique count"]).set_index(["date"]) df2.plot(ax=ax) plt.xticks(np.arange(min(df.index), max(df.index) + timedelta(3), 5, dtype="datetime64[D]")) plt.ylabel("ping count") plt.xlabel("date") plt.grid(True) plt.show()
We are seeing an ever-increasing volume of duplicate pings. Very few are unique.
Operating on the assumption that the duplicate pings were distributed along with the Aurora 51 binaries and profile directories, that means only the unique pings have the chance to contain accurate information about the clients.
Yes, this means the majority of the subsession and platform analysis from the previous report is likely bunk as it didn’t filter out the duplicate pings. (To be fair, no one at the time knew the pings were duplicate and likely distributed with the binaries)
So, looking at only the non-duplicate pings, what can we see?
subset = get_pings_properties(pings, [ "id", "meta/geoCountry", "meta/geoCity", "environment/addons/activeAddons", "environment/settings/isDefaultBrowser", "environment/system/cpu/speedMHz", "environment/system/os/name", "environment/system/os/version", "payload/info/sessionLength", "payload/info/subsessionLength", ])
unique_pings = subset\ .map(lambda p: (p["id"], p))\ .reduceByKey(lambda a, b: "dupe")\ .map(lambda p: p[1])\ .filter(lambda p: p != "dupe")
unique_pings.count()
312266
pings_with_addon = unique_pings\ .flatMap(lambda p: [(addon["name"], 1) for addon in filter(lambda x: "isSystem" not in x or not x["isSystem"], p["environment/addons/activeAddons"].values())])\ .countByKey()
sorted(pings_with_addon.items(), key=lambda x: x[1], reverse=True)[:5]
[(u'Random Agent Spoofer', 311019), (u'Alexa Traffic Rank', 37439), (u'RefControl', 18358), (u'StopTube', 4831), (u'Stop YouTube Autoplay', 2)]
As before, every ping shows Random Agent Spoofer.
SESSION_MAX = 250
session_lengths = unique_pings\ .map(lambda p: p["payload/info/sessionLength"] if p["payload/info/sessionLength"] < SESSION_MAX else SESSION_MAX)\ .collect()
s = pd.Series(session_lengths) s.hist(bins=250, figsize=(17, 7)) plt.ylabel("ping count") plt.xlabel("session length in seconds") plt.xticks(np.arange(0, max(s) + 1, 5)) plt.show()
Not sure how this compares to “normal” session lengths on Aurora. But a peak around 1m and another around 1m15s is way too short for a human to be getting anything meaningful done.
unique_pings.map(lambda p: (p["environment/settings/isDefaultBrowser"], 1)).countByKey()
defaultdict(int, {None: 1, False: 312265})
Nope. (Note that this may mean nothing at all. Some platforms (lookin’ at you, Windows 10) make it difficult to set your default browser. Also: even without it being default it can be used if the user’s workflow is to always start by opening their browser.)
MHZ_MAX = 5000
mhzes = unique_pings\ .map(lambda p: p["environment/system/cpu/speedMHz"] if p["environment/system/cpu/speedMHz"] < MHZ_MAX else MHZ_MAX)\ .collect()
ds = pd.Series(mhzes) ds.hist(bins=250, figsize=(17, 7)) plt.ylabel("ping count (log)") plt.xlabel("speed in MHz") plt.yscale("log") plt.show()
ds.value_counts()[:10]
2400 84819 2399 73546 2397 44691 3504 44401 3503 41217 3495 3284 3700 2141 2396 2108 2299 2059 2398 1934 dtype: int64
Nowhere near as monocultural as the original analysis suspected, but still not a very broad spread.
def major_minor(version_string): return version_string.split('.')[0] + '.' + version_string.split('.')[1]
pings_per_os = unique_pings\ .map(lambda p: (p["environment/system/os/name"] + " " + major_minor(p["environment/system/os/version"]), 1))\ .countByKey()
print len(pings_per_os) sorted(pings_per_os.items(), key=lambda x: x[1], reverse=True)[:10]
1 [(u'Windows_NT 5.1', 312266)]
100% Windows XP. I should be more surprised.
pings_per_city = unique_pings\ .map(lambda p: (p["meta/geoCountry"] + " " + p["meta/geoCity"], 1))\ .countByKey()
print len(pings_per_city) sorted(pings_per_city.items(), key=lambda x: x[1], reverse=True)[:10]
458 [(u'CA Montr\xe9al', 100792), (u'FR ??', 62789), (u'US Costa Mesa', 19677), (u'US Phoenix', 14828), (u'FR Paris', 8888), (u'GB ??', 7155), (u'DE ??', 5210), (u'GB London', 5190), (u'NZ Auckland', 5065), (u'US Los Angeles', 4072)]
As expected, this is little different from the analysis over the duplicated pings. The hosts submitting even the duplicated pings have come from the world over.
That being said, Montreal’s very highly represented here. It’s a lovely city and all, but what conditions might result in its unexpected prominence?
Well, we now know that the overwhelming majority of pings sent by this client are geographically diverse and duplicated. This suggests that these instances were packaged up with their profile data directory intact, and are submitting a pre-packaged ping as soon as they come online.
This means the duplicated pings likely reflect the environment of the original host where the package was created: Windows XP, 3.5GHz CPU, 2m35s-long session, with Random Agent Spoofer installed.
These pings reflect likely reflect the hosts that are actually sending the pings: geographically diverse, but still running Windows XP while spending inhumanly-short times within the browser.
We still don’t know why this is happening. We do know that the number of initial duplicate pings we’re receiving is continuing to increase. I see no sign of this ever stopping.