Sysadmin vs Scientist

Dima Korolev, Engineer and Data Scientist via Quora

Here are the two approaches to data science, which I call Sysadmin approach and Scientist approach.

Sysadmin approach: Use the knowledge obtained by reading Apache logs, nginx logs, systemd logs, cron logs, etc..

A good sysadmin would open the log file, press page down and watch it, stopping and scrolling back on anomalies.

A great sysadmin would make a couple iterations of grep -v/sed/awk/cut/sort/uniq -c and find the most interesting patterns within just well under an hour. For bonus points, they might use perl for it.

A sysadmin who is ready for data science will use jq instead of awk and python instead of perl. These admins are half ready to become full-stack data engineers. Especially if data science and infrastructure in their company is done in Python.

Scientist approach: Use whatever tools available to present raw logs as something that can be explained mathematically. Then abstract away from their appearance and look through substance.

A good scientist would find a way to convert data into manageable CSV/TSV. Then they will use gnuplot or Excel to look at distributions of certain events and properties.

A great data scientist will abstract away log structure even further, presenting data views as something R or SciPy can handle.

A data scientist who would load her data into Redis, ElasticSearch or an SQL-friendly storage to then leverage query engine of that tool is half way through their path to become full-stack data engineers.

How to tell whether someone is an ex-admin or ex-scientist?

Ask them a straightforward data question along the lines of “You have a couple gigabytes of logs, explain how would you compute X”.

Say, X is the click-through-rate, CTR.

An admin would grep -c events that are clicks, grep -c events that are impressions, divide one by another and get the answer.

You would then ask whether this number makes sense. Hint: It might be above 100% due to a) duplicates, and b) click events w/o impression events.

An admin would then come up with the lists of distinct user ID-s who had a click event and distinct user ID-s who has an impression event.

You would then ask how do they know the users are canonical? For instance, if a user has created an account, then a click from their account ID later should be counted towards an impression they had while still being an anonymous user, with a client ID but no account ID.

What would a scientist do? Actually, pretty much the same. They might even use the very same grep / jq / awk / uniq commands.

The difference is that a scientist would first postulate a couple questions:

  • What is CTR?
  • What is an impression?
  • What is a click?
  • What is the user?
  • Are all users created equal, or how do we weight users?

Then they would come up with suggestions (hypotheses), for example:

  • CTR per user is the fraction of the pieces of content they clicked out of all the content they saw.
  • Service-wide CTR is the average of all per-user CTR-s.
  • A user is uniquely identified by their account ID.
    We discard anonymous users.Or: A user is unique identified by their client ID. We discard registered users. Or: We treat registered and anonymous users as equals.
  • Click and impression are the presence of certain event in logs.
    We assume log events can be duplicated — the model already accounts for this, so it won’t bother us.

Whose approach is better?

Scientist’s.

What if you have plenty of users who saw just one piece of content and left? Their CTR will be 0%. If those users are 90%, the resulting CTR will never be more than 10% — even though the service may be doing really well, it’s just that new marketing campaign attracts too many cheap yet uninterested users.

What if plenty of anonymous users have low CTR because their impressions count as anonymous, while the clicks happen after they have created an account?

A scientist would answer these questions during her natural flow of working with data.

The “data model first” frame of mind ensures that the right questions will ask themselves. The “no time, grep away” approach will get to some numbers faster, but trusting that data to make business decisions would be a mistake since every once in a while those numbers would not make much sense.

Even if both the scientist and the admin have used the same set of underlying tools.

Remember: She, who has asked the right question, already knows half the answer.

Data science is about asking the right questions. The answers will then inevitably come along.

Comments