Data Science: A treatise about love.

Lambdoma Gif 1There is a book called 'The things we say when we talk about love' and the gist of it is that when we are nervous about something that is important to us, we don't really seem to make much sense.  We talk in metaphor, we speak in third person, we wonder out loud about the weather.  We fuss over if you are dressed warm enough. 

I am a curious person by nature.  I follow the strings. It's my job.  I am a data scientist and for a living I take this organism of a project or an organization or business as represented by it's database, and I convince it to give me it's secrets.  I love what I do immensely. I have three or four databases that I am usually playing with outside of work, which over the years has become a collection of various databases much in the same way one would collect pieces of art.  Many of them I have built myself.

When you work with data, there are conventions of respect.  If data is sentient, and I believe it to varying extents it is, there are ways that you demonstrate respect to the sentient being entrapped by your database engine.  If data were a person she/he would be perceived as an old fashioned romantic type, very much like the Oracle in the Matrix.  It likes it when you open the door or bring it flowers or cook it dinner.  Except since this is data, there are different ways you would go about it. 

You set permissions levels correctly so that it is safe.  You treat it with only SELECT statements in the beginning while you get to know it. 

And just like in the science fiction movies, whose conventions are eerily similar to courting rituals, there are gates you must pass through in order to access the Data's deepest secrets. 

Passing through the gates

Lambdoma Gif 2The gate that is so fundamental it's not considered a gate is the zero point.  The connection.  Do you have permission?  Can you get on board?  This point Zero is harder than one would think, and is never really about the individual player.  This is the step where the organism vets you and determines whether to give you passage at all.

One would be surprised how difficult this initial handshake can be.  I was at a job I loved once, and after 6 weeks was still trying to get SSH into the server.  It was heartbreaking and awkward, but I had no choice other than to leave and wish them well.  

After gaining access, the next step is learning the data architecture if there is one. I query the SCHEME tables, and print them out.  I like to look at them on paper, and mark the field definitions with notes.

I have used but do not necessarily trust (other than Mondo) when a relational data management system becomes Oedipal and kills it's own father: it's very method of inception. If databases were not relational, they would not have fathered the rebel noSQL. Key/value pairs have a limit even if you have more processing power than you know what to do with.  I may be old fashioned, but elegance and form is important. Structure.

In any case we were talking about gates. 

Engaging the Borg

Once you get a sense of the architecture, it's time to engage it.  We think of data scientists as this sort of Atheist by default, but this process of engagement borders on the sacramental, a genuflecting of sorts.  We have learned the structure of this system and the first complex query that is appropriately answered is the first real intelligent interaction with the data.  It fuels a sort of fire, and usually in the space of a short amount of time, these questions that have been building up get answered in a flood of inspired SELECT queries. 

JOINS are important in this phase particularly for many to one relationships which ultimately make the data more readable early on (as opposed to seeing a grid full of ID references in foreign key fields).  

At least that is how it is for me.

Presumably the organization that hired me already had a way that it got business intelligence from it's datasets.  And, presumably, they are hiring a data scientist or analyst because they have a sneaking suspicion that there is more that they could be told, if they only understood the correct questions to ask.

The questions not yet asked

Diurnal ChartAnd this is the gate that separates the worthy from the unworthy.  This gate, and the ones that follow it are not visible to naked eye.  If you did not know they were there, you might miss them altogether. 

It's like the infamous game of 'SET' where you know that the chances of there not being a set in this grid are extremely low, but you pull another card anyhow because it's been an uncomfortable amount of time since someone yelled 'SET!' and pulled their three cards.

My point being that the 'SET' is there, the answers you are looking for are most often there, Ladies and Gentlemen, but they just need to be teased out of the matrix.  A condition of finding them is believing they are there.

From here on out, there are other gates, but their appearance is a condition of your performance at the last gate.  A gate isn't always an on or off expression of Boolianity.  There are gradients and subtleties to some of them.   A good data scientist knows that there is a process. When problems are wider than your perception, there are methods to widen your vision.  Usually the key is down time: when the questions are taken home and rediscovered in the shower, or while making coffee.  The metaphorical apple falls, and hits Newton on the head, and he thinks he has discovered Gravity, and gives it it's own 'constant'.  Never mind that gravity is old enough to have it's own Hieroglyph.

But even the obviousness of Gravity as it's own force in the 16th and 17th Centuries, turns out hundreds of years later to be possibly be a function of the electromagnetic force.

Science is a moving target, and so is your data. 

Changing your data: the razor's edge

Yin YangAs the limits of your current concept of yourselves as a project experiences a sort of punctuated evolution, you might need to modify the database fields.  This should be approached with much caution and respect, and should never be taken lightly.  It might be years before those fields of data you added actually are able or 'quanta'-full enough to give you intelligent responses. Remember that intelligence destroys knowledge to gain wisdom.

One should manage and consider these new fields and data as they would an orchard.  There will be a time of unfruitfullness, but that the eventual first and subsequent harvests will be more than worth the wait.

And so  instead of talking about the weather, or speaking in third person or in metaphor, when it comes to data, you really want to be able to speak directly into the machine, into the mind of your organization. 

No shuffling of feet or 'little white lies'.  If your data has been collected with integrity, data can have the equivalent tolerances of an electron microscope for your organization.  The more airtight your data becomes, the more there is a balance of looking very closely for patterns at all scales, and a danger of not 'seeing the forest for the trees'. 

In that way, it becomes a good practice to create a variety of different 'views' that look both at the macro and the micro.  You should be and to be able to correlate the information/wisdom you are getting at all scales.  You can also use these views as scaffolding for other queries, and so your should think of them as part of your foundation.

Presentation

I give this it's own header because it is often forgotten. 

While the geeky types like to talk in a way that seems itself encrypted, Data will have the most impact if it's patterns can be made visible: R, PowerBI, Tableau, etc help tell the story of data so that it can be grokked. 

There is a misconception that given a simple data set that can be defined by two axis, there are many ways to display it.  I would disagree with that to some extent, and implore folks to be more choosy.  When concerning patterns, there are always some types of displays that demonstrate those patterns in a way that is superior to the others. 

I seek them out, usually have 5 different ways of displaying things, and then cull through the last pass to get it down to 3 winners.  Sometimes logarithmic scales are invoked for added punch, but I think they are a bit dishonest.

Conclusion

If you have read this far, I want to reward you with a thought.  We are hearing these days many people saying that we should collectively 'TRUST THE SCIENCE'. 

I actually would change that and offer a new rule: 'TRUST THE DATA'.

Science is corruptible by what is being funded, who happens to get along with whom, who is singing the right song at the right tune.  It is administered in the interests of whomever it is that is funding it. 

Data is collected, much of the time in a way that is automatted and passive.  If it is collected well, it can be trusted immediately. It is democratically available to anyone with a spreadsheet program.

If data is accurate and unforged, it is more responsive and reliable than the practice of science which is corruptible, and much of the time has been corrupted.

My $0.02.