Thursday, November 14, 2024

21st Century Data Science

When I got to Princeton in 1976, graduating 1980, the stats department was already using APL to teach stats. I wasn't taking any stats but was looking over the shoulders of classmates, and for sure APL looked interesting, so I tackled that. APL = A Programming Language, by Kenneth Iverson, then at Harvard.

Nowadays I'm in that same realm, laying foundations in basic Python for ascending the data science mountain, a metaphor I'll use, and reminiscent of the "Calculus Mountain" I'd often decry on Math Forum, ala Andrew Hacker's The Math Myth. Calculus Mountain is abused by admin cullers to separate people from their dreams. 

So couldn't Data Science Mountain be used just as abusively? Surely it could be. Depends on the school, the curriculum. My approach is to encourage many options, many pathways, posed to a consumer willing to self educate (that's idealistic, I realize). 

If a high schooler prefers a discrete math route, such as statistics, abet that with more number and group theory, some programming, and accredit these cohorts and college ready. If their discipline requires calculus, let the discipline teach it the way it needs its students to learn it, as it comes in many flavors.

In retrospect, tackling programming has become closer to taking up an instrument, a more broadly spread skill. In this case you're the music composer, with ways of making the music loop (repeat) or flow conditionally and interactively..

The musician is in the composer role in other words, whereas the instrument plays itself (like a player piano), at superhuman speeds.

And why should only electrical or electronics engineers learn the flute? 

Obviously the question is rhetorical, the point being there's no reason stats can't be an entre into computer programming, and in fact stats have provided such a doorway since mainframes serving universities became available. The big corporations needed to keep churning out future staff. They would make their donations in the form of hardware. The PC revolution followed the same pattern, with many schools receiving tax deductible equipment donations.

Princeton had APL terminals scattered all over, including in my dorm (Princeton Inn) and at the Firestone Library. I taught myself APL, and much later J, also with Dr. Iverson on the team. I was following from Portland, writing web pages on that language, some of which Iverson saw as he helped spot some typos.

APL and J are what we now call "array based languages", such as R and BQN. Operations are performed on n-dimensional arrays, treating the latter as atomic, whereas in conventional computer languages, one has to write busy looping code to hit all the cells in the matrix, one by one. 

The Python community got busy a while back adding array-based packages, such as numpy (Travis Oliphant, 2005) and pandas (Wes McKinney, 2008), which allowed it to stay competitive in the APL-friendly array-based arena. 

Python might seem less pithy than say APL or BQN, as it sticks with the ascii-qwerty keyboard and prefers a more English wordy syntax, even though Guido himself (Python’s inventor) is a Dutchman. 

My students come through Python to the tabular data structure of rows and columns we're so used to from ancient times. Ledgers of rows and columns are in no way a new invention. Computer code becomes a way of automating the job of ledger keepers tasked with keeping those ledgers accurate and up to date, once their composer-designer has orchestrated the perfect system (I’m being idealistic again)..

Equipped with these tools, a data scientists learns to lay pipe (metaphoric pipe) from where the sluices open, allowing raw data to flow in, through successive shaping, cleaning, cutting and patching operations that transform said data into something pristine, polished and suitable for the gods. 

The gods, in this case, are the model makers of Machine Learning, where the models are like golden egg, crystal ball, magic flutes that predict (with some likelihood) the future. Data science is about inferring and predicting, and also about sensing and measuring. How small a data sample might I get away with, and still reliably track the action? That question becomes a topic for deep analysis.

Where data science meets machine learning is where Artificial Intelligence gets much of its nutrition, in terms of achieving practical results, such as when performing text to voice, voice to text, text to graphics, text to video, and extruding synthetic suggestive strands from the LLMs. 

Those famous deep learning neural network algorithms fit in here. Welcome to Hilbert Space.  

Data science is as much about data visualization (showing what’s so)  as about stochastic extrapolation (predicting what’s next). Data science is about providing dashboards, often updated in real time, meaning a set or combination of instruments sufficient to monitor and perhaps influence or control a situation (I’m bundling the car steering wheel and pedals in with the dashboard instruments).


Data science is largely about anticipating the future based on the intelligent leveraging of what's known about the past.

In my Heuristics for Teachers regarding my Silicon Forest Digital Math, the data science stuff mostly fits into my Casino Math, one of four realms. 

Casino Math about risk, taking a gamble, rolling the dice, and developing winning strategies, designing games and simulations, imparting best practices and training chief risk officers. 

The Silicon Forest is in the North American Pacific Northwest where casinos play an important role in the economy. Next to Casino Math we have Supermarket Math, Martian Math and Neolithic Math.

What's true about learning basic programming is one cannot help but butt up against well-worn math topics such as: sets, set operations, number sequences (rule based), random numbers, primes, fibonacci numbers, cryptography, the web, history of the internet, and so on. 

A data scientist is someone becoming fluent with a lot of lore, especially as the discipline becomes increasingly about representing data sets geospatially and considering Planet Earth the relevant display object. 

Landlubber math gets augmented with lat/long and spherical trig. The data scientist is likewise a geographer, taking advantage of what GIS/GPS has to offer.