An overview of Julia + Examples in exploratory data analysis

Daryl
9 min readJun 12, 2020

--

When I first chanced upon Julia, and read snippets about what it was capable of, I was downright sceptical.

“If it ain’t broke, don’t fix it.”

Pretty much, the one sentence that summarised my sentiments.

Why would anyone want to pick up a new language, for no conceivable benefit?

Everyone and their grandmother was basically jumping on the Data Science bandwagon. Not a day went by where I didn’t see yet another person flaunting their new Python certificate on LinkedIn. (Though I am pretty sure, many of them, after a few tries, just dragged the “progress” bar all the way til the end and called it a day.)

Credits: Stack Overflow

R and Matlab were being shoehorned down our throats in the hallways of universities.

A quick Google search of these languages reveal that they have basically been around forever.

It is hard to believe that a dark horse could actually show up out of nowhere and displace them.

But as it turns out, a dark horse is indeed here.

This dark horse, is Julia.

The Two Language Problem

Now, I will be proceeding with this explanation assuming that the average reader does not have an in-depth knowledge of computer architecture.

This is a common issue that gets in the way of users who write code, especially so in the realm of numerical analysis.

The easier and more intuitive the language reads, the slower it runs.

Of course, its human nature to pick the path of least resistance. That is clearly the reason why Python has such a large popularity, it’s not hard to pick up.

If the solutions to the calculation for a code block takes a full 5 seconds to run, it’s no big deal, if the results are required for a presentation that takes place tomorrow.

However, as you will see soon, there are indeed cases where a several second delay, is completely unacceptable.

Or there are calculations with such an astronomically high number of permutations to compute, languages like Python will probably take months to process the calculation.

Hence the continued demand for languages such as C,C++, and Fortran, that can process these calculations in a satisfactory length of time.

These are compiled languages that run at much higher speeds. However the process of compiling them to run is extremely onerous and difficult.

https://link.springer.com/book/10.1007/978-1-4471-2736-9

Though, in recent years there has been attempts to bridge this divide via packages like Cython.

How Julia solves the problem

“Walks like Python, Runs like C”

A very apt summary of the language.

1. Convenience

The user interface functions like a scripted language.

It uses a REPL (read–evaluate–print loop) console; A interface that takes the input, evaluates it and returns the results to the user.

“To all appearances, using Julia is like coding in Python: type a line, get a result. But in the background, the code is compiled. Consequently, the first time a function is keyed in, it might be slow, but subsequent runs are faster. And once the code is working correctly, users can optimize it.”

-Nature Magazine

2. Unicode characters and LaTex functionality

Just imagine the convenience brought about by the inclusion of Unicode characters in Microsoft Word & Excel.

Think how much harder your work would be, if you couldn’t add in
Δ (delta), θ (theta), Σ/σ (Sigma), Ψ (Psi), into documents and had to spell them out.

These symbols are now a part of the Julia syntax.

LaTeX functionality

Though it is worth point out that at this point, the inclusion of LaTeX features is no longer new, and R, Matlab, Python and Mathematica all have provisions for it.

But nevertheless, the fact that Julia has LaTeX should placate the worries of some users, who are afraid of losing functionality if they were to switch over.

3. S P E E D and S C A L E

Imagine the Flash when he straps on the Tachyon device

The Petaflop club

A name to describe the languages capable of executing in excess of Petaflops speed; One Quadrillion calculations per second.

(For perspective, One Quadrillion = One Thousand Trillion)

For a long time, the Petaflop club was occupied by 3 languages.

C, C++ and Fortran.

You guessed it.

The newest addition to the family is Julia.

The Julia blog does a wonderful job of chronicling the growth of the language and the various users that use and grow its libraries.

The Celeste Project: A consortium of researchers from UC Berkeley, Harvard, MIT, Intel, and other alphabet soup research labs using Julia to catalog the deluge of astronomical data from the Sloan Digital Sky Survey.

Galactic superclusters in the observable universe

Chaotic Dynamics

Poincare Maps

Quantitative Finance

An interesting addition here for sure. It’s not everyday that you see financial institutions being at the forefront of numerical computing.

But it turns out that BlackRock’s quants have been hard at work these past few years.

BlackRock has an In-House analytics division called Aladdin. With Aladdin being an acronym for Asset, Liability, Debt and Derivative Investment Network.

What are we still waiting for?

With this realisation, its high time we get our feet wet and familiarise ourselves with Julia.

Exploratory Data Analysis in Julia

Yet another convenient fact about Julia, is that it can be run on many different IDEs. Since I conducted data analysis in Python for the most part in the past, I decided to run it in Jupyter Lab.

Do follow this guide to set up Julia within a Jupyter environment.

Though if you want a professional level IDE, feel free to skip Jupyter and familiarise yourself with Juno instead.

Since the focus of this post is on usage of Julia itself, we shall be picking a very simple dataset in which to conduct our EDA.

The Kaggle Titanic Dataset.

using Pkg, CSV, DataFramesPkg.add("CSV")
Pkg.add("DataFrames")
df = DataFrame(CSV.read("train.csv",normalizenames = true))summary(df)< "891×12 DataFrame" >describe(df)
first(df,5)

Notice that by default, Julia omits displaying many columns in between.

show(first(df,5), allcols=true)     #Force to display all columns

An example of a dataframe in Julia

Downloading the relevant packages for plotting and data manipulation.

Pkg.add("StatsBase")
Pkg.add("StatsPlots")
Pkg.add("Gadfly")
Pkg.add("Plots")
Using StatsBase
Using StatsPlots
Using Gadfly
Using Plots
#Importing relevant packages for data manipulation
#Packing adding has high similarities with R.

Splitting the passengers by various criteria to get a brief rundown.

(Survival status, Gender, Ticket class)

countmap(df[!, :Survived])       < Dict{Int64,Int64} with 2 entries:        
0 => 549
1 => 342 > #Survivors vs Non-Survivors
countmap(df[!, :Sex])< Dict{String,Int64} with 2 entries:
"male" => 577
"female" => 314 >
#Gender distribution of passengers
countmap(df[!, :Pclass])< Dict{Int64,Int64} with 3 entries:
2 => 184
3 => 491
1 => 216 > #Distributution by ticket class

Plotting their distribution via the Gadfly package

Plotting survival numbers against gender

p1=Gadfly.plot(df, x=:Sex, y=:Survived, color="Survived", Geom.histogram(position=:dodge),Scale.color_discrete_manual("red","green"))

Survival number against ticket classification

p2=Gadfly.plot(df, x=:Pclass, y=:Survived, color="Pclass", Geom.bar(), Scale.color_discrete_manual("blue","purple","yellow"))

Survival against ticket price.

p3=Gadfly.plot(df, x=:Fare, y=:Survived, color="Survived", Geom.histogram(), Scale.color_discrete_manual("blue"))

Data manipulation functions.

Identify columns where > 10% of the rows contain missing data.

filter(c -> count(ismissing, df[:,c])/size(df,1) > 0.1, names(df))< 2-element Array{String,1}:
"Age"
"Cabin" >

Deselecting columns where > 10% of the rows containing missing data.

df1 = select!(df, Not([:Age, :Cabin]))

Displaying the updated Dataframe

show(first(df1,1),allcols=true)

A quick look at the Machine Learning models in Julia via the “MLJ” package.

True to its purpose as a scientific computing language, Machine Learning libraries are all found in one single place.

Pkg.add("MLJ")
using MLJ
models()

As at time of writing, Julia contains 133 different Machine Learning models.

Once again, many are omitted in the display.

After Action Review and takeaways from this EDA task .

  1. The Julia ecosystem is still under-developed and rapidly changing.

When you run into errors and try to search for answers, it is very common to not be able to find them. The user base is still rather small and it is not like Python where you can be sure that nearly every conceivable error has been encountered before and solved.

The rapidly changing ecosystem also means that functions are being rewritten, and modules are being updated. What worked two weeks ago, may not work today. And what worked previously may have been just a hardcoded temporary stop gap measure.

(TL;DR It was a pain!)

2. Julia was not meant to be a language for automation and general purpose scripting.

Broadly speaking, you will find less support for “everyday” tasks such as a code script to log into your Office Intranet, or sorting information in Microsoft Excel CSV files.

When displaying the dataframe earlier, we saw that by default, it eliminated many columns in between, unless specifically instructed to display all.

In scientific big data and numerical analysis, rows often number in the millions. It makes sense to automatically omit displaying most of it, to avoid hogging memory.

3. The language is truly targeted toward the Scientific Computing community.

A quick glance at the software packages and documentation updates in the Julia blog and you see updates heavily geared toward the scientific community.

Multi-threaded parallelism

Graph analytics

Generative Adversarial networks

Is Julia a worthwhile addition to your programming toolbox?

For the average user, probably not.

But if your work involves crunching truly gigantic datasets, mathematical optimization, parallel processing, differential equations, then Julia is most definitely the language for you.

More projects to come in Julia soon!

--

--

Daryl
Daryl

Graduated with a Physics degree, I write about physics, coding and quantitative finance.