If you're a data scientist and need to analyze loads of CSV files for insights into, say, stock-price and market movements, the Julia programming language trumps machine-learning rivals Python and R, according to Julia supporters.
However, Julia, a young language with roots in MIT's Computer Science and Artificial Intelligence Lab (CSAIL), has also Virtual hiring tips for job seekers and recruiters (free PDF) (TechRepublic)
Some languages such as Rust aren't widely used by developers but they are appreciated by programmers for qualities that excel in systems programming, versus application programming. For example, Microsoft is looking to Rust for the memory-safety features lacking in C and C++, which are extensively employed in Windows and other Microsoft projects.
Julia on the other hand has been adopted by some programmers for its C-like speed, but it has a much smaller ecosystem of packages than Python.
A According to Deepak Suresh, a machine-learning engineer at Julia Computing, multithreading capabilities give Julia libraries an advantage over both machine-learning rivals with a range of different datasets accessed from CSV files, or comma-separated values text files.
Suresh has benchmarked statistical programming language R's fread, Pandas' read_csv for Python, and Julia's CSV.jl CSV parsers and reckons that Julia comes out on top.
"Julia's CSV.jl is 1.5 to 5 times faster than Pandas even on a single core; with multithreading enabled, it is as fast or faster than R's read_csv," he notes.
The benchmarks were carried out on a machine with Ubuntu 18.04 powered by an Intel Xeon Silver 4114 processor running at 2.20GHz.
As he explains, Julia's CSV.jl is the only tool that is "fully implemented in its higher-level language rather than being implemented in C and wrapped from R/Python".
The benchmarks are meant to demonstrate the speed of loading data in Julia and also indicate the performance of Julia code during data analysis.
One of the example benchmarks looks at Apple stock price states – open, high, low and close – using a 2.5GB dataset with 50 million rows and five columns.
"The single threaded CSV.jl is about 1.5 times faster than R's fread from data.table. With multithreading CSV.jl is about 22 times faster. Pandas' read_csv takes 34s to read, this is slower than both R and Julia," Suresh declares.
Another looks at performance with a mortgage risk dataset from Google-owned data-science platform, Kaggle, which contains mixed type dataset, with 356,000 rows and 2,190 columns.
"Pandas takes 119s to read in this dataset. Single-threaded fread is about twice faster than CSV.jl. However, with more threads Julia is either as fast or slightly faster than R," says Suresh.
Another is the acquisition dataset from US mortgage lender, Fannie Mae, which has four million rows and 25 columns.
"Single-threaded data.table is 1.25 times faster than CSV.jl. But, the performance of CSV.jl keeps increasing with more threads. CSV.jl gets about 4 times faster with multi-threading," he says.