(Some tips on how to use Java if you’re from R or Python; some thoughts on software platforms and programming for data-science-or-whatever-we-call-it-now.)
Most of my research these days uses Python, R, or Java. It’s terrific that so many people are using Python and R as their primary langauges now; this is way better than the bad old days when people overused Java just because that’s what they learned in their intro CS course. Python/R are better for many things. But fast, compiled, static languages are still important[1], and Java still seems to be a very good cross-platform approach for this[2], or at the very least, it’s helpful to know how to muck around with CoreNLP or Mallet. I think in undergrad I kept annoying my CS professors that we needed to stop using Java and do everything in Python, but I honestly think we now have the opposite problem — I’ve met many people recently who do lots of programming without traditional CS training (e.g. from the natural sciences, social sciences, statistics, humanities, etc.), who need to pick up some Java but find it fairly different than the lightweight languages they first learned. I don’t know what are good overall introductions to the language for this audience, but here’s a little bit of information about development tools which make it easier.
Unlike R or Python, Java is really hard to program with just a text editor. You have to import tons of packages to do anything basic, and the names for everything are long and easy to misspell, which is extra bad because it takes more lines of code to do anything. While it’s important to learn the bare basics of compiling and running Java from the commandline (at the very least because you need to understand it to run Java on a server), the only good way to write Java for real is with an IDE. This is more complicated than a text editor, but once you get the basics down it is much more productive for most things. In many ways Java is an outdated, frustrating language, but Java plus an IDE is actually pretty good.
The two most popular IDEs seem to be Eclipse and IntelliJ. I’ve noticed really good Java programmers often prefer IntelliJ. It’s probably better. I use Eclipse only because I learned it a long time ago. They’re both free.
The obvious things an IDE gives you include things like autosuggestion to tell you method names for a particular object, or instantly flagging misspelled variable names. But the most useful and underappreciated features, in my opinion, are for code navigation and refactoring. I feel like I became many times more productive when I learned how to use them.
For example:
- Go to a definition (Eclipse name: “Open Declaration”). Hold “Command” then all the function names, class names, and variable names will get underlines. You can click one to navigate to where it’s declared. This is really helpful to follow method calls. You basically are following the path your program would take at runtime. You can even navigate into the code for any library or the standard library.
- Back: this is a button on the toolbar. After you navigated to a declaration, use the this to go back to where you were before. This lets you do things like go to a method just to quickly refresh your memory about what’s going on, or maybe go to a class to remember what things are in it, then after a second go right back to what you were working on. This lets you effectively deal with a lot more complexity without holding it all in your head at once.
(The “Command” key is for Mac with Eclipse; there are equivalents for Linux and Windows and other IDEs too.)
With these two commands, you can move through your code, and other people’s code, like it’s a web browser. Enabling keyboard shortcuts makes it work even better. Then you can press a keyboard shortcut to navigate to the the function currently under your cursor, and press another to go back to where you were. I think that by default these two commands don’t both have shortcuts; it’s worth adding them yourself (in Preferences). I actually mapped them to be like Chrome or Safari, using Command-[ and Command-] for Back and Open Declaration, respectively. I use them constantly when writing Java code.
But that’s just one navigational direction. You can also traverse in other directions with:
- See all references (Eclipse: right-click, “References”; or, Cmd-Shift-G). You invoke this on a function name in the code. Then you’ll get a listing on the sidebar of all places that call that function, and you can click on them to go to them. As opposed to going to a declaration, this lets you go backwards in a hypothetical call stack. It’s like being able to navigate to all inbound links, like all “cited by” in Google Scholar. And it’s useful for variables and classes, too. By invoking this on different things in your code, you quickly get little ego-network snapshots of your codebase’s dependency graph. This not only helps you track down bugs, but helps you figure out how to refactor or restructure your code.
There are many other useful navigational features as well, such as navigating to a class by typing a prefix of its name; and many other IDE features too. Different people tend to use different ones so it’s worth looking at what different people use.
Finally, besides navigation, a very useful feature is rename refactoring: any variable or function or class can be renamed, and all references to it get renamed too. Since names are pretty important for comprehension, this actually makes it much easier to write the first draft of code, because you don’t have to worry about getting the name right on the first try. When I write large Python programs, I find I have to spend lots of time thinking through the structure and naming so I don’t hopelessly confuse myself later. There’s also move refactoring, where you can move functions between different files.
Navigation and refactoring aren’t just things for Java; they’re important things you want to do in any language. There are certainly IDEs and editor plugins for lightweight languages as well which support these things to greater or lesser degrees (e.g. RStudio, PyCharm, Syntastic…). And without IDE support, there are unix-y alternatives like CTags, perl -pi
, grep
, etc. These are good, but their accuracy relative to the semantics you care about often is less than 100%, which changes how you use them.
Java and IDE-style development feel almost retrospective in some ways. To me at least, they’re associated with a big-organization, top-heavy, bureaucratic software engineering approach to programming, which feels distant from the needs of computational research or startup-style lightweight development. And they certainly don’t address some of the major development challenges facing scientific programming, like dependency management for interwoven code/data pipelines, or data/algorithm visualization done concurrently with code development. But these tools still the most effective ones for a large class of problems, so worth doing well if you’re going to do them at all.
[1]: An incredibly long and complicated discussion addressed in many other places, but in my own work, static languages are necessary over lightweight ones for (1) algorithms that need more speed, especially ones that involve graphs or linguistic structure, or sample/optimize over millions of datapoints; (2) larger programs, say more than a few thousand lines of code, which is when dynamic typing starts to turn into a mess while static typing and abstractions start to pay off; (3) code with multiple authors, or that develops or uses libraries with nontrivial APIs; in theory dynamic types are fine if everyone is super good at communication and documentation, but in practice explicit interfaces make things much easier. If none of (1-3) are true, I think Python or R is preferable.
[2]: Long story here and depends on your criteria. Scala is similar to Java in this regard. The main comparison is to C and C++, which have a slight speed edge over Java (about the same for straightforward numeric loops, but gains in BLAS/LAPACK and other low-level support), are way better for memory usage, and can more directly integrate with your favorite more-productive high-level language (e.g. Python, R, or Matlab). But the interface between C/C++ and the lightweight language you care about is cumbersome. Cython and Rcpp do this better — and especially good if you’re willing to be tied to either Python or R — but they’re still awkward enough they slow you down and introduce new bugs. (Julia is a better approach since it just eliminates this dichotomy, but is still maturing.) C/C++’s weaknesses compared to Java include nondeterministic crashing bugs (due to the memory model), high conceptual complexity to use the better C++ features, time-wasting build issues, and no good IDEs. At the end of the day I find that I’m usually more productive in Java than C/C++, though the Cython or Rcpp hybrids can get to similar outcomes. These main criteria somewhat assume a Linux or Mac platform; people on Microsoft Windows are in a different world where there’s a great C++ IDE and C# is available, which is (mostly?) better than Java. But very few people in my work/research world use Windows and it’s been like this for many years, for better or worse.