idutils with R and Python (and related issues)

1 Sep 2006 posts

I was looking for tagging and indexing tools, specially for R and Python, and that work nicely with Emacs. Here are some notes on what I looked at and tried, with varying levels of detail. At the end, there is some info on an R-scanner for idutils that seems to work (please, if you are interested, do try it out, and let me know if there are problems).

What I want

While editing, be able to jump to where a function is defined. This is what ctags seems great at. I need support for R and Python, but also nice if support for LaTeX and general text.
Find all calls to a function, variable, etc. Helps refactoring, navigating code, and serves as a poor-man's static call graph.
Any tool should be Emacs-friendly (many of the tools below are also vim friendly, or about as friendly for both emacs and vim).
A plus if a whole bunch of languages we use are also supported (specially C/C++ and, to a lesser degree, Awk, sed, and JavaScript and possibly elisp).

Some tools I looked at

imenu

This is part of (X)Emacs already. The appropriate parsing of the file needs to be defined elsewhere, like in the mode. For instance, ess has an "ess-imenu-R" (defined in ess-menu.el) and python-mode has function "py-imenu-create-index-engine".

Imenu is actually great, and allows to use ECB (Emacs Code Browser) very nicely with R, Python and LaTeX. You can go to a function definition, but only if the definition is in the buffer you are currently editing (it won't jump and open another file).

lxr

Set up complex for what we need. Uses ctags underneath. No emacs interface.

open-grok

From the OpenSolaris project. It was impossible for me to download it over a three day period and I gave up. And does not seem to be Emacs-friendly.

pypersrc

Only C, C++, Python, and Java. Not emacs callable (has Tk interface). But nice for what it does.

cscope

For C code mainly, but can be used as a general text parsing/searching tool. Its use recursively, however, I did not like (if using non-standard file types, such as R or py). Fixing this, though, should be easy: just modify /usr/bin/cscope-indexer, so that it looks for the files I want (.py, .R, .html). The Emacs interface, although nice, I liked less than idutils. Extending it seems harder than working on other alternatives? And, although cscope is still widely used (and, as they say, has a very honorable Unix pedigree) it does not seem actively developed anymore.

gonzui + langhelp

langhelp is used by gonzui: langhelp by itself will not index arbitrary files, only the standard docs of a language (and for that it seems very nice, but that is not the point of what I am trying to do now).

To get langhelp running on my machine, I had to comment line 165 of /usr/local/lib/site_ruby/1.8/langhelp/lh_ruby.rb as I was getting error: "no such file to load -- rdoc/ri/ri_paths (LoadError)". That's OK, since ry is not relevant for my anyway.

gonzui itself works as a mix of ctags and idutils. However, it is a lot slower than ctags when indexing (at least what I tried). That should not be a surprise (gonzui is Ruby). It does not support LaTeX or R. Usage from emacs works just fine (google for gonzui-emacs). However, the buffers with the code are not the original ones and not editable

A simple example of usage is:

$ gonzui-import -d /home/ramon/test/gonzui.db /home/ramon/bzr-local-repositories/mpi.defs

Within Emacs:

M-x gonzui

and enter the tag you want to search for (the db. has to be created in the directory where Emacs is being run).

global

Another apparently very nice and fast program. However, no support for Python or R, and extending it did not seem clear to me. Very active development, and looks like their is an enthusiastic user and developer community.

ctags (from Exuberant Ctags)

Very fast.
No LaTeX, TeX, or R. However, there is a patch for R and one patch for LaTeX and another patch for LaTeX.
Easily used from Emacs (just create TAGS file using the -e option)
There is also a "native emacs" etags function that does understand LaTeX, and there is a way to use a regexp when calling etags so that it works with R (see links at end, and ESS manual).

ctags examples:

First create the tags file:

$ ctags -e --links=no --exclude="Examples/" --exclude="www/tmp" -R

You can tell emacs where to look for tags with:

visit-tags-table

And then, the absolutely fabulous "M-." and automagically you are taken to the place where the function is defined; if the function is not defined in the buffer you are editing, the appropriate file is visited. I don't know how I lived without this before.

So for going to the place where the function is defined, ctags works just fine. However, I do not like it for see where the function (or variable, or whatever) is used:

tags-apropos shows only places where it is defined;
tags-search goes file by file, does not show all at one (hackable?) and is essentially using grep.

idutils

Fast and powerful, but does not include Python nor LaTeX nor R scanners (see options doing "mkid –help"). I really like the Emacs interface.

To begin, we can try the generic text scanner. A simple example:

mkid --lang-map=&quot;/home/ramon/emacs-files/id-lang-try.map&quot; --default-lang=&quot;text&quot; *.R
## simpler for later if we use default ID file rather 
##          than --output=some_custom_thing

The file id-lang-try.map simply tells mkid that files ending with .R are to be parsed as text.

Now, inside emacs do:

M-x gid

and type what you want to search for.

It is reasonably fast (running in bzr-local-repositories it took 9.5 seconds on my laptop). Note that we can exclude directories and files, which is good for all the temporary and example files (e.g., mkid -p ./pomelo2/www/Examples).

idutils: R and Python using the Perl scanner

We can do better than telling mkid that Python and/or R are text. For example, tell it they are Perl. This is an example that shows that using the Perl scanner leads to more reasonable results (in what follows, f1.R is a "reasonable", medium sized R file; roll your own):

cp f1.R f1.txt
cp f1.R f1.pl

mkid --output id-f1.text -i text f1.txt
mkid --output id-f1.pl -i perl f1.pl
mkid --output id-f1w.pl --lang-option=perl:"--include=." -i perl f1.pl

The last invocation tells the scanner to include the "." as part of a token (I want to have, say, "print.myobject" rather than "print" and "myobject").

See how perl extraction is smaller (I'd assume also faster) and more reasonable:

``` fid --file id-f1.text f1.txt > f1-text.tokens fid --file id-f1.pl f1.pl > f1-perl.tokens fid --file id-f1w.pl f1.pl > f1-perl.tokensw ls -lrta f1-*.toke* ## and now, open each one ```

The main differences for us are:

No single numbers extracted as tokens.
No terms after "#" included (this is not exactly true: "fid" does not list those terms, because they seem not to be in the ID database, but when you use gid ---not lid--- they'll show up).
No terms within " " are included.

But the first two do unwanted things with ".": tokens that included a "." are split at the period, which is not what we want with R (and probably not with Python either). Using the third invocation does preserve functions with a ".", but also includes numbers if they start with a ".".

Trying R and Python scanners

I've tried to incorporate R and Python scanners. Basically, I took the perl one, and modified it. You can download a patch against the CVS version of idutils (as of 2006-09-01) (file also available here) or get the source with the patch applied. The tests I've done (see the testsuite directory) seem to work well.

Supposing the above code does work well, there are several issues still:

The output includes language keywords. I guess this could be avoided (checking each token against a list of keywords) but I am not sure this is worth it (or even desirable); when you search for a term, I assume you won't search for "if", any checking of tokens would slow things down, and space does not seem a big deal.
I do not like it at all that, even if a given token is not included in the ID base if after comments, it does show up when you execute gid. I'll try to look at that code. However, I am not sure how really worth it this is.

Summary

ctags and idutils together seem to fulfill what I want, but I still need to use both, each for a different purpose: ctags to locate where a function is defined, idutils to find all references to a function, variable, etc. Extending idutils seems doable, and the R and Python scanners seem to work.

Thanks

I asked about the above issues at the ESS and Python lists and I got very helpful responses from several people. Thanks to all.

License

Date:	2006-09-01 (2nd revision)
Author:	Ramon Diaz-Uriarte