Open Source Endeca in 250 Lines or Less

Casey Durfee

The Seattle Public Library

casey.durfee@spl.org

Use the source, Luke

code:
http://extranet.spl.org/code/code4lib2007.zip


demo:
http://catalog.spl.org/catalog/

A more accurate description

I will detail how you can create an OPAC with features comparable to Endeca N-Dekka or AquaBrowser's Ockwa Bowser search products (faceted browsing, relevancy ranking, fuzzy searching) using the open-source Apache Solr search engine and your favorite web programming language my favorite programming language. I will present a catalog with most of Endeca's N-Dekka's features in 250 lines of code or less not very many lines of code and discuss performance/scalability concerns and common pitfalls when using Solr.
[put talk spiel here ]

For Sticklers

There is a "for sticklers" version you can download that has everything in 250 lines or less but DO NOT STARE DIRECTLY AT IT. It is so obfuscated and awful it may blind you.

Why count lines?

Number of Bugs ~ Lines of Code1.5 (according to the Mythical Man Month)

A 2500 line program has on average 42x as many bugs as a 250 line program.

Search results

This is what our search results screen will basically look like.

Giant Legos


by Sean Kenney
This here is a picture of a giant lego block made up of 5,000 smaller blocks. I think about Solr and other tools I use like these giant lego blocks that somebody else has already put together for you. They're extremely modular -- easy to stick on to anything else you might be building, and they allow you to make really big, really Enterprise stuff really fast. And four of these gigantic bricks take as long to put together as four of the tiny bricks that make them up. That's why you can build a webapp -- with features that would cost you $50 or 100,000 to buy from some vendor who put together 5,000 smaller blocks themselves -- with not a whole lot of work.

Solr shortcuts

[spiel here]

Django features

[spiel here]

Free Developer Tools

[spiel here]

Base Template

[spiel here]

Base Screen

[spiel here]

Config File

[spiel here]

Search View -- constructing URL

All we will have in our URL are query, index, limits, sort and page number. It is simple to translate to Solr syntax.
[spiel here]

Mapping to Solr URL

/catalog/search/?q=cats&index=title&limit=genre:"Mystery fiction"&sort=pubdate desc

Becomes

http://dev7:8888/solr/select?q=text%3Acats%20AND%20genre%3A%22Mystery%20fiction.%22 &wt=python&facet.field=topic&facet.field=genre&facet.field=format &facet.field=location&facet.field=place&facet.field=language&facet.field=author_exact &facet.zeros=false&facet=true&facet.limit=25&start=0
[spiel here]

Search URL syntax

/catalog/search/?q=cats&index=text&limit=genre:%22Mystery%20fiction.%22&sort=pubdate%20desc
vs.
/ipac20/ipac.jsp?npp=10&ipp=20&spp=20&profile=dial&aspect=subtab14&term=cats&index=.GW &uindex=&oper=&ri=1&session=RY725251723V3.10868&menu=search&aspect=subtab14&npp=10&ipp=20 &spp=20&profile=dial&ri=1&source=%7E%21horizon&sort=3100049&go_sort_limit.x=10 &go_sort_limit.y=15&limit=MD01+%3D+dvdsmd+and+MT01+%3D+mt_g
or
/catalog/?view=full&Ntt=cats&sort=1&Ntk=Keyword&Ns=PubDateSort%7c1&N=4294951185
or
/result.asp?cmd=sort&inlibrary=false&noext=false &debugkey=&lastquery=cats&lsi=url&uilang=en&searchmode=assoc&hardsort=year &skin=queens&c_over=1&branch=&ref=336&curpage=1&hrecpos=0&aquamode=aqua
[spiel here]

Augmenting item results

[spiel here]

Shuffling data for easier display

[spiel here]

The Boringest Slide of the Presentation

[spiel here]

RSS template

[spiel here]

View Template: facet options

[spiel here]

View Template: brief bib

[spiel here]

Solr performance tricks

  • <optimize/>
  • huge filterCache
  • Cache as you can
  • Some facets are faster than others
  • Keep it warm
[spiel here]

Filter Cache

The filter cache is integral to good performance with facets. Should be roughly equal to the number of indexed records.
[spiel here]

Some Facets are Faster than Others

#jython facetWarmer.py
query on facet author_exact took 10.5620
query on facet publisher took 1.7960
query on facet pubdate took 0.4220
query on facet audience took 0.7030
query on facet language took 0.6100
query on facet genre took 0.5940
query on facet place took 1.1090
query on facet topic took 2.9060
query on facet collection took 0.4220
done!

You frequently need to warm facets. Do query that will hit all records in DB and do facets on every field.
[spiel here]

anMARCy in the UK

"Here are 3 chords. Now go start a band." -- Sniffin' Glue
  • Eclipse
  • PyDev
  • Aptana
  • Firefox with:
    • Firebug
    • Color Picker
    • CSS Editor
    • View Source Chart
    • Measureit - screen ruler