libindexer 1.1
(c) Fabien MENEMENLIS 2012-07-18 (nihilist@dead-inside.org)

This program is released under the LGPL License. For copying information
see the file COPYING.

This is version 1.1 of libindexer, a fast indexing / data retrieving library
written in C language. Currently it is only known to work on FreeBSD and
MacOS X (changing the mmap() function in the libindexer code) and has not been
tested on Linux yet.
This is a "work in progress" library initially made for a search engine.

The library is made of 2 distincts parts:
- libindexer used to index / prepare a file
- libfsearch used to retrieve data from an indexed file

This implementation is based on JC Dufour's hash patch and doesn't require
any extra library. It's a much much faster version over the AVL implementation
as search time is linear no matter the number of records indexed. Note
however that creating the hash table on disk takes time and space and
is only worthy on huge files (we're talking about over 100 GB files here). If
you just want to play with smaller files (few GB files) go for the AVL
version. Anyway, kudos to JC for his work here, it's really an awesome job
you did there, you rock!


1. libindexer

libindexer uses a fast indexing method by creating an inverse dictionary
with a binary tree in memory, dumping the (sorted) data on disk and then
merging the sorted files. The binary tree is in memory and its size is limited
by the MAXCOUNT define in libindexer.c.
The indexer is very simple to call and contains only one function:

#include "libindexer.h"

indexer(int forkc, char *pname, char *outname, char *inname,
        indexer_read_func user_readdata);

where forkc is process count. You can use more than 1 CPU/core, libindexer will
fork() and read a record per process. libindexer uses semaphores, shared
memory and file locking to deal with read concurrency. Define NO_MULTI
in libindexer.c if you don't plan on using more than a process to index.
pname is the path to a filename that will be used to generate an unique key
in order to use semaphores / shared memory primitives. Usually you can use
argv[0].
outname is the basename for the files that will be created by libindexer.
4 files will be created: outname.off outname.idx outname.dboff and
outname.dbdata.
inname is the file to be indexed.
user_readdata is an user callback function that will be called to read each
record from the file to be indexed. The callback will receive the file
descriptor of the file to be indexed as its first argument, a pointer to
the read data as its second argument, a pointer to an alternative buffer
as its third argument and as a final argument a pointer to the current
document id as generated by the indexer, should you need it for further use
(a document can later be accessed by its id using the fsearch_getoffset()
function)

This alternative buffer is only useful if you want your data to be indexed,
and found, "as is" and won't be processed by the functions that clear strings
from non standard characters and make words all upper case so you can find
"The" "THE" or "the".

For example:
indexer(2, argv[0], "fileout", "filein.txt", readdata);
with readdata() defined as:
void readdata(FILE *f, char **data, char **ndata, int *id) {
    static char buffer[1024 + 1];

    fgets(buffer, 1024, f);
    *data = buffer;
    **ndata = NULL;
}

is enough to index filein.txt if it's a simple text file with lines < 1024
bytes.
Check libindexer.h for functions and arguments.


2. libfsearch

Once the file has been prepared for indexing with libindexer you have to use
libfsearch to retrieve data.
First, call fsearch_init() to initialize libfsearch with the output files
basename. It will return an s_fsearch structure defined in libfsearch.h
or NULL if there was an error.
    s_fsearch *fs;
    fs = fsearch_init("testout");

Then you have to prepare the string you are looking for with either
fsearch_clearstring() or fsearch_clearstringlnk().
    char *query;
    query = fsearch_clearstringlnk("text");
The difference between fsearch_clearstring() and fsearch_clearstringlnk() is
that the latter preserves special characters between letters / words. For
example looking for "A.B.C" would only find exactly "A.B.C" while
preparing the string with fsearch_clearstring() would also find lines
containing A and B and C no matter the order or if there are words in between.
When looking for multiple words you simply have to put them in a string
separated by spaces. For example query = fsearch_clearstring("two words");

Once you have prepared the query string you can call fsearch_seek() to perform
the lookup.
    int *res, nbres;
    res = (int *)malloc(sizeof(int) * 500);
    rescount = fsearch_seek(fs, query, &res, 500);
first argument is the s_fearch structure created by fsearch_init()
second argument is the query prepared by fsearch_clearstring()
third argument is an integer table that will be filled with results IDs
(record number in the original file). That is, if res[0] is 3 it will be the
4th line in our "filein.txt" file since we used fgets() to read the records
in our callback function.
fourth argument is the maximum size of the result table.
You can set res to NULL and set the maximum size to 0 if you want the result
table to be dynamically malloc'd with no limit, with a performance hit since
the table will be resized for every new result found during the lookup.

Once you have the record numbers in the result table you can obtain the
offset to the record in the original file by calling fearch_getoffset().
    f = fopen("infile.txt", "r");
    for (i = 0; i < nbres; i++) {
        fseeko(f, fsearch_getoffset(fs, res[i]), SEEK_SET);
        fgets(buffer, TBUFFER, f);
        printf("%s", buffer);
    }

Note that records are sorted by the words lowest positions. When looking for
"three" in a data set containing two lines "one two three" and "three two one",
first output will be "three two one".

You can use fsearch_close(fs) to close the file descriptors once you are done
with the library.

libfsearch also handles substring lookup (ONE "TWO WORDS" etc) and NOT
operator using ~ in front of the word you want to avoid (ONE ~TWO ~THREE).

Included are two small example files testindex.c and testfsearch.c as well
as a test set from the Gutenberg project. Try ./testindex and then
./testfsearch cosette

Send bug reports to nihilist@dead-inside.org

Fabien

