libindexer 1.1 (c) Fabien MENEMENLIS 2012-07-18 (nihilist@dead-inside.org) This program is released under the LGPL License. For copying information see the file COPYING. This is version 1.1 of libindexer, a fast indexing / data retrieving library written in C language. Currently it is only known to work on FreeBSD and MacOS X (changing the mmap() function in the libindexer code) and has not been tested on Linux yet. This is a "work in progress" library initially made for a search engine. The library is made of 2 distincts parts: - libindexer used to index / prepare a file - libfsearch used to retrieve data from an indexed file This implementation requires libavl 1.4 in your parent directory, check http://www.stanford.edu/~blp/avl/ and available on GNU's ftp at ftp://ftp.gnu.org/pub/gnu/avl/avl-1.4.0.tar.gz You'll need avl.o so don't "make clean" after you ./configure; make 1. libindexer libindexer uses a fast indexing method by creating an inverse dictionary with a binary tree in memory, dumping the (sorted) data on disk and then merging the sorted files. The binary tree is in memory and its size is limited by the MAXCOUNT define in libindexer.c. The indexer is very simple to call and contains only one function: #include "libindexer.h" indexer(int forkc, char *pname, char *outname, char *inname, indexer_read_func user_readdata); where forkc is process count. You can use more than 1 CPU/core, libindexer will fork() and read a record per process. libindexer uses semaphores, shared memory and file locking to deal with read concurrency. Define NO_MULTI in libindexer.c if you don't plan on using more than a process to index. pname is the path to a filename that will be used to generate an unique key in order to use semaphores / shared memory primitives. Usually you can use argv[0]. outname is the basename for the files that will be created by libindexer. 4 files will be created: outname.off outname.idx outname.dboff and outname.dbdata. inname is the file to be indexed. user_readdata is an user callback function that will be called to read each record from the file to be indexed. The callback will receive the file descriptor of the file to be indexed as its first argument, a pointer to the read data as its second argument, a pointer to an alternative buffer as its third argument and as a final argument a pointer to the current document id as generated by the indexer, should you need it for further use (a document can later be accessed by its id using the fsearch_getoffset() function) This alternative buffer is only useful if you want your data to be indexed, and found, "as is" and won't be processed by the functions that clear strings from non standard characters and make words all upper case so you can find "The" "THE" or "the". For example: indexer(2, argv[0], "fileout", "filein.txt", readdata); with readdata() defined as: void readdata(FILE *f, char **data, char **ndata, int *id) { static char buffer[1024 + 1]; fgets(buffer, 1024, f); *data = buffer; **ndata = NULL; } is enough to index filein.txt if it's a simple text file with lines < 1024 bytes. Check libindexer.h for functions and arguments. 2. libfsearch Once the file has been prepared for indexing with libindexer you have to use libfsearch to retrieve data. First, call fsearch_init() to initialize libfsearch with the output files basename. It will return an s_fsearch structure defined in libfsearch.h or NULL if there was an error. s_fsearch *fs; fs = fsearch_init("testout"); Then you have to prepare the string you are looking for with either fsearch_clearstring() or fsearch_clearstringlnk(). char *query; query = fsearch_clearstringlnk("text"); The difference between fsearch_clearstring() and fsearch_clearstringlnk() is that the latter preserves special characters between letters / words. For example looking for "A.B.C" would only find exactly "A.B.C" while preparing the string with fsearch_clearstring() would also find lines containing A and B and C no matter the order or if there are words in between. When looking for multiple words you simply have to put them in a string separated by spaces. For example query = fsearch_clearstring("two words"); Once you have prepared the query string you can call fsearch_seek() to perform the lookup. int *res, nbres; res = (int *)malloc(sizeof(int) * 500); rescount = fsearch_seek(fs, query, &res, 500); first argument is the s_fearch structure created by fsearch_init() second argument is the query prepared by fsearch_clearstring() third argument is an integer table that will be filled with results IDs (record number in the original file). That is, if res[0] is 3 it will be the 4th line in our "filein.txt" file since we used fgets() to read the records in our callback function. fourth argument is the maximum size of the result table. You can set res to NULL and set the maximum size to 0 if you want the result table to be dynamically malloc'd with no limit, with a performance hit since the table will be resized for every new result found during the lookup. Once you have the record numbers in the result table you can obtain the offset to the record in the original file by calling fearch_getoffset(). f = fopen("infile.txt", "r"); for (i = 0; i < nbres; i++) { fseeko(f, fsearch_getoffset(fs, res[i]), SEEK_SET); fgets(buffer, TBUFFER, f); printf("%s", buffer); } Note that records are sorted by the words lowest positions. When looking for "three" in a data set containing two lines "one two three" and "three two one", first output will be "three two one". You can use fsearch_close(fs) to close the file descriptors once you are done with the library. libfsearch also handles substring lookup (ONE "TWO WORDS" etc) and NOT operator using ~ in front of the word you want to avoid (ONE ~TWO ~THREE). Included are two small example files testindex.c and testfsearch.c as well as a test set from the Gutenberg project. Try ./testindex and then ./testfsearch cosette Send bug reports to nihilist@dead-inside.org Fabien