王朝网络
分享
 
 
 

Lucene --open source text serch engine API(讲稿)

王朝other·作者佚名  2008-05-31
宽屏版  字体: |||超大  

/**

* 这是一个关于LUCene的讲稿的txt格式。假如您需要pdf格式的可以

* 与我联系(pengjy@263.net) 。

* 作者:pengjy

* 时间:2002-04

* keyWords: lucene, api, token, index, chinese, unicode

*/

................page 1 ................

Lucene

an open source text search engine API

high-performance,

full-featured,pure Java

Pengjy@262.net

................page 2 ................

Agenda

Overview

APIs

How dose Search Engine Work

Feature

For Chinese character

................page 3 ................

Overview

An Apache Jakarta Project

High-performance, full-featured

Open source text search engine APIs

Easy to use, fast to build your own search engine

................page 4 ................

Overview

Version 1.2 rc4

Applications using Lucence

2a.WebSearch

Jive Forums

RockyNewsgroup.org

................page 5 ................

APIs

org.apache.lucene.analysis

defines an abstract Analyzer API for converting

text from a java.io.Reader into a TokenStream,

an enumeration of Token's. A TokenStream is composed

by applying TokenFilter's to the output of a Tokenizer.

A few simple implemetations are provided, including

StopAnalyzer and the grammar-based StandardAnalyzer

(use JavaCC).

................page 6 ~9................

APIs

org.apache.lucene.document

provides a simple Document class. A document is

simply a set of named Field's, whose values may be

strings or instances of java.io.Reader.

org.apache.lucene.index

provides two primary classes: IndexWriter, which

creates and adds documents to indices; and IndexReader,

which ccesses the data in the index.

org.apache.lucene.queryParser

uses JavaCC to implement a QueryParser

org.apache.lucene.search

provides data structures to represent queries

(TermQuery for individual words, PhraseQuery for phrases,

and BooleanQuery for boolean combinations of queries) and

the abstract Searcher which turns queries into Hits.

IndexSearcher implements search over a single IndexReader.

org.apache.lucene.store

defines an abstract class for storing persistent

data,the Directory, a collection of named files written

by an OutputStream and read by an InputStream. Two

implementations are provided, FSDirectory, which uses

a file system directory to store files, and RAMDirectory

which implements files as memory-resident data structures.

org.apache.lucene.util

contains a few handy data structures, e.g.,

BitVector and PriorityQueue.

................page 10 ................

How dose Search Engine Work

Create indices

input --analyzer--filters--tokens--indices

^

tokenize

................page 11 ~ 14 ................

How dose Search Engine Work

Store Indices

Rather than maintaining a single index, it builds

multiple index segments. For each new document indexed,

Lucene creates a new index segment.

It merges small segments with larger ones -- this

keeps the total number of segments small so searches remain

fast.

To prevent conflicts (or locking overhead) between

index readers and writers, Lucene never modifies segments

in place, it only creates new ones. When merging segments,

Lucene writes a new segment and deletes the old ones --

after any active readers have closed it.

A Lucene index segment consists of several files:

A dictionary index containing one entry for each 100 entries

in the dictionary A dictionary containing one entry for

each unique word A postings file containing an entry for

each posting

Since Lucene never updates segments in place, they

can be stored in flat files instead of complicated B-trees.

For quick retrieval, the dictionary index contains offsets

into the dictionary file, and the dictionary holds offsets

into the postings file.

Lucene also implements a variety of tricks to compress

the dictionary and posting files -- thereby reducing disk

I/O -- without incurring substantial CPU overhead.

................page 15 ~ 22 ................

Feature

Incremental indexing

Incremental indexing allows easy adding of documents to

an existing index. Lucene supports both incremental and batch

indexing.

Data sources

Lucene allows developers to deliver the document to the

indexer through a String or an InputStream, permitting the

data source to be abstracted from the data. However, with

this approach, the developer must supply the appropriate

readers for the data. Feature

Indexing control

Some search engines can automatically crawl through a

directory tree or a Website to find documents to index.

Since Lucene operates primarily in incremental mode, it lets

the application find and retrieve documents.

File formats

Lucene supports a filter mechanism, which offers a simple

alternative to indexing word processing documents, SGML

documents, and other file formats.

Content tagging

Lucene supports content tagging by treating documents

as collections of fields, and supports queries that

specify which field(s) to search. This permits semantically

richer queries like "author contains 'Hamilton' AND body

contains 'Constitution'".

Stop-word processing

Search engines will not index certain words, called stop

words.such as "a", "and," and "the". Lucene handles stop

words with the more general Analyzer mechanism, and provides

the StopAnalyzer class, which eliminates stop words from the

input stream.

Query features

Lucene supports a wide range of query features, including

all of those listed below:

Boolean queries; andqueries. return a "relevance" score

with each hit.

handle adjacency or proximity queries -- "search followed

by engine" or "Knicks near Celtics"

search on single keywords.

search multiple indexes at once and merge the results to

give a meaningful relevance score.

However, Lucene does not support the valuable "Soundex",

or "sounds like," query.

Concurrency

Lucene allows users to search an index transactionally,

even if another user is simultaneously updating the index.

Non-English support

As Lucene preprocesses the input stream through the

Analyzer class provided by the developer, it is possible to

perform language-specific filtering.

................page 23 ................

For Chinese character

JavaCC -- the Java Compiler Compiler.

build complex compilers for languages such as

Java or C++.

write tools that parse Java source code and perform

automatic analysis or transformation tasks.

EBNF (Extended Backus-Naur-Form)

................page 24 ................

For Chinese character

org.apache.lucene.analysis.standard.StandardTokenizer.jj

TOKEN : { // token patterns

)+

("." )+ //email adress

}

................page 25 ................

For Chinese character

Add Uincode CJK to StandardTokenizer.jj

[

"u4e00"-"u9faf", //CJK Unified Ideographs

"u3400"-"u4dbf", //CJK Unified Ideographs Extension A

"u3000"-"u303f", //CJK Symbols and Punctuation

"u2e80"-"u2eff", //CJK Radicals Supplement

"u3200"-"u32ff", //Enclosed CJK Letters and Months

"ufe30"-"ufe4f", //CJK Compatibility Forms

"u3300"-"u33ff", //CJK Compatibility

"uf900"-"ufaff" //CJK Compatibility Ideographs

]

................page 26 ................

For Chinese character

Add Unicode CJK

Build Lucene (use Lucene 1.2 src and Ant 1.4)

Test windows 2000 server + weblogic 6.1 sp2 +

MSSQLserver 2000 + jive2.2.3 + Lucene

................page 27 ................

Thank you!

My mail:pengjy@263.net

................The end ................

 
 
 
免责声明:本文为网络用户发布,其观点仅代表作者个人观点,与本站无关,本站仅提供信息存储服务。文中陈述内容未经本站证实,其真实性、完整性、及时性本站不作任何保证或承诺,请读者仅作参考,并请自行核实相关内容。
2023年上半年GDP全球前十五强
 百态   2023-10-24
美众议院议长启动对拜登的弹劾调查
 百态   2023-09-13
上海、济南、武汉等多地出现不明坠落物
 探索   2023-09-06
印度或要将国名改为“巴拉特”
 百态   2023-09-06
男子为女友送行,买票不登机被捕
 百态   2023-08-20
手机地震预警功能怎么开?
 干货   2023-08-06
女子4年卖2套房花700多万做美容:不但没变美脸,面部还出现变形
 百态   2023-08-04
住户一楼被水淹 还冲来8头猪
 百态   2023-07-31
女子体内爬出大量瓜子状活虫
 百态   2023-07-25
地球连续35年收到神秘规律性信号,网友:不要回答!
 探索   2023-07-21
全球镓价格本周大涨27%
 探索   2023-07-09
钱都流向了那些不缺钱的人,苦都留给了能吃苦的人
 探索   2023-07-02
倩女手游刀客魅者强控制(强混乱强眩晕强睡眠)和对应控制抗性的关系
 百态   2020-08-20
美国5月9日最新疫情:美国确诊人数突破131万
 百态   2020-05-09
荷兰政府宣布将集体辞职
 干货   2020-04-30
倩女幽魂手游师徒任务情义春秋猜成语答案逍遥观:鹏程万里
 干货   2019-11-12
倩女幽魂手游师徒任务情义春秋猜成语答案神机营:射石饮羽
 干货   2019-11-12
倩女幽魂手游师徒任务情义春秋猜成语答案昆仑山:拔刀相助
 干货   2019-11-12
倩女幽魂手游师徒任务情义春秋猜成语答案天工阁:鬼斧神工
 干货   2019-11-12
倩女幽魂手游师徒任务情义春秋猜成语答案丝路古道:单枪匹马
 干货   2019-11-12
倩女幽魂手游师徒任务情义春秋猜成语答案镇郊荒野:与虎谋皮
 干货   2019-11-12
倩女幽魂手游师徒任务情义春秋猜成语答案镇郊荒野:李代桃僵
 干货   2019-11-12
倩女幽魂手游师徒任务情义春秋猜成语答案镇郊荒野:指鹿为马
 干货   2019-11-12
倩女幽魂手游师徒任务情义春秋猜成语答案金陵:小鸟依人
 干货   2019-11-12
倩女幽魂手游师徒任务情义春秋猜成语答案金陵:千金买邻
 干货   2019-11-12
 
>>返回首页<<
推荐阅读
 
 
频道精选
 
静静地坐在废墟上,四周的荒凉一望无际,忽然觉得,凄凉也很美
© 2005- 王朝网络 版权所有