Apache SolrCloud
-
Upload
michal-warecki -
Category
Technology
-
view
7.700 -
download
2
Transcript of Apache SolrCloud
Searching in the Cloud
Arkadiusz MasiakiewiczMicha Warecki
Przedstawi si, jestemy z PayU (Allegro group).
Chcemy opowiedzie o naszych krtkich dowiadczeniach z Solr. Opowiemy o technicznych szczegach tego fajnego serwera.
Mielimy problem z czasami wyszukiwania, dlatego zainteresowalimy si t technologi.
Moe by ciekawe bo zastosowalimy SolrCloud (std nazwa prezentacji).
Pytania prosimy na kocu.
About us
Arkadiusz MasiakiewiczNoSQL Databases
Full-text search engines
Neural networks
Scuba diving
Micha WareckiGC
JiT Compilers
Concurrency
Non-blocking algorithms
Programming languages runtime
Astronomy
Agenda
Introduction to Apache Solr
Data in PayU
About SolrCloud
Performance
About Apache Solr
Full-text search server based on Apache Lucene
Platform independent
Buzzword compatible (sharding, replication, clustering, cloud, scaling, big data, ...)
Where to use Solr?
Sklepy internetowe empikWyszkiwarkiCzciowe zastpienie bazy danychwyszukiwanie plikw po treciwyszukiwanie osb
Who uses Solr?
About Apache Solr
Indexer / Scheduler
Lucene IndexSolr server
SchemaConfigDocuments / Update queryClient applicationQuery (HTTP)ResponseWeb container
Omwi komponenty SolrKomunikacja
Platform independent
(XML|JSON|PHP...)/HTTP
HTTP-based queries
http://localhost:8080/solr/select?q=query&start=50
&rows=25
&fq=field:ala
&facet=on&facet.eld=category
&sort=dist(2, point1, point2) desc
Solr schema
name
type
indexed
stored
multiValued
required
Omwi atrybuty pl
type i stored wielko indeksu
Przykad z ycia: setki milionw dokumentw I zwracamy tylko id, ktore pozniej wyciagamy po indeksie z bazy
Solr schema
StrField String (UTF-8 encoded string or Unicode).
TextField Text, usually multiple words or tokens.
TrieDateField Date field accessible for Lucene TrieRange processing.
TrieDoubleField, TrieFloatField, TrieIntField, TrieLongField
UUIDField Universally Unique Identifier (UUID). Pass in a value of "NEW" and Solr will create a new UUID.
CurrencyFieldSupports currencies and exchange rates.
TextField analizery (StandardTokenizer whitespace, dots, itd.) i filtry (lowercasefilter) w czasie indeksowania i zapytania
CurrencyField zamiana wartoci w trakcie zapytania
Data import
SolrData feedersXML DocumentsXML
CSV
File system
Web spiders
RSS/Atom
POJO
DB
Jak to si dziej, e moemy odpyta solara o dokumenty?
Dodatkowo mona powiedzie, e istnieje moliwo zaimplementowania wasnego importu.
Data configuration
data source
documententity(query, delta query)field
SolrDBUpdate querySQLDIH
Peny import
Import przyrostowy
Dodatkowwo robimy full zamiast delta import, bo delta robi selecta per dokument
Plugins
Search components
Request handlers
Process factory
Search component moe podkrela wyszukiwane sowa kluczowe; zwraca dodatkowe informacje jak np. Ilo sw w polu
Request handlers podpinamy komponenty pod odpowiedni ciek URL (endpoint)
Process factory przy indeksowaniu mona za jego pomoc doda nowe pola, zmienia je itd.
Data in PayU
~100 GB index size
~400 milions of documents to search
More than 30 fields in schema
Dotyczy to tylko jednej tabeli bazodanowej
Jest to stosunkowo due wdroenie.
Data flow in PayU
Client applicationSolrDB1: search params
3: DB id's
2: DB id's
4: Data
Connecting to your old app
Hibernate criteria
JPA criteria
Custom criteria
q=field1:testvalue&wt=javabin
Mona wwczas bardzo atwo przecza si pomidzy baz danych, a Solr
Traditional Solr architecture
Solr shard1- config- schemaSolr shard1 replica- config- schemaSolr shard1- config- schema
Solr shard1- config- schemaSolr shard2 replica- config- schemaSolr shard2- config- schema
manually copy config
manually split index
manually shard queries
add replica
manually copy config
setup replication
does not provide fail-over
separate monitoring
Zookeeper cluster
SolrCloud architecture
confClusterstateconfconfconfZookeeper
Solr instance1
Shard1leader
Shard2Replica
Solr instance2
Shard1replica
Shard2leader
confClusterstateconfconfconfZookeeper
confClusterstateconfconfconfZookeeper2
Zookeeper1
Zookeeper3
Add documentMog by logiczne I fizyczne instancje
ZooKeeperZooKeeperZooKeeperAssigning machines
Number of shards : 3
Shard1
ZooKeeperZooKeeperZooKeeperAssigning machines
Number of shards : 3
Shard1
Shard2
ZooKeeperZooKeeperZooKeeperAssigning machines
Number of shards : 3
Shard1
Shard2
Shard3
Now you can search and index
ZooKeeperZooKeeperZooKeeperAssigning machines
Shard1
Shard2
Shard3
Replicashard1
ZooKeeperZooKeeperZooKeeperAssigning machines
Shard1
Shard2
Shard3
Replicashard1
Replicashard2
ZooKeeperZooKeeperZooKeeperAssigning machines
Shard1
Shard2
Shard3
Replicashard1
Replicashard2
Replicashard3
ZooKeeper configuration
{"collection1":{ "shards":{ "shard1":{ "range":"80000000-8443ffff", "state":"active", "replicas":{ "10.205.33.92:8080_solr_collection1_shard1_replica2":{ "state":"active", "core":"collection1_shard1_replica2", "node_name":"10.205.33.92:8080_solr", "base_url":"http://10.205.33.92:8080/solr", "leader":"true"}, "10.205.33.93:8080_solr_collection1_shard1_replica1":{ "state":"active", "core":"collection1_shard1_replica1", "node_name":"10.205.33.93:8080_solr", "base_url":"http://10.205.33.93:8080/solr"}}},
SolrCloud
Central Config in Zookeeper
Automatic Fail-Over
Near-Realtime
Leader Election
Optimistic Locking
Durable writes
Sharding in SolrCloud
Collection i.e. booksShard1Books part1replicareplica
Shard3Books part3replicareplica
Shard2Books part2replicareplica
Shardy s logicznym podziaem indeksu. W szczeglnoci mog by fizycznym podziaem.
Sharding in SolrCloud
--Ile dajemy shardw? 72? -- ee, daj 100 do penego :-)
Sharding in SolrCloud
Implicit documents distributinguniqeId.hashCode() % numServers.
Composite keygroupId!uniqeId
groupId.hashCode() in shard hash range
1-10
11-20
21-30
31-40
hashCode=5hashCode=10hashCode=35hashCode=15
?shard.keys=xxx!
hashCode=35
Jest jeszcze dostpny custom hashing
SolrJ
HTTP Solr Server
Embedded Solr ServerDoes not require HTTP
Cloud Solr ServerPass ZooKeeper hosts
SolrJ POJO
public class Item { @Field String id;
@Field("cat") String[] categories;
@Field List features;
}
//...Item item = new Item();item.id = "one";item.categories = new String[] { "aaa", "bbb", "ccc" };
server.addBean(item);
Cache
Filter cache
Field value cache
Query result cache
Document cache
User/Generic Caches
Field value cache dla facetingu
Query result cache trzyma posortowane id'ki
Document cache trzyma pelne dokumenty Lucene (enableLazyLoading ref dla pol I potem dociaga na podstawie fl)
Cache implementations
Least recent used (LRU)
Fast LRU
Least frequent used
LRUCache - synchronized LinkedHashMapFastLRUCache - ConcurrentHashMap
Omwi opcje cache'a
Warming queries
solr testDate desc
First searcherNew searcher
Kiedy jest otwierany nowy searcher
FQ vs Q
Filter Query (FQ) doesn't involve very complex document scoring
q=description:Potterfq=type:book
Questions?
Thank you!