Apache SolrCloud

Searching in the Cloud

Arkadiusz MasiakiewiczMicha Warecki

Przedstawi si, jestemy z PayU (Allegro group).

Chcemy opowiedzie o naszych krtkich dowiadczeniach z Solr. Opowiemy o technicznych szczegach tego fajnego serwera.

Mielimy problem z czasami wyszukiwania, dlatego zainteresowalimy si t technologi.

Moe by ciekawe bo zastosowalimy SolrCloud (std nazwa prezentacji).

Pytania prosimy na kocu.

About us

Arkadiusz MasiakiewiczNoSQL Databases

Full-text search engines

Neural networks

Scuba diving

Micha WareckiGC

JiT Compilers

Concurrency

Non-blocking algorithms

Programming languages runtime

Astronomy

Agenda

Introduction to Apache Solr

Data in PayU

About SolrCloud

Performance

About Apache Solr

Full-text search server based on Apache Lucene

Platform independent

Buzzword compatible (sharding, replication, clustering, cloud, scaling, big data, ...)

Where to use Solr?

Sklepy internetowe empikWyszkiwarkiCzciowe zastpienie bazy danychwyszukiwanie plikw po treciwyszukiwanie osb

Who uses Solr?

About Apache Solr

Indexer / Scheduler

Lucene IndexSolr server

SchemaConfigDocuments / Update queryClient applicationQuery (HTTP)ResponseWeb container

Omwi komponenty SolrKomunikacja

Platform independent

(XML|JSON|PHP...)/HTTP

HTTP-based queries

http://localhost:8080/solr/select?q=query&start=50

&rows=25

&fq=field:ala

&facet=on&facet.eld=category

&sort=dist(2, point1, point2) desc

Solr schema

name

type

indexed

stored

multiValued

required

Omwi atrybuty pl

type i stored wielko indeksu

Przykad z ycia: setki milionw dokumentw I zwracamy tylko id, ktore pozniej wyciagamy po indeksie z bazy

Solr schema

StrField String (UTF-8 encoded string or Unicode).

TextField Text, usually multiple words or tokens.

TrieDateField Date field accessible for Lucene TrieRange processing.

TrieDoubleField, TrieFloatField, TrieIntField, TrieLongField

UUIDField Universally Unique Identifier (UUID). Pass in a value of "NEW" and Solr will create a new UUID.

CurrencyFieldSupports currencies and exchange rates.

TextField analizery (StandardTokenizer whitespace, dots, itd.) i filtry (lowercasefilter) w czasie indeksowania i zapytania

CurrencyField zamiana wartoci w trakcie zapytania

Data import

SolrData feedersXML DocumentsXML

CSV

File system

Web spiders

RSS/Atom

POJO

E-mail

DB

Jak to si dziej, e moemy odpyta solara o dokumenty?

Dodatkowo mona powiedzie, e istnieje moliwo zaimplementowania wasnego importu.

Data configuration

data source

documententity(query, delta query)field

SolrDBUpdate querySQLDIH

Peny import

Import przyrostowy

Dodatkowwo robimy full zamiast delta import, bo delta robi selecta per dokument

Plugins

Search components

Request handlers

Process factory

Search component moe podkrela wyszukiwane sowa kluczowe; zwraca dodatkowe informacje jak np. Ilo sw w polu

Request handlers podpinamy komponenty pod odpowiedni ciek URL (endpoint)

Process factory przy indeksowaniu mona za jego pomoc doda nowe pola, zmienia je itd.

Data in PayU

~100 GB index size

~400 milions of documents to search

More than 30 fields in schema

Dotyczy to tylko jednej tabeli bazodanowej

Jest to stosunkowo due wdroenie.

Data flow in PayU

Client applicationSolrDB1: search params

3: DB id's

2: DB id's

4: Data

Connecting to your old app

Hibernate criteria

JPA criteria

Custom criteria

q=field1:testvalue&wt=javabin

Mona wwczas bardzo atwo przecza si pomidzy baz danych, a Solr

Traditional Solr architecture

Solr shard1- config- schemaSolr shard1 replica- config- schemaSolr shard1- config- schema

Solr shard1- config- schemaSolr shard2 replica- config- schemaSolr shard2- config- schema

manually copy config

manually split index

manually shard queries

add replica

manually copy config

setup replication

does not provide fail-over

separate monitoring

Zookeeper cluster

SolrCloud architecture

confClusterstateconfconfconfZookeeper

Solr instance1

Shard1leader

Shard2Replica

Solr instance2

Shard1replica

Shard2leader

confClusterstateconfconfconfZookeeper

confClusterstateconfconfconfZookeeper2

Zookeeper1

Zookeeper3

Add documentMog by logiczne I fizyczne instancje

ZooKeeperZooKeeperZooKeeperAssigning machines

Number of shards : 3

Shard1



Shard1

Shard2



Shard1

Shard2

Shard3

Now you can search and index


Shard1

Shard2

Shard3

Replicashard1


Shard1

Shard2

Shard3

Replicashard1

Replicashard2


Shard1

Shard2

Shard3

Replicashard1

Replicashard2

Replicashard3

ZooKeeper configuration

{"collection1":{ "shards":{ "shard1":{ "range":"80000000-8443ffff", "state":"active", "replicas":{ "10.205.33.92:8080_solr_collection1_shard1_replica2":{ "state":"active", "core":"collection1_shard1_replica2", "node_name":"10.205.33.92:8080_solr", "base_url":"http://10.205.33.92:8080/solr", "leader":"true"}, "10.205.33.93:8080_solr_collection1_shard1_replica1":{ "state":"active", "core":"collection1_shard1_replica1", "node_name":"10.205.33.93:8080_solr", "base_url":"http://10.205.33.93:8080/solr"}}},

SolrCloud

Central Config in Zookeeper

Automatic Fail-Over

Near-Realtime

Leader Election

Optimistic Locking

Durable writes

Sharding in SolrCloud

Collection i.e. booksShard1Books part1replicareplica

Shard3Books part3replicareplica

Shard2Books part2replicareplica

Shardy s logicznym podziaem indeksu. W szczeglnoci mog by fizycznym podziaem.


--Ile dajemy shardw? 72? -- ee, daj 100 do penego :-)


Implicit documents distributinguniqeId.hashCode() % numServers.

Composite keygroupId!uniqeId

groupId.hashCode() in shard hash range

1-10

11-20

21-30

31-40

hashCode=5hashCode=10hashCode=35hashCode=15

?shard.keys=xxx!

hashCode=35

Jest jeszcze dostpny custom hashing

SolrJ

HTTP Solr Server

Embedded Solr ServerDoes not require HTTP

Cloud Solr ServerPass ZooKeeper hosts

SolrJ POJO

public class Item { @Field String id;

@Field("cat") String[] categories;

@Field List features;

}

//...Item item = new Item();item.id = "one";item.categories = new String[] { "aaa", "bbb", "ccc" };

server.addBean(item);

Cache

Filter cache

Field value cache

Query result cache

Document cache

User/Generic Caches

Field value cache dla facetingu

Query result cache trzyma posortowane id'ki

Document cache trzyma pelne dokumenty Lucene (enableLazyLoading ref dla pol I potem dociaga na podstawie fl)

Cache implementations

Least recent used (LRU)

Fast LRU

Least frequent used

LRUCache - synchronized LinkedHashMapFastLRUCache - ConcurrentHashMap

Omwi opcje cache'a

Warming queries

solr testDate desc

First searcherNew searcher

Kiedy jest otwierany nowy searcher

FQ vs Q

Filter Query (FQ) doesn't involve very complex document scoring

q=description:Potterfq=type:book

Questions?

Thank you!

Apache SolrCloud

Technology

Transcript of Apache SolrCloud