Czy mo Ŝna pogodzi ć działalno ść Czy mo Ŝna pogodzi ć działalno ść naukow ą i komercyjn ą?
Na przykładzie historii silnika bazodanowego stworzonego przez firm ę Infobright Inc.
Dominik Ślęzak
O czym b ędzie
� Infobright dzisiaj� Historia rozwoju� Szczegóły technologiczne� Jak się w tym nie pogubić?
2
Infobright
� Technology Innovation� First commercial open source data
warehouse (analytical database engine)� Community & Enterprise Editions
� Lowest cost technology in the market� Simplest & easiest to build & manage� Powerful & scalable MySQL
STARTUP TO WATCH
Cool Vendor in Data Management and Integration
2009
2008 High Performance Data Warehousing Partner
of the YearInfobright: Economic
Data Warehouse Choice
Partner of the Year 2009
� Technology unlike anything in the industry
� Strong Momentum & Mature Product� Release 3.x generally available� +100 customers in 10 Countries� +40 Partners on 6 continents� A vibrant open source community
� +1 million visitors� 20,000 downloads
3
Infobright: Designed for Analytics
Best FitTrue Analytics
Good FitStandard Analytics
Poor FitOLTP
Analytics Questionsabout the dataAggregates: sums, counts...
Some requests for data, optimized by Knowledge Grid
Primarily data retrieval &mixed read/write activity(but we handle mixed read/writes in analytics)
Environment Analytical ad hoc queries
Analytical and standard queries
Transaction applications& operational reportingad hoc queries standard queries & operational reporting
Sample Queries
How many new customers were identified recently?What was the largest sale made for item A?
Report the sales figures from California for X campaign for Jan 2008
SELECT *“Go get the data”
Notes Knowledge Grid is used heavily to resolve the query – almost no I/O
Knowledge Grid is used to reduce I/O times by optimizing data access
It does not use the advantages of a columnar database
Why Customers Buy Infobright (1)
� Fast query response with no tuning
Customer’s Test Alternative Infobright
Analytical queries 2+ hours with MySQL <10 seconds
Query (AND – Left Join) 26.4 seconds in Oracle .02 seconds
� Fast and consistent data load speed as database grows. Up to 300GB/hour on a single server
5
Oracle query set 10 seconds – 15 minutes 0.43 – 22 seconds
BI report 7 hours in Informix 17 seconds
Data load 11 hours in MySQL ISAM 11 minutes
Online Marketing Need Infobright’s Solution
� Data Volumes – online marketing data generates staggering volumes of information that needs to be captured and analyzed
� Real-time Response needed for
� Support for storing up to 50 TBs of data compressed, with up to 40:1 compression, providing the lowest hardware footprint
� Blazing response times for deep
Online Marketing
� Real-time Response needed for rapid response to campaigns and specific targeting
� Deep Ad Hoc Analysis –differentiation through deeper analytics across more granular data
� Low Maintenance – insane growth rates, tight resourcing means a solution with as little maintenance and cost as possible
� Blazing response times for deep business analytic queries on TBs of granular clickstream data
� Lowest total cost of ownership in data warehousing – simple to implement &manage, requiring little administration
� Customer experience, including:
6
Telecommunications Need Infobright’s Solution
� Data Volumes are simply staggering; call center analysis, call data records (CDR), customer billing data, network data incl. alarms, alerts and events
� Support for storing up to 50 TBs of data compressed, with up to 40:1 compression, providing the lowest hardware footprint
� Keeps data together while providing fast
Telecommunications
and events
� Avoiding Data Silos – storing vast amounts of data and ensuring timely access typically results in multiple development projects, creating unconnected silos of data
� Highly Complex Ad Hoc Analysis– vast cross-section of TELCOapplications means widely differing requirements & highly complex, ad hoc queries
� Keeps data together while providing fast response times on TBs of varied data
� Complex business analytics without the maintenance overhead provides the lowest TCO in data warehousing
� Customer experience, including:
7
Financial Services Need Infobright’s Solution
� Response Times – Capital markets firms depend on the speed at which they can analyze and process trades
� Risk Management – Must prevent risk and identify fraud in real-time to
� Fast analytic response times regardless of TBs of ever-increasing trade and tick data volumes
� Detailed business analytics to spot inconsistencies across huge volumes of
Financial Services
risk and identify fraud in real-time to maintain the business
� Regulatory Compliance –Compliance with rising regulatory requirements, directly impacts their ability to stay in business:
� Trading compliance (Reg NMS, MiFID)
� Corporate governance (Sarbanes-Oxley, Basel II)
inconsistencies across huge volumes of live and historical data, prevents losses of money and reputation
� Reporting accuracy regardless of rising volumes helps maintain compliance
� Customer experience, including:
8
Dominik (back to 2005)
9
Challenges (2005-2010)
O u tlo o k T em p . H u m id . W in d S p o rt?
1 S u n n y H o t H ig h W e ak N o
2 S u n n y H o t H ig h S tro n g N o
3 O v ercast H o t H ig h W e ak Y es
4 R ain M ild H ig h W e ak Y es
5 R ain C o ld N o rm al W e ak Y es
6 R ain C o ld N o rm al S tro n g N o
7 O v ercast C o ld N o rm al S tro n g Y es
8 S u n n y M ild H ig h W e ak N o
9 S u n n y C o ld N o rm al W e ak Y es
1 0 R ain M ild N o rm al W e ak Y es
1 1 S u n n y M ild N o rm al S tro n g Y es
� Data tables are incomparably larger than on this illustration but the users want the query speed to stay within minutes
� Data may grow very quickly
�Current data warehouse solutions (MPP, indices, sorting, partitioning, etc.) are not flexible and often require special hardware
10
1 1 S u n n y M ild N o rm al S tro n g Y es
1 2 O v ercast M ild H ig h S tro n g Y es
1 3 O v ercast H o t N o rm al W e ak Y es
1 4 R ain M ild H ig h S tro n g N o
� Data may grow very quickly but the users expect the data warehouse to adjust to the new data very quickly as well
� Types of queries are hard to predict and the users expect that arbitrary queries will be executed reasonably quickly
Column vs. Row-Oriented
EMP_ID FNAME LNAME SALARY
1 Moe Howard 10000
2 Curly Joe 12000
3 Larry Fine 9000
Row Oriented ( 1,Moe,Howard,10000; 2,Curly, Joe,12000; 3,Larry,Fine,9000;)
� Works well if all the columns are needed for � Works well if all the columns are needed for every query.
� Efficient for transactional processing if all the data for the row is available
� Works well with aggregate results (sum, count, avg, min, max, ... )
� Only relevant columns need to be touched� Allows for very efficient compression
Column Oriented ( 1,2,3; Moe,Curly,Larry; Howard,Joe,Fine; 10000,12000,9000;)
11
Data Packs and Compression
64K
64K
Data Packs� Each data pack contains 65,536 data values� Compression is applied to each individual data pack� The compression algorithm varies depending on data
type and distribution
Compression64K
64K
Compression� Results vary depending on data
distribution among data packs� A typical overall compression ratio
seen in the field is around 10:1� Some customers have seen results
have been as high as 40:1
Patent PendingCompression
Algorithms
12
Knowledge Grid
Data Pack Nodes (DPN)A separate DPN is created for every data pack created in the database to store basic statistical information
HistogramsHistograms are created for every Data Pack that contains numeric data and creates 1024 MIN-MAX intervals.
13
This metadata layer is ~1% of the compressed volume.For example, a 1TB (raw) database would have Knowledge Grid of less than 1 GB.
Character Maps (CMAPs)Every Data Pack that contains text creates a matrix that records the occurrence of every possible ASCII character
MIN-MAX intervals.
salary age job city
A Simple Query using Knowledge Grid
SELECT count (*) FROM employeesWHERE salary > 50000AND age < 65AND job = ‘Shipping’AND city = ‘TORONTO’;
2. Find the Data Packs that contain age < 65
3. Find the Data Packs that have job = ‘Shipping’
All packs ignored
All packs ignored
1. Find the Data Packs with salary > 50000
‘Shipping’
4. Find the Data Packs that have City = ‘Toronto’
All packs ignored
5. Now we eliminate all rows that have been flagged as irrelevant.
Only this pack will be decompressed
6. Finally we have identified the data pack that needs to be decompressed
Completely Irrelevant
Suspect
All values match
14
SELECT MAX(A) FROM T WHERE B>15;
E E
I/E I/E
1 2 3
I/E I/E
I/S/R denotes irrelevant/suspect/relevant; E – exact computation (decompression)
15
HashBlockJoin (advanced usage of metadata)
16
MySQL Pluggable Storage Engine Architecture
17
Infobright and MySQL (2006-2010)
Integration provides a simple path to highly scalable data warehousing for MySQL users
No new management interface to learninterface to learn
MySQL integration enables seamless connectivity to BI tools and MySQL drivers for ODBC, JDBC, C/C++, .NET, Perl, Python, PHP, Ruby, Tcl, etc.
18
19
Features
Technical Support Forums and/orone-time 4-hr support pack
Silver, Gold, Platinum
Warranty and Indemnification
No Included
Mixed Read/Writes No Supported
Comparison of ICE and IEE (2008-2010)
Infobright Loader Up to 50 GB/hrMulti-threaded, Up to 300GB/hr
Data Load Types Text onlyText
Binary (up to 100% faster)
MySQL Loader No Supported
Temp Tables No Supported
Platform Support(keeps changing)
64-bit Intel and AMDRHEL 5, CentOS 5,
Debian 32-bit, Ubuntu 8.04, Fedora 9, Windows XP...
64-bit Intel and AMDSolaris 10, Windows Server,
RHEL 5, RHEL AS, CentOS 5, Debian...
SubscriptionHours of Service
Response Time SLA
# Named Client
Contacts
Support Access
MethodsServices
* Two year PrePay
Charge/TB
One YearAnnual
Charge/TB
Silver
9am-5pmEDT
(no phone support)
Severity 1=4 hr
Severity 2=8 hr
1
� Web� Email� Self-Service Knowledge Base
•Training and consulting services at published rate
$12,950 $15,950
8am-6pmSeverity
� Phone� Web
•Health Check Service included
IEE Annual Subscriptions
Gold8am-6pm
(ET for NA,CET Europe)
Severity 1=2 hr
Severity 2=6 hr
3
� Web� Email� Self-Service Knowledge Base
included•10% Discount training and other consulting services•Sev 1 hot fixes
$15,950 $18,950
Platinum 7 x 24 x 365
Severity 1=1 hr
Severity 2=4 hr
5
� Phone� Web� Email� Self-Service Knowledge Base
•Health Check Service included•20% discount training and other consulting services•Sev 1 and 2 hot fixes
$18,950 $21,950
Perpetual Perpetual
(Same as Gold Support Level)
18% of Purchase Price not to increase by more then 10% per Yr.
$40,000
* Two year subscription pricing requires two year p repay
Why Customers Buy Infobright (2)
• Less cost
• Capacity-based subscription model
• Less hardware required
• Less work
• Simple schemas and standard SQL• Simple schemas and standard SQL
• Reduced monitoring requirements
• Reduced maintenance requirements
• Open source
• Fast implementation
• Open source community
• Broad platform support
22
Dominik (in 2010)
24
Dominik (in 2010)
25
Approximate SQL
� In such areas as, e.g., Business Intelligence and Web Analytics, there is an ongoing debate whether the answers to SQL statements have to be always exact.
� The same question occurs in the case of SQL-based machine learning algorithms, which are often based on heuristics, randomness and inexactness anyway.on heuristics, randomness and inexactness anyway.
� Motivation for SQL approximations is related also to such aspects as complexity of queries and data sources, dynamically changing data with a limited access, and huge data sets with a need to monitor convergence of query execution in time, regardless of whether the final answers are to be exact or not.
26
Data Granulation
27
SQL-based Machine Learning
28
Community Inspirations
� New Objectives� New Schemas� New Volumes� New Queries
� Count Distinct� Self-Joins� Decision Trees� Contingencies
� New Statistics
� New Data Types� SQL Extensions� Feature Extraction� Data Compression
29
DZIĘKUJĘ ZA UWAGĘ!!!www.infobright.com
www.infobright.org
http://en.wikipedia.org/wiki/Infobright
Top Related