【官方】MariaDB Optimization and Indexes 中文版(二)

ACMUG征集原创技术文章。详情请添加 A_CMUG或者扫描文末二维码关注我们的微信公众号。有奖征稿,请发送稿件至:acmug@acmug.com。
3306现金有奖征稿说明:知识无价,劳动有偿,ACMUG特约撰稿人有奖回报计划(修订版)


作者: MariaDB官方授权资词翻译组翻译

资词翻译组,由数名数据库技术爱好者组成,旨在传播技术,帮助他人,提升自我。目前的成员有:田丽芳,强昌金,王竹峰,侯军伟,吕智超,刘启荣,周彦伟,古雷。
译事三难:信、达、雅。信者,真也,真者,不伪也;达者,至也,至者,无过无不及也;雅者,文学性也,文学性者,当雅则雅当俗则俗也。我们深知能达成此三事,绝非一日之功,亦非常人所能。然,苟利国家生死以,岂因祸福避趋之。我们还是希望竭尽所能,不遗余力,做一点微小的工作,提高姿势水平。由于能力有限,水平一般,有错误不妥之处,还请批评指正,希望能得到大家的资词。

资词翻译组获得MySQL Server团队和MariaDB官方授权,翻译相关技术文章。


全文索引

全文索引概述

MariaDB已经支持了全文索引和搜索:

  • MariaDB中的全文索引(full-text index)是一种类型为FULLTEXT的索引,当搜索一个列中的部分文本时,它可以允许更多的选项。
  • 全文索引只能用于MyISAM和Aria表、MariaDB 10.0.5及后续版本的InnoDB表以及MariaDB 10.0.15及后续版本中的Mroonga表,而且只能为CHAR、VARCHAR或TEXT字段创建。
  • 即便是存储引擎支持,分区表也不能包含全文索引。
  • 在创建表时不能在CREATE TABLE语句中指定一个FULLTEXT索引定义,可以在之后使用ALTER TABLECREATE INDEX来添加。
  • 对于大的数据集,先往一个不含FULLTEXT索引的表里加载数据,再创建索引,要比往一个含有FULLTEXT的表中加载数据快得多。

全文搜索可用MATCH() AGAINST语法执行。MATCH()接收一个逗号分隔的列,用以指定要搜索的列。AGAINST接收要搜索的字符串,和一个可选的修饰符(modifier)以指定执行搜索所采用的类型。搜索的字符串一定要是一个文本字串符,而不是一个变量或是一个列名。

MATCH(col1, col2) AGAINST (expr [search_modifier])

排除的结果

  • 部分词被排除。
  • 长度小于4个字符(3个或以下)的词不会被保存在全文索引中。这个值可以通过修改ft_min_word_length(InnoDB中用innodb_ft_min_token_size)系统变量来调整。
  • 长度大于84个字符的词也不会被保存在全文索引中。这个值可以通过修改ft_max_word_length(InnoDB中用innodb_ft_max_token_size)系统变量来调整。
  • 停用词(stopwords)是一个含有常见词的列表,诸如”once”或”then”,除非IN BOOLEAN MODE模式被使用,否则它们不会出现在搜索结果中。MyISAM/Aria表和InnoDB表所用的停用词列表有所不同。更多细节与全部列表,以及如何修改默认列表,请参考stopwords 
  • 只对于MyISAM/Aria来讲,如果一个单词出现在一半以上的行记录中,它也会被从全文搜索的结果中排除。

相关度

MariaDB基于大量的因素,计算出每行结果的相关度,这些因素包括索引中单词的数量、行中唯一单词的数量、同时在索引和结果中的词的数量以及词的权重。在英语中,’cool’的权重将会小于’dandy’,至少目前是这样的。通过简单的使用字段列中的MATCH函数,相关度可以做为查询结果的一部分返回(给客户端)。

全文索引类型

自然语言模式

自然语言模式(IN NATURAL LANGUAGE MODE)是全文搜索的默认类型,并且可以省略关键词。没有特殊的运算符,搜索包括一个或多个逗号分隔的关键词。

结果以相关度倒序返回。

布尔模式

布尔(Boolean)搜索准许使用许多特殊运算符。

运算符 描述
+ 特定词必须在所有返回的行中。
特定词不出现在任何返回的行中。
< 它后面跟着的词的相关度要比其它词低,即使包含它的行仍然会匹配。
> 它后面跟着的词的相关度要比其它词高。
() 用来将单词归组到一个子表达式中。
~ 跟随其后的词给行一个负相关作用(这与“-”运算符不同,“-”是特意将特定词排除,也跟“<”操作符不同,“<”依然对行有正相关作用)。
* 通配符,代表0个或多个字符。它只能出现在词尾。
任何被双引号括起来的内容都被看成是一个整体(比如,你可以进行词组匹配了)。

搜索既不会以相关度顺序返回,也不适用50%的限制。停用词和单词最小值与最大值的限制仍然像以往一样适用。

QUERY扩展

查询扩展(query expansion)是自然语言搜索的变体。搜索字符串被用来执行一个常规的自然语言搜索。然后,搜索返回结果中相关度最高的行中词被加入到搜索字符串中再做一次搜索。Query返回第二次搜索的结果。IN NATURAL LANGUAGE MODE WITH QUERY EXPANSIONWITH QUERY EXPANSION 修饰符指定一个query扩展搜索。当搜索依赖于数据中的隐含信息时,这会很有用,比如MariaDB是一个数据库。

另见

举例

创建一个表,执行一个基本的搜索:

CREATE TABLE ft_myisam(copy TEXT,FULLTEXT(copy)) ENGINE=MyISAM;

INSERT INTO ft_myisam(copy) VALUES ('Once upon a time'),
  ('There was a wicked witch'), ('Who ate everybody up');

SELECT * FROM ft_myisam WHERE MATCH(copy) AGAINST('wicked');
+--------------------------+
| copy                     |
+--------------------------+
| There was a wicked witch |
+--------------------------+

多个词:

SELECT * FROM ft_myisam WHERE MATCH(copy) AGAINST('wicked,witch');
+---------------------------------+
| copy                            |
+---------------------------------+
| There was a wicked witch        |
+---------------------------------+

由于’Once’是一个停用词,无结果返回:

SELECT * FROM ft_myisam WHERE MATCH(copy) AGAINST('Once');
Empty set (0.00 sec)

在数据中插入单词“wicked”超过一半行数,结果中将排除它:

INSERT INTO ft_myisam(copy) VALUES ('Once upon a wicked time'),
  ('There was a wicked wicked witch'), ('Who ate everybody wicked up');

SELECT * FROM ft_myisam WHERE MATCH(copy) AGAINST('wicked');
Empty set (0.00 sec)

使用布尔模式(IN BOOLEAN MODE)突破50%的限制:

SELECT * FROM ft_myisam WHERE MATCH(copy) AGAINST('wicked' IN BOOLEAN MODE);
+---------------------------------+
| copy                            |
+---------------------------------+
| There was a wicked witch        |
| Once upon a wicked time         |
| There was a wicked wicked witch |
| Who ate everybody wicked up     |
+---------------------------------+

返回相关度:

SELECT copy,MATCH(copy) AGAINST('witch') AS relevance 
  FROM ft_myisam WHERE MATCH(copy) AGAINST('witch');
+---------------------------------+--------------------+
| copy                            | relevance          |
+---------------------------------+--------------------+
| There was a wicked witch        | 0.6775632500648499 |
| There was a wicked wicked witch | 0.5031757950782776 |
+---------------------------------+--------------------+

使用Query扩展(WITH QUERY EXPANSION),在下面的例子中,’MariaDB’常常与单词’database’联系在一起,因此当查询扩展启用时,即使没有明确要求,它也会在结果中返回。

CREATE TABLE ft2(copy TEXT,FULLTEXT(copy)) ENGINE=MyISAM;

INSERT INTO ft2(copy) VALUES
 ('MySQL vs MariaDB database'),
 ('Oracle vs MariaDB database'), 
 ('PostgreSQL vs MariaDB database'),
 ('MariaDB overview'),
 ('Foreign keys'),
 ('Primary keys'),
 ('Indexes'),
 ('Transactions'),
 ('Triggers');

SELECT * FROM ft2 WHERE MATCH(copy) AGAINST('database');
+--------------------------------+
| copy                           |
+--------------------------------+
| MySQL vs MariaDB database      |
| Oracle vs MariaDB database     |
| PostgreSQL vs MariaDB database |
+--------------------------------+
3 rows in set (0.00 sec)

SELECT * FROM ft2 WHERE MATCH(copy) AGAINST('database' WITH QUERY EXPANSION);
+--------------------------------+
| copy                           |
+--------------------------------+
| MySQL vs MariaDB database      |
| Oracle vs MariaDB database     |
| PostgreSQL vs MariaDB database |
| MariaDB overview               |
+--------------------------------+
4 rows in set (0.00 sec)

布尔模式中的部分词匹配:

SELECT * FROM ft2 WHERE MATCH(copy) AGAINST('Maria*' IN BOOLEAN MODE);
+--------------------------------+
| copy                           |
+--------------------------------+
| MySQL vs MariaDB database      |
| Oracle vs MariaDB database     |
| PostgreSQL vs MariaDB database |
| MariaDB overview               |
+--------------------------------+

使用boolean操作符:

SELECT * FROM ft2 WHERE MATCH(copy) AGAINST('+MariaDB -database' IN BOOLEAN MODE);
+------------------+
| copy             |
+------------------+
| MariaDB overview |
+------------------+

MATCH AGAINST

语法

MATCH (col1,col2,...) AGAINST (expr [search_modifier])

描述

一个用于在全文索引上执行全文搜索的特定的结构。

详细说明,请参阅全文索引概述(##全文索引概述),更多有关该主题的文章请参阅全文索引(#全文索引)。

例子

CREATE TABLE ft_myisam(copy TEXT,FULLTEXT(copy)) ENGINE=MyISAM;

INSERT INTO ft_myisam(copy) VALUES ('Once upon a time'), ('There was a wicked witch'), 
 ('Who ate everybody up');

SELECT * FROM ft_myisam WHERE MATCH(copy) AGAINST('wicked');
+--------------------------+
| copy                     |
+--------------------------+
| There was a wicked witch |
+--------------------------+
SELECT id, body, MATCH (title,body) AGAINST
    ('Security implications of running MySQL as root'
    IN NATURAL LANGUAGE MODE) AS score
    FROM articles WHERE MATCH (title,body) AGAINST
    ('Security implications of running MySQL as root'
    IN NATURAL LANGUAGE MODE);
+----+-------------------------------------+-----------------+
| id | body                                | score           |
+----+-------------------------------------+-----------------+
|  4 | 1. Never run mysqld as root. 2. ... | 1.5219271183014 |
|  6 | When configured properly, MySQL ... | 1.3114095926285 |
+----+-------------------------------------+-----------------+

myisam_ftdump

myisam_ftdump是一个用来显示MyISAM全文索引信息的程序。它可以扫描并转存整个索引,这可能是一个耗时的过程。

如果服务在运行,确保你先执行了FLASH TABLE语句。

使用

myisam_ftdump <table_name> <index_num>

table_name加不加.MYI索引后缀都可以。

index_num指的是表定义时的索引(顺序)号,从0开始。以下面的表定义为例:

CREATE TABLE IF NOT EXISTS `employees_example` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `first_name` varchar(30) NOT NULL,
  `last_name` varchar(40) NOT NULL,
  `position` varchar(25) NOT NULL,
  `home_address` varchar(50) NOT NULL,
  `home_phone` varchar(12) NOT NULL,
  `employee_code` varchar(25) NOT NULL,
  `bio` text NOT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `employee_code` (`employee_code`),
  FULLTEXT (`bio`)
) ENGINE=MyISAM;

全文索引号为2,主键是索引 0,唯一键是索引1

你可以像下面这样用myisam_ftdump生成一个以出现频率排序的索引记录列表:

myisam_ftdump -c mytexttable 1 | sort -r

选项

Option Description
-h, —help 显示帮助并退出。
-?, —help 同 -h。
-c, —count 计算每个单词的统计值(总数和全局权重)。
-d, —dump 转储索引(包括数据偏移和单词权重)。
-l, —length 报告长度分布。
-s, —stats 报告全局统计。
-v, —verbose 详细信息。

停用词

停用词(stopwords)是一个全文索引可忽略的常用词列表。

MyISAM停用词

针对MyISAM表的全文索引,列表默认从storage/myisam/ft_static.c文件中构建,并用系统的字符集和校验来搜索。系统变量ft_stopword_file允许默认列表可以被另一个文件覆盖,或完全忽略停用词。

如果停用词列表发生变更,任何已有的全文索引需要重新建立。

下表展示了本文撰写时停用词的默认列表,不过你应该始终将storage/myisam/ft_static.c作为权威列表。有关详细说明,请参阅全文索引概述(#全文索引概述),相关文章请参阅全文索引(#全文索引)。

a’s able about above
according accordingly across actually
after afterwards again against
ain’t all allow allows
almost alone along already
also although always am
among amongst an and
another any anybody anyhow
anyone anything anyway anyways
anywhere apart appear appreciate
appropriate are aren’t around
as aside ask asking
associated at available away
awfully be became because
become becomes becoming been
before beforehand behind being
believe below beside besides
best better between beyond
both brief but by
c’mon c’s came can
can’t cannot cant cause
causes certain certainly changes
clearly co com come
comes concerning consequently consider
considering contain containing contains
corresponding could couldn’t course
currently definitely described despite
did didn’t different do
does doesn’t doing don’t
done down downwards during
each edu eg eight
either else elsewhere enough
entirely especially et etc
even ever every everybody
everyone everything everywhere ex
exactly example except far
few fifth first five
followed following follows for
former formerly forth four
from further furthermore get
gets getting given gives
go goes going gone
got gotten greetings had
hadn’t happens hardly has
hasn’t have haven’t having
he he’s hello help
hence her here here’s
hereafter hereby herein hereupon
hers herself hi him
himself his hither hopefully
how howbeit however i’d
i’ll i’m i’ve ie
if ignored immediate in
inasmuch inc indeed indicate
indicated indicates inner insofar
instead into inward is
isn’t it it’d it’ll
it’s its itself just
keep keeps kept know
knows known last lately
later latter latterly least
less lest let let’s
like liked likely little
look looking looks ltd
mainly many may maybe
me mean meanwhile merely
might more moreover most
mostly much must my
myself name namely nd
near nearly necessary need
needs neither never nevertheless
new next nine no
nobody non none noone
nor normally not nothing
novel now nowhere obviously
of off often oh
ok okay old on
once one ones only
onto or other others
otherwise ought our ours
ourselves out outside over
overall own particular particularly
per perhaps placed please
plus possible presumably probably
provides que quite qv
rather rd re really
reasonably regarding regardless regards
relatively respectively right said
same saw say saying
says second secondly see
seeing seem seemed seeming
seems seen self selves
sensible sent serious seriously
seven several shall she
should shouldn’t since six
so some somebody somehow
someone something sometime sometimes
somewhat somewhere soon sorry
specified specify specifying still
sub such sup sure
t’s take taken tell
tends th than thank
thanks thanx that that’s
thats the their theirs
them themselves then thence
there there’s thereafter thereby
therefore therein theres thereupon
these they they’d they’ll
they’re they’ve think third
this thorough thoroughly those
though three through throughout
thru thus to together
too took toward towards
tried tries truly try
trying twice two un
under unfortunately unless unlikely
until unto up upon
us use used useful
uses using usually value
various very via viz
vs want wants was
wasn’t way we we’d
we’ll we’re we’ve welcome
well went were weren’t
what what’s whatever when
whence whenever where where’s
whereafter whereas whereby wherein
whereupon wherever whether which
while whither who who’s
whoever whole whom whose
why will willing wish
with within without won’t
wonder would wouldn’t yes
yet you you’d you’ll
you’re you’ve your yours
yourself yourselves zero

InnoDB停用词

在索引被创建的时候,全文索引停用词只有在系统变量innodb_ft_enable_stopword设置了时(默认就是)才被启用。

停用词列表由以下条件确定:

  • 如果设置了系统变量innodb_ft_user_stopword_table,这个表就当成是停用词列表。
  • 如果没有设置系统变量innodb_ft_user_stopword_table,则使用由innodb_ft_server_stopword_table指定的表。
  • 如果两者都未设置,使用内建的列表,它可以查询Information_Schema中的INNODB_FT_DEFAULT_STOPWORD表来查看。

在前两种情况中,设置系统变量和创建索引时指定的表必须要存在。它必须是一张只有一列的InnoDB表,该列名为VALUE,类型为VARCHAR。

InnoDB默认的停用词列表与MyISAM的默认列表不同,会少很多,在本文撰写时,列表如下:

a about an are
as at be by
com de en for
from how i in
is it la of
on or that the
this to was what
when where who will
with und the www

ANALYZE TABLE

语法

ANALYZE [NO_WRITE_TO_BINLOG | LOCAL] TABLE tbl_name [,tbl_name ...] 
 [PERSISTENT FOR [ALL|COLUMNS ([col_name [,col_name ...]])] [INDEXES ([index_name [,index_name ...]])]]

描述

ANALYZE TABLE会分析并存储表的索引的分布情况(index statistics)。在分析的过程中,myisam表会被read lock锁锁住。Innodb表会被write lock锁锁住。ANALYZE TABLE语句适用于MyISAM、Aria、Innodb表。对于MyISAM表,该语句和myisamchk —analyze是等价的。

关于InnoDB引擎的ANALYZE TABLE详情,请查看InnoDB Restrictions

MariaDB在执行一个和常量的join时,会使用持久化的索引统计信息来决定表的join顺序。此外,索引分布情况可能在query时一个表使用哪些索引的决策中被使用。

此语句需要表的SELECT和INSERT权限

默认情况下,ANALYZE TABLE语句会被写入binlog,并且被复制。NO_WRITE_TO_BINLOG关键字可以保证该语句不会被写入binlog。

ANALYZE TABLE同时也支持分区表,你可以使用ALTER TABLE… ANALYZE PARTITION去分析一个或者多个分区。

MariaDB 10.0.1 扩展了ANALYZE TABLE,支持引擎独立的索引统计

目前,引擎独立的统计信息是通过全表和全索引扫描来收集的,所以这个操作很重。

ANALYZE TABLE的语法在MariaDB 10.0.1中进行了扩展,通过PERSISTENT FOR来支持特定列和索引的统计信息收集。

需要注意blob和text字段的统计信息是不能收集的,如果指定了这种类型的字段,会返回一个警告信息。

-- update all engine-independent statistics for all columns and indexes
ANALYZE TABLE tbl PERSISTENT FOR ALL;

-- update specific columns and indexes:
ANALYZE TABLE tbl PERSISTENT FOR COLUMNS (col1,col2,...) INDEXES (idx1,idx2,...);

-- empty lists are allowed:
ANALYZE TABLE tbl PERSISTENT FOR COLUMNS (col1,col2,...) INDEXES ();
ANALYZE TABLE tbl PERSISTENT FOR COLUMNS () INDEXES (idx1,idx2,...);

-- the following will only update mysql.table_stat fields:
ANALYZE TABLE tbl PERSISTENT FOR COLUMNS () INDEXES ();

-- with use_stat_tables != NEVER a simple ANALYZE TABLE collects engine-independent statistics for all columns and indexes.
SET SESSION use_stat_tables='COMPLEMENTARY';
ANALYZE TABLE tbl;

即使设置了STATS_PERSISTENT=1,如果需要更新引擎独立的统计信息,也需要使用PERSISTENT FOR关键字。

从MariaDB 5.3版本开始,Aria存储引擎在执行该语句的时候可以支持进度汇报。


注:ACMUG收录技术文章版权属于原作者本人所有。如有疑问,请联系作者。
看完转发,手留余香。关注我们,一起进步。
关注ACMUG公众号,参与社区活动,交流开源技术,分享学习心得,一起共同进步。

发表评论