はじめに

前回に引き続き、Postgresのオプティマイザの挙動を確認しようの会です。

tablename	attname	avg_width	n_distinct
table1	c1	11	-1
table1	c2	8	-0.55219007
table1	c3	7	8332669
table1	c4	6	930908
table1	c5	5	98600
table1	c6	4	9990
table1	c7	3	1000
table1	c8	2	100
table1	c9	2	10

みたいなデータ分布のテーブルがあるとして、

create index table1_c1_c6 on table1 (c1, c6);
create index table1_c6_c1 on table1 (c6, c1);

みたいなインデックスを貼ってc1 like '+%' and c6 = '5000'のような大味なlike演算子のクエリをブッパしたらどちらのインデクスが使われるんだろうねというお話です。

環境

今回からPostgreSQL 14.5をソースコードからコンパイルしてインストールする方法で検証します。

select version();

PostgreSQL 14.5 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-10), 64-bit

みんな大好きRed Hat Enterprise Linuxです。

あとはクエリヒントを与えられるようにpg_hint_planをソースコードからコンパイルしてインストールしています。

環境構築

こんな感じにテーブルを作成し、

create table table1
(
    id bigint primary key,
    c1 varchar(100),
    c2 varchar(100),
    c3 varchar(100),
    c4 varchar(100),
    c5 varchar(100),
    c6 varchar(100),
    c7 varchar(100),
    c8 varchar(100),
    c9 varchar(100)
);

create index table1_c1_c6 on table1 (c1, c6);
create index table1_c6_c1 on table1 (c6, c1);

こんな感じにデータを投入します。

synchronous_commitパラメータとかを変えずにデータ投入をしたらなんか12時間くらいかかってました。つらい

using Dapper;
using Npgsql;

namespace ConsoleApp3
{
    internal class Program
    {
        private static readonly string[] Symbols = { "+", "-", "*", "/", "@", "" };

        private static string GetSymbol(Random random)
        {
            var symbol = Symbols[random.Next(Symbols.Length)];
            return symbol;
        }

        static void Main(string[] args)
        {
            using var db = new NpgsqlConnection("Host=192.168.2.100;Username=testdb;Database=testdb");
            db.Open();

            var tx = db.BeginTransaction();
            var rand = new Random();

            for (var i = 0; i < 100_000_000; i++)
            {
                if (i != 0 && i % 10000 == 0)
                {
                    Console.WriteLine("{0}", i);
                    tx.Commit();
                    tx = db.BeginTransaction();
                }

                var n = rand.Next();

                var p = new
                {
                    Id = i,
                    C1 = $"{GetSymbol(rand)}{i:0000000000}",
                    C2 = n % 100_000_000,
                    C3 = n % 10_000_000,
                    C4 = n % 1_000_000,
                    C5 = n % 100_000,
                    C6 = n % 10_000,
                    C7 = n % 1_000,
                    C8 = n % 100,
                    C9 = n % 10,
                };

                db.Execute("insert into table1 (id, c1, c2, c3, c4, c5, c6, c7, c8, c9) values (@Id, @C1, @C2, @C3, @C4, @C5, @C6, @C7, @C8, @C9)", p, transaction: tx);
            }

            tx.Commit();
        }
    }
}

検証

以下のクエリを想定した場合、カーディナリティ自体が高いのはc1ですが、このlike演算子の使い方であればc6が先頭に来ているインデックスを使用したほうが有利なのはデータの傾向を知っている人間であればすぐにわかります。

explain analyse
select *
from table1 t
where c1 like '+%'
  and c6 = '5000';

この場合はちゃんとtable1_c6_c1（c6が先頭のインデックス）を使用してくれます。

Index Scan using table1_c6_c1 on table1 t  (cost=0.57..1888.12 rows=1716 width=56) (actual time=0.027..2.256 rows=1728 loops=1)
  Index Cond: (((c6)::text = '5000'::text) AND ((c1)::text >= '+'::text) AND ((c1)::text < ','::text))
  Filter: ((c1)::text ~~ '+%'::text)
Planning Time: 0.109 ms
Execution Time: 2.310 ms

また、c6の条件をlike演算子に変えてもちゃんとtable1_c6_c1を優先的に使ってくれます。

explain analyse
select *
from table1 t
where c1 like '+%'
  and c6 like '500%';

Bitmap Heap Scan on table1 t  (cost=2312.27..23275.90 rows=1714 width=56) (actual time=14.581..355.768 rows=18441 loops=1)
  Filter: (((c1)::text ~~ '+%'::text) AND ((c6)::text ~~ '500%'::text))
  Heap Blocks: exact=18289
  ->  Bitmap Index Scan on table1_c6_c1  (cost=0.00..2311.84 rows=19180 width=0) (actual time=12.230..12.231 rows=18441 loops=1)
        Index Cond: (((c6)::text >= '500'::text) AND ((c6)::text < '501'::text) AND ((c1)::text >= '+'::text) AND ((c1)::text < ','::text))
Planning Time: 0.091 ms
Execution Time: 357.301 ms

まとめ

細かい挙動は追えていませんが、ちゃんと使ってほしいインデックスを選択してくれるようです。えらい

おまけ

pg_hint_planを使うと強制的に使うインデックスとかを弄れます。まぁ使わないに越したことはないですけど

explain analyse
/*+
  IndexScan(t table1_c1_c6)
 */
select *
from table1 t
where c1 like '+%'
  and c6 like '500%';

Index Scan using table1_c1_c6 on table1 t  (cost=0.57..357692.13 rows=1714 width=56) (actual time=0.071..1810.166 rows=18441 loops=1)
  Index Cond: (((c1)::text >= '+'::text) AND ((c1)::text < ','::text) AND ((c6)::text >= '500'::text) AND ((c6)::text < '501'::text))
  Filter: (((c1)::text ~~ '+%'::text) AND ((c6)::text ~~ '500%'::text))
Planning Time: 0.128 ms
Execution Time: 1811.353 ms

おわり

レンコン畑でつかまえて

Postgresでも複数のインデックスが使用可能な状況ならカーディナリティが高いカラムのが優先的に使われるんだよね　その2

はじめに

環境

環境構築

検証

まとめ

おまけ