lxmlモジュールでXPathプリコンパイルの効果

前回の記事で使用したPythonのlxmlモジュールについて調べていたら、XMLデータの検索に用いるXPath式をプリコンパイルできるらしいことが分かったので、どのくらい速度が向上するのか試してみた。

テストデータは前回のものと同じ。

$ ls -hl | grep yahoo_list
-rw-rw-r-- 1 hogeuser hogeuser   11K  9月 25 05:46 yahoo_list.rss
-rw-rw-r-- 1 hogeuser hogeuser 1011K  9月 26 14:47 yahoo_list2.rss
※yahoo_list2.rssは、yahoo_list.rssを加工して100倍くらいにサイズを大きくしたもの。

以下、測定用スクリプト。

#!/usr/bin/env python
# -*- coding:utf8 -*-
import sys
import timeit
from lxml import etree

COUNT = 10000
if len(sys.argv) > 1:
    print "SOURCE: %s" % sys.argv[1]
    ET = etree.parse(sys.argv[1])

if len(sys.argv) > 2:
    print "XPATH: %s" % sys.argv[2]
    XPATH = sys.argv[2]

def test1():
    '''XPath式の文字列で検索。
    '''
    global finded
    finded = ET.xpath(XPATH)

compiled_xpath = etree.XPath(XPATH) # XPath式をプリコンパイルしておく。
def test2():
    '''コンパイル済みXPathで検索。
    '''
    global finded
    finded = compiled_xpath(ET)

for func in ["test1", "test2"]:
    time_eachcall = timeit.Timer(
        setup = ('from __main__ import %s' % func),
        stmt  = ('%s()' % func)
    )
    print func
    print "  Time = %s" % time_eachcall.timeit(number=COUNT)
    print "  Items = %d" % len(finded)

前回のテストデータとXPath式、それぞれを使って実行してみる。

$ python -V
Python 2.4.3

$ python test_lxml_xpath.py yahoo_list.rss //rss/channel/item/title
SOURCE: yahoo_list.rss
XPATH: //rss/channel/item/title
test1
  Time = 0.740010976791
  Items = 20
test2
  Time = 0.270004034042
  Items = 20

$ python test_lxml_xpath.py yahoo_list2.rss //rss/channel/item/title
SOURCE: yahoo_list2.rss
XPATH: //rss/channel/item/title
test1
  Time = 24.0603630543
  Items = 2000
test2
  Time = 22.6303420067
  Items = 2000

$ python test_lxml_xpath.py yahoo_list.rss /rss/channel/item/title
SOURCE: yahoo_list.rss
XPATH: /rss/channel/item/title
test1
  Time = 0.670010089874
  Items = 20
test2
  Time = 0.21000289917
  Items = 20

$ python test_lxml_xpath.py yahoo_list2.rss /rss/channel/item/title
SOURCE: yahoo_list2.rss
XPATH: /rss/channel/item/title
test1
  Time = 19.6702969074
  Items = 2000
test2
  Time = 18.6202809811
  Items = 2000

測定スクリプトの実行結果をまとめると以下のような感じになった。

XPath式 //rss/channel/item/title

	yahoo_list.rss（11K）	yahoo_list2.rss（1011K）
XPath文字列で検索	0.74sec	24.06sec
コンパイル済XPathで検索	0.27sec	22.63sec

XPath式 /rss/channel/item/title

	yahoo_list.rss（11K）	yahoo_list2.rss（1011K）
XPath文字列で検索	0.67sec	19.67sec
コンパイル済XPathで検索	0.21sec	18.62sec

yahoo_list.rss（ファイルサイズが10K程度のXMLデータ）の方は、XPath式をプリコンパイルしたことでそれなりに速くなった。
一方、yahoo_list2.rss（1000K程度）の方は、それほど効果が現れなかったのは何故だろう？
何か他の要因も関係しているのかも。

何にせよ、同じXPath式を繰り返し使うような処理を行う場合は、プリコンパイルしておくと良さそう。

参考
lxml を使用して Python での XML 構文解析をハイパフォーマンスにする
 XPath and XSLT with lxml