From byte-code to a query syntax tree¶

This document describes the core of the revenge package for getting a query synxtax tree from the byte-code of generator expressions. It spans several highly coupled modules:

The xotl.ql.revenge.scanners module,
the xotl.ql.revenge.parsers module,
the xotl.ql.revenge.walkers module,
and (parts of) the xotl.ql.revenge.qst module.

Though we have tried to make a linear exposition of the matter, the high degree of coupling between those phases imposes a challenge to justify decisions made in the first layers to accommodate the actions of the later ones.

This package derives from the work of John Aycock and others on the uncompyle2 package. In fact, this document was produced at the same time we were trying to grasp how that package worked and how we could modify it to better suite our needs. This fact has led to a pattern of exposition that may seem awkward at times because we have had to revise our previous assumptions when we analyzed further. Nevertheless, we have striven to remove those issues the best we could.

Since the goal of the original package is much more ambitious than ours we have stripped big parts of the original grammar and the scanner has been almost entirely reimplemented.

uncompyle2 handles the problem of getting Python source from the byte-code (reverse engineering the byte-code) by casting it as a parsing problem. Thus compiler techniques are applied.

The byte-code is passed to a scanner that produces the tokens. Then a parser recognizes if the stream of tokens can be generated by a given grammar, if so, it produces a syntax tree that captures the syntactical structure of the provide stream of tokens. Finally a walker transform the syntax tree into source code by traversing the syntax tree. xotl.ql.revenge differs in the last step where the walker produces a Query Syntax Tree (QST) instead.

In this document we try to expose how those components work together to reverse engineer a byte-code for a valid Python expression to a structure we call Query Syntax Tree.

Module `xotl.ql.revenge.scanners`¶

This module provides the Scanner that gets the byte-code and produces a stream of Tokens. Each token represents either an actual byte-code instruction with its arguments or a virtual instruction.

This document begins with an exploration of the byte-code for Python expressions and then describes the objects in the scanners module and how the tokens are produced. At the same time, we give proposals for the grammar that may be used to recognize higher level structures from those tokens.

Let’s inspect how simple expressions are compiled in different versions of Python.

Basic expressions¶

Let’s call basic expressions to those composed only by literals, variables (attributes, subscript), function calls, and binary and unary operator, excluding conditional and boolean expressions, generator expressions and other types of comprehensions and lambdas – which we’ll discuss later. Yield expressions are also out the question cause they can only occur within the body of a non-lambda function (i.e, not an expression).

The following are all valid basic expressions along with the disassembled byte-code [3]:

a

dis.dis(compile('a', '', 'eval'))
1           0 LOAD_NAME                0 (a)
            3 RETURN_VALUE

a + 1:

dis.dis(compile('a + 1', '', 'eval'))
  1           0 LOAD_NAME                0 (a)
              3 LOAD_CONST               0 (1)
              6 BINARY_ADD
              7 RETURN_VALUE

a(b[c**e/2:d], *y, **kw).at:

dis.dis(compile('a(b[c**e/2:d], *y, **kw).at', '', 'eval'))  # doctest: +SKIP
 1           0 LOAD_NAME                0 (a)
             3 LOAD_NAME                1 (b)
             6 LOAD_NAME                2 (c)
             9 LOAD_NAME                3 (e)
            12 BINARY_POWER
            13 LOAD_CONST               0 (2)
            16 BINARY_DIVIDE
            17 LOAD_NAME                4 (d)
            20 SLICE+3
            21 LOAD_NAME                5 (y)
            24 LOAD_NAME                6 (kw)
            27 CALL_FUNCTION_VAR_KW     1
            30 LOAD_ATTR                7 (at)
            33 RETURN_VALUE

Note

To understand the output of the dis.dis() function we refer the reader to the dis module of the Python standard library. We’ll assume a basic level of familiarity with it over the rest of the document. Nevertheless, we’ll link to the definition of a byte-code name the first time we use it.

The most important feature of basic expressions is they don’t have any kind of jump in the generated byte-code. The byte-code for basic expression resembles (with infrequent exceptions) the postfix notation where operands come before the operator.

Our parser can have several rules for recognizing this kind of expressions:

expr        ::=  atom
atom        ::=  identifier | literal
identifier  ::=  LOAD_FAST | LOAD_NAME | LOAD_GLOBAL | LOAD_DEREF
literal     ::=  LOAD_CONST
                
expr        ::=  binary_expr
binary_expr ::=  expr expr binary_op
binary_op   ::=  BINARY_ADD | BINARY_MULTIPLY | BINARY_AND |
                 BINARY_OR | BINARY_XOR | BINARY_SUBTRACT | BINARY_DIVIDE |
                 BINARY_TRUE_DIVIDE | BINARY_FLOOR_DIVIDE | BINARY_MODULO |
                 BINARY_LSHIFT | BINARY_RSHIFT | BINARY_POWER

... and so on.

Notice, however, the last instruction in all the basic expression shown so far is RETURN_VALUE and that is not recognized by the grammar presented. Before trying to find a solution for this issue, let’s inspect conditional expressions.

Conditional and boolean expressions¶

Conditional and boolean expressions, on the other hand, do produce jumps in the byte-code: they don’t run straightforwardly from top to bottom.

For instance, the simplest conditional expression “a if x else y”, in Python 2.7 and 3.4 compiles to:

dis.dis(compile('a if x else y', '', 'eval'))
  1           0 LOAD_NAME                0 (x)
              3 POP_JUMP_IF_FALSE       10
              6 LOAD_NAME                1 (a)
              9 RETURN_VALUE
        >>   10 LOAD_NAME                2 (y)
             13 RETURN_VALUE

In Pypy 2.7.3:

dis.dis(compile('a if x else y', '', 'eval'))  # doctest: +SKIP
  1           0 LOAD_NAME                0 (x)
              3 POP_JUMP_IF_FALSE       12
              6 LOAD_NAME                1 (a)
              9 JUMP_FORWARD             3 (to 15)
        >>   12 LOAD_NAME                2 (y)
        >>   15 RETURN_VALUE

The difference is the 9th offset, where Python uses a RETURN_VALUE, Pypy uses a JUMP_FORWARD.

It’s easy to see we can do a normalization step for Pypy byte-code so that the final byte-code is the same as that of CPython. Actually, later on we’ll find out that’s best to do the other way around and to keep a single RETURN_VALUE.

Boolean expressions¶

Boolean expressions produce byte-code that “short-circuit” the evaluation as soon as the result is known:

dis.dis(compile('a and b', '', 'eval'))
 1           0 LOAD_NAME                0 (a)
             3 JUMP_IF_FALSE_OR_POP     9
             6 LOAD_NAME                1 (b)
       >>    9 RETURN_VALUE


dis.dis(compile('a or b', '', 'eval'))
 1           0 LOAD_NAME                0 (a)
             3 JUMP_IF_TRUE_OR_POP      9
             6 LOAD_NAME                1 (b)
       >>    9 RETURN_VALUE

Here the JUMP_IF_FALSE_OR_POP and JUMP_IF_TRUE_OR_POP do the evaluation and stop it if the result is known. However, when combined things change:

dis.dis(compile('x and a or y', '', 'eval'))
 1           0 LOAD_NAME                0 (x)
             3 POP_JUMP_IF_FALSE       12
             6 LOAD_NAME                1 (a)
             9 JUMP_IF_TRUE_OR_POP     15
       >>   12 LOAD_NAME                2 (y)
       >>   15 RETURN_VALUE

Now the and is implemented by POP_JUMP_IF_FALSE. This is because the jumps are optimized in several ways by Python VMs. Optimization changes the layout of the code. We have collected some of the optimization steps performed by the CPython VM below. If unoptimized we would’ve seen the same instructions. You may confirm that Pypy 2.7 doesn’t use the POP_JUMP_IF_FALSE for the expression above, but keeps the JUMP_IF_FALSE_OR_POP.

We’ll see that changing the precedence by grouping don’t change the instructions but the target of the first jump:

dis.dis(compile('x and (a or y)', '', 'eval'))  # doctest: +SKIP
 1           0 LOAD_NAME                0 (x)
             3 JUMP_IF_FALSE_OR_POP    15
             6 LOAD_NAME                1 (a)
             9 JUMP_IF_TRUE_OR_POP     15
            12 LOAD_NAME                2 (y)
       >>   15 RETURN_VALUE

Again not taking jump targets into account hurts our ability to properly interpret the byte-code.

But first let’s take more samples:

dis.dis(compile('x and a and b', '', 'eval'))  # doctest: +SKIP
 1           0 LOAD_NAME                0 (x)
             3 JUMP_IF_FALSE_OR_POP    15
             6 LOAD_NAME                1 (a)
             9 JUMP_IF_FALSE_OR_POP    15
            12 LOAD_NAME                2 (b)
       >>   15 RETURN_VALUE

dis.dis(compile('x or a or b', '', 'eval'))  # doctest: +SKIP
 1           0 LOAD_NAME                0 (x)
             3 JUMP_IF_TRUE_OR_POP     15
             6 LOAD_NAME                1 (a)
             9 JUMP_IF_TRUE_OR_POP     15
            12 LOAD_NAME                2 (b)
       >>   15 RETURN_VALUE

These cases show that several and and or together are simply changed to target the same “final” offset.

The first observation we can make is that jumps in conditional and boolean expressions target the first instruction that deals with the result of the boolean sub-expression. Pictorially for a a and b this is like:

expr JUMP_IF_FALSE_OR_POP  ..... <SOME_INSTRUCTION>
                       |         ^
                       +--------/

The entire program after the jump and before its target contains the instructions for the other operand to the boolean operator. The instruction just before the target is not necessarily the last instruction of the other operand, it may be another jump or RETURN_VALUE like in:

dis.dis(compile('a and b or c', '', 'eval'))   # doctest: +SKIP
 1           0 LOAD_NAME                0 (a)
             3 POP_JUMP_IF_FALSE       12
             6 LOAD_NAME                1 (b)
             9 JUMP_IF_TRUE_OR_POP     15
       >>   12 LOAD_NAME                2 (c)
       >>   15 RETURN_VALUE

Furthermore, jumps may skip several operands at once (remind a and b and c).

It seems that the key to solving this issue is to keep track of the jumps and their targets and produce virtual codes that allow the structure to be retrieved by the parser.

Keeping track of jumps and targets in the scanner¶

Our first approach is to place a simple <COME_FROM> [1] virtual token at the most likely end of the sub-expression after a conditional jump. Our definition of the most likely end is rather intuitive and we’ll extend it as we analyze further.

In the simplest cases, it will be placed just before the RETURN_VALUE at the end:

from xotl.ql.revenge.scanners import xdis
xdis(compile('a and b', '', 'eval'))   # doctest: +SKIP
1         0 LOAD_NAME            a    (a)
          3 JUMP_IF_FALSE_OR_POP 9    (9)
          6 LOAD_NAME            b    (b)
        9_0 COME_FROM            3    (from 3)
    >>    9 RETURN_VALUE

For the sake of readability COME_FROM indicates the offset from which it is reached, it’s virtual offset if composed of the offset of the target and an index – since many jumps may target the same offset:

xdis(compile('a or b or c', '', 'eval'))   # doctest: +SKIP
1         0 LOAD_NAME            a    (a)
          3 JUMP_IF_TRUE_OR_POP  15   (15)
          6 LOAD_NAME            b    (b)
          9 JUMP_IF_TRUE_OR_POP  15   (15)
         12 LOAD_NAME            c    (c)
       15_0 COME_FROM            3    (from 3)
       15_1 COME_FROM            9    (from 9)
    >>   15 RETURN_VALUE

In the first case the COME_FROM is at the end of the right-hand operand. In the second case we see that the first jumps spans over the second.

Note

This is as far as we’ve gotten in this document. Below are just stuff that need to be edited and integrated.

`EXTENDED_ARG` removal¶

The EXTENDED_ARG opcode only comes when arguments in the next opcode are two big to fit into 2 bytes.

The Bytecode knows how to deal with extended arguments and the produced Instruction set have the arguments fully computed. The instruction set keeps the EXTENDED_ARG but we can easily skip it and never produce a token for it.

Virtual codes¶

Most of the time the scanner will produce tokens that wrap byte-code instructions. But for byte-codes that take arguments which indicate how many items to take from the stack (BUILD_LIST, BUILD_TUPLE, BUILD_SLICE, etc) the scanner transforms them in customized versions depending on the argument.

For instance the instruction BUILD_LIST 3 yields a token <BUILD_LIST_3> [1]. The scanner actually returns the stream of tokens and a map from customized code names to arguments. This means that a customized code must be produced per argument value found in the original instruction set.

The program shown below produces two customized tokens <BUILD_LIST_3> and <BUILD_LIST_2>:

dis.dis(compile('[1, 2, 3] + [9, 0]', '', 'eval'))
 1           0 LOAD_CONST               0 (1)
             3 LOAD_CONST               1 (2)
             6 LOAD_CONST               2 (3)
             9 BUILD_LIST               3
            12 LOAD_CONST               3 (9)
            15 LOAD_CONST               4 (0)
            18 BUILD_LIST               2
            21 BINARY_ADD
            22 RETURN_VALUE

Before the parser begins to recognize the program it should build new rules from those customized tokens. For instance for recognizing a <BUILD_LIST_3> it could add the rule:

list_expression ::= expr expr expr BUILD_LIST_3

JUMP_BACK¶

A JUMP_ABSOLUTE that targets a previous instruction yields a virtual code <JUMP_BACK>.

This allows the parser to recognize loops inside comprehensions more easily. The JUMP_ABSOLUTE in a comprehension always jumps to the FOR_ITER to get the next item.

Generator expressions, set and dictionary comprehensions¶

Since generator expressions are the constructor for expressing queries we need to make sure we can recognize them. However, they’re compiled using the same general algorithm and we need to analyze them together.

Generator expressions work by creating a nested function where the loops is actually done and call it immediately with the iterable as its single argument:

dis.dis(compile('(a for a in this)', '', 'eval'))  # doctest: +SKIP
 1           0 LOAD_CONST               0 (<code object <genexpr> ...>)
             3 MAKE_FUNCTION            0
             6 LOAD_NAME                0 (this)
             9 GET_ITER
            10 CALL_FUNCTION            1
            13 RETURN_VALUE

A set comprehension is indistinguishable from a generator expression:

dis.dis(compile('{a for a in this}', '', 'eval'))  # doctest: +SKIP
 1           0 LOAD_CONST               0 (<code object <setcomp> ...>)
             3 MAKE_FUNCTION            0
             6 LOAD_NAME                0 (this)
             9 GET_ITER
            10 CALL_FUNCTION            1
            13 RETURN_VALUE

The easiest way to distinguish if we’re looking at a set comprehension, dictionary comprehension or generator expression is by realizing that the code object has a unique name for each: “<genexpr>” for generator expressions, “<setcomp>” for set comprehensions and “<dictcomp>” for dictionary comprehensions [2].

This fact is used by the scanner to generate virtual tokens for each kind of comprehension: LOAD_GENEXPR, LOAD_DICTCOM, LOAD_SETCOMP and LOAD_LAMBDA. The last one is not related with comprehensions but is easy to show the similarity:

dis.dis(compile('(lambda t: None)(this)', '', 'eval')) # doctest: +SKIP
 1           0 LOAD_CONST               0 (<code object <lambda> ...>)
             3 MAKE_FUNCTION            0
             6 LOAD_NAME                0 (this)
             9 CALL_FUNCTION            1
            12 RETURN_VALUE

This shows that the parser needs to recognize this simple structure for comprehensions:

_comprehension ::= MAKE_FUNCTION_0 expr GET_ITER CALL_FUNCTION_1

genexpr  ::= LOAD_GENEXPR  _comprehension
setcomp  ::= LOAD_SETCOMP  _comprehension
dictcomp ::= LOAD_DICTCOMP _comprehension

The nested function can be inspected:

dis.dis(compile('(a for a in this)', '', 'eval').co_consts[0])
 1           0 LOAD_FAST                0 (.0)
       >>    3 FOR_ITER                11 (to 17)
             6 STORE_FAST               1 (a)
             9 LOAD_FAST                1 (a)
            12 YIELD_VALUE
            13 POP_TOP
            14 JUMP_ABSOLUTE            3
       >>   17 LOAD_CONST               0 (None)
            20 RETURN_VALUE

Note

Support for generator expressions from Python 2.7 and 3.2+.

Generator expressions were introduced in Python 2.4, but the byte-code was actually clean up in Python 2.7, when they removed block-related byte-code. In the related function ‘Python/compile.c’ we read:

This code *knows* that the loop cannot contain break, continue, or
return, so it cheats and skips the SETUP_LOOP/POP_BLOCK steps used in
normal loops.

We rely on this fact.

List comprehensions¶

Since (at least) Python 3.2, list comprehensions take the form of the other comprehensions, thus we’ll need a virtual code LOAD_LISTCOMP and a rule:

listcomp ::= LOAD_LISTCOMP _comprehension

Yet the code object of a list comprehension needs to be analyzed. We’ll show how to deal with Python 2.7 list comprehensions first and then see how the nested function in Python 3.2+ is to be integrated.

In Pypy there’s a BUILD_LIST_FROM_ARG opcode to speed list comprehensions up. This is basically an informed BUILD_LIST: it peeks the TOS and tries to guess the length of the list it will build, if it can’t it will behave as a BUILD_LIST 0.

A list comprehension “[a for a in b]” will look like:

1           0 LOAD_NAME                0 (b)
            3 BUILD_LIST_FROM_ARG      0
            6 GET_ITER
      >>    7 FOR_ITER                12 (to 22)
           10 STORE_NAME               1 (a)
           13 LOAD_NAME                1 (a)
           16 LIST_APPEND              2
           19 JUMP_ABSOLUTE            7
      >>   22 RETURN_VALUE

while in Python 2.7 the same comprehension yields:

1           0 BUILD_LIST               0
            3 LOAD_NAME                0 (b)
            6 GET_ITER
      >>    7 FOR_ITER                12 (to 22)
           10 STORE_NAME               1 (a)
           13 LOAD_NAME                1 (a)
           16 LIST_APPEND              2
           19 JUMP_ABSOLUTE            7
      >>   22 RETURN_VALUE

Notice that the only change is that Pypy don’t use the BUILD_LIST 0 but add a BUILD_LIST_FROM_ARG after loading the list.

We can’t simply swap the codes in this case, their order is also affected.

We think this issue should be addressed by the parser and not the scanner since the iterable can be any expression, i.e. not a single LOAD_NAME.

[1]	(1, 2) In this document we use angular brackets to distinguish tokens from byte-code instructions names.

[2]

Since CPython 3.3, MAKE_FUNCTION will actually be preceded by two LOAD_CONST, the second one with the function qualified name. But we can ignore that fact for the analysis.

It will be easily treated by overriding the rule _comprehension:

_comprehension ::= LOAD_CONST MAKE_FUNCTION_0 expr GET_ITER
                   CALL_FUNCTION_1

Terms used in this document¶

absolute jump: Any jump instructions that takes the actual target offset that it jumps to: JUMP_ABSOLUTE and all the conditional jumps.
conditional jump: An instruction that only jumps conditionally: JUMP_IF_FALSE_OR_POP, JUMP_IF_TRUE_OR_POP, POP_JUMP_IF_FALSE, and POP_JUMP_IF_TRUE.

Note

Despite the fact that FOR_ITER only jumps when the iterator is exhausted, we don’t consider it neither a conditional jump nor an unconditional jump.
jump jump instruction: Any instructions that jumps: conditional jump, unconditional jump, relative jump and absolute jump.
relative jump: Any jump instruction that takes a delta argument to calculate the offset of the target: FOR_ITER, and JUMP_FORWARD.

Note

We don’t list all the instructions but only those that show up in expressions.
target: This term is used loosely to mean either the offset (after resolution for relative jumps), or the instruction at that offset.
unconditional jump: Any of the instructions JUMP_FORWARD and JUMP_ABSOLUTE

Optimizations¶

If the target of JUMP_IF_FALSE_OR_POP (JUMP_IF_TRUE_OR_POP) is a conditional jump of the same truth-testing kind then update the target to be that of the previous target.

This is the rule that collapses several and and or together.

If the target of JUMP_IF_*_OR_POP is a conditional jump of the opposite truth-testing kind, change the instruction to POP_JUMP_IF_* targeting the next instruction of the previous target.

This is what happens for x and a or b and also for (x or a) and b:

dis.dis(compile('x and a or b', '', 'eval'))
 1           0 LOAD_NAME                0 (x)
             3 POP_JUMP_IF_FALSE       12
             6 LOAD_NAME                1 (a)
             9 JUMP_IF_TRUE_OR_POP     15
       >>   12 LOAD_NAME                2 (b)
       >>   15 RETURN_VALUE

dis.dis(compile('(x or a) and b', '', 'eval'))
 1           0 LOAD_NAME                0 (x)
             3 POP_JUMP_IF_TRUE        12
             6 LOAD_NAME                1 (a)
             9 JUMP_IF_FALSE_OR_POP    15
       >>   12 LOAD_NAME                2 (b)
       >>   15 RETURN_VALUE

... and others.

API of the module¶

xotl.ql.revenge.scanners.keep_single_return(instructions)¶

Transform the instructions so the a single RETURN_VALUE is kept at the bottom.

If the original instructions don’t have a RETURN_VALUE at the very end, return the same instructions set.

Otherwise all but the last RETURN_VALUE will be replaced by a JUMP_FORWARD that targets the RETURN_VALUE at the end of the instructions.

Offsets and jump targets are updated.

Parameters:	instructions (iterable) – The original instructions.
Return type:	list

xotl.ql.revenge.scanners.without_nops(instructions)¶

Return the same instruction set with NOPs removed.

The jump targets are properly updated.

Loosely based on the same algorithm in Python/peephole.c.

Parameters:	instructions – An iterable of `instructions`.
Returns:	A generator of instructions.

xotl.ql.revenge.scanners.normalize_pypy_conditional(instructions)¶

Apply the pypy normalization rule.

The pypy normalization rule states that:

If the target of a JUMP_FORWARD (or JUMP_ABSOLUTE) is a RETURN_VALUE replace the JUMP with the following instructions:
RETURN_VALUE
NOP
NOP

This rule does not only apply when using Pypy, the name simply comes because Pypy compiles conditional expressions using JUMPs.

Parameters:	instructions – An iterable of `instructions`.
Returns:	A generator of instructions.

class xotl.ql.revenge.scanners.Token(name, arg=None, argval=None, argrepr=None, offset=-1, starts_line=False, instruction=None)¶

Class representing a byte-code token.

A byte-code token is equivalent to the contents of one line as output by dis.dis().

Footnotes¶

[3]	Many of the examples on this document won’t look the same when you execute them on different versions or implementations of Python.

From byte-code to a query syntax tree¶

Module `xotl.ql.revenge.scanners`¶

Basic expressions¶

Conditional and boolean expressions¶

Boolean expressions¶

Keeping track of jumps and targets in the scanner¶

`EXTENDED_ARG` removal¶

Virtual codes¶

JUMP_BACK¶

Generator expressions, set and dictionary comprehensions¶

List comprehensions¶

Terms used in this document¶

Optimizations¶

API of the module¶

Footnotes¶

Table Of Contents

Previous topic

Next topic

From byte-code to a query syntax tree¶

Module xotl.ql.revenge.scanners¶

Basic expressions¶

Conditional and boolean expressions¶

Boolean expressions¶

Keeping track of jumps and targets in the scanner¶

EXTENDED_ARG removal¶

Virtual codes¶

JUMP_BACK¶

Generator expressions, set and dictionary comprehensions¶

List comprehensions¶

Terms used in this document¶

Optimizations¶

API of the module¶

Footnotes¶

Module `xotl.ql.revenge.scanners`¶

`EXTENDED_ARG` removal¶