From byte-code to a query syntax tree¶
This document describes the core of the revenge
package for
getting a query synxtax tree from the byte-code of generator
expressions. It spans several highly coupled modules:
- The
xotl.ql.revenge.scanners
module, - the
xotl.ql.revenge.parsers
module, - the
xotl.ql.revenge.walkers
module, - and (parts of) the
xotl.ql.revenge.qst
module.
Though we have tried to make a linear exposition of the matter, the high degree of coupling between those phases imposes a challenge to justify decisions made in the first layers to accommodate the actions of the later ones.
This package derives from the work of John Aycock and others on the uncompyle2 package. In fact, this document was produced at the same time we were trying to grasp how that package worked and how we could modify it to better suite our needs. This fact has led to a pattern of exposition that may seem awkward at times because we have had to revise our previous assumptions when we analyzed further. Nevertheless, we have striven to remove those issues the best we could.
Since the goal of the original package is much more ambitious than ours we have stripped big parts of the original grammar and the scanner has been almost entirely reimplemented.
uncompyle2 handles the problem of getting Python source from the byte-code (reverse engineering the byte-code) by casting it as a parsing problem. Thus compiler techniques are applied.
The byte-code is passed to a scanner that produces the tokens. Then a parser recognizes if the stream of tokens can be generated by a given grammar, if so, it produces a syntax tree that captures the syntactical structure of the provide stream of tokens. Finally a walker transform the syntax tree into source code by traversing the syntax tree. xotl.ql.revenge differs in the last step where the walker produces a Query Syntax Tree (QST) instead.
In this document we try to expose how those components work together to reverse engineer a byte-code for a valid Python expression to a structure we call Query Syntax Tree.
Module xotl.ql.revenge.scanners
¶
This module provides the Scanner
that gets the byte-code and produces
a stream of Tokens
. Each token represents either an actual
byte-code instruction with its arguments or a virtual
instruction.
This document begins with an exploration of the byte-code for Python
expressions and then describes the objects in the scanners
module and
how the tokens are produced. At the same time, we give proposals for the
grammar that may be used to recognize higher level structures from those
tokens.
Let’s inspect how simple expressions are compiled in different versions of Python.
Basic expressions¶
Let’s call basic expressions to those composed only by literals, variables (attributes, subscript), function calls, and binary and unary operator, excluding conditional and boolean expressions, generator expressions and other types of comprehensions and lambdas – which we’ll discuss later. Yield expressions are also out the question cause they can only occur within the body of a non-lambda function (i.e, not an expression).
The following are all valid basic expressions along with the disassembled byte-code [3]:
a
dis.dis(compile('a', '', 'eval')) 1 0 LOAD_NAME 0 (a) 3 RETURN_VALUE
a + 1
:dis.dis(compile('a + 1', '', 'eval')) 1 0 LOAD_NAME 0 (a) 3 LOAD_CONST 0 (1) 6 BINARY_ADD 7 RETURN_VALUE
a(b[c**e/2:d], *y, **kw).at
:dis.dis(compile('a(b[c**e/2:d], *y, **kw).at', '', 'eval')) # doctest: +SKIP 1 0 LOAD_NAME 0 (a) 3 LOAD_NAME 1 (b) 6 LOAD_NAME 2 (c) 9 LOAD_NAME 3 (e) 12 BINARY_POWER 13 LOAD_CONST 0 (2) 16 BINARY_DIVIDE 17 LOAD_NAME 4 (d) 20 SLICE+3 21 LOAD_NAME 5 (y) 24 LOAD_NAME 6 (kw) 27 CALL_FUNCTION_VAR_KW 1 30 LOAD_ATTR 7 (at) 33 RETURN_VALUE
Note
To understand the output of the dis.dis()
function we refer
the reader to the dis
module of the Python standard library. We’ll
assume a basic level of familiarity with it over the rest of the document.
Nevertheless, we’ll link to the definition of a byte-code name the first
time we use it.
The most important feature of basic expressions is they don’t have any kind of jump in the generated byte-code. The byte-code for basic expression resembles (with infrequent exceptions) the postfix notation where operands come before the operator.
Our parser can have several rules for recognizing this kind of expressions:
expr ::= atom atom ::= identifier | literal identifier ::= LOAD_FAST | LOAD_NAME | LOAD_GLOBAL | LOAD_DEREF literal ::= LOAD_CONST expr ::= binary_expr binary_expr ::= expr expr binary_op binary_op ::= BINARY_ADD | BINARY_MULTIPLY | BINARY_AND | BINARY_OR | BINARY_XOR | BINARY_SUBTRACT | BINARY_DIVIDE | BINARY_TRUE_DIVIDE | BINARY_FLOOR_DIVIDE | BINARY_MODULO | BINARY_LSHIFT | BINARY_RSHIFT | BINARY_POWER
... and so on.
Notice, however, the last instruction in all the basic expression shown so far
is RETURN_VALUE
and that is not recognized by the grammar presented.
Before trying to find a solution for this issue, let’s inspect conditional
expressions.
Conditional and boolean expressions¶
Conditional and boolean expressions, on the other hand, do produce jumps in the byte-code: they don’t run straightforwardly from top to bottom.
For instance, the simplest conditional expression “a if x else y
”, in
Python 2.7 and 3.4 compiles to:
dis.dis(compile('a if x else y', '', 'eval'))
1 0 LOAD_NAME 0 (x)
3 POP_JUMP_IF_FALSE 10
6 LOAD_NAME 1 (a)
9 RETURN_VALUE
>> 10 LOAD_NAME 2 (y)
13 RETURN_VALUE
In Pypy 2.7.3:
dis.dis(compile('a if x else y', '', 'eval')) # doctest: +SKIP
1 0 LOAD_NAME 0 (x)
3 POP_JUMP_IF_FALSE 12
6 LOAD_NAME 1 (a)
9 JUMP_FORWARD 3 (to 15)
>> 12 LOAD_NAME 2 (y)
>> 15 RETURN_VALUE
The difference is the 9th offset, where Python uses a RETURN_VALUE
, Pypy
uses a JUMP_FORWARD
.
It’s easy to see we can do a normalization step for Pypy byte-code so that the
final byte-code is the same as that of CPython. Actually, later on we’ll find
out that’s best to do the other way around and to keep a single
RETURN_VALUE
.
See also
The functions without_nops()
and
normalize_pypy_conditional()
and
keep_single_return()
.
After the normalization step, we get the same byte-code and our analysis can proceed ignoring the differences.
We may think then, conditionals might be recognized by the following rules:
conditional ::= condition_expression POR_JUMP_IF_FALSE
then_result else_result
then_result ::= returning_expression
else_result ::= returning_expression
returning_expression ::= expr RETURN_VALUE
condition_expression := expr
expr ::= conditional
However, let’s first see how ‘nested’ conditional interact: a if (a1 if x1
else y1) else y2
. Inspecting the actual byte-code we see there’s an issue
with the then_result
inside the condition expression since it does not end
with the required RETURN_VALUE
:
dis.dis(compile('a if (a1 if x1 else y1) else y2', '', 'eval'))
1 0 LOAD_NAME 0 (x1)
3 POP_JUMP_IF_FALSE 12
6 LOAD_NAME 1 (a1)
9 JUMP_FORWARD 3 (to 15)
>> 12 LOAD_NAME 2 (y1)
>> 15 POP_JUMP_IF_FALSE 22
18 LOAD_NAME 3 (a)
21 RETURN_VALUE
>> 22 LOAD_NAME 4 (y2)
25 RETURN_VALUE
The inner conditional expression cannot return the value of a1
or y1
but let it in the top of the stack (TOS) to further inspection by the outer
expression. Surely we’re going to see this pattern even when combining
conditional expression with non-conditional ones as in:
dis.dis(compile('(a if x else y) + 1', '', 'eval'))
1 0 LOAD_NAME 0 (x)
3 POP_JUMP_IF_FALSE 12
6 LOAD_NAME 1 (a)
9 JUMP_FORWARD 3 (to 15)
>> 12 LOAD_NAME 2 (y)
>> 15 LOAD_CONST 0 (1)
18 BINARY_ADD
19 RETURN_VALUE
Notice that in both cases the JUMP_FORWARD
targets the byte-code
that handles (or leaves alone) the result of the conditional expression. So
inner conditional expressions don’t end with neither a jump or a
RETURN_VALUE
, but leave the TOS as it is.
This is actually the same issue with basic expression having a final
RETURN_VALUE
.
And it means that expr
should not be considered a standalone expression
but that always occur inside a bigger structure, so if we change our rules
to:
conditional ::= condition_expression POR_JUMP_IF_FALSE
then_result else_result
condition_expression ::= expr
expr ::= conditional
else_result ::= expr
then_result ::= expr JUMP_FORWARD
then_result ::= expr JUMP_ABSOLUTE
without the trailing RETURN_VALUE
for the else result; now they work for
recognizing the expression shown so far.
Since our expr
cannot include the trailing RETURN_VALUE
, that token
needs to be matched by a higher level structure. We’ll call that ‘returning
an expression sentence’, in fact we have already the rule:
returning_expression ::= expr
RETURN_VALUE
So instead of trying a recognizing an expr
we should strive to get a
sentence that returns an expression.
Yet, those productions rules are not sufficient. They alone would make these two programs produce the same AST:
dis.dis(compile('a if x else (o if i else n)', '', 'eval'))
1 0 LOAD_NAME 0 (x)
3 POP_JUMP_IF_FALSE 10
6 LOAD_NAME 1 (a)
9 RETURN_VALUE
>> 10 LOAD_NAME 2 (i)
13 POP_JUMP_IF_FALSE 20
16 LOAD_NAME 3 (o)
19 RETURN_VALUE
>> 20 LOAD_NAME 4 (n)
23 RETURN_VALUE
dis.dis(compile('o if (a if x else i) else n', '', 'eval'))
1 0 LOAD_NAME 0 (x)
3 POP_JUMP_IF_FALSE 12
6 LOAD_NAME 1 (a)
9 JUMP_FORWARD 3 (to 15)
>> 12 LOAD_NAME 2 (i)
>> 15 POP_JUMP_IF_FALSE 22
18 LOAD_NAME 3 (o)
21 RETURN_VALUE
>> 22 LOAD_NAME 4 (n)
25 RETURN_VALUE
If we disregard the target of the jumps (as our grammar
does) we can’t distinguish between those two programs. In fact those many
RETURN_VALUE
we see, are only produced after an optimization step
because we’re compiling the expression with 'eval'
. Otherwise, they only
show up in lambdas:
dis.dis(compile('a if x else (o if i else n)', '', 'exec'))
1 0 LOAD_NAME 0 (x)
3 POP_JUMP_IF_FALSE 12
6 LOAD_NAME 1 (a)
9 JUMP_FORWARD 15 (to 27)
>> 12 LOAD_NAME 2 (i)
15 POP_JUMP_IF_FALSE 24
18 LOAD_NAME 3 (o)
21 JUMP_FORWARD 3 (to 27)
>> 24 LOAD_NAME 4 (n)
>> 27 POP_TOP
28 LOAD_CONST 0 (None)
31 RETURN_VALUE
dis.dis(lambda: a if x else (o if i else n))
1 0 LOAD_GLOBAL 0 (x)
3 POP_JUMP_IF_FALSE 10
6 LOAD_GLOBAL 1 (a)
9 RETURN_VALUE
>> 10 LOAD_GLOBAL 2 (i)
13 POP_JUMP_IF_FALSE 20
16 LOAD_GLOBAL 3 (o)
19 RETURN_VALUE
>> 20 LOAD_GLOBAL 4 (n)
23 RETURN_VALUE
We’ll see the same problem in our analysis of boolean expressions, so let’s start with it.
Boolean expressions¶
Boolean expressions produce byte-code that “short-circuit” the evaluation as soon as the result is known:
dis.dis(compile('a and b', '', 'eval'))
1 0 LOAD_NAME 0 (a)
3 JUMP_IF_FALSE_OR_POP 9
6 LOAD_NAME 1 (b)
>> 9 RETURN_VALUE
dis.dis(compile('a or b', '', 'eval'))
1 0 LOAD_NAME 0 (a)
3 JUMP_IF_TRUE_OR_POP 9
6 LOAD_NAME 1 (b)
>> 9 RETURN_VALUE
Here the JUMP_IF_FALSE_OR_POP
and JUMP_IF_TRUE_OR_POP
do
the evaluation and stop it if the result is known. However, when combined
things change:
dis.dis(compile('x and a or y', '', 'eval'))
1 0 LOAD_NAME 0 (x)
3 POP_JUMP_IF_FALSE 12
6 LOAD_NAME 1 (a)
9 JUMP_IF_TRUE_OR_POP 15
>> 12 LOAD_NAME 2 (y)
>> 15 RETURN_VALUE
Now the and
is implemented by POP_JUMP_IF_FALSE
. This is
because the jumps are optimized in several ways by Python VMs.
Optimization changes the layout of the code. We have collected some of the
optimization steps performed by the CPython VM below. If
unoptimized we would’ve seen the same instructions. You may confirm that Pypy
2.7 doesn’t use the POP_JUMP_IF_FALSE
for the expression above, but keeps
the JUMP_IF_FALSE_OR_POP
.
We’ll see that changing the precedence by grouping don’t change the instructions but the target of the first jump:
dis.dis(compile('x and (a or y)', '', 'eval')) # doctest: +SKIP
1 0 LOAD_NAME 0 (x)
3 JUMP_IF_FALSE_OR_POP 15
6 LOAD_NAME 1 (a)
9 JUMP_IF_TRUE_OR_POP 15
12 LOAD_NAME 2 (y)
>> 15 RETURN_VALUE
Again not taking jump targets into account hurts our ability to properly interpret the byte-code.
But first let’s take more samples:
dis.dis(compile('x and a and b', '', 'eval')) # doctest: +SKIP
1 0 LOAD_NAME 0 (x)
3 JUMP_IF_FALSE_OR_POP 15
6 LOAD_NAME 1 (a)
9 JUMP_IF_FALSE_OR_POP 15
12 LOAD_NAME 2 (b)
>> 15 RETURN_VALUE
dis.dis(compile('x or a or b', '', 'eval')) # doctest: +SKIP
1 0 LOAD_NAME 0 (x)
3 JUMP_IF_TRUE_OR_POP 15
6 LOAD_NAME 1 (a)
9 JUMP_IF_TRUE_OR_POP 15
12 LOAD_NAME 2 (b)
>> 15 RETURN_VALUE
These cases show that several and
and or
together are simply changed
to target the same “final” offset.
The first observation we can make is that jumps in conditional and boolean
expressions target the first instruction that deals with the result of the
boolean sub-expression. Pictorially for a a and b
this is like:
expr JUMP_IF_FALSE_OR_POP ..... <SOME_INSTRUCTION>
| ^
+--------/
The entire program after the jump and before its target contains the instructions for the other operand to the boolean operator. The instruction just before the target is not necessarily the last instruction of the other operand, it may be another jump or RETURN_VALUE like in:
dis.dis(compile('a and b or c', '', 'eval')) # doctest: +SKIP
1 0 LOAD_NAME 0 (a)
3 POP_JUMP_IF_FALSE 12
6 LOAD_NAME 1 (b)
9 JUMP_IF_TRUE_OR_POP 15
>> 12 LOAD_NAME 2 (c)
>> 15 RETURN_VALUE
Furthermore, jumps may skip several operands at once (remind a and b and
c
).
It seems that the key to solving this issue is to keep track of the jumps and their targets and produce virtual codes that allow the structure to be retrieved by the parser.
Keeping track of jumps and targets in the scanner¶
Our first approach is to place a simple <COME_FROM>
[1] virtual
token at the most likely end of the sub-expression after a conditional jump.
Our definition of the most likely end is rather intuitive and we’ll extend
it as we analyze further.
In the simplest cases, it will be placed just before the
RETURN_VALUE
at the end:
from xotl.ql.revenge.scanners import xdis
xdis(compile('a and b', '', 'eval')) # doctest: +SKIP
1 0 LOAD_NAME a (a)
3 JUMP_IF_FALSE_OR_POP 9 (9)
6 LOAD_NAME b (b)
9_0 COME_FROM 3 (from 3)
>> 9 RETURN_VALUE
For the sake of readability COME_FROM
indicates the offset from which it
is reached, it’s virtual offset if composed of the offset of the target and an
index – since many jumps may target the same offset:
xdis(compile('a or b or c', '', 'eval')) # doctest: +SKIP
1 0 LOAD_NAME a (a)
3 JUMP_IF_TRUE_OR_POP 15 (15)
6 LOAD_NAME b (b)
9 JUMP_IF_TRUE_OR_POP 15 (15)
12 LOAD_NAME c (c)
15_0 COME_FROM 3 (from 3)
15_1 COME_FROM 9 (from 9)
>> 15 RETURN_VALUE
In the first case the COME_FROM is at the end of the right-hand operand. In the second case we see that the first jumps spans over the second.
Note
This is as far as we’ve gotten in this document. Below are just stuff that need to be edited and integrated.
EXTENDED_ARG
removal¶
The EXTENDED_ARG
opcode only comes when arguments in the next opcode are
two big to fit into 2 bytes.
The Bytecode
knows how to deal with extended arguments and the
produced Instruction
set have the arguments fully computed. The
instruction set keeps the EXTENDED_ARG
but we can easily skip it and never
produce a token for it.
Virtual codes¶
Most of the time the scanner will produce tokens that wrap byte-code instructions. But for byte-codes that take arguments which indicate how many items to take from the stack (BUILD_LIST, BUILD_TUPLE, BUILD_SLICE, etc) the scanner transforms them in customized versions depending on the argument.
For instance the instruction BUILD_LIST 3
yields a token
<BUILD_LIST_3>
[1]. The scanner actually returns the stream of
tokens and a map from customized code names to arguments. This means that a
customized code must be produced per argument value found in the original
instruction set.
The program shown below produces two customized tokens <BUILD_LIST_3>
and
<BUILD_LIST_2>
:
dis.dis(compile('[1, 2, 3] + [9, 0]', '', 'eval'))
1 0 LOAD_CONST 0 (1)
3 LOAD_CONST 1 (2)
6 LOAD_CONST 2 (3)
9 BUILD_LIST 3
12 LOAD_CONST 3 (9)
15 LOAD_CONST 4 (0)
18 BUILD_LIST 2
21 BINARY_ADD
22 RETURN_VALUE
Before the parser begins to recognize the program it should build new rules
from those customized tokens. For instance for recognizing a
<BUILD_LIST_3>
it could add the rule:
list_expression ::= expr expr expr BUILD_LIST_3
See also
The parsers
module.
The byte-code that needs to be customized are: BUILD_LIST
,
BUILD_TUPLE
, BUILD_SET
, BUILD_SLICE
, UNPACK_SEQUENCE
,
MAKE_FUNCTION
, CALL_FUNCTION
, MAKE_CLOSURE
, CALL_FUNCTION_VAR
,
CALL_FUNCTION_KW
, CALL_FUNCTION_VAR_KW
, DUP_TOPX
, and
RAISE_VARARGS
– this last in only include for completion, not because
we’ll find it in expression.
JUMP_BACK¶
A JUMP_ABSOLUTE
that targets a previous instruction yields a virtual code
<JUMP_BACK>
.
This allows the parser to recognize loops inside comprehensions more easily.
The JUMP_ABSOLUTE
in a comprehension always jumps to the FOR_ITER
to
get the next item.
Generator expressions, set and dictionary comprehensions¶
Since generator expressions are the constructor for expressing queries we need to make sure we can recognize them. However, they’re compiled using the same general algorithm and we need to analyze them together.
Generator expressions work by creating a nested function where the loops is actually done and call it immediately with the iterable as its single argument:
dis.dis(compile('(a for a in this)', '', 'eval')) # doctest: +SKIP
1 0 LOAD_CONST 0 (<code object <genexpr> ...>)
3 MAKE_FUNCTION 0
6 LOAD_NAME 0 (this)
9 GET_ITER
10 CALL_FUNCTION 1
13 RETURN_VALUE
A set comprehension is indistinguishable from a generator expression:
dis.dis(compile('{a for a in this}', '', 'eval')) # doctest: +SKIP
1 0 LOAD_CONST 0 (<code object <setcomp> ...>)
3 MAKE_FUNCTION 0
6 LOAD_NAME 0 (this)
9 GET_ITER
10 CALL_FUNCTION 1
13 RETURN_VALUE
The easiest way to distinguish if we’re looking at a set comprehension,
dictionary comprehension or generator expression is by realizing that the code
object has a unique name for each: “<genexpr>
” for generator expressions,
“<setcomp>
” for set comprehensions and “<dictcomp>
” for dictionary
comprehensions [2].
This fact is used by the scanner to generate virtual tokens for each kind of
comprehension: LOAD_GENEXPR
, LOAD_DICTCOM
, LOAD_SETCOMP
and
LOAD_LAMBDA
. The last one is not related with comprehensions but is easy
to show the similarity:
dis.dis(compile('(lambda t: None)(this)', '', 'eval')) # doctest: +SKIP
1 0 LOAD_CONST 0 (<code object <lambda> ...>)
3 MAKE_FUNCTION 0
6 LOAD_NAME 0 (this)
9 CALL_FUNCTION 1
12 RETURN_VALUE
This shows that the parser needs to recognize this simple structure for comprehensions:
_comprehension ::= MAKE_FUNCTION_0 expr GET_ITER CALL_FUNCTION_1
genexpr ::= LOAD_GENEXPR _comprehension
setcomp ::= LOAD_SETCOMP _comprehension
dictcomp ::= LOAD_DICTCOMP _comprehension
The nested function can be inspected:
dis.dis(compile('(a for a in this)', '', 'eval').co_consts[0])
1 0 LOAD_FAST 0 (.0)
>> 3 FOR_ITER 11 (to 17)
6 STORE_FAST 1 (a)
9 LOAD_FAST 1 (a)
12 YIELD_VALUE
13 POP_TOP
14 JUMP_ABSOLUTE 3
>> 17 LOAD_CONST 0 (None)
20 RETURN_VALUE
Note
Support for generator expressions from Python 2.7 and 3.2+.
Generator expressions were introduced in Python 2.4, but the byte-code was actually clean up in Python 2.7, when they removed block-related byte-code. In the related function ‘Python/compile.c’ we read:
This code *knows* that the loop cannot contain break, continue, or
return, so it cheats and skips the SETUP_LOOP/POP_BLOCK steps used in
normal loops.
We rely on this fact.
List comprehensions¶
Since (at least) Python 3.2, list comprehensions take the form of the other comprehensions, thus we’ll need a virtual code LOAD_LISTCOMP and a rule:
listcomp ::= LOAD_LISTCOMP _comprehension
Yet the code object of a list comprehension needs to be analyzed. We’ll show how to deal with Python 2.7 list comprehensions first and then see how the nested function in Python 3.2+ is to be integrated.
In Pypy there’s a BUILD_LIST_FROM_ARG
opcode to speed list comprehensions
up. This is basically an informed BUILD_LIST
: it peeks the TOS and
tries to guess the length of the list it will build, if it can’t it will
behave as a BUILD_LIST 0
.
A list comprehension “[a for a in b]
” will look like:
1 0 LOAD_NAME 0 (b)
3 BUILD_LIST_FROM_ARG 0
6 GET_ITER
>> 7 FOR_ITER 12 (to 22)
10 STORE_NAME 1 (a)
13 LOAD_NAME 1 (a)
16 LIST_APPEND 2
19 JUMP_ABSOLUTE 7
>> 22 RETURN_VALUE
while in Python 2.7 the same comprehension yields:
1 0 BUILD_LIST 0
3 LOAD_NAME 0 (b)
6 GET_ITER
>> 7 FOR_ITER 12 (to 22)
10 STORE_NAME 1 (a)
13 LOAD_NAME 1 (a)
16 LIST_APPEND 2
19 JUMP_ABSOLUTE 7
>> 22 RETURN_VALUE
Notice that the only change is that Pypy don’t use the BUILD_LIST 0
but
add a BUILD_LIST_FROM_ARG
after loading the list.
We can’t simply swap the codes in this case, their order is also affected.
We think this issue should be addressed by the parser and not the scanner
since the iterable can be any expression, i.e. not a single LOAD_NAME
.
[1] | (1, 2) In this document we use angular brackets to distinguish tokens from byte-code instructions names. |
[2] | Since CPython 3.3, It will be easily treated by overriding the rule _comprehension ::= LOAD_CONST MAKE_FUNCTION_0 expr GET_ITER
CALL_FUNCTION_1
|
Terms used in this document¶
- absolute jump
- Any jump instructions that takes the actual target offset
that it jumps to:
JUMP_ABSOLUTE
and all the conditional jumps. - conditional jump
An instruction that only jumps conditionally:
JUMP_IF_FALSE_OR_POP
,JUMP_IF_TRUE_OR_POP
,POP_JUMP_IF_FALSE
, andPOP_JUMP_IF_TRUE
.Note
Despite the fact that
FOR_ITER
only jumps when the iterator is exhausted, we don’t consider it neither a conditional jump nor an unconditional jump.- jump
jump instruction - Any instructions that jumps: conditional jump, unconditional jump, relative jump and absolute jump.
- relative jump
Any jump instruction that takes a delta argument to calculate the offset of the target:
FOR_ITER
, andJUMP_FORWARD
.Note
We don’t list all the instructions but only those that show up in expressions.
- target
- This term is used loosely to mean either the offset (after resolution for relative jumps), or the instruction at that offset.
- unconditional jump
- Any of the instructions
JUMP_FORWARD
andJUMP_ABSOLUTE
Optimizations¶
If the target of
JUMP_IF_FALSE_OR_POP
(JUMP_IF_TRUE_OR_POP
) is a conditional jump of the same truth-testing kind then update the target to be that of the previous target.This is the rule that collapses several
and
andor
together.If the target of
JUMP_IF_*_OR_POP
is a conditional jump of the opposite truth-testing kind, change the instruction toPOP_JUMP_IF_*
targeting the next instruction of the previous target.This is what happens for
x and a or b
and also for(x or a) and b
:dis.dis(compile('x and a or b', '', 'eval')) 1 0 LOAD_NAME 0 (x) 3 POP_JUMP_IF_FALSE 12 6 LOAD_NAME 1 (a) 9 JUMP_IF_TRUE_OR_POP 15 >> 12 LOAD_NAME 2 (b) >> 15 RETURN_VALUE dis.dis(compile('(x or a) and b', '', 'eval')) 1 0 LOAD_NAME 0 (x) 3 POP_JUMP_IF_TRUE 12 6 LOAD_NAME 1 (a) 9 JUMP_IF_FALSE_OR_POP 15 >> 12 LOAD_NAME 2 (b) >> 15 RETURN_VALUE
... and others.
API of the module¶
-
xotl.ql.revenge.scanners.
keep_single_return
(instructions)¶ Transform the instructions so the a single RETURN_VALUE is kept at the bottom.
If the original instructions don’t have a RETURN_VALUE at the very end, return the same instructions set.
Otherwise all but the last RETURN_VALUE will be replaced by a JUMP_FORWARD that targets the RETURN_VALUE at the end of the instructions.
Offsets and jump targets are updated.
Parameters: instructions (iterable) – The original instructions. Return type: list
-
xotl.ql.revenge.scanners.
without_nops
(instructions)¶ Return the same instruction set with NOPs removed.
The jump targets are properly updated.
Loosely based on the same algorithm in Python/peephole.c.
Parameters: instructions – An iterable of instructions
.Returns: A generator of instructions.
-
xotl.ql.revenge.scanners.
normalize_pypy_conditional
(instructions)¶ Apply the pypy normalization rule.
The pypy normalization rule states that:
If the target of a
JUMP_FORWARD
(orJUMP_ABSOLUTE
) is aRETURN_VALUE
replace the JUMP with the following instructions:RETURN_VALUE NOP NOP
This rule does not only apply when using Pypy, the name simply comes because Pypy compiles conditional expressions using JUMPs.
Parameters: instructions – An iterable of instructions
.Returns: A generator of instructions.
-
class
xotl.ql.revenge.scanners.
Token
(name, arg=None, argval=None, argrepr=None, offset=-1, starts_line=False, instruction=None)¶ Class representing a byte-code token.
A byte-code token is equivalent to the contents of one line as output by dis.dis().