Payment Applications Handle Lots of Money No, Really, Lots of it!! Alberto Revelli (icesurfer) icesurfer@gmail.com Ruxcon 2013Mark Swift (swifty) swifty@swift.org Ruxcon 2013 2Agenda ✔Introduction to payment applications ✔Avenues of attack ✔Cryptocomedy ✔How to fix this? ✔Key takeaways Ruxcon 2013This is an area badly understood and apparently overlooked by the security industry (except, it appears, by clueless morons). The ultimate people who have to be convinced about security care about one thing. Money. It’s that simple. Twilight zone / Reversing the natural order of things. •Swifty will do Attack & The Dummy’s guide to payments •Ice will do Defence Context Ruxcon 2013Front Office Trade Recs Accountin g Payment GatewayBusiness Management / Capture. Not interesting. Mess with these areas and you can hide FO/Ops misbehaviour for ages. Serious scope of mischief here if direct settlements made from here. P&L (profit and loss), actuals reconciliation, funky accounting methods to reduce tax. Some kind of ‘payment gateway’. Either manual, direct banking interface (‘EFT’ - horribly insecure), or FIDES/SWIFT.Master Data PO / Settlements Bank / BureauxHard. Leave these guys alone or you will get an extradition warrant for a Christmas present.Landscape Ruxcon 2013• Seeing the wood for the trees •Massive payments and recs volumes •Acceptable error margins. •Human nature: find the explanation you are looking for. •End of months reconciliations don’t (‘accruals’) • Accounts are indecipherable: reversals, accruals, depreciation, loans, amortizations, and other dodges accounting techniques. •Computers are never wrong • Who ever does a manual reconciliation check? •Auditors don’t audit • Accountants and auditors look at processes, procedures, and evidenced accounts. •Security by buzzword. Crypt is magic that fixes everything. •Q: How do you secure payments? •A: We use military grade FIPS certified AES crypto and D/H to verify integrity. We are fully PCI complaint. The Opportunity Ruxcon 2013Accountin g Payment GatewayMessage Bus 1)Direct settlements : often broadcast messages (sometimes signed). GUI 2) Often no 2FA for left/right hand operators. 3) Master data not secured? Create a new counterparty  Server 3) Payment / recs files written out to ‘secure fileshares’. 4) Private key lying around(if it even exists). 6) Cannot mess via the GUI? Go direct to the database. 7) Plaintext memory attacks if you really have to.Server 8) Signing and crypto may happen here. Direct attacks on payment files possible prior to signing. 9) Private key probably ripe for attack, so direct attacks on payment files post signing also possibly.SettlementsTrade RecsLandscape Issues Ruxcon 2013MT940 - Statement - relevant lines :61:120706D1343280,2NMSCBCP .LEA 36175AFB//4612-0000000 :86:BCP.LEA 36175AFB :62F:C120709USD9542201,3MT101 – Payment File {- FIDES ABNAMROXX99 101 02 :30:120622 :21: 36175AFB :32B:USD1343280,25 :50H:/10944563 Sucker Co LTD London :57D: Lloyds Bank PLC Moosley Street Manchester LOYDGB21N78/GB/771935 :59:/897766 Security Services London :70:023-0000254 :71A:OUR -} Amount and Currency a/c and namePay to this bankUnique Ref Closing BalanceShow.Me.The.Money Ruxcon 2013 December 2012 UNCON 21Payment file processing The server process producing the payment file can be processed via a specific server queue. If needed per client and per process it is possible to setup a specific server queue. The files are automatically stored in a predefined subfolder as defined per the server queue. User access to this folder should be limited to prevent users to access and update the output files. Besides the payment file a report (Bank list) is produced containing the total amount per currency, the number of transactions and a total sum bank account . The sum bank account is an example of a hash total which can be used to check the integrity of the data . When the payment file is imported in the banking software the same hash total is shown . Transport of payment file The transport of the payment file from the Account system subfolder to the banking software, can be based on regular file transfer, or via FTP or SFTP . In case of in-house banking software the file is often directly read from the Account subfolder. In case of external banking software FTP or SFTP . SFTP is advised.Straight from the Manual.... Ruxcon 2013Generate and Send Payments (MT101) Encrypt Payment Files echo %Password%|gpg --encrypt --batch --passphrase-fd 0 --sign --output "C:\FTP\IT2\Working\delivery.txt.asc " --recipient "Fides Treasury Services Ltd (FTP Channels)" "C:\FTP\IT\Outgoing\*.*" Upload Payment Files FTP -s:C:\Scripts\FidesITUpload.txt **Contents of FidesITUpload.txt** open xxx.xx.xx.xx [removed password] %Password%cd /EFT/In binhashput C:\FTP\IT\Working\*.ascbye Copy Payment Files Cmd.exe /C Move /y C:\FTP\IT\Outgoing\*.* C:\FTP\IT\CompleteDownloads Statements (MT940) FTP -i -s:C:\Scripts\FidesITDownload.txt **Contents of FidesITDownload.txt** open xxx.xx.xx.xx [removed password] %Password% cd /ARS bin hashmget *.asc C:\FTP\IT\Incoming\* bye Decrypt Statements echo %Password%[|gpg --batch --passphrase-fd 0 --decrypt-files *.ASC Copies Statements to the Application Server Move /y *.txt "\\server.xxx.com\IT\Statements"Process Outside T1 Bank Ruxcon 2013 10Agenda ✔Introduction to payment applications ✔Avenues of attack ✔Cryptocomedy ✔How to fix this? ✔Key takeaways Ruxcon 2013 Payment File / System attacks • Manual payments: Steal the bank creds and transfer cash. Recs avoidance more difficult. • Direct Bank Interface: as above, or mess with the payment file and then ‘correct’ the reconciliation statement. • FIDES/SWIFT: as above, just grab it before it gets encrypted/signed / after decryption/verification.Theft and Avoidance - Basic Ruxcon 2013• Attacks via the GUI • 2nd Hand Fraud: Steal the authentication for left/right hand officers and set up payments via the system. Made more difficult by 2FA. •Change customer payment details (IBAN etc). Change back later. Suppression of reconciliation difficult. •Internal collusion: tried and tested method. • Other •Direct changes to the database – master data (counterparty details, maybe payment amounts etc). • External partners in fraud: ‘evil hackers’ break in and change the payment amounts and suppress breaks. •Note: operator accuracy reports – is it just management who is checking on the most incompetent operators?Theft and Avoidance – Less Basic Ruxcon 2013•So stealing money turns out to be a lot easier than imagined. Hiding your tracks long term requires a different skill set ... • Channel your inner accountant (some examples): •Accruals – “we’ll work it out next month” •Reversals – “don’t worry, the money came back” •Depreciation – to adjust balances and hide cash out. •Amortization – “well spread it over a few years” •Loans (3rd party & intercompany). The ‘Starbucks’ approach. •Credits (3rd party & intercompany). Balance down $1m? Easy, make up a credit from somewhere else! •Interest rates: increase them and move surplus cash •Exchange Rates: change them and move surplus cash oComputers have been around for 150 years. Accountants have been around for thousands. Theft and Avoidance – Elite Ruxcon 2013 •Accounting tricks to hide cash- out still work •Database: how is payment and counterparty data stored prior to payment generation? •How is the backend/ERP system (payment instructions) secured? •GUI based master data attacks should still work. •On positive side GUI payments attacks now difficult due to 2FA. More about the crypto management process later ... A Secured Process – T1 Bank Ruxcon 2013 15Agenda ✔Introduction to payment applications ✔Avenues of attack ✔Cryptocomedy ✔How to fix this? ✔Key takeaways Ruxcon 2013 Key Management? What’s that? Ruxcon 2013Hashing? Yes, We Do Hash 1. Multiply each account number for the respective amount 2. Sum all results 3. T ake up to the 22 most significant digitsFinance Systems Specialist: It is easy to add to the current flat file produced from SAP a hash total. We do it most of the time for bank transfer files. It is difficult to crack a hash total solution. Ruxcon 2013Encryption? Yes, That too! 1. Generate a very long key 2. For each payment line x, we calculate (key + x) 3. We then ” Finance Systems Specialist: “Sure we encrypt the payment information! Here’s what we do....” * Published in 1553 and considered unbreakable. Publicly broken by Charles Babbage in 1854. Ruxcon 2013 19Agenda ✔Introduction to payment applications ✔Avenues of attack ✔Cryptocomedy ✔How to fix this? ✔A few caveats ✔Key takeaways Ruxcon 2013Accounting Payment GatewaySettlementsTrade RecsWe focus on this bit, because: - Here is where there is the opportunity to make arbitrary payments - Opportunity to cover the traces of those payments - Before this step any settlement request has to be against an existing counterparty pre-set up in the system So, let’s fix this sucker Ruxcon 2013•User A creates the payment instruction •User B approves it •User C releases it •Finally, User D finalizes it into a MT101 file “Hey, four people need to control this, what could possibly go wrong?”A real example of this process Ruxcon 2013Payments are NOT immutable artifacts •When a user creates or approves a payment, all he does is writing some rows in a DB “DBAs do not understand commodities markets” not a good defense strategy Ruxcon 2013Really, they are NOT! In most cases, once the MT101 is created it is stored as a text file in a (shared) folder waiting for Dude E to encrypt it for transmission What could possibly go wrong? Ruxcon 2013Basic, Obvious Idea: Sign each Step User A creates the payment.... ...Payment is signed with his private key User B approves it.... ...Check first signature, then re-sign User C re-approves it ...Check both previous signatures, then re-sign You got the idea...You got the idea... Ruxcon 2013Just add a few columns, right? UserIDSignatureUserIDSignatureUserIDSignature Payment details“Creator” “First Approver”“Second Approver”Payment Proposal Ruxcon 2013 MT101 Bank GeneralInfo Name Recipient TransactionDetails MT101Info AccountServicingInstitution Authorisation InstructionParty MessageIndexingTotal RequestedExecutionDate ...OrderingCustomer ... AccountWithInstitution Beneficiary ChargesAccount CurrencyTransactionAmount ...MT101TransactionDetailEach of these rows can be signed by a different combo of peopleOk, It’s a bit more Complicated Ruxcon 2013 MT101 MT101Info MT101TransactionDetails MT101TransactionDetails Signature Signature MT101TransactionDetails SignatureComplicated, and somewhat Overlapping Ruxcon 2013 HSM (Calgary) HSM (London) HSM (Singapore) Web Service Crypto Web ServiceActive Directory RSA HSM SubnetvShieldPayment App DBMSWho Will Sign Anyway? Ruxcon 2013•A Hardware Security Module (FIPS140-2) •Intrusion-resistant, tamper-evident hardware •Keys in hardware •PKCS#11, Microsoft CAPI, CNG ,JCA/JCE, OpenSSL •PRNG, symmetric and asymmetric key pair generation, encryption, decryption, digital signingAbout the HSM guys... Ruxcon 2013 •Remote login with USB token and PIN •Tamper-proof •Buttons are spaced for users who are likely to wear gloves •E.g.: soldiers in tanksPlaying with HSMs is fun! Ruxcon 2013 HSM Crypto Web Service Active Directory Creators Approvers FinalizersPayment App Payment ServiceAdmin App Admins Payment signing Key – User association AD AdminsGroup – User association HSM Admins HSM Partition ManagementDBMSTwo Apps and Lots of Groups Ruxcon 2013 Authentication Data (AD, RSA) PIL-12041210000 27.03.12Fides27.03.12 150 USD CRESCHZZ80A\\\ Share CRESCHZZ80ACH05 0483 5143 0870 9100 0143080-91Big Co, Inc.
PO Box 30
CHCH-1101Bienne/Biel
CRESCHZZ80A1430870-91CREDSUISS CHF A/C1430870-91
1+ 7 Quai des Bergues
1211Geneva
10000100
Sign this, B*tch! Ruxcon 2013 HSM (Calgary) HSM (London) HSM (Singapore) Web Service Crypto Web ServiceActive Directory RSA HSM SubnetvShieldPayment App DBMS XMLSign1Step 1: Payment Creation Ruxcon 2013 HSM (Calgary) HSM (London) HSM (Singapore) Web Service Crypto Web ServiceActive Directory RSA HSM SubnetvShieldPayment App DBMS Sign2Step 2: First Approval XML Sign1 Ruxcon 2013 HSM (Calgary) HSM (London) HSM (Singapore) Web Service Crypto Web ServiceActive Directory RSA HSM SubnetvShieldPayment App DBMSStep 3: Second Approval XML Sign3 Sign1 Sign2 Ruxcon 2013 HSM (Calgary) HSM (London) HSM (Singapore) Web Service Crypto Web ServiceActive Directory RSA HSM SubnetvShieldPayment App DBMSStep 4: Finalization XML MT101Sign1 Sign2 Sign3 Payer's SignatureEncr(FIDES) Ruxcon 2013WOW! This Might Work! •DBAs cannot modify payment data •File is protected at the moment of creation •Adding yourself to an AD Group is not enough: you also need to have a CryptoService admin to generate your key •Job done? Ruxcon 2013 HSM (Calgary) HSM (London) HSM (Singapore) Web Service Crypto Web ServiceActive Directory RSA HSM SubnetvShieldPayment App DBMSIt is Easy to Get Things Wrong XML MT101Sign1 Sign2 Sign3 Payer's SignatureEncr(FIDES) Ruxcon 2013 HSM (Calgary) HSM (London) HSM (Singapore) Web Service Crypto Web ServiceActive Directory RSA HSM SubnetvShieldPayment App DBMS......Oooops! XML MT101 != XML :))))Sign1 Sign2 Sign3 Payer's SignatureEncr(FIDES) Ruxcon 2013Quis Custodiet Ipsos Custodes?* 1) Wait for the payment finalization request 2) Check the signatures (we don't want others to cheat at the same time) 3) Change the payment beneficiary 4) Sign+encrypt 5) PROFIT!!! * Who watches the watchmen? Ruxcon 2013 HSM (Calgary) HSM (London) HSM (Singapore) Web Service Crypto Web ServiceActive Directory RSA HSM SubnetvShieldPayment AppPhase 4: Corrected XML Sign1 Sign2 Sign3Payer's SignatureMT101 Payer's SignatureEncr(FIDES) Ruxcon 20131) The application submits the XML 2) The CS verifies the signatures 3) The CS generates a MT101 and signs it 4) The CS returns the signature 5) The application generates another MT101 and verifies that the signature corresponds 6) The application calls GPG with the following parameters: - The MT101 - The signature - The recipient's public key 7) The resulting file is finally sentDual Control! Always! Ruxcon 2013 43Agenda ✔Introduction to payment applications ✔Avenues of attack ✔Cryptocomedy ✔How to fix this? ✔Key takeaways Ruxcon 2013A Few Takeaways 1) Payments app are everywhere and handle big bucks, but they are generally “secured” by clueless people 2) Financial system vendors seem to be utterly clueless about this 3) Most payment apps out there are likely to be horribly vulnerable 4) Why don't we hear about the thefts? Ruxcon 2013More Takeaways 5) It can be a funny problem to solve! 6) There obviously was a whole bunch of other intricacies, but you get the main idea 7) The devil is in the details Ruxcon 2013Questions?Symbolic Execution of Linux binaries A tool for the About Symbolic Execution ●Dynamically explore all program branches. ●Inputs are considered symbolic variables. ●Symbols remain uninstantiated and become constrained at execution time. ●At a conditional branch operating on symbolic terms, the execution is forked. ●Each feasible branch is taken, and the appropriate constraints logged. Input space >> Number of paths int main( ) { int val; read(STDIN, & val, sizeof( val) ); if ( val > 0 ) if ( val < 100 ) do_something( ); else do_something_else( ); } This is used for: ●Test generation and bug hunting. ●Reason about reachability. ●Worst-Case Execution Time Analysis. ●Comparing different versions of a func. ●Deobfuscation, malware analisys. ●AEG: Automatic Exploit Generation. Whaat?! State of the art ●Lots of academic papers: ○2008-12-OSDI-KLEE ○Unleashing MAYHEM on Binary ●Several implementations: ○SymDroid, Cloud9, Pex, jCUTE, Java PathFinder, KLEE, s2e, fuzzball, mayhem, cbass ●Only a few work on binary : ○libVEX / IL based ○quemu based Our aim ●Emulate Intel IA64 machine code symbolically. ●Load ELF executables. ●Synthesize any process state as starting point. ●The final code should be readable and easy to extend. ●Use as few dependencies as possible: ○ pyelftools, distorm3 and z3 ●Analysis state can be saved and restored. ●Workload can be distributed (dispy ) Basic architecture Instructions Frequency in GNU LIBC ●336 different opcodes ●160218 total instructions ●37% of them are MOV or ADD ●currently 185 instruction implemented CPU ● Based on distorm3 DecomposeInterface. ● Most instructions are very simple, ex. @instruction def DEC(cpu, dest): res = dest.write( dest.read() - 1 ) #Affected Flags o..szapc cpu.calculateFlags('DEC', dest.size, res) Memory class Memory: def mprotect(self, start, size, perms): ... def munmap(self, start, size): ... def mmap(self, addr, size, perms): ... def putchar(self, addr, data): ... def getchar(self, addr): ... Operating System Model (Linux) class Linux: def exe(self, filename, argv=[], envp=[]):... def syscall(self, cpu):... def sys_open(self, cpu, buf, flags, mode):... def sys_read(self, cpu, fd, buf, count):... def sys_write(self, cpu, fd, buf, size):... def sys_close(self, cpu, fd):... def sys_brk(self, cpu, brk):... Symbols and SMT solver class Solver: def getallvalues(self, x, maxcnt = 30): def minmax(self, x, iters=10000): def check(self): def add(self, constraint): #Symbols factory def mkArray(self, size, name ): ... def mkBool(self, name ): ... def mkBitVec(self, size, name ): ... Operation over symbols is almost transparent >>> from smtlibv2 import * >>> s = Solver() >>> a = s.mkBitVec(32) >>> b = s.mkBitVec(32) >>> s.add( a + 2*b > 100 ) >>> s.check() 'sat' >>> s.getvalue(a), s.getvalue(b) (101, 0) The glue: Basic Initialization 1.Make Solver, Memory, Cpu and Linux objects. 2.Load ELF binary program in Memory, Initialize cpu registers, initialize stack. solver = Solver() mem = SMemory(solver, bits, 12 ) cpu = Cpu(mem, arch ) linux = SLinux(solver, [cpu], mem, ... ) linux.exe(“./my_test”, argv=[], env=[]) The glue: Basic analysis loop states = [‘init.pkl’] while len(states) > 0 : linux = load(state.pop()) while linux.running: linux.execute() if isinstance( linux.cpu.PC, Symbol): vals = solver.getallvalues(linux. cpu.PC) -- generate states for each value -- break Micro demo python system.py -h usage: system.py [-h] [-sym SYM] [-stdin STDIN] [-stdout STDOUT] [-stderr STDERR] [-env ENV] PROGRAM ... python system.py -sym stdin my_prog stdin: PDF-1.2++++++++++++++++++++++++++++++ Symbolic inputs. We need to mark whic part of the environment is symbolic: ●STDIN: a file partially symbolic. Symbols marked with “+” ●STDOUT and STDERR are placeholders. ●ARGV and ENVP can be symbolic A toy example int main(int argc, char* argv[], char* envp[]){ char buffer [0x100] = {0}; read(0, buffer , 0x100); if (strcmp( buffer , "ZARAZA") == 0 ) printf("Message: ZARAZA!\n"); else printf("Message: Not Found!\n"); return 0; } Conclusions, future work ●Push all known optimizations: solver cache, implied values, Redundant State Elimination, constraint independence, KLEE-like cex cache, symbol simplification. ●Add more cpu instructions (fpu, simd). ●Improve Linux model, add network. ●Implement OSX loader and os model. ●https://github.com/feliam/pysymemu gGracias. Contacto: feliam@binamuse.com Computer Scienc e and A rtifi cialIntelli genc e Labor atory Techn ical Report mas sachus etts ins titute of technology , cambr idge, ma 0 213 9 usa — www .csai l.m it.eduMIT-CSAIL-TR-2013-018 August 6, 2013 Sound Input Filter Generation for Integer Overflow Errors Fan Long, Stelios Sidiroglou-Douskos, Deokhwan Kim, and Martin Rinard Sound Input Filter Generation for Integer Overflow Errors Fan Long, Stelios Sidiroglou-Douskos, Deokhwan Kim, Martin Rinard MIT CSAIL {fanl, stelios, dkim, rinard}@csail.mit.edu Abstract We present a system, SIFT, for generating input filters that nullify integer overflow errors associated with critical pro- gram sites such as memory allocation or block copy sites. SIFT uses a static program analysis to generate filters that discard inputs that may trigger integer overflow errors in the computations of the sizes of allocated memory blocks or the number of copied bytes in block copy operations. The gen- erated filters are sound — if an input passes the filter, it will not trigger an integer overflow error for any analyzed site. Our results show that SIFT successfully analyzes (and therefore generates sound input filters for) 52 out of 58 mem- ory allocation and block memory copy sites in analyzed in- put processing modules from five applications (VLC, Dillo, Swfdec, Swftools, and GIMP). These nullified errors include six known integer overflow vulnerabilities. Our results also show that applying these filters to 62895 real-world inputs produces no false positives. The analysis and filter genera- tion times are all less than a second. 1. Introduction Many security exploits target software errors in deployed applications. One approach to nullifying vulnerabilities is to deploy input filters that discard inputs that may trigger the errors. We present a new static analysis technique and imple- mented system, SIFT, for automatically generating filters that discard inputs that may trigger integer overflow errors at analyzed memory allocation and block copy sites. We fo- cus on this problem, in part, because of its practical impor- tance. Because integer overflows may enable code injection or other attacks, they are an important source of security vul- nerabilities [22, 29, 32]. Unlike all previous techniques of which we are aware, SIFT is sound — if an input passes the filter, it will not trigger an overflow error at any analyzed site. 1.1 Previous Filter Generation Systems Standard filter generation systems start with an input that triggers an error [8–10, 24, 33]. They next use the input to generate an execution trace and discover the path the pro-gram takes to the error. They then use a forward symbolic execution on the discovered path (and, in some cases, heuris- tically related paths) to derive a vulnerability signature — a boolean condition that the input must satisfy to follow the same execution path through the program to trigger the same error. The generated filter discards inputs that satisfy the vul- nerability signature. Because other unconsidered paths to the error may exist, these techniques are unsound (i.e., the filter may miss inputs that exploit the error). It is also possible to start with a potentially vulnerable site and use a weakest precondition analysis to obtain an input filter for that site. To our knowledge, the only previous tech- nique that uses this approach [4] is unsound in that 1) it uses loop unrolling to eliminate loops and therefore analyzes only a subset of the possible execution paths and 2) it does not specify a technique for dealing with potentially aliased val- ues. As is standard, the generated filter incorporates checks from conditional statements along the analyzed execution paths. The goal is to avoid filtering potentially problematic inputs that the program would (because of safety checks at conditionals along the execution path) process correctly. As a result, the generated input filters perform a substantial (be- tween 106and1010) number of operations. 1.2 SIFT SIFT starts with a set of critical expressions from memory allocation and block copy sites. These expressions control the sizes of allocated or copied memory blocks at these sites. SIFT then uses an interprocedural, demand-driven, weakest precondition static analysis to propagate the critical expression backwards against the control flow. The result is a symbolic condition that captures all expressions that the application may evaluate (in any execution) to obtain the values of critical expressions. The free variables in the symbolic condition represent the values of input fields. In effect, the symbolic condition captures all of the possible computations that the program may perform on the input fields to obtain the values of critical expressions. Given an input, the generated input filter evaluates this condition over the corresponding input fields to discard inputs that may cause an overflow. 1 2013/8/5 AnnotatedModulesStatic AnalysisSymbolicConditionsCritical SiteIdentificationFilter Generator Drop Input?Incoming Input ApplicationNoYes Generate Report Property CheckerFigure 1. The SIFT architecture. A key challenge is that, to successfully extract effec- tive symbolic conditions, the analysis must reason precisely about interprocedural computations that use pointers to com- pute and manipulate values derived from input fields. Our analysis meets this challenge by deploying a novel combi- nation of techniques including 1) a novel interprocedural weakest precondition analysis that works with symbolic rep- resentations of input fields and values accessed via pointers (including input fields read in loops and values accessed via pointers in loops) and 2) an alias analysis that ensures that the derived symbolic condition correctly characterizes the values that the program computes. Another key challenge is obtaining loop invariants that enable the analysis to precisely characterize how loops transform the propagated symbolic conditions. Our analy- sis meets this challenge with a novel symbolic expression normalization algorithm that enables the fixed point analysis to terminate unless it attempts to compute a symbolic value that depends on a statically unbounded number of values computed in different loop iterations (see Section 3.2). •Sound Filters: Because SIFT takes all paths to analyzed memory allocation and block copy sites into account, it generates sound filters — if an input passes the filter, it will not trigger an overflow in the evaluation of any crit- ical expression (including the evaluation of intermediate expressions at distant program points that contribute to the value of the critical expression).1 •Efficient Filters: Unlike standard techniques, SIFT in- corporates no checks from the program’s conditional statements and works only with arithmetic expressions that contribute directly to the values of the critical ex- 1As is standard in the field, SIFT uses an alias analysis that is designed to work with programs that do not access uninitialized or out of bounds mem- ory. Our analysis therefore comes with the following soundness guarantee. If an input passes the filter for a given critical expression e, the input field annotations are correct (see Section 3.4), and the program has not yet ac- cessed uninitialized or out of bounds memory when the program computes a value of e, then no integer overflow occurs during the evaluation of e(in- cluding the evaluations of intermediate expressions that contribute to the final value of the critical expression).pressions. It therefore generates much more efficient fil- ters than standard techniques (SIFT’s filters perform tens of operations as opposed to tens of thousands or more). Indeed, our experimental results show that, in contrast to standard filters, SIFT’s filters spend essentially all of their time reading the input (as opposed to checking if the input may trigger an overflow error). •Accurate Filters: Our experimental results show that, empirically, ignoring execution path constraints results in no loss of accuracy. Specifically, we tested our gener- ated filters on 62895 real-world inputs for six benchmark applications and found no false positives (incorrectly fil- tered inputs that the program would have processed cor- rectly). We attribute this potentially counterintuitive re- sult to the fact that standard integer data types usually contain enough bits to represent the memory allocation sizes and block copy lengths that benign inputs typically elicit. 1.3 SIFT Usage Model Figure 1 presents the architecture of SIFT. The architecture is designed to support the following usage model: Module Identification. Starting with an application that is designed to process inputs presented in one or more input formats, the developer identifies the modules within the ap- plication that process inputs of interest. SIFT will analyze these modules to generate an input filter for the inputs that these modules process. Input Statement Annotation. The developer annotates the relevant input statements in the source code of the modules to identify the input field that each input statement reads. Critical Site Identification. SIFT scans the modules to find allcritical sites (currently, memory allocation and block copy sites). Each critical site has a critical expression that determines the size of the allocated or copied block of mem- ory. The generated input filter will discard inputs that may trigger an integer overflow error during the computation of the value of the critical expression. Static Analysis. For each critical expression, SIFT uses a demand-driven backwards static program analysis to auto- 2 2013/8/5 matically derive the corresponding symbolic condition . Each conjunct expression in this condition specifies, as a function of the input fields, how the value of the critical expression is computed along one of the program paths to the correspond- ing critical site. Input Parser Acquisition. The developer obtains (typically from open-source repositories such as Hachoir [1]) a parser for the desired input format. This parser groups the input bit stream into input fields, then makes these fields available via a standard API. Filter Generation. SIFT uses the input parser and symbolic conditions to automatically generate the input filter. When presented with an input, the filter reads the fields of the input and, for each symbolic expression in the conditions, deter- mines if an integer overflow may occur when the expression is evaluated. If so, the filter discards the input. Otherwise, it passes the input along to the application. The generated fil- ters can be deployed anywhere along the path from the input source to the application that ultimately processes the input. 1.4 Experimental Results We used SIFT to generate input filters for modules in five real-world applications: VLC 0.8.6h (a network me- dia player), Dillo 2.1 (a lightweight web browser), Swfdec 0.5.5 (a flash video player), Swftools 0.9.1 (SWF manipu- lation and generation utilities), and GIMP 2.8.0 (an image manipulation application). Together, the analyzed modules contain 58 critical memory allocation and block copy sites. SIFT successfully generated filters for 52 of these 58 crit- ical sites (SIFT’s static analysis was unable to derive sym- bolic conditions for the remaining six critical sites, see Sec- tion 5.2 for more details). These applications contain six integer overflow vulnerabilities at their critical sites. SIFT’s filters nullify all of these vulnerabilities. Analysis and Filter Generation Time. We configured SIFT to analyze all critical sites in the analyzed modules, then gen- erate a single, high-performance composite filter that checks for integer overflow errors at all of the sites. The maximum time required to analyze all of the sites and generate the com- posite filter was less than a second for each benchmark ap- plication. False Positive Evaluation. We used a web crawler to obtain a set of least 6000 real-world inputs for each application (for a total of 62895 input files). We found no false positives — the corresponding composite filters accept all of the input files in this test set. Filter Performance. We measured the composite filter exe- cution time for each of the 62895 input files in our test set. The average time required to read and filter each input was at most 16 milliseconds, with this time dominated by the time required to read in the input file.1.5 Contributions This paper makes the following contributions: •SIFT: We present SIFT, a sound filter generation sys- tem for nullifying integer overflow vulnerabilities. SIFT scans modules to find critical memory allocation and block copy sites, statically analyzes the code to automat- ically derive symbolic conditions that characterize how the application may compute the sizes of the allocated or copied memory blocks, and generates input filters that discard inputs that may trigger integer overflow errors in the evaluation of these expressions. In comparison with previous filter generation techniques, SIFT is sound and generates efficient and empirically precise filters. •Static Analysis: We present a new static analysis that au- tomatically derives symbolic conditions that capture, as a function of the input fields, how the integer values of crit- ical expressions are computed along the various possible execution paths to the corresponding critical site. Key elements of this static analysis include 1) a precise backwards symbolic analysis that soundly and accurately reasons about symbolic conditions in the face of instruc- tions that use pointers to load and store computed values and 2) a novel normalization procedure that enables the analysis to effectively synthesize symbolic loop invari- ants. •Experimental Results: We present experimental results that illustrate the practical viability of our approach in protecting applications against integer overflow vulnera- bilities at memory allocation and block copy sites. The remainder of the paper is organized as follows. Sec- tion 2 presents an example that illustrates how SIFT works. Section 3 presents the core SIFT static analysis for C pro- grams. Section 4 presents the formalization of the static anal- ysis and discusses the soundness of the analysis. Section 5 presents the experimental results. Section 6 presents related work. We conclude in Section 7. 2. Example We next present an example that illustrates how SIFT nullifies an integer overflow vulnerability in Swfdec 0.5.5, an open source shockwave flash player. Figure 2 presents (simplified) source code from Swfdec. When Swfdec opens an SWF file with embedded JPEG im- ages, it calls jpeg_decoder_decode() (line 1 in Figure 2) to decode each JPEG image in the file. This function in turn calls the function jpeg_decoder_start_of_frame() (line 7) to read the image metadata and the function 3 2013/8/5 1int jpeg_decoder_decode(JpegDecoder *dec) { 2 ... 3 jpeg_decoder_start_of_frame(dec, ...); 4 jpeg_decoder_init_decoder (dec); 5 ... 6} 7void jpeg_decoder_start_of_frame(JpegDecoder *dec){ 8 ... 9 dec->height = jpeg_bits_get_u16_be (bits); 10 /*dec->height = SIFT_input("jpeg_height", 16); */ 11 dec->width = jpeg_bits_get_u16_be (bits); 12 /*dec->width = SIFT_input("jpeg_width", 16); */ 13 for (i = 0; i < dec->n_components; i++) { 14 dec->components[i].h_sample =getbits(bits, 4); 15 /*dec->components[i].h_sample = 16 SIFT_input("h_sample", 4); */ 17 dec->components[i].v_sample =getbits(bits, 4); 18 /*dec->components[i].v_sample = 19 SIFT_input("v_sample", 4); */ 20 } 21 } 22 void jpeg_decoder_init_decoder(JpegDecoder *dec){ 23 int max_h_sample = 0; 24 int max_v_sample = 0; 25 int i; 26 for (i=0; i < dec->n_components; i++) { 27 max_h_sample = MAX(max_h_sample, 28 dec->components[i].h_sample); 29 max_v_sample = MAX(max_v_sample, 30 dec->components[i].v_sample); 31 } 32 dec->width_blocks=(dec->width+8 *max_h_sample-1) 33 / (8 *max_h_sample); 34 dec->height_blocks=(dec->height+8 *max_v_sample-1) 35 / (8 *max_v_sample); 36 for (i = 0; i < dec->n_components; i++) { 37 int rowstride; 38 int image_size; 39 dec->components[i].h_subsample=max_h_sample / 40 dec->components[i].h_sample; 41 dec->components[i].v_subsample=max_v_sample / 42 dec->components[i].v_sample; 43 rowstride=dec->width_blocks *8*max_h_sample / 44 dec->components[i].h_subsample; 45 image_size=rowstride *(dec->height_blocks *8* 46 max_v_sample / dec->components[i].v_subsample); 47 dec->components[i].image = malloc (image_size); 48 } 49 } Figure 2. Simplified Swfdec source code. Input statement annotations appear in comments. jpeg_decoder_init_decoder() (line 22) to allocate mem- ory buffers for the JPEG image. There is an integer overflow vulnerability at lines 43-47 where Swfdec calculates the size of the buffer for a JPEG image as: rowstride *(dec->height_block *8*max_v_sample / dec->components[i].v_subsample) At this program point, rowstride equals: (jpeg_width + 8 *max_h_sample - 1) / (8 *max_h_sample) *8*max_h_sample / (max_h_sample / h_sample) while the rest of the expression equals (jpeg_height + 8 *max_v_sample - 1) / (8 *max_v_sample) *8*max_v_sample / (max_v_sample / v_sample) wherejpeg_height is the 16-bit height input field value that Swfdec reads at line 9 and jpeg_width is the 16-bit widthinput field value that Swfdec reads at line 11. h_sample is one of the horizontal sampling factor values that Swfdec reads at line 14, while max_h_sample is the maximum horizontal sampling factor value. v_sample is one of the vertical sampling factor values that Swfdec reads at line 17, whilemax_v_sample is the maximum vertical sampling factor value. Malicious inputs with specifically crafted val- ues in these input fields can cause the image buffer size cal- culation to overflow. In this case Swfdec allocates an image buffer that is smaller than required and eventually writes be- yond the end of the allocated buffer. The loop at lines 13-20 reads an array of horizontal and vertical factor values. Swfdec computes the maximum val- ues of these factors in the loop at lines 26-31. It then uses these values to compute the size of the allocated buffer at each iteration in the loop (lines 36-48). Analysis Challenges: This example highlights several chal- lenges that SIFT must overcome to successfully analyze and generate a filter for this program. First, the expression for the size of the buffer uses pointers to access values derived from input fields. To overcome this challenge, SIFT uses an alias analysis [17] to reason precisely about expressions with pointers. Second, the memory allocation site (line 47) occurs in a loop, with the size expression referencing input values read in a different loop (lines 13-19). Different instances of the same input field ( h_sample andv_sample ) are used to compute (potentially different) sizes for different blocks of memory allocated at the same site. To reason precisely about these different instances, the analysis works with an abstrac- tion that materializes, on demand, abstract representatives of accessed input field and computed values (see Section 3). To successfully analyze the loop, the analysis uses a new loop invariant synthesis algorithm (which exploits a new expres- sion normalization technique to reach a fixed point). Finally, Swfdec reads the input fields (lines 14 and 17) and computes the size of the allocated memory block (lines 45-46) in different procedures. SIFT therefore uses an in- terprocedural analysis that propagates symbolic conditions across procedure boundaries to obtain precise symbolic con- ditions. We next describe how SIFT generates a sound input filter to nullify this integer overflow error. Source Code Annotations: SIFT provides an declarative specification interface that enables the developer to spec- ify which statements read which input fields. In this ex- ample, the developer specifies that the application reads the input fields jpeg_height ,jpeg_width ,h_sample , and v_sample at lines 10, 12, 15-16, and 18-19 in Figure 2. SIFT uses this specification to map the variables dec- >height ,dec->width ,dec->components[i].h_sample , anddec->components[i].v_sample at lines 9, 11, 14, and 4 2013/8/5 C:safe((((sext(jpeg _width[16],32) + 8[32]×sext(h_sample (1)[4],32)−1[32])/(8[32]×sext(h_sample (1)[4],32)) ×8[32]×sext(h_sample (1)[4],32))/(sext(h_sample (1)[4],32)/sext (h_sample (2)[4],32))) ×(((sext(jpeg _height[16],32) + 8[32]×sext(v_sample (1)[4],32)−1[32])/(8[32]×sext(v_sample (1)[4],32)) ×8[32]×sext(v_sample (1)[4],32))/(sext(v_sample (1)[4],32)/sext (v_sample (2)[4],32)))) Figure 3. The symbolic condition Cfor the Swfdec example. Subexpressions in Care bit vector expressions. The superscript indicates the bit width of each expression atom. “ sext(v,w)" is the signed extension operation that transforms the value vto the bit width w. 17 to the corresponding input field values. The field names h_sample andv_sample map to two arrays of input fields that Swfdec reads in the loop at lines 14 and 17. Compute Symbolic Condition: SIFT uses a demand- driven, interprocedural, backward static analysis to com- pute the symbolic condition Cin Figure 3. We use notation “safe(e)" in Figure 3 to denote that overflow errors should not occur in any step of the evaluation of the expression e. Subexpressions in Care in bit vector expression form so that the expressions accurately reflect the representation of the numbers inside the computer as fixed-length bit vectors as well as the semantics of arithmetic and logical operations as implemented inside the computer on these bit vectors. In Figure 3, the superscripts indicate the bit width of each expression atom. sext(v,w)is the signed extension opera- tion that transforms the value vto the bit width w. SIFT also tracks the sign of each arithmetic operation in C. For sim- plicity, Figure 3 omits this information. SIFT soundly han- dles the loops that access the input field arrays h_sample andv_sample . The generated Creflects the fact that the variabledec->components[i].h_sample and the variable max_h_sample might be two different elements in the input array h_sample . InC,h_sample (1)corresponds tomax_h_sample andh_sample (2)corresponds to dec- >components[i].h_sample . SIFT handles v_sample sim- ilarly. Cincludes all intermediate expressions evaluated at lines 32-35 and 39-46. In this example, Ccontains only a single term in the form of safe (e). However, if there may be multi- ple execution paths, SIFT generates a symbolic condition C that conjuncts multiple terms in the form of safe (e)to cover all paths. Generate Input Filter: Starting with the symbolic condi- tionC, SIFT generates an input filter that discards any input that violates C, i.e., for any term safe (e)inC, the input trig- gers integer overflow errors when evaluating e(including all subexpressions). The generated filter extracts all instances of the input fields jpeg_height ,jpeg_width ,h_sample , and v_sample (these are the input fields that appear in C) from an incoming input. It then iterates over all combinations of pairs of the input fields h_sample andv_sample to con- sider all possible bindings of h_sample (1),h_sample (2), v_sample (1), andv_sample (2)inC. For each binding, itchecks the entire evaluation of C(including the evaluation of all subexpressions) for overflow. If there is no overflow in any evaluation, the filter accepts the input, otherwise it rejects the input. 3. Static Analysis This section presents the static analysis algorithm in SIFT. We have implemented our static analysis for C programs using the LLVM Compiler Infrastructure [2]. 3.1 Core Language and Notation s:=l:x= read( f) | l:x=c|l:x=y| l:x=y op z |l:x=∗p|l:∗p=x| l:p= malloc | l: skip | s′;s′′| l: if ( x)s′else s′′|l: while ( x) {s′} s, s′, s′′∈Statement f∈InputField x, y, z, p∈Var c∈Int l∈Label Figure 4. The Core Programming Language Figure 4 presents the core language that we use to present the analysis. The language is modeled on a standard lowered program representation in which 1) nested expressions are converted into sequences of statements of the form l:x=y op z (wherex,y, andzare either non-aliased variables or automatically generated temporaries) and 2) all accesses to potentially aliased memory locations occur in load or store statements of the form l:x=∗porl:∗p=x. Each statement contains a unique label l∈Label. A statement of the form “ l:x= read(f)” reads a value from an input field f. Because the input may contain mul- tiple instances of the field f, different executions of the statement may return different values. For example, the loop at lines 14-17 in Figure 2 reads multiple instances of the h_sample andv_sample input fields. Labels and Pointer Analysis: Figure 5 presents three utility functionsfirst :Statement→Label,last:Statement→ Label, andlabels :Statement→Label in our notations. Intuitively, given a statement s,first mapssto the label that corresponds to the first atomic statement inside s;last maps sto the label that corresponds to the last atomic statement insides;labels mapssto the set of labels that are inside s. 5 2013/8/5 first (s) ={first (s′)s=s′;s′′ l otherwise, lis the label of s last(s) ={last(s′′)s=s′;s′′ l otherwise, lis the label of s labels (s) =   labels (s′)∪labels (s′′) s=s′;s′′ {l}∪labels (s′) s=while (x){s′} {l}∪labels (s′)∪labels (s′′)s=if(x)s′elses′′ {l} otherwise, lis the label of s Figure 5. Definitions of first ,last, andlabels We use LoadLabel and StoreLabel to denote the set of labels that correspond to load and store statements, respec- tively. LoadLabel⊆Label and StoreLabel ⊆Label. Our static analysis uses an underlying pointer analy- sis [17] to disambiguate aliases at load and store statements. The pointer analysis provides two functions no_alias and must _alias : no_alias : (StoreLabel×LoadLabel )→Bool must _alias : (StoreLabel×LoadLabel )→Bool We assume that the underlying pointer analysis is sound so that 1)no_alias (lstore,lload) =true only if the load state- ment at label lloadwill never retrieve a value stored by the store statement at label lstore; 2)must _alias (lstore,lload) = true only if the load statement at label lloadwillalways re- trieve a value from the last memory location that the store statement at label lstoremanipulates (see Section 4.2 for a formal definition of the soundness requirements that the alias anlayisis must satisfy). 3.2 Intraprocedural Analysis Because it works with a lowered representation, our static analysis starts with a variable vat a critical program point. It then propagates vbackward against the control flow to the program entry point. In this way the analysis computes a symbolic condition that soundly captures how the program, starting with input field values, may compute the value of vat the critical program point. The generated filters use the analysis results to check whether the input may trigger an integer overflow error in any of these computations. Condition Syntax: Figure 7 presents the definition of symbolic conditions that our analysis manipulates and propagates. A condition consists of a set of conjuncts in the form of safe (e), which represents that the evaluation of the symbolic expression eshould not trigger an overflow condition (including all sub-computations in the evaluation, see Section 4.5 for the formal definition of a program state satisfying a condition).C:=C∧safe(e)| safe( e) e:=e1op e2|atom atom :=x|c|f(id)|l(id) id∈{1, 2, . . . } x∈Var c∈Int l∈LoadLabel f∈InputField Figure 7. The Condition Syntax Symbolic conditions may contain four kinds of atoms: crepresents a constant, xrepresents the variable x,f(id) represents the value of an input field f(the analysis uses the natural number idto distinguish different instances of f), andl(id)represents a value returned by a load statement with the label l(the analysis uses the natural number idto distinguish values loaded at different executions of the load statement). Analysis Framework: Given a series of statements s, a label lwithins(l∈labels (s)), and a symbolic condition Cat the program point after the corresponding statement with the labell, our demand-driven backwards analysis computes a symbolic condition F(s,l,C ). The analysis ensures that ifF(s,l,C )holds before executing s, thenCwill hold whenever the execution reaches the program point after the corresponding statement with the label l(see Section 4.5 for the formal definition). Given a program s0as a series of statements and a vari- ablevat a critical site associated with the label l, our analysis generates the condition F(s,l,safe(v))to create an input fil- ter that checks whether the input may trigger an integer over- flow error in the computations that the program performs to obtain the value of vat the critical site. Analysis of Assignment, Conditional, and Sequence Statements: Figure 6 presents the analysis rules for basic program statements. The analysis of assignment statements replaces the assigned variable xwith the assigned value ( c, y,yopz, orf(id), depending on the assignment statement). Here the notation C[ea/eb]denotes the new symbolic condi- tion obtained by replacing every occurrence of ebinCwith ea. The analysis rule for input read statements materializes a newidto represent the read value f(id). This mecha- nism enables the analysis to correctly distinguish different instances of the same input field (because different instances have different ids). If the labellidentifies the end of a conditional statement, the analysis of the statement takes the union of the symbolic conditions from the analysis of the true and false branches of the conditional statement. The resulting symbolic condition correctly takes the execution of both branches into account. If the label lidentifies a program point within one of the branches of a conditional statement, the analysis will prop- agate the condition from that branch only. The analysis of 6 2013/8/5 Statement s Rules l:x=c F (s, l, C ) =C[c/x] l:x=y F (s, l, C ) =C[y/x] l:x=y op z F (s, l, C ) =C[y op z/x ] l:x=read(f) F(s, l, C ) =C[f(id)/x],f(id)is fresh. s′;s′′F(s, l, C ) =F(s′, last(s′), F(s′′, l, C)), ifl∈labels (s′′) F(s, l, C ) =F(s′, l, C), ifl∈labels (s′) l:if(v)s′elses′F(s, l, C ) =F(s′, last(s′), C)∧F(s′′, last(s′′), C) F(s, l′, C) =F(s′, l′, C), ifl′∈labels (s′) F(s, l′, C) =F(s′′, l′, C), ifl′∈labels (s′′) l:while (v){s′}F(s, l, C ) =Cfix∧C, ifnorm (F(s′, last(s′), Cfix∧C)) =Cfix F(s, l′, C) =F(s, l, C′), ifF(s′, l′, C) =C′andl′∈labels (s′) l:p=malloc F(s, l, C ) =C l:x=∗p F (s, l, C ) =C[l(id)/x],l(id)is fresh. l:∗p=x F (s, l, C ) =C(l1(id1), l, x)(l2(id2), l, x). . .(ln(idn), l, x)for all l1(id1), . . . , l n(idn)inC, where: C(lload(id), l, x) =  C no _alias(l, lload) C[x/l load(id)]¬no_alias(l, lload)∧must _alias(l, lload) C[x/l load(id)]∧C¬no_alias(l, lload)∧¬must _alias(l, lload) Figure 6. Static analysis rules. The notation C[ea/eb]denotes the symbolic condition obtained by replacing every occurrence ofebinCwithea.norm (C)is the normalization function that transforms the condition Cto an equivalent normalized condition. sequences of statements propagates the symbolic expression set backwards through the statements in sequence. Analysis of Load and Store Statements: The analysis of a load statement x=∗preplaces the assigned variable x with a materialized abstract value l(id)that represents the loaded value. For input read statements, the analysis uses a newly materialized idto distinguish values read on different executions of the load statement. The analysis of a store statement ∗p=xuses the alias analysis to appropriately match the stored value xagainst all loads that may return that value. Specifically, the analysis locates allli(idi)atoms inCthat either may or must load a valuevthat the store statement stores into the location p. If the alias analysis determines that the li(idi)expression must loadx(i.e., the corresponding load statement will always access the last value that the store statement stored into locationp), then the analysis of the store statement replaces all occurrences of li(idi)withx. If the alias analysis determines that the li(idi)expression may loadx(i.e., on some executions the corresponding load statement may load x, on others it may not), then the analysis produces two symbolic conditions: one with li(idi)replaced byx(for executions in which the load statement loads x) and one that leaves li(idi)in place (for executions in which the load statement loads a value other than x). We note that, if the pointer analysis is imprecise, the sym- bolic condition may become intractably large. SIFT uses the DSA algorithm [17], a context-sensitive, unification-based pointer analysis. We found that, in practice, this analysis1Input : Expression e 2Output : Normalized expression enorm 3 4enorm←e 5f_cnt←{all→0} 6l_cnt←{all→0} 7for ainAtoms (e)do 8 ifais in form f(id)then 9 nextid←f_cnt(f) + 1 10 f_cnt←f_cnt[f→nextid ] 11 enorm←enorm[∗f(nextid )/f(id)] 12 else if ais in form l(id)then 13 nextid←l_cnt(l) + 1 14 l_cnt←l_cnt[l→nextid ] 15 enorm←enorm[∗l(nextid )/l(id)] 16 end if 17end 18for ainAtoms (enorm)do 19 ifais in form∗f(id)then 20 enorm←enorm[f(id)/∗f(id)] 21 else if ais in form∗l(id)then 22 enorm←enorm[l(id)/∗l(id)] 23 end if 24end Figure 8. Normalization function norm (e).Atom (e)iter- ates over the atoms in the expression efrom left to right. is precise enough to enable SIFT to efficiently analyze our benchmark applications (see Figure 14 in Section 5.2). Analysis of Loop Statements: The analysis uses a fixed- point algorithm to synthesize the loop invariant Cfixrequired to analyze while loops. Specifically, the analysis of a state- 7 2013/8/5 ment while (x){s′}computes a sequence of symbolic condi- tionsCi, whereC0=∅andCi=norm (F(s′,last (s′),C∧ Ci−1)). Conceptually, each successive symbolic condition Cicaptures the effect of executing an additional loop itera- tion. The analysis terminates when it reaches a fixed point (i.e., when it has performed niterations such that Cn= Cn−1). HereCnis the discovered loop invariant. This fixed point correctly summarizes the effect of the loop (regardless of the number of iterations that it may perform). The loop analysis normalizes the analysis result F(s′,last (s′),C∧Ci−1)after each iteration. For a symbolic conditionC=safe(e1)∧...∧safe(en), the normalization ofCisnorm (C) =remove _dup(safe(norm (e1))∧...∧ safe(norm (en))), wherenorm (ei)is the normalization of each individual expression in C(using the algorithm pre- sented in Figure 8) and remove _dup()removes duplicate conjuncts from the condition. Normalization facilitates loop invariant discovery for loops that read input fields or load values via pointers. Each analysis of the loop body during the fixed point computa- tion produces new materialized values f(id)andl(id)with freshids. The new materialized f(id)represent input fields that the current loop iteration reads; the new materialized l(id)represent values that the current loop iteration loads via pointers. The normalization algorithm appropriately renum- bers theseids in the new symbolic condition so that the first appearance of each idis in lexicographic order. Because the normalization only renumbers ids, the normalized condition is equivalent to the original conditions (see Section 4.5). This normalization enables the analysis to recognize loop invari- ants that show up as equivalent successive analysis results that differ only in the materialized ids that they use to repre- sent input fields and values accessed via pointers. The above algorithm will reach a fixed point and ter- minate if it computes the symbolic condition of a value that depends on at most a statically fixed number of val- ues from the loop iterations. For example, our algorithm is able to compute the symbolic condition of the size pa- rameter value of the memory allocation in Figure 2 — the value of this size parameter depends only on the val- ues ofjpeg_width andjpeg_height , the current values ofh_sample andv_sample , and the maximum values of h_sample andv_sample , each of which comes from one previous iteration of the loop at line 26-31. Note that the algorithm will not reach a fixed point if it attempts to compute a symbolic condition that contains an unbounded number of values from different loop iterations. For example, the algorithm will not reach a fixed point if it attempts to compute a symbolic condition for the sum of a set of numbers computed within the loop (the sum depends on values from all loop iterations). To ensure termination, our current implemented algorithm terminates the analysis1Input : A symbolic condition C 2Output :F(lcall:v=call proc v 1. . . v k, lcall, C), 3 where proc is defined as: 4 proc(a1,a2,. . .,ak) { s;ret v ret} 5Where :l1(id1), l2(id2), . . . , l n(idn) 6 are all atoms of the form l(id) 7 that appear in S. 8 9R←∅ 10ST0←F(s, last (s),safe (vret)) 11for e0inexprs (ST0[v1/a1]. . .[vn/an])do 12 ST1←F(s, last (s),safe (l1(id1))) 13 for e1inexprs (ST1[v1/a1]. . .[vn/an])do 14 ... 15 STn←F(sb, last(s),safe (ln(idn))) 16 for eninexprs (STn[v1/a1]. . .[vn/an])do 17 e′ 0←make _fresh (e0, C) 18 ... 19 e′ n←make _fresh (en, C) 20 R←R∧C[e′ 0/v]. . .[e′ i/label i(idi)]. . . 21 end 22 ... 23 end 24end 25F(lcall:v=call proc v 1. . . v k, lcall, C)←R Figure 9. Procedure Call Analysis Algorithm. make _fresh (e,C)renumbers ids ineso that oc- currences of l(id)andf(id)will not conflict with the condition C.exprs (C)returns the set of expres- sions that appear in the conjuncts of C. For example, expr(safe(e1)∧safe(e2)) ={e1,e2}. and fails to generate a symbolic condition Cif it fails to reach a fixed point after ten iterations. In practice, we expect that many programs may contain expressions whose values depend on an unbounded number of values from different loop iterations. Our analysis can successfully analyze such programs because it is demand driven — it only attempts to obtain precise symbolic repre- sentations of expressions that may contribute to the values of expressions in the analyzed symbolic condition C(which, in our current system, are ultimately derived from expressions that appear at memory allocation and block copy sites). Our experimental results indicate that our approach is, in prac- tice, effective for this set of expressions, specifically because these expressions tend to depend on at most a fixed number of values from loop iterations. 3.3 Inter-procedural Analysis Analyzing Procedure Calls: Figure 9 presents the inter- procedural analysis for procedure call sites. Given a sym- bolic condition Cand a function call statement lcall: v=callproc v 1... vkthat invokes a procedure proc(a1, 8 2013/8/5 a2,...,ak) {s;ret vret}, the analysis computes F(v= callproc v 1... vk,lcall,C). Conceptually, the analysis performs two tasks. First, it re- places any occurrences of the procedure return value vinC (the symbolic condition after the procedure call) with sym- bolic expressions that represent the values that the proce- dure may return. Second, it transforms Cto reflect the ef- fect of any store instructions that the procedure may exe- cute. Specifically, the analysis finds expressions l(id)inC that represent values that 1) the procedure may store into a locationp2) that the computation following the procedure may access via a load instruction that may access (a poten- tially aliased version of) p. It then replaces occurrences of l(id)inCwith symbolic expressions that represent the cor- responding values computed (and stored into p) within the procedure. The analysis examines the invoked procedural body s to obtain the symbolic expressions that corresponds to the return value (see line 10) or the value of l(id)(see lines 12 and 15). The analysis avoids redundant analysis of the invoked procedure by caching the analysis results F(s,last (s),safe(vret))andF(s,last (s),safe(l(id)))for reuse. Note that symbolic expressions derived from an analysis of the invoked procedure may contain occurrences of the formal parameters a1,...,ak. The interprocedural analysis translates these symbolic expressions into the name space of the caller by replacing occurrences of the formal parameters a1,...,akwith the corresponding actual parameters v1,...,vk from the call site (see lines 11, 13, and 16 in Figure 9). Also note that the analysis carefully renumbers the ids in the symbolic expressions derived from an analysis of the invoked procedure before the replacements (see lines 17-19). This ensures that the occurances of f(id)ands(id)in the expressions are fresh in C. Propagation to Program Entry: To derive the final sym- bolic condition at the start of the program, the analysis prop- agates the current symbolic condition up the call tree through procedure calls until it reaches the start of the program. When the propagation reaches the entry of the current pro- cedureproc , the algorithm uses the procedure call graph to find all call sites that may invoke proc . It then propagates the current symbolic condition Cto the callers ofproc , appropriately translating Cinto the naming context of the caller by substituting any formal parameters of proc that appear in Cwith the corresponding actual param- eters from the call site. The analysis continues this propaga- tion until it has traced out all paths in the call graph from the initial critical site where the analysis started to the program entry point. The final symbolic condition Cis the conjunc- tion of the conditions derived along all of these paths.3.4 Extension to C Programs We next describe how to extend our analysis to real world C programs to generate input filters. Identify Critical Sites: SIFT transforms the application source code into the LLVM intermediate representation (IR) [2], scans the IR to identify critical values (i.e., size parameters of memory allocation and block copy call sites) inside the developer specified module, and then performs the static analysis for each identified critical value. By default, SIFT recognizes calls to standard C memory allocation rou- tines (such as malloc ,calloc , andrealloc ) and block copy routines (such as memcpy ). SIFT can also be configured to recognize additional memory allocation and block copy rou- tines (for example, dMalloc in Dillo). Bit Width and Signness: SIFT extends the analysis de- scribed above to track the bit width of each expression atom. It also tracks the sign of each expression atom and arith- metic operation and correctly handles extension and trunca- tion operations (i.e., signed extension, unsigned extension, and truncation) that change the width of a bit vector. SIFT therefore faithfully implements the representation of integer values in the C program. Function Pointers and Library Calls: SIFT uses its under- lying pointer analysis [17] to disambiguate function point- ers. It can analyze programs that invoke functions via func- tion pointers. The static analysis may encounter procedure calls (for ex- ample, calls to standard C library functions) for which the source code of the callee is not available. A standard way to handle this situation is to work with an annotated procedure declaration that gives the static analysis information that it can use to analyze calls to the procedure. If code for an in- voked procedure is not available, by default SIFT currently synthesizes information that indicates that symbolic expres- sions are not available for the return value or for any values accessible (and therefore potentially stored) via procedure parameters (code following the procedure call may load such values). This information enables the analysis to determine if the return value or values accessible via the procedure pa- rameters may affect the analyzed symbolic condition C. If so, SIFT does not generate a filter. Because SIFT is demand- driven, this mechanism enables SIFT to successfully analyze programs with library calls (all of our benchmark programs have such calls) as long as the calls do not affect the analyzed symbolic conditions. Annotations for Input Read Statement: SIFT provides a declarative specification language that developers use to indicate which input statements read which input fields. In our current implementation these statements appear in the source code in comments directly below the C statement that reads the input field. See lines 10, 12, 15-16, and 18- 9 2013/8/5 19 in Figure 2 for examples that illustrate the use of the specification language in the Swfdec example. The SIFT annotation generator scans the comments, finds the input specification statements, then inserts new nodes into the LLVM IR that contain the specified information. Formally, this information appears as procedure calls of the following form: v = SIFT_Input("field_name", w); wherevis a program variable that holds the value of the input field with the field name field_name . The width (in bits) of the input field is w. The SIFT static analyzer recog- nizes such procedure calls as specifying the correspondence between input fields and program variables and applies the appropriate analysis rule for input read statements (see Fig- ure 6). Input Filter Generation: We prune any conjuncts that con- tain residual occurrences of abstract materialized values l(id)in the final symbolic condition C. We also replace ev- ery residual occurance of program variables vwith 0. For- mally speaking, these residual occurances correspond to ini- tial values of the program state σand ̄hin abstract semantics (see Section 4.3). The result condition CInpwill contain only input value and constant atoms. The filter operates as follows. It first uses an existing parser for the input format to parse the input and extract the input fields used in the input condition CInp. Open source parsers are available for a wide of input file formats, in- cluding all of the formats in our experimental evaluation [1]. These parsers provide a standard API that enables clients to access the parsed input fields. The generated filter evaluates each conjunct expression inCInpby replacing each symbolic input variable in the expression with the corresponding concrete value from the parsed input. If an integer overflow may occur in the evalu- ation of any expression in CInp, the filter discards the input and optionally raises an alarm. For input field arrays such as h_sample andv_sample in the Swfdec example (see Sec- tion 2), the input filter enumerates all possible combinations of concrete values (see Figure 12 for the formal definition of condition evaluation). The filter discards the input if any combination can trigger the integer overflow error. Given multiple symbolic conditions generated from mul- tiple critical program points, SIFT can create a single effi- cient filter that first parses the input, then checks the input against all symbolic conditions in series on the parsed input. This approach amortizes the overhead of reading the input (in practice, reading the input consumes essentially all of the time required to execute the filter, see Figure 15) over all of the symbolic condition checks.4. Soundness of the Static Analysis We next formalize our static analysis algorithm on the core language in Figure 4 and discuss the soundness of the anal- ysis. We focus on the intraprocedural analysis and omit a discussion of the interprocedural analysis as it uses standard techniques based on summary tables. 4.1 Dynamic Semantics of the Core Language Program State: We define the program state (σ,ρ,ς,ρ,Inp )as follows: σ: Var→(Loc + Int + {undef}) ς: Var→Bool ρ: Loc→(Loc + Int + {undef}) ρ: Loc→Bool Inp: InputField→P(Int) σandρmap variables and memory locations to their cor- responding values. We use undef to represent unintialized values. We define that if any operand of an arithmetic oper- ation is undef, the result of the operation is also undef. Inp represents the input file, which is unchanged during the exe- cution.ςmaps each variable to a boolean flag, which tracks whether the computation that generates the value of the vari- able (including all sub-computations) generates an overflow. ρmaps each memory location to a boolean overflow flag similar toς. The initial states σ0andρ0map all variables and loca- tions to undef. The initial states of ς0andρ0map all vari- ables and locations to false. The values of uninitialized vari- ables and memory locations are undefined as per the C lan- guage specification standard. Small Step Rules: Figure 10 presents the small step dy- namic semantics of the language. Note that in Figure 10, overflow (a,b,op )is a function that returns true if and only if the computation aopb causes overflow. A main point of departure from standard languages is that we also update ς andρto track overflow errors during each execution step. For example, the op rule in Figure 10 appropriately updates the overflow flag of xinςby checking whether the com- putation that generates the value of x(including the sub- computations that generates the value of yandz) results in an overflow condition. 4.2 Soundness of the Pointer Analysis Our analysis uses an underlying pointer analysis [17] to analyze programs that use pointers. We formally state our assumptions about the soundness of the underlying pointer alias analysis as follows: Definition 1 (Soundness of no_alias andmust _alias ). Given a sequence of execution ⟨s0,σ0,ρ0,ς0,ρ0⟩−→⟨s1,σ1,ρ1,ς1,ρ1⟩−→··· and two labels lstore,lload, wherelstoreis the label for the store statement sstoresuch thatsstore= “lstore:∗p=x” 10 2013/8/5 readc∈Inp(f)σ′=σ[x→c]ς′=ς[x→false] ⟨l:x=read(f),σ,ρ,ς,ρ,Inp⟩−→⟨ nil: skip,σ′,ρ,ς′,ρ,Inp⟩constσ′=σ[x→c]ς′=ς[x→false] ⟨l:x=c,σ,ρ,ς,ρ,Inp⟩−→⟨ nil: skip,σ′,ρ,ς′,ρ,Inp⟩ assignσ′=σ[x→σ(y)]ς′=ς[x→ς(y)] ⟨l:x=y,σ,ρ,ς,ρ,Inp⟩−→⟨ nil: skip,σ′,ρ,ς′,ρ,Inp⟩mallocξ∈Loc ξ is freshσ′=σ[p→ξ]ς′=ς[p→false] ⟨l:p=malloc,σ,ρ,ς,ρ,Inp⟩−→⟨ nil: skip,σ′,ρ,ς′,ρ,Inp⟩ seq-1⟨null: skip ;s,σ,ρ,ς,ρ,Inp⟩−→⟨s,σ,ρ,ς,ρ,Inp⟩loadσ(p) =ξ ξ∈Loc σ′=σ[x→ρ(ξ)]ς′=ς[x→ρ(ξ)] ⟨l:x=∗p,σ,ρ,ς,ρ,Inp⟩−→⟨ nil: skip,σ′,ρ,ς′,ρ,Inp⟩ seq-2⟨s,σ,ρ,ς,ρ,Inp⟩−→⟨s′′,σ′,ρ′,ς′,ρ′,Inp⟩ ⟨s;s′,σ,ρ,ς,ρ,Inp⟩−→⟨s′′;s′,σ′,ρ′,ς′,ρ′,Inp⟩storeσ(p) =ξ ξ∈Loc ρ′=ρ[ξ→σ(x)]ρ′=ρ[ξ→ς(x)] ⟨l:∗p=x,σ,ρ,ς,ρ,Inp⟩−→⟨ nil: skip,σ,ρ′,ς,ρ′,Inp⟩ opσ(y)/∈Loc σ (z)/∈Loc b =ς(y)∨ς(z)∨overflow (σ(y),σ(z),op) ⟨l:x=yopz,σ,ρ,ς,ρ,Inp ⟩−→⟨ nil: skip,σ[x→σ(y)opσ(z)],ρ,ς[x→b],ρ,Inp⟩ if-tσ(x)̸= 0 ⟨l:if(x)selses′,σ,ρ,ς,ρ,Inp⟩−→⟨s,σ,ρ,ς,ρ,Inp⟩if-fσ(x) = 0 ⟨l:if(x)selses′,σ,ρ,ς,ρ,Inp⟩−→⟨s′,σ,ρ,ς,ρ,Inp⟩ while-fσ(x) = 0 ⟨l:while(x){s},σ,ρ,ς,ρ,Inp⟩−→⟨ nil: skip,σ,ρ,ς,ρ,Inp⟩while-tσ(x)̸= 0s′=s;l:while(x){s} ⟨l:while(x){s},σ,ρ,ς,ρ,Inp⟩−→⟨s′,σ,ρ,ς,ρ,Inp⟩ Figure 10. The small step operational semantics of the language. “nil" is a special label reserved by the semantics. andlloadis the label for the load statement sloadsuch that sload= “lload:x′=∗p′”, we have: no_alias (lstore,lload)→ ∀i,first (si)=lstore∀j,first (sj)=lload,i>3[32])×png_height ht[32])∧ safe(png_width[32]×((c[32]×sext(png_bitdepth[8],32))>>3[32])×png_height[32])) GIMP safe((gif_width[32]×gif_height[32])×2[32])∧safe(gif_width[32]×gif_height[32]×4[32]) Figure 16. The symbolic condition Cin the bit vector form for VLC, Swftools-png2swf, Swftools-jpeg2swf, Dillo and GIMP. The superscript indicates the bit width of each expression atom. “ sext(v,w)" is the signed extension operation that transforms the valuevto the bit width w. The VLC wav.c module contains an integer overflow vulnerability (CVE-2008-2430) when parsing WA V sound inputs. Figure 17 presents (a simplified version of) the source code that is related to this error. When VLC parses the format chunk of a WA V input, it first reads the input field fmt_size, which indicates the size of the format chunk (line 6). VLC then allocates a buffer to hold the format chunk (line 14 in Figure 17). A large fmt_size field value (for example, 0xfffffffe) will cause an overflow to occur when VLC computes the buffer size. We annotate the source code to specify where the module reads thefmt_size input field (line 11). SIFT then analyzes the module to obtain the symbolic condition C(Figure 16), which soundly summarizes how VLC computes the buffer size from input fields. 5.3.2 Dillo Dillo contains an integer overflow vulnerability (CVE- 2009-2294) in its png module. Figure 18 presents the sim- plified source code for this example. Dillo uses the libpng library to read PNG images. The libpng runtime calls png_process_data() (line 2) to process each PNG image. This function then calls png_push_read_chunk() (line 11) to process each chunk in the PNG image. When the libpng runtime reads the first data chunk (the IDAT chunk), it calls the Dillo callback png_datainfo_callback() (lines 66-75) in the Dillo PNG processing module. There is an in- teger overflow vulnerability at line 73 where Dillo calcu- lates the size of the image buffer as png->rowbytes*png- >height . On a 32-bit machine, inputs with large width and height fields can cause the image buffer size calculation tooverflow. In this case Dillo allocates an image buffer that is smaller than required and eventually writes beyond the end of the allocated buffer. Figure 16 presents the symbolic condition Cfor Dillo. Csoundly takes intermediate computations over all execu- tion paths into consideration, including the switch branch at lines 45-59 that sets the variable png_ptr->channels and PNG_ROWBYTES macro at lines 26-29. Note that the constantc[32]inCcorresponds to the possible values of png_ptr->channels , which are between 1 and 4. 5.3.3 Swftools Swftools is a set of utilities for creating and manipulat- ing SWF files. Swftools contains two tools png2swf and jpeg2swf, which transform PNG and JPEG images to SWF files. Each of these two tools contains an integer overflow vulnerability (CVE-2010-1516). Figure 19 presents (a simplified version of) the source code that contains the png2swf vulnerability. When pro- cessing PNG images, Swftools calls getPNG() (lines 20- 43) atpng2swf.c:763 to read the PNG image into memory. getPNG() first calls png_read_header() (lines 1-18) to locate and read the header chunk which contains the PNG metadata. It then uses the metadata information to calcu- late the length of the image data at png.h:502 (lines 39-40). There is no bounds check on the width and the height value from the header chunk before this calculation. On a 32-bit machine, a PNG image with large width and height values will trigger the integer overflow error. We annotate lines 7 and 10 the statements that read in- put fields png_width andpng_height and use SIFT to de- 15 2013/8/5 1// libpng main data process function. 2void png_process_data(png_structp png_ptr, 3 png_infop info_ptr, ...) { 4 ... 5 while (png_ptr->buffer_size) { 6 // This is a wrapper for png_push_read_chunk 7 png_process_some_data(png_ptr, info_ptr); 8 } 9} 10 // chunk handler dispatcher 11 void png_push_read_chunk(png_structp png_ptr, 12 png_infop info_ptr) { 13 if(!png_memcmp(png_ptr->chunk_name,png_IHDR,4)){ 14 ... 15 png_handle_IHDR(png_ptr, info_ptr, ...); 16 } 17 ... 18 else if (!png_memcmp(png_ptr->chunk_name, 19 png_IDAT, 4)) { 20 ... 21 // Datainfo callback is called 22 png_push_have_info(png_ptr, info_ptr); 23 ... 24 } 25 } 26 #define PNG_ROWBYTES(pixel_bits,width)\ 27 ((pixel_bits)>=8?\ 28 ((width) *(((png_uint_32)(pixel_bits))>>3)):\ 29 ((((width) *((png_uint_32)(pixel_bits)))+7)>>3)) 30 void png_handle_IHDR(png_structp png_ptr, 31 png_infop info_ptr, ...) { 32 ... 33 // read individual png fields from input buffer 34 width = png_get_uint_31(png_ptr, buf); 35 /*width = SIFT_input("png_width", 32); */ 36 height = png_get_uint_31(png_ptr, buf + 4); 37 /*height = SIFT_input("png_height", 32); */ 38 bit_depth = buf[8]; 39 /*bit_depth = SIFT_input("png_bitdepth", 8); */ 40 ... 41 png_ptr->width = width; 42 png_ptr->height = height; 43 png_ptr->bit_depth = (png_byte)bit_depth; 44 ... 45 switch (png_ptr->color_type) { 46 case PNG_COLOR_TYPE_GRAY: 47 case PNG_COLOR_TYPE_PALETTE: 48 png_ptr->channels = 1; 49 break ; 50 case PNG_COLOR_TYPE_RGB: 51 png_ptr->channels = 3; 52 break ; 53 case PNG_COLOR_TYPE_GRAY_ALPHA: 54 png_ptr->channels = 2; 55 break ; 56 case PNG_COLOR_TYPE_RGB_ALPHA: 57 png_ptr->channels = 4; 58 break ; 59 } 60 png_ptr->pixel_depth = (png_byte)( 61 png_ptr->bit_depth *png_ptr->channels); 62 png_ptr->rowbytes = PNG_ROWBYTES( 63 png_ptr->pixel_depth, png_ptr->width); 64 } 65 // Dillo datainfo initialization callback 66 static void Png_datainfo_callback(png_structp png_ptr, 67 ...) { 68 DilloPng *png; 69 png = png_get_progressive_ptr(png_ptr); 70 ... 71 // where the overflow happens 72 png->image_data = ( uchar_t *) dMalloc( 73 png->rowbytes *png->height); 74 ... 75 } Figure 18. The simplified source code from Dillo and libpng with annotations inside comments.1static int png_read_header( FILE *fi, 2 struct png_header *header) { 3 ... 4 while (png_read_chunk(&id, &len, &data, fi)) { 5 if(!strncmp(id, "IHDR", 4)) { 6 ... 7 header->width = data[0]<<24|data[1]<<16| 8 data[2]<<8|data[3]; 9 /*header->width=SIFT_input("png_width",32); */ 10 header->height = data[4]<<24|data[5]<<16| 11 data[6]<<8|data[7]; 12 /*header->height=SIFT_input("png_height",32); */ 13 ... 14 } 15 ... 16 } 17 ... 18 } 19 20 EXPORT int getPNG( const char *sname, int*destwidth, 21 int*destheight, unsigned char **destdata) { 22 ... 23 unsigned long int imagedatalen; 24 ... 25 if(!png_read_header(fi, &header)) { 26 fclose(fi); 27 return 0; 28 } 29 ... 30 if(header.mode==3 || header.mode==0) bypp = 1; 31 else if (header.mode == 4) bypp = 2; 32 else if (header.mode == 2) bypp = 3; 33 else if (header.mode == 6) bypp = 4; 34 else { 35 ... 36 return 0; 37 } 38 39 imagedatalen = bypp *header.width * 40 header.height + 65536; 41 imagedata = ( unsigned char *)malloc(imagedatalen); 42 ... 43 } Figure 19. The simplified source code from png2swf in swftools with annotations inside comments. rive the symbolic condition for this vulnerability. Figure 16 presents the symbolic condition C. jpeg2swf contains a similar integer overflow vulnera- bility when processing JPEG images. At jpeg2swf.c:171 jpeg2swf first calls the libjpeg API to read jpeg image. At jpeg2swf.c:173 , jpeg2swf then immediately calculates the size of a memory buffer for holding the jpeg file in its own data structure. Because it directly uses the input width and height values in the calculation without range checks, large width and height values may cause overflow errors. Fig- ure 16 presents the symbolic expression condition Cfor jpeg2swf. 5.3.4 GIMP GIMP contains an integer overflow vulnerability (CVE- 2012-3481) in its GIF loading plugin file-gif-load.c . When GIMP opens a GIF file, it calls load_image atfile-gif- load.c:335 to load the entire GIF file into memory. For each individual image in the GIF file, this function first reads the 16 2013/8/5 image metadata information, then calls ReadImage to pro- cess the image. At file-gif-load.c:1064 , the plugin calculates the size of the image output buffer as a function of the prod- uct of the width and height values from the input. Because it uses these values directly without range checks, large height and width fields may cause an integer overflow. In this case GIMP may allocate a buffer smaller than the required size. We annotate the source code based on the GIF specifica- tion and use SIFT to derive the symbolic condition for this vulnerability. Figure 16 presents the generated symbolic ex- pression condition C. 5.4 Discussion The experimental results highlight the combination of prop- erties that, together, enable SIFT to effectively nullifying potential integer overflow errors at memory allocation and block copy sites. SIFT is efficient enough to deploy in pro- duction on real-world modules (the combined program anal- ysis and filter generation times are always under a second), the analysis is precise enough to successfully generate in- put filters for the majority of memory allocation and block copy sites, the results provide encouraging evidence that the generated filters are precise enough to have few or even no false positives in practice, and the filters execute efficiently enough to deploy with acceptable filtering overhead. 6. Related Work Weakest Precondition: Madhavan et. al. present an approx- imate weakest precondition analysis to verify the absence of null dereference errors in Java programs [21]. The under- lying analysis domain tracks whether or not variables may contain null references. To obtain acceptable precision for the null dereference verification problem, the technique in- corporates null-dereference checks from conditional state- ments into the propagated conditions. Because SIFT focuses on integer overflow errors, the un- derlying analysis domain (symbolic arithmetic expressions) and propagation rules are significantly more complex. SIFT also does not incorporate checks from conditional state- ments, a design decision that, for the integer overflow prob- lem, produces efficient and accurate filters. Also the prob- lems are different — SIFT generates filters to eliminate se- curity vulnerabilities, while Madhavan et. al. focus on veri- fying the absence of null dereferences. Flanagan et. al. presents a general intraprocedural weak- est precondition analysis for generating verification condi- tions for ESC/JA V A programs [12]. SIFT differs in that it fo- cuses on integer overflow errors. Because of this focus, SIFT can synthesize its own loop invariants (Flanagan et. al. rely on developer-provided invariants). In addition, SIFT is inter- procedural, and uses the analysis results to generate sound filters that nullify integer overflow errors.Anomaly Detection: Anomaly detection techniques gener- ate (unsound) input filters by empirically learning proper- ties of successfully or unsuccessfully processed inputs [14, 16, 19, 23, 25, 26, 30, 31]. Web-based anomaly detec- tion [16, 26] uses input features (e.g. request length and char- acter distributions) from attack-free HTTP traffic to model normal behaviors. HTTP requests that contain features that violate the model are flagged as anomalous and dropped. Similarly, Valeur et al [30] propose a learning-based ap- proach for detecting SQL-injection attacks. Wang et al [31] propose a technique that detects network-based intrusions by examining the character distribution in payloads. Perdisci et al [25] propose a clustering-based anomaly detection tech- nique that learns features from malicious traces (as opposed to benign traces). Input rectification learns properties of in- puts that the application processes successfully, then mod- ifies inputs to ensure that they satisfy the learned proper- ties [20]. Two key differences between SIFT and these techniques are that SIFT statically analyzes the application, not its in- puts, and takes all execution paths into account to generate a sound filter. Static Analysis for Finding Integer Errors: Several static analysis tools have been proposed to find integer errors [6, 27, 32]. KINT [32], for example, analyzes individual pro- cedures, with the developer optionally providing procedure specifications that characterize the value ranges of the pa- rameters. KINT also unsoundly avoids the loop invariant synthesis problem by replacing each loop with the loop body (in effect, unrolling the loop once). Despite substantial ef- fort, KINT reports a large number of false positives [32]. SIFT addresses a different problem: it is designed to nul- lify, not detect, overflow errors. In pursuit of this goal, it uses an interprocedural analysis, synthesizes symbolic loop in- variants, and soundly analyzes all execution paths to produce a sound filter. Symbolic Test Generation: DART [15] and KLEE [5] use symbolic execution to automatically generate test cases that can expose errors in an application. IntScope [29] and Smart- Fuzz [22] are symbolic execution systems specifically for finding integer errors. It would be possible to combine these systems with previous input-driven filter generation tech- niques to obtain filters that discard inputs that take the dis- covered path to the error. As discussed previously, SIFT dif- fers in that it considers all possible paths so that its gener- ated filters come with a soundness guarantee that if an input passes the filter, it will not exploit the integer overflow error. Runtime Check and Library Support: To alleviate the problem of false positives, several research projects have focused on runtime detection tools that dynamically insert runtime checks before integer operations [3, 7, 11, 34]. An- other similar technique is to use safe integer libraries such as 17 2013/8/5 SafeInt [18] and CERT’s IntegerLib [28] to perform sanity checks at runtime. Using these libraries requires that devel- opers rewrite existing code to use safe versions of integer operations. However, the inserted code typically imposes non- negligible overhead. When integer errors happen in the mid- dle of the program execution, these techniques usually raise warnings and terminate the execution, which effectively turn integer overflow attacks into DoS attacks. SIFT, in contrast, inserts no code into the application and blocks inputs that exploit integer overflow vulnerabilities to avoid the attacks completely. Benign Integer Overflows: In some cases, developers may intentionally write code that contains benign integer over- flows [29, 32]. A potential concern is that techniques that nullify overflows may interfere with the intended behavior of such programs [29, 32]. Because SIFT focuses on critical memory allocation and block copy sites that are unlikely to have such intentional integer overflows, SIFT is unlikely to nullify benign integer overflows and therefore unlikely inter- fere with the intended behavior of the program. 7. Conclusion Integer overflow errors can lead to security vulnerabilities. SIFT analyzes how the application computes integer values that appear at memory allocation and block copy sites to generate input filters that discard inputs that may trigger overflow errors in these computations. Our results show that SIFT can quickly generate efficient and precise input filters for the vast majority of memory allocation and block copy call sites in our analyzed benchmark modules. References [1] Hachoir. http://bitbucket.org/haypo/ hachoir/wiki/Home . [2] The LLVM compiler infrastructure. http://www.llvm. org/ . [3] D. Brumley, T. Chiueh, R. Johnson, H. Lin, and D. Song. Rich: Automatically protecting against integer-based vulner- abilities. Department of Electrical and Computing Engineer- ing, page 28, 2007. [4] D. Brumley, H. Wang, S. Jha, and D. Song. Creating vulnera- bility signatures using weakest preconditions. In Proceedings of the 20th IEEE Computer Security Foundations Symposium , CSF ’07’, pages 311–325, Washington, DC, USA, 2007. IEEE Computer Society. [5] C. Cadar, D. Dunbar, and D. Engler. Klee: unassisted and automatic generation of high-coverage tests for complex sys- tems programs. In Proceedings of the 8th USENIX conference on Operating systems design and implementation , OSDI’08, pages 209–224, Berkeley, CA, USA, 2008. USENIX Associ- ation.[6] E. Ceesay, J. Zhou, M. Gertz, K. Levitt, and M. Bishop. Using type qualifiers to analyze untrusted integers and detecting security flaws in c programs. Detection of Intrusions and Malware & Vulnerability Assessment , pages 1–16, 2006. [7] R. Chinchani, A. Iyer, B. Jayaraman, and S. Upadhyaya. Archerr: Runtime environment driven program safety. Com- puter Security–ESORICS 2004 , pages 385–406, 2004. [8] M. Costa, M. Castro, L. Zhou, L. Zhang, and M. Peinado. Bouncer: securing software by blocking bad input. In Pro- ceedings of twenty-first ACM SIGOPS symposium on Operat- ing systems principles , SOSP ’07. ACM, 2007. [9] M. Costa, J. Crowcroft, M. Castro, A. Rowstron, L. Zhou, L. Zhang, and P. Barham. Vigilante: end-to-end containment of internet worms. In Proceedings of the twentieth ACM sym- posium on Operating systems principles , SOSP ’05. ACM, 2005. [10] W. Cui, M. Peinado, and H. J. Wang. Shieldgen: Automatic data patch generation for unknown vulnerabilities with in- formed probing. In Proceedings of 2007 IEEE Symposium on Security and Privacy . IEEE Computer Society, 2007. [11] W. Dietz, P. Li, J. Regehr, and V . Adve. Understanding integer overflow in c/c++. In Proceedings of the 2012 International Conference on Software Engineering , pages 760–770. IEEE Press, 2012. [12] C. Flanagan and J. B. Saxe. Avoiding exponential explosion: generating compact verification conditions. In Proceedings of the 28th ACM SIGPLAN-SIGACT symposium on Principles of programming languages , POPL ’01’, pages 193–205, New York, NY , USA, 2001. ACM. [13] V . Ganesh, T. Leek, and M. Rinard. Taint-based directed whitebox fuzzing. In ICSE ’09: Proceedings of the 31st In- ternational Conference on Software Engineering . IEEE Com- puter Society, 2009. [14] D. Gao, M. K. Reiter, and D. Song. On gray-box program tracking for anomaly detection. In Proceedings of the 13th conference on USENIX Security Symposium - Volume 13 , SSYM’04. USENIX Association, 2004. [15] P. Godefroid, N. Klarlund, and K. Sen. Dart: directed au- tomated random testing. In Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation , PLDI ’05, pages 213–223, New York, NY , USA, 2005. ACM. [16] C. Kruegel and G. Vigna. Anomaly detection of web-based attacks. In Proceedings of the 10th ACM conference on Com- puter and communications security , CCS ’03. ACM, 2003. [17] C. Lattner, A. Lenharth, and V . Adve. Making context- sensitive points-to analysis with heap cloning practical for the real world. In Proceedings of the 2007 ACM SIGPLAN confer- ence on Programming language design and implementation , PLDI ’07, pages 278–289, New York, NY , USA, 2007. ACM. [18] D. LeBlanc. Integer handling with the c++ safeint class. urlhttp://msdn. microsoft. com/en-us/library/ms972705 , 2004. 18 2013/8/5 [19] F. Long, V . Ganesh, M. Carbin, S. Sidiroglou, and M. Rinard. Automatic input rectification. ICSE ’12, 2012. [20] F. Long, V . Ganesh, M. Carbin, S. Sidiroglou, and M. Rinard. Automatic input rectification. In Proceedings of the 2012 In- ternational Conference on Software Engineering , ICSE 2012, pages 80–90, Piscataway, NJ, USA, 2012. IEEE Press. [21] R. Madhavan and R. Komondoor. Null dereference verifica- tion via over-approximated weakest pre-conditions analysis. InProceedings of the 2011 ACM international conference on Object oriented programming systems languages and applica- tions , OOPSLA ’11, pages 1033–1052, New York, NY , USA, 2011. ACM. [22] D. Molnar, X. C. Li, and D. A. Wagner. Dynamic test genera- tion to find integer bugs in x86 binary linux programs. Usenix Security’09. [23] D. Mutz, F. Valeur, C. Kruegel, and G. Vigna. Anomalous system call detection. ACM Transactions on Information and System Security , 9, 2006. [24] J. Newsome, D. Brumley, and D. X. Song. Vulnerability- specific execution filtering for exploit prevention on commod- ity software. In NDSS , 2006. [25] R. Perdisci, W. Lee, and N. Feamster. Behavioral clustering of http-based malware and signature generation using malicious network traces. In Proceedings of the 7th USENIX conference on Networked systems design and implementation , NSDI’10. USENIX Association, 2010. [26] W. Robertson, G. Vigna, C. Kruegel, and R. A. Kemmerer. Using generalization and characterization techniques in the anomaly-based detection of web attacks. In Proceedings of the 13 th Symposium on Network and Distributed System Security (NDSS) , 2006. [27] D. Sarkar, M. Jagannathan, J. Thiagarajan, and R. Venkata- pathy. Flow-insensitive static analysis for detecting integer anomalies in programs. In IASTED , 2007. [28] R. Seacord. The CERT C secure coding standard . Addison- Wesley Professional, 2008. [29] W. Tielei, W. Tao, L. Zhiqiang, and Z. Wei. IntScope: Au- tomatically Detecting Integer Overflow Vulnerability In X86 Binary Using Symbolic Execution. In 16th Annual Network & Distributed System Security Symposium , 2009. [30] F. Valeur, D. Mutz, and G. Vigna. A learning-based approach to the detection of sql attacks. In DIMVA 2005 , 2005. [31] K. Wang and S. J. Stolfo. Anomalous payload-based network intrusion detection. In RAID , 2004. [32] X. Wang, H. Chen, Z. Jia, N. Zeldovich, and M. Kaashoek. Improving integer security for systems with kint. In OSDI . USENIX Association, 2012. [33] X. Wang, Z. Li, J. Xu, M. K. Reiter, C. Kil, and J. Y . Choi. Packet vaccine: black-box exploit detection and signature gen- eration. CCS ’06. ACM, 2006. [34] C. Zhang, T. Wang, T. Wei, Y . Chen, and W. Zou. Intpatch: Automatically fix integer-overflow-to-buffer-overflow vulner-ability at compile-time. Computer Security–ESORICS 2010 , pages 71–86, 2010. A. Proof Sketch of the Relationship between the Original Semantics and the Abstract Semantics A.1 The Alias Analyses and the Abstract Semantics In order to prove the above relationship between the origi- nal semantics and the abstract semantics, we introduce the following lemma that states the property between alias rela- tionships and the abstract semantics. Lemma 5. Given an execution trace ⟨s0,σ0,ς0, ̄h0⟩−→a ⟨s1,σ1,ς1, ̄h1⟩−→a···in the abstract semantics, we have ∀i,first (si)=l,left (si)=“l:∗p=x”∀j,i 0. Iffirst (sn−1)/∈ StoreLabel , then based on the small step rules of seman- tics, ̄hn−1= ̄hn. It is straintforward to apply the induction rule to prove the condition. Iffirst (sn−1)∈StoreLabel andsn−1= “l:∗p′= x′”, based on the induction, what we need to prove is the case wherej=n: ∀i,first (si)=l,left (si)=“l:∗p=x”,i 0, we already have σ0,...,σn−1,ς0,...,ςn−1, ̄h0,..., ̄hn−1that satisfy the conditions by the induction rule. We first construct σn,ςn, and ̄hnusing the corresponding small step rule in abstract semantics, and then prove the construction satisfies the con- ditions. Condition 1: Iffirst (sn−1)/∈LoadLabel , the proof is straightforward. For example, if first (sn−1) =l, left(sn−1) = “l:x=yopz ”, based on the small step rule of the original semantics we know that: σn=σn−1[x→σn−1(y)opσn−1(z)] ςn=ςn−1[x→ςn−1(y)∨ςn−1(z)∨ overflow (σn−1(y),σn−1(z),op)] and we can construct using the corresponding rule in the abstract semantics: σn=σn−1[x→σn−1(y)opσn−1(z)] ςn=ςn−1[x→ςn−1(y)∨ςn−1(z)∨ overflow (σn−1(y),σn−1(z),op)] By the induction rule we have ∀v∈Varσn−1(v)∈Int→ σn−1(v) =σn−1(v), so it is easy to show that: ∀v∈Var,v̸=xσn(x)∈Int→σn(x) =σn(x)Also consider σn(x)∈Int→ (σn−1(y)∈Int∧σn−1(z)∈Int)→ (σn−1(y) =σn−1(y)∧σn−1(z) =σn−1(z))→ ((σn−1(y)opσn−1(z)) = (σn−1(y)opσn−1(z)))→ (σn(x) =σn(x)) We can do the proof similarly for ςandς. Therefore Condition 1 holds. Iffirst (sn−1)∈LoadLabel andleft(sn−1) = “l:x= ∗p”, based on the semantic rules of the load statement, we know that: σn=σn−1[x→ρn−1(σn−1(p))] ςn=ςn−1[x→ρn−1(σn−1(p))] From the induction rule of Condition 2, we know that: ρn−1(σn−1(p))∈ Int→ (ρn−1(σn−1(p)),ρn−1(σn−1(p)))∈ ̄h(l). Therefore, ifρn−1(σn−1(p))∈Int, it is possible to construct σnand ςmas follows based on the abstract semantic rule of the load statement: σn=σn−1[x→ρn−1(σn−1(p))] ςn=ςn−1[x→ρn−1(σn−1(p))] From the induction rule of Condition 1, we know that σn−1=σn−1andςn−1=ςn−1. These facts together are enough to show that Condition 1 holds. Condition 2: Iffirst (sn)/∈LoadLabel , the proof is trivial by using induction rule. Next we are going to sketch the proof of Condition 2, when first (sn)∈LoadLabel and left(sn) = “l:x=∗p”. We try to find a program state ⟨sm,σm,ςm,ρm,ρm⟩in prior execution steps, such that m < n ,first (sm) =l′, left(sm) = “l′:∗p′=x′”,σm(p′) =σn(p), and ∀m (e.g. t 0, t1, ... , t123, ...), native registers from the source architecture of an analyzed program can also show up as operands of REIL instructions. This does not limit the platform-independence of REIL code as REIL registers and source architecture reg- isters are treated completely uniformly in REIL analysis al- gorithms. Native registers are simply used in REIL code to make it easier to liken the results of an analysis algorithm back to the original input code. The last REIL instruction operand type is the subaddress. Operands of this type are comparable to integer literals but instead of integral values these operands always hold ad- dresses of REIL instructions. Furthermore, this operand type can only appear as the third operand of JCC (condi- tional jump) instructions. Operands of this type are only generated when an original native assembly instruction is translated into a series of REIL instructions that contains branches from decisions or loops. Examples for such instruc- tions are the prefixed string operations (rep stos, ...) of the x86 instruction set which are translated to REIL instructions that form a loop. Except for their type and their value, REIL operands are characterized by their size. This size is equal to the max- imum size of its operand value. REIL operand sizes have names like b1,b2,b4and so on meaning that the size of the operand is 1 byte, 2 bytes, and 4 bytes respectively. For example, the integer literal operand 0x17/ b2is really two bytes large and could also be represented as 0x0017 b2while the size of the register t 0in the operand t 0/b4is 32 bits. In addition to its operands, all REIL instructions can come with so-called meta-data. This meta-data is simply a map of key-value pairs that give additional information about an 2The one exception is the jump instruction JCC where the third operand is the jump target.instruction that is probably important during static code analysis. In general, the number of pieces of meta-data as- sociated with an instruction is not limited but in practice most REIL instructions do not have any meta-data at all associated with them. In the current version of REIL there is only one kind of meta-data. Jump instructions that were generated during the translation of a subfunction call (like callon the x86 CPU or blon PowerPC) are specifically marked with the keyisCall and the value true. This is necessary because subfunction calls need to be treated very differently than conditional jumps during many static code analysis algo- rithms. The 17 different REIL instructions can be grouped into a few different instructions groups. The biggest group are the arithmetic instructions like addition and subtraction. Then there are the bitwise instructions that perform operations like bitwise OR and AND, the conditional instructions that are used to compare values and jump according to the re- sult of the comparison, the data transfer instructions that access REIL memory and transfer the content of registers, and the remaining instructions which do not really fall into any group. 2.1 The arithmeticinstructions With six members, the group of arithmetic instructions covers more than one third of the total instructions of the REIL instruction set. •ADD: Addition of two values •SUB: Subtraction of two values •MUL: Unsigned multiplication of two values •DIV: Unsigned division of two values •MOD: Unsigned modulo of two values •BSH: Logical shift operation ADD and SUB work exactly like standard addition and subtraction on most platforms. The multiplicative instructions MUL, DIV, and MOD in- terpret all of their input operands in an unsigned way. The REIL instruction set does not contain signed counterparts of these instructions because signed multiplication and division can easily be simulated in terms of unsigned multiplication and division. The logical shift operation can either be used as a left shift or a right shift, depending on the sign of its second operand. If the second operand is positive, the shift oper- ation is a left-shift. If it is negative, the shift operation is a right-shift. Arithmetic shifts do not exist in the REIL instruction set because arithmetic shifts can easily be simu- lated with the help of logical shifts. Like in the case of the multiplicative instructions, keeping the REIL instruction se t small was more important than adding the convenience of having more epxressive REIL translations. Figure 1 shows examples of all arithmetic instructions. The structure of all arithmetic instructions is the same. The first two operands are the input operands of the operation while the third operand is the output operand where the result of the operation is stored. The order of the input operands is the natural order that is generally used when writing down the operations in infix notation on paper or in the source code of computer programs. For example, the first operand of the SUB operation is the minuend while the second operand is the subtrahend. In the DIV operation the first operand is the dividend and the second operand is the divisor. ADDt0/b4,t1/b4,t2/b8 SUBt7/b4,t9/b4,t12/b8 MULt8/b4,4/b4,t9/b8 DIV4000/b4,t2/b4,t3/b4 MODt8/b4,8/b4,t4/b4 BSHt1/b4,2/b4,t2/b8 Figure 1: Examples of the arithmetic REIL instruc- tions Another important aspect of REIL is first shown in fig- ure 1 too. Potential overflows in the results of operations are handled explicitly. If an operation can overflow, the output operand must be large enough to store the whole result including the overflow. This is the reason why the output operands of the example instructions are twice as large as their input operands3. The two exceptions are the output operands of the DIV and MOD instructions. Since the results of these operations can never be larger than the first input operand, an extension of the size of the output operand is not necessary. The output operand has the size of the input operand instead. The explicit handling of overflow is an important differ- ence to real architectures where overflows produced by op- erations are nearly always cut off because of the fixed size of native CPU registers. This explicit overflow handling is what enables REIL algorithms to analyze the results of op- erations in bigger detail when the exact overflowing value of a register might be important instead of simply having a flag that signals that an operation produced an overflow. 2.2 The bitwise instructions The next biggest instruction group is the group formed by the three bitwise instructions. •AND: Bitwise AND of two values •OR: Bitwise OR of two values •XOR: Bitwise XOR of two values The three bitwise instructions work exactly like one ex- pects bitwise instructions to work. Bit for bit they connect the bits of two input operands according to the truth ta- ble defined for their operation. The calculated value is then written to the output operand of the instruction. A bitwise NOT instruction is not part of the REIL in- struction set because NOT is equivalent to XOR-ing a value with a value of equal size and all bits set. That means to calculate the one’s complement of the 16-bit value 0x1234 one would XOR it with the 16-bit value 0xFFFF. 3The result operand of addition and subtraction is techni- cally too large because these operations performed on two 32-bit values can only overflow into the 33rd bit; however there is no 33-bit REIL operand size so the next biggest operand size (64-bit) was chosen.ANDt0/b4,t1/b4,t2/b4 ORt7/b4,t9/b4,t12/b4 XORt8/b4,4/b4,t9/b4 Figure 2: Examples of the bitwise REIL instructions Figure 2 shows examples of all bitwise instructions. Their general structure equals the structure of the arithmetic in- structions. Like them, bitwise instructions take two input operands and store the result of the operation in the out- put operand. One important difference is that none of the bitwise instructions produce an overflow. An explicit mod- eling of overflowing values and an extension of the size of the output operand are therefore not necessary. 2.3 Data transfer instructions To access the REIL memory, two different REIL instruc- tions are needed. One is used for loading a value of arbitrary size from the REIL memory while the other one is used to store a value of arbitrary size to the REIL memory. Fur- thermore, this group of instructions contains an instruction that is used to transfer values into registers. •LDM: Load a value from memory •STM: Store a value to memory •STR: Store a value in a register The first operand of the LDM instruction contains the address of the REIL memory where the value is loaded from. This operand can either be an integer literal or a register. When the instruction is executed, it loads the value from the given memory address and stores it in the third operand of the instruction. The size of the value that is loaded from memory equals the size of the third operand. If the size of the third operand is a 32-bit register, a 32-bit value is loaded from memory. As the loaded value is written to the third operand, the third operand must be a register. The store operation STM is the inverse operation to the load operation LDM. It can be used to store a value of ar- bitrary size to memory. The first operand of the STM in- struction is the value to be stored in memory. Its size de- termines how many bytes are written to memory when the STM instruction is executed. The third operand specifies the address where the value of the first operand is written to. Both operands can be either integer literals or registers. The second operand is unused. The STR instruction is one of the simplest instructions of the REIL instruction set. It copies a value to the output register specified in the instruction. The input operand can be either a literal (to load a register with a constant) or another register (to transfer the content of one register to another register). LDM 413800/b4, ,t1/b2 STR t1/b2, ,t2/b2 STM t2/b2, ,415280/b4 Figure 3: Examples of the data transfer REIL in- structions Figure 3 shows a sequence of data transfer instructions that load a value from memory, copy it to another register, and store it back to another address in memory. Since the size of the output register of LDM instructions specifies how many bytes are loaded from memory, it is clear that two bytes are loaded from memory. The size of the two used operands of STR instructions is typically the same as STR only copies a value. In the end the two-byte register t2is stored back to memory. 2.4 Conditional instructions The group of conditional instructions is used to compare values and depending on the results of the comparison to jump to one REIL instruction or another. •BISZ: Compare a value to zero •JCC: Conditional jump The BISZ instruction is the only instruction of the REIL instruction set that can be used to compare two values. In fact, it can only be used to compare a single value to zero but this is sufficient to emulate any kind of more complex comparison. The BISZ instruction takes a single operand, compares it to zero, and depending on the value of the input operand, the output operand is set to 0 (if the value of the input operand was not 0) or 1 (if the value of the input operand was 0). The conditional jump instruction JCC is typically used to process the results of a BISZ instruction. If the first operand of the JCC instruction evaluates to 0, the jump is not taken. If the first operand evaluates to any other value than zero, the jump is taken and control is transferred to the address (or sub-address) specified in the third operand. An unconditional jump is not part of the REIL instruction set because it is possible to emulate an unconditional jump using a conditional jump by setting the first operand of the conditional jump to the integer literal 1 (or any other non- zero integer literal). BISZ t0/b4, ,t1/b1 JCC t1/b1, ,401000/b4 Figure 4: Examples of the conditional REIL instruc- tions Figure 4 shows a typical sequence of a single BISZ instruc- tion followed by a JCC instruction that uses the output of the BISZ instruction to determine whether to take a jump to the address specified in its third operand. Since the output of BISZ instructions is always either 0 or 1, the size of the output operand of BISZ instructions is always b1. 2.5 Other instructions There are a few other instructions which do not really belong to any group at all. •UNDEF: Undefines a value •UNKN: Unknown source instruction •NOP: No operation The UNDEF instruction undefines the value of a register. This means that once the UNDEF instruction is executed, the value inside the undefined register is unknown. This is important because there are native assembly instructionswhich leave registers or flags in an undefined state. The x86 instruction DIV, for example, leaves a number of flags like the zero flag and the carry flag in an undefined state. The UNKN instruction is kind of a placeholder instruc- tion. It indicates that during the REIL code generation an original assembly instruction was encountered that could not be translated. The NOP instruction does nothing. Nevertheless it is not useless. REIL translators can generate this instruction to pad control flow in certain edge cases. In a few situations this is very useful because it keeps REIL translators very simple. Without the existence of the NOP instruction, the REIL translator would have to look ahead to the next native instruction to generate correct REIL code4. UNDEF , , t1/b4 UNKN , , NOP , , Figure 5: Examples of other REIL instructions Figure 5 shows examples of the remaining REIL instruc- tions. The only instruction that takes operands is the UN- DEF instruction which undefines a register. 3. THEREILARCHITECTURE The definition of the REIL language includes the descrip- tion of the REIL architecture and the definition of a virtual machine that can be used to execute the generated REIL code. The REIL architecture is a simple register-based architec- ture without an explicit stack. The number of registers avail- able in REIL code is unlimited. As previously explained, the names of REIL registers have the form t . The index number of register names is unbounded. There is fur- thermore no requirement that all REIL registers between t 0 and t n−1are used by a given program that uses n different registers. A program that uses exactly three REIL registers can use t 7, t799, and t 3199if desired. REIL registers themselves do not have a fixed width or a limited width. The size of REIL registers is always equal to the size of the operands where they are used. The size of REIL registers can even change between instructions. In one instruction register t ncan have size b swhile in another instruction it can have size b t. Since operands can grow arbitrarily large, REIL registers can also grow arbitrarily large. In practice we have not yet encountered registers with more than 128 bits (equivalent to b 16) though. We already mentioned that registers of the original input code can appear in REIL code. In fact, the registers of the original architecture will always appear in REIL code to make it possible to port results of REIL code analysis back to the original code. This does not violate the platform- independent nature of REIL code. REIL registers and native registers can be mixed at will and be treated completely uniformly. While analyzing REIL code there is no difference between the registers t0,t1, andt2and the registers eax, ebx, and ecx. At the end of an analysis algorithm one can then easily distinguish between the REIL registers (which have thetnform) and the native registers (which do not have the 4Technically, the NOP instruction could of course be re- placed by an instruction like add 0, 0, t nthat also has no discernible effect on the program state. tnform) to port the values of relevant registers back to the original assembly code. The memory of the virtual REIL machine follows a flat memory model. Unlike some real CPUs like the x86 which has memory segments (in real mode) or at least memory se- lectors (in protected mode), REIL memory starts at address 0 and can grow arbitrarily large. While there is technically an infinite amount of storage available in REIL memory, practical concerns of the source architecture limit the used memory in practice. If the source assembly language (like 32 bit x86 assembly) can only address 4 GB of memory, only 4 GB of REIL memory will ever be accessed in REIL code created from x86 programs. REIL memory higher than the addressable memory range of the source target architecture is never used. Due to the flat memory model of the REIL memory, seg- mented memory access of native architectures must be sim- ulated in REIL programs if necessary. This can be done by creating virtual segments which represent the memory seg- ments of the native architecture. Since REIL memory is not limited in size, there is enough space available to make these virtual segments non-overlapping, meaning that memory ac- cess through one segment of the native architecture never interfers with memory access through another segment of the native architecture. The endianness of the source architecture must be con- sidered too when accessing REIL memory. On native ar- chitectures endianness falls into two different categories. In some cases (like x86) native architectures have a fixed endi- anness that can not be changed during runtime while other architectures can switch the endianness of their memory ac- cess at runtime by executing a special instruction (Pow- erPC, ARM). In general, REIL does not have any mech- anisms to deal with endianness. All endianness issues must be handled by the REIL translators when generating the REIL instructions that access memory. This poses a prob- lem when endianness is switched at runtime because REIL code is generated in advance and can not be updated any- more when endianness-switching happens. However, the rar- ity of endianness-switching makes this a special situation that is seldomly relevant for security audits. After REIL memory and the REIL registers are given an initial state, REIL code can be analyzed or even executed. Execution of REIL code happens just like program execu- tion on a real CPU. Starting with the value in the program counter register, REIL code is executed5. The REIL instruc- tion at the position of the current program counter is fetched and interpreted with regard to the current state of the REIL register bank and the REIL memory. Once interpretation is complete, the REIL register bank and the REIL memory are updated to reflect the effects of the instruction on the global state. 4. TRANSLATINGNATIVECODETOREIL The translation of native assembly code to REIL code is straightforward. For each supported native assembly lan- guage there is a so called REIL translator. This REIL trans- 5There is no special REIL program counter register. Rather, the program counter register of the input architecture is used. This is important to make sure that at each step of the REIL code analysis, the value of the program counter register has the same value as it would have during a real execution of the program on the source platform.lator takes a piece of native assembly code and translates it to REIL code. Linearly iterating over all instructions in a piece of input code, the translator translates each instruc- tion to REIL code independently. The REIL translator does not look ahead to see what instruction follows the current instruction and it does not require information generated during the translation of previous instructions. This state- lessness of the translation makes REIL translators very sim- ple. In fact, REIL translators are nothing but glorified maps that repeatedly map a single native instruction to a list of REIL instructions. Due to the simplicity of REIL instructions and what they can do in one step, a single native assembly instruction is nearly always translated to many REIL instructions. Exper- imental results have shown that on average, an original in- struction is translated into approximately 20 REIL instruc- tions while the most complex native instruction we found in practice was translated to more than 50 REIL instructions. This one-to-many relation between native instructions and REIL instructions unfortunately destroys a direct correspon- dence between the address of a native assembly instruction and the addresses of the REIL instructions created for the native assembly instruction. Having such a correspondence would be most desirable because it would make it signif- icantly simpler to port the results of a REIL analysis al- gorithm back to the original assembly code. To solve this problem, the addresses of REIL instructions are shifted to the left by 8 bits (or multiplied by 0x100). This means that the first REIL instruction that corresponds to the native as- sembly instruction at offset n has the offset 0 x100∗nwhile the second REIL instruction has the offset 0 x100∗n+1 and so on. This address translation limits the translation of a single native instruction to at most 256 different REIL in- structions. Should it ever happen that more than 256 REIL instructions are generated for a single native instruction, the addresses of the REIL instructions would overflow into the addresses of the REIL instructions of the following native instruction. 5. LIMITATIONS OF REIL There are a number of more or less significant issues that might limit the use of REIL in practice. Some of these lim- itations are built into the REIL language itself while others exist simply because we have not yet had time to implement certain aspects of native architectures. The first limitation is that the REIL translators we have so far (32-bit x86, 32-bit PowerPC, and 32-bit ARM) are unable to translate certain classes of instructions. For ex- ample, none of the translators can translate FPU instruc- tions. CPU extensions like the MMX and SSE extensions of x86 CPUs are also not translated yet. We have chosen to skip the translation of these instructions because REIL is supposed to be a language for analyzing assembly code for security-critical bugs and vulnerabilities. FPU, MMX, and SSE extensions are only very rarely involved in these kinds of flaws. Should FPU bugs or other CPU extension bugs become popular targets of software exploits in the future, we can easily extend our existing translators to be able to handle these instructions. Like FPU instructions, privileged instructions like system calls, interrupts, and other kernel-level instructions are not translated by our current REIL translators. The justifica- tion for the lack of support for these kinds of instructions follows along the lines of the lack of FPU support. In our initial implementation of REIL we wanted to focus on the instructions that are most often involved in some kind of security-relevant software flaws. Depending on the exact effects of the missing privileged instructions, it might be trivial to impossible to add them to the REIL language. An instruction that has significant low-level effects on the under- lying hardware, for example one that flushes the CPU cache, will never be part of REIL for this would mean a complete loss of platform-independence and/or a big increase in the number of different instruction mnemonics. Other privileged instructions like interrupt execution can often be simulated using the features REIL already has. REIL can also not deal with exceptions in a platform- independent way. This means that at this point exceptions and the corresponding stack unwinding can not be handled by REIL. Due to the lack of exception handling common situations that throw exceptions (dividing by zero, hitting a breakpoint, ...) are simply ignored in the default REIL interpreter. The next limitation is that REIL can not handle self- modifying code of any kind. This is simply because native code is pre-translated instruction for instruction of a na- tive function and the resulting REIL code is fixed after the initial translation. The reason for this is that REIL instruc- tions themselves do not reside in the REIL memory. They can therefore not be overwritten and modified during the interpretation of REIL code. 6. THE FUTUREOF REIL The first and foremost goal of the next few months is to write more REIL translators (for example to translate MIPS code) and to implement more REIL-based code analysis al- gorithms. Additionaly, we have a few minor ideas about improving the quality of generated REIL code and its use- fulness in static code analysis. The first idea is the introduction of a bit-sized operand typeb0. Right now the smallest operand type is the byte- sized operand b1. During bit-width analysis it might be use- ful to know that an operand that has size b1in current code does not use any bits but its least significant bit. Extending on this idea, maybe it would be smarter to give the size of operands in bits instead of bytes. An idea that can be used to improve the correctness of REIL translation and certain analysis algorithms is the in- troduction of two additional instructions, extend and reduce. The motivation for these two instructions is simple. Right now there are no limitations on how operand sizes can be combined in one instruction. When generating an ADD in- struction one input operand can have size b1while the sec- ond input operand can have size b4. A rule that specifies that the input operands of all instructions must be of equal size would make REIL code more regular for analysis and certain bugs classes in REIL translators can be checked for automatically. The role of the extend instruction would be to extend a value of a smaller size like b1to a larger size likeb2orb4while keeping the value of the extended register the same. The reduce instruction would be the opposite of the extend instruction. Reduce would reduce the size of an operand to a smaller operand size. In this case it can not be guaranteed that the value of the reduced register equals the value of the original register. In many situations overflow- ing high bits will be truncated and lost. This is perfectlyacceptable though because this is used in many different sit- uations already, for example when writing the 33-bit wide result of an addition of two 32-bit values back to a 32-bit register while truncating the overflow. Right now this trun- cation is done using an AND instruction. In the future the reduce instruction might make things semantically clearer. The number of operand types might also be increased in the future. As soon as FPU instructions are supported by the REIL translators it is necessary to add single-precision FPU operands and double-precision FPU operands. An- other example are certain architectures like PowerPC where registers can be addressed not by name but by an index into the register bank. These instructions can not be translated to REIL yet because REIL does not know an operand type like register index. 7. RELATEDWORK The use of intermediate languages for code analysis is nothing new. In fact all serious compilers use some kind of intermediate language during the optimization phase of their generated code (see GCC for example). Creating inter- mediate representations for disassembled assembly code in the context of security analysis is not nearly as widespread. Nevertheless there are a few approaches which are notewor- thy. At the hack.lu conference 2008 Mihai Chiriac of the anti- virus software company BitDefender presented an interme- diate language that he used to speed up the emulation of obfuscated malware programs [1]. The intermediate lan- guage he presented is structurally close to REIL. Like REIL, his language has a very reduced instruction set where every instruction has exactly one effect on the global state. Fur- thermore his virtual architecture has an infinite number of virtual registers and a fully emulated memory. An open-source implementation of an intermediate lan- guage specifically made for reverse engineering and stat- ically analyzing binary code is the ELIR language of the ERESI project6. Like REIL, the goal of the ELIR interme- diate language is simplified platform-independent reasoning about assembly code by providing an intermediate language that makes the effects of all native assembly operations ex- plicit. An overview of the ELIR language was given in Julien Vanegue’s EKOPARTY 2008 talk Static binary analysis with a domain specific language [2]. A commercial use of intermediate language recovery from disassembled code in the context of security analysis is IDA Pro and Hex-Rays. IDA Pro is the industry standard disas- sembler for many platforms and Hex-Rays is a decompiler plugin for IDA Pro. The Hex-Rays decompiler uses an in- termediate language representation (IR) of the underlying disassembled code to analyze and optimize the disassembled code and to decompile it into a C-style high-level language. As shown in Ilfak Guilfanov’s Black Hat 2008 presentation Decompilers and Beyond [3] [4], the intermediate representa- tion used by Hex-Rays is significantly different from REIL. There are more instructions in the Hex-Rays IR and they do not obey the single-responsibility rule for avoiding side effects. Other differences include the distinction between in - teger literals and pointers to code which is present in the Hex-Rays IR but not in REIL and features like the option to address basic blocks instead of addresses in jump instruc- 6http://www.eresi-project.org tions. Another striking difference that can be seen directly when looking at code snippets of REIL and the Hex-Rays IR is that REIL uses way more temporary registers to translate a typical piece of code. Another implementation of an intermediate language was created by GrammaTech in their CodeSurfer/X86 product. While not publicly available at this point, several whitepa- pers have been released about CodeSurfer/X86 (for example see [5] or [6]). Unfortunately these whitepapers focus on the results of certain static analysis algorithms with CodeSurfer/X86 instead of their intermediate language so it is unclear at this point how similar this language is to REIL. As part of AbsInt, an analysis framework specifically suited for statically analyzing embedded system code, Saarland University developed the intermediate language CRL2. Like REIL, CRL2 is generated by transforming the assembly code of a disassembled input program. Nevertheless the similar- ities to REIL end at this point. CRL2 was specifically de- veloped for detailed control flow analysis and as a result of that, CRL2 code is very complex due to a large number of annotations that are relevant for control flow. Examples of generated CRL2 code can be found at [7]. 8. CONCLUSIONS Using the information presented in this paper it is possi- ble to write a complete implementation of the Reverse En- gineering Intermediate Language REIL that can be used for static code analysis of disassembled assembly code. We have already created a commercial implementation of REIL in our product BinNavi and we have successfully written sev- eral simple static code analysis algorithms. Thanks to REIL these algorithms work platform-independently on x86 code, on PowerPC, and on ARM code. 9. REFERENCES [1] Mihai G. Chiriac. Anti Virus 2.0 - Compilers in disguise . hack.lu, October 2008. [2] Julien Vanegue. Static binary analysis with a domain specific language. EKOPARTY 2008, October 2008. [3] Ilfak Guilfanov. Decompilers and beyond. BlackHat USA 2008, August 2008. [4] Ilfak Guilfanov. Decompilers and beyond - Whitepaper . BlackHat USA 2008, August 2008. [5] Gogul Balakrishnan, Radu Gruian, Thomas Reps, and Tim Teitelbaum. Codesurfer/x86-a platform for analyzing x86 executables. In of Lecture Notes in Computer Science , pages 250–254. Springer, 2005. [6] T. Reps, G. Balakrishnan, J. Lim, and T. Teitelbaum. A next-generation platform for analyzing executables. InIn APLAS , pages 212–229, 2005. [7] AbsInt Angewandte Informatik GmbH. CRL Version 2 Manual .Recovering C++ Objects From Binaries Using Inter-Procedural Data-Flow Analysis Wesley Jin CMU wesleyj@andrew.cmu.eduCory Cohen CERT cfc@cert.orgJeffrey Gennari CERT jsg@cert.org Charles Hines CERT hines@cert.orgSagar Chaki SEI chaki@sei.cmu.eduArie Gurfinkel SEI arie@sei.cmu.edu Jeffrey Havrilla CERT jsh@cert.orgPriya Narasimhan CMU priya@cs.cmu.edu Abstract Object-oriented programming complicates the already difficult task of reverse engineering software, and is being used increasingly by malware authors. Unlike traditional procedural-style code, re- verse engineers must understand the complex interactions between object-oriented methods and the shared data structures with which they operate on, a tedious manual process. In this paper, we present a static approach that uses symbolic execution and inter-procedural data flow analysis to discover ob- ject instances, data members, and methods of a common class. The key idea behind our work is to track the propagation and usage of a unique object instance reference, called a this pointer . Our goal is to help malware reverse engineers to understand how classes are laid out and to identify their methods. We have implemented our approach in a tool called O BJDIGGER , which produced encourag- ing results when validated on real-world malware samples. 1. Introduction As malware grows in sophistication, analysts and reverse engineers are increasingly encountering samples written in code following the object-oriented (OO) programming model. For those tasked with analyzing these programs, recovering class information is an essential but painstaking process. Analysts are often forced to resort to slow and manual analysis of a large number of methods and data structures. Programs that follow a traditional, procedural-based program- ming model are typically arranged around functions with well- Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. PPREW ’14 , January 25, 2014, San Diego, CA, US. Copyright is held by the owner/author(s). Publication rights licensed to ACM. ACM 978-1-4503-2649-0/14/01. . . $15.00. http://dx.doi.org/10.1145/10.1145/2556464.2556465defined boundaries, inputs, and outputs. They use structures that support limited operations with simple relationships (e.g., C-style structs). The clear relationships between procedures make it rela- tively easy to recover control and data flow even after compilation. Conversely, the OO programming model is organized around data structures (i.e. C++ objects) with complex relationships and interactions. While C++ data structures are easily recognizable in source code, compilation hides them behind sets of methods with no obvious organization or relevance to one another. Therefore, to reverse engineer an OO program, analysts must: (1) determine related methods that belong to the same class; and (2) understand how they interact. To facilitate the recovery of object structures and relationships, we have developed an approach that leverages the use of the “this pointer” (hereafter ThisPtr ), a reference assigned to each unique (up to allocation instruction address) object instance. Specifically, we use symbolic execution [11] and static inter-procedural data- flow analysis [9] to track individual ThisPtr propagation and usage between and within functions. Although previous authors, most notably Fokin et. al [7][6] and Sabanal et. al [18], have used ThisPtr tracking for OO reverse engineering purposes, our work is distinct for two reasons: 1) We document heuristics for identifying object-oriented methods and structures expressed as data-flow patterns, which can be detected in an automated way. Although patterns may vary from compiler to compiler or from one object to another, the key idea is that each variant may be captured as a unique pattern. 2) Our approach relies only on static analysis and symbolic execution. Thus, it has the ability to recover object-oriented artifacts that may not be created during execution. We have implemented our approach on top of the ROSE analysis framework [15][17]. ROSE provides an infrastructure for disassem- bly, symbolic execution, control flow analysis, and data flow anal- ysis. Our tool, O BJDIGGER , aggregates data from object instances created throughout a binary, compiled with Microsoft’s Visual Stu- dio C++ (MSVC). It records potential constructors, data members and methods. Tests against open-source programs, compiled with MSVC, and real-world malware samples indicate that we are able to recover this information reasonably accurately. Although our current implementation is specific to MSVC, our approach could be extended to others as well. In summary, the contributions of this paper are: •We present a purely static approach that uses symbolic execu- tion and inter-procedural data-flow analysis to track object ref- erences in binaries, produced by MSVC. •We provide an implementation capable of analyzing a binary and producing a list of (1) potential constructors or builders; (2) methods; (3) data members. •We present techniques for detecting inheritance relationships and embedded objects. •We demonstrate, through experimentation on both open-source programs and closed source malware, that the techniques de- scribed in this paper can be practically applied for reverse engi- neering OO code. The remainder of this paper is organized as follows. In Sec. 2, we provide a brief overview of C++ object internals. In Sec. 3, we formalize our goals and constraints. In Sec. 4, we provide definitions for data structures crucial to our approach. In Sec. 5, we describe our approach. In Sec. 6, we describe experiments conducted using our implementation. In Sec. 6.1.2, we describe limitations in our current work and our plans to address them in the future. In Sec. 7, we review related work. In Sec. 8, we conclude. 2. Implementing Object Oriented C++ Features In this section, we provide a basic overview of object-oriented C++ concepts. For a more detailed discussion, we direct interested read- ers to Gray [8]. Although this paper focuses on code following MSVC’s __ thiscall convention, objects produced by other compil- ers follow similar patterns. Our discussion focuses on the example presented in Fig. 1. When objects are created, the compiler allocates space in mem- ory for each class instance. The amount of space allocated is based on the number and size of data members and possibly padding for alignment. Fig. 2 illustrates the layout of this memory region for Sub,Add andAdd1 , as generated1by MSVC. Every instantiated object is referenced by a pointer to its start in memory; this refer- ence is commonly referred to the ThisPtr . The ThisPtr is maintained for the lifetime of the object. It is passed amongst methods, and is used for data member accesses and to make virtual function calls. For example, suppose that tPtr is a ThisPtr to an instance of Add1 . The memory dereference [tPtr+12] points at variable one (see Fig. 2), and [tPtr+8] points at the embedded Subobject inherited from Add. Only one instance of a function is created by a compiler. How- ever, many instances of a C++ class may exist. Therefore, the ThisPtr allows for operations on the data members within a par- ticular object instance. In MSVC, the ThisPtr is typically passed as a hidden parameter to OO functions via the ECX register in accor- dance with the __ thiscall calling convention (more on this later). For example, the sum() method, depicted in Fig. 3, implements the __ thiscall calling convention. At 0x401120, the ThisPtr is re- trieved from ECX . At 0x40112A, the immediate value 1is moved into the memory location corresponding to ThisPtr plus 12. This memory location corresponds to the offset of the Add1 class mem- ber variable one. Therefore, the instruction at 0x40112A corre- sponds to the high-level assignment one=1 . Inheritance manifests in memory layouts by embedding an ob- ject of each superclass inside the object of the subclass (see Fig. 2). 1This output is generated using the -d1reportAllClassLayout flag.class Sub { private: int c; public: int sub(int a, int b) { c=a-b; return c; } }; class Add { protected: int x; int y; Sub b; public: Add() {x=0; y=0; } Add(int q, int e) { x=q; y=e; } int sum() { return x+y; } int sub() { return b.sub(x,y); } }; class Add1: public Add { public: int one; int sum() { one = 1; return x+one; } Add1(int q) { x = q; } }; Figure 1. Object-oriented code sample. Dereferences to parent data members consist of addresses com- posed of the ThisPtr plus offsets into the parent and child objects. For example, to access the class member variable Add::x from an Add1 object,Add1 ’sThisPtr is adjusted by an offset to refer to the embedded Addinstance (zero in this case) and then dereferenced as needed. Virtual functions are implemented using virtual function tables that contain a list of virtual function addresses. The address of a virtual function table is stored as an implicit data member at offset zero of the object (i.e., and accessed by reading [thisPtr] ). Indirect calls to virtual functions are made by dereferencing the virtual function table and calling the target virtual function. Fig. 4 illustrates a virtual function call to the second function in a virtual function table. At address 0x402300, the virtual function table’s address is moved from offset zero of the class layout into EAX . At 0x402305, the pointer corresponding to the second func- tion in the virtual function table is moved into EAX , which is called in the next instruction. When an inheritance relationship exists, the child object over- writes the virtual function table references of its parent (if it has one) on instantiation to ensure the most specific virtual function ta- ble is installed at runtime. If there are multiple parents with virtual functions, the child object has multiple references to distinct virtual function tables, one per parent. In this arrangement, references to virtual function pointers are placed at the beginning of each em- bedded parent object. class Sub size(4): +–- 0 | c +–- class Add size(12): +–- 0 | x 4 | y 8 | Sub b +–- class Add1 size(16): +–- | +–- (base class Add) 0 | | x 4 | | y 8 | | Sub b | +–- 12| one +–- Figure 2. Class layouts for Sub,Add, andAdd1 . 0x401120: mov [esp+4], ecx 0x401125: mov eax, [esp+4] 0x40112A: mov dword ptr [eax+12], 1 0x401131: mov eax, [esp+4] 0x401136: mov eax, [eax] 0x401138: mov ecx, [esp+4] 0x40113D: add eax, [ecx+12] 0x401140: retn Figure 3. Assembly Code for sum() inAdd1 . 0x402300: mov eax, [ecx] 0x402305: mov eax, [eax+4] 0x40230A: call eax Figure 4. Virtual function call example. 3. Problem Statement Given a binary executable compiled from C++ source code without debugging information, recover the following: •Unique constructors and builder methods for classes instanti- ated •Methods associated with object instances •Location and size of data members used in these methods. Goals and Assumptions. The goal of this work is to expedite the recovery of object-oriented structures in compiled executables. We aim to aid program understanding such that reverse engineers and malware analysts are able to quickly identify when objects and their data members are being used. However, we do not seek to recover the original source code of classes for two reasons. First, recovering source might not al- ways be possible, because compilation is not an injective mapping. Different sources can be compiled to produce the same binary, so identifying the original source code is impossible in general. Sec- ond, from the malware analysts’ point of view, it is more important to understand the details of the compiled code (e.g., method rela- tionships and class layouts) than high-level abstractions.4. Definitions In this section, we review basic definitions and abstractions from data-flow analysis, used in the next section. For additional informa- tion, we direct the reader to work by Kiss, Jász, and Gyimóthy [12]. A computer system can be defined as C=⟨P, M, R⟩, where P is a program; MandRare memory locations and registers that are available for use by P. Each program is composed of a set of functions, F, which can be further divided into sequences of instructions (i.e.,∀f∈F, f=⟨i0, i1, i2, ...⟩). Let Ibe the set of instructions, and Vbe the set of values they manipulate. Instructions read from and write to parts of MandR. Let Use : I↦→2(V×(M∪R))be a mapping such that Use(i)is the set of all pairs⟨v, a⟩, where vis the value read by iandais either a memory address in Mor a register in Rthat stores v: Use(i) ={ ⟨v, a⟩|a∈M∪R, iv←a} Simply stated, Use(i)is a data structure that maps instructions to values read in particular registers and memory locations. Sim- ilarly, let Def : I↦→2(V×(M∪R))be a mapping between an in- struction and the locations it writes to: Def(i) ={ ⟨v, a⟩|a∈M∪R, iv→a} An instruction is said to depend on another instruction, if it reads/uses a value that has been set by the other. We define the function DepOn to be a mapping between an instruction iand the set of triplets⟨v, a, j⟩, where vis the value written to the register or memory location abyj. DepOn( i) ={⟨v, a, j⟩|⟨v, a⟩∈Use(i)∩Def(j)} Simply stated, DepOn adds to the data provided by Use by providing information about the instruction responsible for defining the value that was read. Alternatively, the first instruction is said to be a dependent of the second. DepOf is the inverse of DepOn : DepOf( j) ={⟨v, a, i⟩|⟨v, a⟩∈Use(i)∩Def(j)} Finally, we define the notion of data flow order. Consider a function Xthat calls three others: A,B, andC, such that Bcalls Djust before returning. Furthermore, suppose that Xpasses each method the value v. Written in flow order, [A, B, D, C ], implies that 1) A, B, D andCcontain instructions that all use v(i.e., instructions in A, B, C andDshare a reaching definition ofv), 2) Instructions in Adominate the instructions in B, D andC. Instructions in Bdominate those in CandD, and so forth. (i.e., Given a control-flow graph containing X, A, B, C andD, and taking the first instruction in Xas the starting node, every path to the start from the instructions in B, C andDmust go through the instructions in A. Every path from the instructions in CandD must go through those in AandB, and so forth.) 5. Approach Our approach consists of a preliminary stage and iterative anal- ysis passes. We begin by disassembling binaries using the ROSE framework, and use data and control flow analysis to build the Use,Def,DepOn andDepOf maps. These data structures are then used to identify object-oriented structures. ROSE provides the x86 instruction semantics and symbolic emulation infrastructure re- quired for this analysis. The key idea behind identifying OO structures is to track the propagation of unique (up to allocation sites) ThisPtr instances within and between functions. We begin by identifying the set of functions ( FM) that possibly follow the __ thiscall convention. Next, using heuristics about known heap allocation functions such as the new() operator and stack allocation patterns, we identify points at which a ThisPtr is created. We track these pointers to functions in FMusing inter-procedural data flow analysis. Depend- ing on data flow order, we mark methods as either constructors or member/inherited functions. Within these functions, we look for data transfers to and from memory addresses based off of the ThisPtr . Depending on the offsets from the ThisPtr and the size of dereferences, we recover the size and position of data members. OBJDIGGER uses the ROSE framework to perform control-flow and dominator analysis. 5.1 Data and Control Flow Analysis To construct the four maps described above, we implemented the well known work-list algorithm [3, 10, 12] for data-flow analysis. Our algorithm is shown in Procedure buildDependencies() . It maintains a list of symbolic expressions (called states ) that capture the contents of registers and memory after each basic block2is executed. For each basic block B, and for each instruction iofBin flow order, the algorithm: (i) symbolically executes i; (ii) updates the register and memory contents of B’s state with the result r; and (iii) adds ito the list of “modifiers” of r. This list records the addresses of all instructions that have contributed to the value up to this point. For example, processing the instruction add [eax], 5, located at address 0x00405630, updates the memory contents at the address pointed to by EAX with 5, and adds 0x00405630 to the value’s list of modifiers. Therefore, when a different instruction reading this same memory location is processed later (i.e., cmp [EAX], 0 ), a dependency relationship with the addis established by reading the list of modifiers. The state of each basic block, before any instructions are ex- ecuted, is composed of the ‘merged’ states of each of the block’s predecessors. In more detail, if control-flow can reach a basic block from multiple locations, the contents of registers and memory at block entry may have different symbolic values and modifiers, de- pending on the specific path taken. Thus, the merged state com- bines the information from each possible entry path by performing a union across all possible entry states. Explicitly, if the contents of a register or memory location is the same in two different entry states, the symbolic value for that location in the merged state is the same. If they are different, the merged state reflects that the value is unknown, and the resulting list of modifiers is the combination of the lists from each entry state. The state of each basic block, after all instructions are executed, is compared with the its previous state in states . If any of the registers or memory contents have changed, states is updated and all the block’s successors (those that the block can flow into) are marked for processing. The algorithm terminates when the state of all blocks stop changing. 5.2 Identifying __ thiscall Functions Most methods follow the __ thiscall calling convention. When identifying data members and inheritance, we restrict ourselves to such functions, and thus our first step is to identify them. Note that the steps outlined here are not precise enough to distinguish between __ thiscall and some instances of __ fastcall ,[5] (i.e., a more complete algorithm would also need to verify that EDX is not being used to pass parameters.) However, it is a cheap way to elim- inate many functions that cannot be methods from further analysis, thereby improving the overall efficiency of our approach. A key trait of __ thiscall in MSVC is ThisPtr is passed as a parameter in the ECX register3. Exploiting this feature, we find __thiscall methods as follows: 2A sequence of instructions with one entry and one exit. 3http://msdn.microsoft.com/library/ek8tkfbw(v=vs.80) .aspxProcedure buildDependencies() Input :Func : A binary function composed of assembly instructions Input :EntryState : Symbolic state of system, storing register and memory contents, upon function entry Result :Uses ,Defs ,DepsOn andDepsOf are populated for each instruction 1foreachblock∈getBasicBlocks( Func)do 2states [block ]←initSymbolicState() ; 3queue [block ]←true; 4changed←true; 5whilechanged do 6foreachblock∈getBasicBlocks( Func)do 7 ifqueue [block ]then 8 ifisFirstBlock( block)then 9curstate←EntryState ; 10 else 11 foreachpred∈getPredecessorBlocks( block)do 12 curstate←mergeStates( curstate ,states [pred]); 13 foreachinstr∈getInstructions( block)do 14curstate←symbolicExec( instr ,curstate); 15 foreachaloc∈getRegsAndMemRead( instr)do 16 symval←getRegOrMemValue( aloc,curstate); 17 Uses[instr]←⟨symval ,aloc⟩; 18 foreachdefiner∈getModifierList( symval)do 19 DepsOn [instr]←⟨symval ,aloc,definer⟩; 20 DepsOf [definer ]←⟨symval ,aloc,instr⟩; 21 foreachaloc∈getRegsAndMemWritten( instr)do 22 symval←getRegOrMemValue( aloc,curstate); 23 Defs[instr]←⟨symval ,aloc⟩; 24 if notregsAndMemEqual( curstate ,states [block ])then 25changed←true; 26 foreachsuccessor∈getSuccessorBlocks( block) do 27 queue [successor ]←true; 28states [block ]←curstate ; 29queue [block ]←false ; We examine each method within a binary, f∈F, and look for those that contain instructions that use ECX , whose value has been defined externally to the function. We examine DepOn and look for an instruction, i, that maps to the tuple ⟨∗, ECX, j⟩, where jis an instruction that belongs to a different function than i, and ‘*’ matches an arbitrary value.) Therefore, the set of methods following __ thiscall is: FM←{f|∃i∈f∃j̸∈f⟨∗, ECX, j⟩∈DepOn( i)} Our algorithm for identifying __ thiscall methods is shown in Procedure findThisCall() . It generates a set containing pairs, each containing the __ thiscall method, and the first in- struction within that method to read ECX . In the rest of this pa- per, by a __ thiscall method, we mean a method identified by findThisCall() . Procedure findThisCall() Input :Funcs : set of functions from the executable Input :DepsOn : the dependent-on map Result :ThisCalls : set of pairs⟨func,instr⟩, wherefunc follows __ thiscall andinstr is the first instruction in func that reads ECX 1ThisCalls←nil; 2foreachfunc∈Funcs do 3foreachinstr∈getInstructions( func)do 4 foreach⟨value,aloc,depinst⟩∈DepsOn [instr]do 5deffunc←getFunction( depinst); 6 ifaloc=ECX and func̸=deffunc then 7ThisCalls←ThisCalls∪⟨func,instr⟩; 8 Repeat at Line 2 ; 9returnThisCalls 5.3 Identifying Object Instances and Methods Once potential __ thiscall methods have been identified, the next step is to group them into object instances by finding those that share a common ThisPtr . Recall, the ThisPtr is a reference to an object instance. Object-oriented methods are passed these point- ers, so that they know which object instance they are operating on, and they use the pointer to obtain member values and identify virtual methods. Therefore, we first identify a potential ThisPtr , which points to the stack or the heap. Next, from the data struc- tures constructed earlier in buildDependencies() , we look for those object-oriented methods that have been passed this particular pointer in ECX . Identifying ThisPtr creation follows a similar pattern for both the stack and the heap. Heap space is obtained using functions such as MSVC’s new() operator. Stack space is allocated upon function invocation in the function prologue4. Thelea instruction is often used subsequently to load references to portions of this space. In the remainder of this section, we describe how we track a heap- addressed ThisPtr to object-oriented methods. Tracking a stack- addressed ThisPtr is very similar, except the process begins at an leainstruction. We are able to identify calls to new() , either by parsing the bi- nary’s import section or from fingerprints/hashes of known5new() implementations. Once a call to new() has been identified, we iter- ate through each __ thiscall method, and attempt to identify those that contain an instruction that uses this new() ’s returned value. To identify methods belonging to an object created on the heap, we do the following for each function that calls new() : 1. We retrieve the ThisPtr by identifying the first instruction, j, that reads EAX after a call to new() .6The symbolic value of theThisPtr is found from Use(j), corresponding to the pair ⟨thisPtrFromNew, EAX ⟩. See Procedure findReturnValueOfNew() . 4Typically push ebp; mov ebp, esp; sub esp, X; where X is the number of bytes allocated in the current stack frame. 5We hash the bytes of unique new() implementations across different ver- sions of the Visual Studio compiler. We attempt to identify functions that match these signatures within a binary. 6Functions such as new() typically return their result in the register EAX .2. We then iterate through each __ thiscall method, called in the same function that calls new() , looking for those that contain an instruction, i, that reads ECX with a matching value. Simply stated, we look for __ thiscall methods that are passed values of ECX that match the symbolic value in EAX , immediately following a call to new() . Or more formally: objectMethods ={f∈FM|∃i∈FM ⟨thisPtrFromNew, ECX ⟩∈Use(i)∧i∈f} where fis a __ thiscall method, iis an instruction in f, and thisPtrFromNew is defined above. Also see Procedure findObjectMethodsFromNew() . 0x401008: call new 0x40100D: mov [ebp-4], eax 0x401010: mov ecx, [ebp-4] 0x401021: call constructor 0x401024: ... 0x401026: push param1_offset 0x40102D: push param2_offset 0x401030: mov ecx, [ebp-4] 0x401033: call method Figure 5. Heap object construction and method call example. Fig. 5 illustrates these concepts. The call to new() at 0x401008 allocates space on the heap. The ThisPtr , referring to this region, is returned in EAX and the instruction at 0x40100D saves it into a temporary variable. Next, this ThisPtr is transferred to the ECX register prior to the call to the constructor at 0x401021 and to the method call at 0x401033. Since constructor andmethod share a ThisPtr , they are methods belonging to the same class. Procedure findReturnValueOfNew() Input :NewCaller : a function that calls new() Input :NewAddresses : set of addresses of new() functions Input :Uses : the Uses map built by buildDependencies() Result :ThisPtr : the symbolic value returned by a new() call 1found←false ; 2foreachinstr∈getInstructions( NewCaller )do 3iffound =false then 4 ifgetInstructionType( instr)=x86_call then 5 ifgetCallDest( instr)∈NewAddresses then 6found←true ;// found the call to new 7else 8 foreach⟨ThisPtr ,aloc⟩∈Uses[instr]do 9 ifaloc=EAX then 10 returnThisPtr ;// return the symbolic value 11return failure ; // not usually reached In a similar fashion, we identify objects created on the stack by identifying lea instructions, l, that reference locally allo- cated stack space. The value of ThisPtr is found from the pair ⟨thisPtr, REG⟩∈Def(l), where REG is the first parameter of thelea instruction. The pointer is tracked to __ thiscall methods in the same way as on the heap. Identifying which methods are likely constructors is compli- cated by several factors. Constructors are required to return a ThisPtr , which distinguishes them from many, but not all con- ventional methods. If the class uses virtual functions, initialization Procedure findObjectMethodsFromNew() Input :NewCaller : a function that calls new() Input :ThisCalls : set of functions from findThisCall() Input :Uses : the Uses map built by buildDependencies() Result :ObjectMethods : set of functions sharing a common ThisPtr 1ObjectMethods←nil; 2thisptr←findReturnValueOfNew( NewCaller ); // Get list of OO calls from this function 3OurCalls←ThisCalls∩getCalls( NewCaller ); 4foreach⟨func,instr⟩∈OurCalls do 5foreach⟨symval ,aloc⟩∈Uses[instr]do 6 ifaloc=ECX and symval =thisptr then 7ObjectMethods←ObjectMethods∪func; 8returnObjectMethods of the virtual function table pointers can be used to reliably iden- tify constructors, but virtual functions are not present in all classes. Another common heuristic is that constructors are always called first after space is allocated for the object. This heuristic fails when compiler optimization has resulted in the constructor being inlined following the allocation. We chose to identify constructors as the first method called following allocation of the object if it returned the same ThisPtr that was passed as a parameter. This algorithm erroneously identifies some functions as constructors; for exam- ple, builder/factory methods can closely resemble constructors at the binary level. However, because these types of methods may be indistinguishable from constructors in the binary we have not counted this as an error. This heuristic also misses some legitimate constructors, for example in methods that construct other types of objects. 5.4 Data Members Once related __ thiscall methods have been associated with unique object instances, we process each one to retrieve data members. Re- call that the ThisPtr points into the memory region allocated for an object. Therefore, by finding memory dereferences that use ThisPtr and extracting their offset into this area and size, we identify the lo- cation and width of the data member in the class layout. Specifically, we identify the first instruction, j, in the function to read ECX . We retrieve the value of the ThisPtr from the pair ⟨thisPtr, ECX⟩ ∈ Use(j). We then iterate through all of the other instructions that dereference memory, i, looking for a pair ⟨∗, thisPtr + offset⟩ ∈ Use(i). The algorithm is given more formally in Procedure findDataMembers() , which produces a mapping, MemberMap , between a __ thiscall method and a set of data members discovered at a particular offset, represented by the pair⟨offset ,size⟩. Fig. 6 illustrates the use of ThisPtr for accessing a data mem- ber. The ThisPtr is moved from ECX toEAX at 0x401104 and 0x401107. The data variable located at memory address ThisPtr plus 0xC is transferred to EAX at 0x40110A. Therefore, we de- termine that there is a data member at offset twelve in this class’ layout. Since the size of the dereference is 32-bit, we can assume that a variable, of at least that size, exists at that particular offset. 5.5 Virtual Function Tables Objects that have virtual function tables initialize the memory at ThisPtr (zero offset) with the address of the table. This memory0x401100: push ebp 0x401101: mov ebp, esp 0x401103: push ecx 0x401104: mov [ebp-4], ecx 0x401107: mov eax, [ebp-4] 0x40110A: mov eax, [eax+0Ch] 0x40110D: add eax, 1 0x401110: mov esp, ebp 0x401112: pop ebp 0x401113: retn Figure 6. Data member discovery example. Procedure findDataMembers() Input :ThisCalls : set of functions from findThisCall() Input :Uses : the Uses map built by buildDependencies() Result :MemberMap : mapping from functions to pairs ⟨offset ,size⟩describing data members 1MemberMap←nil; 2foreach⟨func,instr⟩∈ThisCalls do 3members←nil; 4foreach⟨thisptr ,aloc⟩∈Uses[instr]do 5 ifaloc=ECX then 6 foreachuinstr∈getInstructions( func)do 7 ifinstr̸=uinstr then 8⟨symval ,ualoc⟩←Uses[uinstr ]; 9 ifisMemReadType (ualoc )then 10 offset←thisptr−symval ; 11 ifisConstant( offset)then 12 size←getReadSize( uinstr ,ualoc); 13 members←members∪⟨offset ,size⟩; 14 break ; // done with this function 15MemberMap [func]←members ; 16returnMemberMap write occurs within a constructor and typically takes on the form ofmov [reg], vtableAddr wherereg contains the value of a ThisPtr . Therefore, if we find such instructions within constructors identified previously, we record the written constants as potential virtual function table addresses (i.e., ⟨vtableAddr, thisPtr ⟩ ∈ Def(i)). We then identify calls made to entries within this table by examining the dependents of the movinstruction, i. In more detail, we find the set of instructions, Q: Q={q|⟨vtableAddr, thisPtr, q ⟩∈DepOf( i)} where Qcontains the set of instructions that read the ThisPtr from the address initialized by the movinstruction. Using symbolic execution, we follow the flow of the pointer from this instruction to acall instruction. We record the branch target and offset of the call destination from ThisPtr as an entry at the given offset within the virtual table. 5.6 Inheritance and Embedded Objects Although our current implementation does not fully support inher- itance detection, we describe our current progress in this area. Inheritance relationships can be determined by analyzing con- structors. When a class inherits from a parent, the constructors of the subclass call the parent’s constructors. Specifically, the sub- classes pass their ThisPtr to the parents’ constructors. In the case of single inheritance, the subclass constructor passes the ThisPtr directly (the memory address is exactly equal to ThisPtr with no offset). In the case of multiple inheritance, the subclass passes the pointer plus the offset at which the parent is located in the class layout. Unfortunately, this behavior is also observed when an object contains embedded objects. Therefore, in order to distinguish be- tween embedded objects and inheritance, we need additional dis- criminators. One reliable method would be to check if the subclass overwrites the virtual table address of its parent in its constructors. As mentioned earlier, classes in general, and the parent class in this case, are not required to contain virtual functions. In summary, to identify inheritance relationships, we could: (1) retrieve all cross-references (calls out of the function) from con- structors to other constructors; (2) compare the values of ECX at the beginning of each function; a constructor that calls other con- structors that share a common ECX value (possibly plus some con- stant) indicates either an inheritance relationship or an embedded object; (3) check to see if the caller overwrites, the address passed to the called constructors. Recall, the pointer to the virtual table is typically located at offset zero within a class layout. Therefore, if a constructor writes to a memory address, that corresponds to aThisPtr passed to another constructor, with a pointer to a new virtual table, we can label the other constructor as a parent. If we cannot find such an overwrite, it is possible that the constructor is instantiating an embedded object within the class. See Procedure lookForInheritance() . 0x401104: mov [ebp-4], ecx 0x401107: mov ecx, [ebp-4] 0x40110A: call sub_4010C0 0x40110F: mov ecx, [ebp-4] 0x401112: add ecx, 10h 0x401115: call sub_401080 0x40111A: mov eax, [ebp-4] 0x40111D: mov dword ptr [eax], 0x40816C 0x401123: mov ecx, [ebp-4] 0x401126: mov dword ptr [ecx+10h], 0x40817C Figure 7. Example constructor with multiple inheritance. Fig. 7 shows part of a constructor that calls two other methods at 0x4010C0 and 0x401080. It passes its ThisPtr without any offset to the first call at 0x401107. It passes the second call the ThisPtr plus 0x10 at 0x401112. At 0x40111D and 0x401126, we observe that these same memory locations are overwritten with constants cor- responding to two new virtual function table addresses. Therefore, we know that this constructor inherits from two other constructors, at 0x4010C0 and 0x401080, whose layouts are embedded at offset zero and sixteen of the class (see Fig. 2 for an example of single inheritance and layout embedding). In summary, there’s an open problem related to reliably distin- guishing between embedded objects and multiple inheritance in the absence of virtual functions in the parent. Some of our remaining deficiencies stem from this difficulty, and we plan to continue in- vestigating this problem in future work. 5.7 Object Instance Aggregation and Reporting Our implementation aggregates data from object instances created throughout a binary. This information is grouped by unique con- structor, and in some cases builder methods that return object in- stances that are largely indistinguishable from constructors. TheProcedure lookForInheritance() Input :Func : function identified as a constructor, member of ThisCalls Input :Constructors : the set of all identified constructors Input :ThisCalls : set of functions from findThisCall() Input :Uses : the Uses map built by buildDependencies() Input :Defs : the Defs map built by buildDependencies() Result :Parents : set of parent/inherited constructors called byFunc 1Parents←nil; 2passed←nil; 3foreachinstr∈getInstructions( Func)do // Find calls to other constructors 4ifgetInstructionType( instr)=x86_call then 5target←getCallDest( instr); 6 iftarget∈Constructors then // Get ThisPtr passed to each constructor 7 foreach⟨cxf,cxi⟩∈ThisCalls do 8 iftarget =cxfthen 9 foreach⟨symval ,aloc⟩∈Uses[cxi]do 10 ifaloc=ECX then 11 passed←passed∪⟨cxf,symval⟩; // Look for mov instruction that overwrites location of a passed ThisPtr 12foreachinstr∈getInstructions( Func)do 13ifgetInstructionType( instr)=x86_mov then 14 foreach⟨symval ,aloc⟩∈Defs[instr]do 15 ifisMemWriteType( aloc)then 16 foreach⟨pxf,thisptr⟩∈passed do 17 ifsymval =thisptr then 18 Parents←Parents∪pxf; 19returnParents ; list of all seen data members and methods associated with an ob- ject instance, produced by some constructor, are merged with that of another object instance, produced by the same constructor. In- formation from constructors known to belong to the same class, for example because they share a common virtual table, are also merged. In this way we provide results to the analyst which are more use- ful than individual object instances and yet are not truly class defi- nitions either. With more rigorous detection of inheritance and ob- ject embedding relationships these merged object instances should converge on complete class definitions although we do not claim that result in this work. Fig. 8 shows data about merged object instances from one of our experiments. Note that the actual output of our tool generates raw addresses. For illustrative purposes here, we have substituted the raw addresses with symbol information obtained from the com- piler generated PDB files. This particular example shows correctly identify methods, members and virtual function information from the class XmlText. However, it also illustrates a case in which our approach was unable to distinguish between an embedded object and an inheritance relationship. XmlText inherits from XmlNode. However, the XmlNode() andSetValue() methods of XmlNode were reported as methods of XmlText. Constructor: __thiscall XmlText::XmlText(char *) Vtable: 4b7264 Vtable Contents: Address: 4b7264 Pointer to Function @4035ae Data Members: Offset: 16 Size: 4 Offset: 20 Size: 4 Offset: 24 Size: 4 Offset: 28 Size: 4 Offset: 36 Size: 4 Offset: 40 Size: 4 Offset: 44 Size: 1 Methods: void *__thiscall XmlText:: XmlText() void __thiscall XmlNode::SetValue(char *) __thiscall XmlNode::XmlNode(XmlNode::NodeType) void *__thiscall XmlText::SetCDATA(bool) Inherited methods: Figure 8. Output of O BJDIGGER (with symbols substituted for addresses). 6. Experiments To validate our approach, we conducted experiments on open- source packages, downloaded from SourceForge7, and on real- world malware for which source is unavailable. We propose here a framework for evaluating such algorithms using a mixture of tool output, debugging information, and compiler generated class member layouts. 6.1 Open-source Tests 6.1.1 Methodology The open-source tests were designed to evaluate the effective- ness of our approach given ground truth. The packages that we used were: The Lean Mean C++ Option Parser version 1.3, Light Pop3/SMTP Library, X3C C++ Plugin Framework version 1.0.2, PicoHttpD Library version 1.2, and CImg Library version 1.0.5. Each program serves a different purpose, such as XML or math parsing, and includes test programs that exercise different parts of their respective APIs. We ran our tool on a binary from each library. In these experiments, ground truth came from three sources: 1) a compiler layout produced by MSVC (as shown in Fig. 2) that contains information about the class layout and data members; 2) symbol information from compiler generated PDB files, which allows us to map function addresses to symbolic names (from which we can determine the classes in which they belong), and 3) source code of the test programs and libraries. The results of our experiments are summarized in Table 1. We collected data in three categories for each test package: 1.# Unique classes found / # of unique classes . Using the symbol information from the PDB, we counted the number of unique classes instantiated in the binary code. We excluded classes that were part of the standard compiler library. For the numerator, we counted those classes for which O BJDIGGER identified at least one instance of a constructor and associated methods and members. 2.# of methods found / # of methods in binary . We used the sym- bol information from the PDB to determine which methods 7http://www.sourceforge.netwere included in the binary. We counted a method as found by OBJDIGGER if it was associated with at least one instance of a constructor for the correct class. Note that inlined methods, and methods which were not included in binary are not present as symbols in the PDB file. We also excluded from the denom- inator cases where source code inspection confirmed that the methods were not in the control flow. This sometimes happen the compiler includes functions just because they were part of an object file. 3.# of data members found / # of data members in binary . Using the compiler layout information we compared the class mem- bers identified by O BJDIGGER to the members reported by the compiler. In certain circumstances we excluded from the de- nominator members that were known to have no uses in bi- nary. This sometimes occurs when the compiler excludes the only function which accesses a member because the method was never called. Our testing methodology, for each package, was as follows: 1. Compile test programs for the package. Generate layout in- formation using the compiler and demangle class names using undname. 2. Run O BJDIGGER on each binary which reports method ad- dresses and object layouts, without names. 3. Extract symbol data from PDB files using IDA Pro8and de- mangle names using undname9. This maps function addresses to method names. 4. Correlate the function addresses from the O BJDIGGER output to the names in the symbol data. As can be seen in Fig. 8, the symbolic names specify the classes that particular methods belong to, and allows us to determine the validity of grouped methods. 5. Compare the discovered data members to those reported by the compiler, using the class named obtained from the symbol for the constructor. 6. Manually inspect the source code of each test program, exclud- ing any methods or members as described in the previous sec- tion. Table 1. Test results for open-source packages. Package Classes Methods Members PicoHttpD 1.2 8/9 (89%) 31/47 (66%) 18/25 (72%) x3c 1.0.2 4/5 (80%) 21/24 (88%) 6/8 (75%) CImg 1.0.5 7/7 (100%) 61/83 (73%) 33/42 (79%) OptionParser 10/10 (100%) 37/52 (71%) 33/35 (94%) Light Pop3/SMTP 8/9 (89%) 29/35 (83%) 16/23 (70%) 6.1.2 Discussion Table 1 lists the recall or true positive rate for O BJDIGGER . Method and member totals are summed across all classes. While the table does not explicitly list precision values, there were no false pos- itives generated for this test set using the tool, so precision was 100% in all tests. In each case, we verified that all identified methods and data members were correctly associated with the classes in which they actually belong by looking up their symbolic names. 8https://www.hex-rays.com/products/ida/ 9Undname is a MSVC tool for demangling OO method names. With regards to missed methods, O BJDIGGER was often able to identify many of these missed methods as following __ thiscall , but was not able to associate these missed methods with a specific class. It was also able to group many of these missed methods with other found methods that shared the same ThisPtr . Unfortunately, none of those other found methods could be positively identified as a con- structor. For example, in the case of PicoHttpD, the single missed class was created as a global variable. A memory address for a lo- cation in the .rdata section was passed to the constructor. How- ever, currently, O BJDIGGER only checks for local stack addresses and space allocated by new() . Thus, even though O BJDIGGER cor- rectly identified that this same pointer to the .rdata was passed as a ThisPtr in a couple of other methods (those that we missed), we didn’t report a new class instance being found or any of these associated methods. We chose this conservative approach to avoid over counting unique class instances. It is possible that these meth- ods could have belonged to an object instance from a class that had already been identified, but created in another function. When a constructor is not found, we are unable to associate any of the found members or methods in that object instance with a specific named class. This leads to a cascading effect where a single missed constructor negatively affects recall. Additionally, missed class methods also mean that any data members that were accessed inside of them were missed as well. This cascading effect is a fundamental challenge in analyzing OO code, since methods and data members are tied together to produce the object abstraction. With regards to the missed methods and data members in CImg and Light Pop3, we suspect that these omissions were due to imple- mentations bugs and not limitations with the approach. Specifically, at the time of the experiments, O BJDIGGER had problems tracking objects that were passed as parameters to other methods. The tool also had problems identifying certain methods that were called in- directly, by dereferencing addresses within a class’s virtual table. We are currently working on addressing those issues. A fundamental limitation of our approach is that we can only de- tect methods and members called and accessed by the program be- ing analyzed, respectively. Our technique relies on grouping meth- ods together by shared ThisPtr . Thus, if a program creates a class with methods the are never called by any instances of that class (or associated with a unique constructor belonging to that class), OBJDIGGER fails to detect these methods. Similarly, if a data mem- ber is never accessed (i.e., O BJDIGGER never observes a mem- ory read or write to a particular location within a class layout), OBJDIGGER fails to detect this particular data member. 6.2 Closed-source Malware Case Study Object-oriented malware presents many challenges to analysts. Un- derstanding object structures can be critical for recovering func- tionality. To demonstrate how O BJDIGGER can aid with mal- ware analysis we used it to help analyze a malware sample (file md5 019d3b95b261a5828163c77530aaa72f on http://www. virustotal.com ). It is not uncommon for OO malware to encapsulate critical, ma- licious functionality in C++-style data structures. As a result, re- verse engineering OO malware can be challenging because under- standing program functionality may first require recovering C++ data structures. Manually recovering C++ data structures can be a tedious and error prone task, especially if done piecewise or in conjunction with trying to understand program functionality. O BJDIGGER automat- ically recovers object structures thereby streamlining analysis ef- forts. For example, in the sample, O BJDIGGER quickly identifies object instances and potential constructors. Of the 585 functions within the sample, our tool identified nine object instances and their constructors, methods, data members,and virtual function tables. The analyst can then inspect this re- duced set to determine each data structure’s relevance to the pro- gram. 0x401010 push 0FFFFFFFFh 0x401012 push 41497Bh 0x401017 mov eax, large fs:0 0x40101D push eax 0x40101E mov large fs:0, esp 0x401025 sub esp, 0A8h 0x40102B push esi 0x40102C lea ecx, [esp+4] 0x401030 call sub_403000 0x401035 mov eax, [esp+C0h] 0x40103C mov ecx, [esp+BCh] 0x401043 push eax 0x401044 push ecx 0x401045 lea ecx, [esp+Ch] 0x401049 mov dword ptr [esp+BCh], 0 0x401054 call sub_401470 0x401059 lea ecx, [esp+4] 0x40105D mov esi, eax 0x40105F mov dword ptr [esp+B4h], 0FFFFFFFFh 0x40106A call sub_401F20 0x40106F mov ecx, [esp+ACh] 0x401076 mov eax, esi 0x401078 pop esi 0x401079 mov large fs:0, ecx 0x401080 add esp, 0B4h 0x401086 retn Figure 9. Main function of the malware sample. Constructor: 403000 Vtable: 41647c Vtable Contents: Address: 41647c Pointer to Function @ 403030 Data Members: Offset: 0x0 Size: 0x4 Offset: 0x8 Size: 0x4 Offset: 0x18 Size: 0x4 ... Offset: 0xa0 Size: 0x4 Offset: 0xa4 Size: 0x4 Methods: 401470 401f20 Figure 10. OBJDIGGER output for the malware sample main func- tion. Fig. 9 shows the disassembly for the main function generated by IDA Pro. A cursory analysis of this function shows that it is a relatively simple routine containing three methods: sub_403000 , sub_401470 , andsub_401f20 . Note that in Fig. 10, O BJDIGGER identified all three of these methods as related to one object, sub_403000 is the constructor and sub_401470 andsub_401f20 are class methods. With this information the analyst immediately has a sense that the malware’s functionality is organized around (at least) one object instantiated in the main function. Because significant parts of the program are encapsulated in this object understanding its internals is likely critical to determining program functionality. For instance, understanding the purpose of this object’s data members takes on greater importance because of the object’s usage in the malware. Notably, O BJDIGGER also provides information on class member offset and size, further simplifying analytical efforts. In this scenario, the information provided by O BJDIGGER could be recovered manually, but this may take considerable effort. Au- tomatically reasoning through C++ data structures saves time and frees the analyst to focus on questions that are more relevant to malware functionality. 7. Related Work Sabanal et. al [18] provide a detailed discussion of recovering C++ data structures from binary code. In particular, they are the first to describe heuristics for recognizing C++ objects by monitoring the use of the ThisPtr in binary methods. Our work builds upon their ideas, and captures these heuristics as machine recognizable data-flow patterns. Additionally, our work goes one step further by tracking the propagation of ThisPtr between function to identify common data members and methods of classes. Jens Tröger and Ci- fuentes [21] pioneered similar use of dataflow analysis techniques applied to binaries to recover virtual function tables; however, their work relied on dynamic execution of code to resolve addresses of object methods. Lee et. al. [13], Balakrishnan and Reps [2], and Ramalingam et. al. [16] are focused on variable/data type recovery in executable files. While type recovery is an important and related problem, our primary concern is to recover the class structure of objects. Srinivasan et. al [20] propose a method that uses dynamic anal- ysis to observe call relationships between methods to infer class hierarchy (similar to what we have done in a static context). How- ever, their ability to recover class structure is limited to portions of a binary that actually run. Furthermore, since they are not tracking memory dereferences that use the ThisPtr , they do not recover data members. Slowinska et. al [19] and Lin et. al [14] are focused on type discovery of variables using dynamic analysis. Although their work does not deal with object-oriented code directly, their method of tracking the use of memory locations to infer size and type is similar to the way we track memory dereferences, involving the ThisPtr , to infer size and offset of data members. Adamantiadis [1] provides a detailed explanation of construc- tors, destructors and virtual function tables at the binary level. They give an example of reverse engineering an object-oriented C++ bi- nary. However, the discussion does not propose an automated tech- nique. Dewey et. al [4] describe many techniques similar to the ones we use in our work. They specifically state though they are focused on analyzing known non-malicious code for a specific class of vul- nerability. Our work is designed to be used explicitly for analyzing malicious software. Fokin et. al [7] adopt an approach that appears to be very similar to ours, but provides less detail about the data flow analysis. Their earlier work [6] provides interesting insights about the aggregation of related object instances into classes. 8. Conclusion and Future Work In this paper, we present a purely static approach for recovering object instances, data members and methods from x86 binaries, compiled using MSVC. We produced a tool, O BJDIGGER , which we tested against open-source software and real-world malware. A comparison of the output from the open-source tests against ground truth, generated by the compiler’s debug information, indicates that our technique can achieve its goal effectively. The tests against real- world malware demonstrate that our tool can aid in the malware reverse engineering process. While our experiments demonstrate that our approach is viable, there is room for improvement. First, O BJDIGGER needs to be extended to recognize and recover objects instantiated on a globalscope. We are currently exploring this direction, building on data- flow analysis techniques to reason through mechanics of global object creation and storage. In certain object arrangements, inheritance and composition are hard to distinguish. Determining whether an embedded object is a parent or a member without using the presence of virtual function tables is an open problem. More work is needed to correctly iden- tify this arrangement. Similarly, constructors and destructors can be difficult to distinguish under certain circumstances. O BJDIGGER needs to be extended to accurately identify destructors to enable better identification and tracking of object scope. Advanced OO features of C++, such as virtual inheritance, are currently not supported. Virtual inheritance fundamentally changes the layout of objects in memory. The primary mechanism to imple- ment virtual inheritance is the virtual base class tables . The virtual base class table maintains offset to multiple parent classes to re- move ambiguity possible with multiple inheritance. O BJDIGGER must be extended to correctly recognize and interpret these tables. Further investigation is needed to fully understand the impli- cations of compiler optimizations such as inlining of constructors, destructors, and other methods. Finally, further experimentation is needed to determine to what extent O BJDIGGER can analyze non-MSVC generated binaries. Preliminary analysis suggests that compilers such as the GNU C++ Compiler use similar mechanisms to implement OO C++ features, but additional investigation is needed to determine what, if any nuances exist in different compilers. It might also be interesting to investigate what we can discover of OO patterns in other languages, such as Delphi, which analysts see frequently in the malware realm. On a more practical note, the output of O BJDIGGER can cer- tainly be improved to help the analyst more quickly see relation- ships between methods and objects (perhaps as a custom plugin to IDA Pro). Acknowledgments This material is based upon work funded and supported by the De- partment of Defense under Contract No. FA8721-05-C-0003 with Carnegie Mellon University for the operation of the Software En- gineering Institute, a federally funded research and development center. This material has been approved for public release and un- limited distribution. Carnegie Mellon®, CERT®, CERT Coordina- tion Center® are registered in the U.S. Patent and Trademark Office by Carnegie Mellon University. DM-0000440 References [1] Aris Adamantiadis. Reversing C++ programs with IDA pro and and Hey-rays. http://blog.0xbadc0de.be/archives/67 . [2] Gogul Balakrishnan and Thomas Reps. Divine: discovering variables in executables. In Proceedings of the 8th international conference on Verification, model checking, and abstract interpretation , VMCAI’07, pages 1–28, Berlin, Heidelberg, 2007. Springer-Verlag. [3] Keith D. Cooper, Timothy J. Harvey, and Ken Kennedy. Iterative data-flow analysis, revisited. Technical report, Rice University, 2004. [4] David Dewey and Jonathon T. Giffin. Static detection of C++ vtable escape vulnerabilities in binary code. In Proceedings of the 19th Annual Network and Distributed System Security Symposium , NDSS’12, http://www.internetsociety.org/ static-detection-c-vtable-escape-vulnerabilities- binary-code , 2012. [5] Agner Fog, Technical University of Denmark. Calling conventions for different C++ compilers and operating systems. http://www. agner.org/optimize/calling_conventions.pdf , pages 16– 17, Last Updated 04-09-2013. [6] Alexander Fokin, Katerina Troshina, and Alexander Chernov. Reconstruction of Class Hierarchies for Decompilation of C++ Programs. In Proceedings of the 14th European Conference on Software Maintenance and Reengineering (CSMR’10) , IEEE, pages 240–243, 2010. [7] Alexander Fokin, Egor Derevenetc, Alexander Chernov, and Katerina Troshina. SmartDec: Approaching C++ Decompilation. In Pro- ceedings of the 18th Working Conference on Reverse Engineering , WCRE’11, pages 347–356, 2011. [8] Jan Gray. C++: Under the Hood. http://www.openrce.org/ articles/files/jangrayhood.pdf , 1994. [9] S. Horwitz, T. Reps, and D. Binkley. Interprocedural Slicing Using Dependence Graphs. In Proceedings of the ACM SIGPLAN 1988 Conference on Programming Language Design and Implementation (PLDI’88) , pages 35–46, 1988. [10] Harold Johnson. Data flow analysis for ‘intractable’ system software. InSIGPLAN Symposium on Compiler Construction , pages 109–117, 1986. [11] James C. King. Symbolic Execution and Program Testing. Communi- cations of the ACM (CACM) , 19(7), July 1976. [12] Ákos Kiss, Judit Jász, and Tibor Gyimóthy. Using dynamic information in the interprocedural static slicing of binary executables. Software Quality Control , 13(3):227–245, September 2005. [13] JongHyup Lee, Thanassis Avgerinos, and David Brumley. Tie: Principled reverse engineering of types in binary programs. In NDSS . The Internet Society, 2011. [14] Z. Lin, X. Zhang, and D. Xu. Automatic Reverse Engineering of Data Structures from Binary Execution. In Proceedings of the Network and Distributed System Security Symposium (NDSS’2010) , March 2010. [15] Dan Quinlan. ROSE: Compiler support for object-oriented frame- works. In Parallel Processing Letters 10 , no. 02n03, pages 215–226. 2000. [16] G. Ramalingam, John Field, and Frank Tip. Aggregate structure identification and its application to program analysis. In Proceedings of the 26th ACM SIGPLAN-SIGACT symposium on Principles of programming languages , POPL ’99, pages 119–132, New York, NY , USA, 1999. ACM. [17] ROSE website. http://www.rosecompiler.org . [18] Paul Vincent Sabanal and Mark Vincent Yason. Reversing C++. http://www.blackhat.com/presentations/bh-dc-07/ Sabanal_Yason/Paper/bh-\dc-07-Sabanal_Yason-WP.pdf . [19] Asia Slowinska, Traian Stancescu, and Herbert Bos. Dde: dynamic data structure excavation. In Proceedings of the first ACM asia-pacific workshop on Workshop on systems , APSys ’10, pages 13–18, New York, NY , USA, 2010. ACM. [20] V .K. Srinivasan and T. Reps. Software Architecture Recovery from Machine Code. Technical Report TR1781, University of Wisconsin - Madison, March 2013. http://digital.library.wisc.edu/ 1793/65091 . [21] Jens Tröger, and Cristina Cifuentes. Analysis of Virtual Method Invocation for Binary Translation. In Proceedings of the 9th Working Conference on Reverse Engineering (WCRE ’02) , IEEE Computer Society, pages 65–, 2002. 1 / 52 [MS-SHLLINK] — v20100601 Shell Link (.LNK) Binary File Format Copyright © 2010 Microsoft Corporation. Release: Tuesday, June 1, 2010 [MS-SHLLINK]: Shell Link (.LNK) Binary File Format Intellectual Property Rights Notice for Open Specifications Documentation  Technical Documentation. Microsoft publishes Open Specifications documentation for protocols, file formats, languages, standards as well as overviews of the interaction among each of these technologies.  Copyrights. This documentation is covered by Microsoft copyrights. Regardles s of any other terms that are contained in the terms of use for the Microsoft website that hosts this documentation, you may make copies of it in order to develop implementations of the technologies described in the Open Specifications and may distribute p ortions of it in your implementations using these technologies or your documentation as necessary to properly document the implementation. You may also distribute in your implementation, with or without modification, any schema, IDL’s, or code samples that are included in the documentation. This permission also applies to any documents that are referenced in the Open Specifications.  No Trade Secrets. Microsoft does not claim any trade secret rights in this documentation.  Patents. Microsoft has patents that may cover your implementations of the technologies described in the Open Specifications. Neither this notice nor Microsoft's delivery of the documentation grants any licenses under those or any other Microsoft patents. However, a given Open Specification may be covered by Microsoft's Open Specification Promise (available here: http://www.microsoft.com/interop/osp ) or the Community Promise (available here: http://www.microsoft.com/interop/cp/default.mspx ). If you would prefer a written license, or if the technologies described in the Open Specifications are not covered by the Open Specifications Promise or Community Promise, as applicab le, patent licenses are available by contacting iplg@microsoft.com .  Trademarks. The names of companies and products contained in this documentation may be covered by trademarks or similar intellectual property rights . This notice does not grant any licenses under those rights.  Fictitious Names. The example companies, organizations, products, domain names, e -mail addresses, logos, people, places, and events depicted in this documentation are fictitious. No association with any real company, organization, product, domain name, email address, logo, person, place, or event is intended or should be inferred. Reservation of Rights. All other rights are reserved, and this notice does not grant any rights other than specifica lly described above, whether by implication, estoppel, or otherwise. Tools. The Open Specifications do not require the use of Microsoft programming tools or programming environments in order for you to develop an implementation. If you have access to Micro soft programming tools and environments you are free to take advantage of them. Certain Open Specifications are intended for use in conjunction with publicly available standard specifications and network programming art, and assumes that the reader either is familiar with the aforementioned material or has immediate access to it. 2 / 52 [MS-SHLLINK] — v20100601 Shell Link (.LNK) Binary File Format Copyright © 2010 Microsoft Corporation. Release: Tuesday, June 1, 2010 Revision Summary Date Revision History Revision Class Comments 05/22/2009 0.1 Major First Release. 07/02/2009 0.1.1 Editorial Revised and edited the technical content. 08/14/2009 0.2 Minor Updated the technical content. 09/25/2009 0.3 Minor Updated the technical content. 11/06/2009 0.3.1 Editorial Revised and edited the technical content. 12/18/2009 0.3.2 Editorial Revised and edited the technical content. 01/29/2010 0.4 Minor Updated the technical content. 03/12/2010 0.4.1 Editorial Revised and edited the technical content. 04/23/2010 0.5 Minor Updated the technical content. 06/04/2010 0.6 Minor Updated the technical content. 3 / 52 [MS-SHLLINK] — v20100601 Shell Link (.LNK) Binary File Format Copyright © 2010 Microsoft Corporation. Release: Tuesday, June 1, 2010 Contents 1 Introduction ................................ ................................ ................................ ............. 4 1.1 Glossary ................................ ................................ ................................ ............... 4 1.2 References ................................ ................................ ................................ ............ 5 1.2.1 Normative References ................................ ................................ ....................... 5 1.2.2 Informative References ................................ ................................ ..................... 5 1.3 Overview ................................ ................................ ................................ .............. 6 1.4 Relationship to Protocols and Oth er Structures ................................ .......................... 6 1.5 Applicability Statement ................................ ................................ ........................... 7 1.6 Versioning and Localization ................................ ................................ ..................... 7 1.7 Vendor -Extensible Fields ................................ ................................ ......................... 7 2 Structures ................................ ................................ ................................ ................ 8 2.1 ShellLinkHeader ................................ ................................ ................................ ..... 8 2.1.1 LinkFlags ................................ ................................ ................................ ....... 10 2.1.2 FileAttributesFlags ................................ ................................ .......................... 12 2.1.3 HotKeyFlags ................................ ................................ ................................ .. 13 2.2 LinkTargetIDList ................................ ................................ ................................ .. 17 2.2.1 IDList ................................ ................................ ................................ ........... 17 2.2.2 ItemID ................................ ................................ ................................ .......... 18 2.3 LinkInfo ................................ ................................ ................................ .............. 18 2.3.1 VolumeID ................................ ................................ ................................ ...... 21 2.3.2 CommonNetworkRelativeLink ................................ ................................ ........... 23 2.4 StringData ................................ ................................ ................................ .......... 26 2.5 ExtraData ................................ ................................ ................................ ........... 27 2.5.1 ConsoleDataBlock ................................ ................................ ........................... 29 2.5.2 ConsoleFEDataBlock ................................ ................................ ....................... 33 2.5.3 DarwinDataBlock ................................ ................................ ............................ 34 2.5.4 EnvironmentVariableDataBlock ................................ ................................ ........ 35 2.5.5 IconEnvir onmentDataBlock ................................ ................................ .............. 37 2.5.6 KnownFolderDataBlock ................................ ................................ .................... 38 2.5.7 PropertyStoreDataBlock ................................ ................................ .................. 39 2.5.8 ShimDataBlock ................................ ................................ .............................. 39 2.5.9 SpecialFolderDataBlock ................................ ................................ ................... 40 2.5.10 TrackerDataBlock ................................ ................................ ......................... 40 2.5.11 VistaAndAboveIDListDataBlock ................................ ................................ ....... 42 3 Structure Examples ................................ ................................ ................................ 44 3.1 Shortcut to a File ................................ ................................ ................................ . 44 4 Security ................................ ................................ ................................ .................. 48 5 Appendix A: Product Behavior ................................ ................................ ................ 49 6 Change Tracking ................................ ................................ ................................ ..... 50 7 Index ................................ ................................ ................................ ..................... 52 4 / 52 [MS-SHLLINK] — v20100601 Shell Link (.LNK) Binary File Format Copyright © 2010 Microsoft Corporation. Release: Tuesday, June 1, 2010 1 Introduction This is a specification of the Shell Link Binary File Format. In this format a structure is called a shell link, or shortcut , and is a data object that contains information that can be used to access another data object. The Shell Link Binary File Format is the format of Microsoft Windows® files with the extension "LNK". Shell links are commonly used to support application launching and linking scen arios, such as Object Linking and Embedding (OLE) , but they also can be used by applications that need the ability to store a reference to a target file. 1.1 Glossary The following terms are defined in [MS-GLOS] : American National Standards Institute (ANSI) character set Augmented Backus -Naur Form (ABNF) class identifier (CLSID) code page Component Object Model (COM) Coordinated Universal Time (UTC) GUID little -endian NetBIOS name object (3) Unicode Universal Naming Convention (UNC) The following terms are specific to this document: extra data section: A data structure appended to the basic Shell Link Binary File Forma t data that contains additional information about the link target . folder integer ID: An integer value that identifies a known folder. folder GUID ID: A GUID value that identifies a known folder. Some folder GUID ID values correspond to folder integer ID values. item ID (ItemID): A structure that represents an item in the context of a shell data source . item ID list (IDList): A data structure that refers to a location. An item ID list is a multi - segment data structure where each segment's content is defined by a data source that is responsible for the location in the namespace referred to by the preceding segments. link: An object that refers to another item. link target: The item that a link references . In the case of a shell link , the referenced item is identified by its location in the link target namespace using an item ID list (IDList) . link target namespace: A hierarchical namespace . In Windows, the link target namespace is the Windows Explorer namespace , as described in [C706] . namespace: An abstract container used to hold a set of unique identifiers. Object Linking and Embedding (OLE): A technology for transferring and sharing information between applications by inserting a file or part of a file into a compound document. The 5 / 52 [MS-SHLLINK] — v20100601 Shell Link (.LNK) Binary File Format Copyright © 2010 Microsoft Corporation. Release: Tuesday, June 1, 2010 inserted file can be either linked or embedded. An embedded item is stored as part of the compound document that contains it; a linked item stores its data in a separate file. relative path: A path that is implied by the current working directory or is calculated based on a specified directory. When a user enters a command that refers to a file, and the full path is not entered, the current working directory becomes the relative path of the referenced file. resolving a link: The act of finding a specific link target , confirming that it exists, and finding whether it has moved. Red-Green -Blue (RGB): A mapping of color components in which red, green, and blue and an intensity value are combined in various ways to reproduce a range of colors. shell data source: An object that is responsible for a specific location in the namespace and for enume rating and binding IDLists to handlers. shell link: A structure in Shell Link Binary File Format. shim: A mechanism used to provide custom behavior to applications that do not work on newer versions of the operating system. shortcut: A term that is used sy nonymously with shell link . MAY, SHOULD, MUST, SHOULD NOT, MUST NOT: These terms (in all caps) are used as described in [RFC2119] . All statements of optional behavior use either MAY, SHOULD, or SHOULD NOT. 1.2 References 1.2.1 Normative References We conduct frequent surveys of the normative references to assure their continued availability. If you have any issue with finding a normative reference, please contact dochelp@microsoft.com . We will assist you in finding the relevant information. Please check the archive site, http://msdn2.microsoft.com/en -us/library/E4BD6494 -06AD -4aed-9823-445E921C9624 , as an additional source. [MS-DFSNM] Microsoft Corporation, " Distributed File System (DFS): Namespace Management Protocol Specification ", September 2007. [MS-DTYP] Microsoft Corporation, " Windows Data Types ", January 2007. [MS-LCID] Microsoft Corporation, " Windows Language Code Identifier (LCID) Reference ", July 2007. [MS-PROPSTORE] Microsoft Corporation, "Property Store Binary File Format", May 2009. [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997, http://www.ietf.org/rfc/rfc2119.txt [RFC5234] Crocker, D., Ed., and Overell, P., "Augmented BNF for Syntax Specifications: ABNF", STD 68, RFC 5234, January 2008, http://www.ietf.org/rfc/rfc5234.txt 1.2.2 Informative References [C706] The Open Group, "DCE 1.1: Remote Procedure Call", C706, August 1997, http://www.opengroup.org/public/pubs/catalog/c706.htm 6 / 52 [MS-SHLLINK] — v20100601 Shell Link (.LNK) Binary File Format Copyright © 2010 Microsoft Corporation. Release: Tuesday, June 1, 2010 [MS-DLTW] Microsoft Corporation, " Distributed Link Tracking: Workstation Protocol Specification ", January 2007. [MS-GLOS] Microsoft Corporation, " Windows Protocols Master Glossary ", March 2007. [MSCHARSET] Micro soft Corporation, "INFO: Windows, Code Pages, and Character Sets", February 2005, http://support.microsoft.com/kb/75435 [MSDN -CODEPAGE] Microsoft Corporation, "Common Pages", http://msdn.microsoft.com/en - us/goglobal/bb964653.aspx [MSDN -ISHELLLINK] Microsoft Corporation,"IShellLink Interface", http://msdn.microsoft.com/en - us/library /bb774950.aspx [MS-CFB] Microsoft Corporation, " Compound File Binary File Format ", October 2008. [MSDN -MSISHORTCUTS] Microsoft Corporation, "How Windows Installer Shortcuts Work", http://support.microsoft.com/kb/243630 1.3 Overview The Shell Link Binary File Format specifies a structure called a shell link. That structure is used to store a reference to a location in a link target namespace , which is referred to as a link target . The most important component of a link target namespace is a link target in the form of an item ID list (IDList) . The shell link structure stores various information that is useful to end users, including: A keyboard shortc ut that can be used to launch an application. A descriptive comment. Settings that control application behavior. Optional data stored in extra data sections . Optional data can include a property store that contains an extensible set of properties in the format that is described in [MS-PROPSTORE] . The Shell Link Binary File Format can be managed using a COM object, programmed using the IShellLink interface, and saved into its persistence format using the IPersistStream or IPersistFile interface. It is most common for shell links to be stored in a file with the .LNK file extension. By using the IPersistStream interface, a shell link can be saved into another storage system, for example a database or the regi stry, or embedded in another file format. For more information, see [MSDN -ISHELLLINK] . Multi-byte data values in the Shell Link Binary File Format are stored in little -endian format. 1.4 Relationship to Protocols and Other Structures The Shell Link Binary File Format is used by the Compound File Binary File Format [MS-CFB]. The Shell Link Binary File Format uses the Property Store Binary File Format [MS-PROPSTORE] . 7 / 52 [MS-SHLLINK] — v20100601 Shell Link (.LNK) Binary File Format Copyright © 2010 Microsoft Corporation. Release: Tuesday, June 1, 2010 1.5 Applicability Statement This document specifies a persistence format for links to files in a file system or to applications that are available for installation. This persistence format is applicable for use as a stand -alone file and for containment within other structures. 1.6 Versioning and Localization This specification covers versioning issues in the following areas: Localization : The Shell Link Binary File Format defines the ConsoleFEDataBlock structure (section 2.5.2), which specifies a code page for displaying text. That value can be used to specify a set of characters for a particular language or locale. 1.7 Vendor -Extensible Fields A shell data source can extend the persistence format by storing custom data inside ItemID structure. The ItemIDs embedded in an IDList are in a format specified by the shell data sources that manage the ItemIDs. The ItemIDs are free to store whatever data is needed in this structure to uniquely identify the items in their namespace . The property store embedded in a l ink can be used to store property values in the shell link. 8 / 52 [MS-SHLLINK] — v20100601 Shell Link (.LNK) Binary File Format Copyright © 2010 Microsoft Corporation. Release: Tuesday, June 1, 2010 2 Structures The Shell Link Binary File Format consists of a sequence of structures that conform to the following ABNF rules [RFC5234] . SHELL_LINK = SHELL_LINK_HEADER [LINKTARGET_IDLIST] [LINKINFO] [STRING_DATA] *EXTRA_DATA SHELL_LINK_HEADER : A ShellLinkHeader structure (section 2.1), which contains identification information, timestamps, and flags that specify the presence of optional structures. LINKTARGET_IDLIST : An optional LinkTargetIDList structure (section 2.2), which specifies the target of the link. The presence of this structure is specified by the HasLinkTargetIDList bit (LinkFlags section 2.1.1) in the ShellLinkHead er. LINKINFO : An optional LinkInfo structure (section 2.3), which specifies information necessary to resolve the link target. The presence of this structu re is specified by the HasLinkInfo bit (LinkFlags section 2.1.1) in the ShellLinkHeader. STRING_DATA : Zero or more optional StringData structures (section 2.4), which are used to convey user interface and path identification information. The presence of these structures is specified by bits (LinkFlags section 2.1.1) in the ShellLinkHeader. EXTRA_DATA : Zero or more ExtraData structures (section 2.5). 2.1 ShellLinkHeader The ShellLinkHeader structure contains identification information, timestamps, and flags that specify the presence of optional structures, including LinkTargetIDList (section 2.2), LinkInfo (section 2.3), and StringData (section 2.4). 0 1 2 3 4 5 6 7 8 9 1 0 1 2 3 4 5 6 7 8 9 2 0 1 2 3 4 5 6 7 8 9 3 0 1 HeaderSize LinkCLSID ... ... ... LinkFlags FileAttributes 9 / 52 [MS-SHLLINK] — v20100601 Shell Link (.LNK) Binary File Format Copyright © 2010 Microsoft Corporation. Release: Tuesday, June 1, 2010 CreationTime ... AccessTime ... WriteTime ... FileSize IconIndex ShowCommand HotKey Reserved1 Reserved2 Reserved3 HeaderSize (4 bytes): The size, in bytes, of this structure. This value MUST be 0x0000004C. LinkCLSID (16 bytes): A class identifier (CLSID) . This value MUST be 00021401 -0000- 0000-C000-000000000046. LinkFlags (4 bytes): A LinkFlags structure (section 2.1.1 ) that specifies information about the shell link and the presence of optional portions of the structure. FileAttributes (4 bytes): A FileAttributesFlags structure (section 2.1.2) that specifies information about the link target. CreationTime (8 bytes): A FILETIME structure ( [MS-DTYP] section 2.3.1) that specifies the creation time of the link target in UTC (Coordinated Universal Time) . If the value is zero, there is no creation time set on the link targ et. AccessTime (8 bytes): A FILETIME structure ( [MS-DTYP] section 2.3.1) that specifies the access time of the link target in UTC (Coordinated Universal Time) . If the value is zero, there is no access time set on the link targe t. WriteTime (8 bytes): A FILETIME structure ( [MS-DTYP] section 2.3.1) that specifies the write time of the link target in UTC (Coordinated Universal Time) . If the value is zero, there is no write time set on the link target. 10 / 52 [MS-SHLLINK] — v20100601 Shell Link (.LNK) Binary File Format Copyright © 2010 Microsoft Corporation. Release: Tuesday, June 1, 2010 FileSize (4 bytes): A 32-bit unsigned integer that specifies the size, in bytes, of the link target. If the link target file is larger than 0xFFFFFFFF, this value specifies the least significant 32 bits of the link target file size. IconIndex (4 bytes): A 32-bit signed integer that specifies the index of an icon within a given icon location. ShowCommand (4 bytes): A 32-bit unsigned integer that specifies the expected window state of an application launched by the link. This value SHOULD be one of the fol lowing. Value Meaning SW_SHOWNORMAL 0x00000001 The application is open and its window is open in a normal fashion. SW_SHOWMAXIMIZED 0x00000003 The application is open, and keyboard focus is given to the application, but its window is not shown. SW_SHOWM INNOACTIVE 0x00000007 The application is open, but its window is not shown. It is not given the keyboard focus. All other values MUST be treated as SW_SHOWNORMAL . HotKey (2 bytes): A HotKeyFlags structure (section 2.1.3 ) that specifies the keystrokes used to launch the application referenced by the shortcut key. This value is assigned to the application after it is launched, so that pressing the key activates that application. Reserved1 (2 bytes): A value that MUST be zero. Reserved2 (4 bytes): A value that MUST be zero. Reserved3 (4 bytes): A value that MUST be zero. 2.1.1 LinkFlags The LinkFlags structure defines bits that specify which shell link structures are present in the file format after the ShellLinkHeader structure (section 2.1). 0 1 2 3 4 5 6 7 8 9 1 0 1 2 3 4 5 6 7 8 9 2 0 1 2 3 4 5 6 7 8 9 3 0 1 A B C D E F G H I J K L M N O P Q R S T U V W X Y Z A A 0 0 0 0 0 Where the bits are defined as: Value Description A HasLinkTargetIDList The shell link is saved with an item ID list (IDList). If this bit is set, a LinkTargetIDList structure (section 2.2) MUST follow the ShellLinkHeader. B HasLinkInfo The shell link is saved with link information. If this bit is set, a LinkInfo structure (section 2.3) MUST be present. 11 / 52 [MS-SHLLINK] — v20100601 Shell Link (.LNK) Binary File Format Copyright © 2010 Microsoft Corporation. Release: Tuesday, June 1, 2010 Value Description C HasName The shell link is saved with a name string. If this bit is set, a NAME_STRING StringData structure (section 2.4) MUST be present. D HasRelativePath The shell link is saved with a relative path string. If this bit is set, a RELATIVE_PATH StringData structure (section 2.4) MUST be present. E HasWorkingDir The shell link is saved with a working directory string. If this bit is set, a WORKING_DIR StringData structure (section 2.4) MUST be present. F HasArguments The shell link is saved with command line argu ments. If this bit is set, a COMMAND_LINE_ARGUMENTS StringData structure (section 2.4) MUST be present. G HasIconLocation The shell link is saved with an icon location string. If this bit is set, an ICON_ LOCATION StringData structure (section 2.4) MUST be present. H IsUnicode The shell link contains Unicode encoded strings. This bit SHOULD be set. I ForceNoLinkInfo The LinkInfo structure (section 2.3) is ignored. J HasExpString The shell link is saved with an EnvironmentVariableDataBlock (section 2.5.4). K RunInSeparateProcess The target is run in a separate virtual machine when launching a link target that is a 16 -bit application. L Unused1 A bit that is undefined and MUST be ignored. M HasDarwinID The shell link is saved with a DarwinDataBlock (section 2.5.3). N RunAsUser The application is run as a different user when the target of the shell link is activated. O HasExpIcon The shell link is saved with an IconEnvironmentDataBlock (section 2.5.5 ). P NoPidlAlias The file system location is represented in the shell namespace when the path to an item is parsed into an IDList. Q Unused2 A bit that is undefined and MUST be ignored. R RunWithShimLayer The shell link is saved with a ShimDataBlock (section 2.5.8). S The TrackerDataBlock (section 2.5.10 ) is ignored. 12 / 52 [MS-SHLLINK] — v20100601 Shell Link (.LNK) Binary File Format Copyright © 2010 Microsoft Corporation. Release: Tuesday, June 1, 2010 Value Description ForceNoLinkTrack T EnableTargetMetadata The shell link attempts to collect target properties and store them in the PropertyStoreDataBlock (section 2.5.7) when the link target is set. U DisableLinkPathTracking The EnvironmentVariableDataBlock is ignored. V DisableKnownFolderTracking The SpecialFolderDataBlock (section 2.5.9) and the KnownFolderDataBlock (section 2.5.6) are ign ored when loading the shell link. If this bit is set, these extra data blocks SHOULD NOT be saved when saving the shell link. W DisableKnownFolderAlias If the link has a KnownFolderDataBlock (section 2.5.6), the unaliased form of the known folder IDList S HOULD be used when translating the target IDList at the time that the link is loaded. X AllowLinkToLink Creating a link that references another link is enabled. Otherwise, specifying a link as the target IDList SHOULD NOT be allowed. Y UnaliasOnSave When saving a link for which the target IDList is under a known folder, either the unaliased form of that known folder or the target IDList SHOULD be used. Z PreferEnvironmentPath The target IDList SHOULD NOT be stored; instead, the path specified in the EnvironmentVariableDataBlock (section 2.5.4) SHOULD be used to refer to the target. AA KeepLocalIDListForUNCTarget When the target is a UNC name that re fers to a location on a local machine, the local path IDList in the PropertyStoreDataBlock (section 2.5.7) SHOULD be stored, so it can be used when the link is loaded on the local machine. 2.1.2 FileAttributesFlags The FileAttributesFlags structure defines bits that specify the file attributes of the link target , if the target is a file system item. File attributes can be used if the link target is not available, or if accessing the target would be inefficient. It is possible for the target items attributes to be out of sync with this value. 0 1 2 3 4 5 6 7 8 9 1 0 1 2 3 4 5 6 7 8 9 2 0 1 2 3 4 5 6 7 8 9 3 0 1 A B C D E F G H I J K L M N O 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Where the bits are defined as: Value Description A FILE_ATTRIBUTE_READONLY The file or directory is read -only. For a file, if this bit is set, applications can read the file but cannot write to it or delete it. For a directory, if this bit is set, 13 / 52 [MS-SHLLINK] — v20100601 Shell Link (.LNK) Binary File Format Copyright © 2010 Microsoft Corporation. Release: Tuesday, June 1, 2010 Value Description applications cannot delete the directory. B FILE_ATTRIBUTE_HIDDEN The file or dire ctory is hidden. If this bit is set, the file or folder is not included in an ordinary directory listing. C FILE_ATTRIBUTE_SYSTEM The file or directory is part of the operating system or is used exclusively by the operating system. D Reserved1 A bit that MUST be zero. E FILE_ATTRIBUTE_DIRECTORY The link target is a directory instead of a file. F FILE_ATTRIBUTE_ARCHIVE The file or directory is an archive file. Applications use this flag to mark files for backup or removal. G Reserved2 A bit that MUST be zero. H FILE_ATTRIBUTE_NORMAL The file or directory has no other flags set. If this bit is 1, all other bits in this structure MUST be clear. I FILE_ATTRIBUTE_TEMPORARY The file is being used for temporary storage. J FILE_ATTRIBUTE_SP ARSE_FILE The file is a sparse file. K FILE_ATTRIBUTE_REPARSE_POINT The file or directory has an associated reparse point. L FILE_ATTRIBUTE_COMPRESSED The file or directory is compressed. For a file, this means that all data in the file is compressed. For a directory, this means that compression is the default for newly created files and subdirectories. M FILE_ATTRIBUTE_OFFLINE The data of the file is n ot immediately available. N FILE_ATTRIBUTE_NOT_CONTENT_INDEXED The contents of the file need to be indexed. O FILE_ATTRIBUTE_ENCRYPTED The file or directory is encrypted. For a file, this means that all data in the file is encrypted. For a directory, thi s means that encryption is the default for newly created files and subdirectories. 2.1.3 HotKeyFlags The HotKeyFlags structure specifies input generated by a combination of keyboard keys being pressed. 14 / 52 [MS-SHLLINK] — v20100601 Shell Link (.LNK) Binary File Format Copyright © 2010 Microsoft Corporation. Release: Tuesday, June 1, 2010 0 1 2 3 4 5 6 7 8 9 1 0 1 2 3 4 5 6 7 8 9 2 0 1 2 3 4 5 6 7 8 9 3 0 1 LowByte HighByte LowByte (1 byte): An 8-bit unsigned integer that specifies a virtual key code that corresponds to a key on the keyboard. This value MUST be one of the following: Value Meaning 0x30 "0" key 0x31 "1" key 0x32 "2" key 0x33 "3" key 0x34 "4" key 0x35 "5" key 0x36 "6" key 0x37 "7" key 0x38 "8" key 0x39 "9" key 0x41 "A" key 0x42 "B" key 0x43 "C" key 0x44 "D" key 0x45 "E" key 0x46 "F" key 0x47 "G" key 0x48 "H" key 0x49 "I" key 0x4A "J" key 0x4B "K" key 0x4C "L" key 0x4D "M" key 15 / 52 [MS-SHLLINK] — v20100601 Shell Link (.LNK) Binary File Format Copyright © 2010 Microsoft Corporation. Release: Tuesday, June 1, 2010 Value Meaning 0x4E "N" key 0x4F "O" key 0x50 "P" key 0x51 "Q" key 0x52 "R" key 0x53 "S" key 0x54 "T" key 0x55 "U" key 0x56 "V" key 0x57 "W" key 0x58 "X" key 0x59 "Y" key 0x5A "Z" key VK_F1 0x70 "F1" key VK_F2 0x71 "F2" key VK_F3 0x72 "F3" key VK_F4 0x73 "F4" key VK_F5 0x74 "F5" key VK_F6 0x75 "F6" key VK_F7 0x76 "F7" key VK_F8 0x77 "F8" key VK_F9 0x78 "F9" key VK_F10 0x79 "F10" key VK_F11 "F11" key 16 / 52 [MS-SHLLINK] — v20100601 Shell Link (.LNK) Binary File Format Copyright © 2010 Microsoft Corporation. Release: Tuesday, June 1, 2010 Value Meaning 0x7A VK_F12 0x7B "F12" key VK_F13 0x7C "F13" key VK_F14 0x7D "F14" key VK_F15 0x7E "F15" key VK_F16 0x7F "F16" key VK_F17 0x80 "F17" key VK_F18 0x81 "F18" key VK_F19 0x82 "F19" key VK_F20 0x83 "F20" key VK_F21 0x84 "F21" key VK_F22 0x85 "F22" key VK_F23 0x86 "F23" key VK_F24 0x87 "F24" key VK_NUMLOCK 0x90 "NUM LOCK" key VK_SCROLL 0x91 "SCROLL LOCK" key HighByte (1 byte): An 8-bit unsigned integer that specifies bits that correspond to modifier keys on the keyboard. This value MUST be one or a combination of the following: Value Meaning HOTKEYF_SHIFT 0x01 The "SHIFT" key on the keyboard. HOTKEYF_CONTROL The "CTRL" key on the keyboard. 17 / 52 [MS-SHLLINK] — v20100601 Shell Link (.LNK) Binary File Format Copyright © 2010 Microsoft Corporation. Release: Tuesday, June 1, 2010 Value Meaning 0x02 HOTKEYF_ALT 0x04 The "ALT" key on the keyboard. 2.2 LinkTargetIDList The LinkTargetIDList structure specifies the target of the link. The presence of this optional structure is specified by the HasLinkTargetIDList bit (LinkFlags section 2.1.1) in the ShellLinkHeader (section 2.1). 0 1 2 3 4 5 6 7 8 9 1 0 1 2 3 4 5 6 7 8 9 2 0 1 2 3 4 5 6 7 8 9 3 0 1 IDListSize IDList (variable) ... IDListSize (2 bytes): The size, in bytes, of the IDList field. IDList (variable): A stored IDList structure (section 2.2.1), which contains the item ID list. An IDList structure conforms to the following ABNF [RFC5234] : IDLIST = *ITEMID TERMINALID 2.2.1 IDList The stored IDList structure specifies the format of a persisted item ID list. 0 1 2 3 4 5 6 7 8 9 1 0 1 2 3 4 5 6 7 8 9 2 0 1 2 3 4 5 6 7 8 9 3 0 1 ItemIDList (variable) ... TerminalID ItemIDList (variable): An array of zero or more ItemID structures (section 2.2.2 ). TerminalID (2 bytes): A 16-bit, unsigned integer that indicates the e nd of the item IDs. This value MUST be zero. 18 / 52 [MS-SHLLINK] — v20100601 Shell Link (.LNK) Binary File Format Copyright © 2010 Microsoft Corporation. Release: Tuesday, June 1, 2010 2.2.2 ItemID An ItemID is an element in an IDList structure (section 2.2.1). The data stored in a given ItemID is defined by the source that corresponds to the location in the target namespace of the preceding ItemIDs. This data uniquely identifies the items in that part of the namespace. 0 1 2 3 4 5 6 7 8 9 1 0 1 2 3 4 5 6 7 8 9 2 0 1 2 3 4 5 6 7 8 9 3 0 1 ItemIDSize Data (variable) ... ItemIDSize (2 bytes): A 16-bit, unsigned integer that specifies the size, in bytes, of the ItemID structure, including the ItemIDSize field. Data (variable): The shell data source -defined data that specifies an item. 2.3 LinkInfo The LinkInfo structure specifies information necessary to resolve a link target if it is not found in its original location. This includes information about the volume that the target was stored on, the mapped drive letter, and a Universal Naming Convention (UNC) form of the path if one existed when the link was created. For more details about UNC paths, see [MS-DFSNM] section 2.2.1.4. 0 1 2 3 4 5 6 7 8 9 1 0 1 2 3 4 5 6 7 8 9 2 0 1 2 3 4 5 6 7 8 9 3 0 1 LinkInfoSize LinkInfoHeaderSize LinkInfoFlags VolumeIDOffset LocalBasePathOffset CommonNetworkRelativeLinkOffset CommonPathSuffixOffset LocalBasePathOffsetUnicode (optional) CommonPathSuffixOffsetUnicode (optional) 19 / 52 [MS-SHLLINK] — v20100601 Shell Link (.LNK) Binary File Format Copyright © 2010 Microsoft Corporation. Release: Tuesday, June 1, 2010 VolumeID (variable) ... LocalBasePath (variable) ... CommonNetworkRelativeLink (variable) ... CommonPathSuffix (variable) ... LocalBasePathUnicode (variable) ... CommonPathSuffixUnicode (variable) ... LinkInfoSize (4 bytes): A 32-bit, unsigned integer that specifies the size, in bytes, of the LinkInfo structure. All offsets specified in this structure MUST be less than this value, and all strings contained in this structure MUST fit within the extent defined by this size. LinkInfoHeaderSize (4 bytes): A 32-bit, unsigned integer that specifies the size, in bytes, of the LinkInfo header section, which includes all specified offsets. This value MUST be defined as shown in the following table, and it MUST be less than LinkInfoSize .<1> Value Meaning 0x0000001C Offsets to the optional fields are not specified. 0x00000024 ≤ value Offsets to the optional fields are specified. LinkInfoFlags (4 bytes): Flags that specify whether the VolumeID , LocalBasePath , LocalBasePathUnicode , and CommonNetworkRelativeLink fields are present in this structure. 20 / 52 [MS-SHLLINK] — v20100601 Shell Link (.LNK) Binary File Format Copyright © 2010 Microsoft Corporation. Release: Tuesday, June 1, 2010 0 1 2 3 4 5 6 7 8 9 1 0 1 2 3 4 5 6 7 8 9 2 0 1 2 3 4 5 6 7 8 9 3 0 1 A B 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Where the bits are defined as: Value Description A VolumeIDAndLocalBasePath If set, the VolumeID and LocalBasePath fields are present, and their locations are specified by the values of the VolumeIDOffset and LocalBasePathOffset fields, respectively. If the value of the LinkInfoHeaderSize field is greater than or equal to 0x00000024, the LocalBasePathUnicode field is present, and its location is specified by the value of the LocalBasePathOffsetUnicode field. If not set, the Volum eID, LocalBasePath , and LocalBasePathUnicode fields are not present, and the values of the VolumeIDOffset and LocalBasePathOffset fields are zero. If the value of the LinkInfoHeaderSize field is greater than or equal to 0x00000024, the value of the LocalBa sePathOffsetUnicode field is zero. B CommonNetworkRelativeLinkAndPathSuffix If set, the CommonNetworkRelativeLink field is present, and its location is specified by the value of the CommonNetworkRelativeLinkOffset field. If not set, the CommonNetworkRelat iveLink field is not present, and the value of the CommonNetworkRelativeLinkOffset field is zero. VolumeIDOffset (4 bytes): A 32-bit, unsigned integer that specifies the location of the VolumeID field. If the VolumeIDAndLocalBasePath flag is set, this va lue is an offset, in bytes, from the start of the LinkInfo structure; otherwise, this value MUST be zero. LocalBasePathOffset (4 bytes): A 32-bit, unsigned integer that specifies the location of the LocalBasePath field. If the VolumeIDAndLocalBasePath flag is set, this value is an offset, in bytes, from the start of the LinkInfo structure; otherwise, this value MUST be zero. CommonNetworkRelativeLinkOffset (4 bytes): A 32-bit, unsigned integer that specifies the location of the CommonNetworkRelativeLink field. If the CommonNetworkRelativeLinkAndPathSuffix flag is set, this value is an offset, in bytes, from the start of the LinkInfo structure; otherwise, this value MUST be zero. CommonPathSuffixOffset (4 bytes): A 32-bit, unsigned integer that specifies t he location of the CommonPathSuffix field. This value is an offset, in bytes, from the start of the LinkInfo structure. LocalBasePathOffsetUnicode (4 bytes): An optional, 32 -bit, unsigned integer that specifies the location of the LocalBasePathUnicode field. If the VolumeIDAndLocalBasePath flag is set, this value is an offset, in bytes, from the start of the LinkInfo structure; otherwise, this 21 / 52 [MS-SHLLINK] — v20100601 Shell Link (.LNK) Binary File Format Copyright © 2010 Microsoft Corporation. Release: Tuesday, June 1, 2010 value MUST be zero. This field can be present only if the value of the LinkInfoHeaderSize field is greater than or equal to 0x00000024. CommonPathSuffixOffsetUnicode (4 bytes): An optional, 32 -bit, unsigned integer that specifies the location of the CommonPathSuffixUnicode field. This value is an offset, in bytes, from the start of the LinkInfo structure. This field can be present only if the value of the LinkInfoHeaderSize field is greater than or equal to 0x00000024. VolumeID (variable): An optional VolumeID structure (section 2.3.1) that specifies information about the volume that the link target was on when the link was created. This field is present if the VolumeIDAndLocalBasePath flag is set. LocalBasePath (variable): An opti onal, NULL –terminated string, defined by the system default code page, which is used to construct the full path to the link item or link target by appending the string in the CommonPathSuffix field. This field is present if the VolumeIDAndLocalBasePath flag is set. CommonNetworkRelativeLink (variable): An optional CommonNetworkRelativeLink structure (section 2.3.2 ) that specifies information about the netw ork location where the link target is stored. CommonPathSuffix (variable): A NULL –terminated string, defined by the system default code page, which is used to construct the full path to the link item or link target by being appended to the string in the LocalBasePath field. LocalBasePathUnicode (variable): An optional, NULL –terminated, Unicode string that is used to construct the full path to the link item or link target by appending the string in the CommonPathSuffixUnicode field. This field can be present only if the VolumeIDAndLocalBasePath flag is set and the value of the LinkInfoHeaderSize field is greater than or equal to 0x00000024. CommonPathSuffixUnicode (variable): An optional, NULL –terminated, Unicode string that is used to construct the full path to the link item or link target by being appended to the string in the LocalBasePathUnicode field. This field can be present only if the value of the LinkInfoHeaderSize field is greater than or equal to 0x00000024. 2.3.1 VolumeID The VolumeID structure specifies information about the volume that a link target was on when the link was created. This information is useful for resolving the link if the file is not found in its original location. 0 1 2 3 4 5 6 7 8 9 1 0 1 2 3 4 5 6 7 8 9 2 0 1 2 3 4 5 6 7 8 9 3 0 1 VolumeIDSize DriveType DriveSerialNumber VolumeLabelOffset 22 / 52 [MS-SHLLINK] — v20100601 Shell Link (.LNK) Binary File Format Copyright © 2010 Microsoft Corporation. Release: Tuesday, June 1, 2010 VolumeLabelOffsetUnicode (optional) Data (variable) ... VolumeIDSize (4 bytes): A 32-bit, unsigned integer that specifies the size, in bytes, of this structure. This value MUST be greater than 0x00000010. All offsets specified in this structure MUST be less than this value, and all strings contained in this structure MUST fit within t he extent defined by this size. DriveType (4 bytes): A 32-bit, unsigned integer that specifies the type of drive the link target is stored on. This value MUST be one of the following: Value Meaning DRIVE_UNKNOWN 0x00000000 The drive type cannot be determined. DRIVE_NO_ROOT_DIR 0x00000001 The root path is invalid; for example, there is no volume mounted at the path. DRIVE_REMOVABLE 0x00000002 The drive has removable media, such as a floppy drive, thumb drive, or flash card reader. DRIVE_FIXED 0x00000003 The drive has fixed media, such as a hard drive or flash drive. DRIVE_REMOTE 0x00000004 The drive is a remote (network) drive. DRIVE_CDROM 0x00000005 The drive is a CD -ROM drive. DRIVE_RAMDISK 0x00000006 The drive is a RAM disk. DriveSerialNumber (4 bytes): A 32-bit, unsigned integer that specifies the drive serial number of the volume the link target is stored on. VolumeLabelOffset (4 bytes): A 32-bit, unsigned integer that specifies the location of a string that contains the volume label of the drive that the link target is stored on. This value is an offset, in bytes, from the start of the VolumeID structure to a NULL-terminated string of characters, defined by the system default code page. The volume label string is located in the Data field of this structure. If the value of this field is 0x00000014, it MUST be ignored, and the value of the VolumeLabelOffsetUnic ode field MUST be used to locate the volume label string. VolumeLabelOffsetUnicode (4 bytes): An optional, 32 -bit, unsigned integer that specifies the location of a string that contains the volume label of the drive that the link target is stored on. 23 / 52 [MS-SHLLINK] — v20100601 Shell Link (.LNK) Binary File Format Copyright © 2010 Microsoft Corporation. Release: Tuesday, June 1, 2010 This value is an offset, in bytes, from the start of the VolumeID structure to a NULL - terminated string of Unicode characters. The volume label string is located in the Data field of this structure. If the value of the VolumeLabelOffset field is not 0x00000014 , this field MUST be ignored, and the value of the VolumeLabelOffset field MUST be used to locate the volume label string. Data (variable): A buffer of data that contains the volume label of the drive as a string defined by the system default code page or Unicode characters, as specified by preceding fields. 2.3.2 CommonNetworkRelativeLink The CommonNetworkRelativeLink structure specifies information about the network location where a link target is stored, including the mapped drive letter and the UNC path prefix. For details on UNC paths, see [MS-DFSNM] section 2.2.1.4. 0 1 2 3 4 5 6 7 8 9 1 0 1 2 3 4 5 6 7 8 9 2 0 1 2 3 4 5 6 7 8 9 3 0 1 CommonNetworkRelativeLinkSize CommonNetworkRelativeLinkFlags NetNameOffset DeviceNameOffset NetworkProviderType NetNameOffsetUnicode (optional) DeviceNameOffsetUnicode (optional) NetName (variable) ... DeviceName (variable) ... NetNameUnicode (variable) ... 24 / 52 [MS-SHLLINK] — v20100601 Shell Link (.LNK) Binary File Format Copyright © 2010 Microsoft Corporation. Release: Tuesday, June 1, 2010 DeviceNameUnicode (variable) ... CommonNetworkRelativeLinkSize (4 bytes): A 32-bit, unsigned integer that specifies the size, in bytes, of the CommonNetworkRelativeLink structure. This value MUST be greater than or equal to 0x00000014. All offsets specified in this structure MUST be less than this value, and all strings containe d in this structure MUST fit within the extent defined by this size. CommonNetworkRelativeLinkFlags (4 bytes): Flags that specify the contents of the DeviceNameOffset and NetProviderType fields. 0 1 2 3 4 5 6 7 8 9 1 0 1 2 3 4 5 6 7 8 9 2 0 1 2 3 4 5 6 7 8 9 3 0 1 A B 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Where the bits are defined as: Value Description A ValidDevice If set, the DeviceNameOffset field contains an offset to the device name. If not set, the DeviceNameOffset field does not contain an offset to the device name, and its value MUST be zero. B ValidNetType If set, the NetProviderType field contains the network provider type. If not set, the NetProviderType field does not contain t he network provider type, and its value MUST be zero. NetNameOffset (4 bytes): A 32-bit, unsigned integer that specifies the location of the NetName field. This value is an offset, in bytes, from the start of the CommonNetworkRelativeLink structure. DeviceNameOffset (4 bytes): A 32-bit, unsigned integer that specifies the location of the DeviceName field. If the ValidDevice flag is set, this value is an offset, in bytes, from the start of the CommonNetworkRelativeLink structure; otherwise, this value MUS T be zero. NetworkProviderType (4 bytes): A 32-bit, unsigned integer that specifies the type of network provider. If the ValidNetType flag is set, this value MUST be one of the following; otherwise, this value MUST be ignored. Vendor name Value WNNC_NET_AVID 0x001A0000 WNNC_NET_DOCUSPACE 0x001B0000 WNNC_NET_MANGOSOFT 0x001C0000 25 / 52 [MS-SHLLINK] — v20100601 Shell Link (.LNK) Binary File Format Copyright © 2010 Microsoft Corporation. Release: Tuesday, June 1, 2010 Vendor name Value WNNC_NET_SERNET 0x001D0000 WNNC_NET_RIVERFRONT1 0X001E0000 WNNC_NET_RIVERFRONT2 0x001F0000 WNNC_NET_DECORB 0x00200000 WNNC_NET_PROTSTOR 0x00210000 WNNC_NET_FJ_REDIR 0x00220000 WNNC_NET_DISTINCT 0x00230000 WNNC_NET_TWINS 0x00240000 WNNC_NET_RDR2SAMPLE 0x00250000 WNNC_NET_CSC 0x00260000 WNNC_NET_3IN1 0x00270000 WNNC_NET_EXTENDNET 0x00290000 WNNC_NET_STAC 0x002A0000 WNNC_NET_FOXBAT 0x002B0000 WNNC_NET_YAHOO 0x002C0000 WNNC_NET_EXIFS 0x002D0000 WNNC_NET_DAV 0x002E0000 WNNC_NET_KNOWARE 0x002F0000 WNNC_NET_OBJECT_DIRE 0x00300000 WNNC_NET_MASFAX 0x00310000 WNNC_NET_HOB_NFS 0x00320000 WNNC_NET_SHIVA 0x00330000 WNNC_NET_IBMAL 0x00340000 WNNC_NET_LOCK 0x00350000 WNNC_NET_TERMSRV 0x00360000 WNNC_NET_SRT 0x00370000 WNNC_NET_QUINCY 0x00380000 WNNC_NET_OPENAFS 0x00390000 WNNC_NET_AVID1 0X003A0000 WNNC_NET_DFS 0x003B0000 26 / 52 [MS-SHLLINK] — v20100601 Shell Link (.LNK) Binary File Format Copyright © 2010 Microsoft Corporation. Release: Tuesday, June 1, 2010 Vendor name Value WNNC_NET_KWNP 0x003C0000 WNNC_NET_ZENWORKS 0x003D0000 WNNC_NET_DRIVEONWEB 0x003E0000 WNNC_NET_VMWARE 0x003F0000 WNNC_NET_RSFX 0x00400000 WNNC_NET_MFILES 0x00410000 WNNC_NET_MS_NFS 0x00420000 WNNC_NET_GOOGLE 0x00430000 NetNameOffsetUnicode (4 bytes): An optional, 32 -bit, unsigned integer that specifies the location of the NetNameUnicode field. This value is an offset, in bytes, from the start of the CommonNetworkRelativeLink structure. This field MUST be present if the value of the NetNameOffset field is greater than 0x00000014; otherwise, this field MUST NOT be present. DeviceNameOffsetUnicode (4 bytes): An optional, 32 -bit, unsigned integer that specifies the location of the DeviceNameUnicode field. This value is an offset, in bytes, from the start o f the CommonNetworkRelativeLink structure. This field MUST be present if the value of the NetNameOffset field is greater than 0x00000014; otherwise, this field MUST NOT be present. NetName (variable): A NULL –terminated string, as defined by the system def ault code page, which specifies a server share path; for example, " \\server \share". DeviceName (variable): A NULL –terminated string, as defined by the system default code page, which specifies a device; for example, the drive letter "D:". NetNameUnicode ( variable): An optional, NULL –terminated, Unicode string that is the Unicode version of the NetName string. This field MUST be present if the value of the NetNameOffset field is greater than 0x00000014; otherwise, this field MUST NOT be present. DeviceName Unicode (variable): An optional, NULL –terminated, Unicode string that is the Unicode version of the DeviceName string. This field MUST be present if the value of the NetNameOffset field is greater than 0x00000014; otherwise, this field MUST NOT be present. 2.4 StringData StringData refers to a set of structures that convey user interface and path identification information. The presence of these optional structures is controlled by LinkFlags (section 2.1.1) in the ShellLinkHeader (section 2.1). The StringData structures conform to the following ABNF rules [RFC5234] . STRING_DATA = [NAME_STRING] [RELATIVE_PATH] [WORKING_DIR] [COMMAND _LINE_ARGUMENTS] [ICON_LOCATION] 27 / 52 [MS-SHLLINK] — v20100601 Shell Link (.LNK) Binary File Format Copyright © 2010 Microsoft Corporation. Release: Tuesday, June 1, 2010 NAME_STRING : An optional structure that specifies a description of the shortcut that is displayed to end users to identify the purpose of the shell link. This structure MUST be present if the HasName flag is set. RELATIVE_ PATH : An optional structure that specifies the location of the link target relative to the file that contains the shell link. When specified, this string SHOULD be used when resolving the link. This structure MUST be present if the HasRelativePath flag is set. WORKING_DIR : An optional structure that specifies the file system path of the working directory to be used when activating the link target. This structure MUST be present if the HasWorkingDir flag is set. COMMAND_LINE_ARGUMENTS : An optional structure that stores the command -line arguments that should be specified when activating the link target. This structure MUST be present if the HasArguments flag is set. ICON_LOCATION : An optional structure that specifies the location of the icon to be used when displaying a shell link item in an icon view. This structure MUST be present if the HasIconLocation flag is set. All StringData structures have the following structure. 0 1 2 3 4 5 6 7 8 9 1 0 1 2 3 4 5 6 7 8 9 2 0 1 2 3 4 5 6 7 8 9 3 0 1 CountCharacters String (variable) ... CountCharacters (2 bytes): A 16-bit, unsigned integer that specifies either the number of characters, defined by the system default code page, or the number of Unicode characters found in the String field. A value of zero specifies an empty string. String (variable): An optional set of characters, defined by the system default code page, or a Unicode string with a length specified by the CountCharacters field. This string MUST NOT be NULL-terminated . 2.5 ExtraData ExtraData refers to a set of structures that convey additional information about a link target. These optional structures can be present in an extra data section that is appended to the basic Shell Link Binary File Format. The ExtraData structures conform to the following ABNF rules [RFC5234] : EXTRA_DATA = *EXTRA_DATA_BLOCK TERMINAL_BLOCK EXTRA_DATA_BLOCK = CONSOLE_PROPS / CONSOLE_FE_PROPS / DARWIN_PROPS / ENVIRONMENT_PROPS / ICON_ENVIRONMENT_PROPS / KNOWN_FOLDER_PROPS / PROPERTY_STORE_PROPS / 28 / 52 [MS-SHLLINK] — v20100601 Shell Link (.LNK) Binary File Format Copyright © 2010 Microsoft Corporation. Release: Tuesday, June 1, 2010 SHIM_PROPS / SPECIAL_FOLDER_PROPS / TRACKER_PROPS / VISTA_AND_ABOVE_IDLIST_PROPS EXTRA_DAT A: A structure consisting of zero or more property data blocks followed by a terminal block. EXTRA_DATA_BLOCK : A structure consisting of any one of the following property data blocks. CONSOLE_PROPS : A ConsoleDataBlock structure (section 2.5.1). CONSOLE_FE_PROPS : A ConsoleFEDataBlock structure (section 2.5.2). DARWIN_PROPS : A DarwinDataBlock structure (section 2.5.3). ENVIRONMENT_PROPS : An EnvironmentVariable DataBlock structure (section 2.5.4). ICON_ENVIRONMENT_PROPS : An IconEnvironmentDataBlock structure (section 2.5.5). KNOWN_FOLDER_PROPS : A KnownFolderDataBlock structure (section 2.5.6). PROPERTY_STORE_PROPS : A PropertyStoreDataBlock structure (section 2.5.7 ). SHIM_PROPS : A ShimDataBlock structure (section 2.5.8). SPECIAL_FOLDER_PROPS : A SpecialFolderDataBlock structure (section 2.5.9 ). TRACKER_PROPS : A TrackerDataBlock structure (section 2.5.10 ). VISTA_AND_ABOVE_IDLIST_PROPS : A VistaAndAboveIDListDataBlock structure (section 2.5.11 ). TERMINAL_BLOCK A structure that indicates the end of the extra data section. The general structure of an extra data section is shown in the following diagram. 0 1 2 3 4 5 6 7 8 9 1 0 1 2 3 4 5 6 7 8 9 2 0 1 2 3 4 5 6 7 8 9 3 0 1 ExtraDataBlock (variable) ... TerminalBlock ExtraDataBlock (variable): An optional array of bytes that contains zero or more property data blocks listed in the EXTRA_DATA_BLOCK syntax rule. TerminalBlock (4 bytes): A 32-bit, unsigned integer that indicates the end of the extra data section. This value MUST be less than 0x00000004. 29 / 52 [MS-SHLLINK] — v20100601 Shell Link (.LNK) Binary File Format Copyright © 2010 Microsoft Corporation. Release: Tuesday, June 1, 2010 2.5.1 ConsoleDataBlock The ConsoleDataBlock structure specifies the display settings to use when a link target specifies an application that is run in a console window. <2> 0 1 2 3 4 5 6 7 8 9 1 0 1 2 3 4 5 6 7 8 9 2 0 1 2 3 4 5 6 7 8 9 3 0 1 BlockSize BlockSignature FillAttributes PopupFillAttributes ScreenBufferSizeX ScreenBufferSizeY WindowSizeX WindowSizeY WindowOriginX WindowOriginY Unused1 Unused2 FontSize FontFamily FontWeight Face Name ... ... ... ... ... 30 / 52 [MS-SHLLINK] — v20100601 Shell Link (.LNK) Binary File Format Copyright © 2010 Microsoft Corporation. Release: Tuesday, June 1, 2010 ... ... (Face Name cont'd for 8 rows) CursorSize FullScreen InsertMode AutoPosition HistoryBufferSize NumberOfHistoryBuffers HistoryNoDup ColorTable ... ... ... ... ... ... ... (ColorTable cont'd for 8 rows) BlockSize (4 bytes): A 32-bit, unsigned integer that specifies the size of the ConsoleDataBlock structure. This value MUST be 0x000000CC. 31 / 52 [MS-SHLLINK] — v20100601 Shell Link (.LNK) Binary File Format Copyright © 2010 Microsoft Corporation. Release: Tuesday, June 1, 2010 BlockSignature (4 bytes): A 32-bit, unsigned integer that specifies the signature of the ConsoleDataBlock extra data section. This value MUST be 0xA0000002. FillAttributes (2 bytes): A 16-bit, unsigned integer that specifies the fill attributes that control the foreground and background text colors in the console window. The following bit definitions can be combined to specify 16 different values each for the foreground and background colors: Value Meaning FOREGROUND_BLUE 0x0001 The foreground text color contains blue. FOREGROUND_GREEN 0x0002 The foreground text color contains green. FOREGROUND_RED 0x0004 The foreground text color contains red. FOREGROUND_INTENSITY 0x0008 The foreground text color is intensified. BACKGROUND_BLUE 0x0010 The background text color contains blue. BACKGROUND_GREEN 0x0020 The background text color contains green. BACKGROUND_RED 0x0040 The background text color contains red. BACKGROUND_INTENSITY 0x0080 The background text color is intensified. PopupFillAttributes (2 bytes): A 16-bit, unsigned integer that specifies the fill attributes that control the foreground and background text color in the console window popup. The values are the same as for the FillAttributes field. ScreenBufferSizeX (2 bytes): A 16-bit, signed integer that specifies the horizontal size (X axis), in characters, of the console window buffer. ScreenBufferSizeY (2 bytes): A 16-bit, signed integer that specifies the vertical size (Y axis), in characters, of the console window buffer. WindowSizeX (2 bytes): A 16-bit, signed integer that specifies the horizontal size (X axis), in characters, of the console window. WindowSizeY (2 bytes): A 16-bit, signed integer that specifies the vertical size (Y axis), in characters, of the console window. WindowOriginX (2 bytes): A 16-bit, signed integer that specifies the horizontal coordinate (X axis), in p ixels, of the console window origin. WindowOriginY (2 bytes): A 16-bit, signed integer that specifies the vertical coordinate (Y axis), in pixels, of the console window origin. Unused1 (4 bytes): A value that is undefined and MUST be ignored. 32 / 52 [MS-SHLLINK] — v20100601 Shell Link (.LNK) Binary File Format Copyright © 2010 Microsoft Corporation. Release: Tuesday, June 1, 2010 Unused2 (4 bytes): A value that is undefined and MUST be ignored. FontSize (4 bytes): A 32-bit, unsigned integer that specifies the size, in pixels, of the font used in the console window. FontFamily (4 bytes): A 32-bit, unsigned integer that specifies the family of the font used in the console window. This value MUST be one of the following: Value Meaning FF_DONTCARE 0x0000 The font family is unknown. FF_ROMAN 0x0010 The font is variable -width with serifs; for example, "Times New Roman". FF_SWISS 0x0020 The font is variable -width without serifs; for example, "Arial". FF_MODERN 0x0030 The font is fixed -width, with or without serifs; for example, "Courier New". FF_SCRIPT 0x0040 The font is designed to look like handwriting; for example, "Cursive". FF_DEC ORATIVE 0x0050 The font is a novelty font; for example, "Old English". FontWeight (4 bytes): A 16-bit, unsigned integer that specifies the stroke weight of the font used in the console window. Value Meaning 700 ≤ value A bold font. value < 700 A regular -weight font. Face Name (64 bytes): A 32-character Unicode string that specifies the face name of the font used in the console window. CursorSize (4 bytes): A 32-bit, unsigned integer that specifies the size of the cursor, in pixels, used in th e console window. Value Meaning value ≤ 25 A small cursor. 26 — 50 A medium cursor. 51 — 100 A large cursor. FullScreen (4 bytes): A 32-bit, unsigned integer that specifies whether to open the console window in full -screen mode. 33 / 52 [MS-SHLLINK] — v20100601 Shell Link (.LNK) Binary File Format Copyright © 2010 Microsoft Corporation. Release: Tuesday, June 1, 2010 Value Meaning 0x00000000 Full-screen mode is off. 0x00000000 < value Full-screen mode is on. InsertMode (4 bytes): A 32-bit, unsigned integer that specifies insert mode in the console window. Value Meaning 0x00000000 Insert mode is disabled. 0x00000000 < value Insert mode is enabled. AutoPosition (4 bytes): A 32-bit, unsigned integer that specifies auto -position mode of the console window. Value Meaning 0x00000000 The values of the WindowOriginX and WindowOriginY fields are used to position the console window . 0x00000000 < value The console window is positioned automatically. HistoryBufferSize (4 bytes): A 32-bit, unsigned integer that specifies the size, in characters, of the buffer that is used to store a history of user input into the console window. NumberOfHistoryBuffers (4 bytes): A 32-bit, unsigned integer that specifies the number of history buffers to use. HistoryNoDup (4 bytes): A 32-bit, unsigned integer that specifies whether to remove duplicates in the history buffer. Value Meaning 0x00000 000 Duplicates are not allowed. 0x00000000 < value Duplicates are allowed. ColorTable (64 bytes): A table of 16 32 -bit, unsigned integers specifying the RGB colors that are used for text in the console window. The values of the fill attribute fields FillAttributes and PopupFillAttributes are used as indexes into this table to specify the final foreground and background color for a character. 2.5.2 ConsoleFEDataBlock The ConsoleFEDataBlock structure specifies the code page to use for displaying text when a link target specifies an application that is run in a console window. <3> 34 / 52 [MS-SHLLINK] — v20100601 Shell Link (.LNK) Binary File Format Copyright © 2010 Microsoft Corporation. Release: Tuesday, June 1, 2010 0 1 2 3 4 5 6 7 8 9 1 0 1 2 3 4 5 6 7 8 9 2 0 1 2 3 4 5 6 7 8 9 3 0 1 BlockSize BlockSignature CodePage BlockSize (4 bytes): A 32-bit, unsigned integer that specifies the size of the ConsoleFEDataBlock structure. This value MUST be 0x0000000C. BlockSignature (4 bytes): A 32-bit, unsigned integer that specifies the signature of the ConsoleFEDataBlock extra data section. This value MUST be 0xA0000004. CodePage (4 bytes): A 32-bit, unsigned integer that specifies a code page language code identif ier. For details concerning the structure and meaning of language code identifiers, see [MS-LCID] . For additional background information, see [MSCHARSET] and [MSDN - CODEPAGE] . 2.5.3 DarwinDataBlock The DarwinDataBlock structure specifies an application identifier that can be used instead of a link target IDList to install an application when a shell link is activated. 0 1 2 3 4 5 6 7 8 9 1 0 1 2 3 4 5 6 7 8 9 2 0 1 2 3 4 5 6 7 8 9 3 0 1 BlockSize BlockSignature DarwinDataAnsi ... ... ... ... ... ... 35 / 52 [MS-SHLLINK] — v20100601 Shell Link (.LNK) Binary File Format Copyright © 2010 Microsoft Corporation. Release: Tuesday, June 1, 2010 ... (DarwinDataAnsi cont'd for 57 rows) DarwinDataUnicode (optional) ... ... ... ... ... ... ... (DarwinDataUnicode (optional) cont'd for 122 rows) BlockSize (4 bytes): A 32-bit, unsigned integer that specifies the size of the DarwinDataBlock structure. This value MUST be 0x00000314. BlockSignature (4 bytes): A 32-bit, unsigned integer that specifi es the signature of the DarwinDataBlock extra data section. This value MUST be 0xA0000006. DarwinDataAnsi (260 bytes): A NULL –terminated string, defined by the system default code page, which specifies an application identifier. This field SHOULD be ignor ed. DarwinDataUnicode (520 bytes): An optional, NULL –terminated, Unicode string that specifies an application identifier. <4> 2.5.4 EnvironmentVariableDataBlock The EnvironmentVariableDataBlock structure specifies a path to environment variable information when the link target refers to a location that has a corresponding environment variable. 0 1 2 3 4 5 6 7 8 9 1 0 1 2 3 4 5 6 7 8 9 2 0 1 2 3 4 5 6 7 8 9 3 0 1 BlockSize BlockSignature 36 / 52 [MS-SHLLINK] — v20100601 Shell Link (.LNK) Binary File Format Copyright © 2010 Microsoft Corporation. Release: Tuesday, June 1, 2010 TargetAnsi ... ... ... ... ... ... ... (TargetAnsi cont'd for 57 rows) TargetUnicode ... ... ... ... ... ... ... (TargetUnicode cont'd for 122 rows) BlockSize (4 bytes): A 32-bit, unsigned integer that specifies the size of the EnvironmentVariableDataBlock structure. This value MUST be 0x00000314. BlockSignature (4 bytes): A 32-bit, unsigned integer that specifies the signature of the EnvironmentVariableDataBlock extra data section. This value MUST be 0xA0000001. 37 / 52 [MS-SHLLINK] — v20100601 Shell Link (.LNK) Binary File Format Copyright © 2010 Microsoft Corporation. Release: Tuesday, June 1, 2010 TargetAnsi (260 bytes): A NULL -terminated string, defined by the system default code page, which specifies a path to environment variable information. TargetUnicode (520 bytes): An optional, NULL -terminated, Unicode string that specifies a path to environment variable information. 2.5.5 IconEnvironmentDataBlock The IconEnvironmentDataBlock structure specifies the path to an icon. The path is encoded using environment variables, which makes it possible to find the icon across machines where the locations vary but are expressed using environment variables. 0 1 2 3 4 5 6 7 8 9 1 0 1 2 3 4 5 6 7 8 9 2 0 1 2 3 4 5 6 7 8 9 3 0 1 BlockSize BlockSignature TargetAnsi ... ... ... ... ... ... ... (TargetAnsi cont'd for 57 rows) TargetUnicode ... ... ... 38 / 52 [MS-SHLLINK] — v20100601 Shell Link (.LNK) Binary File Format Copyright © 2010 Microsoft Corporation. Release: Tuesday, June 1, 2010 ... ... ... ... (TargetUnicode cont'd for 122 rows) BlockSize (4 bytes): A 32-bit, unsigned integer that specifies the size of the IconEnvironmentDataBlock structure. This value MUST be 0x00000314. BlockSignature (4 bytes): A 32-bit, unsigned integer that specifies the signature of the IconEnvironmentDataBlock extra data section. This value MUST be 0xA0000007. TargetAnsi (260 bytes): A NULL -terminated string, defined by the system default code page, which specifies a path that is constructed with environment variables. TargetUnicode (520 bytes): An optional, NULL -terminated, Unicode string that specifies a path that is constructed with environment variables. 2.5.6 KnownFolderDataBlock The KnownFolderDataBlock structure specifies the location of a known folder. This data can be used when a link target is a known folder to keep track of the folder so that the link target IDList can be translated when the link is loaded. 0 1 2 3 4 5 6 7 8 9 1 0 1 2 3 4 5 6 7 8 9 2 0 1 2 3 4 5 6 7 8 9 3 0 1 BlockSize BlockSignature KnownFolderID ... ... ... Offset 39 / 52 [MS-SHLLINK] — v20100601 Shell Link (.LNK) Binary File Format Copyright © 2010 Microsoft Corporation. Release: Tuesday, June 1, 2010 BlockSize (4 bytes): A 32-bit, unsigned integer that specifies the size of the KnownFolderDataBlock structure. This value MUST be 0x0000001C. BlockSignature (4 bytes): A 32-bit, unsigned integer that specifies the signature of the KnownFolderDataBlock extra data section. This value MUST be 0xA000000B. KnownFolderID (16 bytes): A GUID that specifies the folder GUID ID . Offset (4 bytes): A 32-bit, unsigned integer that specifies the location of the ItemID of the first child segment of the IDList specified by KnownFolderID . This value is the offset, in bytes, into the link target IDList. 2.5.7 PropertyStoreDataBlock A PropertyStoreDataBlock structure specifies a set of properties that can be used by applications to store extra data in the shell link. 0 1 2 3 4 5 6 7 8 9 1 0 1 2 3 4 5 6 7 8 9 2 0 1 2 3 4 5 6 7 8 9 3 0 1 BlockSize BlockSignature PropertyStore (variable) ... BlockSize (4 bytes): A 32-bit, unsigned integer that specifies the size of the PropertyStoreDataBlock structure. This value MUST be greater than or equal to 0x0000000C. BlockSignature (4 bytes): A 32-bit, unsigned integer that specifies the signature of the PropertyStoreDataBlock extra data section. This value MUST be 0xA0000009. PropertyStore (variable): A serialized property storage structure ( [MS-PROPSTORE] section 2.2). 2.5.8 ShimDataBlock The ShimDataBlock structure specifies the name of a shim that can be applied when activating a link target. 0 1 2 3 4 5 6 7 8 9 1 0 1 2 3 4 5 6 7 8 9 2 0 1 2 3 4 5 6 7 8 9 3 0 1 BlockSize 40 / 52 [MS-SHLLINK] — v20100601 Shell Link (.LNK) Binary File Format Copyright © 2010 Microsoft Corporation. Release: Tuesday, June 1, 2010 BlockSignature LayerName (variable) ... BlockSize (4 bytes): A 32-bit, unsigned integer that specifies the size of the ShimDataBlock structure. This value MUST be greater than or equal to 0x00000088. BlockSignature (4 bytes): A 32-bit, unsigned integer that specifies the signature of the ShimDataBlock extra data section. This value MUST be 0xA0000008. LayerName (variable): A Unicode string that specifies the name of a shim layer to apply to a link target when it is being activ ated. 2.5.9 SpecialFolderDataBlock The SpecialFolderDataBlock structure specifies the location of a special folder. This data can be used when a link target is a special folder to keep track of the folder, so that the link target IDList can be translated when the link is loaded. 0 1 2 3 4 5 6 7 8 9 1 0 1 2 3 4 5 6 7 8 9 2 0 1 2 3 4 5 6 7 8 9 3 0 1 BlockSize BlockSignature SpecialFolderID Offset BlockSize (4 bytes): A 32-bit, unsigned integer that specifies the size of the SpecialFolderDataBlock structure. This value MUST be 0x00000010. BlockSignature (4 bytes): A 32-bit, unsigned integer that specifies the signature of the SpecialFolderDataBlock extra data section. This value MUST be 0xA0000005. SpecialFolderID (4 bytes): A 32-bit, unsigned integer that specifies the folder integer ID . Offset (4 bytes): A 32-bit, unsigned integer that specifies the location of the ItemID of the first child segment of the IDList specified by SpecialFolderID . This value is the offset, in bytes, into the link target IDList. 2.5.10 TrackerDataBlock The TrackerDataBlock structure specifies data that can be used to resolve a link target if it is not found in its original location when the link is resolved. This data is passed to the Link Tracking service [MS-DLTW] to find the link target. 41 / 52 [MS-SHLLINK] — v20100601 Shell Link (.LNK) Binary File Format Copyright © 2010 Microsoft Corporation. Release: Tuesday, June 1, 2010 0 1 2 3 4 5 6 7 8 9 1 0 1 2 3 4 5 6 7 8 9 2 0 1 2 3 4 5 6 7 8 9 3 0 1 BlockSize BlockSignature Length Version MachineID (variable) ... Droid ... ... ... ... ... ... ... DroidBirth ... ... ... ... 42 / 52 [MS-SHLLINK] — v20100601 Shell Link (.LNK) Binary File Format Copyright © 2010 Microsoft Corporation. Release: Tuesday, June 1, 2010 ... ... ... BlockSize (4 bytes): A 32-bit, unsigned integer that specifies the size of the TrackerDataBlock structure. This value MUST be 0x00000060. BlockSignature (4 bytes): A 32-bit, unsigned integer that specifies the signature of the TrackerDataBlock extra data section. This value MUST be 0xA0000003. Length (4 bytes): A 32-bit, unsigned integer. This value MUST be greater than or equal to 0x0000058. Version (4 bytes): A 32-bit, unsigned integer. This value MUST be 0x00000000. MachineID (variable): A character string, as defined by the system default code page, which specifies the NetBIOS name of the machine where the link target was last known to reside. Droid (32 bytes): Two GUID values that are used to find the link target with the Link Tracking service, as specified in [MS -DLTW]. DroidBirth (32 bytes): Two GUID values that are used to find the link target with the Link Tracking service, as specified in [MS -DLTW]. 2.5.11 VistaAndAboveIDListDataBlock The VistaAndAboveIDListDataBlock structure specifies an alternate IDList that can be used instead of the LinkTargetIDList structure (section 2.2) on platforms that support it. <5> 0 1 2 3 4 5 6 7 8 9 1 0 1 2 3 4 5 6 7 8 9 2 0 1 2 3 4 5 6 7 8 9 3 0 1 BlockSize BlockSignature IDList (variable) ... BlockSize (4 bytes): A 32-bit, unsigned integer that specifies the size of the VistaAndAboveIDListDataBlock structure. This value MUST be greater than or equal to 0x0000000A. 43 / 52 [MS-SHLLINK] — v20100601 Shell Link (.LNK) Binary File Format Copyright © 2010 Microsoft Corporation. Release: Tuesday, June 1, 2010 BlockSignature (4 bytes): A 32-bit, unsigned integer that specifies the signature of the VistaAndAboveIDListDataBlock extra data section. This value MUST be 0xA000000C. IDList (variable): An IDList structure (section 2.2.1 ). 44 / 52 [MS-SHLLINK] — v20100601 Shell Link (.LNK) Binary File Format Copyright © 2010 Microsoft Corporation. Release: Tuesday, June 1, 2010 3 Structure Examples 3.1 Shortcut to a File This section presents a sample of the Shell Link Binary File Format, consisting of a shortcut to a file with the path "C: \test\a.txt". The following is the hexadecimal representation of the contents of the shell link. x0 x1 x2 x3 x4 x5 x6 x7 x8 x9 xA xB xC xD xE xF 0000 4C 00 00 00 01 14 02 00 00 00 00 00 C0 00 00 00 0010 00 00 00 46 9B 00 08 00 20 00 00 00 D0 E9 EE F2 0020 15 15 C9 01 D0 E9 EE F2 15 15 C9 01 D0 E9 EE F2 0030 15 15 C9 01 00 00 00 00 00 00 00 00 01 00 00 00 0040 00 00 00 00 00 00 00 00 00 00 00 00 BD 00 14 00 0050 1F 50 E0 4F D0 20 EA 3A 69 10 A2 D8 08 00 2B 30 0060 30 9D 19 00 2F 43 3A 5C 00 00 00 00 00 00 00 00 0070 00 00 00 00 00 00 00 00 00 00 00 46 00 31 00 00 0080 00 00 00 2C 39 69 A3 10 00 74 65 73 74 00 00 32 0090 00 07 00 04 00 EF BE 2C 39 65 A3 2C 39 69 A3 26 00A0 00 00 00 03 1E 00 00 00 00 F5 1E 00 00 00 00 00 00B0 00 00 00 00 00 74 00 65 00 73 00 74 00 00 00 14 00C0 00 48 00 32 00 00 00 00 00 2C 39 69 A3 20 00 61 00D0 2E 74 78 74 00 34 00 07 00 04 00 EF BE 2C 39 69 00E0 A3 2C 39 69 A3 26 00 00 00 2D 6E 00 00 00 00 96 00F0 01 00 00 00 00 00 00 00 00 00 00 61 00 2E 00 74 0100 00 78 00 74 00 00 00 14 00 00 00 3C 00 00 00 1C 0110 00 00 00 01 00 00 00 1C 00 00 00 2D 00 00 00 00 0120 00 00 00 3B 00 00 00 11 00 00 00 03 00 00 00 81 0130 8A 7A 30 10 00 00 00 00 43 3A 5C 74 65 73 74 5C 0140 61 2E 74 78 74 00 00 07 00 2E 00 5C 00 61 00 2E 0150 00 74 00 78 00 74 00 07 00 43 00 3A 00 5C 00 74 0160 00 65 00 73 00 74 00 60 00 00 00 03 00 00 A0 58 0170 00 00 00 00 00 00 00 63 68 72 69 73 2D 78 70 73 45 / 52 [MS-SHLLINK] — v20100601 Shell Link (.LNK) Binary File Format Copyright © 2010 Microsoft Corporation. Release: Tuesday, June 1, 2010 x0 x1 x2 x3 x4 x5 x6 x7 x8 x9 xA xB xC xD xE xF 0180 00 00 00 00 00 00 00 40 78 C7 94 47 FA C7 46 B3 0190 56 5C 2D C6 B6 D1 15 EC 46 CD 7B 22 7F DD 11 94 01A0 99 00 13 72 16 87 4A 40 78 C7 94 47 FA C7 46 B3 01B0 56 5C 2D C6 B6 D1 15 EC 46 CD 7B 22 7F DD 11 94 01C0 99 00 13 72 16 87 4A 00 00 00 00 HeaderSize : (4 bytes, offset 0x0000), 0x0000004C as required. LinkCLSID : (16 bytes, offset 0x0004), 00021401 -0000-0000-C000-000000000046. LinkFlags : (4 bytes, offset 0x0014), 0x0008009B means the following LinkFlags (section 2.1.1) are set: HasLinkTa rgetIDList HasLinkInfo HasRelativePath HasWorkingDir IsUnicode EnableTargetMetadata FileAttributes : (4 bytes, offset 0x0018), 0x00000020, means the following FileAttributesFlags (section 2.1.2) are set: FILE_ATTRIBUTE_ARCHIVE CreationTime : (8 bytes, offset 0x001C) FILETIME 9/12/08, 8:27:17PM. AccessTime : (8 bytes, offset 0x0024) FILETIME 9/1 2/08, 8:27:17PM. WriteTime : (8 bytes, offset 0x002C) FILETIME 9/12/08, 8:27:17PM. FileSize : (4 bytes, offset 0x0034), 0x00000000. IconIndex : (4 bytes, offset 0x0038), 0x00000000. ShowCommand : (4 bytes, offset 0x003C), SW_SHOWNORMAL(1). Hotkey : (2 bytes, offset 0x0040), 0x0000. Reserved : (2 bytes, offset 0x0042), 0x0000. Reserved2 : (4 bytes, offset 0x0044), 0 x00000000. Reserved3 : (4 bytes, offset 0x0048), 0 x00000000. Because HasLinkTargetIDList is set, a LinkTargetIDList structure (section 2.2) follows: IDListSize : (2 bytes, offset 0x004C), 0x00BD, the size of IDList . 46 / 52 [MS-SHLLINK] — v20100601 Shell Link (.LNK) Binary File Format Copyright © 2010 Microsoft Corporation. Release: Tuesday, June 1, 2010 IDList : (189 bytes, offset 0x004E) an IDList structure (section 2.2.1) follows: ItemIDList : (187 bytes, offset 0x004E), ItemID structures (section 2.2.2) follow: ItemIDSize : (2 bytes, offset 0x004E), 0x0014 Data: (12 bytes, offset 0x0050), <18 bytes of data> [computer] ItemIDSize : (2 bytes, offset 0x0062), 0x0019 Data: (23 bytes, offset 0x0064), <23 bytes of data> [c:] ItemIDSize : (2 bytes, offset 0x007B), 0x0046 Data: (68 bytes, offset 0x007D), <68 bytes of data> [test] ItemIDSize : (2 bytes, offset 0x00C1), 0x0048 Data: (68 bytes, offset 0x00C3), <70 bytes of data> [a.txt] TerminalID : (2 bytes, offset 0x0109), 0x0000 indicates the end of the IDList . Because HasLinkInfo is set, a LinkInfo structure (section 2.3) follows: LinkInfoSize : (4 bytes, offset 0x010B), 0x0000003C LinkInfoHeaderSize : (4 bytes, offset 0x010F), 0x0000001C as specified in the LinkInfo structure definition. LinkInfoFlags : (4 bytes, offset 0x0113), 0x00000001 VolumeIDAndLocalBasePath is set. VolumeIDOffset : (4 bytes, offset 0x0117), 0x0000001C, references offset 0x0127. LocalBasePathOffset : (4 bytes, offset 0x011B), 0x0000002D, references the character string "C:\test\a.txt". CommonNetworkRelativeLinkOffset : (4 bytes, offset 0x011F), 0x00000000 indicates CommonNetworkR elativeLink is not present. CommonPathSuffixOffset : (4 bytes, offset 0x0123), 0x0000003B, references offset 0x00000146, the character string "" (empty string). VolumeID : (17 bytes, offset 0x0127), because VolumeIDAndLocalBasePath is set, a VolumeID structure (section 2.3.1 ) follows: VolumeIDSize : (4 bytes, offset 0x0127), 0x00000011 indicates the size of the VolumeID structure. DriveType : (4 bytes, offset 0 x012B), DRIVE_FIXED(3). DriveSerialNumber : (4 bytes, offset 0x012F), 0x307A8A81. VolumeLabelOffset : (4 bytes, offset 0x0133), 0x00000010, indicates that Volume Label Offset Unicode is not specified and references offset 0x0137 where the Volume Label is sto red. Data: (1 byte, offset 0x0137), "" an empty character string. LocalBasePath : (14 bytes, offset 0x0138), because VolumeIDAndLocalBasePath is set, the character string "c: \test\a.txt" is present. 47 / 52 [MS-SHLLINK] — v20100601 Shell Link (.LNK) Binary File Format Copyright © 2010 Microsoft Corporation. Release: Tuesday, June 1, 2010 CommonBasePath : (1 byte, offset 0x0146), "" an empty chara cter string. Because HasRelativePath is set, the RELATIVE_PATH StringData structure (section 2.4) follows: CountCharacters : (2 bytes, offset 0x0147), 0x0007 Unicode characters. String (14 bytes, offset 0x0149), the Unicode string: ". \a.txt". Because HasWorkingDir is set, the WORKING_DIR StringData structure (section 2.4) follows: CountCharacters : (2 bytes, offset 0x0157), 0x0007 Unicode characters. String (14 bytes, offset 0x0159), the Unicode string: "c: \test". Extra data section: (100 bytes, offset 0x0167), an ExtraData structure (section 2.5) follows: ExtraDataBlock (96 bytes, offset 0x0167), the TrackerDataBlock structure (section 2.5.10 ) follows: BlockSize : (4 bytes, offset 0x0167), 0x00000060 BlockSignature : (4 bytes, offset 0x016B), 0xA000003, which identifies the TrackerDataBlock structure (section 2.5.10 ). Length : (4 by tes, offset 0x016F), 0x00000058, the required minimum size of this extra data block. Version : (4 bytes, offset 0x0173), 0x00000000, the required version. MachineID : (16 bytes, offset 0x0177), the character string "chris -xps", with zero fill. Droid : (32 bytes, offset 0x0187), 2 GUID values. DroidBirth : (32 bytes, offset 0x01A7), 2 GUID values. TerminalBlock : (4 bytes, offset 0x01C7), 0x00000000 indicates the end of the extra data section. 48 / 52 [MS-SHLLINK] — v20100601 Shell Link (.LNK) Binary File Format Copyright © 2010 Microsoft Corporation. Release: Tuesday, June 1, 2010 4 Security None. 49 / 52 [MS-SHLLINK] — v20100601 Shell Link (.LNK) Binary File Format Copyright © 2010 Microsoft Corporation. Release: Tuesday, June 1, 2010 5 Appendix A: Product Behavior The information in this specification is applicable to the following Microsoft products: Microsoft Windows NT® 3.1 operating system Microsoft Windows NT® 3.5 operating system Microsoft Windows NT® 3.51 operating system Microsoft Windows NT® 4.0 operating system Microsoft Windows® 2000 operating system Windows® XP operating system Windows Server® 2003 operating system Windows Vista® operating system Windows Server® 2008 operating system Windows® 7 operating system Windows Server® 2008 R2 operating system Exceptions, if any, are noted below. If a service pack number appears with the product version, behavior changed in that service pack. The new behavior also applies to subsequent service packs of the product unless otherwi se specified. Unless otherwise specified, any statement of optional behavior in this specification prescribed using the terms SHOULD or SHOULD NOT implies product behavior in accordance with the SHOULD or SHOULD NOT prescription. Unless otherwise specified , the term MAY implies that product does not follow the prescription. <1> Section 2.3: In Windows, Unicode characters are stored in this structure if the data cannot be represented as ANSI characters due to truncation of the values. In this case, the value of the LinkInfoHeaderSize field is greater than or equal to 36. <2> Section 2.5.1: In Windows environments, this is commonly known as a "command prompt" window. <3> Section 2.5.2: In Windows environments, this is commonly known as a "command prompt" window. <4> Section 2.5.3: In Windows, this is a Windows Installer (MSI) application descriptor. For more information, see [MSDN -MSISHORTCUTS] . <5> Section 2.5.11: The VistaAndAboveIDListDataBlock structure is supported on Windows Vista, Windows Server 2008, Windows 7, and Windows Server 2008 R2 only. 50 / 52 [MS-SHLLINK] — v20100601 Shell Link (.LNK) Binary File Format Copyright © 2010 Microsoft Corporation. Release: Tuesday, June 1, 2010 6 Change Tracking This section identifies changes made to [MS -SHLLINK] protocol documentation between April 2010 and June 2010 releases. Changes are classed as major, minor, or editorial. Major changes affect protocol interoperability or implementation. Examples of major changes are: A document revision that incorporates changes to interoperability requirements or functionality. An extensive rewrite, addition, or deletion of major porti ons of content. A protocol is deprecated. The removal of a document from the documentation set. Changes made for template compliance. Minor changes do not affect protocol interoperability or implementation. Examples are updates to fix technical accuracy or ambiguity at the sentence, paragraph, or table level. Editorial changes apply to grammatical, formatting, and style issues. No changes means that the document is identical to its last release. Major and minor changes can be desc ribed further using the following revision types: New content added. Content update. Content removed. New product behavior note added. Product behavior note updated. Product behavior note removed. New protocol syntax added. Protocol syntax updated. Protoco l syntax removed. New content added due to protocol revision. Content updated due to protocol revision. Content removed due to protocol revision. New protocol syntax added due to protocol revision. Protocol syntax updated due to protocol revision. Protocol syntax removed due to protocol revision. New content added for template compliance. Content updated for template compliance. 51 / 52 [MS-SHLLINK] — v20100601 Shell Link (.LNK) Binary File Format Copyright © 2010 Microsoft Corporation. Release: Tuesday, June 1, 2010 Content removed for template compliance. Obsolete document removed. Editorial changes always have the revision type "Editorially updated." Some important terms used in revision type descriptions are defined as follows: Protocol syntax refers to data elements (such as packets, structures, enumerations, and methods) as well as interfaces. Protocol revision refers to changes made to a protocol that affect the bits that are sent over the wire. Changes are listed in the following table. If you need further information, please contact protocol@microsoft.com . Section Tracking number (if applicable) and description Major change (Y or N) Revision Type 2.3 LinkInfo Replaced the term "ANSI" with "system default code page". N Editorially updated. 2.3.1 VolumeID Replaced the term "ANSI" with "system default code page". N Editorially updated. 2.3.2 CommonNetworkRelativeLink Replaced the term "ANSI" with "system default code page". N Editorially updated. 2.4 StringData Replaced the term "ANSI" with "system default code page". N Editorially updated. 2.5.3 DarwinDataBlock Replaced the term "ANSI" with "system default code page". N Editorially updated. 2.5.4 EnvironmentVar iableDataBlock Replaced the term "ANSI" with "system default code page". N Editorially updated. 2.5.5 IconEnvironmentDataBlock Replaced the term "ANSI" with "system default code page". N Editorially updated. 2.5.10 TrackerDataBlock Replaced the term "ANSI" with "system default co de page". N Editorially updated. 3.1 Shortcut to a File Replaced the term "ANSI" with "system default code page". N Editorially updated. 4 Security Added section. N New content added for template compliance. 52 / 52 [MS-SHLLINK] — v20100601 Shell Link (.LNK) Binary File Format Copyright © 2010 Microsoft Corporation. Release: Tuesday, June 1, 2010 7 Index A Applicability 7 C Change tracking 50 CommonNetworkRelativeLink packet 23 ConsoleDataBlock packet 29 ConsoleFEDataBlock packet 33 D DarwinDataBlock packet 34 E EnvironmentVariableDataBlock packet 35 Example - shortcut to file 44 ExtraData packet 27 F Fields - vendor -extensible 7 FileAttributeFlags packet 12 G Glossary 4 H HotKeyFlags packet 13 I IconEnvironmentDataBlock packet 37 IDList packet 17 Informative references 5 Introduction 4 ItemID packet 18 K KnownFolderDataBlock packet 38 L LinkFlags packet 10 LinkInfo packet 18 LinkTargetIDList packet 17 Localization 7 N Normative references 5 O Overview (synopsis) 6 P Product behavior 49 PropertyStoreDataBlock packet 39 R References informative 5 normative 5 Relationsh ip to protocols and other structures 6 S Security 48 ShelllLinkHeader packet 8 ShimDataBlock packet 39 Shortcut to file example 44 SpecialFolderDataBlock packet 40 StringData packet 26 Structures 8 T TrackerDataBlock packet 40 Tracking changes 50 V Vendor -extensible fields 7 Versioning 7 VistaAndAboveIDListDataBlock packet 42 VolumeID packet 21 This paper is included in the Proceedings of the 2015 USENIX Annual Technical Conference (USENIC ATC ’15). July 8–10, 2015 • Santa Clara, CA, USA ISBN 978-1-931971-225 Open access to the Proceedings of the 2015 USENIX Annual Technical Conference (USENIX ATC ’15) is sponsored by USENIX.Identifying Trends in Enterprise Data Protection Systems George Amvrosiadis, University of Toronto; Medha Bhadkamkar, Symantec Research Labs https://www.usenix.org/conference/atc15/technical-session/presentation/amvrosiadis USENIX Association 2015 USENIX Annual Technical Conference 151Identifying Trends in Enterprise Data Protection Systems George Amvrosiadis Dept. of Computer Science, University of Toronto gamvrosi@cs.toronto.eduMedha Bhadkamkar Symantec Research Labs medha bhadkamkar@symantec.com Abstract Enterprises routinely use data protection techniques to achieve business continuity in the event of failures. Toensure that backup and recovery goals are met in the face of the steep data growth rates of modern workloads, data protection systems need to constantly evolve. Re- cent studies show that these systems routinely miss theirgoals today. However, there is little work in the literature to understand why this is the case. In this paper, we present a study of 40,000 enterprise data protection systems deploying Symantec NetBackup,a commercial backup product. In total, we analyze over a million weekly reports which have been collected over a period of three years. We discover that the main rea- son behind inefficiencies in data protection systems is misconfigurations. Furthermore, our analysis shows thatthese systems grow in bursts, leaving clients unprotected at times, and are often configured using the default pa- rameter values. As a result, we believe there is poten- tial in developing automated, self-healing data protection systems that achieve higher efficiency standards. To aid researchers in the development of such systems, we useour dataset to identify trends characterizing data protec- tion systems with regards to configuration, job schedul- ing, and data growth. 1 Introduction Studies analyzing the characteristics of storage systemsare an important aid in the design and implementation of techniques that can improve the performance and robust- ness of these systems. In the past 30 years, numerous file system studies have investigated different aspects ofdesktop and enterprise systems [2, 6, 7, 19, 30, 39, 47,51, 55, 56]. However, little work has been published to provide insight in the characteristics of backup sys- tems, focusing on deduplication rates [52], and the char- acteristics of the file systems storing the backup images[66]. With this study, we look into the backup applica-tion generating these images, their internal structure, andthe characteristics of the jobs that created them. Modern data growth rates and shorter recovery win-dows are driving the need for innovation in the area ofdata protection. Recent surveys of CIOs and IT profes-sionals indicate that 90% of businesses use more than two backup products [18], and only 28% of backup jobs complete within their scheduled window [34, 65]. Thegoal of this study is to investigate how data protectionsystems are configured and operate. Our analysis shows that the inefficiency of backup systems is largely at- tributed to misconfigurations. We believe automating configuration management can help alleviate these con- figuration issues significantly. Our findings motivate and support research on automated data protection [22, 27],by identifying trends in data protection systems, and re- lated directions for future research. Our study is based on a million weekly reports col- lected in a span of three years, from 40, 000 enterprise backup systems, also referred to as domains in the rest of the paper. Each domain is a multi-tiered network of backup servers deploying Symantec NetBackup [61], anenterprise backup product. To the best of our knowledge, this dataset is the largest in existing literature in terms of both the number of domains, and the time span covered.As a result, we are able to analyze the characteristics of a diverse domain population, and its evolution over time. First, we investigate how backup domains are config- ured. Identifying common growth trends is useful for provisioning system resources, such as network or stor-age bandwidth, to accommodate future growth. We find that the population of protected client machines grows in bursts and rarely shrinks. Furthermore, domains pro- tect data of a single type, such as database files or virtual machines, regardless of domain size. Overall, our find-ings suggest that automated configuration is an importantand feasible direction for future research to accommo- date growth bursts in the number of protected clients. The configuration of a backup system, with regards to job frequency and scheduling, is also an important con- tributor to resource consumption. Understanding com- mon practices employed by systems in the field can give us better insight in the load that these systems face, and the characteristics of that load. To derive these trends, weanalyzed 210 million jobs performing a variety of tasks,ranging from data backup and recovery, to management 152 2015 USENIX Annual Technical Conference USENIX AssociationCharacteristic Observation Section Previous work System setup The initial configuration period of backup domains is at least 3 weeks. 4.1 None Protected clients Clients tend to be added to a domain in groups, on a monthly basis. 4.2 None Backup policies82% of backup domains protect one type of data. 4.3 None The number of backup job policies in a domain remains mostly fixed. Also, 79% of clients subscribe to a single policy.4.4 None Job frequencyFull backups tend to occur every few days, while incremental ones occur daily.Recovery operations occur for few domains, on a weekly or monthly basis.5.2 None Users prefer default scheduling windows during weekdays, resulting in nightly bursts of activity.5.3 None Job sizesIncremental and full backups tend to be similar to each other in terms of size andnumber of files. Recovery jobs restore either few files and bytes, or entire volumes.6.1Considers filesizes instead [66] Deduplicationratios Deduplication can result in the reduction of backup image sizes by more than 88%,despite average job sizes ranging in the tens of gigabytes.6.2We confirm theirfindings [66] Data retentionIncremental backups are retained for weeks, while full backups are retained for monthsand retention depends on their scheduling frequency.6.3We confirm theirfindings [66] Table 1: A summary of the most important observations of our study. of backup archives. We find that jobs occur in bursts, due to the preference of default scheduling parameters by users. Moreover, job types are strongly correlated to spe- cific days and times of the week. To avoid these bursts of activity, we expect future backup systems to followmore flexible scheduling plans based on data protection guarantees and resource availability [4, 26, 48]. Finally, successful resource provisioning for backup storage capacity requires data growth rate knowledge. Our results show that jobs in the order of tens of GBs arethe norm, even with deduplication ratios of 88%. Also, retention periods for these jobs are selected as a function of backup frequency, and backups are performed at inter-vals significantly shorter than the periods for which they are retained. Thus, future data protection offering faster backup and recovery times through the use of snapshots[1, 22], will have to be designed to handle significant data churn, or employ these mechanisms selectively. We summarize the most important observations of our study in Table 1. Note that a policy (see Section 2.2) refers to a predefined set of configuration parameters spe- cific to an application. The rest of the paper is organizedas follows. In Section 2, we provide an overview of the evolution of backup systems. Section 3 describes the dataset used in this study. Sections 4 through 6 presentour analysis results on backup domain configuration, job scheduling, and data growth, respectively. Finally, we discuss directions for research on next-generation data protection systems, supported by our findings, in Section 7, and conclude in Section 8. 2 Background Formally, backup is the process of making redundant copies of data, so that it can be retrieved if the orig-inal copy becomes unavailable. In the past 30 years, however, data growth coupled with capacity and band-width limitations have triggered a number of paradigm shifts in the way backup is performed. Recently, data growth trends have once again prompted efforts to re-think backup [1, 9, 20, 22, 27]. This section underlines the importance of field studies in this process (Section 2.1), putting our study in context, and describes the ar-chitecture of modern backup systems (Section 2.2). 2.1 Evolution of backup and field studies In the early 1990s, backup consisted of using simplecommand-line tools to copy data to/from tape. A number of studies tested and outlined the shortcomings of these contemporary backup methods [38, 54, 69, 70]. The lim- itations of this approach, which included scaling, archive management, operating on online systems, and comple- tion time, were subsequently addressed sufficiently bymoving to a client-server backup model [8, 11, 15, 16]. In this model, job scheduling, policy configuration, and archive cataloging were all unified at the server side. In the early 2000s, deduplicating storage systems were developed [53, 67], which removed data redundancy, lowering the cost of backup storage. Subsequently, Wal-lace et al. [66] published a study that aims to characterizebackup storage characteristics by looking at the contentsand workload of file systems that store images producedby backup applications such as NetBackup. A large body of work used their results to simulate deduplicating backup systems more realistically [41, 43, 44, 57, 62],and was built on the motivation provided by the study’sresults [40, 42, 46, 58]. The authors analyze weekly re-ports from appliances, while we analyze reports from the backup application, which has visibility within the archives and the jobs that created them. However, the two studies overlap in three points. First, the dedupli-cation ratios reported for backups confirm our findings.Second, we report backup data retention as a configura-tion parameter, while they report on file age, two distri- USENIX Association 2015 USENIX Annual Technical Conference 153 Master server Storage serversmedia management, job scheduling, backup policies, catalog metadata data managementTier One Tier Two ClientsTier Three (a) 3-tier architecture Fast storage interfaceBackup storage Master server data/media management, job scheduling, backup policies, catalog metadata ClientsTier One Tier Two (b) 2-tier architecture Figure 1: Architecture of a modern backup domain. butions that overlap for popular values. Third, the aver- age job sizes we report are 5-8 times smaller than the file sizes reported in their study, likely because they take into account all files in the file system storing the backup im- ages. Overlaps between our study and previous work are summarized in Table 1. Recently, an ongoing effort has been initiated in the industry to redefine enterprise data protection as a re-sponse to modern data growth rates and shorter backup windows [12, 18, 65]. Proposed deviations from the tra-ditional model rely on data snapshots, trading manage- ment complexity for faster job completion rates [22], and a paradigm shift from backup to data protection policies, in which users specify constraints on data availability as opposed to backup frequency and scheduling [1]. The latter paradigm allows the system to make decisions onindividual policy parameters that can increase global ef-ficiency, while keeping misconfigurations to a minimum. In this direction, previous work leverages predictive an- alytics to configure backup systems [9, 20, 25]. We be- lieve that all this work is promising, and that a study char-acterizing the configuration and evolution of backup sys-tems over time could aid in developing new approaches and predictive models that ensure backup systems meet their goals timely, while efficiently utilizing their re- sources. 2.2 Anatomy of modern backup systems Modern backup domains typically consist of three tiers of operation: a master server, one or more storage servers, and several clients, as shown in Figure 1a. The domain’s master server maintains information on backupimages and backup policies. It is also responsible for scheduling and monitoring backup jobs, and assigningthem to storage servers. Storage servers manage stor- age media, such as tapes and hard drives, used to archivebackup images. By abstracting storage media manage- ment in this way, clients can send data directly to their corresponding storage server, avoiding a bandwidth bot-tleneck at the master server. Finally, domain clients can be desktops, servers, or virtual machines generatingdata that is protected by the backup system against fail- ures. In an alternative 2-tiered architecture model (Fig- ure 1b), the storage servers are absent and the storagemedia are directly managed by the master server. Themajority of enterprise backup software today, includ-ing Symantec NetBackup, support the 3-tiered model[3, 5, 13, 17, 21, 28, 32, 60, 68]. Performing a backup generally consists of a sequence of operations, each of which is executed as an indepen-dent job. Such jobs include: snapshots of the state of data at a given point in time, copying data into a backup image as part of a full backup, copying modified data since the last backup as part of an incremental backup, restoring data from a backup image as part of a recov- ery operation, and managing backup images or backing up the domain’s configuration as part of a management operation. These jobs are typically employed in a prede- fined order. For example, a full backup may be followed by a management operation that deletes backup imagespast their retention periods. To be consistently backed up, or provide point-in-time recovery guarantees, business applications may requirespecific operations to take place. In these scenarios,backup products offer predefined policies that are spe- cific to individual applications. For instance, a Microsoft Exchange Server policy will also backup the transaction log, to capture any updates since the backup was initi-ated. Users can further configure policies to specify the characteristics of backups jobs, such as their frequency and retention rate. 3 Dataset Information Our analysis is based on telemetry reports collected from customer installations of a commercial backup product,Symantec NetBackup [61], in enterprise and regular pro- duction environments. Reports are only collected from customers who opted to participate in the telemetry pro- gram, so our dataset represents a fraction of the customer base. The reports contain no personal identifiable infor-mation, or details about the data being backed up. Report types. Each report in our dataset belongs to ex- actly one of three types: installation, runtime, or domain report. Reports of different types are collected at distinctpoints in the lifetime of a backup domain. Installation 154 2015 USENIX Annual Technical Conference USENIX AssociationReport type Metrics used in study Installation Installation time Runtime reportJob information: starting time, type, size, number of files, client policy, deduplication ratio, retention period Domain reportNumber and type of policies, number of clients, number of storage media, number of storage servers and appliances Table 2: Telemetry report metrics used in the study. reports are generated when the backup software is suc- cessfully installed on a server, and can be used to de- termine the time each server of a domain first came on-line. Runtime reports are generated and transmitted on a weekly basis from online domains, and contain dailyaggregate data about the backup jobs running on the sys- tem. Domain reports are also generated and transmitted on a weekly basis, and report daily aggregate metrics that describe the configuration of the backup domain. The telemetry report metrics used in this study are summa- rized in Table 2. Dataset size. The telemetry reports in our dataset were collected over the span of 3 years (January 2012 to De- cember 2014), across two major versions of the Net-Backup software. We collected 1 million reports from over 40,000 server installations deployed in 124 coun- tries, on most modern operating systems. Monitoring duration. The backup domains included in our study were each monitored for 5.5 months on av- erage, and up to 32 months. We elaborate on our strategy for excluding some of the domains from our analysis in Section 4.1. Note that the monitoring time is not always equivalent to the total lifetime of the domain, as many ofthese domains were still online at the time of this writing. Architecture. While NetBackup supports the 3-tiered architecture model, only 35% of domains in our dataset use dedicated storage servers. The remaining domains omit that layer, opting for a 2-tier system instead. Ad-ditionally, while backup software can be installed onany server, storage companies also offer Purpose-BuiltBackup Appliances (PBBAs) [33]. 31% of domains in our dataset represent this market by deploying Net- Backup on Symantec PBBAs. 4 Domain configuration This section analyzes the way backup domains are con-figured with regards to their clients and backup policies. We use the periodic telemetry reports to quantify the growth rate of the number of clients and policies across domains, and characterize the diversity of policy typesbased on the type of data and applications they protect.0.00.10.20.30.40.50.60.70.80.91.0 1 2 3 4 5 6 7 8 9 10 11 12 Week of operationFraction of expected totalClients PoliciesStorage mediaStart of analysis Figure 2: The average number of clients, policies, and storage media for a given week of operation, as a fractionof the expected total, i.e. the overall mean. We beginour analysis on the fourth week of operation, when these quantities become relatively stable. 4.1 Initial configuration period Observation 1: Backup domains take at least 3 weeks to reach a stable configuration after installation. The number of clients, policies, and storage media are three characteristic factors of a backup domain’s config- uration. These numbers fluctuate as resources are addedto, or removed from the domain. As we monitor domains since their creation, we find the number of clients, poli- cies, and storage media to be initially close to zero, and then increase rapidly until the domain is properly config- ured. After this initial configuration period, variability for these numbers tends to be low over the lifetime ofeach domain, with standard deviations less than 16% of the corresponding mean. To avoid having the initial weeks of operation affect our results, we exclude them from our analysis. To esti-mate the average configuration period length, we analyze the number of clients, policies, and storage media in abackup domain as a fraction of the overall mean, i.e. the expected total. In Figure 2, we report the average frac- tions for all domains that have been monitored for more than 16 weeks. For example, a fraction of 0.47 for the number of clients during the first week of operation, im-plies that the number of clients at that time is 47% of the domain’s expected total. With the exception of storage media, which seem to be added to backup domains fromtheir first week of operation, we find that the number of clients and policies tends to be significantly lower for the first 3 weeks of operation. As a result, we choose to start our analysis from the fourth week of operation. 4.2 Client growth rate Observation 2: The number of clients in a domain in- creases by an average of 7 clients every 3.7 months. Clients are the producers of backup data, and the con- USENIX Association 2015 USENIX Annual Technical Conference 155012345678910 Percentage of domains0102030405060708090100 0 3 6 9 12 15 18 21 24 27 30 33 Average rate of change in clients (months)Cum. percentage of domainsHistogram (right y−axis) CDF (left y−axis)Median: 1.3 monthsMean: 3.7 months47% Figure 3: Distribution of the average rate at which the number of clients changes, across all domains in ourdataset. On average, 93% of client population changes are attributed to the addition of clients. sumers of said data during recovery. As a result, the num- ber of jobs running on a backup domain is directly pro- portional to the number of clients in the domain, deeming it important to quantify the rate at which their population grows over time. Once the initial configuration period for a backup do- main has elapsed, we find that clients tend to be addedto, or removed from the domain in groups. Therefore, we characterize a domain’s client population growth by quantifying the average rate of change in the client pop-ulation, the sign indicating an increase or decrease in the population, and size of each change. To estimate the rate at which the number of clients change, we extract inter-arrival times between changes through change-point analysis [37], a cost-benefit ap- proach for detecting changes in time series. Then, we estimate the average rate of change for a domain as the average of these inter-arrival times. In Figure 3, we showthe distribution of the average rates of change, i.e. the av- erage number of months between changes in the number of clients across domains. For 42% of backup domains, the number of clients remains fixed after the first 3 weeks of operation, while on average the number of clients in a domain changes every 3.7 months. Overall, we find no strong correlation between the rate of change in thenumber of clients, and the domain’s lifetime. We further analyze the sign and size of each popula- tion change. Of all events in which a domain’s clientpopulation changes, 93% are attributed to the addition of clients. However, 78% of domains never remove clients. Regarding the size of each change, Figure 4 shows thedistribution of the average number of clients involved in each change, across all domains in our study. On av- erage, a domain’s population changes by 7.3 clients ata time. The average standard deviation of the number of clients over time is 13.1% of the corresponding ex- pected value, indicating low variation overall. However, the 95% confidence intervals (C.I.) for each mean (Fig- ure 4), suggest that growth spurts as large as 2.16 times0102030405060708090100 1 2 4 8 16 32 64 128 256 512 Average size of change in the number of clientsCum. percentage of domainsCDF 95% C.I.Median: 3.0 clientsMean: 7.3 clientsC.I. range Figure 4: Distribution of the average number of clients involved in each change of a domain’s client population,across all domains in our dataset. The 95% confidenceintervals (C.I.) for each domain’s average are also shown. Policy category Domains with at least 1 policy File and block policy 61.24% Database policy 20.34% Virtual machine policy 15.13% Application policy 13.52% Metadata backup policy 31.93% Table 3: Percentage of backup domains with at least onepolicy of a given category. Less than a third of domains protect the master server using a metadata backup policy. the average value are possible, as this is the width of the average 95% confidence interval. 4.3 Diversity of protected data Observation 3: 82% of backup domains protect one type of data, and only 32% of domains effectively protect the master server’s state and metadata. To provide consistent online backups, backup prod- ucts offer optimizations for different application types, implemented as dedicated policy types [14, 23, 59]. Forour analysis, we partitioned these policy types into fourcategories. File and block policies are specifically tai- lored for backing up raw device data blocks, or file andoperating system data and metadata, e.g. from NTFS,AFS, or Windows volumes. Database policies are de- signed to provide consistent online backups for specificdatabase management systems, such as DB2 and Oracle. Virtual machine policies are tuned to backup and restore VM images, from virtual environments such as VMware or Hyper-V . Application policies specialize in backing up state for client-server applications, such as Microsoft Ex-change and Lotus Notes. Finally, a metadata backup pol- icycan be setup to backup the master server’s state. In Table 3, we show the probability that at least one policy of a given category will be present in a backup do- main. Since domains may deploy policies from multiple categories, these percentages add up to more than 100%. 156 2015 USENIX Annual Technical Conference USENIX Association0102030405060708090100Percentage of domains One Two Three Four Number of policy categories82.21% 15.82% 1.72% 0.26% Figure 5: Distribution of the number of policy categories per backup domain. The metadata backup policy cate-gory is not accounted for in these numbers. 05101520253035404550 Percentage of domains0102030405060708090100 12 4 6 8 10 12 14 16 18 20 22 Number of policy types per domainCum. percentage of domainsHistogram (right y−axis) CDF (left y−axis)Median: 1.9 policy typesMean: 2.6 policy types Figure 6: Distribution of the number of policy types per backup domain, across all domains in the study. More than 25 distinct NetBackup policy types are present in the telemetry data. Surprisingly, we find that only 32% of backup domains register a metadata backup policy to protect the master server’s data. While the remaining domains may employ a different mechanism to backup the master server, guar-anteeing no data inconsistencies while doing so is chal- lenging. In any case, this result suggests that automat- ically configured metadata backup policies should be apriority for future backup systems. We also look into the number of policy categories rep- resented by each domain’s policies, to gauge the diver-sity in the types of protected data. Interestingly, Figure5 shows that 82% of domains deploy policies of a singlecategory (excluding metadata backup policies), and theremaining domains mostly use policies of two distinct categories. We further examine the number of distinct policy types that are deployed in each domain. As shownin Figure 6, domains tend to make use of a small numberof policy types. Specifically, 61% of the domains deploypolicies of only one, or two distinct types. 4.4 Backup policies Observation 4: After the initial configuration period, the number of policies in a domain remains mostly fixedand 79% of clients subscribe to a single policy each.0102030405060708090100 1 2 4 8 16 32 64 128 256 512 1024 Average number of policies per domainCum. percentage of domainsCDF 95% C.I. Median: 7.0 policies Mean: 30.1 policies Figure 7: Distribution of the average number of policies per backup domain. The 95% confidence intervals foreach average are also shown. Overall, the number ofpolicies remains stable over the lifetime of a domain. Following from Section 4.2, the policies in a backup domain, along with the number of clients, are indica-tive of the domain’s load. Recall from Section 2.2, thatclients subscribe to policies which determine the char- acteristics of backup jobs. Therefore, it is important to quantify both the number of policies in a domain and the characteristics of each, to effectively characterize thedomain’s workload. We defer an analysis of job charac- teristics to the remainder of the paper, and focus here on the number of policies in each domain. In Figure 7, we show the distribution of the average number of policies in a given backup domain, across all domains in our dataset. Overall, we find that once the initial configuration period is complete, the numberof backup policies in a domain remains mostly stable. Specifically, the expected width of the 95% confidence interval is 2.5% of the average number of policies. Figure 7 also shows that the average backup domain carries 30 backup policies, while 5% of domains carry over 128. While each policy may represent a group of clients with specific data protection needs, we find that individual clients usually subscribe to a single policy. In Figure 8, we show the distribution of the average number of policies that each client subscribes to. More than 79%of clients belong to only one policy, while 16% spendsome or most of their time unprotected (less than one policy on average). The latter result, coupled with the large number of policies in backup domains and the fact that clients are added to a domain in groups (Section 4.2),suggests that manual policy configuration might not beideal as a domain’s client population inflates over time. 5 Job scheduling While the master server can reorder policy jobs to in-crease overall system efficiency, it adheres to user pref-erences that dictate when, and how often a job should be scheduled. This section looks into the way that these pa- USENIX Association 2015 USENIX Annual Technical Conference 157012345678910 Percentage of domains0102030405060708090100 0 2 4 6 8 10 12 14 16 18 20 22 24 Policies per clientCum. percentage of domainsHistogram (right y−axis) CDF (left y−axis)Median: 1.0 policiesMean: 1.2 policies96.5% Figure 8: Distribution of the average number of poli- cies that a domain client subscribes to. Overall, 79% ofclients subscribe to one policy, while 16% spend some or most time unprotected by a policy (x <1). Job type Percentage of jobs Incremental Backups 45.27% Full Backups 31.20% Snapshot Operations 12.61% Management Operations 10.12% Recovery Operations 0.80% Table 4: Breakdown of all jobs in the dataset by type. rameters are configured by users across backup domains, and the workload generated in the domain as a result. 5.1 Job types Recall from Section 2.2 that policies consist of a prede-fined series of operations, each carried out by a separate job. We collected data from 209.5 million jobs, and we group them in five distinct categories: full and incremen-tal backups, snapshots, recovery, and management oper- ations. In Table 4, we show a breakdown of all jobs in our dataset by job type. Across all monitored backup do-mains, we find that 76% of jobs perform data backups, having processed a total of 1.64 Exabytes of data, while 13% of jobs take snapshots of data. On the other hand,less than 1% of jobs are tasked with data recovery, hav- ing restored a total of 5.12 Petabytes of data. Finally, 10% of jobs are used to manage backup images, e.g. mi- grate, duplicate, or delete them. Due to the data transferof backup images, these jobs processed 4.88 Exabytes ofdata. We analyze individual job sizes in Section 6. 5.2 Scheduling frequency Observation 5: Full backups tend to occur every 5 days or fewer. Recovery operations occur for few do-mains, on a weekly or monthly basis. A factor indicative of data churn in a backup domain is the rate at which jobs are scheduled to backup, restore,or manage backed-up data. To quantify the scheduling frequency of different job types for a given domain, we 0102030405060708090100 0 5 10 15 20 25 30 35 40 45 50 55 60 65 Job scheduling frequency (days)Cum. percentage of domains Management operations (Mean: 3 days) < 5 Recovery operations (Mean: 24 days) ≥ 5 Recovery operations (Mean: 6 days) Incremental backups (Mean: 2 days) Full backups (Mean: 5 days) Snapshot operations (Mean: 5 days)Figure 9: Distribution of the average scheduling fre- quency of different job types across backup domains.Recovery operations are broken into two groups of do-mains with more, and less than 5 recovery operations each. Despite being of similar size, the characteristics of each group differ significantly. rely on the starting times of individual jobs. Specifi- cally, starting times are used to estimate the average oc- currence rate of different jobs of each domain policy, onindividual clients. In Figure 9, we show the distributions of the scheduling frequency of different job types across backup domains. Overall, we find that the average frequency of recovery operations differs depending on their number. In Figure 9, we show the distributions of the recovery frequency for two domain groups having recovered data more, and less than 5 times. The former group consists of 337 do-mains that recovered data 17 times on average, and the latter consists of 262 domains with 3 recovery operations on average. By definition, our analysis excludes an addi-tional 676 domains that initiate recovery only once. For domains with multiple events, the distribution of their frequency spans 1-2 weeks, with an average of 6 days.On the other hand, domains with fewer recovery opera- tions perform them significantly less frequently, up to 2 months apart and every 24 days on average. Since recov-ery operations are initiated manually by users, we have no accurate way of pinpointing their cause. These re- sults, however, suggest that frequent recovery operations may be attributed to disaster recovery testing, while in- frequent ones may be due to actual disasters. Interest-ingly, both domain groups are equally small, but whendomains with a single recovery event are factored in, the group of infrequent recovery operations doubles in size. In the case of backup jobs, the general belief is that systems in the field rely on weekly full backups, comple- mented by daily incremental backups [11, 36, 67]. Our results confirm this assumption for incremental backups, which take place every 1-2 days in 81% of domains.Daily incremental backups are also the default optionin NetBackup. For full backups, however, our analysis shows that only 17% of domains perform them every 6-8 158 2015 USENIX Annual Technical Conference USENIX Association0250500750100012501500 <2 2−4 4−6 6−8 8−10 10−12 >12 Full backup scheduling frequency (days)Average backup size (GB)Mean Median 1.5 IQRMean Median 1.5 IQR Figure 10: Tukey boxplots (without outliers) that repre-sent the average size of full backup jobs, for different jobscheduling frequencies. Means for each boxplot are alsoshown. Frequent full backups seem to be associated with larger job sizes, suggesting that they may be preferred as a response to high data churn. days on average. Instead, the majority of domains per- form full backups more often: 15% perform them every 1-2 days, and 57% perform them every 2-6 days. This is despite the fact that weekly full backups is the defaultoption. As expected, management operations take place on a daily or weekly basis, since they usually follow (or precede) an incremental or full backup operation. Snap- shot operations display a similar trend to full backups, as they are mostly used by clients in lieu of the latter. Of the 65% of domain policies that perform full back- ups every 6 days or fewer, only 33% also perform in- cremental backups at all. On the other hand, 76% ofpolicies that perform weekly full backups also rely on incremental backups. To determine whether full back- ups are performed frequently to accommodate high data churn, we group average full backup sizes per client pol- icy according to their scheduling frequency, and present the results as a series of boxplots in Figure 10. Notethat regardless of frequency, full backups tend to be small (medians in the order of a few gigabytes), due to the effi- ciency of deduplication. However, the larger percentiles of each distribution show that larger backup sizes tendto occur when full backups are taken more frequentlythan once per week. While this confirms our assump- tion of high data churn for a fraction of the clients, the remaining small backup sizes could also be attributed to overly conservative configurations, a sign that policy auto-configuration is an important feature for future dataprotection systems. 5.3 Scheduling windows Observation 6: Users prefer default scheduling win- dows during weekdays, resulting in nightly bursts of ac-tivity. Default values are overridden, however , to avoid scheduling jobs during the weekend. Another important factor for characterizing the work-0.00.20.40.60.81.01.21.41.6 Mon 12am 12pm Tue 12am12pm Wed 12am 12pm Thu 12am12pm Fri 12am 12pm Sat 12am12pm Sun 12am 12pm Mon 12amHour of the weekSched. probability (%) Figure 11: Probability density function for scheduling policy jobs at a given hour of a given day of the week.Policies tend to be configured using the default schedul-ing windows at 6pm and 12am, resulting in high system load during those hours. load of a backup system is the exact time jobs are sched- uled. A popular belief is that backup operations take place late at night or during weekends, when client sys- tems are expected to be idle [15, 66]. In Figure 11, we show our findings for all the jobs in our dataset. Thepresented density function was computed by normalizing the number of jobs that take place in a given domain, to prevent domains with more jobs from affecting the over-all trend disproportionately. We note that this normaliza- tion had minimal effect on the result, which suggests that the presented trend is common across domains. The hourly scheduling frequency is similar for each day, although there is less activity during the weekend. We also find that the probability of a job being sched- uled is highest starting at 6pm and 12am on a weekday.We attribute the timing of job scheduling to customers using the default scheduling windows suggested by Net- Backup, which start at 6pm and 12am every day. Thechoice to exclude weekends, however, seems to be an explicit choice of the user. This result suggests that auto- mated job scheduling, where the only constraints wouldbe to leverage device idleness [4, 26, 48], would be more practical, allowing the system to schedule jobs so that such activity bursts are avoided. While Figure 11 merges all job types, different jobs exhibit different scheduling patterns, as shown in Figure9. Our data, however, does not allow a matching of job types to scheduling times at a granularity finer than the day on which the job was scheduled. Thus, we partitionjobs based on their type, and in Figure 12 we show the probability that a job of a given type will be scheduled on a given day of the week. We find that incremental back- ups are scheduled to complement full backups, as they tend to get scheduled from Monday to Thursday, whilefull backups are mostly scheduled on Fridays. Note that the latter does not contradict our previous result of full backups that take place more often than once a week, USENIX Association 2015 USENIX Annual Technical Conference 1590.02.55.07.510.012.515.017.520.022.525.0 Mon Tue Wed Thu Fri Sat Sun Day of the weekSched. probability (%)Management operations Recovery operationsIncremental backupsFull backups Figure 12: Probability of a policy job occurring on a given day of the week, based on its type. Incrementalbackups tend to be scheduled to complement full back-ups, while users initiate recovery operations more fre- quently at the beginning of the week. 0102030405060708090100 1 2 4 8 16 32 64 128 256 512 1024 Average gigabytes transferred per jobCum. percentage of domainsManagement operations (Mean: 32.9GB) Incremental backups (Mean: 34.9GB)Full backups (Mean: 47.1GB)Recovery operations (Mean: 51.8GB) Figure 13: Distribution of the average job size of a given job type across backup domains, after the data has been deduplicated at the client side. Incremental backups re- semble full backups in size. as the probability of scheduling full backups any other day is still comparatively high. Recovery operations also take place within the week, with a slightly higher proba- bility on Tuesdays (which we confirmed as not related toPatch Tuesday [49]). Finally management operations do not follow any particular trend and are equally likely to be scheduled on any day of the week. 6 Backup data growth Characterizing backup data growth is crucial for estimat- ing the amount of data that needs to be transferred andstored, which allows for efficient provisioning of stor- age capacity and bandwidth. Towards this goal, we ana- lyze the sizes and number of files of different job types,and their deduplication ratios across backup domains. Fi- nally, we look into the time that backup data is retained. 6.1 Job sizes and number of files Observation 7: Incremental and full backups tend to be similar in size and files transferred, due to the effective- ness of deduplication, or misconfigurations. Recovery jobs restore either a few files, or entire volumes.0102030405060708090100 100101102103104105106107 Average number of files transferred per jobCum. percentage of domainsManagement operations (Mean: 11724 files) Incremental backups (Mean: 52033 files) Full backups (Mean: 75916 files) Recovery operations (Mean: 73223 files) Figure 14: Distributions of average number of files trans- ferred per job, across different job types. The trends areconsistent with those for job sizes (Figure 13). An obvious factor when estimating a domain’s data growth is the size of backup jobs. In Figure 13, we showthe distributions of the average number of bytes trans-ferred for different job types across all domains, after the data has been deduplicated at the client. Averages for each operation are shown in the legend, and marked onthe x axis. Snapshot operations are not included, as they do not incur data transfer. Surprisingly, incremental backups resemble full back- ups in size. Although the distribution of full backups is skewed toward larger job sizes, 29% of full backups on domains that also perform incremental backups tend to be equal or smaller in size than the latter, 21% rangefrom 1−1.5 times the size of incremental backups, and the remainder range from 1.5 −10 6times. We attribute the small size difference to three reasons. First, systems with low data churn can achieve high deduplication rates, which are common as we show in Section 6.2. Second,misconfigured policies or volumes that do not support incremental backups often fall back to full backups, as suggested by support tickets. Third, maintenance appli- cations, such as anti-virus scanners, can update file meta-data making unchanged files appear modified. Overall, the average backup job sizes in Figure 13 are 5-8 times smaller than the file sizes reported by Wallace et al. [66],likely due to their study considering the sizes of all files in the file system storing the backup images. Since recovery operations can be triggered by users to recover an entire volume or individual files, the distribu- tion of recovery job sizes is not surprising. 32% of recov- ery jobs restore less than 1GB, while the average job can be as large as 51GB. Finally, management operations, which consist mostly of metadata backups (95.7%), butalso backup image (1.5%) and snapshot (2.8%) duplica- tion operations, are much smaller than all other opera- tions, as expected. Figure 14 shows the distributions of the average num- ber of files transferred for different job types in each do-main. Similar to job sizes, the average number of filestransferred per incremental backup is 31% smaller than 160 2015 USENIX Annual Technical Conference USENIX Association0102030405060708090100 0 10 20 30 40 50 60 70 80 90 100 Average daily deduplication ratio (%)Cum. percentage of domainsManagement operations (Mean: 66.8%) Incremental backups (Mean: 88.1%)Full backups (Mean: 89.1%) Figure 15: Distributions of the average daily deduplica- tion ratio of different job types, across backup domains.Incremental and full backups observe high deduplicationratios, while the uniqueness of metadata backups (man- agement operations) makes them harder to deduplicate. that for full backups, and both job types are characterized by similar CDF curves. Recovery operations transfer as many files as full backups on average, yet the majority transfer fewer than 200 files. This is in line with ourresults on recovery job sizes. Given that large recovery jobs also occur less frequently, these results suggest that most recovery operations are not triggered as a disasterresponse, but rather to recover data lost due to errors, or to test the recoverability of backup images. Management operations, being mostly metadata backups, transfer sig-nificantly fewer files than other job types on average. 6.2 Deduplication ratios Observation 8: Deduplication can result in the reduc- tion of backup image sizes by more than 88%, despite average job sizes ranging in the tens of gigabytes. For clients that use NetBackup’s deduplication solu- tion, we analyzed the daily deduplication ratios of jobs, i.e. the percentage by which the number of bytes trans-ferred was reduced due to deduplication. Figure 15 shows the distributions of the average daily deduplication ratio for management operations, full, and incremental backups across backup domains. Recovery and snapshotjobs are not included as the notion of deduplication does not apply. Since deduplication happens globally across backup images, deduplication ratios for backups tend toincrease after the first few iterations of a policy. In gen- eral, sustained deduplication ratios as high as 99% are not unusual. Across all domains in our dataset, however, the average daily deduplication ratio is 88-89%, for both full and incremental backups. It is interesting to note thatdespite such high deduplication ratios, jobs in the order of tens of gigabytes are common (Figure 13), suggesting that even for daily incremental jobs, the actual job sizesare an order of magnitude larger in size. These results are in agreement with previous work [66], which reports average deduplication ratios of 91%.0102030405060708090100 1 day 2 3 1 week 23 1 mo 23 6 1 year 2 3 5 Retention periodCum. percentage of jobsManagement operations (Mean: 16 days) Incremental backups (Mean: 25 days) Full backups (Mean: 40 days) Snapshot operations (Mean: 37 days) Figure 16: Distributions of retention period lengths for different job types. 3% of jobs have infinite retention pe-riods. Incremental backups are typically retained for al-most half the time of full backups, the majority of which are retained for months. Finally, for management operations the average dedu- plication ratio is 68%. Since only 1.1% of domains that use deduplication enable it for management operations, we do not attach much importance to this result. For the reported domains, however, it can be attributed to the uniqueness of metadata backups, which do not share files with other backup images on the same backup domain and consist of large binary files. 6.3 Data retention Observation 9: Incremental backups are retained for weeks, while full backups are retained for months and retention depends on their scheduling frequency. Another factor characteristic of backup storage growth is the retention time for backup images, which is a con-figurable policy parameter. Once a backup image ex- pires, the master server deletes it from backup storage. We have analyzed the retention periods assigned to each job in our telemetry reports, and show the distributions of retention period lengths for different job types in Fig-ure 16. Our initial observation is that job retention pe- riods coincide with the values available by default in NetBackup, although users can specify custom periods. These values range from 1 week to 1 year, and corre-spond to the steps in the CDF shown. While federal laws,such as HIPAA [63] and FoIA [64], require minimum re- tention from a few years up to infinity for certain types of data. In our case, 3% of jobs are either assigned cus- tom retention periods longer than 1 year, or are retained indefinitely. On the other extreme, only 3% of jobs areassigned custom retention periods shorter than 1 week. Previous work confirms our findings, by reporting simi- lar ages for backup image files [66]. In particular, management operations (metadata back- ups and backup image duplicates) are mostly retained for1 week. Incremental backups are mostly retained for 2weeks, the default option. Full backups and snapshots, USENIX Association 2015 USENIX Annual Technical Conference 161on the other hand, are more likely retained for months. Overall, 94% of jobs select a preset retention period fromNetBackup’s list, and 35% of jobs keep the default sug-gestion of 2 weeks. This suggests that the actual reten-tion period length is not a crucial policy parameter. Finally, we find a strong correlation (Pearson’s r= 0.53) between the length of retention periods for fullbackups, and the frequency with which they take place. Specifically, we find that clients taking full backups less frequently retain them for longer periods of time. On the other hand, no such correlation exists for management operations and incremental backups. This is because al-most all data resulting from a management operation isretained for 1 week (Figure 16), and almost all incre-mental backups are performed with a frequency of 1-2days apart (Figure 9). The correlation of retention period length and frequency of full backup operations, coupled with the preference for default values, may suggest that retention periods are selected as a function of storage ca- pacity, or that they are at least limited by that factor. 7 Insight: next-generation data protection This section outlines five major directions for futurework on data protection systems. In each case, we iden- tify existing literature and describe how our findings en- courage future work. Automated configuration and self-healing. To allevi- ate performance and availability problems of data pro- tection systems, existing work uses historical data to perform automated storage capacity planning [9], data prefetching and network scheduling [25]. Our findingssupport this line of work. We have shown that backup do- mains grow in bursts, and client policies are either con- figured using default values, misconfigured, or not con- figured at all. As a result, clients are left unprotected, jobs are scheduled in bursts, and users are not warned of imminent problems. To enable automated policy config-uration and self-healing data protection systems, furtherresearch is necessary. Deduplication. Our findings confirm the efficiency of deduplication at reducing backup image sizes. We fur- ther show that in many systems, incremental backups arereplaced by frequent full, deduplicated backups. This is likely due to the adoption of deduplication, which im- proves on incremental backups by looking for duplicatesacross all backup data in the domain. To completely re-place incremental backups, however, it is necessary toimprove on the time required to restore the original datafrom deduplicated storage, which directly affects recov- ery times. Currently, this is an area of active research [24, 35, 43, 50]. Efficient storage utilization. Our analysis shows that job retention periods are selected as a function of backupfrequency, likely to ensure sufficient backup storage space will be available. Additionally, 31% of domainsin our dataset use dedicated backup appliances (PBBAs),a market currently experiencing growth [33]. We believe that storage capacity in these dedicated systems should be utilized fully, and retention periods should be dynam-ically adjusted to fill it, providing the ability to recoverolder versions of data. In this direction, related work onstream-processing systems [29] could be adapted to theneeds of backup data. Accident insurance. Most recovery operations in our dataset appear to be small in both the number of files and bytes they recover, compared to their respective backups.This result suggests that recovery operations are mostly triggered to restore a few files, or to test the integrity of backup images. This motivates us to re-examine the re-quirement of instant recovery for backup systems as aproblem of determining which data is more likely to be recovered, and storing it closer to clients [40, 45]. Content-aware backups. Data protection strategies can generate data at a rate up to 5 times higher than produc- tion data growth [1]. This is due to the practice of creat- ing multiple copies and backing up temporary files used for test-and-development or data analytics processes,such as the Shuffle stage of MapReduce tasks [10]. De- pending on the storage interface used, it might be more efficient to recompute these datasets rather than restor-ing them from backup storage. Another challenge for contemporary backup software is detecting data changes since the last backup among PBs of data and billions of files [31]. By augmenting data protection systems to ac- count for data types and modification events, we can po- tentially reduce the time needed to complete backup and restore operations. 8 Conclusion We investigated an extensive dataset representing a di-verse population of enterprise data protection systemsto demonstrate how these systems are configured and evolved over time. Among other results, our analysis showed that these systems are usually configured to pro- tect one type of data, and while their client population growth is steady and bursty, their backup policies don’tchange. With regards to job scheduling, we find that the popularity of default values can have an adverse ef- fect on the efficiency of the system by creating burstyworkloads. Finally, we showed that full and incremental backups tend to be similar in size and number of files, as a result of efficient deduplication and misconfigura-tions. We hope that our data and the proposed areas of future research will enable researchers to simulate realis- tic scenarios for building next generation data protection systems that are easy to configure and manage. 162 2015 USENIX Annual Technical Conference USENIX AssociationAcknowledgments The study would not be possible without the teleme- try data collected by Symantec’s NetBackup team, and we thank Liam McNerney and Aaron Christensen fortheir invaluable assistance in understanding the data. Wealso thank the four anonymous reviewers and our shep-herd, Fred Douglis, for helping us improve our paper significantly. Finally, we would like to thank Petros Efstathopoulos, Fanglu Guo, Vish Janakiraman, Ash-win Kayyoor, CW Hobbs, Bruce Montague, Sanjay Sah-wney, and all other members of Symantec’s ResearchLabs for their feedback during the earlier stages of ourstudy. References [1]ACTIFIO . Actifio Copy Data Virtualization: How It Works, August 2014. [2]AGRAWAL , N., B OLOSKY , W. J., D OUCEUR , J. R., AND LORCH , J. R. A five-year study of file-system meta- data. In Proceedings of the 5th USENIX Conference on File and Storage Technologies (2007). [3]ARCSERVE . arcserve Unified Data Protection. http: //www.arcserve.com, May 2014. [4]BACHMAT , E., AND SCHINDLER , J. Analysis of meth- ods for scheduling low priority disk drive tasks. In Pro- ceedings of the 2002 ACM SIGMETRICS International Conference on Measurement and modeling of computer systems (2002). [5]BACULA SYSTEMS . Bacula 7.0.5. http://www. bacula.org, July 2014. [6]BAKER , M., H ARTMAN , J. H., K UPFER , M. D., SHIRRIFF , K., AND OUSTERHOUT , J. K. Measure- ments of a Distributed File System. In Proceedings of the 13th ACM Symposium on Operating Systems Principles(1991). [7]B ENNETT , J. M., B AUER , M. A., AND KINCHLEA , D. Characteristics of Files in NFS Environments. In Pro- ceedings of the 1991 ACM SIGSMALL/PC Symposium onSmall Systems (1991). [8]B HATTACHARYA , S., M OHAN , C., B RANNON , K. W., NARANG , I., H SIAO , H.-I., AND SUBRAMANIAN , M. Coordinating Backup/Recovery and Data ConsistencyBetween Database and File Systems. In Proceedings of the 2002 ACM SIGMOD International Conference onManagement of Data (2002), SIGMOD. [9]C HAMNESS , M. Capacity Forecasting in a Backup Stor- age Environment. In Proceedings of the 25th Interna- tional Conference on Large Installation System Adminis-tration (2011).[10] C HEN,Y . ,A LSPAUGH , S., AND KATZ, R. Interactive Analytical Processing in Big Data Systems: A Cross- industry Study of MapReduce Workloads. Proc. VLDB Endow. 5, 12 (Aug. 2012), 1802–1813. [11] CHERVENAK , A. L., V ELLANKI , V., AND KURMAS , Z. Protecting File Systems: A Survey Of Backup Tech- niques. In Proceedings of the Joint NASA and IEEE Mass Storage Conference (1998). [12] COMM VAULT SYSTEMS . Get Smart About Big Data: In- tegrated Backup, Archive & Reporting to Solve Big Data Management Problems, July 2013. [13] COMM VAULT SYSTEMS INC. CommVault Sim- pana 10. http://www.commvault.com/simpana- software, April 2014. [14] COMM VAULT SYSTEMS INC. CommVault Simpana: Solutions for Protecting and Managing Business Ap-plications. http://www.commvault.com/solutions/ enterprise-applications, April 2015. [15] DASI LVA, J., G UDMUNDSSON , O., AND MOSS ́E, D. Performance of a Parallel Network Backup Manager, 1992. [16] DASI LVA, J., AND GUTHMUNDSSON , O. The Amanda Network Backup Manager. In Proceedings of the 7th USENIX Conference on System Administration (1993), LISA. [17] DELL INC. Dell NetVault 10.0. http://software. dell.com/products/netvault-backup, May 2014. [18] DIMENSIONAL RESEARCH . The state of IT recov- ery for SMBs. http://axcient.com/state-of-it- recovery-for-smbs, Oct. 2014. [19] DOUCEUR , J. R., AND BOLOSKY , W. J. A Large-scale Study of File-system Contents. In Proceedings of the 1999 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (1999). [20] DOUGLIS , F., B HARDWAJ , D., Q IAN, H., AND SHI- LANE , P. Content-aware Load Balancing for Distributed Backup. In Proceedings of the 25th International Confer- ence on Large Installation System Administration (2011), LISA. [21] EMC C ORPORATION . EMC NetWorker 8.2. http:// www.emc.com/data-protection/networker.htm, July 2014. [22] EMC C ORPORATION . EMC ProtectPoint: Protection Software Enabling Direct Backup from Primary Storageto Protection Storage, 2014. [23] EMC C ORPORATION . EMC NetWorker Appli- cation Modules Data Sheet. http://www.emc. com/collateral/software/data-sheet/h2479-networker-app-modules-ds.pdf , January 2015. USENIX Association 2015 USENIX Annual Technical Conference 163[24] FU, M., F ENG, D., H UA, Y., H E, X., C HEN, Z., X IA, W., H UANG , F., AND LIU, Q. Accelerating Restore and Garbage Collection in Deduplication-based Backup Sys- tems via Exploiting Historical Information. In Proceed- ings of the 2014 USENIX Conference on USENIX AnnualTechnical Conference (2014). [25] G IAT, A., P ELLEG , D., R AICHSTEIN , E., AND RONEN , A. Using Machine Learning Techniques to Enhance the Performance of an Automatic Backup and Recovery Sys- tem. In Proceedings of the 3rd Annual Haifa Experimen- tal Systems Conference (2010), SYSTOR. [26] GOLDING , R., B OSCH , P., S TAELIN , C., S ULLIV AN , T., AND WILKES , J. Idleness is not sloth. In Proceedings of the USENIX 1995 Technical Conference Proceedings (1995), TCON’95. [27] HEWLETT -PACKARD . Rethinking backup and recovery in the modern data center, November 2013. [28] HEWLETT -PACKARD COMPANY . HP Data Protector 9.0.1. http://www.autonomy.com/products/data- protector, August 2014. [29] HILDRUM , K., D OUGLIS , F., W OLF, J. L., Y U, P. S., FLEISCHER , L., AND KATTA , A. Storage Optimization for Large-scale Distributed Stream-processing Systems.Trans. Storage 3, 4 (Feb. 2008), 5:1–5:28. [30] H SU, W. W., AND SMITH , A. J. Characteristics of I/O Traffic in Personal Computer and Server Workloads.Tech. rep., EECS Department, University of California,Berkeley, 2002. [31] H UGHES , D., AND FARROW , R. Backup Strategies for Molecular Dynamics: An Interview with Doug Hughes.Proc. USENIX ;login: 36, 2 (Apr. 2011), 25–28. [32] IBM C ORPORATION . IBM Tivoli Storage Manager 7.1.http://www.ibm.com/software/products/en/ tivostormana, November 2013. [33] INTERNATIONAL DATA CORPORATION . Worldwide Purpose-Built Backup Appliance (PBBA) Market Rev-enue Increases 11.2% in the Third Quarter of 2014, Ac-cording to IDC. http://www.idc.com/getdoc.jsp? containerId=prUS25351414, December 2014. [34] I RON MOUNTAIN . Data Backup and Recovery Bench- mark Report. http://www.ironmountain.com/ Knowledge-Center/Reference-Library/View-by-Document-Type/White-Papers-Briefs/I/Iron-Mountain-Data-Backup-and-Recovery-Benchmark-Report.aspx, 2013. [35] K ACZMARCZYK , M., B ARCZYNSKI , M., K ILIAN , W., AND DUBNICKI , C. Reducing Impact of Data Fragmen- tation Caused by In-line Deduplication. In Proceedings of the 5th Annual International Systems and Storage Confer-ence (2012).[36] K EETON , K., S ANTOS , C., B EYER , D., C HASE , J., AND WILKES , J. Designing for Disasters. In Proceed- ings of the 3rd USENIX Conference on File and StorageTechnologies (2004), FAST. [37] K ILLICK , R., AND ECKLEY , I. A. changepoint: An R package for Changepoint Analysis. In Journal of Statisti- cal Software (May 2013). [38] KOLSTAD , R. A Next Step in Backup and Restore Tech- nology. In Proceedings of the 5th USENIX Conference on System Administration (1991), LISA. [39] LEUNG , A. W., P ASUPATHY , S., G OODSON , G., AND MILLER , E. L. Measurement and Analysis of Large- scale Network File System Workloads. In Proceedings of the USENIX 2008 Annual Technical Conference (2008). [40] LI, C., S HILANE ,P . ,D OUGLIS ,F . ,S HIM, H., S MAL- DONE , S., AND WALLACE , G. Nitro: A Capacity- Optimized SSD Cache for Primary Storage. In Proceed- ings of the 2014 USENIX Annual Technical Conference (2014), ATC. [41] LI, M., Q IN, C., L EE, P. P. C., AND LI, J. Convergent Dispersal: Toward Storage-Efficient Security in a Cloud- of-Clouds. In Proceedings of the 6th USENIX Workshop on Hot Topics in Storage and File Systems (2014), Hot- Storage. [42] LI, Z., G REENAN , K. M., L EUNG , A. W., AND ZADOK , E. Power Consumption in Enterprise-scale Backup Stor- age Systems. In Proceedings of the 10th USENIX Con- ference on File and Storage Technologies (2012), FAST. [43] LILLIBRIDGE , M., E SHGHI , K., AND BHAGWAT , D. Improving Restore Speed for Backup Systems that UseInline Chunk-Based Deduplication. In Proceeings of the 11th USENIX Conference on File and Storage Technolo-gies(2013), FAST. [44] L IN, X., L U, G., D OUGLIS ,F . ,S HILANE , P., AND WALLACE , G. Migratory Compression: Coarse-grained Data Reordering to Improve Compressibility. In Proceed- ings of the 12th USENIX Conference on File and StorageTechnologies (2014), FAST. [45] L IU, J., C HAI, Y., Q IN, X., AND XIAO, Y. PLC- cache: Endurable SSD cache for deduplication-based pri-mary storage. In Mass Storage Systems and Technologies (MSST), 2014 30th Symposium on (2014). [46] M EISTER , D., B RINKMANN , A., AND S ̈USS, T. File Recipe Compression in Data Deduplication Systems. In Proceedings of the 11th USENIX Conference on File and Storage Technologies (2013), FAST. [47] MEYER , D. T., AND BOLOSKY , W. J. A Study of Prac- tical Deduplication. In Proceedings of the 9th USENIX Conference on File and Stroage Technologies (2011). 164 2015 USENIX Annual Technical Conference USENIX Association[48] MI, N., R ISKA , A., L I, X., S MIRNI , E., AND RIEDEL , E. Restrained utilization of idleness for transparent scheduling of background tasks. In Proceedings of the 11th International Joint conference on Measurement and modeling of computer systems (2009), SIGMETRICS. [49] MICROSOFT CORPORATION . Understanding Windows automatic updating. http://windows.microsoft. com/en-us/windows/understanding-windows- automatic-updating . [50] NG, C.-H., AND LEE, P. P. C. RevDedup: A Reverse Deduplication Storage System Optimized for Reads to Latest Backups. In Proceedings of the 4th Asia-Pacific Workshop on Systems (2013). [51] OUSTERHOUT , J. K., D ACOSTA , H., H ARRISON , D., KUNZE , J. A., K UPFER , M., AND THOMPSON , J. G. A Trace-driven Analysis of the UNIX 4.2 BSD File System. InProceedings of the 10th ACM Symposium on Operating Systems Principles (1985). [52] PARK, N., AND LILJA, D. J. Characterizing Datasets for Data Deduplication in Backup Applications. In Proceed- ings of the IEEE International Symposium on Workload Characterization (2010), IISWC. [53] QUINLAN , S., AND DORWARD , S. Venti: A New Ap- proach to Archival Data Storage. In Proceedings of the 1st USENIX Conference on File and Storage Technolo-gies(2002), FAST. [54] R OMIG , S. M. Backup at Ohio State, Take 2. In Proceed- ings of the 4th USENIX Conference on System Adminis- tration (1990), LISA. [55] ROSELLI , D., L ORCH , J. R., AND ANDERSON , T. E. A Comparison of File System Workloads. In Proceedings of the USENIX Annual Technical Conference (2000). [56] SATYANARAYANAN , M. A Study of File Sizes and Func- tional Lifetimes. In Proceedings of the 8th ACM Sympo- sium on Operating Systems Principles (1981). [57] SHIM, H., S HILANE , P., AND HSU, W. Characterization of Incremental Data Changes for Efficient Data Protec- tion. In Proceedings of the 2013 USENIX Annual Techni- cal Conference (2013), ATC. [58] SMALDONE , S., W ALLACE , G., AND HSU, W. Effi- ciently Storing Virtual Machine Backups. In Proceedings of the 5th USENIX Workshop on Hot Topics in Storageand File Systems (2013), HotStorage. [59] S YMANTEC CORPORATION . Symantec NetBackup 7.6 Data Sheet: Data Protection. http://www.symantec. com/content/en/us/enterprise/fact_sheets/b-netbackup-ds-21324986.pdf, January 2014. [60] S YMANTEC CORPORATION . Symantec NetBackup 7.6. http://www.symantec.com/backup-software, March 2015.[61] SYMANTEC CORPORATION . Symantec NetBackup 7.6.1 Getting Started Guide. https://support.symantec. com/en_US/article.DOC7941.html, February 2015. [62] TARASOV , V., M UDRANKIT , A., B UIK, W., S HILANE , P., K UENNING , G., AND ZADOK , E. Generating Re- alistic Datasets for Deduplication Analysis. In Proceed- ings of the 2012 USENIX Annual Technical Conference (2012), ATC. [63] U.S. D EPARTMENT OF HEALTH AND HUMAN SER- VICES . The Health Insurance Portability and Account- ability Act. http://www.hhs.gov/ocr/privacy. [64] U.S. D EPARTMENT OF JUSTICE . The Freedom of Infor- mation Act. http://www.foia.gov. [65] VANSON BOURNE . Virtualization Data Protection Report 2013 – SMB edition. http://www.dabcc.com/ documentlibrary/file/virtualization-data-protection-report-smb-2013.pdf , 2013. [66] W ALLACE , G., D OUGLIS , F., Q IAN, H., S HILANE , P., SMALDONE , S., C HAMNESS , M., AND HSU, W. Char- acteristics of Backup Workloads in Production Systems. InProceedings of the 10th USENIX Conference on File and Storage Technologies (2012). [67] ZHU, B., L I, K., AND PATTERSON , H. Avoiding the Disk Bottleneck in the Data Domain Deduplication FileSystem. In Proceedings of the 6th USENIX Conference on File and Storage Technologies (2008). [68] Z MANDA INC. Amanda 3.3.6. http://amanda. zmanda.com, July 2014. [69] ZWICKY , E. D. Torture-testing Backup and Archive Pro- grams: Things You Ought to Know But Probably WouldRather Not. In Proceedings of the 5th USENIX Confer- ence on System Administration (1991), LISA. [70] Z WICKY , E. D. Further Torture: More Testing of Backup and Archive Programs. In Proceedings of the 17th USENIX Conference on System Administration (2003), LISA.QuickintroductionintoSAT/SMTsolversandsymbolic execution DennisYurichev December2015–May2017 Contents 1 This is a draft! 3 2 Thanks 3 3 Introduction 3 4 Is it a hype? Yet another fad? 3 5 SMT1-solvers 4 5.1 School-levelsystemofequations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 5.2 Anotherschool-levelsystemofequations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 5.3 ConnectionbetweenSAT2andSMTsolvers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 5.4 Zebrapuzzle(AKA3Einsteinpuzzle) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 5.5 Sudokupuzzle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 5.5.1 Thefirstidea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 5.5.2 Thesecondidea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 5.5.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 5.5.4 Homework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 5.5.5 Furtherreading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 5.5.6 SudokuasaSATproblem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 5.6 SolvingProblemEuler31: “Coinsums” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 5.7 UsingZ3theoremprovertoproveequivalenceofsomeweirdalternativetoXORoperation . . .15 5.7.1 InSMT-LIBform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 5.7.2 Usinguniversalquantifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 5.7.3 Howtheexpressionworks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 5.8 Dietz’sformula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 5.9 CrackingLCG4withZ3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 5.10SolvingpipepuzzleusingZ3SMT-solver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 5.10.1Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 5.10.2Solving . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 5.11CrackingMinesweeperwithZ3SMTsolver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 5.11.1Themethod . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 5.11.2Thecode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 5.12Recalculatingmicro-spreadsheetusingZ3Py . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 5.12.1Unsatcore . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 5.12.2Stresstest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 5.12.3Thefiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 1Satisfiabilitymodulotheories 2Booleansatisfiabilityproblem 3AlsoKnownAs 4Linearcongruentialgenerator 1 6 Program synthesis 32 6.1 SynthesisofsimpleprogramusingZ3SMT-solver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 6.1.1 Fewnotes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 6.1.2 Thecode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 6.2 Rockeydongle: findingunknownalgorithmusingonlyinput/outputpairs . . . . . . . . . . . . . . . 35 6.2.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 6.2.2 Thefiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 6.2.3 Furtherwork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 7 Toy decompiler 40 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 7.2 Datastructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 7.3 Simpleexamples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 7.4 Dealingwithcompileroptimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 7.4.1 Divisionusingmultiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 7.5 Obfuscation/deobfuscation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 7.6 Tests. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 7.6.1 Evaluatingexpressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 7.6.2 UsingZ3SMT-solverfortesting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 7.7 Myotherimplementationsoftoydecompiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 7.7.1 Evensimplertoydecompiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 7.8 Differencebetweentoydecompilerandcommercial-gradeone . . . . . . . . . . . . . . . . . . . . 60 7.9 Furtherreading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 7.10Thefiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 8 Symbolic execution 61 8.1 Symboliccomputation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 8.1.1 Rationaldatatype . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 8.2 Symbolicexecution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 8.2.1 SwappingtwovaluesusingXOR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 8.2.2 Changeendianness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 8.2.3 FastFouriertransform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 8.2.4 Cyclicredundancycheck . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 8.2.5 Linearcongruentialgenerator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 8.2.6 Pathconstraint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 8.2.7 Divisionbyzero . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 8.2.8 Mergesort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 8.2.9 ExtendingExprclass . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 8.2.10Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 8.3 Furtherreading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 9 KLEE 76 9.1 Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 9.2 School-levelequation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 9.3 Zebrapuzzle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 9.4 Sudoku . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 9.5 Unittest: HTML/CSScolor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 9.6 Unittest: strcmp()function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 9.7 UNIXdate/time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 9.8 Inversefunctionforbase64decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 9.9 CRC(Cyclicredundancycheck) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 9.9.1 Bufferalterationcase#1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 9.9.2 Bufferalterationcase#2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 9.9.3 RecoveringinputdataforgivenCRC32valueofit . . . . . . . . . . . . . . . . . . . . . . . . . 98 9.9.4 Incomparisonwithotherhashingalgorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 9.10LZSSdecompressor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 9.11strtodx()fromRetroBSD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 9.12Unittesting: simpleexpressionevaluator(calculator) . . . . . . . . . . . . . . . . . . . . . . . . . . 105 9.13Regularexpressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 9.14Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 2 10(Amateur) cryptography 111 10.1Seriouscryptography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 10.1.1Attemptstobreak“serious”crypto . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 10.2Amateurcryptography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 10.2.1Bugs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 10.2.2XORciphers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 10.2.3Otherfeatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 10.2.4Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 10.3Casestudy: simplehashfunction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 10.3.1Manualdecompiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 10.3.2Nowlet’susetheZ3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 11SAT-solvers 123 11.1CNFform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 11.2Example: 2-bitadder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 11.2.1MiniSat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 11.2.2CryptoMiniSat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 11.3CrackingMinesweeperwithSATsolver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 11.3.1Simple populationcount function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 11.3.2Minesweeper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 11.4Conway’s“GameofLife” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 11.4.1Reversingbackstateof“GameofLife” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 11.4.2Finding“stilllives” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 11.4.3Thesourcecode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 12Acronyms used 148 1 This is a draft! Thisisveryearlydraft,butstillcanbeinterestingforsomeone. Latestversionisalwaysavailableat http://yurichev.com/writings/SAT_SMT_draft-EN.pdf . Russin versionisat http://yurichev.com/writings/SAT_SMT_draft-RU.pdf . Currenttextversion: May8,2017. Fornewsaboutupdates,youmaysubscribemytwitter5,facebook6,orgithubrepo7. 2 Thanks LeonardodeMouraandNikolajBjorner,forhelp. 3 Introduction SAT/SMTsolverscanbeviewedassolversofhugesystemsofequations. Thedifferenceisthat SMTsolvers takessystemsinarbitraryformat,while SATsolversarelimitedtobooleanequationsin CNF8form. Alotofrealworldproblemscanberepresentedasproblemsofsolvingsystemofequations. 4 Is it a hype? Yet another fad? Somepeoplesay,thisisjustanotherhype. No, SATisoldenoughandfundamentalto CS9. Thereasonof increasedinteresttoitisthatcomputersgetsfasteroverthelastcoupledecades,sothereareattemptsto solveoldproblemsusing SAT/SMT,whichwereinaccessibleinpast. 5https://twitter.com/yurichev 6https://www.facebook.com/dennis.yurichev.5 7https://github.com/dennis714/SAT_SMT_article 8Conjunctivenormalform 9Computerscience 3 5SMT-solvers 5.1 School-level system of equations I’vegotthisschool-levelsystemofequationscopypastedfromWikipedia10: 3x+ 2yz= 1 2x2y+ 4z=2 x+1 2yz= 0 WillitbepossibletosolveitusingZ3? Hereitis: #!/usr/bin/python from z3 import * x = Real('x') y = Real('y') z = Real('z') s = Solver() s.add(3*x + 2*y - z == 1) s.add(2*x - 2*y + 4*z == -2) s.add(-x + 0.5*y - z == 0) print s.check() print s.model() Weseethisafterrun: sat [z = -2, y = -2, x = 1] Ifwechangeanyequationinsomewaysoitwillhavenosolution,s.check()willreturn“unsat”. I’veused“Real” sort(somekindofdatatypein SMT-solvers)becausethelastexpressionequalsto1 2, whichis,ofcourse,arealnumber. Fortheintegersystemofequations,“Int” sortwouldworkfine. Python(andotherhigh-level PL11slikeC#)interfaceishighlypopular,becauseit’spractical,butinfact, thereisastandardlanguagefor SMT-solverscalledSMT-LIB12. Ourexamplerewrittentoitlookslikethis: (declare-const x Real) (declare-const y Real) (declare-const z Real) (assert (=(-(+(* 3 x) (* 2 y)) z) 1)) (assert (=(+(-(* 2 x) (* 2 y)) (* 4 z)) -2)) (assert (=(-(+ (- 0 x) (* 0.5 y)) z) 0)) (check-sat) (get-model) ThislanguageisveryclosetoLISP,butissomewhathardtoreadforuntrainedeyes. Nowwerunit: % z3 -smt2 example.smt sat (model (define-fun z () Real (- 2.0)) (define-fun y () Real (- 2.0)) (define-fun x () Real 1.0) ) SowhenyoulookbacktomyPythoncode,youmayfeelthatthese3expressionscouldbeexecuted.This isnottrue: Z3PyAPIoffersoverloadedoperators,soexpressionsareconstructedandpassedintotheguts ofZ3withoutanyexecution13. Iwouldcallit“embedded DSL14”. 10https://en.wikipedia.org/wiki/System_of_linear_equations 11ProgrammingLanguage 12http://smtlib.cs.uiowa.edu/papers/smt-lib-reference-v2.5-r2015-06-28.pdf 13https://github.com/Z3Prover/z3/blob/6e852762baf568af2aad1e35019fdf41189e4e12/src/api/python/z3.py 14Domain-specificlanguage 4 SamethingforZ3C++API,youmayfindthere“operator+”declarationsandmanymore15. Z3API16sforJava,MLand.NETarealsoexist17. Z3Pytutorial: https://github.com/ericpony/z3py-tutorial . Z3tutorialwhichusesSMT-LIBlanguage: http://rise4fun.com/Z3/tutorial/guide . 5.2 Another school-level system of equations I’vefoundthissomewhereatFacebook: Figure1:Systemofequations It’sthateasytosolveitinZ3: #!/usr/bin/python from z3 import * circle, square, triangle = Ints('circle square triangle') s = Solver() s.add(circle+circle==10) s.add(circle*square+square==12) s.add(circle*square-triangle*circle==circle) print s.check() print s.model() sat [triangle = 1, square = 2, circle = 5] 5.3 Connection between SATandSMT solvers EarlySMT-solverswerefrontendsto SATsolvers,i.e.,theytranslatinginputSMTexpressionsinto CNFand feedSAT-solverwithit. Translationprocessissometimescalled“bitblasting”. Some SMT-solversstillworks inthatway: STPusesMiniSATorCryptoMiniSATasbackendSAT-solver. Someother SMT-solversaremore advanced(likeZ3),sotheyusesomethingevenmorecomplex. 5.4 Zebra puzzle ( AKA Einstein puzzle) Zebrapuzzleisapopularpuzzle,definedasfollows: 15https://github.com/Z3Prover/z3/blob/6e852762baf568af2aad1e35019fdf41189e4e12/src/api/c%2B%2B/z3%2B%2B.h 16Applicationprogramminginterface 17https://github.com/Z3Prover/z3/tree/6e852762baf568af2aad1e35019fdf41189e4e12/src/api 5 1.Therearefivehouses. 2.TheEnglishmanlivesintheredhouse. 3.TheSpaniardownsthedog. 4.Coffeeisdrunkinthegreenhouse. 5.TheUkrainiandrinkstea. 6.Thegreenhouseisimmediatelytotherightoftheivoryhouse. 7.TheOldGoldsmokerownssnails. 8.Koolsaresmokedintheyellowhouse. 9.Milkisdrunkinthemiddlehouse. 10.TheNorwegianlivesinthefirsthouse. 11.ThemanwhosmokesChesterfieldslivesinthehousenexttothemanwiththefox. 12.Koolsaresmokedinthehousenexttothehousewherethehorseiskept. 13.TheLuckyStrikesmokerdrinksorangejuice. 14.TheJapanesesmokesParliaments. 15.TheNorwegianlivesnexttothebluehouse. Now,whodrinkswater? Whoownsthezebra? In the interest of clarity, it must be added that each of the five houses is painted a dif- ferent color, and their inhabitants are of different national extractions, own different pets, drink different beverages and smoke different brands of American cigarets [sic]. One other thing: instatement6,rightmeansyourright. (https://en.wikipedia.org/wiki/Zebra_Puzzle ) It’saverygoodexampleof CSP18. Wewouldencodeeachentityasintegervariable,representingnumberofhouse. Then, to define that Englishman lives in red house, we will add this constraint: Englishman == Red , meaningthatnumberofahousewhereEnglishmenresidesandwhichispaintedinredisthesame. TodefinethatNorwegianlivesnexttothebluehouse,wedon’trealyknow,ifitisatleftsideofbluehouse oratrightside,butweknowthathousenumbersaredifferentbyjust1. Sowewilldefinethisconstraint: Norwegian==Blue-1 OR Norwegian==Blue+1 . Wewillalsoneedtolimitallhousenumbers,sotheywillbeinrangeof1..5. Wewillalsouse Distincttoshowthatallvariousentitiesofthesametypeareallhasdifferenthouse numbers. #!/usr/bin/env python from z3 import * Yellow, Blue, Red, Ivory, Green=Ints('Yellow Blue Red Ivory Green') Norwegian, Ukrainian, Englishman, Spaniard, Japanese=Ints('Norwegian Ukrainian Englishman Spaniard Japanese') Water, Tea, Milk, OrangeJuice, Coffee=Ints('Water Tea Milk OrangeJuice Coffee') Kools, Chesterfield, OldGold, LuckyStrike, Parliament=Ints('Kools Chesterfield OldGold LuckyStrike Parliament') Fox, Horse, Snails, Dog, Zebra=Ints('Fox Horse Snails Dog Zebra') s = Solver() # colors are distinct for all 5 houses: s.add(Distinct(Yellow, Blue, Red, Ivory, Green)) # all nationalities are living in different houses: s.add(Distinct(Norwegian, Ukrainian, Englishman, Spaniard, Japanese)) # so are beverages: s.add(Distinct(Water, Tea, Milk, OrangeJuice, Coffee)) # so are cigarettes: s.add(Distinct(Kools, Chesterfield, OldGold, LuckyStrike, Parliament)) # so are pets: s.add(Distinct(Fox, Horse, Snails, Dog, Zebra)) 18Constraintsatisfactionproblem 6 # limits. # adding two constraints at once (separated by comma) is the same # as adding one And() constraint with two subconstraints s.add(Yellow>=1, Yellow<=5) s.add(Blue>=1, Blue<=5) s.add(Red>=1, Red<=5) s.add(Ivory>=1, Ivory<=5) s.add(Green>=1, Green<=5) s.add(Norwegian>=1, Norwegian<=5) s.add(Ukrainian>=1, Ukrainian<=5) s.add(Englishman>=1, Englishman<=5) s.add(Spaniard>=1, Spaniard<=5) s.add(Japanese>=1, Japanese<=5) s.add(Water>=1, Water<=5) s.add(Tea>=1, Tea<=5) s.add(Milk>=1, Milk<=5) s.add(OrangeJuice>=1, OrangeJuice<=5) s.add(Coffee>=1, Coffee<=5) s.add(Kools>=1, Kools<=5) s.add(Chesterfield>=1, Chesterfield<=5) s.add(OldGold>=1, OldGold<=5) s.add(LuckyStrike>=1, LuckyStrike<=5) s.add(Parliament>=1, Parliament<=5) s.add(Fox>=1, Fox<=5) s.add(Horse>=1, Horse<=5) s.add(Snails>=1, Snails<=5) s.add(Dog>=1, Dog<=5) s.add(Zebra>=1, Zebra<=5) # main constraints of the puzzle: # 2.The Englishman lives in the red house. s.add(Englishman==Red) # 3.The Spaniard owns the dog. s.add(Spaniard==Dog) # 4.Coffee is drunk in the green house. s.add(Coffee==Green) # 5.The Ukrainian drinks tea. s.add(Ukrainian==Tea) # 6.The green house is immediately to the right of the ivory house. s.add(Green==Ivory+1) # 7.The Old Gold smoker owns snails. s.add(OldGold==Snails) # 8.Kools are smoked in the yellow house. s.add(Kools==Yellow) # 9.Milk is drunk in the middle house. s.add(Milk==3) # i.e., 3rd house # 10.The Norwegian lives in the first house. s.add(Norwegian==1) # 11.The man who smokes Chesterfields lives in the house next to the man with the fox. s.add(Or(Chesterfield==Fox+1, Chesterfield==Fox-1)) # left or right # 12.Kools are smoked in the house next to the house where the horse is kept. s.add(Or(Kools==Horse+1, Kools==Horse-1)) # left or right # 13.The Lucky Strike smoker drinks orange juice. s.add(LuckyStrike==OrangeJuice) # 14.The Japanese smokes Parliaments. s.add(Japanese==Parliament) # 15.The Norwegian lives next to the blue house. 7 s.add(Or(Norwegian==Blue+1, Norwegian==Blue-1)) # left or right r=s.check() print r if r==unsat: exit(0) m=s.model() print(m) Whenwerunit,wegotcorrectresult: sat [Snails = 3, Blue = 2, Ivory = 4, OrangeJuice = 4, Parliament = 5, Yellow = 1, Fox = 1, Zebra = 5, Horse = 2, Dog = 4, Tea = 2, Water = 1, Chesterfield = 2, Red = 3, Japanese = 5, LuckyStrike = 4, Norwegian = 1, Milk = 3, Kools = 1, OldGold = 3, Ukrainian = 2, Coffee = 5, Green = 5, Spaniard = 4, Englishman = 3] 5.5 Sudoku puzzle Sudokupuzzleisa9*9gridwithsomecellsfilledwithvalues,someareempty: 53 8 2 715 4 53 17 6 32 8 65 9 4 3 97 UnsolvedSudoku Numbersofeachrowmustbeunique,i.e.,itmustcontainall9numbersinrangeof1..9withoutrepetition. Samestoryforeachcolumnandalsoforeach3*3square. Thispuzzleisgoodcandidatetotry SMTsolveron,becauseit’sessentiallyanunsolvedsystemofequa- tions. 8 5.5.1 The first idea Theonlythingwemustdecideisthathowtodetermineinoneexpression,iftheinput9variableshasall9 uniquenumbers? Theyarenotorderedorsorted,afterall. Fromtheschool-levelarithmetics,wecandevisethisidea: 10i1+ 10i2+  + 10i9 | {z } 9= 1111111110 (1) Takeeachinputvariable,calculate 10iandsumthemall.Ifallinputvaluesareunique,eachwillbesettled atitsownplace. Evenmorethanthat: therewillbenoholes,i.e.,noskippedvalues. So,incaseofSudoku, 1111111110numberwillbefinalresult,indicatingthatall9inputvaluesareunique,inrangeof1..9. Exponentiationisheavyoperation,canweusebinaryoperations? Yes,justreplace10with2: 2i1+ 2i2+  + 2i9 | {z } 9= 1111111110 2 (2) Theeffectisjustthesame,butthefinalvalueisinbase2insteadof10. Nowaworkingexample: import sys from z3 import * """ coordinates: ------------------------------ 00 01 02 | 03 04 05 | 06 07 08 10 11 12 | 13 14 15 | 16 17 18 20 21 22 | 23 24 25 | 26 27 28 ------------------------------ 30 31 32 | 33 34 35 | 36 37 38 40 41 42 | 43 44 45 | 46 47 48 50 51 52 | 53 54 55 | 56 57 58 ------------------------------ 60 61 62 | 63 64 65 | 66 67 68 70 71 72 | 73 74 75 | 76 77 78 80 81 82 | 83 84 85 | 86 87 88 ------------------------------ """ s=Solver() # using Python list comprehension, construct array of arrays of BitVec instances: cells=[[BitVec('cell%d%d' % (r, c), 16) for c in range(9)] for r in range(9)] # http://www.norvig.com/sudoku.html # http://www.mirror.co.uk/news/weird-news/worlds-hardest-sudoku-can-you-242294 puzzle="..53.....8......2..7..1.5..4....53...1..7...6..32...8..6.5....9..4....3......97.." # process text line: current_column=0 current_row=0 for i in puzzle: if i!='.': s.add(cells[current_row][current_column]==BitVecVal(int(i),16)) current_column=current_column+1 if current_column==9: current_column=0 current_row=current_row+1 one=BitVecVal(1,16) mask=BitVecVal(0b1111111110,16) # for all 9 rows for r in range(9): s.add(((one<=1) s.add(cells[r][c]<=9) # for all 9 rows for r in range(9): s.add(Distinct(cells[r][0], cells[r][1], cells[r][2], cells[r][3], cells[r][4], cells[r][5], cells[r][6], cells[r][7], cells[r][8])) # for all 9 columns for c in range(9): s.add(Distinct(cells[0][c], cells[1][c], cells[2][c], cells[3][c], cells[4][c], cells[5][c], cells[6][c], cells[7][c], cells[8][c])) # enumerate all 9 squares for r in range(0, 9, 3): for c in range(0, 9, 3): # add constraints for each 3*3 square: s.add(Distinct(cells[r+0][c+0], cells[r+0][c+1], cells[r+0][c+2], cells[r+1][c+0], cells[r+1][c+1], cells[r+1][c+2], cells[r+2][c+0], cells[r+2][c+1], cells[r+2][c+2])) #print s.check() s.check() #print s.model() m=s.model() for r in range(9): for c in range(9): sys.stdout.write (str(m[cells[r][c]])+" ") print "" (https://github.com/dennis714/SAT_SMT_article/blob/master/SMT/sudoku2.py ) % time python sudoku2.py 1 4 5 3 2 7 6 9 8 8 3 9 6 5 4 1 2 7 6 7 2 9 1 8 5 4 3 4 9 6 1 8 5 3 7 2 2 1 8 4 7 3 9 5 6 7 5 3 2 9 6 4 8 1 3 6 7 5 4 2 8 1 9 9 8 4 7 6 1 2 3 5 5 2 1 8 3 9 7 6 4 real 0m0.382s user 0m0.346s sys 0m0.036s That’smuchfaster. 13 5.5.3 Conclusion SMT-solversaresohelpful, isthatourSudokusolverhasnothingelse, wehavejustdefinedrelationships betweenvariables(cells). 5.5.4 Homework Asitseems,trueSudokupuzzleistheonewhichhasonlyonesolution. ThepieceofcodeI’veincludedhere showsonlythefirstone. Usingthemethoddescribedearlier( 5.6,alsocalled“modelcounting”),trytofind moresolutions,orprovethatthesolutionyouhavejustfoundistheonlyonepossible. 5.5.5 Further reading http://www.norvig.com/sudoku.html 5.5.6 Sudoku as a SATproblem It’salsopossibletorepresendSudokupuzzleasahuge CNFequationanduse SAT-solvertofindsolution,but it’sjusttrickier. Somearticlesaboutit: BuildingaSudokuSolverwithSAT20,TjarkWeber, ASAT-basedSudokuSolver21, InesLynce,JoelOuaknine, SudokuasaSATProblem22,GihwonKwon,HimanshuJain, OptimizedCNFEncoding forSudokuPuzzles23. SMT-solvercanalsouse SAT-solverinitscore,soitdoesallmundanetranslatingwork. Asa“compiler”, itmaynotdothisinthemostefficientway,though. 5.6 Solving Problem Euler 31: “Coin sums” (Thistextwasfirstpublishedinmyblog24at10-May-2013.) InEnglandthecurrencyismadeupofpound,£,andpence,p,andthereareeightcoinsin generalcirculation: 1p,2p,5p,10p,20p,50p,£1(100p)and£2(200p).Itispossibletomake£2inthefollowing way: 1£1+150p+220p+15p+12p+31pHowmanydifferentwayscan£2bemadeusing anynumberofcoins? (ProblemEuler31—Coinsums ) UsingZ3forsolvingthisisoverkill,andalsoslow,butnevertheless,itworks,showingallpossiblesolutions aswell. Thepieceofcodeforblockingalreadyfoundsolutionandsearchfornext, andthus, countingall solutions,wastakenfromStackOverflowanswer25. Thisisalsocalled“modelcounting”. Constraintslike “a>=0”mustbepresent,becauseZ3solverwillfindsolutionswithnegativenumbers. #!/usr/bin/python from z3 import * a,b,c,d,e,f,g,h = Ints('a b c d e f g h') s = Solver() s.add(1*a + 2*b + 5*c + 10*d + 20*e + 50*f + 100*g + 200*h == 200, a>=0, b>=0, c>=0, d>=0, e>=0, f>=0, g>=0, h>=0) result=[] 20http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-005-elements-of-software-construction-fall-2011/ assignments/MIT6_005F11_ps4.pdf 21https://www.lri.fr/~conchon/mpri/weber.pdf 22http://sat.inesc-id.pt/~ines/publications/aimath06.pdf 23http://www.cs.cmu.edu/~hjain/papers/sudoku-as-SAT.pdf 24http://dennisyurichev.blogspot.de/2013/05/in-england-currency-is-made-up-of-pound.html 25http://stackoverflow.com/questions/11867611/z3py-checking-all-solutions-for-equation , another question: http:// stackoverflow.com/questions/13395391/z3-finding-all-satisfying-models 14 while True: if s.check() == sat: m = s.model() print m result.append(m) # Create a new constraint the blocks the current model block = [] for d in m: # d is a declaration if d.arity() > 0: raise Z3Exception("uninterpreted functions are not suppported") # create a constant from declaration c=d() #print c, m[d] if is_array(c) or c.sort().kind() == Z3_UNINTERPRETED_SORT: raise Z3Exception("arrays and uninterpreted sorts are not supported") block.append(c != m[d]) #print "new constraint:",block s.add(Or(block)) else: print len(result) break Worksveryslow,andthisiswhatitproduces: [h = 0, g = 0, f = 0, e = 0, d = 0, c = 0, b = 0, a = 200] [f = 1, b = 5, a = 0, d = 1, g = 1, h = 0, c = 2, e = 1] [f = 0, b = 1, a = 153, d = 0, g = 0, h = 0, c = 1, e = 2] ... [f = 0, b = 31, a = 33, d = 2, g = 0, h = 0, c = 17, e = 0] [f = 0, b = 30, a = 35, d = 2, g = 0, h = 0, c = 17, e = 0] [f = 0, b = 5, a = 50, d = 2, g = 0, h = 0, c = 24, e = 0] 73682resultsintotal. 5.7 Using Z3 theorem prover to prove equivalence of some weird alternative to XOR operation (ThetestwasfirstpublishedinmyblogatApril2015: http://blog.yurichev.com/node/86 ). Thereisa“AHacker’sAssistant”program26(Aha!)writtenbyHenryWarren,whoisalsotheauthorofthe great“Hacker’sDelight”book. TheAha!programisessentially superoptimizer27,whichblindlybrute-forcealistofsomegenericRISC CPUinstructionstoachieveshortestpossible(andjumplessorbranch-free)CPUcodesequencefordesired operation. Forexample, Aha!canfindjumplessversionofabs()functioneasily. Compilerdevelopersusesuperoptimizationtofindshortestpossible(and/orjumpless)code,butItriedto dootherwise—tofindlongestcodeforsomeprimitiveoperation. Itried Aha!tofindequivalentofbasicXOR operationwithoutusageoftheactualXORinstruction,andthemostbizarreexample Aha!gaveis: Found a 4-operation program: add r1,ry,rx and r2,ry,rx mul r3,r2,-2 add r4,r3,r1 Expr: (((y & x)*-2) + (y + x)) And it’s hard to say, why/where we can use it, maybe for obfuscation, I’m not sure. I would call this suboptimization (asopposedto superoptimization ). Ormaybe superdeoptimization . But my another question was also, is it possible to prove that this is correct formula at all? The Aha! checkingsomeintput/outputvaluesagainstXORoperation,butofcourse,notallthepossiblevalues. Itis 32-bitcode,soitmaytakeverylongtimetotryallpossible32-bitinputstotestit. WecantryZ3theoremproverforthejob. It’scalled prover,afterall. SoIwrotethis: #!/usr/bin/python from z3 import * x = BitVec('x', 32) 26http://www.hackersdelight.org/ 27http://en.wikipedia.org/wiki/Superoptimization 15 y = BitVec('y', 32) output = BitVec('output', 32) s = Solver() s.add(x^y==output) s.add(((y & x)*0xFFFFFFFE) + (y + x)!=output) print s.check() InplainEnglishlanguage,thismeans“arethereanycasefor xandywhere xydoesn’tequalsto ((y&x) 2) + ( y+x)?” ...andZ3prints“unsat”,meaning,itcan’tfindanycounterexampletotheequation. Sothis Aha!resultisprovedtobeworkingjustlikeXORoperation. Oh,Ialsotriedtoextendtheformulato64bit: #!/usr/bin/python from z3 import * x = BitVec('x', 64) y = BitVec('y', 64) output = BitVec('output', 64) s = Solver() s.add(x^y==output) s.add(((y & x)*0xFFFFFFFE) + (y + x)!=output) print s.check() Nope,nowitsays“sat”,meaning,Z3foundatleastonecounterexample. Oops,it’sbecauseIforgotto extend-2numberto64-bitvalue: #!/usr/bin/python from z3 import * x = BitVec('x', 64) y = BitVec('y', 64) output = BitVec('output', 64) s = Solver() s.add(x^y==output) s.add(((y & x)*0xFFFFFFFFFFFFFFFE) + (y + x)!=output) print s.check() Nowitsays“unsat”,sotheformulagivenby Aha!worksfor64-bitcodeaswell. 5.7.1 In SMT-LIB form Nowwecanrephraseourexpressiontomoresuitableform: (x+y((x&y)<<1)). ItalsoworkswellinZ3Py: #!/usr/bin/python from z3 import * x = BitVec('x', 64) y = BitVec('y', 64) output = BitVec('output', 64) s = Solver() s.add(x^y==output) s.add((x + y - ((x & y)<<1)) != output) print s.check() HereishowtodefineitinSMT-LIBway: (declare-const x (_ BitVec 64)) (declare-const y (_ BitVec 64)) (assert (not (= (bvsub (bvadd x y) (bvshl (bvand x y) (_ bv1 64))) (bvxor x y) ) ) ) (check-sat) 16 5.7.2 Using universal quantifier Z3supportsuniversalquantifier exists,whichistrueifatleastonesetofvariablessatistfiedunderlying condition: (declare-const x (_ BitVec 64)) (declare-const y (_ BitVec 64)) (assert (exists ((x (_ BitVec 64)) (y (_ BitVec 64))) (not (= (bvsub (bvadd x y) (bvshl (bvand x y) (_ bv1 64)) ) (bvxor x y) )) ) ) (check-sat) Itreturns“unsat”,meaning,Z3couldn’tfindanycounterexampleoftheequation,i.e.,it’snotexist. Thisisalsoknownas 9inmathematicallogiclingo. Z3alsosupportsuniversalquantifier forall, whichistrueiftheequationistrueforallpossiblevalues. SowecanrewriteourSMT-LIBexampleas: (declare-const x (_ BitVec 64)) (declare-const y (_ BitVec 64)) (assert (forall ((x (_ BitVec 64)) (y (_ BitVec 64))) (= (bvsub (bvadd x y) (bvshl (bvand x y) (_ bv1 64)) ) (bvxor x y) ) ) ) (check-sat) Itreturns“sat”,meaning,theequationiscorrectforallpossible64-bit xand yvalues,likethemallwere checked. Mathematicallyspeaking: 8n2N(xy= (x+y((x&y)<<1)))28 5.7.3 How the expression works Firstofall,binaryadditioncanbeviewedasbinaryXORingwithcarrying( 11.2). Hereisanexample: let’s add2(10b)and2(10b). XORingthesetwovaluesresulting0,butthereisacarrygeneratedduringaddition oftwosecondbits.Thatcarrybitispropagatedfurtherandsettlesattheplaceofthe3rdbit:100b.4(100b) ishenceafinalresultofaddition. Ifthecarrybitsarenotgeneratedduringaddition,theadditionoperationismerelyXORing. Forexample, let’sadd1(1b)and2(10b). 1 + 2equalsto3,but 12isalso3. If the addition is XORing plus carry generation and application, we should eliminate effect of carrying somehowhere. Thefirstpartoftheexpression( x+y)isaddition,thesecond( (x&y)<<1)isjustcalculation ofeverycarrybitwhichwasusedduringaddition.Iftosubtractcarrybitsfromtheresultofaddition,theonly XOReffectisleftthen. It’shardtosayhowZ3provesthis:maybeitjustsimplifiestheequationdowntosingleXORusingsimple booleanalgebrarewritingrules? 5.8 Dietz’s formula Oneoftheimpressiveexamplesof Aha!workisfindingofDietz’sformula29,whichisthecodeofcomputing averagenumberoftwonumberswithoutoverflow(whichisimportantifyouwanttofindaveragenumberof 288meansequationmustbetrueforallpossiblevalues ,whicharechoosenfromnaturalnumbers( N). 29http://aggregate.org/MAGIC/#Average%20of%20Integers 17 numberslike0xFFFFFF00andsoon,using32-bitregisters). Takingthisininput: int userfun(int x, int y) { // To find Dietz's formula for // the floor-average of two // unsigned integers. return ((unsigned long long)x + (unsigned long long)y) >> 1; } ...theAha!givesthis: Found a 4-operation program: and r1,ry,rx xor r2,ry,rx shrs r3,r2,1 add r4,r3,r1 Expr: (((y ^ x) >>s 1) + (y & x)) Anditworkscorrectly30. Buthowtoproveit? WewillplaceDietz’sformulaontheleftsideofequationand x+y/2(orx+y >> 1)ontherightside: 8n20::2641:(x&y) + (xy)>>1 =x+y >> 1 Oneimportantthingisthatwecan’toperateon64-bitvaluesonrightside,becauseresultwilloverflow. Sowewillzeroextendinputsonrightsideby1bit(inotherwords,wewilljust1zerobitbeforeeachvalue). TheresultofDietz’sformulawillalsobeextendedby1bit. Hence,bothsidesoftheequationwillhavea widthof65bits: (declare-const x (_ BitVec 64)) (declare-const y (_ BitVec 64)) (assert (forall ((x (_ BitVec 64)) (y (_ BitVec 64))) (= ((_ zero_extend 1) (bvadd (bvand x y) (bvlshr (bvxor x y) (_ bv1 64)) ) ) (bvlshr (bvadd ((_ zero_extend 1) x) ((_ zero_extend 1) y)) (_ bv1 65) ) ) ) ) (check-sat) Z3says“sat”. 65bitsareenough,becausetheresultofadditionoftwobiggest64-bitvalueshaswidthof65bits: 0xFF...FF + 0xFF...FF = 0x1FF...FE . AsinpreviousexampleaboutXORequivalent, (not (= ... )) and existscanalsobeusedhereinstead offorall. 5.9 Cracking LCGwith Z3 (ThistextisfirstappearedinmybloginJune2015at http://yurichev.com/blog/modulo/ .) Therearewell-knownweaknessesof LCG31,butlet’ssee,ifitwouldbepossibletocrackitstraightfor- wardly,withoutanyspecialknowledge. WewilldefineallrelationsbetweenLCGstatesintermsofZ3. Here isatestprogam: 30Forthosewhointerestinghowitworks,itsmechanicsiscloselyrelatedtotheweirdXORalternativewejustsaw.That’swhyIplaced thesetwopiecesoftextoneafteranother. 31http://en.wikipedia.org/wiki/Linear_congruential_generator#Advantages_and_disadvantages_of_LCGs , http: //www.reteam.org/papers/e59.pdf , http://stackoverflow.com/questions/8569113/why-1103515245-is-used-in-rand/ 8574774#8574774 18 #include #include #include int main() { int i; srand(time(NULL)); for (i=0; i<10; i++) printf ("%d\n", rand()%100); }; Itisprinting10pseudorandomnumbersin0..99range: 37 29 74 95 98 40 23 58 61 17 Let’ssayweareobservingonly8ofthesenumbers(from29to61)andweneedtopredictnextone(17) and/orpreviousone(37). TheprogramiscompiledusingMSVC2013(IchooseitbecauseitsLCGissimplerthanthatinGlib): .text:0040112E rand proc near .text:0040112E call __getptd .text:00401133 imul ecx, [eax+0x14], 214013 .text:0040113A add ecx, 2531011 .text:00401140 mov [eax+14h], ecx .text:00401143 shr ecx, 16 .text:00401146 and ecx, 7FFFh .text:0040114C mov eax, ecx .text:0040114E retn .text:0040114E rand endp Let’sdefineLCGinZ3Py: #!/usr/bin/python from z3 import * output_prev = BitVec('output_prev', 32) state1 = BitVec('state1', 32) state2 = BitVec('state2', 32) state3 = BitVec('state3', 32) state4 = BitVec('state4', 32) state5 = BitVec('state5', 32) state6 = BitVec('state6', 32) state7 = BitVec('state7', 32) state8 = BitVec('state8', 32) state9 = BitVec('state9', 32) state10 = BitVec('state10', 32) output_next = BitVec('output_next', 32) s = Solver() s.add(state2 == state1*214013+2531011) s.add(state3 == state2*214013+2531011) s.add(state4 == state3*214013+2531011) s.add(state5 == state4*214013+2531011) s.add(state6 == state5*214013+2531011) s.add(state7 == state6*214013+2531011) s.add(state8 == state7*214013+2531011) s.add(state9 == state8*214013+2531011) s.add(state10 == state9*214013+2531011) s.add(output_prev==URem((state1>>16)&0x7FFF,100)) s.add(URem((state2>>16)&0x7FFF,100)==29) s.add(URem((state3>>16)&0x7FFF,100)==74) 19 s.add(URem((state4>>16)&0x7FFF,100)==95) s.add(URem((state5>>16)&0x7FFF,100)==98) s.add(URem((state6>>16)&0x7FFF,100)==40) s.add(URem((state7>>16)&0x7FFF,100)==23) s.add(URem((state8>>16)&0x7FFF,100)==58) s.add(URem((state9>>16)&0x7FFF,100)==61) s.add(output_next==URem((state10>>16)&0x7FFF,100)) print(s.check()) print(s.model()) URemstatesforunsignedremainder . Itworksforsometimeandgaveuscorrectresult! sat [state3 = 2276903645, state4 = 1467740716, state5 = 3163191359, state7 = 4108542129, state8 = 2839445680, state2 = 998088354, state6 = 4214551046, state1 = 1791599627, state9 = 548002995, output_next = 17, output_prev = 37, state10 = 1390515370] Iadded 10statestobesureresultwillbecorrect. Itmaybenotincaseofsmallersetofinformation. That is the reason why LCGis not suitable for any security-related task. This is why cryptographically secure pseudorandom number generators exist: they are designed to be protected against such simple attack. Well,atleastif NSA32don’tgetinvolved33. Security tokens like “RSA SecurID” can be viewed just as CPRNG34with a secret seed. It shows new pseudorandomnumbereachminute,andtheservercanpredictit,becauseitknowstheseed. Imagineif suchtokenwouldimplement LCG—itwouldbemucheasiertobreak! 5.10 Solving pipe puzzle using Z3 SMT-solver “Pipepuzzle”isapopularpuzzle(justgoogle“pipepuzzle”andlookatimages). Thisishowshuffledpuzzlelookslike: Figure2:Shuffledpuzzle ...andsolved: 32NationalSecurityAgency 33https://en.wikipedia.org/wiki/Dual_EC_DRBG 34CryptographicallySecurePseudorandomNumberGenerator 20 Figure3:Solvedpuzzle Let’strytofindawaytosolveit. 5.10.1 Generation First,weneedtogenerateit. Hereismyquickideaonit. Take8*16arrayofcells. Eachcellmaycontain sometypeofblock. Therearejointsbetweencells: vjoints[...,0] vjoints[...,1] vjoints[...,2] vjoints[...,3] vjoints[...,4] vjoints[...,5] vjoints[...,6] vjoints[...,7] vjoints[...,8] vjoints[...,9] vjoints[...,10] vjoints[...,11] vjoints[...,12] vjoints[...,13] vjoints[...,14] vjoints[...,15] hjoints[7,...]hjoints[6,...]hjoints[5,...]hjoints[4,...]hjoints[3,...]hjoints[2,...]hjoints[1,...]hjoints[0,...] Bluelinesarehorizontaljoints,redlinesareverticaljoints. Wejustseteachjointto0/false(absent)or 1/true(present),randomly. Onceset,it’snoweasytofindtypeforeachcell. Thereare: jointsourinternalname anglesymbol 0type0 0◦(space) 2type2a 0◦ 2type2a 90◦ 2type2b 0◦ 2type2b 90◦ 2type2b 180◦ 2type2b 270◦ 21 3type3 0◦ 3type3 90◦ 3type3 180◦ 3type3 270◦ 4type4 0◦ Danglingjointscanbepresetatafirststage(i.e.,cellwithonlyonejoint),buttheyareremovedrecursively, thesecellsaretransformingintoemptycells.Hence,attheend,allcellshasatleasttwojoints,andthewhole plumbingsystemhasnoconnectionswithouterworld—Ihopethiswouldmakethingsclearer. TheCsourcecodeofgeneratorishere: https://github.com/dennis714/SAT_SMT_article/tree/master/ SMT/pipe/generator . Allhorizontaljointsarestoredintheglobalarray hjoints[]andverticalin vjoints[]. TheCprogramgeneratesANSI-coloredoutputlikeithasbeenshowedabove( 5.10,5.10)plusanarrayof types,withnoangleinformationabouteachcell: [ ["0", "0", "2b", "3", "2a", "2a", "2a", "3", "3", "2a", "3", "2b", "2b", "2b", "0", "0"], ["2b", "2b", "3", "2b", "0", "0", "2b", "3", "3", "3", "3", "3", "4", "2b", "0", "0"], ["3", "4", "2b", "0", "0", "0", "3", "2b", "2b", "4", "2b", "3", "4", "2b", "2b", "2b"], ["2b", "4", "3", "2a", "3", "3", "3", "2b", "2b", "3", "3", "3", "2a", "2b", "4", "3"], ["0", "2b", "3", "2b", "3", "4", "2b", "3", "3", "2b", "3", "3", "3", "0", "2a", "2a"], ["0", "0", "2b", "2b", "0", "3", "3", "4", "3", "4", "3", "3", "3", "2b", "3", "3"], ["0", "2b", "3", "2b", "0", "3", "3", "4", "3", "4", "4", "3", "0", "3", "4", "3"], ["0", "2b", "3", "3", "2a", "3", "2b", "2b", "3", "3", "3", "3", "2a", "3", "3", "2b"], ] 5.10.2 Solving Firstofall,wewouldthinkabout8*16arrayofcells,whereeachhasfourbits: “T”(top),“B”(bottom),“L” (left),“R”(right). Eachbitrepresentshalfofjoint. [...,0] [...,1] [...,2] [...,3] [...,4] [...,5] [...,6] [...,7] [...,8] [...,9] [...,10] [...,11] [...,12] [...,13] [...,14] [...,15] [7,...]T BLRT BLRT BLRT BLRT BLRT BLRT BLRT BLRT BLRT BLRT BLRT BLRT BLRT BLRT BLRT BLR[6,...]T BLRT BLRT BLRT BLRT BLRT BLRT BLRT BLRT BLRT BLRT BLRT BLRT BLRT BLRT BLRT BLR[5,...]T BLRT BLRT BLRT BLRT BLRT BLRT BLRT BLRT BLRT BLRT BLRT BLRT BLRT BLRT BLRT BLR[4,...]T BLRT BLRT BLRT BLRT BLRT BLRT BLRT BLRT BLRT BLRT BLRT BLRT BLRT BLRT BLRT BLR[3,...]T BLRT BLRT BLRT BLRT BLRT BLRT BLRT BLRT BLRT BLRT BLRT BLRT BLRT BLRT BLRT BLR[2,...]T BLRT BLRT BLRT BLRT BLRT BLRT BLRT BLRT BLRT BLRT BLRT BLRT BLRT BLRT BLRT BLR[1,...]T BLRT BLRT BLRT BLRT BLRT BLRT BLRT BLRT BLRT BLRT BLRT BLRT BLRT BLRT BLRT BLR[0,...]T BLRT BLRT BLRT BLRT BLRT BLRT BLRT BLRT BLRT BLRT BLRT BLRT BLRT BLRT BLRT BLR Nowwedefinearraysofeachoffourhalf-joints+angleinformation: HEIGHT=8 WIDTH=16 # if T/B/R/L is Bool instead of Int, Z3 solver will work faster T=[[Bool('cell_%d_%d_top' % (r, c)) for c in range(WIDTH)] for r in range(HEIGHT)] B=[[Bool('cell_%d_%d_bottom' % (r, c)) for c in range(WIDTH)] for r in range(HEIGHT)] R=[[Bool('cell_%d_%d_right' % (r, c)) for c in range(WIDTH)] for r in range(HEIGHT)] L=[[Bool('cell_%d_%d_left' % (r, c)) for c in range(WIDTH)] for r in range(HEIGHT)] A=[[Int('cell_%d_%d_angle' % (r, c)) for c in range(WIDTH)] for r in range(HEIGHT)] Weknowthatifeachofhalf-jointsispresent,correspondinghalf-jointmustbealsopresent,andviceversa. Wedefinethisusingtheseconstraints: 22 # shorthand variables for True and False: t=True f=False # "top" of each cell must be equal to "bottom" of the cell above # "bottom" of each cell must be equal to "top" of the cell below # "left" of each cell must be equal to "right" of the cell at left # "right" of each cell must be equal to "left" of the cell at right for r in range(HEIGHT): for c in range(WIDTH): if r!=0: s.add(T[r][c]==B[r-1][c]) if r!=HEIGHT-1: s.add(B[r][c]==T[r+1][c]) if c!=0: s.add(L[r][c]==R[r][c-1]) if c!=WIDTH-1: s.add(R[r][c]==L[r][c+1]) # "left" of each cell of first column shouldn't have any connection # so is "right" of each cell of the last column for r in range(HEIGHT): s.add(L[r][0]==f) s.add(R[r][WIDTH-1]==f) # "top" of each cell of the first row shouldn't have any connection # so is "bottom" of each cell of the last row for c in range(WIDTH): s.add(T[0][c]==f) s.add(B[HEIGHT-1][c]==f) Nowwe’llenumerateallcellsintheinitialarray( 5.10.1).Firsttwocellsareemptythere.Andthethirdone hastype“2b”. Thisis“ ”anditcanbeorientedin4possibleways. Andifithasangle0◦,bottomandright half-jointsarepresent,othersareabsent. Ifithasangle90◦,itlookslike“ ”,andbottomandlefthalf-joints arepresent,othersareabsent. InplainEnglish: “ifcellofthistypehasangle0◦,thesehalf-jointsmustbepresent ORifithasangle90◦, thesehalf-jointsmustbepresent, OR,etc,etc.” Likewise,wedefinealltheserulesforalltypesandallpossibleangles: for r in range(HEIGHT): for c in range(WIDTH): ty=cells_type[r][c] if ty=="0": s.add(A[r][c]==f) s.add(T[r][c]==f, B[r][c]==f, L[r][c]==f, R[r][c]==f) if ty=="2a": s.add(Or(And(A[r][c]==0, L[r][c]==f, R[r][c]==f, T[r][c]==t, B[r][c]==t), # And(A[r][c]==90, L[r][c]==t, R[r][c]==t, T[r][c]==f, B[r][c]==f))) # if ty=="2b": s.add(Or(And(A[r][c]==0, L[r][c]==f, R[r][c]==t, T[r][c]==f, B[r][c]==t), # And(A[r][c]==90, L[r][c]==t, R[r][c]==f, T[r][c]==f, B[r][c]==t), # And(A[r][c]==180, L[r][c]==t, R[r][c]==f, T[r][c]==t, B[r][c]==f), # And(A[r][c]==270, L[r][c]==f, R[r][c]==t, T[r][c]==t, B[r][c]==f))) # if ty=="3": s.add(Or(And(A[r][c]==0, L[r][c]==f, R[r][c]==t, T[r][c]==t, B[r][c]==t), # And(A[r][c]==90, L[r][c]==t, R[r][c]==t, T[r][c]==f, B[r][c]==t), # And(A[r][c]==180, L[r][c]==t, R[r][c]==f, T[r][c]==t, B[r][c]==t), # And(A[r][c]==270, L[r][c]==t, R[r][c]==t, T[r][c]==t, B[r][c]==f))) # if ty=="4": s.add(A[r][c]==0) s.add(T[r][c]==t, B[r][c]==t, L[r][c]==t, R[r][c]==t) # Fullsourcecodeishere: https://github.com/dennis714/SAT_SMT_article/blob/master/SMT/pipe/ solver/solve_pipe_puzzle1.py . Itproducesthisresult(printsangleforeachcelland(pseudo)graphicalrepresentation): 23 Figure4:Solverscriptoutput Itworked 4secondsonmyoldandslowIntelAtomN4551.66GHz. Isitfast? Idon’tknow,butagain, whatisreallycool, wedonotknowaboutanymathematicalbackgroundofallthis, wejustdefinedcells, (half-)jointsanddefinedrelationsbetweenthem. Nowthenextquestionis, howmanysolutionsarepossible? Usingmethoddescribedearlier( 5.6), I’ve alteredsolverscript35andsolversaidtwosolutionsarepossible. Let’scomparethesetwosolutionsusinggvimdiff: Figure5:gvimdiffoutput(pardonmyredcursoratleftpaneatleft-bottomcorner) 4cellsinthemiddlecanbeorientateddifferently. Perhaps,otherpuzzlesmayproducedifferentresults. P.S.Half-jointisdefinedasbooleantype. Butinfact,thefirstversionofthesolverhasbeenwrittenusing integertypeforhalf-joints, and0wasusedforFalseand1forTrue. IdiditsobecauseIwantedtomake sourcecodetidierandnarrowerwithoutusinglongwordslike“False”and“True”. Anditworked,butslower. Perhaps,Z3handlesbooleandatatypesfaster? Better? Anyway,Iwritingthistonotethatintegertypecan alsobeusedinsteadofboolean,ifneeded. 35https://github.com/dennis714/SAT_SMT_article/blob/master/SMT/pipe/solver/solve_pipe_puzzle2.py 24 5.11 Cracking Minesweeper with Z3 SMT solver ForthosewhoarenotverygoodatplayingMinesweeper(likeme),it’spossibletopredictbombs’placement withouttouchingdebugger. HereisaclickedsomewhereandIseerevealedemptycellsandcellswithknownnumberof“neighbours”: Whatwehavehere,actually? Hiddencells,emptycells(wherebombsarenotpresent),andemptycells withnumbers,whichshowshowmanybombsareplacednearby. 5.11.1 The method Hereiswhatwecando: wewilltrytoplaceabombtoallpossiblehiddencellsandaskZ3SMTsolver,ifit candisprovetheveryfactthatthebombcanbeplacedthere. Take a look at this fragment. ”?” mark is for hidden cell, ”.” is for empty cell, number is a number of neighbours. C1C2C3 R1??? R2?3. R3?1. Sothereare5hiddencells.Wewillcheckeachhiddencellbyplacingabombthere.Let’sfirstpicktop/left cell: C1C2C3 R1*?? R2?3. R3?1. Thenwewilltrytosolvethefollowingsystemofequations( RrCciscellofrow randcolumn c): •R1C2+R2C1+R2C2=1(becauseweplacedbombatR1C1) •R2C1+R2C2+R3C1=1(becausewehave”1”atR3C2) •R1C1+R1C2+R1C3+R2C1+R2C2+R2C3+R3C1+R3C2+R3C3=3(becausewehave”3”atR2C2) •R1C2+R1C3+R2C2+R2C3+R3C2+R3C3=0(becausewehave”.”atR2C3) •R2C2+R2C3+R3C2+R3C3=0(becausewehave”.”atR3C3) As it turns out, this system of equations is satisfiable, so there could be a bomb at this cell. But this informationisnotinterestingtous,sincewewanttofindcellswecanfreelyclickon.Andwewilltryanother one.Andiftheequationwillbeunsatisfiable,thatwouldimplythatabombcannotbethereandwecanclick onit. 25 5.11.2 The code #!/usr/bin/python known=[ "01?10001?", "01?100011", "011100000", "000000000", "111110011", "????1001?", "????3101?", "?????211?", "?????????"] from z3 import * import sys WIDTH=len(known[0]) HEIGHT=len(known) print "WIDTH=", WIDTH, "HEIGHT=", HEIGHT def chk_bomb(row, col): s=Solver() cells=[[Int('cell_r=%d_c=%d' % (r,c)) for c in range(WIDTH+2)] for r in range(HEIGHT+2)] # make border for c in range(WIDTH+2): s.add(cells[0][c]==0) s.add(cells[HEIGHT+1][c]==0) for r in range(HEIGHT+2): s.add(cells[r][0]==0) s.add(cells[r][WIDTH+1]==0) for r in range(1,HEIGHT+1): for c in range(1,WIDTH+1): t=known[r-1][c-1] if t in "012345678": s.add(cells[r][c]==0) # we need empty border so the following expression would be able to work for all possible cases: s.add(cells[r-1][c-1] + cells[r-1][c] + cells[r-1][c+1] + cells[r][c-1] + cells[r][c+1] + cells[ r+1][c-1] + cells[r+1][c] + cells[r+1][c+1]==int(t)) # place bomb: s.add(cells[row][col]==1) result=str(s.check()) if result=="unsat": print "row=%d col=%d, unsat!" % (row, col) # enumerate all hidden cells: for r in range(1,HEIGHT+1): for c in range(1,WIDTH+1): if known[r-1][c-1]=="?": chk_bomb(r, c) Thecodeisalmostself-explanatory. Weneedborderforthesamereason,whyConway’s”GameofLife” implementationsalsohasborder(tomakecalculationfunctionsimpler). Wheneverweknowthatthecellis freeofbomb,weputzerothere. Wheneverweknownumberofneighbours,weaddaconstraint,again,just likein”GameofLife”:numberofneighboursmustbeequaltothenumberwehaveseenintheMinesweeper. Thenweplacebombsomewhereandcheck. Let’srun: row=1 col=3, unsat! row=6 col=2, unsat! row=6 col=3, unsat! row=7 col=4, unsat! row=7 col=9, unsat! row=8 col=9, unsat! 26 ThesearecellswhereIcanclicksafely,soIdid: Nowwehavemoreinformation,soweupdateinput: known=[ "01110001?", "01?100011", "011100000", "000000000", "111110011", "?11?1001?", "???331011", "?????2110", "???????10"] Irunitagain: row=7 col=1, unsat! row=7 col=2, unsat! row=7 col=3, unsat! row=8 col=3, unsat! row=9 col=5, unsat! row=9 col=6, unsat! Iclickonthesecellsagain: Iupdateitagain: known=[ "01110001?", "01?100011", "011100000", 27 "000000000", "111110011", "?11?1001?", "222331011", "??2??2110", "????22?10"] row=8 col=2, unsat! row=9 col=4, unsat! Thisislastupdate: known=[ "01110001?", "01?100011", "011100000", "000000000", "111110011", "?11?1001?", "222331011", "?22??2110", "???322?10"] ...lastresult: row=9 col=1, unsat! row=9 col=2, unsat! Voila! Thesourcecode: https://github.com/dennis714/SAT_SMT_article/blob/master/SMT/minesweeper/ minesweeper_solver.py . 28 SomediscussiononHN: https://news.ycombinator.com/item?id=13797375 . Seealso: crackingMinesweeperusingSATsolver: 11.3. 5.12 Recalculating micro-spreadsheet using Z3Py Thereisaniceexercise36: writeaprogramtorecalculatemicro-spreadsheet,likethisone: 1 0 B0+B2 A0*B0*C0 123 10 12 11 667 A0+B1 (C1*A0)*122 A3+C2 Asitturnsout,thoughoverkill,thiscanbesolvedusingZ3withlittleeffort: #!/usr/bin/python from z3 import * import sys, re # MS Excel or LibreOffice style. # first top-left cell is A0, not A1 def coord_to_name(R, C): return "ABCDEFGHIJKLMNOPQRSTUVWXYZ"[R]+str(C) # open file and parse it as list of lists: f=open(sys.argv[1],"r") # filter(None, ...) to remove empty sublists: ar=filter(None, [item.rstrip().split() for item in f.readlines()]) f.close() WIDTH=len(ar[0]) HEIGHT=len(ar) # cells{} is a dictionary with keys like "A0", "B9", etc: cells={} for R in range(HEIGHT): for C in range(WIDTH): name=coord_to_name(R, C) cells[name]=Int(name) s=Solver() cur_R=0 cur_C=0 for row in ar: for c in row: # string like "A0+B2" becomes "cells["A0"]+cells["B2"]": c=re.sub(r'([A-Z]{1}[0-9]+)', r'cells["\1"]', c) st="cells[\"%s\"]==%s" % (coord_to_name(cur_R, cur_C), c) # evaluate string. Z3Py expression is constructed at this step: e=eval(st) # add constraint: s.add (e) cur_C=cur_C+1 cur_R=cur_R+1 cur_C=0 result=str(s.check()) print result if result=="sat": m=s.model() for r in range(HEIGHT): for c in range(WIDTH): sys.stdout.write (str(m[cells[coord_to_name(r, c)]])+"\t") sys.stdout.write ("\n") (https://github.com/dennis714/yurichev.com/blob/master/blog/spreadsheet/1.py ) Allwedoisjustcreatingpackofvariablesforeachcell,namedA0,B1,etc,ofintegertype. Allofthem arestoredin cells[]dictionary. Keyisastring. Thenweparseallthestringsfromcells,andaddtolistof constraintsA0=123(incaseofnumberincell)or A0=B1+C2(incaseofexpressionincell). Thereisaslight preparation: stringlike A0+B2becomescells[”A0”]+cells[”B2”] . 36BlogpostinRussian: http://thesz.livejournal.com/280784.html 29 ThenthestringisevaluatedusingPython eval()method,whichishighlydangerous37:imagineifend-user couldaddastringtocellotherthanexpression? Nevertheless,itservesourpurposeswell,becausethisisa simplestwaytopassastringwithexpressionintoZ3. Z3dothejobwithlittleeffort: % python 1.py test1 sat 1 0 135 82041 123 10 12 11 667 11 1342 83383 5.12.1 Unsat core Nowtheproblem: whatifthereiscirculardependency? Like: 1 0 B0+B2 A0*B0 123 10 12 11 C1+123 C0*123 A0*122 A3+C2 Twofirstcellsofthelastrow(C0andC1)arelinkedtoeachother. Ourprogramwilljusttells“unsat”, meaning,itcouldn’tsatisfyallconstraintstogether.Wecan’tusethisaserrormessagereportedtoend-user, becauseit’shighlyunfriendly. However,wecanfetch unsatcore,i.e.,listofvariableswhichZ3findsconflicting. ... s=Solver() s.set(unsat_core=True) ... # add constraint: s.assert_and_track(e, coord_to_name(cur_R, cur_C)) ... if result=="sat": ... else: print s.unsat_core() (https://github.com/dennis714/yurichev.com/blob/master/blog/spreadsheet/2.py ) Weshouldexplicitlyturnonunsatcoresupportanduse assert_and_track() insteadofadd()method,be- causethisfeatureslowsdownthewholeprocess,andisturnedoffbydefault. Thatworks: % python 2.py test_circular unsat [C0, C1] Perhaps,thesevariablescouldberemovedfromthe2Darray,markedas unresolvedandthewholespread- sheetcouldberecalculatedagain. 5.12.2 Stress test Howtogeneratelargerandomspreadsheet? Whatwecando. First,createrandom DAG38,likethisone: 37http://stackoverflow.com/questions/1832940/is-using-eval-in-python-a-bad-practice 38Directedacyclicgraph 30 Figure6:RandomDAG Arrowswillrepresentinformationflow.Soavertex(node)whichhasnoincomingarrowstoit(indegree=0), canbesettoarandomnumber. Thenweusetopologicalsorttofinddependenciesbetweenvertices. Then weassignspreadsheetcellnamestoeachvertex. Thenwegeneraterandomexpressionwithrandomoper- ations/numbers/cellstoeachcell,withtheuseofinformationfromtopologicalsortedgraph. WolframMathematica: (* Utility functions *) In[1]:= findSublistBeforeElementByValue[lst_,element_]:=lst[[ 1;;Position[lst, element][[1]][[1]]-1]] (* Input in ∞1.. range. 1->A0, 2->A1, etc *) In[2]:= vertexToName[x_,width_]:=StringJoin[FromCharacterCode[ToCharacterCode["A"][[1]]+Floor[(x-1)/width]], ToString[Mod[(x-1),width]]] In[3]:= randomNumberAsString[]:=ToString[RandomInteger[{1,1000}]] In[4]:= interleaveListWithRandomNumbersAsStrings[lst_]:=Riffle[lst,Table[randomNumberAsString[],Length[lst]-1]] (* We omit division operation because micro-spreadsheet evaluator can't handle division by zero *) In[5]:= interleaveListWithRandomOperationsAsStrings[lst_]:=Riffle[lst,Table[RandomChoice[{"+","-","*"}],Length[ lst]-1]] 31 In[6]:= randomNonNumberExpression[g_,vertex_]:=StringJoin[interleaveListWithRandomOperationsAsStrings[ interleaveListWithRandomNumbersAsStrings[Map[vertexToName[#,WIDTH]&,pickRandomNonDependentVertices[g,vertex ]]]]] In[7]:= pickRandomNonDependentVertices[g_,vertex_]:=DeleteDuplicates[RandomChoice[ findSublistBeforeElementByValue[TopologicalSort[g],vertex],RandomInteger[{1,5}]]] In[8]:= assignNumberOrExpr[g_,vertex_]:=If[VertexInDegree[g,vertex]==0,randomNumberAsString[], randomNonNumberExpression[g,vertex]] (* Main part *) (* Create random graph *) In[21]:= WIDTH=7;HEIGHT=8;TOTAL=WIDTH*HEIGHT Out[21]= 56 In[24]:= g=DirectedGraph[RandomGraph[BernoulliGraphDistribution[TOTAL,0.05]],"Acyclic"]; ... (* Generate random expressions and numbers *) In[26]:= expressions=Map[assignNumberOrExpr[g,#]&,VertexList[g]]; (* Make 2D table of it *) In[27]:= t=Partition[expressions,WIDTH]; (* Export as tab-separated values *) In[28]:= Export["/home/dennis/1.txt",t,"TSV"] Out[28]= /home/dennis/1.txt In[29]:= Grid[t,Frame->All,Alignment->Left] Hereisanoutputfrom Grid[]: 846 499 A3*913-H4 808 278 303 D1+579+B6 B4*860+D2 999 59 442 425 A5*163+B2+127*C2*927*D3*213+C1 583 G6*379-C3-436-C4-289+H6 972 804 D2 G5+108-F1*413-D3 B5 G4*981*D2 F2 E0 B6-731-D3+791+B4*92+C1 551 F4*922*C2+760*A6-992+B4-184-A4 B1-624-E3 F4+182+A4*940-E1+76*C1 519 G1*402+D1*107*G3-458*A1 D3 B4 B3*811-D3*345+E0 B5 H5 F5-531+B5-222*E4 9 B5+106*B6+600-B1 E3 A5+866*F6+695-A3*226+C6 F4*102*E4*998-H0 B1-616-G5+812-A5 C3-956*A5 G4*408-D3*290*B6-899*G5+400+F1 B2-701+H6 A3+782*A5+46-B3-731+C1 42 287 H0 B4-792*H4*407+F6-425-E1 D2 D3 F2-327*G4*35*E1 E1+376*A6-606*F6*554+C5 E3 F6*484+C1-114-H4-638-A3 Usingthisscript,Icangeneraterandomspreadsheetof 26500 = 13000cells,whichseemstobeprocessed incoupleofseconds. 5.12.3 The files Thefiles,includingMathematicanotebook: https://github.com/dennis714/yurichev.com/tree/master/ blog/spreadsheet . 6 Program synthesis Programsynthesisisaprocessofautomaticprogramgeneration,inaccordancewithsomespecificgoals. 6.1 Synthesis of simple program using Z3 SMT-solver Sometimes,multiplicationoperationcanbereplacedwithaseveraloperationsofshifting/addition/subtrac- tion. Compilersdoso,becausepackofinstructionscanbeexecutedfaster. Forexample,multiplicationby19isreplacedbyGCC5.4withpairofinstructions: lea edx, [eax+eax*8] and lea eax, [eax+edx*2] . Thisissometimesalsocalled“superoptimization”. Let’sseeifwecanfindashortestpossibleinstructionspackforsomespecifiedmultiplier. AsI’vealreadywroteonce,SMT-solvercanbeseenasasolverofhugesystemsofequations. Thetask is to construct such system of equations, which, when solved, could produce a short program. I will use electronicsanalogyhere,itcanmakethingsalittlesimpler. Firstofall,whatourprogramwillbeconstingof? Therewillbe3operationsallowed: ADD/SUB/SHL.Only registersallowedasoperands,exceptforthesecondoperandofSHL(whichcouldbein1..31range). Each registerwillbeassignedonlyonce(asin SSA39). 39Staticsingleassignmentform 32 Andtherewillbesome“magicblock”,whichtakesallpreviousregisterstates,italsotakesoperationtype, operandsandproducesavalueofnextregister’sstate. op ------------+ op1_reg -----+ | op2_reg ---+ | | | | | v v v +---------------+ | | registers -> | | -> new register's state | | +---------------+ Nowlet’stakealookonourschematicsontoplevel: 0 -> blk -> blk -> blk .. -> blk -> 0 1 -> blk -> blk -> blk .. -> blk -> multiplier Eachblocktakespreviousstateofregistersandproducesnewstates. Therearetwochains. Firstchain takes0asstateofR0attheverybeginning,andthechainissupposedtoproduce0attheend(sincezero multipliedbyanyvalueisstillzero). Thesecondchaintakes1andmustproducemultiplierasthestateof verylastregister(since1multipliedbymultipliermustequaltomultiplier). Eachblockis“controlled”byoperationtype,operands,etc. Foreachcolumn,thereiseachownset. Nowyoucanviewthesetwochainsastwoequations.Theultimategoalistofindsuchstateofalloperation typesandoperands,sothefirstchainwillequalto0,andthesecondtomultiplier. Let’salsotakealookinto“magicblock”inside: op1_reg op | v v +-----+ registers ---> selector1 --> | ADD | + | SUB | ---> result | | SHL | +-> selector2 --> +-----+ ^ ^ | | op2_reg op2_imm Eachselectorcanbeviewedasasimplemultipositionalswitch. IfoperationisSHL,avalueinrangeof 1..31isusedassecondoperand. Soyoucanimaginethiselectriccircuitandyourgoalistoturnallswitchesinsuchastate,sotwochains willhave0andmultiplieronoutput. Thissoundslikelogicpuzzleinsomeway. NowwewilltrytouseZ3to solvethispuzzle. First,wedefineallvaribles: R=[[BitVec('S_s%d_c%d' % (s, c), 32) for s in range(MAX_STEPS)] for c in range (CHAINS)] op=[Int('op_s%d' % s) for s in range(MAX_STEPS)] op1_reg=[Int('op1_reg_s%d' % s) for s in range(MAX_STEPS)] op2_reg=[Int('op2_reg_s%d' % s) for s in range(MAX_STEPS)] op2_imm=[BitVec('op2_imm_s%d' % s, 32) for s in range(MAX_STEPS)] R[][]isregistersstateforeachchainandeachstep. Oncontrary, op/op1_reg/op2_reg/op2_immvariablesaredefinedforeachstep,butforbothchains,since bothchainsateachcolumnhasthesameoperation/operands. Nowwemustlimitcountofoperations,andalso,register’snumberforeachstepmustnotbebiggerthan stepnumber,inotherwords,instructionateachstepisallowedtoaccessonlyregisterswhichwerealready setbefore: for s in range(1, STEPS): # for each step sl.add(And(op[s]>=0, op[s]<=2)) sl.add(And(op1_reg[s]>=0, op1_reg[s]=0, op2_reg[s]=1, op2_imm[s]<=31)) Fixregisteroffirststepforbothchains: for c in range(CHAINS): # for each chain: 33 sl.add(R[c][0]==chain_inputs[c]) sl.add(R[c][STEPS-1]==chain_inputs[c]*multiplier) Nowlet’sadd“magicblocks”: for s in range(1, STEPS): sl.add(R[c][s]==simulate_op(R,c, op[s], op1_reg[s], op2_reg[s], op2_imm[s])) Nowhow“magicblock”isdefined? def selector(R, c, s): # for all MAX_STEPS: return If(s==0, R[c][0], If(s==1, R[c][1], If(s==2, R[c][2], If(s==3, R[c][3], If(s==4, R[c][4], If(s==5, R[c][5], If(s==6, R[c][6], If(s==7, R[c][7], If(s==8, R[c][8], If(s==9, R[c][9], 0)))))))))) # default def simulate_op(R, c, op, op1_reg, op2_reg, op2_imm): op1_val=selector(R,c,op1_reg) return If(op==0, op1_val + selector(R, c, op2_reg), If(op==1, op1_val - selector(R, c, op2_reg), If(op==2, op1_val << op2_imm, 0))) # default Thisisveryimportanttounderstand: iftheoperationisADD/SUB, op2_imm’svalueisjustignored. Oth- erwise,ifoperationisSHL,valueof op2_regisignored. Justlikeincaseofdigitalcircuit. Thecode: https://github.com/dennis714/SAT_SMT_article/blob/master/pgm_synth/mult.py . Nowlet’sseehowitworks: % ./mult.py 12 multiplier= 12 attempt, STEPS= 2 unsat attempt, STEPS= 3 unsat attempt, STEPS= 4 sat! r1=SHL r0, 2 r2=SHL r1, 1 r3=ADD r1, r2 tests are OK Thefirststepisalwaysastepcontaining0/1, or, r0. Sowhenoursolverreportingabout4steps, this means3instructions. Somethingharder: % ./mult.py 123 multiplier= 123 attempt, STEPS= 2 unsat attempt, STEPS= 3 unsat attempt, STEPS= 4 unsat attempt, STEPS= 5 sat! r1=SHL r0, 2 r2=SHL r1, 5 r3=SUB r2, r1 r4=SUB r3, r0 tests are OK Nowthecodemultiplyingby1234: r1=SHL r0, 6 r2=ADD r0, r1 r3=ADD r2, r1 34 r4=SHL r2, 4 r5=ADD r2, r3 r6=ADD r5, r4 Looksgreat,butittook 23secondstofinditonmyIntelXeonCPUE31220@3.10GHz. Iagree,this is far from practical usage. Also, I’m not quite sure that this piece of code will work faster than a single multiplicationinstruction. Butanyway,it’sagooddemonstrationofSMTsolverscapabilities. Thecodemultiplyingby12345( 150seconds): r1=SHL r0, 5 r2=SHL r0, 3 r3=SUB r2, r1 r4=SUB r1, r3 r5=SHL r3, 9 r6=SUB r4, r5 r7=ADD r0, r6 Multiplicationby123456( 8minutes!): r1=SHL r0, 9 r2=SHL r0, 13 r3=SHL r0, 2 r4=SUB r1, r2 r5=SUB r3, r4 r6=SHL r5, 4 r7=ADD r1, r6 6.1.1 Few notes I’veremovedSHRinstructionsupport,simplybecausethecodemultiplyingbyaconstantmakesnouseofit. Evenmore: it’snotaproblemtoaddsupportofconstantsassecondoperandforallinstructions,butagain, youwouldn’tfindapieceofcodewhichdoesthisjobandusessomeadditionalconstants.OrmaybeIwrong? Ofcourse,foranotherjobyou’llneedtoaddsupportofconstantsandotheroperations. Butatthesame time,itwillworkslowerandslower. SoIhadtokeep ISA40ofthistoyCPU41ascompactaspossible. 6.1.2 The code https://github.com/dennis714/SAT_SMT_article/blob/master/pgm_synth/mult.py . 6.2 Rockey dongle: finding unknown algorithm using only input/output pairs (ThistextwasfirstpublishedinAugust2012inmyblog: http://blog.yurichev.com/node/71 .) SomesmartcardscanexecuteJavaor.NETcode-that’sthewaytohideyoursensitivealgorithmintochip thatisveryhardtobreak(decapsulate). Forexample,onemayencrypt/decryptdatafilesbyhiddencrypto algorithmrenderingsoftwarepiracyofsuchsoftwareclosetoimpossible—anencrypteddatefilecreatedon softwarewithconnectedsmartcardwouldbeimpossibletodecryptoncrackedversionofthesamesoftware. (Thisleadstomanynuisances,though.) That’swhatcalled blackbox. Somesoftwareprotectiondonglesoffersthisfunctionalitytoo. OneexampleisRockey442. 40InstructionSetArchitecture 41Centralprocessingunit 42http://www.rockey.nl/en/rockey.html 35 Figure7:Rockey4dongle ThisisasmalldongleconnectedviaUSB.Iscontainsomeuser-definedmemorybutalsomemoryforuser algorithms. Thevirtual(toy)CPUforthesealgorithmsisverysimple: itofferonly816-bitregisters(however,only4 canbesetandread)and8operations(addition,subtractation,cyclicleftshifting,multiplication,OR,XOR, AND,negation). Secondinstructionargumentcanbeaconstant(from0to63)insteadofregister. Eachalgorithmisdescribedbystringlike A=A+B, B=C*13, D=DˆA, C=B*55, C=C&A, D=D|A, A=A*9, A=A&B . Therearenomemory,stack,conditional/unconditionaljumps,etc. Eachalgorithm,obviously,can’thavesideeffects,sotheyareactually purefunctions andtheirresults canbememoized. Bytheway,asithasbeenmentionedinRockey4manual,firstandlastinstructioncannothaveconstants. Maybethat’sbecausethesefieldsusedforsomeinternaldata:eachalgorithmstartandendshouldbemarked somehowinternallyanyway. Woulditbepossibletorevealhiddenimpossible-to-readalgorithmonlybyrecordinginput/outputdongle traffic? Commonsensetellus“no”. Butwecantryanyway. Since,mygoalwasn’ttobreakintosomeRockey-protectedsoftware,Iwasinterestingonlyinlimits(which algorithmscouldwefind),soImakesomethingssimpler:wewillworkwithonly416-bitregisters,andthere willbeonly6operations(add,subtract,multiply,OR,XOR,AND). Let’sfirstcalculate,howmuchinformationwillbeusedinbrute-forcecase. Thereare384ofallpossibleinstructionsin reg=reg,op,reg formatfor4registersand6operations,and also6144instructionsin reg=reg,op,constant format. Rememberthatconstantlimitedto63asmaximal value? Thathelpusforalittle. So,thereare6528ofallpossibleinstructions. Thismean,thereare 1:110195-instructionalgorithms. That’stoomuch. Howcanweexpresseachinstructionassystemofequations? Whilerememberingsomeschoolmathe- matics,Iwrotethis: Function one\_step()= # Each Bx is integer, but may be only 0 or 1. # only one of B1..B4 and B5..B9 can be set reg1=B1*A + B2*B + B3*C + B4*D reg_or_constant2=B5*A + B6*B + B7*C + B8*D + B9*constant reg1 should not be equal to reg_or_constant2 # Only one of B10..B15 can be set result=result+B10*(reg1*reg2) result=result+B11*(reg1^reg2) result=result+B12*(reg1+reg2) result=result+B13*(reg1-reg2) result=result+B14*(reg1|reg2) result=result+B15*(reg1®2) B16 - true if register isn't updated in this part B17 - true if register is updated in this part (B16 cannot be equal to B17) A=B16*A + B17*result B=B18*A + B19*result C=B20*A + B21*result 36 D=B22*A + B23*result That’showwecanexpresseachinstructioninalgorithm. 5-instructionsalgorithmcanbeexpressedlikethis: one_step (one_step (one_step (one_step (one_step (input_registers))))) . Let’salsoaddfiveknowninput/outputpairsandwe’llgetsystemofequationslikethis: one_step (one_step (one_step (one_step (one_step (input_1)))))==output_1 one_step (one_step (one_step (one_step (one_step (input_2)))))==output_2 one_step (one_step (one_step (one_step (one_step (input_3)))))==output_3 one_step (one_step (one_step (one_step (one_step (input_4)))))==output_4 .. etc Sothequestionnowistofind 523booleanvaluessatisfyingknowninput/outputpairs. IwrotesmallutilitytoprobeRockey4algorithmwithrandomnumbers,itproduceresultsinform: RY_CALCULATE1: (input) p1=30760 p2=18484 p3=41200 p4=61741 (output) p1=49244 p2=11312 p3=27587 p4=12657 RY_CALCULATE1: (input) p1=51139 p2=7852 p3=53038 p4=49378 (output) p1=58991 p2=34134 p3=40662 p4=9869 RY_CALCULATE1: (input) p1=60086 p2=52001 p3=13352 p4=45313 (output) p1=46551 p2=42504 p3=61472 p4=1238 RY_CALCULATE1: (input) p1=48318 p2=6531 p3=51997 p4=30907 (output) p1=54849 p2=20601 p3=31271 p4=44794 p1/p2/p3/p4arejustanothernamesforA/B/C/Dregisters. Nowlet’sstartwithZ3. WewillneedtoexpressRockey4toyCPUinZ3Py(Z3Python API)terms. Itcanbesaid,myPythonscriptisdividedintotwoparts: •constraintdefinitions(like, output_1shouldbe nforinput_1=m ,constantcannotbegreaterthan63 , etc); •functionsconstructingsystemofequations. Thispieceofcodedefinesomekindof structureconsistingof4named16-bitvariables,eachrepresent registerinourtoyCPU. Registers_State=Datatype ('Registers_State') Registers_State.declare('cons', ('A', BitVecSort(16)), ('B', BitVecSort(16)), ('C', BitVecSort(16)), ('D', BitVecSort(16))) Registers_State=Registers_State.create() Theseenumerationsdefinetwonewtypes(or sortsinZ3’sterminology): Operation, (OP_MULT, OP_MINUS, OP_PLUS, OP_XOR, OP_OR, OP_AND) = EnumSort('Operation', ('OP_MULT', 'OP_MINUS', ' OP_PLUS', 'OP_XOR', 'OP_OR', 'OP_AND')) Register, (A, B, C, D) = EnumSort('Register', ('A', 'B', 'C', 'D')) Thispartisveryimportant,itdefinesallvariablesinoursystemofequations. op_stepistypeofoperation ininstruction. reg_or_constant isselectorbetweenregisterandconstantinsecondargument— Falseifit’s aregisterand Trueifit’saconstant. reg_stepisadestinationregisterofthisinstruction. reg1_stepand reg2_steparejustregistersatarg1andarg2. constant_step isconstant(incaseit’susedininstruction insteadofarg2). op_step=[Const('op_step%s' % i, Operation) for i in range(STEPS)] reg_or_constant_step=[Bool('reg_or_constant_step%s' % i) for i in range(STEPS)] reg_step=[Const('reg_step%s' % i, Register) for i in range(STEPS)] reg1_step=[Const('reg1_step%s' % i, Register) for i in range(STEPS)] reg2_step=[Const('reg2_step%s' % i, Register) for i in range(STEPS)] constant_step = [BitVec('constant_step%s' % i, 16) for i in range(STEPS)] Addingconstraintsisverysimple. Remember,Iwrotethateachconstantcannotbelargerthan63? # according to Rockey 4 dongle manual, arg2 in first and last instructions cannot be a constant s.add (reg_or_constant_step[0]==False) s.add (reg_or_constant_step[STEPS-1]==False) ... for x in range(STEPS): s.add (constant_step[x]>=0, constant_step[x]<=63) 37 Knowninput/outputvaluesareaddedasconstraintstoo. Nowlet’sseehowtoconstructoursystemofequations: # Register, Registers_State -> int def register_selector (register, input_registers): return If(register==A, Registers_State.A(input_registers), If(register==B, Registers_State.B(input_registers), If(register==C, Registers_State.C(input_registers), If(register==D, Registers_State.D(input_registers), 0)))) # default Thisfunctionreturningcorrespondingregistervaluefrom structure. Needlesstosay,thecodeaboveis notexecuted. If()isZ3Pyfunction. Thecodeonlydeclaresthefunction,whichwillbeusedinanother. ExpressiondeclarationresemblingLISP PLinsomeway. Hereisanotherfunctionwhere register_selector() isused: # Bool, Register, Registers_State, int -> int def register_or_constant_selector (register_or_constant, register, input_registers, constant): return If(register_or_constant==False, register_selector(register, input_registers), constant) Thecodehereisneverexecutedtoo. Itonlyconstructsonesmallpieceofverybigexpression. Butfor thesakeofsimplicity,onecanthinkallthesefunctionswillbecalledduringbruteforcesearch,manytimes, atfastestpossiblespeed. # Operation, Bool, Register, Register, Int, Registers_State -> int def one_op (op, register_or_constant, reg1, reg2, constant, input_registers): arg1=register_selector(reg1, input_registers) arg2=register_or_constant_selector (register_or_constant, reg2, input_registers, constant) return If(op==OP_MULT, arg1*arg2, If(op==OP_MINUS, arg1-arg2, If(op==OP_PLUS, arg1+arg2, If(op==OP_XOR, arg1^arg2, If(op==OP_OR, arg1|arg2, If(op==OP_AND, arg1&arg2, 0)))))) # default Hereistheexpressiondescribingeachinstruction. new_valwillbeassignedtodestinationregister,while allotherregisters’valuesarecopiedfrominputregisters’state: # Bool, Register, Operation, Register, Register, Int, Registers_State -> Registers_State def one_step (register_or_constant, register_assigned_in_this_step, op, reg1, reg2, constant, input_registers): new_val=one_op(op, register_or_constant, reg1, reg2, constant, input_registers) return If (register_assigned_in_this_step==A, Registers_State.cons (new_val, Registers_State.B(input_registers), Registers_State.C(input_registers), Registers_State.D(input_registers)), If (register_assigned_in_this_step==B, Registers_State.cons (Registers_State.A(input_registers), new_val, Registers_State.C(input_registers), Registers_State.D(input_registers)), If (register_assigned_in_this_step==C, Registers_State.cons (Registers_State.A(input_registers), Registers_State.B(input_registers), new_val, Registers_State.D(input_registers)), If (register_assigned_in_this_step==D, Registers_State.cons (Registers_State.A(input_registers), Registers_State.B(input_registers), Registers_State.C(input_registers), new_val), Registers_State.cons(0,0,0,0))))) # default Thisisthelastfunctiondescribingawhole n-stepprogram: def program(input_registers, STEPS): cur_input=input_registers for x in range(STEPS): cur_input=one_step (reg_or_constant_step[x], reg_step[x], op_step[x], reg1_step[x], reg2_step[x], constant_step[x], cur_input) return cur_input Again,forthesakeofsimplicity,itcanbesaid,nowZ3willtryeachpossibleregisters/operations/constants againstthisexpressiontofindsuchcombinationwhichsatisfyallinput/outputpairs. Soundsabsurdic,but thisisclosetoreality. SAT/SMT-solversindeedtriesthemall. Butthetrickistoprunesearchtreeasearlyas possible,soitwillworkforsomereasonabletime. Andthisishardestproblemforsolvers. 38 Nowlet’sstartwithverysimple3-stepalgorithm: B=AˆD, C=D*D, D=A*C . Pleasenote: register Aleft unchanged. IprogrammedRockey4donglewiththealgorithm,andrecordedalgorithmoutputsare: RY_CALCULATE1: (input) p1=8803 p2=59946 p3=36002 p4=44743 (output) p1=8803 p2=36004 p3=7857 p4=24691 RY_CALCULATE1: (input) p1=5814 p2=55512 p3=52155 p4=55813 (output) p1=5814 p2=52403 p3=33817 p4=4038 RY_CALCULATE1: (input) p1=25206 p2=2097 p3=55906 p4=22705 (output) p1=25206 p2=15047 p3=10849 p4=43702 RY_CALCULATE1: (input) p1=10044 p2=14647 p3=27923 p4=7325 (output) p1=10044 p2=15265 p3=47177 p4=20508 RY_CALCULATE1: (input) p1=15267 p2=2690 p3=47355 p4=56073 (output) p1=15267 p2=57514 p3=26193 p4=53395 Ittookaboutonesecondandonly5pairsabovetofindalgorithm(onmyquad-coreXeonE3-12203.1GHz, however,Z3solverworkinginsingle-threadmode): B = A ^ D C = D * D D = C * A Notethelastinstruction: Cand AregistersareswappedcomparingtoversionIwrotebyhand. Butof course,thisinstructionisworkinginthesameway,becausemultiplicationiscommutativeoperation. NowifItrytofind4-stepprogramsatisfyingtothesevalues,myscriptwillofferthis: B = A ^ D C = D * D D = A * C A = A | A ...andthat’sreallyfun,becausethelastinstructiondonothingwithvalueinregister A,it’slikeNOP43—but still,algorithmiscorrectforallvaluesgiven. Hereisanother5-stepalgorithm: B=BˆD, C=A*22, A=B*19, A=A&42, D=B&C andvalues: RY_CALCULATE1: (input) p1=61876 p2=28737 p3=28636 p4=50362 (output) p1=32 p2=46331 p3=50552 p4=33912 RY_CALCULATE1: (input) p1=46843 p2=43355 p3=39078 p4=24552 (output) p1=8 p2=63155 p3=47506 p4=45202 RY_CALCULATE1: (input) p1=22425 p2=51432 p3=40836 p4=14260 (output) p1=0 p2=65372 p3=34598 p4=34564 RY_CALCULATE1: (input) p1=44214 p2=45766 p3=19778 p4=59924 (output) p1=2 p2=22738 p3=55204 p4=20608 RY_CALCULATE1: (input) p1=27348 p2=49060 p3=31736 p4=59576 (output) p1=0 p2=22300 p3=11832 p4=1560 Ittook37secondsandwe’vegot: B = D ^ B C = A * 22 A = B * 19 A = A & 42 D = C & B A=A&42wascorrectlydeduced(lookatthesefivep1’satoutput(assignedtooutput Aregister):32,8,0,2,0) 6-stepalgorithm A=A+B, B=C*13, D=DˆA, C=C&A, D=D|B, A=A&B andvalues: RY_CALCULATE1: (input) p1=4110 p2=35411 p3=54308 p4=47077 (output) p1=32832 p2=50644 p3=36896 p4=60884 RY_CALCULATE1: (input) p1=12038 p2=7312 p3=39626 p4=47017 (output) p1=18434 p2=56386 p3=2690 p4=64639 RY_CALCULATE1: (input) p1=48763 p2=27663 p3=12485 p4=20563 (output) p1=10752 p2=31233 p3=8320 p4=31449 RY_CALCULATE1: (input) p1=33174 p2=38937 p3=54005 p4=38871 (output) p1=4129 p2=46705 p3=4261 p4=48761 RY_CALCULATE1: (input) p1=46587 p2=36275 p3=6090 p4=63976 (output) p1=258 p2=13634 p3=906 p4=48966 90secondsandwe’vegot: A = A + B B = C * 13 D = D ^ A D = B | D C = C & A A = B & A Butthatwassimple,however. Some6-stepalgorithmsarenotpossibletofind,forexample: A=AˆB, A=A*9, A=AˆC, A=A*19, A=AˆD, A=A&B . Solverwasworkingtoolong(uptoseveralhours),soI didn’tevenknowisitpossibletofinditanyway. 6.2.1 Conclusion Thisisinfactanexerciseinprogramsynthesis. Someshortalgorithmsfortiny CPUsarereallypossibletofindusingsosmallsetsetofdata. Ofcourse it’sstillnotpossibletorevealsomecomplexalgorithm,butthismethoddefinitelyshouldnotbeignored. 43NoOperation 39 6.2.2 The files Rockey4dongleprogrammerandreader,Rockey4manual,Z3Pyscriptforfindingalgorithms,input/output pairs: https://github.com/dennis714/SAT_SMT_article/tree/master/pgm_synth/rockey_files . 6.2.3 Further work Perhaps,constructingLISP-likeS-expressioncanbebetterthanaprogramfortoy-levelCPU. It’salsopossibletostartwithsmallerconstantsandthenproceedtobigger. Thisissomewhatsimilarto increasingpasswordlengthinpasswordbrute-forcecracking. 7 Toy decompiler 7.1 Introduction Amodern-daycompilerisaproductofhundredsofdeveloper/year. Atthesametime,toycompilercanbe anexerciseforastudentforaweek(orevenweekend). Likewise,commercialdecompilerlikeHex-Rayscanbeextremelycomplex,whiletoydecompilerlikethis one,canbeeasytounderstandandremake. ThefollowingdecompilerwritteninPython,supportsonlyshortbasicblocks,withnojumps. Memoryis alsonotsupported. 7.2 Data structure Ourtoydecompilerwillusejustonesingledatastructure,representingexpressiontree. ManyprogrammingtextbookshasanexampleofconversionfromFahrenheittemperaturetoCelsius,using thefollowingformula: celsius = (fahrenheit 32)5 9 Thisexpressioncanberepresentedasatree: / * - INPUT 32 5 9 Howtostoreitinmemory? Weseehere3typesofnodes: 1)numbers(orvalues);2)arithmeticalopera- tions;3)symbols(like“INPUT”). Manydeveloperswith OOP44intheirmindwillcreatesomekindofclass. Otherdevelopermaybewilluse “varianttype”. I’llusesimplestpossiblewayofrepresentingthisstructure: aPythontuple. Firstelementoftuplecanbe astring: either“EXPR_OP”foroperation,“EXPR_SYMBOL”forsymbolor“EXPR_VALUE”forvalue. Incaseof symbolorvalue,itfollowsthestring. Incaseofoperation,thestringfollowedbyanothertuples. Nodetypeandoperationtypearestoredasplainstrings—tomakedebuggingoutputeasiertoread. Thereareconstructors inourcode,in OOPsense: 44Object-orientedprogramming 40 def create_val_expr (val): return ("EXPR_VALUE", val) def create_symbol_expr (val): return ("EXPR_SYMBOL", val) def create_binary_expr (op, op1, op2): return ("EXPR_OP", op, op1, op2) Therearealso accessors: def get_expr_type(e): return e[0] def get_symbol (e): assert get_expr_type(e)=="EXPR_SYMBOL" return e[1] def get_val (e): assert get_expr_type(e)=="EXPR_VALUE" return e[1] def is_expr_op(e): return get_expr_type(e)=="EXPR_OP" def get_op (e): assert is_expr_op(e) return e[1] def get_op1 (e): assert is_expr_op(e) return e[2] def get_op2 (e): assert is_expr_op(e) return e[3] Thetemperatureconversionexpressionwejustsawwillberepresentedas: "EXPR_OP" "/" "EXPR_OP" "*" "EXPR_OP" "-" "EXPR_SYMBOL" "arg1" "EXPR_VALUE" 32 "EXPR_VALUE" 5 "EXPR_VALUE" 9 ...orasPythonexpression: ('EXPR_OP', '/', ('EXPR_OP', '*', ('EXPR_OP', '-', ('EXPR_SYMBOL', 'arg1'), ('EXPR_VALUE', 32)), ('EXPR_VALUE', 5)), ('EXPR_VALUE', 9)) Infact,thisis AST45initssimplestform. ASTsareusedheavilyincompilers. 7.3 Simple examples Let’sstartwithsimplestexample: 45Abstractsyntaxtree 41 mov rax, rdi imul rax, rsi Atstart,thesesymbolsareassignedtoregisters:RAX=initial_RAX,RBX=initial_RBX,RDI=arg1,RSI=arg2, RDX=arg3,RCX=arg4. WhenwehandleMOVinstruction,wejustcopyexpressionfromRDItoRAX.WhenwehandleIMULinstruc- tion,wecreateanewexpression,addingtogetherexpressionsfromRAXandRSIandputtingresultintoRAX again. Icanfeedthistodecompilerandwewillseehowregister’sstateischangedthroughprocessing: python td.py --show-registers --python-expr tests/mul.s ... line=[mov rax, rdi] rcx=('EXPR_SYMBOL', 'arg4') rsi=('EXPR_SYMBOL', 'arg2') rbx=('EXPR_SYMBOL', 'initial_RBX') rdx=('EXPR_SYMBOL', 'arg3') rdi=('EXPR_SYMBOL', 'arg1') rax=('EXPR_SYMBOL', 'arg1') line=[imul rax, rsi] rcx=('EXPR_SYMBOL', 'arg4') rsi=('EXPR_SYMBOL', 'arg2') rbx=('EXPR_SYMBOL', 'initial_RBX') rdx=('EXPR_SYMBOL', 'arg3') rdi=('EXPR_SYMBOL', 'arg1') rax=('EXPR_OP', '*', ('EXPR_SYMBOL', 'arg1'), ('EXPR_SYMBOL', 'arg2')) ... result=('EXPR_OP', '*', ('EXPR_SYMBOL', 'arg1'), ('EXPR_SYMBOL', 'arg2')) IMULinstructionismappedto“*”string,andthennewexpressionisconstructedin handle_binary_op() , whichputsresultintoRAX. Inthisoutput,thedatastructuresaredumpedusingPython str()function,whichdoesmostlythesame, asprint(). Outputisbulky,andwecanturnoffPythonexpressionsoutput,andseehowthisinternaldatastructure canberenderedneatlyusingourinternal expr_to_string() function: python td.py --show-registers tests/mul.s ... line=[mov rax, rdi] rcx=arg4 rsi=arg2 rbx=initial_RBX rdx=arg3 rdi=arg1 rax=arg1 line=[imul rax, rsi] rcx=arg4 rsi=arg2 rbx=initial_RBX rdx=arg3 rdi=arg1 rax=(arg1 * arg2) ... result=(arg1 * arg2) Slightlyadvancedexample: imul rdi, rsi lea rax, [rdi+rdx] LEAinstructionistreatedjustasADD. 42 python td.py --show-registers --python-expr tests/mul_add.s ... line=[imul rdi, rsi] rcx=('EXPR_SYMBOL', 'arg4') rsi=('EXPR_SYMBOL', 'arg2') rbx=('EXPR_SYMBOL', 'initial_RBX') rdx=('EXPR_SYMBOL', 'arg3') rdi=('EXPR_OP', '*', ('EXPR_SYMBOL', 'arg1'), ('EXPR_SYMBOL', 'arg2')) rax=('EXPR_SYMBOL', 'initial_RAX') line=[lea rax, [rdi+rdx]] rcx=('EXPR_SYMBOL', 'arg4') rsi=('EXPR_SYMBOL', 'arg2') rbx=('EXPR_SYMBOL', 'initial_RBX') rdx=('EXPR_SYMBOL', 'arg3') rdi=('EXPR_OP', '*', ('EXPR_SYMBOL', 'arg1'), ('EXPR_SYMBOL', 'arg2')) rax=('EXPR_OP', '+', ('EXPR_OP', '*', ('EXPR_SYMBOL', 'arg1'), ('EXPR_SYMBOL', 'arg2')), ('EXPR_SYMBOL', 'arg3') ) ... result=('EXPR_OP', '+', ('EXPR_OP', '*', ('EXPR_SYMBOL', 'arg1'), ('EXPR_SYMBOL', 'arg2')), ('EXPR_SYMBOL', ' arg3')) Andagain,let’sseethisexpressiondumpedneatly: python td.py --show-registers tests/mul_add.s ... result=((arg1 * arg2) + arg3) Nowanotherexample,whereweuse2inputarguments: imul rdi, rdi, 1234 imul rsi, rsi, 5678 lea rax, [rdi+rsi] python td.py --show-registers --python-expr tests/mul_add3.s ... line=[imul rdi, rdi, 1234] rcx=('EXPR_SYMBOL', 'arg4') rsi=('EXPR_SYMBOL', 'arg2') rbx=('EXPR_SYMBOL', 'initial_RBX') rdx=('EXPR_SYMBOL', 'arg3') rdi=('EXPR_OP', '*', ('EXPR_SYMBOL', 'arg1'), ('EXPR_VALUE', 1234)) rax=('EXPR_SYMBOL', 'initial_RAX') line=[imul rsi, rsi, 5678] rcx=('EXPR_SYMBOL', 'arg4') rsi=('EXPR_OP', '*', ('EXPR_SYMBOL', 'arg2'), ('EXPR_VALUE', 5678)) rbx=('EXPR_SYMBOL', 'initial_RBX') rdx=('EXPR_SYMBOL', 'arg3') rdi=('EXPR_OP', '*', ('EXPR_SYMBOL', 'arg1'), ('EXPR_VALUE', 1234)) rax=('EXPR_SYMBOL', 'initial_RAX') line=[lea rax, [rdi+rsi]] rcx=('EXPR_SYMBOL', 'arg4') rsi=('EXPR_OP', '*', ('EXPR_SYMBOL', 'arg2'), ('EXPR_VALUE', 5678)) rbx=('EXPR_SYMBOL', 'initial_RBX') rdx=('EXPR_SYMBOL', 'arg3') rdi=('EXPR_OP', '*', ('EXPR_SYMBOL', 'arg1'), ('EXPR_VALUE', 1234)) rax=('EXPR_OP', '+', ('EXPR_OP', '*', ('EXPR_SYMBOL', 'arg1'), ('EXPR_VALUE', 1234)), ('EXPR_OP', '*', (' EXPR_SYMBOL', 'arg2'), ('EXPR_VALUE', 5678))) ... result=('EXPR_OP', '+', ('EXPR_OP', '*', ('EXPR_SYMBOL', 'arg1'), ('EXPR_VALUE', 1234)), ('EXPR_OP', '*', (' EXPR_SYMBOL', 'arg2'), ('EXPR_VALUE', 5678))) ...andnowneatoutput: 43 python td.py --show-registers tests/mul_add3.s ... result=((arg1 * 1234) + (arg2 * 5678)) Nowconversionprogram: mov rax, rdi sub rax, 32 imul rax, 5 mov rbx, 9 idiv rbx Youcansee,howregister’sstateischangedoverexecution(orparsing). Raw: python td.py --show-registers --python-expr tests/fahr_to_celsius.s ... line=[mov rax, rdi] rcx=('EXPR_SYMBOL', 'arg4') rsi=('EXPR_SYMBOL', 'arg2') rbx=('EXPR_SYMBOL', 'initial_RBX') rdx=('EXPR_SYMBOL', 'arg3') rdi=('EXPR_SYMBOL', 'arg1') rax=('EXPR_SYMBOL', 'arg1') line=[sub rax, 32] rcx=('EXPR_SYMBOL', 'arg4') rsi=('EXPR_SYMBOL', 'arg2') rbx=('EXPR_SYMBOL', 'initial_RBX') rdx=('EXPR_SYMBOL', 'arg3') rdi=('EXPR_SYMBOL', 'arg1') rax=('EXPR_OP', '-', ('EXPR_SYMBOL', 'arg1'), ('EXPR_VALUE', 32)) line=[imul rax, 5] rcx=('EXPR_SYMBOL', 'arg4') rsi=('EXPR_SYMBOL', 'arg2') rbx=('EXPR_SYMBOL', 'initial_RBX') rdx=('EXPR_SYMBOL', 'arg3') rdi=('EXPR_SYMBOL', 'arg1') rax=('EXPR_OP', '*', ('EXPR_OP', '-', ('EXPR_SYMBOL', 'arg1'), ('EXPR_VALUE', 32)), ('EXPR_VALUE', 5)) line=[mov rbx, 9] rcx=('EXPR_SYMBOL', 'arg4') rsi=('EXPR_SYMBOL', 'arg2') rbx=('EXPR_VALUE', 9) rdx=('EXPR_SYMBOL', 'arg3') rdi=('EXPR_SYMBOL', 'arg1') rax=('EXPR_OP', '*', ('EXPR_OP', '-', ('EXPR_SYMBOL', 'arg1'), ('EXPR_VALUE', 32)), ('EXPR_VALUE', 5)) line=[idiv rbx] rcx=('EXPR_SYMBOL', 'arg4') rsi=('EXPR_SYMBOL', 'arg2') rbx=('EXPR_VALUE', 9) rdx=('EXPR_OP', '%', ('EXPR_OP', '*', ('EXPR_OP', '-', ('EXPR_SYMBOL', 'arg1'), ('EXPR_VALUE', 32)), (' EXPR_VALUE', 5)), ('EXPR_VALUE', 9)) rdi=('EXPR_SYMBOL', 'arg1') rax=('EXPR_OP', '/', ('EXPR_OP', '*', ('EXPR_OP', '-', ('EXPR_SYMBOL', 'arg1'), ('EXPR_VALUE', 32)), (' EXPR_VALUE', 5)), ('EXPR_VALUE', 9)) ... result=('EXPR_OP', '/', ('EXPR_OP', '*', ('EXPR_OP', '-', ('EXPR_SYMBOL', 'arg1'), ('EXPR_VALUE', 32)), (' EXPR_VALUE', 5)), ('EXPR_VALUE', 9)) Neat: python td.py --show-registers tests/fahr_to_celsius.s ... line=[mov rax, rdi] 44 rcx=arg4 rsi=arg2 rbx=initial_RBX rdx=arg3 rdi=arg1 rax=arg1 line=[sub rax, 32] rcx=arg4 rsi=arg2 rbx=initial_RBX rdx=arg3 rdi=arg1 rax=(arg1 - 32) line=[imul rax, 5] rcx=arg4 rsi=arg2 rbx=initial_RBX rdx=arg3 rdi=arg1 rax=((arg1 - 32) * 5) line=[mov rbx, 9] rcx=arg4 rsi=arg2 rbx=9 rdx=arg3 rdi=arg1 rax=((arg1 - 32) * 5) line=[idiv rbx] rcx=arg4 rsi=arg2 rbx=9 rdx=(((arg1 - 32) * 5) % 9) rdi=arg1 rax=(((arg1 - 32) * 5) / 9) ... result=(((arg1 - 32) * 5) / 9) ItisinterestingtonotethatIDIVinstructionalsocalculatesreminderofdivision,anditisplacedintoRDX register. It’snotused,butisavailableforuse. Thisishowquotientandremainderarestoredinregisters: def handle_unary_DIV_IDIV (registers, op1): op1_expr=register_or_number_in_string_to_expr (registers, op1) current_RAX=registers["rax"] registers["rax"]=create_binary_expr ("/", current_RAX, op1_expr) registers["rdx"]=create_binary_expr ("%", current_RAX, op1_expr) Nowthisis align2grain() function46: ; uint64_t align2grain (uint64_t i, uint64_t grain) ; return ((i + grain-1) & (grain-1)); ; rdi=i ; rsi=grain sub rsi, 1 add rdi, rsi not rsi and rdi, rsi mov rax, rdi ... line=[sub rsi, 1] rcx=arg4 rsi=(arg2 - 1) 46Takenfrom https://docs.oracle.com/javase/specs/jvms/se6/html/Compiling.doc.html 45 rbx=initial_RBX rdx=arg3 rdi=arg1 rax=initial_RAX line=[add rdi, rsi] rcx=arg4 rsi=(arg2 - 1) rbx=initial_RBX rdx=arg3 rdi=(arg1 + (arg2 - 1)) rax=initial_RAX line=[not rsi] rcx=arg4 rsi=( (arg2 - 1)) rbx=initial_RBX rdx=arg3 rdi=(arg1 + (arg2 - 1)) rax=initial_RAX line=[and rdi, rsi] rcx=arg4 rsi=( (arg2 - 1)) rbx=initial_RBX rdx=arg3 rdi=((arg1 + (arg2 - 1)) & ( (arg2 - 1))) rax=initial_RAX line=[mov rax, rdi] rcx=arg4 rsi=( (arg2 - 1)) rbx=initial_RBX rdx=arg3 rdi=((arg1 + (arg2 - 1)) & ( (arg2 - 1))) rax=((arg1 + (arg2 - 1)) & ( (arg2 - 1))) ... result=((arg1 + (arg2 - 1)) & ( (arg2 - 1))) 7.4 Dealing with compiler optimizations Thefollowingpieceofcode... mov rax, rdi add rax, rax ...willbetransormedinto (arg1+arg1) expression. Itcanbereducedto (arg1*2). Ourtoydecompiler canidentifypatternslikesuchandrewritethem. # X+X -> X*2 def reduce_ADD1 (expr): if is_expr_op(expr) and get_op (expr)=="+" and get_op1 (expr)==get_op2 (expr): return dbg_print_reduced_expr ("reduce_ADD1", expr, create_binary_expr ("*", get_op1 (expr), create_val_expr (2))) return expr # no match Thisfunctionwilljusttest,ifthecurrentnodehas EXPR_OPtype,operationis“+”andbothchildrenare equaltoeachother. Bytheway,sinceourdatastructureisjusttupleoftuples,Pythoncancomparethem usingplain“==”operation. Ifthetestingisfinishedsuccessfully,currentnodeisthenreplacedwithanew expression:wetakeoneofchildren,weconstructanodeof EXPR_VALUE typewith“2”numberinit,andthen weconstructanodeof EXPR_OPtypewith“*”. dbg_print_reduced_expr() servingsolelydebuggingpurposes—itjustprintstheoldandthenew(re- duced)expressions. Decompileristhentraverseexpressiontreerecursivelyin deep-firstsearch fashion. def reduce_step (e): if is_expr_op (e)==False: return e # expr isn't EXPR_OP, nothing to reduce (we don't reduce EXPR_SYMBOL and EXPR_VAL) 46 if is_unary_op(get_op(e)): # recreate expr with reduced operand: return reducers(create_unary_expr (get_op(e), reduce_step (get_op1 (e)))) else: # recreate expr with both reduced operands: return reducers(create_binary_expr (get_op(e), reduce_step (get_op1 (e)), reduce_step (get_op2 (e)))) ... # same as "return ...(reduce_MUL1 (reduce_ADD1 (reduce_ADD2 (... expr))))" reducers=compose([ ... reduce_ADD1, ... ...]) def reduce (e): print "going to reduce " + expr_to_string (e) new_expr=reduce_step(e) if new_expr==e: return new_expr # we are done here, expression can't be reduced further else: return reduce(new_expr) # reduced expr has been changed, so try to reduce it again Reductionfunctionscalledagainandagain,aslong,asexpressionchanges. Nowwerunit: python td.py tests/add1.s ... going to reduce (arg1 + arg1) reduction in reduce_ADD1() (arg1 + arg1) -> (arg1 * 2) going to reduce (arg1 * 2) result=(arg1 * 2) Sofarsogood,nowwhatifwewouldtrythispieceofcode? mov rax, rdi add rax, rax add rax, rax add rax, rax python td.py tests/add2.s ... working out tests/add2.s going to reduce (((arg1 + arg1) + (arg1 + arg1)) + ((arg1 + arg1) + (arg1 + arg1))) reduction in reduce_ADD1() (arg1 + arg1) -> (arg1 * 2) reduction in reduce_ADD1() (arg1 + arg1) -> (arg1 * 2) reduction in reduce_ADD1() ((arg1 * 2) + (arg1 * 2)) -> ((arg1 * 2) * 2) reduction in reduce_ADD1() (arg1 + arg1) -> (arg1 * 2) reduction in reduce_ADD1() (arg1 + arg1) -> (arg1 * 2) reduction in reduce_ADD1() ((arg1 * 2) + (arg1 * 2)) -> ((arg1 * 2) * 2) reduction in reduce_ADD1() (((arg1 * 2) * 2) + ((arg1 * 2) * 2)) -> (((arg1 * 2) * 2) * 2) going to reduce (((arg1 * 2) * 2) * 2) result=(((arg1 * 2) * 2) * 2) Thisiscorrect,buttooverbose. Wewouldliketorewrite (X*n)*mexpressionto X*(n*m),where nandmarenumbers. Wecandothisby addinganotherfunctionlike reduce_ADD1() , butthereismuchbetteroption: wecanmakematcherfor tree. Youcanthinkaboutitasregularexpressionmatcher,butovertrees. def bind_expr (key): return ("EXPR_WILDCARD", key) def bind_value (key): return ("EXPR_WILDCARD_VALUE", key) def match_EXPR_WILDCARD (expr, pattern): return {pattern[1] : expr} # return {key : expr} 47 def match_EXPR_WILDCARD_VALUE (expr, pattern): if get_expr_type (expr)!="EXPR_VALUE": return None return {pattern[1] : get_val(expr)} # return {key : expr} def is_commutative (op): return op in ["+", "*", "&", "|", "^"] def match_two_ops (op1_expr, op1_pattern, op2_expr, op2_pattern): m1=match (op1_expr, op1_pattern) m2=match (op2_expr, op2_pattern) if m1==None or m2==None: return None # one of match for operands returned False, so we do the same # join two dicts from both operands: rt={} rt.update(m1) rt.update(m2) return rt def match_EXPR_OP (expr, pattern): if get_expr_type(expr)!=get_expr_type(pattern): # be sure, both EXPR_OP. return None if get_op (expr)!=get_op (pattern): # be sure, ops type are the same. return None if (is_unary_op(get_op(expr))): # match unary expression. return match (get_op1 (expr), get_op1 (pattern)) else: # match binary expression. # first try match operands as is. m=match_two_ops (get_op1 (expr), get_op1 (pattern), get_op2 (expr), get_op2 (pattern)) if m!=None: return m # if matching unsuccessful, AND operation is commutative, try also swapped operands. if is_commutative (get_op (expr))==False: return None return match_two_ops (get_op1 (expr), get_op2 (pattern), get_op2 (expr), get_op1 (pattern)) # returns dict in case of success, or None def match (expr, pattern): t=get_expr_type(pattern) if t=="EXPR_WILDCARD": return match_EXPR_WILDCARD (expr, pattern) elif t=="EXPR_WILDCARD_VALUE": return match_EXPR_WILDCARD_VALUE (expr, pattern) elif t=="EXPR_SYMBOL": if expr==pattern: return {} else: return None elif t=="EXPR_VALUE": if expr==pattern: return {} else: return None elif t=="EXPR_OP": return match_EXPR_OP (expr, pattern) else: raise AssertionError Nowhowwewilluseit: # (X*A)*B -> X*(A*B) def reduce_MUL1 (expr): m=match (expr, create_binary_expr ("*", (create_binary_expr ("*", bind_expr ("X"), bind_value ("A"))), bind_value ("B"))) if m==None: return expr # no match return dbg_print_reduced_expr ("reduce_MUL1", expr, create_binary_expr ("*", m["X"], # new op1 create_val_expr (m["A"] * m["B"]))) # new op2 48 We take input expression, and we also construct pattern to be matched. Matcher works recursively over both expressions synchronously. Pattern is also expression, but can use two additional node types: EXPR_WILDCARD andEXPR_WILDCARD_VALUE .Thesenodesaresuppliedwithkeys(storedasstrings).When matcher encounters EXPR_WILDCARD in pattern, it just stashes current expression and will return it. If matcherencounters EXPR_WILDCARD_VALUE ,itdoesthesame,butonlyincasethecurrentnodehas EXPR_VALUE type. bind_expr() and bind_value() arefunctionswhichcreatenodeswiththetypeswehaveseen. Allthismeans, reduce_MUL1() functionwillsearchfortheexpressioninform (X*A)*B,where AandB arenumbers.Inothercases,matcherwillreturninputexpressionuntouched,sothesereducingfunctioncan bechained. Nowwhen reduce_MUL1() encounters(sub)expressionweareinterestingin,itwillreturndictionarywith keysandexpressions. Let’sadd print mcallsomewherebeforereturnandrerun: python td.py tests/add2.s ... going to reduce (((arg1 + arg1) + (arg1 + arg1)) + ((arg1 + arg1) + (arg1 + arg1))) reduction in reduce_ADD1() (arg1 + arg1) -> (arg1 * 2) reduction in reduce_ADD1() (arg1 + arg1) -> (arg1 * 2) reduction in reduce_ADD1() ((arg1 * 2) + (arg1 * 2)) -> ((arg1 * 2) * 2) {'A': 2, 'X': ('EXPR_SYMBOL', 'arg1'), 'B': 2} reduction in reduce_MUL1() ((arg1 * 2) * 2) -> (arg1 * 4) reduction in reduce_ADD1() (arg1 + arg1) -> (arg1 * 2) reduction in reduce_ADD1() (arg1 + arg1) -> (arg1 * 2) reduction in reduce_ADD1() ((arg1 * 2) + (arg1 * 2)) -> ((arg1 * 2) * 2) {'A': 2, 'X': ('EXPR_SYMBOL', 'arg1'), 'B': 2} reduction in reduce_MUL1() ((arg1 * 2) * 2) -> (arg1 * 4) reduction in reduce_ADD1() ((arg1 * 4) + (arg1 * 4)) -> ((arg1 * 4) * 2) {'A': 4, 'X': ('EXPR_SYMBOL', 'arg1'), 'B': 2} reduction in reduce_MUL1() ((arg1 * 4) * 2) -> (arg1 * 8) going to reduce (arg1 * 8) ... result=(arg1 * 8) Thedictionaryhaskeyswesuppliedplusexpressionsmatcherfound. Wethencanusethemtocreate newexpressionandreturnit. Numbersarejustsummedwhileformingsecondoperandto“*”opeartion. Nowareal-worldoptimizationtechnique—optimizingGCCreplacedmultiplicationby31byshiftingand subtractionoperations: mov rax, rdi sal rax, 5 sub rax, rdi Withoutreductionfunctions, ourdecompilerwilltranslatethisinto ((arg1«5)-arg1) . Wecanreplace shiftingleftbymultiplication: # X< X*(2^n) def reduce_SHL1 (expr): m=match (expr, create_binary_expr ("<<", bind_expr ("X"), bind_value ("Y"))) if m==None: return expr # no match return dbg_print_reduced_expr ("reduce_SHL1", expr, create_binary_expr ("*", m["X"], create_val_expr (1< X*(n-1) def reduce_SUB3 (expr): m=match (expr, create_binary_expr ("-", create_binary_expr ("*", bind_expr("X1"), bind_value ("N")), bind_expr("X2"))) if m!=None and match (m["X1"], m["X2"])!=None: return dbg_print_reduced_expr ("reduce_SUB3", expr, create_binary_expr ("*", m["X1"], create_val_expr (m ["N"]-1))) else: return expr # no match 49 MatcherwillreturntwoX’s,andwemustbeassuredthattheyareequal. Infact,inpreviousversionsof thistoydecompiler,Ididcomparisonwithplain“==”,anditworked. Butwecanreuse match()function forthesamepurpose,becauseitwillprocesscommutativeoperationsbetter. Forexample,ifX1is“Q+1” andX2is“1+Q”,expressionsareequal,butplain“==”willnotwork. Ontheotherside, match()function, whenencounter“+”operation(oranothercommutativeoperation),anditfailswithcomparison,itwillalso tryswappedoperandandwilltrytocompareagain. However, tounderstanditeasier, foramoment, youcanimaginethereis“==”insteadofthesecond match(). Anyway,hereiswhatwe’vegot: working out tests/mul31_GCC.s going to reduce ((arg1 << 5) - arg1) reduction in reduce_SHL1() (arg1 << 5) -> (arg1 * 32) reduction in reduce_SUB3() ((arg1 * 32) - arg1) -> (arg1 * 31) going to reduce (arg1 * 31) ... result=(arg1 * 31) Another optimization technique is often seen in ARM thumb code: AND-ing a value with a value like 0xFFFFFFF0,isimplementedusingshifts: mov rax, rdi shr rax, 4 shl rax, 4 ThiscodeisquitecommoninARMthumbcode,becauseit’saheadachetoencode32-bitconstantsusing coupleof16-bitthumbinstructions,whilesingle16-bitinstructioncanshiftby4bitsleftorright. Also,theexpression (x»4)«4canbejokinglycalledas“twitchingoperator”:I’veheardthe“--i++”expres- sionwascalledlikethisinRussian-speakingsocialnetworks,itwassomekindofmeme(“operatorpodergi- vaniya”). Anyway,thesereductionfunctionswillbeused: # X>>n -> X / (2^n) ... def reduce_SHR2 (expr): m=match(expr, create_binary_expr(">>", bind_expr("X"), bind_value("Y"))) if m==None or m["Y"]>=64: return expr # no match return dbg_print_reduced_expr ("reduce_SHR2", expr, create_binary_expr ("/", m["X"], create_val_expr (1< X*(2^n) def reduce_SHL1 (expr): m=match (expr, create_binary_expr ("<<", bind_expr ("X"), bind_value ("Y"))) if m==None: return expr # no match return dbg_print_reduced_expr ("reduce_SHL1", expr, create_binary_expr ("*", m["X"], create_val_expr (1< X&( ((2^n)-1)) def reduce_MUL2 (expr): m=match(expr, create_binary_expr ("*", create_binary_expr ("/", bind_expr("X"), bind_value("N1")), bind_value("N2"))) if m==None or m["N1"]!=m["N2"] or is_2n(m["N1"])==False: # short-circuit expression return expr # no match return dbg_print_reduced_expr("reduce_MUL2", expr, create_binary_expr ("&", m["X"], create_val_expr( (m["N1"]-1)&0xffffffffffffffff))) 50 Nowtheresult: working out tests/AND_by_shifts2.s going to reduce ((arg1 >> 4) << 4) reduction in reduce_SHR2() (arg1 >> 4) -> (arg1 / 16) reduction in reduce_SHL1() ((arg1 / 16) << 4) -> ((arg1 / 16) * 16) reduction in reduce_MUL2() ((arg1 / 16) * 16) -> (arg1 & 0xfffffffffffffff0) going to reduce (arg1 & 0xfffffffffffffff0) ... result=(arg1 & 0xfffffffffffffff0) 7.4.1 Division using multiplication Divisionisoftenreplacedbymultiplicationforperformancereasons. Fromschool-levelarithmetics,wecanrememberthatdivisionby3canbereplacedbymultiplicationby1 3. Infact,sometimescompilersdosoforfloating-pointarithmetics,forexample,FDIVinstructioninx86code canbereplacedbyFMUL.AtleastMSVC6.0willreplacedivisionby3bymultiplicationby1 3andsometimes it’shardtobesure,whatoperationwasinoriginalsourcecode. ButwhenweoperateoverintegervaluesandCPUregisters, wecan’tusefractions. However, wecan reworkfraction: result =x 3=x1 3=x1MagicNumber 3MagicNumber Giventhefactthatdivisionby 2nisveryfast,wenowshouldfindthat MagicNumber ,forwhichthefollowing equationwillbetrue: 2n= 3MagicNumber . Thiscodeperformingdivisionby10: mov rax, rdi movabs rdx, 0cccccccccccccccdh mul rdx shr rdx, 3 mov rax, rdx Divisionby 264issomewhathidden:lower64-bitofproductinRAXisnotused(dropped),onlyhigher64-bit ofproduct(inRDX)isusedandthenshiftedbyadditional3bits. RDXregisterissetduringprocessingofMUL/IMULlikethis: def handle_unary_MUL_IMUL (registers, op1): op1_expr=register_or_number_in_string_to_expr (registers, op1) result=create_binary_expr ("*", registers["rax"], op1_expr) registers["rax"]=result registers["rdx"]=create_binary_expr (">>", result, create_val_expr(64)) In other words, the assembly code we have just seen multiplicates by0cccccccccccccccdh 264+3, or divides by 264+3 0cccccccccccccccdh. Tofinddivisorwejusthavetodividenumeratorbydenominator. # n = magic number # m = shifting coefficient # return = 1 / (n / 2^m) = 2^m / n def get_divisor (n, m): return (2**float(m))/float(n) # (X*n)>>m, where m>=64 -> X/... def reduce_div_by_MUL (expr): m=match (expr, create_binary_expr(">>", create_binary_expr ("*", bind_expr("X"), bind_value("N")), bind_value("M"))) if m==None: return expr # no match divisor=get_divisor(m["N"], m["M"]) return dbg_print_reduced_expr ("reduce_div_by_MUL", expr, create_binary_expr ("/", m["X"], create_val_expr ( int(divisor)))) Thisworks,butwehaveaproblem: thisruletakes (arg1*0xcccccccccccccccd)»64 expressionfirstand findsdivisortobeequalto 1:25.Thisiscorrect:resultisshiftedby3bitsafter(ordividedby8),and 1:258 = 10. Butourtoydecompilerdoesn’tsupportrealnumbers. Wecansolvethisprobleminthefollowingway: ifdivisorhasfractionalpart,wepostponereducing,with ahope,thattwosubsequentrightshiftoperationswillbereducedintosingleone: 51 # (X*n)>>m, where m>=64 -> X/... def reduce_div_by_MUL (expr): m=match (expr, create_binary_expr(">>", create_binary_expr ("*", bind_expr("X"), bind_value("N")), bind_value("M"))) if m==None: return expr # no match divisor=get_divisor(m["N"], m["M"]) if math.floor(divisor)==divisor: return dbg_print_reduced_expr ("reduce_div_by_MUL", expr, create_binary_expr ("/", m["X"], create_val_expr (int(divisor)))) else: print "reduce_div_by_MUL(): postponing reduction, because divisor=", divisor return expr Thatworks: working out tests/div_by_mult10_unsigned.s going to reduce (((arg1 * 0xcccccccccccccccd) >> 64) >> 3) reduce_div_by_MUL(): postponing reduction, because divisor= 1.25 reduction in reduce_SHR1() (((arg1 * 0xcccccccccccccccd) >> 64) >> 3) -> ((arg1 * 0xcccccccccccccccd) >> 67) going to reduce ((arg1 * 0xcccccccccccccccd) >> 67) reduction in reduce_div_by_MUL() ((arg1 * 0xcccccccccccccccd) >> 67) -> (arg1 / 10) going to reduce (arg1 / 10) result=(arg1 / 10) Idon’tknowifthisisbestsolution. Inearlyversionofthisdecompiler,itprocessedinputexpressionin twopasses: firstpassforeverythingexceptdivisionbymultiplication,andthesecondpassforthelatter. I don’tknowwhichwayisbetter. Ormaybewecouldsupportrealnumbersinexpressions? Coupleofwordsaboutbetterunderstandingdivisionbymultiplication.Manypeoplemiss“hidden”division by232or264,whenlower32-bitpart(or64-bitpart)ofproductisnotused(orjustdropped). Also,thereis misconceptionthatmoduloinverseisusedhere. Thisisclose,butnotthesamething. ExtendedEuclidean algorithm is usually used to find magic coefficient , but in fact, this algorithm is rather used to solve the equation.Youcansolveitusinganyothermethod.Also,needlesstomention,theequationisunsolvablefor somedivisors,becausethisisdiophantineequation(i.e.,equationallowingresulttobeonlyinteger),since weworkonintegerCPUregisters,afterall. 7.5 Obfuscation/deobfuscation Despitesimplicityofourdecompiler,wecanseehowtodeobfuscate(oroptimize)usingseveralsimpletricks. Forexample,thispieceofcodedoesnothing: mov rax, rdi xor rax, 12345678h xor rax, 0deadbeefh xor rax, 12345678h xor rax, 0deadbeefh Wewouldneedtheserulestotameit: # (X^n)^m -> X^(n^m) def reduce_XOR4 (expr): m=match(expr, create_binary_expr("^", create_binary_expr ("^", bind_expr("X"), bind_value("N")), bind_value("M"))) if m!=None: return dbg_print_reduced_expr ("reduce_XOR4", expr, create_binary_expr ("^", m["X"], create_val_expr (m["N"]^m["M"]))) else: return expr # no match ... # X op 0 -> X, where op is ADD, OR, XOR, SUB def reduce_op_0 (expr): # try each: for op in ["+", "|", "^", "-"]: m=match(expr, create_binary_expr(op, bind_expr("X"), create_val_expr (0))) if m!=None: 52 return dbg_print_reduced_expr ("reduce_op_0", expr, m["X"]) # default: return expr # no match working out tests/t9_obf.s going to reduce ((((arg1 ^ 0x12345678) ^ 0xdeadbeef) ^ 0x12345678) ^ 0xdeadbeef) reduction in reduce_XOR4() ((arg1 ^ 0x12345678) ^ 0xdeadbeef) -> (arg1 ^ 0xcc99e897) reduction in reduce_XOR4() ((arg1 ^ 0xcc99e897) ^ 0x12345678) -> (arg1 ^ 0xdeadbeef) reduction in reduce_XOR4() ((arg1 ^ 0xdeadbeef) ^ 0xdeadbeef) -> (arg1 ^ 0x0) going to reduce (arg1 ^ 0x0) reduction in reduce_op_0() (arg1 ^ 0x0) -> arg1 going to reduce arg1 result=arg1 Thispieceofcodecanbedeobfuscated(oroptimized)aswell: ; toggle last bit mov rax, rdi mov rbx, rax mov rcx, rbx mov rsi, rcx xor rsi, 12345678h xor rsi, 12345679h mov rax, rsi working out tests/t7_obf.s going to reduce ((arg1 ^ 0x12345678) ^ 0x12345679) reduction in reduce_XOR4() ((arg1 ^ 0x12345678) ^ 0x12345679) -> (arg1 ^ 1) going to reduce (arg1 ^ 1) result=(arg1 ^ 1) Ialsousedaha!47superoptimizertofindweirdpieceofcodewhichdoesnothing. Aha!issocalledsuperoptimizer,ittriesvariouspieceofcodesinbrute-forcemanner,inattempttofind shortestpossiblealternativeforsomemathematicaloperation. Whilesanecompilerdevelopersusesuper- optimizersforthistask,Itrieditinoppositeway,tofindoddestpiecesofcodeforsomesimpleoperations, includingNOPoperation. Inpast,I’veusedittofindweirdalternativetoXORoperation( 5.7). Sohereiswhat aha!canfindforNOP: ; do nothing (as found by aha) mov rax, rdi and rax, rax or rax, rax # X & X -> X def reduce_AND3 (expr): m=match (expr, create_binary_expr ("&", bind_expr ("X1"), bind_expr ("X2"))) if m!=None and match (m["X1"], m["X2"])!=None: return dbg_print_reduced_expr("reduce_AND3", expr, m["X1"]) else: return expr # no match ... # X | X -> X def reduce_OR1 (expr): m=match (expr, create_binary_expr ("|", bind_expr ("X1"), bind_expr ("X2"))) if m!=None and match (m["X1"], m["X2"])!=None: return dbg_print_reduced_expr("reduce_OR1", expr, m["X1"]) else: return expr # no match working out tests/t11_obf.s going to reduce ((arg1 & arg1) | (arg1 & arg1)) reduction in reduce_AND3() (arg1 & arg1) -> arg1 reduction in reduce_AND3() (arg1 & arg1) -> arg1 reduction in reduce_OR1() (arg1 | arg1) -> arg1 going to reduce arg1 result=arg1 47http://www.hackersdelight.org/aha/aha.pdf 53 Thisisweirder: ; do nothing (as found by aha) ;Found a 5-operation program: ; neg r1,rx ; neg r2,rx ; neg r3,r1 ; or r4,rx,2 ; and r5,r4,r3 ; Expr: ((x | 2) & -(-(x))) mov rax, rdi neg rax neg rax or rdi, 2 and rax, rdi Rulesadded(Iused“NEG”stringtorepresentsignchangeandtobedifferentfromsubtractionoperation, whichisjustminus(“-”)): # (op(op X)) -> X, where both ops are NEG or NOT def reduce_double_NEG_or_NOT (expr): # try each: for op in ["NEG", " "]: m=match (expr, create_unary_expr (op, create_unary_expr (op, bind_expr("X")))) if m!=None: return dbg_print_reduced_expr ("reduce_double_NEG_or_NOT", expr, m["X"]) # default: return expr # no match ... # X & (X | ...) -> X def reduce_AND2 (expr): m=match (expr, create_binary_expr ("&", create_binary_expr ("|", bind_expr ("X1"), bind_expr ("REST")), bind_expr ("X2"))) if m!=None and match (m["X1"], m["X2"])!=None: return dbg_print_reduced_expr("reduce_AND2", expr, m["X1"]) else: return expr # no match going to reduce ((-(-arg1)) & (arg1 | 2)) reduction in reduce_double_NEG_or_NOT() (-(-arg1)) -> arg1 reduction in reduce_AND2() (arg1 & (arg1 | 2)) -> arg1 going to reduce arg1 result=arg1 Ialsoforced aha!tofindpieceofcodewhichadds2withnoaddition/subtractionoperationsallowed: ; arg1+2, without add/sub allowed, as found by aha: ;Found a 4-operation program: ; not r1,rx ; neg r2,r1 ; not r3,r2 ; neg r4,r3 ; Expr: -( (-((x)))) mov rax, rdi not rax neg rax not rax neg rax Rule: # (- ( X)) -> X+1 def reduce_NEG_NOT (expr): m=match (expr, create_unary_expr ("NEG", create_unary_expr (" ", bind_expr("X")))) if m==None: return expr # no match return dbg_print_reduced_expr ("reduce_NEG_NOT", expr, create_binary_expr ("+", m["X"],create_val_expr(1))) 54 working out tests/add_by_not_neg.s going to reduce (-( (-(arg1)))) reduction in reduce_NEG_NOT() (-( arg1)) -> (arg1 + 1) reduction in reduce_NEG_NOT() (-( (arg1 + 1))) -> ((arg1 + 1) + 1) reduction in reduce_ADD3() ((arg1 + 1) + 1) -> (arg1 + 2) going to reduce (arg1 + 2) result=(arg1 + 2) This is artifact of two’s complement system of signed numbers representation. Same can be done for subtraction(justswapNEGandNOToperations). Nowlet’saddsomefakeluggagetoFahrenheit-to-Celsiusexample: ; celsius = 5 * (fahr-32) / 9 ; fake luggage: mov rbx, 12345h mov rax, rdi sub rax, 32 ; fake luggage: add rbx, rax imul rax, 5 mov rbx, 9 idiv rbx ; fake luggage: sub rdx, rax It’snotaproblemforourdecompiler,becausethenoiseisleftinRDXregister,andnotusedatall: working out tests/fahr_to_celsius_obf1.s line=[mov rbx, 12345h] rcx=arg4 rsi=arg2 rbx=0x12345 rdx=arg3 rdi=arg1 rax=initial_RAX line=[mov rax, rdi] rcx=arg4 rsi=arg2 rbx=0x12345 rdx=arg3 rdi=arg1 rax=arg1 line=[sub rax, 32] rcx=arg4 rsi=arg2 rbx=0x12345 rdx=arg3 rdi=arg1 rax=(arg1 - 32) line=[add rbx, rax] rcx=arg4 rsi=arg2 rbx=(0x12345 + (arg1 - 32)) rdx=arg3 rdi=arg1 rax=(arg1 - 32) line=[imul rax, 5] rcx=arg4 rsi=arg2 rbx=(0x12345 + (arg1 - 32)) rdx=arg3 rdi=arg1 rax=((arg1 - 32) * 5) line=[mov rbx, 9] rcx=arg4 rsi=arg2 rbx=9 rdx=arg3 rdi=arg1 rax=((arg1 - 32) * 5) 55 line=[idiv rbx] rcx=arg4 rsi=arg2 rbx=9 rdx=(((arg1 - 32) * 5) % 9) rdi=arg1 rax=(((arg1 - 32) * 5) / 9) line=[sub rdx, rax] rcx=arg4 rsi=arg2 rbx=9 rdx=((((arg1 - 32) * 5) % 9) - (((arg1 - 32) * 5) / 9)) rdi=arg1 rax=(((arg1 - 32) * 5) / 9) going to reduce (((arg1 - 32) * 5) / 9) result=(((arg1 - 32) * 5) / 9) Wecantrytopretendweaffecttheresultwiththenoise: ; celsius = 5 * (fahr-32) / 9 ; fake luggage: mov rbx, 12345h mov rax, rdi sub rax, 32 ; fake luggage: add rbx, rax imul rax, 5 mov rbx, 9 idiv rbx ; fake luggage: sub rdx, rax mov rcx, rax ; OR result with garbage (result of fake luggage): or rcx, rdx ; the following instruction shouldn't affect result: and rax, rcx ...butinfact,it’sallreducedby reduce_AND2() functionwealreadysaw( 7.5): working out tests/fahr_to_celsius_obf2.s going to reduce ((((arg1 - 32) * 5) / 9) & ((((arg1 - 32) * 5) / 9) | ((((arg1 - 32) * 5) % 9) - (((arg1 - 32) * 5) / 9)))) reduction in reduce_AND2() ((((arg1 - 32) * 5) / 9) & ((((arg1 - 32) * 5) / 9) | ((((arg1 - 32) * 5) % 9) - ((( arg1 - 32) * 5) / 9)))) -> (((arg1 - 32) * 5) / 9) going to reduce (((arg1 - 32) * 5) / 9) result=(((arg1 - 32) * 5) / 9) Wecanseethatdeobfuscationisinfactthesamethingasoptimizationusedincompilers.Wecantrythis functioninGCC: int f(int a) { return -( a); }; OptimizingGCC5.4(x86)generatesthis: f: mov eax, DWORD PTR [esp+4] add eax, 1 ret GCChasitsownrewritingrules,someofwhichare,probably,closetowhatweusehere. 7.6 Tests Despitesimplicityofthedecompiler,it’sstillerror-prone. Weneedtobesurethatoriginalexpressionand reducedoneareequivalenttoeachother. 56 7.6.1 Evaluating expressions Firstofall, wewouldjustevaluate(or run, orexecute)expressionwithrandomvaluesasarguments,and thencompareresults. Evaluator do arithmetical operations when possible, recursively. When any symbol is encountered, its value(randomlygeneratedbefore)istakenfromatable. un_ops={"NEG":operator.neg, "":operator.invert} bin_ops={">>":operator.rshift, "<<":(lambda x, c: x<<(c&0x3f)), # operator.lshift should be here, but it doesn't handle too big counts "&":operator.and_, "|":operator.or_, "^":operator.xor, "+":operator.add, "-":operator.sub, "*":operator.mul, "/":operator.div, "%":operator.mod} def eval_expr(e, symbols): t=get_expr_type (e) if t=="EXPR_SYMBOL": return symbols[get_symbol(e)] elif t=="EXPR_VALUE": return get_val (e) elif t=="EXPR_OP": if is_unary_op (get_op (e)): return un_ops[get_op(e)](eval_expr(get_op1(e), symbols)) else: return bin_ops[get_op(e)](eval_expr(get_op1(e), symbols), eval_expr(get_op2(e), symbols)) else: raise AssertionError def do_selftest(old, new): for n in range(100): symbols={"arg1":random.getrandbits(64), "arg2":random.getrandbits(64), "arg3":random.getrandbits(64), "arg4":random.getrandbits(64)} old_result=eval_expr (old, symbols)&0xffffffffffffffff # signed->unsigned new_result=eval_expr (new, symbols)&0xffffffffffffffff # signed->unsigned if old_result!=new_result: print "self-test failed" print "initial expression: "+expr_to_string(old) print "reduced expression: "+expr_to_string(new) print "initial expression result: ", old_result print "reduced expression result: ", new_result exit(0) In fact, this is very close to what LISP EVALfunction does, or even LISP interpreter. However, not all symbolsareset.IftheexpressionisusinginitialvaluesfromRAXorRBX(towhichsymbols“initial_RAX”and “initial_RBX”areassigned,decompilerwillstopwithexception,becausenorandomvaluesassignedtothese registers,andthesesymbolsareabsentin symbolsdictionary. Usingthistest,I’vesuddenlyfoundabughere(despitesimplicityofallthesereductionrules).Well,no-one protectedfromeyestrain. Nevertheless,thetesthasaseriousproblem: somebugscanberevealedonlyif oneofargumentsis 0,or 1,or1. Maybethereareevenmorespecialcasesexists. Mentionedabove aha!superoptimizertriesatleastthesevaluesasargumentswhiletesting: 1, 0, -1, 0x80000000, 0x7FFFFFFF, 0x80000001, 0x7FFFFFFE, 0x01234567, 0x89ABCDEF, -2, 2, -3, 3, -64, 64, -5, -31415. Still,youcannotbesure. 7.6.2 Using Z3 SMT-solver for testing SoherewewilltryZ3SMT-solver. SMT-solvercan provethattwoexpressionsareequivalenttoeachother. Forexample,withthehelpof aha!,I’vefoundanotherweirdpieceofcode,whichdoesnothing: ; do nothing (obfuscation) ;Found a 5-operation program: 57 ; neg r1,rx ; neg r2,r1 ; sub r3,r1,3 ; sub r4,r3,r1 ; sub r5,r4,r3 ; Expr: (((-(x) - 3) - -(x)) - (-(x) - 3)) mov rax, rdi neg rax mov rbx, rax ; rbx=-x mov rcx, rbx sub rcx, 3 ; rcx=-x-3 mov rax, rcx sub rax, rbx ; rax=(-(x) - 3) - -(x) sub rax, rcx Usingtoydecompiler,I’vefoundthatthispieceisreducedto arg1expression: working out tests/t5_obf.s going to reduce ((((-arg1) - 3) - (-arg1)) - ((-arg1) - 3)) reduction in reduce_SUB2() ((-arg1) - 3) -> (-(arg1 + 3)) reduction in reduce_SUB5() ((-(arg1 + 3)) - (-arg1)) -> ((-(arg1 + 3)) + arg1) reduction in reduce_SUB2() ((-arg1) - 3) -> (-(arg1 + 3)) reduction in reduce_ADD_SUB() (((-(arg1 + 3)) + arg1) - (-(arg1 + 3))) -> arg1 going to reduce arg1 result=arg1 Butisitcorrect? I’veaddedafunctionwhichcanoutputexpression(s)toSMTLIB-format,it’sassimpleas afunctionwhichconvertsexpressiontostring. AndthisisSMTLIB-fileforZ3: (assert (forall ((arg1 (_ BitVec 64)) (arg2 (_ BitVec 64)) (arg3 (_ BitVec 64)) (arg4 (_ BitVec 64))) (= (bvsub (bvsub (bvsub (bvneg arg1) #x0000000000000003) (bvneg arg1)) (bvsub (bvneg arg1) # x0000000000000003)) arg1 ) ) ) (check-sat) InplainEnglishterms,whatweaskingittobesure,that forallfour64-bitarguments,twoexpressionsare equivalent(secondisjust arg1). Thesyntaxmaybehardtounderstand,butinfact,thisisveryclosetoLISP,andarithmeticaloperations arenamed“bvsub”,“bvadd”,etc,because“bv”standsfor bitvector. Whilerunning,Z3shows“sat”,meaning“satisfiable”. Inotherwords,Z3couldn’tfindcounterexample forthisexpression. Infact,Icanrewritethisexpressioninthefollowingform: expr1!=expr2 ,andwewouldaskZ3tofindat leastonesetofinputarguments,forwhichexpressionsarenotequaltoeachother: (declare-const arg1 (_ BitVec 64)) (declare-const arg2 (_ BitVec 64)) (declare-const arg3 (_ BitVec 64)) (declare-const arg4 (_ BitVec 64)) (assert (not (= (bvsub (bvsub (bvsub (bvneg arg1) #x0000000000000003) (bvneg arg1)) (bvsub (bvneg arg1) # x0000000000000003)) arg1 ) ) ) (check-sat) Z3says“unsat”,meaning,itcouldn’tfindanysuchcounterexample.Inotherwords,forallpossibleinput arguments,resultsofthesetwoexpressionsarealwaysequaltoeachother. 58 Nevertheless,Z3isnotomnipotent. Itfailstoproveequivalenceofthecodewhichperformsdivisionby multiplication. Firstofall,Iextendeditsobothsresultswillhavesizeof128bitinsteadof64: (declare-const x (_ BitVec 64)) (assert (forall ((x (_ BitVec 64))) (= ((_ zero_extend 64) (bvudiv x (_ bv17 64))) (bvlshr (bvmul ((_ zero_extend 64) x) #x0000000000000000f0f0f0f0f0f0f0f1) (_ bv68 128)) ) ) ) (check-sat) (get-model) (bv17isjust64-bitnumber17,etc. “bv”standsfor“bitvector”,asopposedtointegervalue.) Z3workstoolongwithoutanyanswer,andIhadtointerruptit. AsZ3developersmentioned,suchexpressionsarehardforZ3sofar: https://github.com/Z3Prover/ z3/issues/514 . Still,divisionbymultiplicationcanbetestedusingpreviouslydescribedbrute-forcecheck. 7.7 My other implementations of toy decompiler WhenImadeattempttowriteitinC++,ofcourse,nodeinexpressionwasrepresentedusingclass.Thereis alsoimplementationinpureC48,nodeisrepresentedusingstructure. MatchersinbothC++andCversionsdoesn’treturnanydictionary,butinstead, bind_value() functions takespointertoavariablewhichwillcontainvalueaftersuccessfulmatching. bind_expr() takespointer toapointer,whichwillpointstothepartofexpression,again,incaseofsuccess. ItookthisideafromLLVM. HerearetwopiecesofcodefromLLVMsourcecodewithcoupleofreducingrules: // (X >> A) << A -> X Value *X; if (match(Op0, m_Exact(m_Shr(m_Value(X), m_Specific(Op1))))) return X; (lib/Analysis/InstructionSimplify.cpp ) // (A | B) | C and A | (B | C) -> bswap if possible. // (A >> B) | (C << D) and (A << B) | (B >> C) -> bswap if possible. if (match(Op0, m_Or(m_Value(), m_Value())) || match(Op1, m_Or(m_Value(), m_Value())) || (match(Op0, m_LogicalShift(m_Value(), m_Value())) && match(Op1, m_LogicalShift(m_Value(), m_Value())))) { if (Instruction *BSwap = MatchBSwap(I)) return BSwap; (lib/Transforms/InstCombine/InstCombineAndOrXor.cpp ) Asyoucansee,mymatchertriestomimicLLVM.WhatIcall reductioniscalledfoldinginLLVM.Bothterms arepopular. IhavealsoablogpostaboutLLVMobfuscator,inwhichLLVMmatcherismentioned: https://yurichev. com/blog/llvm/ . PythonversionoftoydecompilerusesstringsinplacewhereenumeratedatatypeisusedinCversion (likeOP_AND,OP_MUL,etc)andsymbolsusedinRacketversion49(like’OP_DIV,etc). Thismaybeseenas inefficient,nevertheless,thankstostringsinterning,onlyaddressofstringsarecomparedinPythonversion, notstringsasawhole. SostringsinPythoncanbeseenaspossiblereplacementforLISPsymbols. 7.7.1 Even simpler toy decompiler KnowledgeofLISPmakesyouunderstandallthesethingsnaturally,withoutsignificanteffort.ButwhenIhad noknowledgeofit,butstilltriedtomakeasimpletoydecompiler,Imadeitusingusualtextstringswhich holdedexpressionsforeachregisters(andevenmemory). SowhenMOVinstructioncopiesvaluefromoneregistertoanother,wejustcopystring.Whenarithmetical instructionoccurred,wedostringconcatenation: 48https://github.com/dennis714/SAT_SMT_article/tree/master/toy_decompiler/files/C 49RacketisScheme(whichis,inturn,LISPdialect)dialect. https://github.com/dennis714/SAT_SMT_article/tree/master/toy_ decompiler/files/Racket 59 std::string registers[TOTAL]; ... // all 3 arguments are strings switch (ins, op1, op2) { ... case ADD: registers[op1]="(" + registers[op1] + " + " + registers[op2] + ")"; break; ... case MUL: registers[op1]="(" + registers[op1] + " / " + registers[op2] + ")"; break; ... } Nowyou’llhavelongexpressionsforeachregister,representedasstrings. Forreducingthem,youcan useplainsimpleregularexpressionmatcher. Forexample,fortherule (X*n)+(X*m) -> X*(n+m) ,youcanmatch(sub)stringusingthefollowingreg- ularexpression: ((.*)*(.*))+((.*)*(.*))50. Ifthestringismatched,you’regetting4groups(orsubstrings). Youthen justcompare1stand3rdusingstringcomparisonfunction,thenyoucheckifthe2ndand4tharenumbers, youconvertthemtonumbers,sumthemandyoumakenewstring,consistingof1stgroupandsum,likethis: (" + X + "*" + (int(n) + int(m)) + ") . Itwasnaïve,clumsy,itwassourceofgreatembarrassment,butitworkedcorrectly. 7.8 Difference between toy decompiler and commercial-grade one Perhaps,someone,whocurrentlyreadingthistext,mayrushintoextendingmysourcecode.Asanexercise, Iwouldsay,thatthefirststepcouldbesupportofpartialregisters:i.e.,AL,AX,EAX.Thisistricky,butdoable. Anothertaskmaybesupportof FPU51x86instructions( FPUstackmodelingisn’tabigdeal). ThegapbetweentoydecompilerandacommercialdecompilerlikeHex-Raysisstillenormous. Several trickyproblemsmustbesolved,atleastthese: •Cdatatypes:arrays,structures,pointers,etc.Thisproblemisvirtuallynon-existentfor JVM52(Java,etc) and.NETdecompilers,becausetypeinformationispresentinbinaryfiles. •Basicblocks,C/C++statements.MikeVanEmmerikinhisthesis53showshowthiscanbetackledusing SSAforms(whicharealsousedheavilyincompilers). •Memorysupport,includinglocalstack. Keepinmindpointeraliasingproblem. Again,decompilersof JVMand.NETfilesaresimplerhere. 7.9 Further reading Thereareseveralinterestingopen-sourceattemptstobuilddecompiler. Bothsourcecodeandthesesare interestingstudy. •decompbyJimReuter54. •DCCbyCristinaCifuentes55. Itisinterestingthatthisdecompilersupportsonlyonetype( int).MaybethisisareasonwhyDCCdecom- pilerproducessourcecodewith .Bextension? ReadmoreaboutBtypelesslanguage(Cpredecessor): https://yurichev.com/blog/typeless/ . 50Thisregularexpressionstringhasn’tbeenproperlyescaped,forthereasonofeasierreadabilityandunderstanding. 51Floating-pointunit 52JavaVirtualMachine 53https://yurichev.com/mirrors/vanEmmerik_ssa.pdf 54http://www.program-transformation.org/Transform/DecompReadMe ,http://www.program-transformation.org/Transform/ DecompDecompiler 55http://www.program-transformation.org/Transform/DccDecompiler , thesis: https://yurichev.com/mirrors/DCC_ decompilation_thesis.pdf 60 •BoomerangbyMikeVanEmmerik,TrentWaddingtonetal56. As I’ve said, LISP knowledge can help to understand this all much easier. Here is well-known micro- interpreterofLISPbyPeterNorvig,alsowritteninPython: https://web.archive.org/web/20161116133448/ http://www.norvig.com/lispy.html ,https://web.archive.org/web/20160305172301/http://norvig. com/lispy2.html . 7.10 The files Pythonversionandtests: https://github.com/dennis714/SAT_SMT_article/tree/master/toy_decompiler/ files. TherearealsoCandRacketversions,butoutdated. Keepinmind—thisdecompilerisstillattoylevel,anditwastestedonlyontinytestfilessupplied. 8 Symbolic execution 8.1 Symbolic computation Let’sfirststartwithsymboliccomputation57. Somenumberscanonlyberepresentedinbinarysystemapproximately,like1 3and. Ifwecalculate1 33 step-by-step,wemayhavelossofsignificance. Wealsoknowthat sin( 2) = 1,butcalculatingthisexpression inusualway,wecanalsohavesomenoiseinresult.Arbitrary-precisionarithmetic58isnotasolution,because thesenumberscannotbestoredinmemoryasabinarynumberoffinitelength. How we could tackle this problem? Humans reduce such expressions using paper and pencil without anycalculations. Wecanmimichumanbehaviourprogrammaticallyifwewillstoreexpressionastreeand symbolslike willbeconvertedintonumberattheverylaststep(s). ThisiswhatWolframMathematica59does. Let’sstartitandtrythis: In[]:= x + 2*8 Out[]= 16 + x SinceMathematicahasnocluewhat xis,it’sleftasis,but 28canbereducedeasily,bothbyMathematica andbyhumans,sothatiswhathasdone. Insomepointoftimeinfuture,Mathematica’susermayassign somenumberto xandthen,Mathematicawillreducetheexpressionevenfurther. Mathematica does this because it parses the expression and finds some known patterns. This is also calledtermrewriting60. InplainEnglishlanguageitmaysoundslikethis: “ifthereisa +operatorbetween twoknownnumbers,replacethissubexpressionbyacomputednumberwhichissumofthesetwonumbers, ifpossible”. Justlikehumansdo. Mathematicaalsohasruleslike“replace sin()by0”and“replace sin( 2)by1”,butasyoucansee, must bepreservedassomekindofsymbolinsteadofanumber. SoMathematicaleft xasunknownvalue. Thisis, infact, commonmistakebyMathematica’susers: a smalltypoinaninputexpressionmayleadtoahugeirreducibleexpressionwiththetypoleft. Anotherexample: Mathematicaleftthisdeliberatelywhilecomputingbinarylogarithm: In[]:= Log[2, 36] Out[]= Log[36]/Log[2] Becauseithasahopethatatsomepointinfuture,thisexpressionwillbecomeasubexpressioninanother expressionanditwillbereducednicelyattheveryend. Butifwereallyneedanumericalanswer,wecan forceMathematicatocalculateit: In[]:= Log[2, 36] // N Out[]= 5.16993 Sometimesunresolvedvaluesaredesirable: 56http://boomerang.sourceforge.net/ ,http://www.program-transformation.org/Transform/MikeVanEmmerik ,thesis: https: //yurichev.com/mirrors/vanEmmerik_ssa.pdf 57https://en.wikipedia.org/wiki/Symbolic_computation 58https://en.wikipedia.org/wiki/Arbitrary-precision_arithmetic 59Anotherwell-knownsymboliccomputationsystemare MaximaandSymPy 60https://en.wikipedia.org/wiki/Rewriting 61 In[]:= Union[{a, b, a, c}, {d, a, e, b}, {c, a}] Out[]= {a, b, c, d, e} Charactersintheexpressionarejustunresolvedsymbols61withnoconnectionstonumbersorotherex- pressions,soMathematicaleftthem asis. Anotherrealworldexampleissymbolicintegration62,i.e.,findingformulaforintegralbyrewritinginitial expressionusingsomepredefinedrules. Mathematicaalsodoesit: In[]:= Integrate[1/(x^5), x] Out[]= -(1/(4 x^4)) Benefitsofsymboliccomputationareobvious:itisnotpronetolossofsignificance63andround-offerrors64, butdrawbacksarealsoobvious: youneedtostoreexpressionin(possiblehuge)treeandprocessitmany times. Termrewritingisalsoslow. Allthesethingsareextremelyclumsyincomparisontoafast FPU. “Symboliccomputation”isopposedto“numericalcomputation”,thelastoneisjustprocessingnumbers step-by-step,usingcalculator, CPUorFPU. Sometaskcanbesolvedbetterbythefirstmethod,someothers–bythesecondone. 8.1.1 Rational data type Some LISP implementations can store a number as a ratio/fraction65, i.e., placing two numbers in a cell (which, inthiscase, iscalled atominLISPlingo). Forexample, youdivide1by3, andtheinterpreter, by understandingthat1 3isanirreduciblefraction66,createsacellwith1and3numbers. Sometimeafter,you maymultiplythiscellby6,andthemultiplicationfunctioninsideLISPinterpretermayreturnmuchbetter result(2without noise). Printingfunctionininterpretercanalsoprintsomethinglike 1 / 3insteadoffloatingpointnumber. Thisissometimescalled“fractionalarithmetic”[seeDonaldE.Knuth, TheArtofComputingProgramming , 3rded.,(1997),4.5.1,page330]. Thisisnotsymboliccomputationinanyway,butthisisslightlybetterthanstoringratios/fractionsasjust floatingpointnumbers. Drawbacksareclearlyvisible:youneedmorememorytostoreratioinsteadofanumber;andallarithmetic functionsaremorecomplexandslower,becausetheymusthandlebothnumbersandratios. Perhaps, becauseofdrawbacks, someprogramminglanguagesoffersseparate( rational)datatype, as languagefeature,orsupportedbyalibrary67: Haskell,OCaml,Perl,Ruby,Python,Smalltalk,Java,Clojure, C/C++68. 8.2 Symbolic execution 8.2.1 Swapping two values using XOR Thereisawell-known(butcounterintuitive)algorithmforswappingtwovaluesintwovariablesusingXOR operationwithoutuseofanyadditionalmemory/register: X=X^Y Y=Y^X X=X^Y Howitworks? Itwouldbebettertoconstructanexpressionateachstepofexecution. #!/usr/bin/env python class Expr: def __init__(self,s): self.s=s def __str__(self): 61SymbollikeinLISP 62https://en.wikipedia.org/wiki/Symbolic_integration 63https://en.wikipedia.org/wiki/Loss_of_significance 64https://en.wikipedia.org/wiki/Round-off_error 65https://en.wikipedia.org/wiki/Rational_data_type 66https://en.wikipedia.org/wiki/Irreducible_fraction 67Moredetailedlist: https://en.wikipedia.org/wiki/Rational_data_type 68ByGNUMultiplePrecisionArithmeticLibrary 62 return self.s def __xor__(self, other): return Expr("(" + self.s + "^" + other.s + ")") def XOR_swap(X, Y): X=X^Y Y=Y^X X=X^Y return X, Y new_X, new_Y=XOR_swap(Expr("X"), Expr("Y")) print "new_X", new_X print "new_Y", new_Y Itworks,becausePythonisdynamicalytyped PL,sothefunctiondoesn’tcarewhattooperateon,numer- icalvalues,oronobjectsofExpr()class. Hereisresult: new_X ((X^Y)^(Y^(X^Y))) new_Y (Y^(X^Y)) Youcanremovedoublevariablesinyourmind(sinceXORingbyavaluetwicewillresultinnothing). At new_XwecandroptwoX-esandtwoY-es,andsingleYwillleft.Atnew_YwecandroptwoY-es,andsingleX willleft. 8.2.2 Change endianness Whatdoesthiscodedo? mov eax, ecx mov edx, ecx shl edx, 16 and eax, 0000ff00H or eax, edx mov edx, ecx and edx, 00ff0000H shr ecx, 16 or edx, ecx shl eax, 8 shr edx, 8 or eax, edx Infact,manyreverseengineersplayshellgamealot,keepingtrackofwhatisstoredwhere,ateachpoint oftime. 63 Figure8:HieronymusBosch–TheConjurer Again,wecanbuildequivalentfunctionwhichcantakebothnumericalvariablesandExpr()objects. We alsoextendExpr()classtosupportmanyarithmeticalandbooleanoperations. Also,Expr()methodswould takebothExpr()objectsoninputandintegervalues. #!/usr/bin/env python class Expr: def __init__(self,s): self.s=s def convert_to_Expr_if_int(self, n): if isinstance(n, int): return Expr(str(n)) if isinstance(n, Expr): return n raise AssertionError # unsupported type def __str__(self): return self.s def __xor__(self, other): return Expr("(" + self.s + "^" + self.convert_to_Expr_if_int(other).s + ")") def __and__(self, other): return Expr("(" + self.s + "&" + self.convert_to_Expr_if_int(other).s + ")") def __or__(self, other): return Expr("(" + self.s + "|" + self.convert_to_Expr_if_int(other).s + ")") def __lshift__(self, other): return Expr("(" + self.s + "<<" + self.convert_to_Expr_if_int(other).s + ")") def __rshift__(self, other): return Expr("(" + self.s + ">>" + self.convert_to_Expr_if_int(other).s + ")") 64 # change endianness ecx=Expr("initial_ECX") # 1st argument eax=ecx # mov eax, ecx edx=ecx # mov edx, ecx edx=edx<<16 # shl edx, 16 eax=eax&0xff00 # and eax, 0000ff00H eax=eax|edx # or eax, edx edx=ecx # mov edx, ecx edx=edx&0x00ff0000 # and edx, 00ff0000H ecx=ecx>>16 # shr ecx, 16 edx=edx|ecx # or edx, ecx eax=eax<<8 # shl eax, 8 edx=edx>>8 # shr edx, 8 eax=eax|edx # or eax, edx print eax Irunit: ((((initial_ECX&65280)|(initial_ECX<<16))<<8)|(((initial_ECX&16711680)|(initial_ECX>>16))>>8)) Nowthisissomethingmorereadable,however,abitLISPyatfirstsight. Infact,thisisafunctionwhich changeendiannessin32-bitword. Bytheway,myToyDecompilercandothisjobaswell,butoperateson ASTinsteadofplainstrings: 7. 8.2.3 Fast Fourier transform I’vefoundoneofthesmallestpossibleFFTimplementationson reddit: #!/usr/bin/env python from cmath import exp,pi def FFT(X): n = len(X) w = exp(-2*pi*1j/n) if n > 1: X = FFT(X[::2]) + FFT(X[1::2]) for k in xrange(n/2): xk = X[k] X[k] = xk + w**k*X[k+n/2] X[k+n/2] = xk - w**k*X[k+n/2] return X print FFT([1,2,3,4,5,6,7,8]) Justinteresting,whatvaluehaseachelementonoutput? #!/usr/bin/env python from cmath import exp,pi class Expr: def __init__(self,s): self.s=s def convert_to_Expr_if_int(self, n): if isinstance(n, int): return Expr(str(n)) if isinstance(n, Expr): return n raise AssertionError # unsupported type def __str__(self): return self.s def __add__(self, other): return Expr("(" + self.s + "+" + self.convert_to_Expr_if_int(other).s + ")") def __sub__(self, other): return Expr("(" + self.s + "-" + self.convert_to_Expr_if_int(other).s + ")") def __mul__(self, other): return Expr("(" + self.s + "*" + self.convert_to_Expr_if_int(other).s + ")") def __pow__(self, other): 65 return Expr("(" + self.s + "**" + self.convert_to_Expr_if_int(other).s + ")") def FFT(X): n = len(X) # cast complex value to string, and then to Expr w = Expr(str(exp(-2*pi*1j/n))) if n > 1: X = FFT(X[::2]) + FFT(X[1::2]) for k in xrange(n/2): xk = X[k] X[k] = xk + w**k*X[k+n/2] X[k+n/2] = xk - w**k*X[k+n/2] return X input=[Expr("input_%d" % i) for i in range(8)] output=FFT(input) for i in range(len(output)): print i, ":", output[i] FFT()functionleftalmostintact,theonlythingIadded: complexvalueisconvertedintostringandthen Expr()objectisconstructed. 0 : (((input_0+(((-1-1.22464679915e-16j)**0)*input_4))+(((6.12323399574e-17-1j)**0)*(input_2+(((-1-1.22464679915 e-16j)**0)*input_6))))+(((0.707106781187-0.707106781187j)**0)*((input_1+(((-1-1.22464679915e-16j)**0)* input_5))+(((6.12323399574e-17-1j)**0)*(input_3+(((-1-1.22464679915e-16j)**0)*input_7)))))) 1 : (((input_0-(((-1-1.22464679915e-16j)**0)*input_4))+(((6.12323399574e-17-1j)**1)*(input_2-(((-1-1.22464679915 e-16j)**0)*input_6))))+(((0.707106781187-0.707106781187j)**1)*((input_1-(((-1-1.22464679915e-16j)**0)* input_5))+(((6.12323399574e-17-1j)**1)*(input_3-(((-1-1.22464679915e-16j)**0)*input_7)))))) 2 : (((input_0+(((-1-1.22464679915e-16j)**0)*input_4))-(((6.12323399574e-17-1j)**0)*(input_2+(((-1-1.22464679915 e-16j)**0)*input_6))))+(((0.707106781187-0.707106781187j)**2)*((input_1+(((-1-1.22464679915e-16j)**0)* input_5))-(((6.12323399574e-17-1j)**0)*(input_3+(((-1-1.22464679915e-16j)**0)*input_7)))))) 3 : (((input_0-(((-1-1.22464679915e-16j)**0)*input_4))-(((6.12323399574e-17-1j)**1)*(input_2-(((-1-1.22464679915 e-16j)**0)*input_6))))+(((0.707106781187-0.707106781187j)**3)*((input_1-(((-1-1.22464679915e-16j)**0)* input_5))-(((6.12323399574e-17-1j)**1)*(input_3-(((-1-1.22464679915e-16j)**0)*input_7)))))) 4 : (((input_0+(((-1-1.22464679915e-16j)**0)*input_4))+(((6.12323399574e-17-1j)**0)*(input_2+(((-1-1.22464679915 e-16j)**0)*input_6))))-(((0.707106781187-0.707106781187j)**0)*((input_1+(((-1-1.22464679915e-16j)**0)* input_5))+(((6.12323399574e-17-1j)**0)*(input_3+(((-1-1.22464679915e-16j)**0)*input_7)))))) 5 : (((input_0-(((-1-1.22464679915e-16j)**0)*input_4))+(((6.12323399574e-17-1j)**1)*(input_2-(((-1-1.22464679915 e-16j)**0)*input_6))))-(((0.707106781187-0.707106781187j)**1)*((input_1-(((-1-1.22464679915e-16j)**0)* input_5))+(((6.12323399574e-17-1j)**1)*(input_3-(((-1-1.22464679915e-16j)**0)*input_7)))))) 6 : (((input_0+(((-1-1.22464679915e-16j)**0)*input_4))-(((6.12323399574e-17-1j)**0)*(input_2+(((-1-1.22464679915 e-16j)**0)*input_6))))-(((0.707106781187-0.707106781187j)**2)*((input_1+(((-1-1.22464679915e-16j)**0)* input_5))-(((6.12323399574e-17-1j)**0)*(input_3+(((-1-1.22464679915e-16j)**0)*input_7)))))) 7 : (((input_0-(((-1-1.22464679915e-16j)**0)*input_4))-(((6.12323399574e-17-1j)**1)*(input_2-(((-1-1.22464679915 e-16j)**0)*input_6))))-(((0.707106781187-0.707106781187j)**3)*((input_1-(((-1-1.22464679915e-16j)**0)* input_5))-(((6.12323399574e-17-1j)**1)*(input_3-(((-1-1.22464679915e-16j)**0)*input_7)))))) Wecanseesubexpressionsinformlike x0andx1. Wecaneliminatethem,since x0= 1andx1=x. Also, wecanreducesubexpressionslike x1tojust x. def __mul__(self, other): op1=self.s op2=self.convert_to_Expr_if_int(other).s if op1=="1": return Expr(op2) if op2=="1": return Expr(op1) return Expr("(" + op1 + "*" + op2 + ")") def __pow__(self, other): op2=self.convert_to_Expr_if_int(other).s if op2=="0": return Expr("1") if op2=="1": return Expr(self.s) return Expr("(" + self.s + "**" + op2 + ")") 0 : (((input_0+input_4)+(input_2+input_6))+((input_1+input_5)+(input_3+input_7))) 1 : (((input_0-input_4)+((6.12323399574e-17-1j)*(input_2-input_6)))+((0.707106781187-0.707106781187j)*((input_1- input_5)+((6.12323399574e-17-1j)*(input_3-input_7))))) 2 : (((input_0+input_4)-(input_2+input_6))+(((0.707106781187-0.707106781187j)**2)*((input_1+input_5)-(input_3+ input_7)))) 66 3 : (((input_0-input_4)-((6.12323399574e-17-1j)*(input_2-input_6)))+(((0.707106781187-0.707106781187j)**3)*(( input_1-input_5)-((6.12323399574e-17-1j)*(input_3-input_7))))) 4 : (((input_0+input_4)+(input_2+input_6))-((input_1+input_5)+(input_3+input_7))) 5 : (((input_0-input_4)+((6.12323399574e-17-1j)*(input_2-input_6)))-((0.707106781187-0.707106781187j)*((input_1- input_5)+((6.12323399574e-17-1j)*(input_3-input_7))))) 6 : (((input_0+input_4)-(input_2+input_6))-(((0.707106781187-0.707106781187j)**2)*((input_1+input_5)-(input_3+ input_7)))) 7 : (((input_0-input_4)-((6.12323399574e-17-1j)*(input_2-input_6)))-(((0.707106781187-0.707106781187j)**3)*(( input_1-input_5)-((6.12323399574e-17-1j)*(input_3-input_7))))) 8.2.4 Cyclic redundancy check I’vealwaysbeenwondering,whichinputbitaffectswhichbitinthefinalCRC32value. FromtheCRC69theory(goodandconciseintroduction: http://web.archive.org/web/20161220015646/ http://www.hackersdelight.org/crc.pdf )weknowthat CRCisshiftingregisterwithtaps. Wewilltrackeachbitratherthanbyteorword,whichishighlyinefficient,butservesourpurposebetter: #!/usr/bin/env python import sys class Expr: def __init__(self,s): self.s=s def convert_to_Expr_if_int(self, n): if isinstance(n, int): return Expr(str(n)) if isinstance(n, Expr): return n raise AssertionError # unsupported type def __str__(self): return self.s def __xor__(self, other): return Expr("(" + self.s + "^" + self.convert_to_Expr_if_int(other).s + ")") BYTES=1 def crc32(buf): #state=[Expr("init_%d" % i) for i in range(32)] state=[Expr("1") for i in range(32)] for byte in buf: for n in range(8): bit=byte[n] to_taps=bit^state[31] state[31]=state[30] state[30]=state[29] state[29]=state[28] state[28]=state[27] state[27]=state[26] state[26]=state[25]^to_taps state[25]=state[24] state[24]=state[23] state[23]=state[22]^to_taps state[22]=state[21]^to_taps state[21]=state[20] state[20]=state[19] state[19]=state[18] state[18]=state[17] state[17]=state[16] state[16]=state[15]^to_taps state[15]=state[14] state[14]=state[13] state[13]=state[12] state[12]=state[11]^to_taps state[11]=state[10]^to_taps state[10]=state[9]^to_taps state[9]=state[8] state[8]=state[7]^to_taps state[7]=state[6]^to_taps 69Cyclicredundancycheck 67 state[6]=state[5] state[5]=state[4]^to_taps state[4]=state[3]^to_taps state[3]=state[2] state[2]=state[1]^to_taps state[1]=state[0]^to_taps state[0]=to_taps for i in range(32): print "state %d=%s" % (i, state[31-i]) buf=[[Expr("in_%d_%d" % (byte, bit)) for bit in range(8)] for byte in range(BYTES)] crc32(buf) HereareexpressionsforeachCRC32bitfor1-bytebuffer: state 0=(1^(in_0_2^1)) state 1=((1^(in_0_0^1))^(in_0_3^1)) state 2=(((1^(in_0_0^1))^(in_0_1^1))^(in_0_4^1)) state 3=(((1^(in_0_1^1))^(in_0_2^1))^(in_0_5^1)) state 4=(((1^(in_0_2^1))^(in_0_3^1))^(in_0_6^(1^(in_0_0^1)))) state 5=(((1^(in_0_3^1))^(in_0_4^1))^(in_0_7^(1^(in_0_1^1)))) state 6=((1^(in_0_4^1))^(in_0_5^1)) state 7=((1^(in_0_5^1))^(in_0_6^(1^(in_0_0^1)))) state 8=(((1^(in_0_0^1))^(in_0_6^(1^(in_0_0^1))))^(in_0_7^(1^(in_0_1^1)))) state 9=((1^(in_0_1^1))^(in_0_7^(1^(in_0_1^1)))) state 10=(1^(in_0_2^1)) state 11=(1^(in_0_3^1)) state 12=((1^(in_0_0^1))^(in_0_4^1)) state 13=(((1^(in_0_0^1))^(in_0_1^1))^(in_0_5^1)) state 14=((((1^(in_0_0^1))^(in_0_1^1))^(in_0_2^1))^(in_0_6^(1^(in_0_0^1)))) state 15=((((1^(in_0_1^1))^(in_0_2^1))^(in_0_3^1))^(in_0_7^(1^(in_0_1^1)))) state 16=((((1^(in_0_0^1))^(in_0_2^1))^(in_0_3^1))^(in_0_4^1)) state 17=(((((1^(in_0_0^1))^(in_0_1^1))^(in_0_3^1))^(in_0_4^1))^(in_0_5^1)) state 18=(((((1^(in_0_1^1))^(in_0_2^1))^(in_0_4^1))^(in_0_5^1))^(in_0_6^(1^(in_0_0^1)))) state 19=((((((1^(in_0_0^1))^(in_0_2^1))^(in_0_3^1))^(in_0_5^1))^(in_0_6^(1^(in_0_0^1))))^(in_0_7^(1^(in_0_1^1)) )) state 20=((((((1^(in_0_0^1))^(in_0_1^1))^(in_0_3^1))^(in_0_4^1))^(in_0_6^(1^(in_0_0^1))))^(in_0_7^(1^(in_0_1^1)) )) state 21=(((((1^(in_0_1^1))^(in_0_2^1))^(in_0_4^1))^(in_0_5^1))^(in_0_7^(1^(in_0_1^1)))) state 22=(((((1^(in_0_0^1))^(in_0_2^1))^(in_0_3^1))^(in_0_5^1))^(in_0_6^(1^(in_0_0^1)))) state 23=((((((1^(in_0_0^1))^(in_0_1^1))^(in_0_3^1))^(in_0_4^1))^(in_0_6^(1^(in_0_0^1))))^(in_0_7^(1^(in_0_1^1)) )) state 24=((((((in_0_0^1)^(in_0_1^1))^(in_0_2^1))^(in_0_4^1))^(in_0_5^1))^(in_0_7^(1^(in_0_1^1)))) state 25=(((((in_0_1^1)^(in_0_2^1))^(in_0_3^1))^(in_0_5^1))^(in_0_6^(1^(in_0_0^1)))) state 26=(((((in_0_2^1)^(in_0_3^1))^(in_0_4^1))^(in_0_6^(1^(in_0_0^1))))^(in_0_7^(1^(in_0_1^1)))) state 27=((((in_0_3^1)^(in_0_4^1))^(in_0_5^1))^(in_0_7^(1^(in_0_1^1)))) state 28=(((in_0_4^1)^(in_0_5^1))^(in_0_6^(1^(in_0_0^1)))) state 29=(((in_0_5^1)^(in_0_6^(1^(in_0_0^1))))^(in_0_7^(1^(in_0_1^1)))) state 30=((in_0_6^(1^(in_0_0^1)))^(in_0_7^(1^(in_0_1^1)))) state 31=(in_0_7^(1^(in_0_1^1))) Forlargerbuffer, expressionsgets increasingexponentially. This is 0th bit of the final state for 4-byte buffer: state 0=((((((((((((((in_0_0^1)^(in_0_1^1))^(in_0_2^1))^(in_0_4^1))^(in_0_5^1))^(in_0_7^(1^(in_0_1^1))))^ (in_1_0^(1^(in_0_2^1))))^(in_1_2^(((1^(in_0_0^1))^(in_0_1^1))^(in_0_4^1))))^(in_1_3^(((1^(in_0_1^1))^ (in_0_2^1))^(in_0_5^1))))^(in_1_4^(((1^(in_0_2^1))^(in_0_3^1))^(in_0_6^(1^(in_0_0^1))))))^(in_2_0^((((1^ (in_0_0^1))^(in_0_6^(1^(in_0_0^1))))^(in_0_7^(1^(in_0_1^1))))^(in_1_2^(((1^(in_0_0^1))^(in_0_1^1))^(in_0_4^ 1))))))^(in_2_6^(((((((1^(in_0_0^1))^(in_0_1^1))^(in_0_2^1))^(in_0_6^(1^(in_0_0^1))))^(in_1_4^(((1^(in_0_2^1))^ (in_0_3^1))^(in_0_6^(1^(in_0_0^1))))))^(in_1_5^(((1^(in_0_3^1))^(in_0_4^1))^(in_0_7^(1^(in_0_1^1))))))^ (in_2_0^((((1^(in_0_0^1))^(in_0_6^(1^(in_0_0^1))))^(in_0_7^(1^(in_0_1^1))))^(in_1_2^(((1^(in_0_0^1))^(in_0_1^1)) ^ (in_0_4^1))))))))^(in_2_7^(((((((1^(in_0_1^1))^(in_0_2^1))^(in_0_3^1))^(in_0_7^(1^(in_0_1^1))))^(in_1_5^(((1^ (in_0_3^1))^(in_0_4^1))^(in_0_7^(1^(in_0_1^1))))))^(in_1_6^(((1^(in_0_4^1))^(in_0_5^1))^(in_1_0^(1^(in_0_2^ 1))))))^(in_2_1^((((1^(in_0_1^1))^(in_0_7^(1^(in_0_1^1))))^(in_1_0^(1^(in_0_2^1))))^(in_1_3^(((1^(in_0_1^1))^ (in_0_2^1))^(in_0_5^1))))))))^(in_3_2^(((((((((1^(in_0_1^1))^(in_0_2^1))^(in_0_4^1))^(in_0_5^1))^(in_0_6^(1^ (in_0_0^1))))^(in_1_2^(((1^(in_0_0^1))^(in_0_1^1))^(in_0_4^1))))^(in_2_0^((((1^(in_0_0^1))^(in_0_6^(1^(in_0_0^ 1))))^(in_0_7^(1^(in_0_1^1))))^(in_1_2^(((1^(in_0_0^1))^(in_0_1^1))^(in_0_4^1))))))^(in_2_1^((((1^(in_0_1^1))^ (in_0_7^(1^(in_0_1^1))))^(in_1_0^(1^(in_0_2^1))))^(in_1_3^(((1^(in_0_1^1))^(in_0_2^1))^(in_0_5^1))))))^(in_2_4^ (((((1^(in_0_0^1))^(in_0_4^1))^(in_1_2^(((1^(in_0_0^1))^(in_0_1^1))^(in_0_4^1))))^(in_1_3^(((1^(in_0_1^1))^ (in_0_2^1))^(in_0_5^1))))^(in_1_6^(((1^(in_0_4^1))^(in_0_5^1))^(in_1_0^(1^(in_0_2^1)))))))))) Expressionforthe0thbitofthefinalstatefor8-bytebufferhaslengthof 350KiB,whichis,ofcourse, canbereducedsignificantly(becausethisexpressionisbasicallyXORtree),butyoucanfeeltheweightofit. 68 Nowwecanprocessthisexpressionssomehowtogetasmallerpictureonwhatisaffectingwhat. Let’s say, if we can find “in_2_3” substring in expression, this means that 3rd bit of 2nd byte of input affects this expression. But even more than that: since this is XOR tree (i.e., expression consisting only of XOR operations),ifsomeinputvariableisoccurringtwice,it’s annihilated,since xx= 0. Morethanthat: ifa vairableoccurredevennumberoftimes(2,4,8,etc),it’sannihilated,butleftifit’soccurredoddnumberof times(1,3,5,etc). for i in range(32): #print "state %d=%s" % (i, state[31-i]) sys.stdout.write ("state %02d: " % i) for byte in range(BYTES): for bit in range(8): s="in_%d_%d" % (byte, bit) if str(state[31-i]).count(s) & 1: sys.stdout.write ("*") else: sys.stdout.write (" ") sys.stdout.write ("\n") (https://github.com/dennis714/SAT_SMT_article/blob/master/symbolic/4_CRC/2.py ) Nowthishoweachbitof1-byteinputbufferaffectseachbitofthefinalCRC32state: state 00: * state 01: * * state 02: ** * state 03: ** * state 04: * ** * state 05: * ** * state 06: ** state 07: * ** state 08: * ** state 09: * state 10: * state 11: * state 12: * * state 13: ** * state 14: ** * state 15: ** * state 16: * *** state 17: ** *** state 18: *** *** state 19: *** *** state 20: ** ** state 21: * ** * state 22: ** ** state 23: ** ** state 24: * * ** * state 25: **** ** state 26: ***** ** state 27: * *** * state 28: * *** state 29: ** *** state 30: ** ** state 31: * * Thisis8*8=64bitsof8-byteinputbuffer: state 00: * ** * *** * ** ** * * ***** *** * * ** * state 01: * * ** * *** * ** ** * * ***** *** * * ** * state 02: ** * ** * *** * ** ** * * ***** *** * * ** * state 03: *** * ** * *** * ** ** * * ***** *** * * ** * state 04: **** * ** * *** * ** ** * * ***** *** * * ** * state 05: **** * ** * *** * ** ** * * ***** *** * * ** * state 06: ** *** ** ** * ** *** * * ** ** *** * * * ** state 07: * ** *** ** ** * ** *** * * ** ** *** * * * ** state 08: * ** *** ** ** * ** *** * * ** ** *** * * * ** state 09: *** ** * * ** *** * ***** * * ** ** ** * * ** * * state 10: ** * *** * * * * ** * * ** * * ** * ** * state 11: ** * *** * * * * ** * * ** * * ** * ** * state 12: ** * *** * * * * ** * * ** * * ** * ** * state 13: ** * *** * * * * ** * * ** * * ** * ** * state 14: ** * *** * * * * ** * * ** * * ** * ** * state 15: ** * *** * * * * ** * * ** * * ** * ** * state 16: * ** ****** ** ** ** * * * ** * ** * *** *** state 17: * * ** ****** ** ** ** * * * ** * ** * *** *** 69 state 18: * * ** ****** ** ** ** * * * ** * ** * *** *** state 19: * * * ** ****** ** ** ** * * * ** * ** * *** *** state 20: ****** ** ** *** ** * * * ***** * **** * * ** ** state 21: ** *** ** * * * ** ** *** ** * * * ** * * ** * state 22: ** * * *** ** ** * ** ***** * ** * *** * ** ** state 23: * ** * * *** ** ** * ** ***** * ** * *** * ** ** state 24: * *** * *** *** *** * * * * ** ***** ** * ** * ** * state 25: * * *** *** * * **** * ** * *** * * ***** ** state 26: * * * *** *** * * **** * ** * *** * * ***** ** state 27: * *** * ***** **** * *** ** *** * ** * * *** * state 28: *** * *** * ***** *** * * *** ** **** *** state 29: *** * *** * ***** *** * * *** ** **** *** state 30: ** *** * * *** ** * ** *** ** * ** *** * ** ** state 31: * ** * *** * ** ** * * ***** *** * * ** * * 8.2.5 Linear congruential generator Thisispopular PRNG70fromOpenWatcom CRT71library: https://github.com/open-watcom/open-watcom-v2/ blob/d468b609ba6ca61eeddad80dd2485e3256fc5261/bld/clib/math/c/rand.c . Whatexpressionitgeneratesoneachstep? #!/usr/bin/env python class Expr: def __init__(self,s): self.s=s def convert_to_Expr_if_int(self, n): if isinstance(n, int): return Expr(str(n)) if isinstance(n, Expr): return n raise AssertionError # unsupported type def __str__(self): return self.s def __xor__(self, other): return Expr("(" + self.s + "^" + self.convert_to_Expr_if_int(other).s + ")") def __mul__(self, other): return Expr("(" + self.s + "*" + self.convert_to_Expr_if_int(other).s + ")") def __add__(self, other): return Expr("(" + self.s + "+" + self.convert_to_Expr_if_int(other).s + ")") def __and__(self, other): return Expr("(" + self.s + "&" + self.convert_to_Expr_if_int(other).s + ")") def __rshift__(self, other): return Expr("(" + self.s + ">>" + self.convert_to_Expr_if_int(other).s + ")") seed=Expr("initial_seed") def rand(): global seed seed=seed*1103515245+12345 return (seed>>16) & 0x7fff for i in range(10): print i, ":", rand() 0 : ((((initial_seed*1103515245)+12345)>>16)&32767) 1 : ((((((initial_seed*1103515245)+12345)*1103515245)+12345)>>16)&32767) 2 : ((((((((initial_seed*1103515245)+12345)*1103515245)+12345)*1103515245)+12345)>>16)&32767) 3 : ((((((((((initial_seed*1103515245)+12345)*1103515245)+12345)*1103515245)+12345)*1103515245)+12345)>>16) &32767) 4 : ((((((((((((initial_seed*1103515245)+12345)*1103515245)+12345)*1103515245)+12345)*1103515245)+12345) *1103515245)+12345)>>16)&32767) 5 : ((((((((((((((initial_seed*1103515245)+12345)*1103515245)+12345)*1103515245)+12345)*1103515245)+12345) *1103515245)+12345)*1103515245)+12345)>>16)&32767) 70Pseudorandomnumbergenerator 71Cruntimelibrary 70 6 : ((((((((((((((((initial_seed*1103515245)+12345)*1103515245)+12345)*1103515245)+12345)*1103515245)+12345) *1103515245)+12345)*1103515245)+12345)*1103515245)+12345)>>16)&32767) 7 : ((((((((((((((((((initial_seed*1103515245)+12345)*1103515245)+12345)*1103515245)+12345)*1103515245)+12345) *1103515245)+12345)*1103515245)+12345)*1103515245)+12345)*1103515245)+12345)>>16)&32767) 8 : ((((((((((((((((((((initial_seed*1103515245)+12345)*1103515245)+12345)*1103515245)+12345)*1103515245)+12345) *1103515245)+12345)*1103515245)+12345)*1103515245)+12345)*1103515245)+12345)*1103515245)+12345)>>16)&32767) 9 : ((((((((((((((((((((((initial_seed*1103515245)+12345)*1103515245)+12345)*1103515245)+12345)*1103515245) +12345)*1103515245)+12345)*1103515245)+12345)*1103515245)+12345)*1103515245)+12345)*1103515245)+12345) *1103515245)+12345)>>16)&32767) NowifweoncegotseveralvaluesfromthisPRNG,like4583,16304,14440,32315,28670,12568...,how wouldwerecovertheinitialseed? Theprobleminfactissolvingasystemofequations: ((((initial_seed*1103515245)+12345)>>16)&32767)==4583 ((((((initial_seed*1103515245)+12345)*1103515245)+12345)>>16)&32767)==16304 ((((((((initial_seed*1103515245)+12345)*1103515245)+12345)*1103515245)+12345)>>16)&32767)==14440 ((((((((((initial_seed*1103515245)+12345)*1103515245)+12345)*1103515245)+12345)*1103515245)+12345)>>16)&32767) ==32315 Asitturnsout,Z3cansolvethissystemcorrectlyusingonlytwoequations: #!/usr/bin/env python from z3 import * s=Solver() x=BitVec("x",32) a=1103515245 c=12345 s.add((((x*a)+c)>>16)&32767==4583) s.add((((((x*a)+c)*a)+c)>>16)&32767==16304) #s.add((((((((x*a)+c)*a)+c)*a)+c)>>16)&32767==14440) #s.add((((((((((x*a)+c)*a)+c)*a)+c)*a)+c)>>16)&32767==32315) s.check() print s.model() [x = 11223344] (Though,ittakes 20secondsonmyancientIntelAtomnetbook.) 8.2.6 Path constraint HowtogetweekdayfromUNIXtimestamp? #!/usr/bin/env python input=... SECS_DAY=24*60*60 dayno = input / SECS_DAY wday = (dayno + 4) % 7 if wday==5: print "Thanks God, it's Friday!" Let’ssay,weshouldfindawaytoruntheblockwithprint()callinit. Whatinputvalueshouldbe? First,let’sbuildexpressionof wdayvariable: #!/usr/bin/env python class Expr: def __init__(self,s): self.s=s def convert_to_Expr_if_int(self, n): if isinstance(n, int): return Expr(str(n)) if isinstance(n, Expr): return n raise AssertionError # unsupported type def __str__(self): return self.s 71 def __div__(self, other): return Expr("(" + self.s + "/" + self.convert_to_Expr_if_int(other).s + ")") def __mod__(self, other): return Expr("(" + self.s + "%" + self.convert_to_Expr_if_int(other).s + ")") def __add__(self, other): return Expr("(" + self.s + "+" + self.convert_to_Expr_if_int(other).s + ")") input=Expr("input") SECS_DAY=24*60*60 dayno = input / SECS_DAY wday = (dayno + 4) % 7 print wday if wday==5: print "Thanks God, it's Friday!" (((input/86400)+4)%7) Inordertoexecutetheblock,weshouldsolvethisequation: ((input 86400+ 4)5mod 7. Sofar,thisiseasytaskforZ3: #!/usr/bin/env python from z3 import * s=Solver() x=Int("x") s.add(((x/86400)+4)%7==5) s.check() print s.model() [x = 86438] ThisisindeedcorrectUNIXtimestampforFriday: % date --date='@86438' Fri Jan 2 03:00:38 MSK 1970 Thoughthedatebackinyear1970,butit’sstillcorrect! This is also called “path constraint”, i.e., what constraint must be satisified to execute specific block? Severaltoolshas“path”intheirnames,like“pathgrind”, SymbolicPathFinder ,CodeSurferPathInspector, etc. Liketheshellgame,thistaskisalsooftenencountersinpractice. Youcanseethatsomethingdangerous canbeexecutedinsidesomebasicblockandyou’retryingtodeduce,whatinputvaluescancauseexecution ofit. Itmaybebufferoverflow,etc. Suchinputvaluesaresometimesalsocalled“inputsofdeath”. Manycrackmesaresolvedinthisway,allyouneedisfindapathintoblockwhichprints“keyiscorrect” orsomethinglikethat. Wecanextendthistinyexample: input=... SECS_DAY=24*60*60 dayno = input / SECS_DAY wday = (dayno + 4) % 7 print wday if wday==5: print "Thanks God, it's Friday!" else: print "Got to wait a little" Nowwehavetwoblocks: forthefirstweshouldsolvethisequation: ((input 86400+ 4)5mod 7. Butforthe secondweshouldsolveinvertedequation: ((input 86400+ 4)̸5mod 7. Bysolvingtheseequations,wewillfind twopathsintobothblocks. KLEE(orsimilartool)triestofindpathtoeach[basic]blockandproduces“ideal”unittest.Hence,KLEEcan findapathintotheblockwhichcrasheseverything,orreportingaboutcorrectnessoftheinputkey/license, etc. Surprisingly,KLEEcanfindbackdoorsintheverysamemanner. 72 KLEEisalsocalled“KLEESymbolicVirtualMachine”–bythatitscreatorsmeanthattheKLEEis VM72which executesacodesymbolicallyratherthannumerically(likeusual CPU). 8.2.7 Division by zero Ifdivisionbyzeroisunwrappedbysanitizingcheck,andexceptionisn’tcaught,itcancrashprocess. Let’scalculatesimpleexpressionx 2y+4z12. Wecanaddawarninginto __div__method: #!/usr/bin/env python class Expr: def __init__(self,s): self.s=s def convert_to_Expr_if_int(self, n): if isinstance(n, int): return Expr(str(n)) if isinstance(n, Expr): return n raise AssertionError # unsupported type def __str__(self): return self.s def __mul__(self, other): return Expr("(" + self.s + "*" + self.convert_to_Expr_if_int(other).s + ")") def __div__(self, other): op2=self.convert_to_Expr_if_int(other).s print "warning: division by zero if "+op2+"==0" return Expr("(" + self.s + "/" + op2 + ")") def __add__(self, other): return Expr("(" + self.s + "+" + self.convert_to_Expr_if_int(other).s + ")") def __sub__(self, other): return Expr("(" + self.s + "-" + self.convert_to_Expr_if_int(other).s + ")") """ x ------------ 2y + 4z - 12 """ def f(x, y, z): return x/(y*2 + z*4 - 12) print f(Expr("x"), Expr("y"), Expr("z")) ...soitwillreportaboutdangerousstatesandconditions: warning: division by zero if (((y*2)+(z*4))-12)==0 (x/(((y*2)+(z*4))-12)) Thisequationiseasytosolve,let’stryWolframMathematicathistime: In[]:= FindInstance[{(y*2 + z*4) - 12 == 0}, {y, z}, Integers] Out[]= {{y -> 0, z -> 3}} Thesevaluesfor yandzcanalsobecalled“inputsofdeath”. 8.2.8 Merge sort Howmergesortworks? IhavecopypastedPythoncodefromrosettacode.comalmostintact: #!/usr/bin/env python class Expr: def __init__(self,s,i): self.s=s self.i=i 72VirtualMachine 73 def __str__(self): # return both symbolic and integer: return self.s+" (" + str(self.i)+")" def __le__(self, other): # compare only integer parts: return self.i <= other.i # copypasted from http://rosettacode.org/wiki/Sorting_algorithms/Merge_sort#Python def merge(left, right): result = [] left_idx, right_idx = 0, 0 while left_idx < len(left) and right_idx < len(right): # change the direction of this comparison to change the direction of the sort if left[left_idx] <= right[right_idx]: result.append(left[left_idx]) left_idx += 1 else: result.append(right[right_idx]) right_idx += 1 if left_idx < len(left): result.extend(left[left_idx:]) if right_idx < len(right): result.extend(right[right_idx:]) return result def tabs (t): return "\t"*t def merge_sort(m, indent=0): print tabs(indent)+"merge_sort() begin. input:" for i in m: print tabs(indent)+str(i) if len(m) <= 1: print tabs(indent)+"merge_sort() end. returning single element" return m middle = len(m) // 2 left = m[:middle] right = m[middle:] left = merge_sort(left, indent+1) right = merge_sort(right, indent+1) rt=list(merge(left, right)) print tabs(indent)+"merge_sort() end. returning:" for i in rt: print tabs(indent)+str(i) return rt # input buffer has both symbolic and numerical values: input=[Expr("input1",22), Expr("input2",7), Expr("input3",2), Expr("input4",1), Expr("input5",8), Expr("input6 ",4)] merge_sort(input) Buthereisafunctionwhichcompareselements. Obviously,itwouldn’tworkcorrectlywithoutit. Sowecantrackbothexpressionforeachelementandnumericalvalue. Bothwillbeprintedfinally. But whenevervaluesaretobecompared,onlynumericalpartswillbeused. Result: merge_sort() begin. input: input1 (22) input2 (7) input3 (2) input4 (1) input5 (8) input6 (4) merge_sort() begin. input: input1 (22) input2 (7) input3 (2) merge_sort() begin. input: input1 (22) merge_sort() end. returning single element 74 merge_sort() begin. input: input2 (7) input3 (2) merge_sort() begin. input: input2 (7) merge_sort() end. returning single element merge_sort() begin. input: input3 (2) merge_sort() end. returning single element merge_sort() end. returning: input3 (2) input2 (7) merge_sort() end. returning: input3 (2) input2 (7) input1 (22) merge_sort() begin. input: input4 (1) input5 (8) input6 (4) merge_sort() begin. input: input4 (1) merge_sort() end. returning single element merge_sort() begin. input: input5 (8) input6 (4) merge_sort() begin. input: input5 (8) merge_sort() end. returning single element merge_sort() begin. input: input6 (4) merge_sort() end. returning single element merge_sort() end. returning: input6 (4) input5 (8) merge_sort() end. returning: input4 (1) input6 (4) input5 (8) merge_sort() end. returning: input4 (1) input3 (2) input6 (4) input2 (7) input5 (8) input1 (22) 8.2.9 Extending Expr class This is somewhat senseless, nevertheless, it’s easy task to extend my Expr class to support ASTinstead ofplainstrings. It’salsopossibletoaddfoldingsteps(likeIdemonstratedinToyDecompiler: 7). Maybe someonewillwanttodothisasanexercise.Bytheway,thetoydecompilercanbeusedassimplesymbolic engineaswell,justfeedalltheinstructionstoitanditwilltrackcontentsofeachregister. 8.2.10 Conclusion Forthesakeofdemonstration,Imadethingsassimpleaspossible. Butrealityisalwaysharshandinconve- nient,soallthisshouldn’tbetakenasasilverbullet. Thefilesusedinthispart: https://github.com/dennis714/SAT_SMT_article/tree/master/symbolic . 8.3 Further reading JamesC.King—SymbolicExecutionandProgramTesting73 73https://yurichev.com/mirrors/king76symbolicexecution.pdf 75 9 KLEE 9.1 Installation KLEEbuildingfromsourceistricky.EasiestwaytouseKLEEistoinstalldocker74andthentorunKLEEdocker image75. 9.2 School-level equation Let’srevisitschool-levelsystemofequationsfrom( 5.2). WewillforceKLEEtofindapath,wherealltheconstraintsaresatisfied: int main() { int circle, square, triangle; klee_make_symbolic(&circle, sizeof circle, "circle"); klee_make_symbolic(&square, sizeof square, "square"); klee_make_symbolic(&triangle, sizeof triangle, "triangle"); if (circle+circle!=10) return 0; if (circle*square+square!=12) return 0; if (circle*square-triangle*circle!=circle) return 0; // all constraints should be satisfied at this point // force KLEE to produce .err file: klee_assert(0); }; % clang -emit-llvm -c -g klee_eq.c ... % klee klee_eq.bc KLEE: output directory is "/home/klee/klee-out-93" KLEE: WARNING: undefined reference to function: klee_assert KLEE: WARNING ONCE: calling external: klee_assert(0) KLEE: ERROR: /home/klee/klee_eq.c:18: failed external call: klee_assert KLEE: NOTE: now ignoring this error at this location KLEE: done: total instructions = 32 KLEE: done: completed paths = 1 KLEE: done: generated tests = 1 Let’sfindout,where klee_assert() hasbeentriggered: % ls klee-last | grep err test000001.external.err % ktest-tool --write-ints klee-last/test000001.ktest ktest file : 'klee-last/test000001.ktest' args : ['klee_eq.bc'] num objects: 3 object 0: name: b'circle' object 0: size: 4 object 0: data: 5 object 1: name: b'square' object 1: size: 4 object 1: data: 2 object 2: name: b'triangle' object 2: size: 4 object 2: data: 1 Thisisindeedcorrectsolutiontothesystemofequations. KLEEhasintrinsic klee_assume() whichtellsKLEEtocutpathifsomeconstraintisnotsatisfied. Sowe canrewriteourexampleinsuchcleanerway: int main() { 74https://docs.docker.com/engine/installation/linux/ubuntulinux/ 75http://klee.github.io/docker/ 76 int circle, square, triangle; klee_make_symbolic(&circle, sizeof circle, "circle"); klee_make_symbolic(&square, sizeof square, "square"); klee_make_symbolic(&triangle, sizeof triangle, "triangle"); klee_assume (circle+circle==10); klee_assume (circle*square+square==12); klee_assume (circle*square-triangle*circle==circle); // all constraints should be satisfied at this point // force KLEE to produce .err file: klee_assert(0); }; 9.3 Zebra puzzle Let’srevisitzebrapuzzlefrom( 5.4). Wejustdefineallvariablesandaddconstraints: int main() { int Yellow, Blue, Red, Ivory, Green; int Norwegian, Ukrainian, Englishman, Spaniard, Japanese; int Water, Tea, Milk, OrangeJuice, Coffee; int Kools, Chesterfield, OldGold, LuckyStrike, Parliament; int Fox, Horse, Snails, Dog, Zebra; klee_make_symbolic(&Yellow, sizeof(int), "Yellow"); klee_make_symbolic(&Blue, sizeof(int), "Blue"); klee_make_symbolic(&Red, sizeof(int), "Red"); klee_make_symbolic(&Ivory, sizeof(int), "Ivory"); klee_make_symbolic(&Green, sizeof(int), "Green"); klee_make_symbolic(&Norwegian, sizeof(int), "Norwegian"); klee_make_symbolic(&Ukrainian, sizeof(int), "Ukrainian"); klee_make_symbolic(&Englishman, sizeof(int), "Englishman"); klee_make_symbolic(&Spaniard, sizeof(int), "Spaniard"); klee_make_symbolic(&Japanese, sizeof(int), "Japanese"); klee_make_symbolic(&Water, sizeof(int), "Water"); klee_make_symbolic(&Tea, sizeof(int), "Tea"); klee_make_symbolic(&Milk, sizeof(int), "Milk"); klee_make_symbolic(&OrangeJuice, sizeof(int), "OrangeJuice"); klee_make_symbolic(&Coffee, sizeof(int), "Coffee"); klee_make_symbolic(&Kools, sizeof(int), "Kools"); klee_make_symbolic(&Chesterfield, sizeof(int), "Chesterfield"); klee_make_symbolic(&OldGold, sizeof(int), "OldGold"); klee_make_symbolic(&LuckyStrike, sizeof(int), "LuckyStrike"); klee_make_symbolic(&Parliament, sizeof(int), "Parliament"); klee_make_symbolic(&Fox, sizeof(int), "Fox"); klee_make_symbolic(&Horse, sizeof(int), "Horse"); klee_make_symbolic(&Snails, sizeof(int), "Snails"); klee_make_symbolic(&Dog, sizeof(int), "Dog"); klee_make_symbolic(&Zebra, sizeof(int), "Zebra"); // limits. if (Yellow<1 || Yellow>5) return 0; if (Blue<1 || Blue>5) return 0; if (Red<1 || Red>5) return 0; if (Ivory<1 || Ivory>5) return 0; if (Green<1 || Green>5) return 0; if (Norwegian<1 || Norwegian>5) return 0; if (Ukrainian<1 || Ukrainian>5) return 0; if (Englishman<1 || Englishman>5) return 0; if (Spaniard<1 || Spaniard>5) return 0; if (Japanese<1 || Japanese>5) return 0; if (Water<1 || Water>5) return 0; if (Tea<1 || Tea>5) return 0; 77 if (Milk<1 || Milk>5) return 0; if (OrangeJuice<1 || OrangeJuice>5) return 0; if (Coffee<1 || Coffee>5) return 0; if (Kools<1 || Kools>5) return 0; if (Chesterfield<1 || Chesterfield>5) return 0; if (OldGold<1 || OldGold>5) return 0; if (LuckyStrike<1 || LuckyStrike>5) return 0; if (Parliament<1 || Parliament>5) return 0; if (Fox<1 || Fox>5) return 0; if (Horse<1 || Horse>5) return 0; if (Snails<1 || Snails>5) return 0; if (Dog<1 || Dog>5) return 0; if (Zebra<1 || Zebra>5) return 0; // colors are distinct for all 5 houses: if (((1<=1 && Yellow<=5); klee_assume (Blue>=1 && Blue<=5); klee_assume (Red>=1 && Red<=5); klee_assume (Ivory>=1 && Ivory<=5); klee_assume (Green>=1 && Green<=5); klee_assume (Norwegian>=1 && Norwegian<=5); klee_assume (Ukrainian>=1 && Ukrainian<=5); klee_assume (Englishman>=1 && Englishman<=5); klee_assume (Spaniard>=1 && Spaniard<=5); klee_assume (Japanese>=1 && Japanese<=5); klee_assume (Water>=1 && Water<=5); klee_assume (Tea>=1 && Tea<=5); klee_assume (Milk>=1 && Milk<=5); klee_assume (OrangeJuice>=1 && OrangeJuice<=5); klee_assume (Coffee>=1 && Coffee<=5); klee_assume (Kools>=1 && Kools<=5); klee_assume (Chesterfield>=1 && Chesterfield<=5); klee_assume (OldGold>=1 && OldGold<=5); klee_assume (LuckyStrike>=1 && LuckyStrike<=5); klee_assume (Parliament>=1 && Parliament<=5); klee_assume (Fox>=1 && Fox<=5); klee_assume (Horse>=1 && Horse<=5); klee_assume (Snails>=1 && Snails<=5); klee_assume (Dog>=1 && Dog<=5); klee_assume (Zebra>=1 && Zebra<=5); // colors are distinct for all 5 houses: klee_assume (((1< 2 3/* 4coordinates: 5------------------------------ 600 01 02 | 03 04 05 | 06 07 08 710 11 12 | 13 14 15 | 16 17 18 820 21 22 | 23 24 25 | 26 27 28 9------------------------------ 10 30 31 32 | 33 34 35 | 36 37 38 11 40 41 42 | 43 44 45 | 46 47 48 12 50 51 52 | 53 54 55 | 56 57 58 13 ------------------------------ 14 60 61 62 | 63 64 65 | 66 67 68 81 15 70 71 72 | 73 74 75 | 76 77 78 16 80 81 82 | 83 84 85 | 86 87 88 17 ------------------------------ 18 */ 19 20 uint8_t cells[9][9]; 21 22 // http://www.norvig.com/sudoku.html 23 // http://www.mirror.co.uk/news/weird-news/worlds-hardest-sudoku-can-you-242294 24 char *puzzle="..53.....8......2..7..1.5..4....53...1..7...6..32...8..6.5....9..4....3......97.."; 25 26 int main() 27 { 28 klee_make_symbolic(cells, sizeof cells, "cells"); 29 30 // process text line: 31 for (int row=0; row<9; row++) 32 for (int column=0; column<9; column++) 33 { 34 char c=puzzle[row*9 + column]; 35 if (c!='.') 36 { 37 if (cells[row][column]!=c-'0') return 0; 38 } 39 else 40 { 41 // limit cells values to 1..9: 42 if (cells[row][column]<1) return 0; 43 if (cells[row][column]>9) return 0; 44 }; 45 }; 46 47 // for all 9 rows 48 for (int row=0; row<9; row++) 49 { 50 51 if (((1< /* coordinates: ------------------------------ 00 01 02 | 03 04 05 | 06 07 08 10 11 12 | 13 14 15 | 16 17 18 20 21 22 | 23 24 25 | 26 27 28 ------------------------------ 83 30 31 32 | 33 34 35 | 36 37 38 40 41 42 | 43 44 45 | 46 47 48 50 51 52 | 53 54 55 | 56 57 58 ------------------------------ 60 61 62 | 63 64 65 | 66 67 68 70 71 72 | 73 74 75 | 76 77 78 80 81 82 | 83 84 85 | 86 87 88 ------------------------------ */ uint8_t cells[9][9]; // http://www.norvig.com/sudoku.html // http://www.mirror.co.uk/news/weird-news/worlds-hardest-sudoku-can-you-242294 char *puzzle="..53.....8......2..7..1.5..4....53...1..7...6..32...8..6.5....9..4....3......97.."; int main() { klee_make_symbolic(cells, sizeof cells, "cells"); // process text line: for (int row=0; row<9; row++) for (int column=0; column<9; column++) { char c=puzzle[row*9 + column]; if (c!='.') klee_assume (cells[row][column]==c-'0'); else { klee_assume (cells[row][column]>=1); klee_assume (cells[row][column]<=9); }; }; // for all 9 rows for (int row=0; row<9; row++) { klee_assume (((1< #include #include void HTML_color(uint8_t R, uint8_t G, uint8_t B, char* out) { if (R==0xFF && G==0 && B==0) { strcpy (out, "red"); return; }; if (R==0x0 && G==0xFF && B==0) { strcpy (out, "green"); return; }; if (R==0 && G==0 && B==0xFF) { strcpy (out, "blue"); return; }; // abbreviated hexadecimal if (R>>4==(R&0xF) && G>>4==(G&0xF) && B>>4==(B&0xF)) { sprintf (out, "#%X%X%X", R&0xF, G&0xF, B&0xF); return; }; // last resort sprintf (out, "#%02X%02X%02X", R, G, B); }; 85 int main() { uint8_t R, G, B; klee_make_symbolic (&R, sizeof R, "R"); klee_make_symbolic (&G, sizeof R, "G"); klee_make_symbolic (&B, sizeof R, "B"); char tmp[16]; HTML_color(R, G, B, tmp); }; Thereare5possiblepathsinfunction,andlet’ssee,ifKLEEcouldfindthemall? It’sindeedso: % clang -emit-llvm -c -g color.c % klee color.bc KLEE: output directory is "/home/klee/klee-out-134" KLEE: WARNING: undefined reference to function: sprintf KLEE: WARNING: undefined reference to function: strcpy KLEE: WARNING ONCE: calling external: strcpy(51867584, 51598960) KLEE: ERROR: /home/klee/color.c:33: external call with symbolic argument: sprintf KLEE: NOTE: now ignoring this error at this location KLEE: ERROR: /home/klee/color.c:28: external call with symbolic argument: sprintf KLEE: NOTE: now ignoring this error at this location KLEE: done: total instructions = 479 KLEE: done: completed paths = 19 KLEE: done: generated tests = 5 Wecanignorecallstostrcpy()andsprintf(),becausewearenotreallyinterestinginstateof outvariable. Sothereareexactly5paths: % ls klee-last assembly.ll run.stats test000003.ktest test000005.ktest info test000001.ktest test000003.pc test000005.pc messages.txt test000002.ktest test000004.ktest warnings.txt run.istats test000003.exec.err test000005.exec.err 1stsetofinputvariableswillresultin“red”string: % ktest-tool --write-ints klee-last/test000001.ktest ktest file : 'klee-last/test000001.ktest' args : ['color.bc'] num objects: 3 object 0: name: b'R' object 0: size: 1 object 0: data: b'\xff' object 1: name: b'G' object 1: size: 1 object 1: data: b'\x00' object 2: name: b'B' object 2: size: 1 object 2: data: b'\x00' 2ndsetofinputvariableswillresultin“green”string: % ktest-tool --write-ints klee-last/test000002.ktest ktest file : 'klee-last/test000002.ktest' args : ['color.bc'] num objects: 3 object 0: name: b'R' object 0: size: 1 object 0: data: b'\x00' object 1: name: b'G' object 1: size: 1 object 1: data: b'\xff' object 2: name: b'B' object 2: size: 1 object 2: data: b'\x00' 3rdsetofinputvariableswillresultin“#010000”string: % ktest-tool --write-ints klee-last/test000003.ktest ktest file : 'klee-last/test000003.ktest' args : ['color.bc'] 86 num objects: 3 object 0: name: b'R' object 0: size: 1 object 0: data: b'\x01' object 1: name: b'G' object 1: size: 1 object 1: data: b'\x00' object 2: name: b'B' object 2: size: 1 object 2: data: b'\x00' 4thsetofinputvariableswillresultin“blue”string: % ktest-tool --write-ints klee-last/test000004.ktest ktest file : 'klee-last/test000004.ktest' args : ['color.bc'] num objects: 3 object 0: name: b'R' object 0: size: 1 object 0: data: b'\x00' object 1: name: b'G' object 1: size: 1 object 1: data: b'\x00' object 2: name: b'B' object 2: size: 1 object 2: data: b'\xff' 5thsetofinputvariableswillresultin“#F01”string: % ktest-tool --write-ints klee-last/test000005.ktest ktest file : 'klee-last/test000005.ktest' args : ['color.bc'] num objects: 3 object 0: name: b'R' object 0: size: 1 object 0: data: b'\xff' object 1: name: b'G' object 1: size: 1 object 1: data: b'\x00' object 2: name: b'B' object 2: size: 1 object 2: data: b'\x11' These5setsofinputvariablescanformaunittestforourfunction. 9.6 Unit test: strcmp() function Thestandard strcmp()functionfromClibrarycanreturn0,-1or1,dependingofcomparisonresult. Hereismyownimplementationof strcmp(): int my_strcmp(const char *s1, const char *s2) { int ret = 0; while (1) { ret = *(unsigned char *) s1 - *(unsigned char *) s2; if (ret!=0) break; if ((*s1==0) || (*s2)==0) break; s1++; s2++; }; if (ret < 0) { return -1; } else if (ret > 0) { return 1; } 87 return 0; } int main() { char input1[2]; char input2[2]; klee_make_symbolic(input1, sizeof input1, "input1"); klee_make_symbolic(input2, sizeof input2, "input2"); klee_assume((input1[0]>='a') && (input1[0]<='z')); klee_assume((input2[0]>='a') && (input2[0]<='z')); klee_assume(input1[1]==0); klee_assume(input2[1]==0); my_strcmp (input1, input2); }; Let’sfindout,ifKLEEiscapableoffindingallthreepaths? IintentionalymadethingssimplerforKLEEby limitinginputarraystotwo2bytesorto1character+terminalzerobyte. % clang -emit-llvm -c -g strcmp.c % klee strcmp.bc KLEE: output directory is "/home/klee/klee-out-131" KLEE: ERROR: /home/klee/strcmp.c:35: invalid klee_assume call (provably false) KLEE: NOTE: now ignoring this error at this location KLEE: ERROR: /home/klee/strcmp.c:36: invalid klee_assume call (provably false) KLEE: NOTE: now ignoring this error at this location KLEE: done: total instructions = 137 KLEE: done: completed paths = 5 KLEE: done: generated tests = 5 % ls klee-last assembly.ll run.stats test000002.ktest test000004.ktest info test000001.ktest test000002.pc test000005.ktest messages.txt test000001.pc test000002.user.err warnings.txt run.istats test000001.user.err test000003.ktest Thefirsttwoerrorsareabout klee_assume() . Theseareinputvaluesonwhich klee_assume() calls arestuck. Wecanignorethem,ortakeapeekoutofcuriosity: % ktest-tool --write-ints klee-last/test000001.ktest ktest file : 'klee-last/test000001.ktest' args : ['strcmp.bc'] num objects: 2 object 0: name: b'input1' object 0: size: 2 object 0: data: b'\x00\x00' object 1: name: b'input2' object 1: size: 2 object 1: data: b'\x00\x00' % ktest-tool --write-ints klee-last/test000002.ktest ktest file : 'klee-last/test000002.ktest' args : ['strcmp.bc'] num objects: 2 object 0: name: b'input1' object 0: size: 2 object 0: data: b'a\xff' object 1: name: b'input2' object 1: size: 2 object 1: data: b'\x00\x00' Threerestfilesaretheinputvaluesforeachpathinsideofmyimplementationof strcmp(): % ktest-tool --write-ints klee-last/test000003.ktest ktest file : 'klee-last/test000003.ktest' args : ['strcmp.bc'] num objects: 2 object 0: name: b'input1' object 0: size: 2 88 object 0: data: b'b\x00' object 1: name: b'input2' object 1: size: 2 object 1: data: b'c\x00' % ktest-tool --write-ints klee-last/test000004.ktest ktest file : 'klee-last/test000004.ktest' args : ['strcmp.bc'] num objects: 2 object 0: name: b'input1' object 0: size: 2 object 0: data: b'c\x00' object 1: name: b'input2' object 1: size: 2 object 1: data: b'a\x00' % ktest-tool --write-ints klee-last/test000005.ktest ktest file : 'klee-last/test000005.ktest' args : ['strcmp.bc'] num objects: 2 object 0: name: b'input1' object 0: size: 2 object 0: data: b'a\x00' object 1: name: b'input2' object 1: size: 2 object 1: data: b'a\x00' 3rdisaboutfirstargument(“b”)islesserthanthesecond(“c”).4thisopposite(“c”and“a”).5thiswhen theyareequal(“a”and“a”). Usingthese3testcases,we’vegotfullcoverageofourimplementationof strcmp(). 9.7 UNIX date/time UNIXdate/time76isanumberofsecondsthathaveelapsedsince1-Jan-197000:00UTC.C/C++gmtime() functionisusedtodecodethisvalueintohuman-readabledate/time. HereisapieceofcodeI’vecopypastedfromsomeancientversionofMinixOS( http://www.cise.ufl. edu/~cop4600/cgi-bin/lxr/http/source.cgi/lib/ansi/gmtime.c )andreworkedslightly: 1#include 2#include 3#include 4 5/* 6 * copypasted and reworked from 7 * http://www.cise.ufl.edu/ cop4600/cgi-bin/lxr/http/source.cgi/lib/ansi/loc_time.h 8 * http://www.cise.ufl.edu/ cop4600/cgi-bin/lxr/http/source.cgi/lib/ansi/misc.c 9 * http://www.cise.ufl.edu/ cop4600/cgi-bin/lxr/http/source.cgi/lib/ansi/gmtime.c 10 */ 11 12 #define YEAR0 1900 13 #define EPOCH_YR 1970 14 #define SECS_DAY (24L * 60L * 60L) 15 #define YEARSIZE(year) (LEAPYEAR(year) ? 366 : 365) 16 17 const int _ytab[2][12] = 18 { 19 { 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31 }, 20 { 31, 29, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31 } 21 }; 22 23 const char *_days[] = 24 { 25 "Sunday", "Monday", "Tuesday", "Wednesday", 26 "Thursday", "Friday", "Saturday" 27 }; 28 29 const char *_months[] = 30 { 31 "January", "February", "March", 32 "April", "May", "June", 33 "July", "August", "September", 76https://en.wikipedia.org/wiki/Unix_time 89 34 "October", "November", "December" 35 }; 36 37 #define LEAPYEAR(year) (!((year) % 4) && (((year) % 100) || !((year) % 400))) 38 39 void decode_UNIX_time(const time_t time) 40 { 41 unsigned int dayclock, dayno; 42 int year = EPOCH_YR; 43 44 dayclock = (unsigned long)time % SECS_DAY; 45 dayno = (unsigned long)time / SECS_DAY; 46 47 int seconds = dayclock % 60; 48 int minutes = (dayclock % 3600) / 60; 49 int hour = dayclock / 3600; 50 int wday = (dayno + 4) % 7; 51 while (dayno >= YEARSIZE(year)) 52 { 53 dayno -= YEARSIZE(year); 54 year++; 55 } 56 57 year = year - YEAR0; 58 59 int month = 0; 60 61 while (dayno >= _ytab[LEAPYEAR(year)][month]) 62 { 63 dayno -= _ytab[LEAPYEAR(year)][month]; 64 month++; 65 } 66 67 char *s; 68 switch (month) 69 { 70 case 0: s="January"; break; 71 case 1: s="February"; break; 72 case 2: s="March"; break; 73 case 3: s="April"; break; 74 case 4: s="May"; break; 75 case 5: s="June"; break; 76 case 6: s="July"; break; 77 case 7: s="August"; break; 78 case 8: s="September"; break; 79 case 9: s="October"; break; 80 case 10: s="November"; break; 81 case 11: s="December"; break; 82 default: 83 assert(0); 84 }; 85 86 printf ("%04d-%s-%02d %02d:%02d:%02d\n", YEAR0+year, s, dayno+1, hour, minutes, seconds); 87 printf ("week day: %s\n", _days[wday]); 88 } 89 90 int main() 91 { 92 uint32_t time; 93 94 klee_make_symbolic(&time, sizeof time, "time"); 95 96 decode_UNIX_time(time); 97 98 return 0; 99 } Let’stryit: % clang -emit-llvm -c -g klee_time1.c ... % klee klee_time1.bc KLEE: output directory is "/home/klee/klee-out-107" KLEE: WARNING: undefined reference to function: printf KLEE: ERROR: /home/klee/klee_time1.c:86: external call with symbolic argument: printf 90 KLEE: NOTE: now ignoring this error at this location KLEE: ERROR: /home/klee/klee_time1.c:83: ASSERTION FAIL: 0 KLEE: NOTE: now ignoring this error at this location KLEE: done: total instructions = 101579 KLEE: done: completed paths = 1635 KLEE: done: generated tests = 2 Wow,assert()atline83hasbeentriggered,why? Let’sseeavalueofUNIXtimewhichtriggersit: % ls klee-last | grep err test000001.exec.err test000002.assert.err % ktest-tool --write-ints klee-last/test000002.ktest ktest file : 'klee-last/test000002.ktest' args : ['klee_time1.bc'] num objects: 1 object 0: name: b'time' object 0: size: 4 object 0: data: 978278400 Let’sdecodethisvalueusingUNIXdateutility: % date -u --date='@978278400' Sun Dec 31 16:00:00 UTC 2000 Aftermyinvestigation,I’vefoundthat monthvariablecanholdincorrectvalueof12(while11ismaxi- mal,forDecember),becauseLEAPYEAR()macroshouldreceiveyearnumberas2000,notas100. SoI’ve introducedabugduringrewrittingthisfunction,andKLEEfoundit! Justinteresting,whatwouldbeifI’llreplaceswitch()toarrayofstrings,likeitusuallyhappensinconcise C/C++code? ... const char *_months[] = { "January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December" }; ... while (dayno >= _ytab[LEAPYEAR(year)][month]) { dayno -= _ytab[LEAPYEAR(year)][month]; month++; } char *s=_months[month]; printf ("%04d-%s-%02d %02d:%02d:%02d\n", YEAR0+year, s, dayno+1, hour, minutes, seconds); printf ("week day: %s\n", _days[wday]); ... KLEEdetectsattempttoreadbeyondarrayboundaries: % klee klee_time2.bc KLEE: output directory is "/home/klee/klee-out-108" KLEE: WARNING: undefined reference to function: printf KLEE: ERROR: /home/klee/klee_time2.c:69: external call with symbolic argument: printf KLEE: NOTE: now ignoring this error at this location KLEE: ERROR: /home/klee/klee_time2.c:67: memory error: out of bound pointer KLEE: NOTE: now ignoring this error at this location KLEE: done: total instructions = 101716 KLEE: done: completed paths = 1635 KLEE: done: generated tests = 2 ThisisthesameUNIXtimevaluewe’vealreadyseen: 91 % ls klee-last | grep err test000001.exec.err test000002.ptr.err % ktest-tool --write-ints klee-last/test000002.ktest ktest file : 'klee-last/test000002.ktest' args : ['klee_time2.bc'] num objects: 1 object 0: name: b'time' object 0: size: 4 object 0: data: 978278400 So,ifthispieceofcodecanbetriggeredonremotecomputer,withthisinputvalue( inputofdeath ),it’s possibletocrashtheprocess(withsomeluck,though). OK,nowI’mfixingabugbymovingyearsubtractingexpressiontoline43,andlet’sfind,whatUNIXtime valuecorrespondstosomefancydatelike2022-February-2? 1#include 2#include 3#include 4 5#define YEAR0 1900 6#define EPOCH_YR 1970 7#define SECS_DAY (24L * 60L * 60L) 8#define YEARSIZE(year) (LEAPYEAR(year) ? 366 : 365) 9 10 const int _ytab[2][12] = 11 { 12 { 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31 }, 13 { 31, 29, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31 } 14 }; 15 16 #define LEAPYEAR(year) (!((year) % 4) && (((year) % 100) || !((year) % 400))) 17 18 void decode_UNIX_time(const time_t time) 19 { 20 unsigned int dayclock, dayno; 21 int year = EPOCH_YR; 22 23 dayclock = (unsigned long)time % SECS_DAY; 24 dayno = (unsigned long)time / SECS_DAY; 25 26 int seconds = dayclock % 60; 27 int minutes = (dayclock % 3600) / 60; 28 int hour = dayclock / 3600; 29 int wday = (dayno + 4) % 7; 30 while (dayno >= YEARSIZE(year)) 31 { 32 dayno -= YEARSIZE(year); 33 year++; 34 } 35 36 int month = 0; 37 38 while (dayno >= _ytab[LEAPYEAR(year)][month]) 39 { 40 dayno -= _ytab[LEAPYEAR(year)][month]; 41 month++; 42 } 43 year = year - YEAR0; 44 45 if (YEAR0+year==2022 && month==1 && dayno+1==22) 46 klee_assert(0); 47 } 48 int main() 49 { 50 uint32_t time; 51 52 klee_make_symbolic(&time, sizeof time, "time"); 53 54 decode_UNIX_time(time); 55 56 return 0; 92 57 } % clang -emit-llvm -c -g klee_time3.c ... % klee klee_time3.bc KLEE: output directory is "/home/klee/klee-out-109" KLEE: WARNING: undefined reference to function: klee_assert KLEE: WARNING ONCE: calling external: klee_assert(0) KLEE: ERROR: /home/klee/klee_time3.c:47: failed external call: klee_assert KLEE: NOTE: now ignoring this error at this location KLEE: done: total instructions = 101087 KLEE: done: completed paths = 1635 KLEE: done: generated tests = 1635 % ls klee-last | grep err test000587.external.err % ktest-tool --write-ints klee-last/test000587.ktest ktest file : 'klee-last/test000587.ktest' args : ['klee_time3.bc'] num objects: 1 object 0: name: b'time' object 0: size: 4 object 0: data: 1645488640 % date -u --date='@1645488640' Tue Feb 22 00:10:40 UTC 2022 Success,buthours/minutes/secondsareseemsrandom—theyarerandomindeed,because,KLEEsatisfied allconstraintswe’veput,nothingelse. Wedidn’taskittosethours/minutes/secondstozeroes. Let’saddconstraintstohours/minutes/secondsaswell: ... if (YEAR0+year==2022 && month==1 && dayno+1==22 && hour==22 && minutes==22 && seconds==22) klee_assert(0); ... Let’srunitandcheck... % ktest-tool --write-ints klee-last/test000597.ktest ktest file : 'klee-last/test000597.ktest' args : ['klee_time3.bc'] num objects: 1 object 0: name: b'time' object 0: size: 4 object 0: data: 1645568542 % date -u --date='@1645568542' Tue Feb 22 22:22:22 UTC 2022 Nowthatisprecise. Yes,ofcourse,C/C++librarieshasfunction(s)toencodehuman-readabledateintoUNIXtimevalue,but whatwe’vegothereisKLEEworking antipodeofdecodingfunction, inversefunction inaway. 9.8 Inverse function for base64 decoder It’spieceofcakeforKLEEtoreconstructinputbase64stringgivenjustbase64decodercodewithoutcor- responding encoder code. I’ve copypasted this piece of code from http://www.opensource.apple.com/ source/QuickTimeStreamingServer/QuickTimeStreamingServer-452/CommonUtilitiesLib/base64.c . Weaddconstraints(lines84,85)sothatoutputbuffermusthavebytevaluesfrom0to15. Wealsotell toKLEEthattheBase64decode()functionmustreturn16(i.e.,sizeofoutputbufferinbytes,line82). 1#include 2#include 3#include 4 5// copypasted from http://www.opensource.apple.com/source/QuickTimeStreamingServer/QuickTimeStreamingServer-452/ CommonUtilitiesLib/base64.c 93 6 7static const unsigned char pr2six[256] = 8{ 9 /* ASCII table */ 10 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 11 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 12 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 62, 64, 64, 64, 63, 13 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 64, 64, 64, 64, 64, 64, 14 64, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 64, 64, 64, 64, 64, 16 64, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 17 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 64, 64, 64, 64, 64, 18 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 19 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 20 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 21 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 22 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 23 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 24 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 25 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64 26 }; 27 28 int Base64decode(char *bufplain, const char *bufcoded) 29 { 30 int nbytesdecoded; 31 register const unsigned char *bufin; 32 register unsigned char *bufout; 33 register int nprbytes; 34 35 bufin = (const unsigned char *) bufcoded; 36 while (pr2six[*(bufin++)] <= 63); 37 nprbytes = (bufin - (const unsigned char *) bufcoded) - 1; 38 nbytesdecoded = ((nprbytes + 3) / 4) * 3; 39 40 bufout = (unsigned char *) bufplain; 41 bufin = (const unsigned char *) bufcoded; 42 43 while (nprbytes > 4) { 44 *(bufout++) = 45 (unsigned char) (pr2six[*bufin] << 2 | pr2six[bufin[1]] >> 4); 46 *(bufout++) = 47 (unsigned char) (pr2six[bufin[1]] << 4 | pr2six[bufin[2]] >> 2); 48 *(bufout++) = 49 (unsigned char) (pr2six[bufin[2]] << 6 | pr2six[bufin[3]]); 50 bufin += 4; 51 nprbytes -= 4; 52 } 53 54 /* Note: (nprbytes == 1) would be an error, so just ingore that case */ 55 if (nprbytes > 1) { 56 *(bufout++) = 57 (unsigned char) (pr2six[*bufin] << 2 | pr2six[bufin[1]] >> 4); 58 } 59 if (nprbytes > 2) { 60 *(bufout++) = 61 (unsigned char) (pr2six[bufin[1]] << 4 | pr2six[bufin[2]] >> 2); 62 } 63 if (nprbytes > 3) { 64 *(bufout++) = 65 (unsigned char) (pr2six[bufin[2]] << 6 | pr2six[bufin[3]]); 66 } 67 68 *(bufout++) = '\0'; 69 nbytesdecoded -= (4 - nprbytes) & 3; 70 return nbytesdecoded; 71 } 72 73 int main() 74 { 75 char input[32]; 76 uint8_t output[16+1]; 77 78 klee_make_symbolic(input, sizeof input, "input"); 79 80 klee_assume(input[31]==0); 81 94 82 klee_assume (Base64decode(output, input)==16); 83 84 for (int i=0; i<16; i++) 85 klee_assume (output[i]==i); 86 87 klee_assert(0); 88 89 return 0; 90 } % clang -emit-llvm -c -g klee_base64.c ... % klee klee_base64.bc KLEE: output directory is "/home/klee/klee-out-99" KLEE: WARNING: undefined reference to function: klee_assert KLEE: ERROR: /home/klee/klee_base64.c:99: invalid klee_assume call (provably false) KLEE: NOTE: now ignoring this error at this location KLEE: WARNING ONCE: calling external: klee_assert(0) KLEE: ERROR: /home/klee/klee_base64.c:104: failed external call: klee_assert KLEE: NOTE: now ignoring this error at this location KLEE: ERROR: /home/klee/klee_base64.c:85: memory error: out of bound pointer KLEE: NOTE: now ignoring this error at this location KLEE: ERROR: /home/klee/klee_base64.c:81: memory error: out of bound pointer KLEE: NOTE: now ignoring this error at this location KLEE: ERROR: /home/klee/klee_base64.c:65: memory error: out of bound pointer KLEE: NOTE: now ignoring this error at this location ... We’reinterestingintheseconderror,where klee_assert() hasbeentriggered: % ls klee-last | grep err test000001.user.err test000002.external.err test000003.ptr.err test000004.ptr.err test000005.ptr.err % ktest-tool --write-ints klee-last/test000002.ktest ktest file : 'klee-last/test000002.ktest' args : ['klee_base64.bc'] num objects: 1 object 0: name: b'input' object 0: size: 32 object 0: data: b'AAECAwQFBgcICQoLDA0OD4\x00\xff\xff\xff\xff\xff\xff\xff\xff\x00' Thisisindeedarealbase64string,terminatedwiththezerobyte,justasit’srequestedbyC/C++stan- dards.Thefinalzerobyteat31thbyte(startingatzerothbyte)isourdeed:sothatKLEEwouldreportlesser numberoferrors. Thebase64stringisindeedcorrect: % echo AAECAwQFBgcICQoLDA0OD4 | base64 -d | hexdump -C base64: invalid input 00000000 00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f |................| 00000010 base64 decoder Linux utility I’ve just run blaming for “invalid input”—it means the input string is not properlypadded. Nowlet’spaditmanually,anddecoderutilitywillnocomplainanymore: % echo AAECAwQFBgcICQoLDA0OD4== | base64 -d | hexdump -C 00000000 00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f |................| 00000010 Thereasonourgeneratedbase64stringisnotpaddedisbecausebase64decodersareusuallydiscards paddingsymbols(“=”)attheend. Inotherwords,theyarenotrequirethem,soisthecaseofourdecoder. Hence,paddingsymbolsareleftunnoticedtoKLEE. Soweagainmade antipodeorinversefunction ofbase64decoder. 95 9.9 CRC (Cyclic redundancy check) 9.9.1 Buffer alteration case #1 Sometimes,youneedtoalterapieceofdatawhichis protectedbysomekindofchecksumor CRC,andyou can’tchangechecksumorCRCvalue,butcanalterpieceofdatasothatchecksumwillremainthesame. Let’spretend,we’vegotapieceofdatawith“Hello,world!” stringatthebeginningand“andgoodbye” stringattheend. Wecanalter14charactersatthemiddle,butforsomereason,theymustbein a..zlimits, butwecanputanycharactersthere. CRC64ofthewholeblockmustbe 0x12345678abcdef12 . Let’ssee77: #include #include uint64_t crc64(uint64_t crc, unsigned char *buf, int len) { int k; crc = crc; while (len--) { crc ^= *buf++; for (k = 0; k < 8; k++) crc = crc & 1 ? (crc >> 1) ^ 0x42f0e1eba9ea3693 : crc >> 1; } return crc; } int main() { #define HEAD_STR "Hello, world!.. " #define HEAD_SIZE strlen(HEAD_STR) #define TAIL_STR " ... and goodbye" #define TAIL_SIZE strlen(TAIL_STR) #define MID_SIZE 14 // work #define BUF_SIZE HEAD_SIZE+TAIL_SIZE+MID_SIZE char buf[BUF_SIZE]; klee_make_symbolic(buf, sizeof buf, "buf"); klee_assume (memcmp (buf, HEAD_STR, HEAD_SIZE)==0); for (int i=0; i='a' && buf[HEAD_SIZE+i]<='z'); klee_assume (memcmp (buf+HEAD_SIZE+MID_SIZE, TAIL_STR, TAIL_SIZE)==0); klee_assume (crc64 (0, buf, BUF_SIZE)==0x12345678abcdef12); klee_assert(0); return 0; } Since our code uses memcmp() standard C/C++ function, we need to add --libc=uclibc switch, so KLEEwilluseitsownuClibcimplementation. % clang -emit-llvm -c -g klee_CRC64.c % time klee --libc=uclibc klee_CRC64.bc Ittakesabout1minute(onmyIntelCorei3-3110M2.4GHznotebook)andwegettingthis: ... real 0m52.643s user 0m51.232s sys 0m0.239s ... % ls klee-last | grep err test000001.user.err test000002.user.err 77ThereareseveralslightlydifferentCRC64implementations,theoneIuseherecanalsobedifferentfrompopularones. 96 test000003.user.err test000004.external.err % ktest-tool --write-ints klee-last/test000004.ktest ktest file : 'klee-last/test000004.ktest' args : ['klee_CRC64.bc'] num objects: 1 object 0: name: b'buf' object 0: size: 46 object 0: data: b'Hello, world!.. qqlicayzceamyw ... and goodbye' Maybeit’sslow,butdefinitelyfasterthanbruteforce. Indeed, log2261465:8whichiscloseto64bits. In otherwords,oneneed 14latincharacterstoencode64bits.AndKLEE+ SMTsolverneeds64bitsatsome placeitcanaltertomakefinalCRC64valueequaltowhatwedefined. Itriedtoreducelengthofthe middleblock to13characters:noluckforKLEEthen,ithasnospaceenough. 9.9.2 Buffer alteration case #2 Iwentsadistic:whatifthebuffermustcontaintheCRC64valuewhich,aftercalculationofCRC64,willresult inthesamevalue? Fascinatedly,KLEEcansolvethis. Thebufferwillhavethefollowingformat: Hello, world! <8 bytes (64-bit value)> and goodbye <6 more bytes> int main() { #define HEAD_STR "Hello, world!.. " #define HEAD_SIZE strlen(HEAD_STR) #define TAIL_STR " ... and goodbye" #define TAIL_SIZE strlen(TAIL_STR) // 8 bytes for 64-bit value: #define MID_SIZE 8 #define BUF_SIZE HEAD_SIZE+TAIL_SIZE+MID_SIZE+6 char buf[BUF_SIZE]; klee_make_symbolic(buf, sizeof buf, "buf"); klee_assume (memcmp (buf, HEAD_STR, HEAD_SIZE)==0); klee_assume (memcmp (buf+HEAD_SIZE+MID_SIZE, TAIL_STR, TAIL_SIZE)==0); uint64_t mid_value=*(uint64_t*)(buf+HEAD_SIZE); klee_assume (crc64 (0, buf, BUF_SIZE)==mid_value); klee_assert(0); return 0; } Itworks: % time klee --libc=uclibc klee_CRC64.bc ... real 5m17.081s user 5m17.014s sys 0m0.319s % ls klee-last | grep err test000001.user.err test000002.user.err test000003.external.err % ktest-tool --write-ints klee-last/test000003.ktest ktest file : 'klee-last/test000003.ktest' args : ['klee_CRC64.bc'] num objects: 1 object 0: name: b'buf' object 0: size: 46 object 0: data: b'Hello, world!.. T+]\xb9A\x08\x0fq ... and goodbye\xb6\x8f\x9c\xd8\xc5\x00' 8bytesbetweentwostringsis64-bitvaluewhichequalstoCRC64ofthiswholeblock. Again,it’sfaster thanbrute-forcewaytofindit. Iftodecreaselastspare6-bytebufferto4bytesorless,KLEEworkssolong soI’vestoppedit. 97 9.9.3 Recovering input data for given CRC32 value of it I’vealwayswantedtodoso,buteveryoneknowsthisisimpossibleforinputbufferslargerthan4bytes. As myexperimentsshow,it’sstillpossiblefortinyinputbuffersofdata,whichisconstrainedinsomeway. TheCRC32valueof6-byte“SILVER”stringisknown: 0xDFA3DFDD . KLEEcanfindthis6-bytestring,ifit knowsthateachbyteofinputbufferisin A..Zlimits: 1#include 2#include 3 4uint32_t crc32(uint32_t crc, unsigned char *buf, int len) 5{ 6 int k; 7 8 crc = crc; 9 while (len--) 10 { 11 crc ^= *buf++; 12 for (k = 0; k < 8; k++) 13 crc = crc & 1 ? (crc >> 1) ^ 0xedb88320 : crc >> 1; 14 } 15 return crc; 16 } 17 18 #define SIZE 6 19 20 bool find_string(char str[SIZE]) 21 { 22 int i=0; 23 for (i=0; i'Z') 25 return false; 26 27 if (crc32(0, &str[0], SIZE)!=0xDFA3DFDD) 28 return false; 29 30 // OK, input str is valid 31 klee_assert(0); // force KLEE to produce .err file 32 return true; 33 }; 34 35 int main() 36 { 37 uint8_t str[SIZE]; 38 39 klee_make_symbolic(str, sizeof str, "str"); 40 41 find_string(str); 42 43 return 0; 44 } % clang -emit-llvm -c -g klee_SILVER.c ... % klee klee_SILVER.bc ... % ls klee-last | grep err test000013.external.err % ktest-tool --write-ints klee-last/test000013.ktest ktest file : 'klee-last/test000013.ktest' args : ['klee_SILVER.bc'] num objects: 1 object 0: name: b'str' object 0: size: 6 object 0: data: b'SILVER' Still,it’snomagic: iftoremoveconditionatlines23..25(i.e.,iftorelaxconstraints),KLEEwillproduce someotherstring,whichwillbestillcorrectfortheCRC32valuegiven. Itworks,because6Latincharactersin A..Zlimitscontain 28:2bits: log226628:2,whichisevensmaller valuethan32. Inotherwords,thefinalCRC32valueholdsenoughbitstorecover 28:2bitsofinput. 98 Theinputbuffercanbeevenbigger,ifeachbyteofitwillbeineventighterconstraints(decimaldigits, binarydigits,etc). 9.9.4 In comparison with other hashing algorithms Thingsarethateasyforsomeotherhashingalgorithmslike Fletcherchecksum ,butnotforcryptographically secureones(likeMD5,SHA1,etc),theyareprotectedfromsuchsimplecryptoanalysis. Seealso: 10. 9.10 LZSS decompressor I’ve googled for a very simple LZSS78decompressor and landed at this page: http://www.opensource. apple.com/source/boot/boot-132/i386/boot2/lzss.c . Let’spretend,we’relookingatunknowncompressingalgorithmwithnocompressoravailable. Willitbe possibletoreconstructacompressedpieceofdatasothatdecompressorwouldgeneratedataweneed? Hereismyfirstexperiment: // copypasted from http://www.opensource.apple.com/source/boot/boot-132/i386/boot2/lzss.c // #include #include #include //#define N 4096 /* size of ring buffer - must be power of 2 */ #define N 32 /* size of ring buffer - must be power of 2 */ #define F 18 /* upper limit for match_length */ #define THRESHOLD 2 /* encode string into position and length if match_length is greater than this */ #define NIL N /* index for root of binary search trees */ int decompress_lzss(uint8_t *dst, uint8_t *src, uint32_t srclen) { /* ring buffer of size N, with extra F-1 bytes to aid string comparison */ uint8_t *dststart = dst; uint8_t *srcend = src + srclen; int i, j, k, r, c; unsigned int flags; uint8_t text_buf[N + F - 1]; dst = dststart; srcend = src + srclen; for (i = 0; i < N - F; i++) text_buf[i] = ' '; r = N - F; flags = 0; for ( ; ; ) { if (((flags >>= 1) & 0x100) == 0) { if (src < srcend) c = *src++; else break; flags = c | 0xFF00; /* uses higher byte cleverly */ } /* to count eight */ if (flags & 1) { if (src < srcend) c = *src++; else break; *dst++ = c; text_buf[r++] = c; r &= (N - 1); } else { if (src < srcend) i = *src++; else break; if (src < srcend) j = *src++; else break; i |= ((j & 0xF0) << 4); j = (j & 0x0F) + THRESHOLD; for (k = 0; k <= j; k++) { c = text_buf[(i + k) & (N - 1)]; *dst++ = c; text_buf[r++] = c; r &= (N - 1); } } } return dst - dststart; 78Lempel–Ziv–Storer–Szymanski 99 } int main() { #define COMPRESSED_LEN 15 uint8_t input[COMPRESSED_LEN]; uint8_t plain[24]; uint32_t size=COMPRESSED_LEN; klee_make_symbolic(input, sizeof input, "input"); decompress_lzss(plain, input, size); // https://en.wikipedia.org/wiki/Buffalo_buffalo_Buffalo_buffalo_buffalo_buffalo_Buffalo_buffalo for (int i=0; i<23; i++) klee_assume (plain[i]=="Buffalo buffalo Buffalo"[i]); klee_assert(0); return 0; } WhatIdidischangingsizeofringbufferfrom4096to32,becauseifbigger,KLEEconsumesall RAM79it can.ButI’vefoundthatKLEEcanlivewiththatsmallbuffer.I’vealsodecreased COMPRESSED_LEN gradually tocheck,whetherKLEEwouldfindcompressedpieceofdata,anditdid: % clang -emit-llvm -c -g klee_lzss.c ... % time klee klee_lzss.bc KLEE: output directory is "/home/klee/klee-out-7" KLEE: WARNING: undefined reference to function: klee_assert KLEE: ERROR: /home/klee/klee_lzss.c:122: invalid klee_assume call (provably false) KLEE: NOTE: now ignoring this error at this location KLEE: ERROR: /home/klee/klee_lzss.c:47: memory error: out of bound pointer KLEE: NOTE: now ignoring this error at this location KLEE: ERROR: /home/klee/klee_lzss.c:37: memory error: out of bound pointer KLEE: NOTE: now ignoring this error at this location KLEE: WARNING ONCE: calling external: klee_assert(0) KLEE: ERROR: /home/klee/klee_lzss.c:124: failed external call: klee_assert KLEE: NOTE: now ignoring this error at this location KLEE: done: total instructions = 41417919 KLEE: done: completed paths = 437820 KLEE: done: generated tests = 4 real 13m0.215s user 11m57.517s sys 1m2.187s % ls klee-last | grep err test000001.user.err test000002.ptr.err test000003.ptr.err test000004.external.err % ktest-tool --write-ints klee-last/test000004.ktest ktest file : 'klee-last/test000004.ktest' args : ['klee_lzss.bc'] num objects: 1 object 0: name: b'input' object 0: size: 15 object 0: data: b'\xffBuffalo \x01b\x0f\x03\r\x05' KLEEconsumed 1GBofRAMandworkedfor 15minutes(onmyIntelCorei3-3110M2.4GHznotebook), buthereitis,a15byteswhich,ifdecompressedbyourcopypastedalgorithm,willresultindesiredtext! Duringmyexperimentation,I’vefoundthatKLEEcandoevenmorecoolerthing,tofindoutsizeofcom- pressedpieceofdata: int main() { uint8_t input[24]; 79Random-accessmemory 100 uint8_t plain[24]; uint32_t size; klee_make_symbolic(input, sizeof input, "input"); klee_make_symbolic(&size, sizeof size, "size"); decompress_lzss(plain, input, size); for (int i=0; i<23; i++) klee_assume (plain[i]=="Buffalo buffalo Buffalo"[i]); klee_assert(0); return 0; } ...butthenKLEEworksmuchslower,consumesmuchmoreRAMandIhadsuccessonlywithevensmaller piecesofdesiredtext. SohowLZSSworks?WithoutpeekingintoWikipedia,wecansaythat:if LZSScompressorobservessome dataitalreadyhad,itreplacesthedatawithalinktosomeplaceinpastwithsize. Ifitobservessomething yetunseen,itputsdataasis. Thisistheory. Thisisindeedwhatwe’vegot. Desiredtextisthree“Buffalo” words,thefirstandthelastareequivalent,butthesecondis almostequivalent,differingwithfirstbyone character. That’swhatwesee: '\xffBuffalo \x01b\x0f\x03\r\x05' Hereissomecontrolbyte(0xff),“Buffalo”wordisplaced asis,thenanothercontrolbyte(0x01),thenwe seebeginningofthesecondword(“b”)andmorecontrolbytes,perhaps,linkstothebeginningofthebuffer. Thesearecommandtodecompressor,like,inplainEnglish,“copydatafromthebufferwe’vealreadydone, fromthatplacetothatplace”,etc. Interesting,isitpossibletomeddleintothispieceofcompresseddata? Outofwhim,canweforceKLEE tofindacompresseddata,wherenotjust“b”characterhasbeenplaced asis,butalsothesecondcharacter oftheword,i.e.,“bu”? I’vemodifiedmain()functionbyadding klee_assume() : nowthe11thbyteofinput(compressed)data (rightafter“b”byte)musthave“u”. Ihasnoluckwith15byteofcompresseddata,soIincreaseditto16 bytes: int main() { #define COMPRESSED_LEN 16 uint8_t input[COMPRESSED_LEN]; uint8_t plain[24]; uint32_t size=COMPRESSED_LEN; klee_make_symbolic(input, sizeof input, "input"); klee_assume(input[11]=='u'); decompress_lzss(plain, input, size); for (int i=0; i<23; i++) klee_assume (plain[i]=="Buffalo buffalo Buffalo"[i]); klee_assert(0); return 0; } ...andvoilà: KLEEfoundacompressedpieceofdatawhichsatisfiedourwhimsicalconstraint: % time klee klee_lzss.bc KLEE: output directory is "/home/klee/klee-out-9" KLEE: WARNING: undefined reference to function: klee_assert KLEE: ERROR: /home/klee/klee_lzss.c:97: invalid klee_assume call (provably false) KLEE: NOTE: now ignoring this error at this location KLEE: ERROR: /home/klee/klee_lzss.c:47: memory error: out of bound pointer KLEE: NOTE: now ignoring this error at this location KLEE: ERROR: /home/klee/klee_lzss.c:37: memory error: out of bound pointer KLEE: NOTE: now ignoring this error at this location KLEE: WARNING ONCE: calling external: klee_assert(0) KLEE: ERROR: /home/klee/klee_lzss.c:99: failed external call: klee_assert 101 KLEE: NOTE: now ignoring this error at this location KLEE: done: total instructions = 36700587 KLEE: done: completed paths = 369756 KLEE: done: generated tests = 4 real 12m16.983s user 11m17.492s sys 0m58.358s % ktest-tool --write-ints klee-last/test000004.ktest ktest file : 'klee-last/test000004.ktest' args : ['klee_lzss.bc'] num objects: 1 object 0: name: b'input' object 0: size: 16 object 0: data: b'\xffBuffalo \x13bu\x10\x02\r\x05' Sonowwefindapieceofcompresseddatawheretwostringsareplaced asis: “Buffalo”and“bu”. '\xffBuffalo \x13bu\x10\x02\r\x05' Bothpiecesofcompresseddata,iffeededintoourcopypastedfunction,produce“BuffalobuffaloBuffalo” textstring. Please note, I still have no access to LZSScompressor code, and I didn’t get into LZSSdecompressor detailsyet. Unfortunately,thingsarenotthatcool:KLEEisveryslowandIhadsuccessonlywithsmallpiecesoftext, andalsoringbuffersizehadtobedecreasedsignificantly(original LZSSdecompressorwithringbufferof 4096bytescannotdecompresscorrectlywhatwefound). Nevertheless,it’sveryimpressive,takingintoaccountthefactthatwe’renotgettingintointernalsofthis specificLZSSdecompressor. Oncemoretime,we’vecreated antipodeofdecompressor,or inversefunction . Also, as it seems, KLEE isn’t very good so far with decompression algorithms (but who’s good then?). I’vealsotriedvariousJPEG/PNG/GIFdecoders(which,ofcourse,hasdecompressors),startingwithsimplest possible,andKLEEhadstuck. 9.11 strtodx() from RetroBSD Just found this function in RetroBSD: https://github.com/RetroBSD/retrobsd/blob/master/src/libc/ stdlib/strtod.c . Itconvertsastringintofloatingpointnumberforgivenradix. 1#include 2 3// my own version, only for radix 10: 4int isdigitx (char c, int radix) 5{ 6 if (c>='0' && c<='9') 7 return 1; 8 return 0; 9}; 10 11 /* 12 * double strtodx (char *string, char **endPtr, int radix) 13 * This procedure converts a floating-point number from an ASCII 14 * decimal representation to internal double-precision format. 15 * 16 * Original sources taken from 386bsd and modified for variable radix 17 * by Serge Vakulenko, . 18 * 19 * Arguments: 20 * string 21 * A decimal ASCII floating-point number, optionally preceded 22 * by white space. Must have form "-I.FE-X", where I is the integer 23 * part of the mantissa, F is the fractional part of the mantissa, 24 * and X is the exponent. Either of the signs may be "+", "-", or 25 * omitted. Either I or F may be omitted, or both. The decimal point 26 * isn't necessary unless F is present. The "E" may actually be an "e", 27 * or "E", "S", "s", "F", "f", "D", "d", "L", "l". 28 * E and X may both be omitted (but not just one). 29 * 30 * endPtr 31 * If non-NULL, store terminating character's address here. 102 32 * 33 * radix 34 * Radix of floating point, one of 2, 8, 10, 16. 35 * 36 * The return value is the double-precision floating-point 37 * representation of the characters in string. If endPtr isn't 38 * NULL, then *endPtr is filled in with the address of the 39 * next character after the last one that was part of the 40 * floating-point number. 41 */ 42 double strtodx (char *string, char **endPtr, int radix) 43 { 44 int sign = 0, expSign = 0, fracSz, fracOff, i; 45 double fraction, dblExp, *powTab; 46 register char *p; 47 register char c; 48 49 /* Exponent read from "EX" field. */ 50 int exp = 0; 51 52 /* Exponent that derives from the fractional part. Under normal 53 * circumstances, it is the negative of the number of digits in F. 54 * However, if I is very long, the last digits of I get dropped 55 * (otherwise a long I with a large negative exponent could cause an 56 * unnecessary overflow on I alone). In this case, fracExp is 57 * incremented one for each dropped digit. */ 58 int fracExp = 0; 59 60 /* Number of digits in mantissa. */ 61 int mantSize; 62 63 /* Number of mantissa digits BEFORE decimal point. */ 64 int decPt; 65 66 /* Temporarily holds location of exponent in string. */ 67 char *pExp; 68 69 /* Largest possible base 10 exponent. 70 * Any exponent larger than this will already 71 * produce underflow or overflow, so there's 72 * no need to worry about additional digits. */ 73 static int maxExponent = 307; 74 75 /* Table giving binary powers of 10. 76 * Entry is 10^2^i. Used to convert decimal 77 * exponents into floating-point numbers. */ 78 static double powersOf10[] = { 79 1e1, 1e2, 1e4, 1e8, 1e16, 1e32, //1e64, 1e128, 1e256, 80 }; 81 static double powersOf2[] = { 82 2, 4, 16, 256, 65536, 4.294967296e9, 1.8446744073709551616e19, 83 //3.4028236692093846346e38, 1.1579208923731619542e77, 1.3407807929942597099e154, 84 }; 85 static double powersOf8[] = { 86 8, 64, 4096, 2.81474976710656e14, 7.9228162514264337593e28, 87 //6.2771017353866807638e57, 3.9402006196394479212e115, 1.5525180923007089351e231, 88 }; 89 static double powersOf16[] = { 90 16, 256, 65536, 1.8446744073709551616e19, 91 //3.4028236692093846346e38, 1.1579208923731619542e77, 1.3407807929942597099e154, 92 }; 93 94 /* 95 * Strip off leading blanks and check for a sign. 96 */ 97 p = string; 98 while (*p==' ' || *p=='\t') 99 ++p; 100 if (*p == '-') { 101 sign = 1; 102 ++p; 103 } else if (*p == '+') 104 ++p; 105 106 /* 107 * Count the number of digits in the mantissa (including the decimal 103 108 * point), and also locate the decimal point. 109 */ 110 decPt = -1; 111 for (mantSize=0; ; ++mantSize) { 112 c = *p; 113 if (!isdigitx (c, radix)) { 114 if (c != '.' || decPt >= 0) 115 break; 116 decPt = mantSize; 117 } 118 ++p; 119 } 120 121 /* 122 * Now suck up the digits in the mantissa. Use two integers to 123 * collect 9 digits each (this is faster than using floating-point). 124 * If the mantissa has more than 18 digits, ignore the extras, since 125 * they can't affect the value anyway. 126 */ 127 pExp = p; 128 p -= mantSize; 129 if (decPt < 0) 130 decPt = mantSize; 131 else 132 --mantSize; /* One of the digits was the point. */ 133 134 switch (radix) { 135 default: 136 case 10: fracSz = 9; fracOff = 1000000000; powTab = powersOf10; break; 137 case 2: fracSz = 30; fracOff = 1073741824; powTab = powersOf2; break; 138 case 8: fracSz = 10; fracOff = 1073741824; powTab = powersOf8; break; 139 case 16: fracSz = 7; fracOff = 268435456; powTab = powersOf16; break; 140 } 141 if (mantSize > 2 * fracSz) 142 mantSize = 2 * fracSz; 143 fracExp = decPt - mantSize; 144 if (mantSize == 0) { 145 fraction = 0.0; 146 p = string; 147 goto done; 148 } else { 149 int frac1, frac2; 150 151 for (frac1=0; mantSize>fracSz; --mantSize) { 152 c = *p++; 153 if (c == '.') 154 c = *p++; 155 frac1 = frac1 * radix + (c - '0'); 156 } 157 for (frac2=0; mantSize>0; --mantSize) { 158 c = *p++; 159 if (c == '.') 160 c = *p++; 161 frac2 = frac2 * radix + (c - '0'); 162 } 163 fraction = (double) fracOff * frac1 + frac2; 164 } 165 166 /* 167 * Skim off the exponent. 168 */ 169 p = pExp; 170 if (*p=='E' || *p=='e' || *p=='S' || *p=='s' || *p=='F' || *p=='f' || 171 *p=='D' || *p=='d' || *p=='L' || *p=='l') { 172 ++p; 173 if (*p == '-') { 174 expSign = 1; 175 ++p; 176 } else if (*p == '+') 177 ++p; 178 while (isdigitx (*p, radix)) 179 exp = exp * radix + (*p++ - '0'); 180 } 181 if (expSign) 182 exp = fracExp - exp; 183 else 104 184 exp = fracExp + exp; 185 186 /* 187 * Generate a floating-point number that represents the exponent. 188 * Do this by processing the exponent one bit at a time to combine 189 * many powers of 2 of 10. Then combine the exponent with the 190 * fraction. 191 */ 192 if (exp < 0) { 193 expSign = 1; 194 exp = -exp; 195 } else 196 expSign = 0; 197 if (exp > maxExponent) 198 exp = maxExponent; 199 dblExp = 1.0; 200 for (i=0; exp; exp>>=1, ++i) 201 if (exp & 01) 202 dblExp *= powTab[i]; 203 if (expSign) 204 fraction /= dblExp; 205 else 206 fraction *= dblExp; 207 208 done: 209 if (endPtr) 210 *endPtr = p; 211 212 return sign ? -fraction : fraction; 213 } 214 215 #define BUFSIZE 10 216 int main() 217 { 218 char buf[BUFSIZE]; 219 klee_make_symbolic (buf, sizeof buf, "buf"); 220 klee_assume(buf[9]==0); 221 222 strtodx (buf, NULL, 10); 223 }; (https://github.com/dennis714/SAT_SMT_article/blob/master/KLEE/strtodx.c ) Interestinly,KLEEcannothandlefloating-pointarithmetic,butnevertheless,foundsomething: ... KLEE: ERROR: /home/klee/klee_test.c:202: memory error: out of bound pointer ... % ktest-tool klee-last/test003483.ktest ktest file : 'klee-last/test003483.ktest' args : ['klee_test.bc'] num objects: 1 object 0: name: b'buf' object 0: size: 10 object 0: data: b'-.0E-66\x00\x00\x00' Asitseems,string“-.0E-66”makesoutofboundsarrayaccess(read)atline202.Whilefurtherinvestiga- tion,I’vefoundthat powersOf10[] arrayistooshort: 6thelement(startedat0th)hasbeenaccessed. And hereweseepartofarraycommented(line79)! Probablysomeone’smistake? 9.12 Unit testing: simple expression evaluator (calculator) Ihasbeenlookingforsimpleexpressionevaluator(calculatorinotherwords)whichtakesexpressionlike “2+2”oninputandgivesanswer.I’vefoundoneat http://stackoverflow.com/a/13895198 .Unfortunately, ithasnobugs,soI’veintroducedone:atokenbuffer( buf[]atline31)issmallerthaninputbuffer( input[] atline19). 1// copypasted from http://stackoverflow.com/a/13895198 and reworked 2 3// Bare bones scanner and parser for the following LL(1) grammar: 105 4// expr -> term { [+-] term } ; An expression is terms separated by add ops. 5// term -> factor { [*/] factor } ; A term is factors separated by mul ops. 6// factor -> unsigned_factor ; A signed factor is a factor, 7// | - unsigned_factor ; possibly with leading minus sign 8// unsigned_factor -> ( expr ) ; An unsigned factor is a parenthesized expression 9// | NUMBER ; or a number 10 // 11 // The parser returns the floating point value of the expression. 12 13 #include 14 #include 15 #include 16 #include 17 #include 18 19 char input[128]; 20 int input_idx=0; 21 22 char my_getchar() 23 { 24 char rt=input[input_idx]; 25 input_idx++; 26 return rt; 27 }; 28 29 // The token buffer. We never check for overflow! Do so in production code. 30 // it's deliberately smaller than input[] so KLEE could find buffer overflow 31 char buf[64]; 32 int n = 0; 33 34 // The current character. 35 int ch; 36 37 // The look-ahead token. This is the 1 in LL(1). 38 enum { ADD_OP, MUL_OP, LEFT_PAREN, RIGHT_PAREN, NOT_OP, NUMBER, END_INPUT } look_ahead; 39 40 // Forward declarations. 41 void init(void); 42 void advance(void); 43 int expr(void); 44 void error(char *msg); 45 46 // Parse expressions, one per line. 47 int main(void) 48 { 49 // take input expression from input[] 50 //input[0]=0; 51 //strcpy (input, "2+2"); 52 klee_make_symbolic(input, sizeof input, "input"); 53 input[127]=0; 54 55 init(); 56 while (1) 57 { 58 int val = expr(); 59 printf("%d\n", val); 60 61 if (look_ahead != END_INPUT) 62 error("junk after expression"); 63 advance(); // past end of input mark 64 } 65 return 0; 66 } 67 68 // Just die on any error. 69 void error(char *msg) 70 { 71 fprintf(stderr, "Error: %s. Exiting.\n", msg); 72 exit(1); 73 } 74 75 // Buffer the current character and read a new one. 76 void read() 77 { 78 buf[n++] = ch; 79 buf[n] = '\0'; // Terminate the string. 106 80 ch = my_getchar(); 81 } 82 83 // Ignore the current character. 84 void ignore() 85 { 86 ch = my_getchar(); 87 } 88 89 // Reset the token buffer. 90 void reset() 91 { 92 n = 0; 93 buf[0] = '\0'; 94 } 95 96 // The scanner. A tiny deterministic finite automaton. 97 int scan() 98 { 99 reset(); 100 START: 101 // first character is digit? 102 if (isdigit (ch)) 103 goto DIGITS; 104 105 switch (ch) 106 { 107 case ' ': case '\t': case '\r': 108 ignore(); 109 goto START; 110 111 case '-': case '+': case '^': 112 read(); 113 return ADD_OP; 114 115 case ' ': 116 read(); 117 return NOT_OP; 118 119 case '*': case '/': case '%': 120 read(); 121 return MUL_OP; 122 123 case '(': 124 read(); 125 return LEFT_PAREN; 126 127 case ')': 128 read(); 129 return RIGHT_PAREN; 130 131 case 0: 132 case '\n': 133 ch = ' '; // delayed ignore() 134 return END_INPUT; 135 136 default: 137 printf ("bad character: 0x%x\n", ch); 138 exit(0); 139 } 140 141 DIGITS: 142 if (isdigit (ch)) 143 { 144 read(); 145 goto DIGITS; 146 } 147 else 148 return NUMBER; 149 } 150 151 // To advance is just to replace the look-ahead. 152 void advance() 153 { 154 look_ahead = scan(); 155 } 107 156 157 // Clear the token buffer and read the first look-ahead. 158 void init() 159 { 160 reset(); 161 ignore(); // junk current character 162 advance(); 163 } 164 165 int get_number(char *buf) 166 { 167 char *endptr; 168 169 int rt=strtoul (buf, &endptr, 10); 170 171 // is the whole buffer has been processed? 172 if (strlen(buf)!=endptr-buf) 173 { 174 fprintf (stderr, "invalid number: %s\n", buf); 175 exit(0); 176 }; 177 return rt; 178 }; 179 180 int unsigned_factor() 181 { 182 int rtn = 0; 183 switch (look_ahead) 184 { 185 case NUMBER: 186 rtn=get_number(buf); 187 advance(); 188 break; 189 190 case LEFT_PAREN: 191 advance(); 192 rtn = expr(); 193 if (look_ahead != RIGHT_PAREN) error("missing ')'"); 194 advance(); 195 break; 196 197 default: 198 printf("unexpected token: %d\n", look_ahead); 199 exit(0); 200 } 201 return rtn; 202 } 203 204 int factor() 205 { 206 int rtn = 0; 207 // If there is a leading minus... 208 if (look_ahead == ADD_OP && buf[0] == '-') 209 { 210 advance(); 211 rtn = -unsigned_factor(); 212 } 213 else 214 rtn = unsigned_factor(); 215 216 return rtn; 217 } 218 219 int term() 220 { 221 int rtn = factor(); 222 while (look_ahead == MUL_OP) 223 { 224 switch(buf[0]) 225 { 226 case '*': 227 advance(); 228 rtn *= factor(); 229 break; 230 231 case '/': 108 232 advance(); 233 rtn /= factor(); 234 break; 235 case '%': 236 advance(); 237 rtn %= factor(); 238 break; 239 } 240 } 241 return rtn; 242 } 243 244 int expr() 245 { 246 int rtn = term(); 247 while (look_ahead == ADD_OP) 248 { 249 switch(buf[0]) 250 { 251 case '+': 252 advance(); 253 rtn += term(); 254 break; 255 256 case '-': 257 advance(); 258 rtn -= term(); 259 break; 260 } 261 } 262 return rtn; 263 } (https://github.com/dennis714/SAT_SMT_article/blob/master/KLEE/calc.c ) KLEEfoundbufferoverflowwithlittleeffort(65zerodigits+onetabulationsymbol): % ktest-tool --write-ints klee-last/test000468.ktest ktest file : 'klee-last/test000468.ktest' args : ['calc.bc'] num objects: 1 object 0: name: b'input' object 0: size: 128 object 0: data: b'0\t0000000000000000000000000000000000000000000000000000000000000000\xff\xff\xff\xff\xff\xff \xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff \xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff \xff\xff' Hardtosay,howtabulationsymbol( \t)gotintoinput[]array,butKLEEachievedwhathasbeendesired: bufferoverflown. KLEEalsofoundtwoexpressionstringswhichleadstodivisionerror(“0/0”and“0%0”): % ktest-tool --write-ints klee-last/test000326.ktest ktest file : 'klee-last/test000326.ktest' args : ['calc.bc'] num objects: 1 object 0: name: b'input' object 0: size: 128 object 0: data: b'0/0\x00\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff \xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff \xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff \xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff \xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff' % ktest-tool --write-ints klee-last/test000557.ktest ktest file : 'klee-last/test000557.ktest' args : ['calc.bc'] num objects: 1 object 0: name: b'input' object 0: size: 128 object 0: data: b'0%0\x00\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff \xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff \xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff \xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff \xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff' 109 Maybethisisnotimpressiveresult,nevertheless,it’syetanotherreminderthatdivisionandremainder operationsmustbewrappedsomehowinproductioncodetoavoidpossiblecrash. 9.13 Regular expressions I’vealwayswantedtogeneratepossiblestringsforgivenregularexpression. Thisisnotsohardiftodive intoregularexpressionmatchertheoryanddetails,butcanweforceREmatchertodothis? ItooklightestREengineI’vefound: https://github.com/cesanta/slre ,andwrotethis: int main(void) { char s[6]; klee_make_symbolic(s, sizeof s, "s"); s[5]=0; if (slre_match("^\\d[a-c]+(x|y|z)", s, 5, NULL, 0, 0)==5) klee_assert(0); } SoIwantedastringconsistingofdigit,“a”or“b”or“c”(atleastonecharacter)and“x”or“y”or“z”(one character). Thewholestringmusthavesizeof5characters. % klee --libc=uclibc slre.bc ... KLEE: ERROR: /home/klee/slre.c:445: failed external call: klee_assert KLEE: NOTE: now ignoring this error at this location ... % ls klee-last | grep err test000014.external.err % ktest-tool --write-ints klee-last/test000014.ktest ktest file : 'klee-last/test000014.ktest' args : ['slre.bc'] num objects: 1 object 0: name: b's' object 0: size: 6 object 0: data: b'5aaax\xff' Thisisindeedcorrectstring. “\xff”isattheplacewhereterminalzerobyteshouldbe,butREenginewe useignoresthelastzerobyte, becauseithasbufferlenghtasapassedparameter. Hence, KLEEdoesn’t reconstructfinalbyte. Canwegetmore? Nowweaddadditionalconstraint: int main(void) { char s[6]; klee_make_symbolic(s, sizeof s, "s"); s[5]=0; if (slre_match("^\\d[a-c]+(x|y|z)", s, 5, NULL, 0, 0)==5 && strcmp(s, "5aaax")!=0) klee_assert(0); } % ktest-tool --write-ints klee-last/test000014.ktest ktest file : 'klee-last/test000014.ktest' args : ['slre.bc'] num objects: 1 object 0: name: b's' object 0: size: 6 object 0: data: b'7aaax\xff' Let’ssay,outofwhim,wedon’tlike“a”atthe2ndposition(startingat0th): int main(void) { char s[6]; klee_make_symbolic(s, sizeof s, "s"); s[5]=0; if (slre_match("^\\d[a-c]+(x|y|z)", s, 5, NULL, 0, 0)==5 && strcmp(s, "5aaax")!=0 && s[2]!='a') klee_assert(0); } 110 KLEEfoundawaytosatisfyournewconstraint: % ktest-tool --write-ints klee-last/test000014.ktest ktest file : 'klee-last/test000014.ktest' args : ['slre.bc'] num objects: 1 object 0: name: b's' object 0: size: 6 object 0: data: b'7abax\xff' Let’salsodefineconstraintKLEEcannotsatisfy: int main(void) { char s[6]; klee_make_symbolic(s, sizeof s, "s"); s[5]=0; if (slre_match("^\\d[a-c]+(x|y|z)", s, 5, NULL, 0, 0)==5 && strcmp(s, "5aaax")!=0 && s[2]!='a' && s[2]!='b' && s[2]!='c') klee_assert(0); } Itcannotindeed,andKLEEfinishedwithoutreportingabout klee_assert() triggering. 9.14 Exercise Hereismycrackme/keygenme,whichmaybetricky,buteasytosolveusingKLEE: http://challenges.re/ 74/. 10 (Amateur) cryptography 10.1 Serious cryptography Let’sbacktothemethodwepreviouslyused( 8.2)toconstructexpressionsusingrunningPythonfunction. WecantrytobuildexpressionfortheoutputofXXTEAencryptionalgorithm: #!/usr/bin/env python class Expr: def __init__(self,s): self.s=s def __str__(self): return self.s def convert_to_Expr_if_int(self, n): if isinstance(n, int): return Expr(str(n)) if isinstance(n, Expr): return n raise AssertionError # unsupported type def __xor__(self, other): return Expr("(" + self.s + "^" + self.convert_to_Expr_if_int(other).s + ")") def __mul__(self, other): return Expr("(" + self.s + "*" + self.convert_to_Expr_if_int(other).s + ")") def __add__(self, other): return Expr("(" + self.s + "+" + self.convert_to_Expr_if_int(other).s + ")") def __and__(self, other): return Expr("(" + self.s + "&" + self.convert_to_Expr_if_int(other).s + ")") def __lshift__(self, other): return Expr("(" + self.s + "<<" + self.convert_to_Expr_if_int(other).s + ")") def __rshift__(self, other): 111 return Expr("(" + self.s + ">>" + self.convert_to_Expr_if_int(other).s + ")") def __getitem__(self, d): return Expr("(" + self.s + "[" + d.s + "])") # reworked from: # Pure Python (2.x) implementation of the XXTEA cipher # (c) 2009. Ivan Voras # Released under the BSD License. def raw_xxtea(v, n, k): def MX(): return ((z>>5)^(y<<2)) + ((y>>3)^(z<<4))^(sum^y) + (k[(Expr(str(p)) & 3)^e]^z) y = v[0] sum = Expr("0") DELTA = 0x9e3779b9 # Encoding only z = v[n-1] # number of rounds: #q = 6 + 52 / n q=1 while q > 0: q -= 1 sum = sum + DELTA e = (sum >> 2) & 3 p = 0 while p < n - 1: y = v[p+1] z = v[p] = v[p] + MX() p += 1 y = v[0] z = v[n-1] = v[n-1] + MX() return 0 v=[Expr("input1"), Expr("input2"), Expr("input3"), Expr("input4")] k=Expr("key") raw_xxtea(v, 4, k) for i in range(4): print i, ":", v[i] #print len(str(v[0]))+len(str(v[1]))+len(str(v[2]))+len(str(v[3])) Akeyischoosenaccordingtoinputdata,and,obviously,wecan’tknowitduringsymbolicexecution,so weleaveexpressionlike k[...]. Nowresultsforjustoneround,foreachof4outputs: 0 : (input1+((((input4>>5)^(input2<<2))+((input2>>3)^(input4<<4)))^(((0+2654435769)^input2)+ ((key[((0&3)^(((0+2654435769)>>2)&3))])^input4)))) 1 : (input2+(((((input1+((((input4>>5)^(input2<<2))+((input2>>3)^(input4<<4)))^(((0+2654435769)^ input2)+((key[((0&3)^(((0+2654435769)>>2)&3))])^input4))))>>5)^(input3<<2))+((input3>>3)^((input1+ ((((input4>>5)^(input2<<2))+((input2>>3)^(input4<<4)))^(((0+2654435769)^input2)+((key[((0&3)^ (((0+2654435769)>>2)&3))])^input4))))<<4)))^(((0+2654435769)^input3)+((key[((1&3)^(((0+2654435769)>>2)& 3))])^(input1+((((input4>>5)^(input2<<2))+((input2>>3)^(input4<<4)))^(((0+2654435769)^input2)+ ((key[((0&3)^(((0+2654435769)>>2)&3))])^input4)))))))) 2 : (input3+(((((input2+(((((input1+((((input4>>5)^(input2<<2))+((input2>>3)^(input4<<4)))^ (((0+2654435769)^input2)+((key[((0&3)^(((0+2654435769)>>2)&3))])^input4))))>>5)^(input3<<2))+ ((input3>>3)^((input1+((((input4>>5)^(input2<<2))+((input2>>3)^(input4<<4)))^(((0+2654435769)^ input2)+((key[((0&3)^(((0+2654435769)>>2)&3))])^input4))))<<4)))^(((0+2654435769)^input3)+ ((key[((1&3)^(((0+2654435769)>>2)&3))])^(input1+((((input4>>5)^(input2<<2))+((input2>>3)^(input4<<4))) ^(((0+2654435769)^input2)+((key[((0&3)^(((0+2654435769)>>2)&3))])^input4))))))))>>5)^(input4<<2))+ ((input4>>3)^((input2+(((((input1+((((input4>>5)^(input2<<2))+((input2>>3)^(input4<<4)))^(((0+2654435769)^ input2)+((key[((0&3)^(((0+2654435769)>>2)&3))])^input4))))>>5)^(input3<<2))+((input3>>3)^((input1+ ((((input4>>5)^(input2<<2))+((input2>>3)^(input4<<4)))^(((0+2654435769)^input2)+((key[((0&3)^ (((0+2654435769)>>2)&3))])^input4))))<<4)))^(((0+2654435769)^input3)+((key[((1&3)^(((0+2654435769)>>2)&3))])^ (input1+((((input4>>5)^(input2<<2))+((input2>>3)^(input4<<4)))^(((0+2654435769)^input2)+((key[((0&3)^ (((0+2654435769)>>2)&3))])^input4))))))))<<4)))^(((0+2654435769)^input4)+((key[((2&3)^(((0+2654435769)>>2)& 3))])^(input2+(((((input1+((((input4>>5)^(input2<<2))+((input2>>3)^(input4<<4)))^(((0+2654435769)^ 112 input2)+((key[((0&3)^(((0+2654435769)>>2)&3))])^input4))))>>5)^(input3<<2))+((input3>>3)^((input1+ ((((input4>>5)^(input2<<2))+((input2>>3)^(input4<<4)))^(((0+2654435769)^input2)+((key[((0&3)^ (((0+2654435769)>>2)&3))])^input4))))<<4)))^(((0+2654435769)^input3)+((key[((1&3)^(((0+2654435769)>>2)& 3))])^(input1+((((input4>>5)^(input2<<2))+((input2>>3)^(input4<<4)))^(((0+2654435769)^input2)+((key[((0&3)^ (((0+2654435769)>>2)&3))])^input4)))))))))))) 3 : (input4+(((((input3+(((((input2+(((((input1+((((input4>>5)^(input2<<2))+((input2>>3)^(input4<<4)))^ (((0+2654435769)^input2)+((key[((0&3)^(((0+2654435769)>>2)&3))])^input4))))>>5)^(input3<<2))+((input3>>3)^ ((input1+((((input4>>5)^(input2<<2))+((input2>>3)^(input4<<4)))^(((0+2654435769)^input2)+((key[((0&3)^ (((0+2654435769)>>2)&3))])^input4))))<<4)))^(((0+2654435769)^input3)+((key[((1&3)^(((0+2654435769)>>2)&3))])^ (input1+((((input4>>5)^(input2<<2))+((input2>>3)^(input4<<4)))^(((0+2654435769)^input2)+((key[((0&3)^ (((0+2654435769)>>2)&3))])^input4))))))))>>5)^(input4<<2))+((input4>>3)^((input2+(((((input1+((((input4>>5)^ (input2<<2))+((input2>>3)^(input4<<4)))^(((0+2654435769)^input2)+((key[((0&3)^(((0+2654435769)>>2)&3))])^ input4))))>>5)^(input3<<2))+((input3>>3)^((input1+((((input4>>5)^(input2<<2))+((input2>>3)^(input4<<4)))^ (((0+2654435769)^input2)+((key[((0&3)^(((0+2654435769)>>2)&3))])^input4))))<<4)))^(((0+2654435769)^input3)+ ((key[((1&3)^(((0+2654435769)>>2)&3))])^(input1+((((input4>>5)^(input2<<2))+((input2>>3)^(input4<<4)))^ (((0+2654435769)^input2)+((key[((0&3)^(((0+2654435769)>>2)&3))])^input4))))))))<<4)))^(((0+2654435769)^ input4)+((key[((2&3)^(((0+2654435769)>>2)&3))])^(input2+(((((input1+((((input4>>5)^(input2<<2))+((input2>>3)^ (input4<<4)))^(((0+2654435769)^input2)+((key[((0&3)^(((0+2654435769)>>2)&3))])^input4))))>>5)^(input3<<2))+ ((input3>>3)^((input1+((((input4>>5)^(input2<<2))+((input2>>3)^(input4<<4)))^(((0+2654435769)^input2)+ ((key[((0&3)^(((0+2654435769)>>2)&3))])^input4))))<<4)))^(((0+2654435769)^input3)+((key[((1&3)^(((0+2654435769) >> 2)&3))])^(input1+((((input4>>5)^(input2<<2))+((input2>>3)^(input4<<4)))^(((0+2654435769)^input2)+((key[((0&3)^ (((0+2654435769)>>2)&3))])^input4))))))))))))>>5)^((input1+((((input4>>5)^(input2<<2))+((input2>>3)^(input4<< 4)))^(((0+2654435769)^input2)+((key[((0&3)^(((0+2654435769)>>2)&3))])^input4))))<<2))+(((input1+((((input4>>5)^ (input2<<2))+((input2>>3)^(input4<<4)))^(((0+2654435769)^input2)+((key[((0&3)^(((0+2654435769)>>2)&3))])^ input4))))>>3)^((input3+(((((input2+(((((input1+((((input4>>5)^(input2<<2))+((input2>>3)^(input4<<4)))^(((0+ 2654435769)^input2)+((key[((0&3)^(((0+2654435769)>>2)&3))])^input4))))>>5)^(input3<<2))+((input3>>3)^((input1+ ((((input4>>5)^(input2<<2))+((input2>>3)^(input4<<4)))^(((0+2654435769)^input2)+((key[((0&3)^(((0+2654435769)>> 2)&3))])^input4))))<<4)))^(((0+2654435769)^input3)+((key[((1&3)^(((0+2654435769)>>2)&3))])^(input1+((((input4>> 5)^(input2<<2))+((input2>>3)^(input4<<4)))^(((0+2654435769)^input2)+((key[((0&3)^(((0+2654435769)>>2)&3))])^ input4))))))))>>5)^(input4<<2))+((input4>>3)^((input2+(((((input1+((((input4>>5)^(input2<<2))+((input2>>3)^ (input4<<4)))^(((0+2654435769)^input2)+((key[((0&3)^(((0+2654435769)>>2)&3))])^input4))))>>5)^(input3<<2))+ ((input3>>3)^((input1+((((input4>>5)^(input2<<2))+((input2>>3)^(input4<<4)))^(((0+2654435769)^input2)+ ((key[((0&3)^(((0+2654435769)>>2)&3))])^input4))))<<4)))^(((0+2654435769)^input3)+((key[((1&3)^(((0+2654435769) >> 2)&3))])^(input1+((((input4>>5)^(input2<<2))+((input2>>3)^(input4<<4)))^(((0+2654435769)^input2)+((key[((0&3)^ (((0+2654435769)>>2)&3))])^input4))))))))<<4)))^(((0+2654435769)^input4)+((key[((2&3)^(((0+2654435769)>>2)&3))]) ^ (input2+(((((input1+((((input4>>5)^(input2<<2))+((input2>>3)^(input4<<4)))^(((0+2654435769)^input2)+((key[((0&3) ^ (((0+2654435769)>>2)&3))])^input4))))>>5)^(input3<<2))+((input3>>3)^((input1+((((input4>>5)^(input2<<2))+ ((input2>>3)^(input4<<4)))^(((0+2654435769)^input2)+((key[((0&3)^(((0+2654435769)>>2)&3))])^input4))))<<4)))^((( 0+2654435769)^input3)+((key[((1&3)^(((0+2654435769)>>2)&3))])^(input1+((((input4>>5)^(input2<<2))+((input2>>3)^( input4<<4)))^(((0+2654435769)^input2)+((key[((0&3)^(((0+2654435769)>>2)&3))])^input4))))))))))))<<4)))^(((0+ 2654435769)^(input1+((((input4>>5)^(input2<<2))+((input2>>3)^(input4<<4)))^(((0+2654435769)^input2)+((key[((0&3) ^ (((0+2654435769)>>2)&3))])^input4)))))+((key[((3&3)^(((0+2654435769)>>2)&3))])^(input3+(((((input2+(((((input1+ ((((input4>>5)^(input2<<2))+((input2>>3)^(input4<<4)))^(((0+2654435769)^input2)+((key[((0&3)^(((0+2654435769)>> 2)&3))])^input4))))>>5)^(input3<<2))+((input3>>3)^((input1+((((input4>>5)^(input2<<2))+((input2>>3)^(input4<< 4)))^(((0+2654435769)^input2)+((key[((0&3)^(((0+2654435769)>>2)&3))])^input4))))<<4)))^(((0+2654435769)^input3)+ ((key[((1&3)^(((0+2654435769)>>2)&3))])^(input1+((((input4>>5)^(input2<<2))+((input2>>3)^(input4<<4)))^(((0+ 2654435769)^input2)+((key[((0&3)^(((0+2654435769)>>2)&3))])^input4))))))))>>5)^(input4<<2))+((input4>>3)^(( input2+(((((input1+((((input4>>5)^(input2<<2))+((input2>>3)^(input4<<4)))^(((0+2654435769)^input2)+((key[((0&3)^ (((0+2654435769)>>2)&3))])^input4))))>>5)^(input3<<2))+((input3>>3)^((input1+((((input4>>5)^(input2<<2))+(( input2>>3)^(input4<<4)))^(((0+2654435769)^input2)+((key[((0&3)^(((0+2654435769)>>2)&3))])^input4))))<<4)))^(((0+ 2654435769)^input3)+((key[((1&3)^(((0+2654435769)>>2)&3))])^(input1+((((input4>>5)^(input2<<2))+((input2>>3)^ (input4<<4)))^(((0+2654435769)^input2)+((key[((0&3)^(((0+2654435769)>>2)&3))])^input4))))))))<<4)))^(((0+ 2654435769)^input4)+((key[((2&3)^(((0+2654435769)>>2)&3))])^(input2+(((((input1+((((input4>>5)^(input2<<2))+ ((input2>>3)^(input4<<4)))^(((0+2654435769)^input2)+((key[((0&3)^(((0+2654435769)>>2)&3))])^input4))))>>5)^ (input3<<2))+((input3>>3)^((input1+((((input4>>5)^(input2<<2))+((input2>>3)^(input4<<4)))^(((0+2654435769)^ input2)+((key[((0&3)^(((0+2654435769)>>2)&3))])^input4))))<<4)))^(((0+2654435769)^input3)+((key[((1&3)^(((0+ 2654435769)>>2)&3))])^(input1+((((input4>>5)^(input2<<2))+((input2>>3)^(input4<<4)))^(((0+2654435769)^input2)+ ((key[((0&3)^(((0+2654435769)>>2)&3))])^input4)))))))))))))))) Somehow,sizeofexpressionforeachsubsequentoutputisbigger. IhopeIhaven’tbeenmistaken? And thisisjustfor1round. For2rounds,sizeofall4expressionis 970KB. For3rounds,thisis 115MB. For 4rounds,IhavenotenoughRAMonmycomputer. Expressions explodingexponentially. Andthereare19 rounds. Youcanweighit. Perhaps, you can simplify these expressions: there are a lot of excessive parenthesis, but I’m highly pessimistic,cryptoalgorithmsconstructedinsuchawaytonothaveanyspareoperations. Inordertocrackit,youcanusetheseexpressionsassystemofequationandtrytosolveitusingSMT- solver. Thisiscalled“algebraicattack”. 113 Inotherwords,theoretically,youcanbuildsystemofequationlikethis: MD5(x) = 12341234 :::,butexpres- sionsaresohugesoit’simpossibletosolvethem. Yes, cryptographersarefullyawareofthisandoneof thegoalsofthesuccessfulcipheristomakeexpressionsasbigaspossible,usingresonabletimeandsizeof algorithm. Nevertheless,youcanfindnumerouspapersaboutbreakingthesecryptosystemswithreducednumber ofrounds: whenexpressionisn’t explodedyet,sometimesit’spossible. Thiscannotbeappliedinpractice, butsuchexperiencehassomeinterestingtheoreticalresults. 10.1.1 Attempts to break “serious” crypto CryptoMiniSatitselfexisttosupportXORoperation,whichisubiquitousincryptography. •Bitcoin mining with SAT solver: http://jheusser.github.io/2013/02/03/satcoin.html ,https:// github.com/msoos/sha256-sat-bitcoin . •AlexanderSemenov,attemptstobreakA5/1,etc. (Russianpresentation) •VegardNossum-SAT-basedpreimageattacksonSHA-1 •AlgebraicAttacksontheCrypto-1StreamCipherinMiFareClassicandOysterCards •AttackingBiviumUsingSATSolvers •ExtendingSATSolverstoCryptographicProblems •ApplicationsofSATSolverstoCryptanalysisofHashFunctions •Algebraic-DifferentialCryptanalysisofDES 10.2 Amateur cryptography Thisiswhatyoucanfindinserialnumbers,licensekeys,executablefilepackers, CTF80,malware,etc.Some- timesevenransomware(butrarelynowadays,in2017). AmateurcryptographyisoftencanbebrokenusingSMTsolver,orevenKLEE. Amateurcryptographyisusuallybasednotontheory,butonvisualcomplexity:ifitscreatorgettingresults whichareseemschaoticenough,often,onestopstoimproveitfurther.Thisissecuritynotevenonobscurity, butonchaoticmess. Thisisalsosometimescalled“TheFallacyofComplexManipulation”(see RFC4086). Devisingyourowncryptoalgorithmisaverytrickythingtodo. Thiscanbecomparedtodevisingyour ownPRNG. EvenfamousDonaldKnuthin1959constructedone,anditwasvisuallyverycomplex,but,as itturnsoutinpractice,ithasveryshortcycleoflength3178. [Seealso: TheArtofComputerProgramming vol.IIpage4,(1997).] Theveryfirstproblemisthatmakinganalgorithmwhichcangenerateverylongexpressionsistrickything itself. CommonerroristouseoperationslikeXORandrotations/permutations,whichcan’thelpmuch. Even worse: somepeoplethinkthatXORingavalueseveraltimescanbebetter,like: (x1234) 5678. Obviously, thesetwoXORoperations(ormoreprecisely,anynumberofit)canbereducedtoasingleone. Samestory aboutappliedoperationslikeadditionandsubtraction—theyallalsocanbereducedtosingleone. Realcryptoalgorithms,likeIDEA,canuseseveraloperationsfromdifferentgroups,likeXOR,additionand multiplication. Applyingthemallinspecificorderwillmakeresultingexpressionirreducible. WhenIpreparedthispart,Itriedtomakeanexampleofsuchamateurhashfunction: // copypasted from http://blog.regehr.org/archives/1063 uint32_t rotl32b (uint32_t x, uint32_t n) { assert (n<32); if (!n) return x; return (x<>(32-n)); } uint32_t rotr32b (uint32_t x, uint32_t n) { assert (n<32); if (!n) return x; return (x>>n) | (x<<(32-n)); } 80CapturetheFlag 114 void megahash (uint32_t buf[4]) { for (int i=0; i<4; i++) { uint32_t t0=buf[0]^0x12345678^buf[1]; uint32_t t1=buf[1]^0xabcdef01^buf[2]; uint32_t t2=buf[2]^0x23456789^buf[3]; uint32_t t3=buf[3]^0x0abcdef0^buf[0]; buf[0]=rotl32b(t0, 1); buf[1]=rotr32b(t1, 2); buf[2]=rotl32b(t2, 3); buf[3]=rotr32b(t3, 4); }; }; int main() { uint32_t buf[4]; klee_make_symbolic(buf, sizeof buf); megahash (buf); if (buf[0]==0x18f71ce6 // or whatever && buf[1]==0xf37c2fc9 && buf[2]==0x1cfe96fe && buf[3]==0x8c02c75e) klee_assert(0); }; KLEEcanbreakitwithlittleeffort. Functionsofsuchcomplexityiscommoninshareware,whichchecks licensekeys,etc. Hereishowwecanmakeitsworkharderbymakingrotationsdependentofinputs,andthismakesnumber ofpossibleinputsmuch,muchbigger: void megahash (uint32_t buf[4]) { for (int i=0; i<16; i++) { uint32_t t0=buf[0]^0x12345678^buf[1]; uint32_t t1=buf[1]^0xabcdef01^buf[2]; uint32_t t2=buf[2]^0x23456789^buf[3]; uint32_t t3=buf[3]^0x0abcdef0^buf[0]; buf[0]=rotl32b(t0, t1&0x1F); buf[1]=rotr32b(t1, t2&0x1F); buf[2]=rotl32b(t2, t3&0x1F); buf[3]=rotr32b(t3, t0&0x1F); }; }; Addition(ormodularaddition ,ascryptographerssay)canmakethingevenharder: void megahash (uint32_t buf[4]) { for (int i=0; i<4; i++) { uint32_t t0=buf[0]^0x12345678^buf[1]; uint32_t t1=buf[1]^0xabcdef01^buf[2]; uint32_t t2=buf[2]^0x23456789^buf[3]; uint32_t t3=buf[3]^0x0abcdef0^buf[0]; buf[0]=rotl32b(t0, t2&0x1F)+t1; buf[1]=rotr32b(t1, t3&0x1F)+t2; buf[2]=rotl32b(t2, t1&0x1F)+t3; buf[3]=rotr32b(t3, t2&0x1F)+t4; }; }; As en exercise, you can try to make a block cipher which KLEE wouldn’t break. This is quite sobering experience.Butevenifyoucan,thisisnotapanacea,thereisanarrayofothercryptoanalyticalmethodsto breakit. Summary: ifyoudealwithamateurcryptography,youcanalwaysgiveKLEEandSMTsolveratry. Even more: sometimesyouhaveonlydecryptionfunction,andifalgorithmissimpleenough,KLEEorSMTsolver canreversethingsback. 115 Onefunthingtomention: ifyoutrytoimplementamateurcryptoalgorithminVerilog/VHDLlanguageto runitonFPGA81,maybeinbrute-forceway,youcanfindthat EDA82toolscanoptimizemanythingsduring synthesis(thisisthewordtheyusefor“compilation”)andcanleavecryptoalgorithmmuchsmaller/simpler thanitwas. EvenifyoutrytoimplementDESalgorithm inbaremetal withafixedkey,AlteraQuartuscan optimizefirstroundofitandmakeitsmallerthanothers. 10.2.1 Bugs Anotherprominentfeatureofamateurcryptographyisbugs.Bugshereoftenleftuncaughtbecauseoutputof encryptingfunctionvisuallylooked“goodenough”or“obfuscatedenough”,soadeveloperstoppedtowork onit. Thisisespeciallyfeatureofhashfunctions,becausewhenyouworkonblockcipher,youhavetodotwo functions(encryption/decryption),whilehashingfunctionissingle. WeirdesteveramateurencryptionalgorithmIoncesaw,encryptedonlyoddbytesofinputblock,while evenbytesleftuntouched,sotheinputplaintexthasbeenpartiallyseenintheresultingencryptedblock. Itwasencryptionroutineusedinlicensekeyvalidation. Hardtobelievesomeonedidthisonpurpose. Most likely,itwasjustanunnoticedbug. 10.2.2 XOR ciphers SimplestpossibleamateurcryptographyisjustapplicationofXORoperationusingsomekindoftable.Maybe evensimpler. ThisisarealalgorithmIoncesaw: for (i=0; i>5; rax=rdx; rcx=r8+rdx*4; rax=rax<<6; rcx=rcx-rax; rax=r8 rax=_lrotl (rax, rcx&0xFF); // rotate left 85ThisexamplewasalsousedbyMurphyBerzishinhislectureabout SATandSMT:http://mirror.csclub.uwaterloo.ca/csclub/ mtrberzi-sat-smt-slides.pdf ,http://mirror.csclub.uwaterloo.ca/csclub/mtrberzi-sat-smt.mp4 117 return rax; }; Ifyouarecarefulenough,thiscodecanbecompiledandwillevenworkinthesamewayastheoriginal. Then,wearegoingtorewriteitgradually,keepinginmindallregistersusage.Attentionandfocusisvery importanthere—anytinytypomayruinallyourwork! Hereisthefirststep: uint64_t f(uint64_t input) { uint64_t rax, rbx, rcx, rdx, r8; ecx=input; rdx=0x5D7E0D1F2E0F1F84; rax=rcx; rax*=rdx; rdx=0x388D76AEE8CB1500; rax=_lrotr(rax, rax&0xF); // rotate right rax^=rdx; rdx=0xD2E9EE7E83C4285B; rax=_lrotl(rax, rax&0xF); // rotate left r8=rax+rdx; rdx=0x8888888888888889; rax=r8; rax*=rdx; // RDX here is a high part of multiplication result rdx=rdx>>5; // RDX here is division result! rax=rdx; rcx=r8+rdx*4; rax=rax<<6; rcx=rcx-rax; rax=r8 rax=_lrotl (rax, rcx&0xFF); // rotate left return rax; }; Nextstep: uint64_t f(uint64_t input) { uint64_t rax, rbx, rcx, rdx, r8; ecx=input; rdx=0x5D7E0D1F2E0F1F84; rax=rcx; rax*=rdx; rdx=0x388D76AEE8CB1500; rax=_lrotr(rax, rax&0xF); // rotate right rax^=rdx; rdx=0xD2E9EE7E83C4285B; rax=_lrotl(rax, rax&0xF); // rotate left r8=rax+rdx; rdx=0x8888888888888889; rax=r8; rax*=rdx; // RDX here is a high part of multiplication result rdx=rdx>>5; // RDX here is division result! rax=rdx; rcx=(r8+rdx*4)-(rax<<6); rax=r8 rax=_lrotl (rax, rcx&0xFF); // rotate left return rax; }; Wecanspotthedivisionusingmultiplication. Indeed,let’scalculatethedividerinWolframMathematica: Listing1: WolframMathematica 118 In[1]:=N[2^(64 + 5)/16^^8888888888888889] Out[1]:=60. Wegetthis: uint64_t f(uint64_t input) { uint64_t rax, rbx, rcx, rdx, r8; ecx=input; rdx=0x5D7E0D1F2E0F1F84; rax=rcx; rax*=rdx; rdx=0x388D76AEE8CB1500; rax=_lrotr(rax, rax&0xF); // rotate right rax^=rdx; rdx=0xD2E9EE7E83C4285B; rax=_lrotl(rax, rax&0xF); // rotate left r8=rax+rdx; rax=rdx=r8/60; rcx=(r8+rax*4)-(rax*64); rax=r8 rax=_lrotl (rax, rcx&0xFF); // rotate left return rax; }; Onemorestep: uint64_t f(uint64_t input) { uint64_t rax, rbx, rcx, rdx, r8; rax=input; rax*=0x5D7E0D1F2E0F1F84; rax=_lrotr(rax, rax&0xF); // rotate right rax^=0x388D76AEE8CB1500; rax=_lrotl(rax, rax&0xF); // rotate left r8=rax+0xD2E9EE7E83C4285B; rcx=r8-(r8/60)*60; rax=r8 rax=_lrotl (rax, rcx&0xFF); // rotate left return rax; }; Bysimplereducing,wefinallyseethatit’scalculatingtheremainder,notthequotient: uint64_t f(uint64_t input) { uint64_t rax, rbx, rcx, rdx, r8; rax=input; rax*=0x5D7E0D1F2E0F1F84; rax=_lrotr(rax, rax&0xF); // rotate right rax^=0x388D76AEE8CB1500; rax=_lrotl(rax, rax&0xF); // rotate left r8=rax+0xD2E9EE7E83C4285B; return _lrotl (r8, r8 % 60); // rotate left }; Weendupwiththisfancyformattedsource-code: #include #include #include #include #include #define C1 0x5D7E0D1F2E0F1F84 #define C2 0x388D76AEE8CB1500 #define C3 0xD2E9EE7E83C4285B 119 uint64_t hash(uint64_t v) { v*=C1; v=_lrotr(v, v&0xF); // rotate right v^=C2; v=_lrotl(v, v&0xF); // rotate left v+=C3; v=_lrotl(v, v % 60); // rotate left return v; }; int main() { printf ("%llu\n", hash(...)); }; Sincewearenotcryptoanalystswecan’tfindaneasywaytogeneratetheinputvalueforsomespecific outputvalue. Therotateinstruction’scoefficientslookfrightening—it’sawarrantythatthefunctionisnot bijective,ithascollisions,or,speakingmoresimply,manyinputsmaybepossibleforoneoutput. Brute-forceisnotsolutionbecausevaluesare64-bitones,that’sbeyondreality. 10.3.2 Now let’s use the Z3 Still,withoutanyspecialcryptographicknowledge,wemaytrytobreakthisalgorithmusingZ3. HereisthePythonsourcecode: 1#!/usr/bin/env python 2 3from z3 import * 4 5C1=0x5D7E0D1F2E0F1F84 6C2=0x388D76AEE8CB1500 7C3=0xD2E9EE7E83C4285B 8 9inp, i1, i2, i3, i4, i5, i6, outp = BitVecs('inp i1 i2 i3 i4 i5 i6 outp', 64) 10 11 s = Solver() 12 s.add(i1==inp*C1) 13 s.add(i2==RotateRight (i1, i1 & 0xF)) 14 s.add(i3==i2 ^ C2) 15 s.add(i4==RotateLeft(i3, i3 & 0xF)) 16 s.add(i5==i4 + C3) 17 s.add(outp==RotateLeft (i5, URem(i5, 60))) 18 19 s.add(outp==10816636949158156260) 20 21 print s.check() 22 m=s.model() 23 print m 24 print (" inp=0x%X" % m[inp].as_long()) 25 print ("outp=0x%X" % m[outp].as_long()) Thisisgoingtobeourfirstsolver. We see the variable definitions on line 7. These are just 64-bit variables. i1..i6are intermediate variables,representingthevaluesintheregistersbetweeninstructionexecutions. Thenweaddtheso-calledconstraintsonlines10..15.Thelastconstraintat17isthemostimportantone: wearegoingtotrytofindaninputvalueforwhichouralgorithmwillproduce10816636949158156260. RotateRight,RotateLeft,URem —arefunctionsfromtheZ3PythonAPI,notrelatedtoPythonlanguage. Thenwerunit: ...>python.exe 1.py sat [i1 = 3959740824832824396, i3 = 8957124831728646493, i5 = 10816636949158156260, inp = 1364123924608584563, outp = 10816636949158156260, i4 = 14065440378185297801, i2 = 4954926323707358301] inp=0x12EE577B63E80B73 outp=0x961C69FF0AEFD7E4 120 “sat”mean“satisfiable”,i.e.,thesolverwasabletofindatleastonesolution. Thesolutionisprintedin thesquarebrackets. Thelasttwolinesaretheinput/outputpairinhexadecimalform. Yes,indeed,ifwerun ourfunctionwith 0x12EE577B63E80B73 asinput,thealgorithmwillproducethevaluewewerelookingfor. But,aswenoticedbefore,thefunctionweworkwithisnotbijective,sotheremaybeothercorrectinput values. The Z3 is not capable of producing more than one result, but let’s hack our example slightly, by addingline19,whichimplies“lookforanyotherresultsthanthis”: 1#!/usr/bin/env python 2 3from z3 import * 4 5C1=0x5D7E0D1F2E0F1F84 6C2=0x388D76AEE8CB1500 7C3=0xD2E9EE7E83C4285B 8 9inp, i1, i2, i3, i4, i5, i6, outp = BitVecs('inp i1 i2 i3 i4 i5 i6 outp', 64) 10 11 s = Solver() 12 s.add(i1==inp*C1) 13 s.add(i2==RotateRight (i1, i1 & 0xF)) 14 s.add(i3==i2 ^ C2) 15 s.add(i4==RotateLeft(i3, i3 & 0xF)) 16 s.add(i5==i4 + C3) 17 s.add(outp==RotateLeft (i5, URem(i5, 60))) 18 19 s.add(outp==10816636949158156260) 20 21 s.add(inp!=0x12EE577B63E80B73) 22 23 print s.check() 24 m=s.model() 25 print m 26 print (" inp=0x%X" % m[inp].as_long()) 27 print ("outp=0x%X" % m[outp].as_long()) Indeed,itfindsanothercorrectresult: ...>python.exe 2.py sat [i1 = 3959740824832824396, i3 = 8957124831728646493, i5 = 10816636949158156260, inp = 10587495961463360371, outp = 10816636949158156260, i4 = 14065440378185297801, i2 = 4954926323707358301] inp=0x92EE577B63E80B73 outp=0x961C69FF0AEFD7E4 Thiscanbeautomated. Eachfoundresultcanbeaddedasaconstraintandthenthenextresultwillbe searchedfor. Hereisaslightlymoresophisticatedexample: 1#!/usr/bin/env python 2 3from z3 import * 4 5C1=0x5D7E0D1F2E0F1F84 6C2=0x388D76AEE8CB1500 7C3=0xD2E9EE7E83C4285B 8 9inp, i1, i2, i3, i4, i5, i6, outp = BitVecs('inp i1 i2 i3 i4 i5 i6 outp', 64) 10 11 s = Solver() 12 s.add(i1==inp*C1) 13 s.add(i2==RotateRight (i1, i1 & 0xF)) 14 s.add(i3==i2 ^ C2) 15 s.add(i4==RotateLeft(i3, i3 & 0xF)) 16 s.add(i5==i4 + C3) 17 s.add(outp==RotateLeft (i5, URem(i5, 60))) 18 19 s.add(outp==10816636949158156260) 20 21 # copypasted from http://stackoverflow.com/questions/11867611/z3py-checking-all-solutions-for-equation 22 result=[] 121 23 while True: 24 if s.check() == sat: 25 m = s.model() 26 print m[inp] 27 result.append(m) 28 # Create a new constraint the blocks the current model 29 block = [] 30 for d in m: 31 # d is a declaration 32 if d.arity() > 0: 33 raise Z3Exception("uninterpreted functions are not supported") 34 # create a constant from declaration 35 c=d() 36 if is_array(c) or c.sort().kind() == Z3_UNINTERPRETED_SORT: 37 raise Z3Exception("arrays and uninterpreted sorts are not supported") 38 block.append(c != m[d]) 39 s.add(Or(block)) 40 else: 41 print "results total=",len(result) 42 break Wegot: 1364123924608584563 1234567890 9223372038089343698 4611686019661955794 13835058056516731602 3096040143925676201 12319412180780452009 7707726162353064105 16931098199207839913 1906652839273745429 11130024876128521237 15741710894555909141 6518338857701133333 5975809943035972467 15199181979890748275 10587495961463360371 results total= 16 Sothereare16correctinputvaluesfor 0x92EE577B63E80B73 asaresult. Thesecondis1234567890—itisindeedthevaluewhichwasusedbymeoriginallywhilepreparingthis example. Let’salsotrytoresearchouralgorithmabitmore. Actingonasadisticwhim,let’sfindifthereareany possibleinput/outputpairsinwhichthelower32-bitpartsareequaltoeachother? Let’sremovethe outpconstraintandaddanother,atline17: 1#!/usr/bin/env python 2 3from z3 import * 4 5C1=0x5D7E0D1F2E0F1F84 6C2=0x388D76AEE8CB1500 7C3=0xD2E9EE7E83C4285B 8 9inp, i1, i2, i3, i4, i5, i6, outp = BitVecs('inp i1 i2 i3 i4 i5 i6 outp', 64) 10 11 s = Solver() 12 s.add(i1==inp*C1) 13 s.add(i2==RotateRight (i1, i1 & 0xF)) 14 s.add(i3==i2 ^ C2) 15 s.add(i4==RotateLeft(i3, i3 & 0xF)) 16 s.add(i5==i4 + C3) 17 s.add(outp==RotateLeft (i5, URem(i5, 60))) 18 19 s.add(outp & 0xFFFFFFFF == inp & 0xFFFFFFFF) 20 21 print s.check() 22 m=s.model() 23 print m 24 print (" inp=0x%X" % m[inp].as_long()) 25 print ("outp=0x%X" % m[outp].as_long()) Itisindeedso: 122 sat [i1 = 14869545517796235860, i3 = 8388171335828825253, i5 = 6918262285561543945, inp = 1370377541658871093, outp = 14543180351754208565, i4 = 10167065714588685486, i2 = 5541032613289652645] inp=0x13048F1D12C00535 outp=0xC9D3C17A12C00535 Let’sbemoresadisticandaddanotherconstraint: last16bitsmustbe 0x1234: 1#!/usr/bin/env python 2 3from z3 import * 4 5C1=0x5D7E0D1F2E0F1F84 6C2=0x388D76AEE8CB1500 7C3=0xD2E9EE7E83C4285B 8 9inp, i1, i2, i3, i4, i5, i6, outp = BitVecs('inp i1 i2 i3 i4 i5 i6 outp', 64) 10 11 s = Solver() 12 s.add(i1==inp*C1) 13 s.add(i2==RotateRight (i1, i1 & 0xF)) 14 s.add(i3==i2 ^ C2) 15 s.add(i4==RotateLeft(i3, i3 & 0xF)) 16 s.add(i5==i4 + C3) 17 s.add(outp==RotateLeft (i5, URem(i5, 60))) 18 19 s.add(outp & 0xFFFFFFFF == inp & 0xFFFFFFFF) 20 s.add(outp & 0xFFFF == 0x1234) 21 22 print s.check() 23 m=s.model() 24 print m 25 print (" inp=0x%X" % m[inp].as_long()) 26 print ("outp=0x%X" % m[outp].as_long()) Ohyes,thispossibleaswell: sat [i1 = 2834222860503985872, i3 = 2294680776671411152, i5 = 17492621421353821227, inp = 461881484695179828, outp = 419247225543463476, i4 = 2294680776671411152, i2 = 2834222860503985872] inp=0x668EEC35F961234 outp=0x5D177215F961234 Z3worksveryfastanditimpliesthatthealgorithmisweak,itisnotcryptographicatall(likethemostof theamateurcryptography). 11SAT-solvers SMTvs. SATislikehighlevel PLvs. assemblylanguage. Thelattercanbemuchmoreefficient,butit’shard toprograminit. 11.1 CNF form CNF86isanormalform. Anybooleanexpressioncanbeconvertedto normalform andCNFisoneofthem. The CNFexpression isabunchofclauses(sub-expressions)constistingofterms(variables),ORsandNOTs,allofwhicharethen 86https://en.wikipedia.org/wiki/Conjunctive_normal_form 123 glueledtogetherwithANDintoafullexpression. Thereisawaytomemorizeit: CNFis“ANDofORs”(or “productofsums”)and DNF87is“ORofANDs”(or“sumofproducts”). Exampleis: (:A_B)^(C_ :D). _standsforOR(logicaldisjunction88),“+”signisalsosometimesusedforOR. ^standsforAND(logicalconjunction89).Itiseasytomemorize: ^lookslike“A”letter.“ ”isalsosometimes usedforAND. :isnegation(NOT). 11.2 Example: 2-bit adder SAT-solverismerelyasolverofhugebooleanequationsinCNFform.Itjustgivestheanswer,ifthereisaset ofinputvalueswhichcansatisfyCNFexpression,andwhatinputvaluesmustbe. Hereisa2-bitadderforexample: Figure9:2-bitaddercircuit Theadderinitssimplestform:ithasnocarry-inandcarry-out,andithas3XORgatesandoneANDgate. Let’strytofigureout,whichsetsofinputvalueswillforceaddertosetbothtwooutputbits? Bydoingquick memorycalculation,wecanseethatthereare4waystodoso: 0 + 3 = 3,1 + 2 = 3,2 + 1 = 3,3 + 0 = 3.Hereis alsotruthtable,withtheserowshighlighted: 87Disjunctivenormalform 88https://en.wikipedia.org/wiki/Logical_disjunction 89https://en.wikipedia.org/wiki/Logical_conjunction 124 aH aL bH bL qH qL 3+3=6 2(mod4) 1 1 1 1 1 0 3+2=5 1(mod4) 1 1 1 0 0 1 3+1=4 0(mod4) 1 1 0 1 0 0 3+0=3 3(mod4) 1 1 0 0 1 1 2+3=5 1(mod4) 1 0 1 1 0 1 2+2=4 0(mod4) 1 0 1 0 0 0 2+1=3 3(mod4) 1 0 0 1 1 1 2+0=2 2(mod4) 1 0 0 0 1 0 1+3=4 0(mod4) 0 1 1 1 0 0 1+2=3 3(mod4) 0 1 1 0 1 1 1+1=2 2(mod4) 0 1 0 1 1 0 1+0=1 1(mod4) 0 1 0 0 0 1 0+3=3 3(mod4) 0 0 1 1 1 1 0+2=2 2(mod4) 0 0 1 0 1 0 0+1=1 1(mod4) 0 0 0 1 0 1 0+0=0 0(mod4) 0 0 0 0 0 0 Let’sfind,what SAT-solvercansayaboutit? First,weshouldrepresentour2-bitadderas CNFexpression. UsingWolframMathematica,wecanexpress1-bitexpressionforbothadderoutputs: In[]:=AdderQ0[aL_,bL_]=Xor[aL,bL] Out[]:=aL ⊻bL In[]:=AdderQ1[aL_,aH_,bL_,bH_]=Xor[And[aL,bL],Xor[aH,bH]] Out[]:=aH ⊻bH⊻(aL && bL) Weneedsuchexpression, wherebothpartswillgenerate1’s. Let’suseWolframMathematicafindallin- stancesofsuchexpression(IglueledbothpartswithAnd): In[]:=Boole[SatisfiabilityInstances[And[AdderQ0[aL,bL],AdderQ1[aL,aH,bL,bH]],{aL,aH,bL,bH},4]] Out[]:={1,1,0,0},{1,0,0,1},{0,1,1,0},{0,0,1,1} Yes,indeed,Mathematicasays,thereare4inputswhichwillleadtotheresultweneed.So,Mathematicacan alsobeusedas SATsolver. Nevertheless,let’sproceedto CNFform. UsingMathematicaagain,let’sconvertourexpressionto CNF form: In[]:=cnf=BooleanConvert[And[AdderQ0[aL,bL],AdderQ1[aL,aH,bL,bH]],``CNF''] Out[]:=(!aH ∥!bH) && (aH ∥bH) && (!aL ∥!bL) && (aL ∥bL) Looksmorecomplex. Thereasonofsuchverbosityisthat CNFformdoesn’tallowXORoperations. 125 11.2.1 MiniSat Forthestarters,wecantryMiniSat90.Thestandardwaytoencode CNFexpressionforMiniSatistoenumerate allORpartsateachline. Also,MiniSatdoesn’tsupportvariablenames,justnumbers. Let’senumerateour variables: 1willbeaH,2–aL,3–bH,4–bL. HereiswhatI’vegotwhenIconvertedMathematicaexpressiontotheMiniSatinputfile: p cnf 4 4 -1 -3 0 1 3 0 -2 -4 0 2 4 0 Two4’satthefirstlinesarenumberofvariablesandnumberofclausesrespectively. Thereare4lines then,eachforeachORclause.Minusbeforevariablenumbermeaningthatthevariableisnegated.Absence ofminus–notnegated. Zeroattheendisjustterminatingzero,meaningendoftheclause. Inotherwords,eachlineisOR-clausewithoptionalnegations,andthetaskofMiniSatistofindsuchset ofinput,whichcansatisfyalllinesintheinputfile. ThatfileInamedas adder.cnfandnowlet’stryMiniSat: % minisat -verb=0 adder.cnf results.txt SATISFIABLE Theresultsarein results.txtfile: SAT -1 -2 3 4 0 Thismeans,ifthefirsttwovariables(aHandaL)willbe false,andthelasttwovariables(bHandbL)will besettotrue,thewholeCNFexpressionissatisfiable.Seemstobetrue:ifbHandbLaretheonlyinputsset totrue,bothresultingbitsarealsohas truestates. Nowhowtogetotherinstances? SAT-solvers,like SMTsolvers,produceonlyonesolution(or instance). MiniSatuses PRNGanditsinitialseedcanbesetexplicitely. Itrieddifferentvalues,butresultisstillthe same. Nevertheless,CryptoMiniSatinthiscasewasabletoshowallpossible4instances,inchaoticorder, though. Sothisisnotveryrobustway. Perhaps,theonlyknownwayistonegatesolutionclauseandaddittotheinputexpression. We’vegot -1 -2 3 4,nowwecannegateallvaluesinit(justtoggleminuses: 1 2 -3 -4)andaddittotheendof theinputfile: p cnf 4 5 -1 -3 0 1 3 0 -2 -4 0 2 4 0 1 2 -3 -4 Nowwe’vegotanotherresult: SAT 1 2 -3 -4 0 Thismeans,aHandaLmustbeboth trueandbHandbLmustbe false,tosatisfytheinputexpression. Let’snegatethisclauseandadditagain: p cnf 4 6 -1 -3 0 1 3 0 -2 -4 0 2 4 0 1 2 -3 -4 -1 -2 3 4 0 Theresultis: SAT -1 2 3 -4 0 aH=false,aL=true,bH=true,bL=false. Thisisalsocorrect,accordingtoourtruthtable. Let’sadditagain: 90http://minisat.se/MiniSat.html 126 p cnf 4 7 -1 -3 0 1 3 0 -2 -4 0 2 4 0 1 2 -3 -4 -1 -2 3 4 0 1 -2 -3 4 0 SAT 1 -2 -3 4 0 aH=true,aL=false,bH=false,bL=true. Thisisalsocorrect. Thisisfourthresult. Thereareshouldn’tbemore. Whatiftoaddit? p cnf 4 8 -1 -3 0 1 3 0 -2 -4 0 2 4 0 1 2 -3 -4 -1 -2 3 4 0 1 -2 -3 4 0 -1 2 3 -4 0 NowMiniSatjustsays“UNSATISFIABLE”withoutanyadditionalinformationintheresultingfile. Ourexampleistiny,butMiniSatcanworkwithhuge CNFexpressions. 11.2.2 CryptoMiniSat XOR operation is absent in CNFform, but crucial in cryptographical algorithms. Simplest possible way to representsingleXORoperationin CNFformis: (:x_ :y)^(x_y)–notthatsmallexpression,though,many XORoperationsinsingleexpressioncanbeoptimizedbetter. OnesignificantdifferencebetweenMiniSatandCryptoMiniSatisthatthelattersupportsclauseswithXOR operationsinsteadofORs,becauseCryptoMiniSathasaimtoanalyzecryptoalgorithms91. XORclausesare handledbyCryptoMiniSatinaspecialwaywithouttranslatingtoORclauses. Youneedjusttoprependaclausewith“x”in CNFfileandORclauseisthentreatedasXORclauseby CryptoMiniSat. Asof2-bitadder,thissmallestpossibleXOR-CNFexpressioncanbeusedtofindallinputs wherebothoutputadderbitsareset: (aHbH)^(aLbL) Thisis .cnffileforCryptoMiniSat: p cnf 4 2 x1 3 0 x2 4 0 NowIrunCryptoMiniSatwithvariousrandomvaluestoinitializeits PRNG... % cryptominisat4 --verb 0 --random 0 XOR_adder.cnf s SATISFIABLE v 1 2 -3 -4 0 % cryptominisat4 --verb 0 --random 1 XOR_adder.cnf s SATISFIABLE v -1 -2 3 4 0 % cryptominisat4 --verb 0 --random 2 XOR_adder.cnf s SATISFIABLE v 1 -2 -3 4 0 % cryptominisat4 --verb 0 --random 3 XOR_adder.cnf s SATISFIABLE v 1 2 -3 -4 0 % cryptominisat4 --verb 0 --random 4 XOR_adder.cnf s SATISFIABLE v -1 2 3 -4 0 % cryptominisat4 --verb 0 --random 5 XOR_adder.cnf s SATISFIABLE v -1 2 3 -4 0 % cryptominisat4 --verb 0 --random 6 XOR_adder.cnf s SATISFIABLE 91http://www.msoos.org/xor-clauses/ 127 v -1 -2 3 4 0 % cryptominisat4 --verb 0 --random 7 XOR_adder.cnf s SATISFIABLE v 1 -2 -3 4 0 % cryptominisat4 --verb 0 --random 8 XOR_adder.cnf s SATISFIABLE v 1 2 -3 -4 0 % cryptominisat4 --verb 0 --random 9 XOR_adder.cnf s SATISFIABLE v 1 2 -3 -4 0 Nevertheless,all4possiblesolutionsare: v -1 -2 3 4 0 v -1 2 3 -4 0 v 1 -2 -3 4 0 v 1 2 -3 -4 0 ...thesameasreportedbyMiniSat. 11.3 Cracking Minesweeper with SAT solver SeealsoaboutcrackingitusingZ3: 5.11. 11.3.1 Simple population count function Firstofall,somehowweneedtocountneighbourbombs.Thecountingfunctionissimilarto populationcount function. Wecantrytomake CNFexpressionusingWolframMathematica. Thiswillbeafunction,returning Trueif anyof2bitsof8inputsbitsare Trueandothersare False. First,wemaketruthtableofsuchfunction: In[]:= tbl2 = Table[PadLeft[IntegerDigits[i, 2], 8] -> If[Equal[DigitCount[i, 2][[1]], 2], 1, 0], {i, 0, 255}] Out[]= {{0, 0, 0, 0, 0, 0, 0, 0} -> 0, {0, 0, 0, 0, 0, 0, 0, 1} -> 0, {0, 0, 0, 0, 0, 0, 1, 0} -> 0, {0, 0, 0, 0, 0, 0, 1, 1} -> 1, {0, 0, 0, 0, 0, 1, 0, 0} -> 0, {0, 0, 0, 0, 0, 1, 0, 1} -> 1, {0, 0, 0, 0, 0, 1, 1, 0} -> 1, {0, 0, 0, 0, 0, 1, 1, 1} -> 0, {0, 0, 0, 0, 1, 0, 0, 0} -> 0, {0, 0, 0, 0, 1, 0, 0, 1} -> 1, {0, 0, 0, 0, 1, 0, 1, 0} -> 1, {0, 0, 0, 0, 1, 0, 1, 1} -> 0, ... {1, 1, 1, 1, 1, 0, 1, 0} -> 0, {1, 1, 1, 1, 1, 0, 1, 1} -> 0, {1, 1, 1, 1, 1, 1, 0, 0} -> 0, {1, 1, 1, 1, 1, 1, 0, 1} -> 0, {1, 1, 1, 1, 1, 1, 1, 0} -> 0, {1, 1, 1, 1, 1, 1, 1, 1} -> 0} Nowwecanmake CNFexpressionusingthistruthtable: In[]:= BooleanConvert[ BooleanFunction[tbl2, {a, b, c, d, e, f, g, h}], "CNF"] Out[]= (! a || ! b || ! c) && (! a || ! b || ! d) && (! a || ! b || ! e) && (! a || ! b || ! f) && (! a || ! b || ! g) && (! a || ! b || ! h) && (! a || ! c || ! d) && (! a || ! c || ! e) && (! a || ! c || ! f) && (! a || ! c || ! g) && (! a || ! c || ! h) && (! a || ! d || ! e) && (! a || ! d || ! f) && (! a || ! d || ! g) && (! a || ! d || ! h) && (! a || ! e || ! f) && (! a || ! e || ! g) && (! a || ! e || ! h) && (! a || ! f || ! g) && (! a || ! f || ! h) && (! a || ! g || ! h) && (a || b || c || d || e || f || g) && (a || b || c || d || e || f || h) && (a || b || c || d || e || g || h) && (a || b || c || d || f || g || h) && (a || b || c || e || f || g || h) && (a || b || d || e || f || g || h) && (a || c || d || e || f || g || h) && (! b || ! c || ! d) && (! b || ! c || ! e) && (! b || ! c || ! f) && (! b || ! c || ! g) && (! b || ! c || ! h) && (! b || ! d || ! e) && (! b || ! d || ! f) && (! b || ! d || ! g) && (! b || ! d || ! h) && (! b || ! e || ! f) && (! b || ! e || ! g) && (! b || ! e || ! h) && (! b || ! f || ! g) && (! b || ! f || ! h) && (! b || ! g || ! h) && (b || c || d || e || f || g || h) && (! c || ! d || ! e) && (! c || ! d || ! f) && (! c || ! d || ! g) && (! c || ! d || ! h) && (! c || ! e || ! f) && (! 128 c || ! e || ! g) && (! c || ! e || ! h) && (! c || ! f || ! g) && (! c || ! f || ! h) && (! c || ! g || ! h) && (! d || ! e || ! f) && (! d || ! e || ! g) && (! d || ! e || ! h) && (! d || ! f || ! g) && (! d || ! f || ! h) && (! d || ! g || ! h) && (! e || ! f || ! g) && (! e || ! f || ! h) && (! e || ! g || ! h) && (! f || ! g || ! h) ThesyntaxissimilartoC/C++. Let’scheckit. IwroteaPythonfunctiontoconvertMathematica’soutputinto CNFfilewhichcanbefeededtoSATsolver: #!/usr/bin/python import subprocess def mathematica_to_CNF (s, a): s=s.replace("a", a[0]).replace("b", a[1]).replace("c", a[2]).replace("d", a[3]) s=s.replace("e", a[4]).replace("f", a[5]).replace("g", a[6]).replace("h", a[7]) s=s.replace("!", "-").replace("||", " ").replace("(", "").replace(")", "") s=s.split ("&&") return s def POPCNT2 (a): s="(!a||!b||!c)&&(!a||!b||!d)&&(!a||!b||!e)&&(!a||!b||!f)&&(!a||!b||!g)&&(!a||!b||!h)&&(!a||!c||!d)&&" \ "(!a||!c||!e)&&(!a||!c||!f)&&(!a||!c||!g)&&(!a||!c||!h)&&(!a||!d||!e)&&(!a||!d||!f)&&(!a||!d||!g)&&" \ "(!a||!d||!h)&&(!a||!e||!f)&&(!a||!e||!g)&&(!a||!e||!h)&&(!a||!f||!g)&&(!a||!f||!h)&&(!a||!g||!h)&&" \ "(a||b||c||d||e||f||g)&&(a||b||c||d||e||f||h)&&(a||b||c||d||e||g||h)&&(a||b||c||d||f||g||h)&&" \ "(a||b||c||e||f||g||h)&&(a||b||d||e||f||g||h)&&(a||c||d||e||f||g||h)&&(!b||!c||!d)&&(!b||!c||!e)&&" \ "(!b||!c||!f)&&(!b||!c||!g)&&(!b||!c||!h)&&(!b||!d||!e)&&(!b||!d||!f)&&(!b||!d||!g)&&(!b||!d||!h)&&" \ "(!b||!e||!f)&&(!b||!e||!g)&&(!b||!e||!h)&&(!b||!f||!g)&&(!b||!f||!h)&&(!b||!g||!h)&&(b||c||d||e||f||g||h) &&" \ "(!c||!d||!e)&&(!c||!d||!f)&&(!c||!d||!g)&&(!c||!d||!h)&&(!c||!e||!f)&&(!c||!e||!g)&&(!c||!e||!h)&&" \ "(!c||!f||!g)&&(!c||!f||!h)&&(!c||!g||!h)&&(!d||!e||!f)&&(!d||!e||!g)&&(!d||!e||!h)&&(!d||!f||!g)&&" \ "(!d||!f||!h)&&(!d||!g||!h)&&(!e||!f||!g)&&(!e||!f||!h)&&(!e||!g||!h)&&(!f||!g||!h)" return mathematica_to_CNF(s, a) clauses=POPCNT2(["1","2","3","4","5","6","7","8"]) f=open("tmp.cnf", "w") f.write ("p cnf 8 "+str(len(clauses))+"\n") for c in clauses: f.write(c+" 0\n") f.close() Itreplacesa/b/c/...variablestothevariablenamespassed(1/2/3...),reworkssyntax,etc.Hereisaresult: p cnf 8 64 -1 -2 -3 0 -1 -2 -4 0 -1 -2 -5 0 -1 -2 -6 0 -1 -2 -7 0 -1 -2 -8 0 -1 -3 -4 0 -1 -3 -5 0 -1 -3 -6 0 -1 -3 -7 0 -1 -3 -8 0 -1 -4 -5 0 -1 -4 -6 0 -1 -4 -7 0 -1 -4 -8 0 -1 -5 -6 0 -1 -5 -7 0 -1 -5 -8 0 -1 -6 -7 0 -1 -6 -8 0 -1 -7 -8 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 8 0 1 2 3 4 5 7 8 0 1 2 3 4 6 7 8 0 1 2 3 5 6 7 8 0 1 2 4 5 6 7 8 0 1 3 4 5 6 7 8 0 -2 -3 -4 0 -2 -3 -5 0 129 -2 -3 -6 0 -2 -3 -7 0 -2 -3 -8 0 -2 -4 -5 0 -2 -4 -6 0 -2 -4 -7 0 -2 -4 -8 0 -2 -5 -6 0 -2 -5 -7 0 -2 -5 -8 0 -2 -6 -7 0 -2 -6 -8 0 -2 -7 -8 0 2 3 4 5 6 7 8 0 -3 -4 -5 0 -3 -4 -6 0 -3 -4 -7 0 -3 -4 -8 0 -3 -5 -6 0 -3 -5 -7 0 -3 -5 -8 0 -3 -6 -7 0 -3 -6 -8 0 -3 -7 -8 0 -4 -5 -6 0 -4 -5 -7 0 -4 -5 -8 0 -4 -6 -7 0 -4 -6 -8 0 -4 -7 -8 0 -5 -6 -7 0 -5 -6 -8 0 -5 -7 -8 0 -6 -7 -8 0 Icanrunit: % minisat -verb=0 tst1.cnf results.txt SATISFIABLE % cat results.txt SAT 1 -2 -3 -4 -5 -6 -7 8 0 Thevariablenameinresultslackingminussignis True. Variablenamewithminussignis False. Wesee therearejusttwovariablesare True: 1and8. Thisisindeedcorrect: MiniSatsolverfoundacondition,for whichourfunctionreturns True. Zeroattheendisjustaterminalsymbolwhichmeansnothing. We can ask MiniSat for another solution, by adding current solution to the input CNF file, but with all variablesnegated: ... -5 -6 -8 0 -5 -7 -8 0 -6 -7 -8 0 -1 2 3 4 5 6 7 -8 0 InplainEnglishlanguage, thismeans“givemeANYsolutionwhichcansatisfyallclauses, butalsonot equaltothelastclausewe’vejustadded”. MiniSat,indeed,foundanothersolution,again,withonly2variablesequalto True: % minisat -verb=0 tst2.cnf results.txt SATISFIABLE % cat results.txt SAT 1 2 -3 -4 -5 -6 -7 -8 0 Bytheway,populationcount functionfor8neighbours(POPCNT8)inCNFformissimplest: a&&b&&c&&d&&e&&f&&g&&h Indeed: it’strueifall8inputbitsare True. Thefunctionfor0neighbours(POPCNT0)isalsosimple: 130 !a&&!b&&!c&&!d&&!e&&!f&&!g&&!h Itmeans,itwillreturn True,ifallinputvariablesare False. Bytheway,POPCNT1functionisalsosimple: (!a||!b)&&(!a||!c)&&(!a||!d)&&(!a||!e)&&(!a||!f)&&(!a||!g)&&(!a||!h)&&(a||b||c||d||e||f||g||h)&& (!b||!c)&&(!b||!d)&&(!b||!e)&&(!b||!f)&&(!b||!g)&&(!b||!h)&&(!c||!d)&&(!c||!e)&&(!c||!f)&&(!c||!g)&& (!c||!h)&&(!d||!e)&&(!d||!f)&&(!d||!g)&&(!d||!h)&&(!e||!f)&&(!e||!g)&&(!e||!h)&&(!f||!g)&&(!f||!h)&&(!g||!h) Thereisjustenumerationofallpossiblepairsof8variables(a/b,a/c,a/d,etc),whichimplies:notwobits mustbepresentsimultaneouslyineachpossiblepair. Andthereisanotherclause: “(a||b||c||d||e||f||g||h)”, whichimplies: atleastonebitmustbepresentamong8variables. Andyes,youcanaskMathematicaforfinding CNFexpressionsforanyothertruthtable. 11.3.2 Minesweeper NowwecanuseMathematicatogenerateall populationcount functionsfor0..8neighbours. For 99Minesweeper matrix including invisible border, there will be 1111 = 121variables, mapped to Minesweepermatrixlikethis: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 ... 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 ThenwewriteaPythonscriptwhichstacksall populationcount functions: eachfunctionforeachknown numberofneighbours(digitonMinesweeperfield). EachPOPCNTx()functiontakeslistofvariablenumbers andoutputslistofclausestobeaddedtothefinal CNFfile. Asofemptycells,wealsoaddthemasclauses,butwithminussign,whichmeans,thevariablemustbe False. Wheneverwetrytoplacebomb, weadditsvariableasclausewithoutminussign, thismeansthe variablemustbe True. Thenweexecuteexternalminisatprocess. Theonlythingweneedfromitisexitcode. Ifaninput CNFis UNSAT,itreturns20: WeuseheretheinformationfromtheprevioussolvingofMinesweeper: 5.11. #!/usr/bin/python import subprocess WIDTH=9 HEIGHT=9 VARS_TOTAL=(WIDTH+2)*(HEIGHT+2) known=[ "01?10001?", "01?100011", "011100000", "000000000", "111110011", "????1001?", "????3101?", "?????211?", "?????????"] def mathematica_to_CNF (s, a): s=s.replace("a", a[0]).replace("b", a[1]).replace("c", a[2]).replace("d", a[3]) s=s.replace("e", a[4]).replace("f", a[5]).replace("g", a[6]).replace("h", a[7]) s=s.replace("!", "-").replace("||", " ").replace("(", "").replace(")", "") s=s.split ("&&") return s def POPCNT0 (a): s="!a&&!b&&!c&&!d&&!e&&!f&&!g&&!h" return mathematica_to_CNF(s, a) 131 def POPCNT1 (a): s="(!a||!b)&&(!a||!c)&&(!a||!d)&&(!a||!e)&&(!a||!f)&&(!a||!g)&&(!a||!h)&&(a||b||c||d||e||f||g||h)&&" \ "(!b||!c)&&(!b||!d)&&(!b||!e)&&(!b||!f)&&(!b||!g)&&(!b||!h)&&(!c||!d)&&(!c||!e)&&(!c||!f)&&(!c||!g)&&" \ "(!c||!h)&&(!d||!e)&&(!d||!f)&&(!d||!g)&&(!d||!h)&&(!e||!f)&&(!e||!g)&&(!e||!h)&&(!f||!g)&&(!f||!h)&&(!g ||!h)" return mathematica_to_CNF(s, a) def POPCNT2 (a): s="(!a||!b||!c)&&(!a||!b||!d)&&(!a||!b||!e)&&(!a||!b||!f)&&(!a||!b||!g)&&(!a||!b||!h)&&(!a||!c||!d)&&" \ "(!a||!c||!e)&&(!a||!c||!f)&&(!a||!c||!g)&&(!a||!c||!h)&&(!a||!d||!e)&&(!a||!d||!f)&&(!a||!d||!g)&&" \ "(!a||!d||!h)&&(!a||!e||!f)&&(!a||!e||!g)&&(!a||!e||!h)&&(!a||!f||!g)&&(!a||!f||!h)&&(!a||!g||!h)&&" \ "(a||b||c||d||e||f||g)&&(a||b||c||d||e||f||h)&&(a||b||c||d||e||g||h)&&(a||b||c||d||f||g||h)&&" \ "(a||b||c||e||f||g||h)&&(a||b||d||e||f||g||h)&&(a||c||d||e||f||g||h)&&(!b||!c||!d)&&(!b||!c||!e)&&" \ "(!b||!c||!f)&&(!b||!c||!g)&&(!b||!c||!h)&&(!b||!d||!e)&&(!b||!d||!f)&&(!b||!d||!g)&&(!b||!d||!h)&&" \ "(!b||!e||!f)&&(!b||!e||!g)&&(!b||!e||!h)&&(!b||!f||!g)&&(!b||!f||!h)&&(!b||!g||!h)&&(b||c||d||e||f||g||h) &&" \ "(!c||!d||!e)&&(!c||!d||!f)&&(!c||!d||!g)&&(!c||!d||!h)&&(!c||!e||!f)&&(!c||!e||!g)&&(!c||!e||!h)&&" \ "(!c||!f||!g)&&(!c||!f||!h)&&(!c||!g||!h)&&(!d||!e||!f)&&(!d||!e||!g)&&(!d||!e||!h)&&(!d||!f||!g)&&" \ "(!d||!f||!h)&&(!d||!g||!h)&&(!e||!f||!g)&&(!e||!f||!h)&&(!e||!g||!h)&&(!f||!g||!h)" return mathematica_to_CNF(s, a) def POPCNT3 (a): s="(!a||!b||!c||!d)&&(!a||!b||!c||!e)&&(!a||!b||!c||!f)&&(!a||!b||!c||!g)&&(!a||!b||!c||!h)&&" \ "(!a||!b||!d||!e)&&(!a||!b||!d||!f)&&(!a||!b||!d||!g)&&(!a||!b||!d||!h)&&(!a||!b||!e||!f)&&" \ "(!a||!b||!e||!g)&&(!a||!b||!e||!h)&&(!a||!b||!f||!g)&&(!a||!b||!f||!h)&&(!a||!b||!g||!h)&&" \ "(!a||!c||!d||!e)&&(!a||!c||!d||!f)&&(!a||!c||!d||!g)&&(!a||!c||!d||!h)&&(!a||!c||!e||!f)&&" \ "(!a||!c||!e||!g)&&(!a||!c||!e||!h)&&(!a||!c||!f||!g)&&(!a||!c||!f||!h)&&(!a||!c||!g||!h)&&" \ "(!a||!d||!e||!f)&&(!a||!d||!e||!g)&&(!a||!d||!e||!h)&&(!a||!d||!f||!g)&&(!a||!d||!f||!h)&&" \ "(!a||!d||!g||!h)&&(!a||!e||!f||!g)&&(!a||!e||!f||!h)&&(!a||!e||!g||!h)&&(!a||!f||!g||!h)&&" \ "(a||b||c||d||e||f)&&(a||b||c||d||e||g)&&(a||b||c||d||e||h)&&(a||b||c||d||f||g)&&(a||b||c||d||f||h)&&" \ "(a||b||c||d||g||h)&&(a||b||c||e||f||g)&&(a||b||c||e||f||h)&&(a||b||c||e||g||h)&&(a||b||c||f||g||h)&&" \ "(a||b||d||e||f||g)&&(a||b||d||e||f||h)&&(a||b||d||e||g||h)&&(a||b||d||f||g||h)&&(a||b||e||f||g||h)&&" \ "(a||c||d||e||f||g)&&(a||c||d||e||f||h)&&(a||c||d||e||g||h)&&(a||c||d||f||g||h)&&(a||c||e||f||g||h)&&" \ "(a||d||e||f||g||h)&&(!b||!c||!d||!e)&&(!b||!c||!d||!f)&&(!b||!c||!d||!g)&&(!b||!c||!d||!h)&&" \ "(!b||!c||!e||!f)&&(!b||!c||!e||!g)&&(!b||!c||!e||!h)&&(!b||!c||!f||!g)&&(!b||!c||!f||!h)&&" \ "(!b||!c||!g||!h)&&(!b||!d||!e||!f)&&(!b||!d||!e||!g)&&(!b||!d||!e||!h)&&(!b||!d||!f||!g)&&" \ "(!b||!d||!f||!h)&&(!b||!d||!g||!h)&&(!b||!e||!f||!g)&&(!b||!e||!f||!h)&&(!b||!e||!g||!h)&&" \ "(!b||!f||!g||!h)&&(b||c||d||e||f||g)&&(b||c||d||e||f||h)&&(b||c||d||e||g||h)&&(b||c||d||f||g||h)&&" \ "(b||c||e||f||g||h)&&(b||d||e||f||g||h)&&(!c||!d||!e||!f)&&(!c||!d||!e||!g)&&(!c||!d||!e||!h)&&" \ "(!c||!d||!f||!g)&&(!c||!d||!f||!h)&&(!c||!d||!g||!h)&&(!c||!e||!f||!g)&&(!c||!e||!f||!h)&&" \ "(!c||!e||!g||!h)&&(!c||!f||!g||!h)&&(c||d||e||f||g||h)&&(!d||!e||!f||!g)&&(!d||!e||!f||!h)&&" \ "(!d||!e||!g||!h)&&(!d||!f||!g||!h)&&(!e||!f||!g||!h)" return mathematica_to_CNF(s, a) def POPCNT4 (a): s="(!a||!b||!c||!d||!e)&&(!a||!b||!c||!d||!f)&&(!a||!b||!c||!d||!g)&&(!a||!b||!c||!d||!h)&&" \ "(!a||!b||!c||!e||!f)&&(!a||!b||!c||!e||!g)&&(!a||!b||!c||!e||!h)&&(!a||!b||!c||!f||!g)&&" \ "(!a||!b||!c||!f||!h)&&(!a||!b||!c||!g||!h)&&(!a||!b||!d||!e||!f)&&(!a||!b||!d||!e||!g)&&" \ "(!a||!b||!d||!e||!h)&&(!a||!b||!d||!f||!g)&&(!a||!b||!d||!f||!h)&&(!a||!b||!d||!g||!h)&&" \ "(!a||!b||!e||!f||!g)&&(!a||!b||!e||!f||!h)&&(!a||!b||!e||!g||!h)&&(!a||!b||!f||!g||!h)&&" \ "(!a||!c||!d||!e||!f)&&(!a||!c||!d||!e||!g)&&(!a||!c||!d||!e||!h)&&(!a||!c||!d||!f||!g)&&" \ "(!a||!c||!d||!f||!h)&&(!a||!c||!d||!g||!h)&&(!a||!c||!e||!f||!g)&&(!a||!c||!e||!f||!h)&&" \ "(!a||!c||!e||!g||!h)&&(!a||!c||!f||!g||!h)&&(!a||!d||!e||!f||!g)&&(!a||!d||!e||!f||!h)&&" \ "(!a||!d||!e||!g||!h)&&(!a||!d||!f||!g||!h)&&(!a||!e||!f||!g||!h)&&(a||b||c||d||e)&&(a||b||c||d||f)&&" \ "(a||b||c||d||g)&&(a||b||c||d||h)&&(a||b||c||e||f)&&(a||b||c||e||g)&&(a||b||c||e||h)&&(a||b||c||f||g)&&" \ "(a||b||c||f||h)&&(a||b||c||g||h)&&(a||b||d||e||f)&&(a||b||d||e||g)&&(a||b||d||e||h)&&(a||b||d||f||g)&&" \ "(a||b||d||f||h)&&(a||b||d||g||h)&&(a||b||e||f||g)&&(a||b||e||f||h)&&(a||b||e||g||h)&&(a||b||f||g||h)&&" \ "(a||c||d||e||f)&&(a||c||d||e||g)&&(a||c||d||e||h)&&(a||c||d||f||g)&&(a||c||d||f||h)&&(a||c||d||g||h)&&" \ "(a||c||e||f||g)&&(a||c||e||f||h)&&(a||c||e||g||h)&&(a||c||f||g||h)&&(a||d||e||f||g)&&(a||d||e||f||h)&&" \ "(a||d||e||g||h)&&(a||d||f||g||h)&&(a||e||f||g||h)&&(!b||!c||!d||!e||!f)&&(!b||!c||!d||!e||!g)&&" \ "(!b||!c||!d||!e||!h)&&(!b||!c||!d||!f||!g)&&(!b||!c||!d||!f||!h)&&(!b||!c||!d||!g||!h)&&" \ "(!b||!c||!e||!f||!g)&&(!b||!c||!e||!f||!h)&&(!b||!c||!e||!g||!h)&&(!b||!c||!f||!g||!h)&&" \ "(!b||!d||!e||!f||!g)&&(!b||!d||!e||!f||!h)&&(!b||!d||!e||!g||!h)&&(!b||!d||!f||!g||!h)&&" \ "(!b||!e||!f||!g||!h)&&(b||c||d||e||f)&&(b||c||d||e||g)&&(b||c||d||e||h)&&(b||c||d||f||g)&&" \ "(b||c||d||f||h)&&(b||c||d||g||h)&&(b||c||e||f||g)&&(b||c||e||f||h)&&(b||c||e||g||h)&&" \ "(b||c||f||g||h)&&(b||d||e||f||g)&&(b||d||e||f||h)&&(b||d||e||g||h)&&(b||d||f||g||h)&&" \ "(b||e||f||g||h)&&(!c||!d||!e||!f||!g)&&(!c||!d||!e||!f||!h)&&(!c||!d||!e||!g||!h)&&" \ "(!c||!d||!f||!g||!h)&&(!c||!e||!f||!g||!h)&&(c||d||e||f||g)&&(c||d||e||f||h)&&(c||d||e||g||h)&&" \ "(c||d||f||g||h)&&(c||e||f||g||h)&&(!d||!e||!f||!g||!h)&&(d||e||f||g||h)" return mathematica_to_CNF(s, a) def POPCNT5 (a): s="(!a||!b||!c||!d||!e||!f)&&(!a||!b||!c||!d||!e||!g)&&(!a||!b||!c||!d||!e||!h)&&" \ "(!a||!b||!c||!d||!f||!g)&&(!a||!b||!c||!d||!f||!h)&&(!a||!b||!c||!d||!g||!h)&&" \ "(!a||!b||!c||!e||!f||!g)&&(!a||!b||!c||!e||!f||!h)&&(!a||!b||!c||!e||!g||!h)&&" \ "(!a||!b||!c||!f||!g||!h)&&(!a||!b||!d||!e||!f||!g)&&(!a||!b||!d||!e||!f||!h)&&" \ 132 "(!a||!b||!d||!e||!g||!h)&&(!a||!b||!d||!f||!g||!h)&&(!a||!b||!e||!f||!g||!h)&&" \ "(!a||!c||!d||!e||!f||!g)&&(!a||!c||!d||!e||!f||!h)&&(!a||!c||!d||!e||!g||!h)&&" \ "(!a||!c||!d||!f||!g||!h)&&(!a||!c||!e||!f||!g||!h)&&(!a||!d||!e||!f||!g||!h)&&" \ "(a||b||c||d)&&(a||b||c||e)&&(a||b||c||f)&&(a||b||c||g)&&(a||b||c||h)&&(a||b||d||e)&&" \ "(a||b||d||f)&&(a||b||d||g)&&(a||b||d||h)&&(a||b||e||f)&&(a||b||e||g)&&(a||b||e||h)&&" \ "(a||b||f||g)&&(a||b||f||h)&&(a||b||g||h)&&(a||c||d||e)&&(a||c||d||f)&&(a||c||d||g)&&" \ "(a||c||d||h)&&(a||c||e||f)&&(a||c||e||g)&&(a||c||e||h)&&(a||c||f||g)&&(a||c||f||h)&&" \ "(a||c||g||h)&&(a||d||e||f)&&(a||d||e||g)&&(a||d||e||h)&&(a||d||f||g)&&(a||d||f||h)&&" \ "(a||d||g||h)&&(a||e||f||g)&&(a||e||f||h)&&(a||e||g||h)&&(a||f||g||h)&&(!b||!c||!d||!e||!f||!g)&&" \ "(!b||!c||!d||!e||!f||!h)&&(!b||!c||!d||!e||!g||!h)&&(!b||!c||!d||!f||!g||!h)&&" \ "(!b||!c||!e||!f||!g||!h)&&(!b||!d||!e||!f||!g||!h)&&(b||c||d||e)&&(b||c||d||f)&&" \ "(b||c||d||g)&&(b||c||d||h)&&(b||c||e||f)&&(b||c||e||g)&&(b||c||e||h)&&(b||c||f||g)&&" \ "(b||c||f||h)&&(b||c||g||h)&&(b||d||e||f)&&(b||d||e||g)&&(b||d||e||h)&&(b||d||f||g)&&" \ "(b||d||f||h)&&(b||d||g||h)&&(b||e||f||g)&&(b||e||f||h)&&(b||e||g||h)&&(b||f||g||h)&&" \ "(!c||!d||!e||!f||!g||!h)&&(c||d||e||f)&&(c||d||e||g)&&(c||d||e||h)&&(c||d||f||g)&&" \ "(c||d||f||h)&&(c||d||g||h)&&(c||e||f||g)&&(c||e||f||h)&&(c||e||g||h)&&(c||f||g||h)&&" \ "(d||e||f||g)&&(d||e||f||h)&&(d||e||g||h)&&(d||f||g||h)&&(e||f||g||h)" return mathematica_to_CNF(s, a) def POPCNT6 (a): s="(!a||!b||!c||!d||!e||!f||!g)&&(!a||!b||!c||!d||!e||!f||!h)&&(!a||!b||!c||!d||!e||!g||!h)&&" \ "(!a||!b||!c||!d||!f||!g||!h)&&(!a||!b||!c||!e||!f||!g||!h)&&(!a||!b||!d||!e||!f||!g||!h)&&" \ "(!a||!c||!d||!e||!f||!g||!h)&&(a||b||c)&&(a||b||d)&&(a||b||e)&&(a||b||f)&&(a||b||g)&&(a||b||h)&&" \ "(a||c||d)&&(a||c||e)&&(a||c||f)&&(a||c||g)&&(a||c||h)&&(a||d||e)&&(a||d||f)&&(a||d||g)&&" \ "(a||d||h)&&(a||e||f)&&(a||e||g)&&(a||e||h)&&(a||f||g)&&(a||f||h)&&(a||g||h)&&" \ "(!b||!c||!d||!e||!f||!g||!h)&&(b||c||d)&&(b||c||e)&&(b||c||f)&&(b||c||g)&&(b||c||h)&&(b||d||e)&&" \ "(b||d||f)&&(b||d||g)&&(b||d||h)&&(b||e||f)&&(b||e||g)&&(b||e||h)&&(b||f||g)&&(b||f||h)&&(b||g||h)&&" \ "(c||d||e)&&(c||d||f)&&(c||d||g)&&(c||d||h)&&(c||e||f)&&(c||e||g)&&(c||e||h)&&(c||f||g)&&(c||f||h)&&" \ "(c||g||h)&&(d||e||f)&&(d||e||g)&&(d||e||h)&&(d||f||g)&&(d||f||h)&&(d||g||h)&&" \ "(e||f||g)&&(e||f||h)&&(e||g||h)&&(f||g||h)" return mathematica_to_CNF(s, a) def POPCNT7 (a): s="(!a||!b||!c||!d||!e||!f||!g||!h)&&(a||b)&&(a||c)&&(a||d)&&(a||e)&&(a||f)&&(a||g)&&(a||h)&&(b||c)&&" \ "(b||d)&&(b||e)&&(b||f)&&(b||g)&&(b||h)&&(c||d)&&(c||e)&&(c||f)&&(c||g)&&(c||h)&&(d||e)&&(d||f)&&(d||g)&&" \ "(d||h)&&(e||f)&&(e||g)&&(e||h)&&(f||g)&&(f||h)&&(g||h)" return mathematica_to_CNF(s, a) def POPCNT8 (a): s="a&&b&&c&&d&&e&&f&&g&&h" return mathematica_to_CNF(s, a) POPCNT_functions=[POPCNT0, POPCNT1, POPCNT2, POPCNT3, POPCNT4, POPCNT5, POPCNT6, POPCNT7, POPCNT8] def coords_to_var (row, col): # we always use SAT variables as strings, anyway. # the 1st variables is 1, not 0 return str(row*(WIDTH+2)+col+1) def chk_bomb(row, col): clauses=[] # make empty border # all variables are negated (because they must be False) for c in range(WIDTH+2): clauses.append ("-"+coords_to_var(0,c)) clauses.append ("-"+coords_to_var(HEIGHT+1,c)) for r in range(HEIGHT+2): clauses.append ("-"+coords_to_var(r,0)) clauses.append ("-"+coords_to_var(r,WIDTH+1)) for r in range(1,HEIGHT+1): for c in range(1,WIDTH+1): t=known[r-1][c-1] if t in "012345678": # cell at r, c is empty (False): clauses.append ("-"+coords_to_var(r,c)) # we need an empty border so the following expression would work for all possible cells: neighbours=[coords_to_var(r-1, c-1), coords_to_var(r-1, c), coords_to_var(r-1, c+1), coords_to_var(r, c-1), coords_to_var(r, c+1), coords_to_var(r+1, c-1), coords_to_var(r+1, c), coords_to_var(r +1, c+1)] clauses=clauses+POPCNT_functions[int(t)](neighbours) # place a bomb 133 clauses.append (coords_to_var(row,col)) f=open("tmp.cnf", "w") f.write ("p cnf "+str(VARS_TOTAL)+" "+str(len(clauses))+"\n") for c in clauses: f.write(c+" 0\n") f.close() child = subprocess.Popen(["minisat", "tmp.cnf"], stdout=subprocess.PIPE) child.wait() # 10 is SAT, 20 is UNSAT if child.returncode==20: print "row=%d, col=%d, unsat!" % (row, col) for r in range(1,HEIGHT+1): for c in range(1,WIDTH+1): if known[r-1][c-1]=="?": chk_bomb(r, c) (https://github.com/dennis714/SAT_SMT_article/blob/master/SAT/minesweeper/minesweeper_SAT. py) The outputCNFfile can be large, up to 2000clauses, or more, here is an example: https://github. com/dennis714/SAT_SMT_article/blob/master/SAT/minesweeper/sample.cnf . Anyway,itworksjustlikemypreviousZ3Pyscript: row=1, col=3, unsat! row=6, col=2, unsat! row=6, col=3, unsat! row=7, col=4, unsat! row=7, col=9, unsat! row=8, col=9, unsat! ...butitrunswayfaster,evenconsideringoverheadofexecutingexternalprogram.Perhaps,Z3Pyversion couldbeoptimizedmuchbetter? Thefiles,includingWolframMathematicanotebook: https://github.com/dennis714/SAT_SMT_article/ tree/master/SAT/minesweeper . 11.4 Conway’s “Game of Life” 11.4.1 Reversing back state of “Game of Life” HowcouldwereversebackaknownstateofGoL?Thiscanbesolvedbybrute-force,butthisisextremely slowandinefficient. Let’strytouseSATsolver. First,weneedtodefineafunctionwhichwilltell,ifthenewcellwillbecreated/born,preserved/stayor died.Quickrefresher:cellisbornifithas3neighbours,itstaysaliveifithas2or3neighbours,itdiesinany othercase. ThisishowIcandefineafunctionreflectingstateofanewcellinthenextstate: if center==true: return popcnt2(neighbours) || popcnt3(neighbours) if center==false return popcnt3(neighbours) Wecangetridof“if”construction: result=(center=true && (popcnt2(neighbours) || popcnt3(neighbours))) || (center=false && popcnt3(neighbours)) ...where“center”isstateofcentralcell,“neighbours”are8neighbouringcells,popcnt2isafunctionwhich returnsTrueifithasexactly2bitsoninput,popcnt3isthesame,butfor3bits(justlikethesewereusedin my“Minesweeper”example( 11.3)). UsingWolframMathematica,Ifirstcreateallhelperfunctionsandtruthtableforthefunction,whichreturns true,ifacellmustbepresentinthenextstate,or falseifnot: In[1]:= popcount[n_Integer]:=IntegerDigits[n,2] // Total In[2]:= popcount2[n_Integer]:=Equal[popcount[n],2] In[3]:= popcount3[n_Integer]:=Equal[popcount[n],3] 134 In[4]:= newcell[center_Integer,neighbours_Integer]:=(center==1 && (popcount2[neighbours]|| popcount3[neighbours ]))|| (center==0 && popcount3[neighbours]) In[13]:= NewCellIsTrue=Flatten[Table[Join[{center},PadLeft[IntegerDigits[neighbours,2],8]] -> Boole[newcell[center, neighbours]],{neighbours,0,255},{center,0,1}]] Out[13]= {{0,0,0,0,0,0,0,0,0}->0, {1,0,0,0,0,0,0,0,0}->0, {0,0,0,0,0,0,0,0,1}->0, {1,0,0,0,0,0,0,0,1}->0, {0,0,0,0,0,0,0,1,0}->0, {1,0,0,0,0,0,0,1,0}->0, {0,0,0,0,0,0,0,1,1}->0, {1,0,0,0,0,0,0,1,1}->1, ... Nowwecancreatea CNF-expressionoutoftruthtable: In[14]:= BooleanConvert[BooleanFunction[NewCellIsTrue,{center,a,b,c,d,e,f,g,h}],"CNF"] Out[14]= (!a||!b||!c||!d)&&(!a||!b||!c||!e)&&(!a||!b||!c||!f)&&(!a||!b||!c||!g)&&(!a||!b||!c||!h)&& (!a||!b||!d||!e)&&(!a||!b||!d||!f)&&(!a||!b||!d||!g)&&(!a||!b||!d||!h)&&(!a||!b||!e||!f)&& (!a||!b||!e||!g)&&(!a||!b||!e||!h)&&(!a||!b||!f||!g)&&(!a||!b||!f||!h)&&(!a||!b||!g||!h)&& (!a||!c||!d||!e)&&(!a||!c||!d||!f)&&(!a||!c||!d||!g)&&(!a||!c||!d||!h)&&(!a||!c||!e||!f)&& (!a||!c||!e||!g)&&(!a||!c||!e||!h)&&(!a||!c||!f||!g)&&(!a||!c||!f||!h)&& ... Also,weneedasecondfunction, invertedone ,whichwillreturn trueifthecellmustbeabsentinthenext state,orfalseotherwise: In[15]:= NewCellIsFalse=Flatten[Table[Join[{center},PadLeft[IntegerDigits[neighbours,2],8]] -> Boole[Not[newcell[center, neighbours]]],{neighbours,0,255},{center,0,1}]] Out[15]= {{0,0,0,0,0,0,0,0,0}->1, {1,0,0,0,0,0,0,0,0}->1, {0,0,0,0,0,0,0,0,1}->1, {1,0,0,0,0,0,0,0,1}->1, {0,0,0,0,0,0,0,1,0}->1, ... In[16]:= BooleanConvert[BooleanFunction[NewCellIsFalse,{center,a,b,c,d,e,f,g,h}],"CNF"] Out[16]= (!a||!b||!c||d||e||f||g||h)&&(!a||!b||c||!d||e||f||g||h)&&(!a||!b||c||d||!e||f||g||h)&& (!a||!b||c||d||e||!f||g||h)&&(!a||!b||c||d||e||f||!g||h)&&(!a||!b||c||d||e||f||g||!h)&& (!a||!b||!center||d||e||f||g||h)&&(!a||b||!c||!d||e||f||g||h)&&(!a||b||!c||d||!e||f||g||h)&& (!a||b||!c||d||e||!f||g||h)&&(!a||b||!c||d||e||f||!g||h)&&(!a||b||!c||d||e||f||g||!h)&& (!a||b||c||!d||!e||f||g||h)&&(!a||b||c||!d||e||!f||g||h)&&(!a||b||c||!d||e||f||!g||h)&& ... Usingtheverysamewayasinmy“Minesweeper”example,Icanconvert CNFexpressiontolistofclauses: def mathematica_to_CNF (s, center, a): s=s.replace("center", center) s=s.replace("a", a[0]).replace("b", a[1]).replace("c", a[2]).replace("d", a[3]) s=s.replace("e", a[4]).replace("f", a[5]).replace("g", a[6]).replace("h", a[7]) s=s.replace("!", "-").replace("||", " ").replace("(", "").replace(")", "") s=s.split ("&&") return s Andagain,asin“Minesweeper”,thereisaninvisibleborder,tomakeprocessingsimpler. SATvariables arealsonumberedasinpreviousexample: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 ... 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 135 Also,thereisavisibleborder,alwaysfixedto False,tomakethingssimpler. Nowtheworkingsourcecode.Wheneverweencounter“*”in final_state[] ,weaddclausesgenerated bycell_is_true() function,or cell_is_false() ifotherwise. Whenwegetasolution,itisnegatedand addedtothelistofclauses,sowhenminisatisexecutednexttime,itwillskipsolutionwhichwasalready printed. ... def cell_is_false (center, a): s="(!a||!b||!c||d||e||f||g||h)&&(!a||!b||c||!d||e||f||g||h)&&(!a||!b||c||d||!e||f||g||h)&&" \ "(!a||!b||c||d||e||!f||g||h)&&(!a||!b||c||d||e||f||!g||h)&&(!a||!b||c||d||e||f||g||!h)&&" \ "(!a||!b||!center||d||e||f||g||h)&&(!a||b||!c||!d||e||f||g||h)&&(!a||b||!c||d||!e||f||g||h)&&" \ "(!a||b||!c||d||e||!f||g||h)&&(!a||b||!c||d||e||f||!g||h)&&(!a||b||!c||d||e||f||g||!h)&&" \ "(!a||b||c||!d||!e||f||g||h)&&(!a||b||c||!d||e||!f||g||h)&&(!a||b||c||!d||e||f||!g||h)&&" \ "(!a||b||c||!d||e||f||g||!h)&&(!a||b||c||d||!e||!f||g||h)&&(!a||b||c||d||!e||f||!g||h)&&" \ "(!a||b||c||d||!e||f||g||!h)&&(!a||b||c||d||e||!f||!g||h)&&(!a||b||c||d||e||!f||g||!h)&&" \ "(!a||b||c||d||e||f||!g||!h)&&(!a||!c||!center||d||e||f||g||h)&&(!a||c||!center||!d||e||f||g||h)&&" \ "(!a||c||!center||d||!e||f||g||h)&&(!a||c||!center||d||e||!f||g||h)&&(!a||c||!center||d||e||f||!g||h)&&" \ "(!a||c||!center||d||e||f||g||!h)&&(a||!b||!c||!d||e||f||g||h)&&(a||!b||!c||d||!e||f||g||h)&&" \ "(a||!b||!c||d||e||!f||g||h)&&(a||!b||!c||d||e||f||!g||h)&&(a||!b||!c||d||e||f||g||!h)&&" \ "(a||!b||c||!d||!e||f||g||h)&&(a||!b||c||!d||e||!f||g||h)&&(a||!b||c||!d||e||f||!g||h)&&" \ "(a||!b||c||!d||e||f||g||!h)&&(a||!b||c||d||!e||!f||g||h)&&(a||!b||c||d||!e||f||!g||h)&&" \ "(a||!b||c||d||!e||f||g||!h)&&(a||!b||c||d||e||!f||!g||h)&&(a||!b||c||d||e||!f||g||!h)&&" \ "(a||!b||c||d||e||f||!g||!h)&&(a||b||!c||!d||!e||f||g||h)&&(a||b||!c||!d||e||!f||g||h)&&" \ "(a||b||!c||!d||e||f||!g||h)&&(a||b||!c||!d||e||f||g||!h)&&(a||b||!c||d||!e||!f||g||h)&&" \ "(a||b||!c||d||!e||f||!g||h)&&(a||b||!c||d||!e||f||g||!h)&&(a||b||!c||d||e||!f||!g||h)&&" \ "(a||b||!c||d||e||!f||g||!h)&&(a||b||!c||d||e||f||!g||!h)&&(a||b||c||!d||!e||!f||g||h)&&" \ "(a||b||c||!d||!e||f||!g||h)&&(a||b||c||!d||!e||f||g||!h)&&(a||b||c||!d||e||!f||!g||h)&&" \ "(a||b||c||!d||e||!f||g||!h)&&(a||b||c||!d||e||f||!g||!h)&&(a||b||c||d||!e||!f||!g||h)&&" \ "(a||b||c||d||!e||!f||g||!h)&&(a||b||c||d||!e||f||!g||!h)&&(a||b||c||d||e||!f||!g||!h)&&" \ "(!b||!c||!center||d||e||f||g||h)&&(!b||c||!center||!d||e||f||g||h)&&(!b||c||!center||d||!e||f||g||h)&&" \ "(!b||c||!center||d||e||!f||g||h)&&(!b||c||!center||d||e||f||!g||h)&&(!b||c||!center||d||e||f||g||!h)&&" \ "(b||!c||!center||!d||e||f||g||h)&&(b||!c||!center||d||!e||f||g||h)&&(b||!c||!center||d||e||!f||g||h)&&" \ "(b||!c||!center||d||e||f||!g||h)&&(b||!c||!center||d||e||f||g||!h)&&(b||c||!center||!d||!e||f||g||h)&&" \ "(b||c||!center||!d||e||!f||g||h)&&(b||c||!center||!d||e||f||!g||h)&&(b||c||!center||!d||e||f||g||!h)&&" \ "(b||c||!center||d||!e||!f||g||h)&&(b||c||!center||d||!e||f||!g||h)&&(b||c||!center||d||!e||f||g||!h)&&" \ "(b||c||!center||d||e||!f||!g||h)&&(b||c||!center||d||e||!f||g||!h)&&(b||c||!center||d||e||f||!g||!h)" return mathematica_to_CNF(s, center, a) def cell_is_true (center, a): s="(!a||!b||!c||!d)&&(!a||!b||!c||!e)&&(!a||!b||!c||!f)&&(!a||!b||!c||!g)&&(!a||!b||!c||!h)&&" \ "(!a||!b||!d||!e)&&(!a||!b||!d||!f)&&(!a||!b||!d||!g)&&(!a||!b||!d||!h)&&(!a||!b||!e||!f)&&" \ "(!a||!b||!e||!g)&&(!a||!b||!e||!h)&&(!a||!b||!f||!g)&&(!a||!b||!f||!h)&&(!a||!b||!g||!h)&&" \ "(!a||!c||!d||!e)&&(!a||!c||!d||!f)&&(!a||!c||!d||!g)&&(!a||!c||!d||!h)&&(!a||!c||!e||!f)&&" \ "(!a||!c||!e||!g)&&(!a||!c||!e||!h)&&(!a||!c||!f||!g)&&(!a||!c||!f||!h)&&(!a||!c||!g||!h)&&" \ "(!a||!d||!e||!f)&&(!a||!d||!e||!g)&&(!a||!d||!e||!h)&&(!a||!d||!f||!g)&&(!a||!d||!f||!h)&&" \ "(!a||!d||!g||!h)&&(!a||!e||!f||!g)&&(!a||!e||!f||!h)&&(!a||!e||!g||!h)&&(!a||!f||!g||!h)&&" \ "(a||b||c||center||d||e||f)&&(a||b||c||center||d||e||g)&&(a||b||c||center||d||e||h)&&" \ "(a||b||c||center||d||f||g)&&(a||b||c||center||d||f||h)&&(a||b||c||center||d||g||h)&&" \ "(a||b||c||center||e||f||g)&&(a||b||c||center||e||f||h)&&(a||b||c||center||e||g||h)&&" \ "(a||b||c||center||f||g||h)&&(a||b||c||d||e||f||g)&&(a||b||c||d||e||f||h)&&(a||b||c||d||e||g||h)&&" \ "(a||b||c||d||f||g||h)&&(a||b||c||e||f||g||h)&&(a||b||center||d||e||f||g)&&(a||b||center||d||e||f||h)&&" \ "(a||b||center||d||e||g||h)&&(a||b||center||d||f||g||h)&&(a||b||center||e||f||g||h)&&" \ "(a||b||d||e||f||g||h)&&(a||c||center||d||e||f||g)&&(a||c||center||d||e||f||h)&&" \ "(a||c||center||d||e||g||h)&&(a||c||center||d||f||g||h)&&(a||c||center||e||f||g||h)&&" \ "(a||c||d||e||f||g||h)&&(a||center||d||e||f||g||h)&&(!b||!c||!d||!e)&&(!b||!c||!d||!f)&&" \ "(!b||!c||!d||!g)&&(!b||!c||!d||!h)&&(!b||!c||!e||!f)&&(!b||!c||!e||!g)&&(!b||!c||!e||!h)&&" \ "(!b||!c||!f||!g)&&(!b||!c||!f||!h)&&(!b||!c||!g||!h)&&(!b||!d||!e||!f)&&(!b||!d||!e||!g)&&" \ "(!b||!d||!e||!h)&&(!b||!d||!f||!g)&&(!b||!d||!f||!h)&&(!b||!d||!g||!h)&&(!b||!e||!f||!g)&&" \ "(!b||!e||!f||!h)&&(!b||!e||!g||!h)&&(!b||!f||!g||!h)&&(b||c||center||d||e||f||g)&&" \ "(b||c||center||d||e||f||h)&&(b||c||center||d||e||g||h)&&(b||c||center||d||f||g||h)&&" \ "(b||c||center||e||f||g||h)&&(b||c||d||e||f||g||h)&&(b||center||d||e||f||g||h)&&" \ "(!c||!d||!e||!f)&&(!c||!d||!e||!g)&&(!c||!d||!e||!h)&&(!c||!d||!f||!g)&&(!c||!d||!f||!h)&&" \ "(!c||!d||!g||!h)&&(!c||!e||!f||!g)&&(!c||!e||!f||!h)&&(!c||!e||!g||!h)&&(!c||!f||!g||!h)&&" \ "(c||center||d||e||f||g||h)&&(!d||!e||!f||!g)&&(!d||!e||!f||!h)&&(!d||!e||!g||!h)&&(!d||!f||!g||!h)&&" \ "(!e||!f||!g||!h)" return mathematica_to_CNF(s, center, a) ... (https://github.com/dennis714/SAT_SMT_article/blob/master/SAT/GoL/GoL_SAT_utils.py ) 136 #!/usr/bin/python import os from GoL_SAT_utils import * final_state=[ " * ", "* *", " * "] H=len(final_state) # HEIGHT W=len(final_state[0]) # WIDTH print "HEIGHT=", H, "WIDTH=", W VARS_TOTAL=W*H+1 VAR_FALSE=str(VARS_TOTAL) def try_again (clauses): # rules for the main part of grid for r in range(H): for c in range(W): if final_state[r][c]=="*": clauses=clauses+cell_is_true(coords_to_var(r, c, H, W), get_neighbours(r, c, H, W)) else: clauses=clauses+cell_is_false(coords_to_var(r, c, H, W), get_neighbours(r, c, H, W)) # cells behind visible grid must always be false: for c in range(-1, W+1): for r in [-1,H]: clauses=clauses+cell_is_false(coords_to_var(r, c, H, W), get_neighbours(r, c, H, W)) for c in [-1,W]: for r in range(-1, H+1): clauses=clauses+cell_is_false(coords_to_var(r, c, H, W), get_neighbours(r, c, H, W)) write_CNF("tmp.cnf", clauses, VARS_TOTAL) print "%d clauses" % len(clauses) solution=run_minisat ("tmp.cnf") os.remove("tmp.cnf") if solution==None: print "unsat!" exit(0) grid=SAT_solution_to_grid(solution, H, W) print_grid(grid) write_RLE(grid) return grid clauses=[] # always false: clauses.append ("-"+VAR_FALSE) while True: solution=try_again(clauses) clauses.append(negate_clause(grid_to_clause(solution, H, W))) print "" (https://github.com/dennis714/SAT_SMT_article/blob/master/SAT/GoL/reverse1.py ) Hereistheresult: HEIGHT= 3 WIDTH= 3 2525 clauses .*. *.* .*. 1.rle written 2526 clauses .** *.. *.* 2.rle written 137 2527 clauses **. ..* *.* 3.rle written 2528 clauses *.* *.. .** 4.rle written 2529 clauses *.* ..* **. 5.rle written 2530 clauses *.* .*. *.* 6.rle written 2531 clauses unsat! Thefirstresultisthesameasinitialstate. Indeed: thisis“stilllife”,i.e.,statewhichwillneverchange, anditiscorrectsolution. Thelastsolutionisalsovalid. Now the problem: 2nd, 3rd, 4th and 5th solutions are equivalent to each other, they just mirrored or rotated.Infact,thisisreflectional92(likeinmirror)androtational93symmetries.Wecansolvethiseasily:we willtakeeachsolution,reflectandrotateitandaddthemnegatedtothelistofclauses,sominisatwillskip themduringitswork: ... while True: solution=try_again(clauses) clauses.append(negate_clause(grid_to_clause(solution, H, W))) clauses.append(negate_clause(grid_to_clause(reflect_vertically(solution), H, W))) clauses.append(negate_clause(grid_to_clause(reflect_horizontally(solution), H, W))) # is this square? if W==H: clauses.append(negate_clause(grid_to_clause(rotate_square_array(solution,1), H, W))) clauses.append(negate_clause(grid_to_clause(rotate_square_array(solution,2), H, W))) clauses.append(negate_clause(grid_to_clause(rotate_square_array(solution,3), H, W))) print "" ... (https://github.com/dennis714/SAT_SMT_article/blob/master/SAT/GoL/reverse2.py ) Functions reflect_vertically() ,reflect_horizontally and rotate_squarearray() are simple arraymanipulationroutines. Nowwegetjust3solutions: HEIGHT= 3 WIDTH= 3 2525 clauses .*. *.* .*. 1.rle written 2531 clauses .** *.. *.* 2.rle written 92https://en.wikipedia.org/wiki/Reflection_symmetry 93https://en.wikipedia.org/wiki/Rotational_symmetry 138 2537 clauses *.* .*. *.* 3.rle written 2543 clauses unsat! Thisonehasonlyonesingleancestor: final_state=[ " * ", " * ", " * "] _PRE_END _PRE_BEGIN HEIGHT= 3 WIDTH= 3 2503 clauses ... *** ... 1.rle written 2509 clauses unsat! Thisisoscillator,ofcourse. Howmanystatescanleadtosuchpicture? final_state=[ " * ", " ", " ** ", " * ", " * ", " *** "] 28,thesearefew: HEIGHT= 6 WIDTH= 5 5217 clauses .*.*. ..*.. .**.* ..*.. ..*.* .**.. 1.rle written 5220 clauses .*.*. ..*.. .**.* ..*.. *.*.* .**.. 2.rle written 5223 clauses ..*.* ..**. .**.. ....* *.*.* .**.. 3.rle written 5226 clauses ..*.* ..**. .**.. *...* ..*.* .**.. 139 4.rle written ... Nowthebiggest,“spaceinvader”: final_state=[ " ", " * * ", " * * ", " ******* ", " ** *** ** ", " *********** ", " * ******* * ", " * * * * ", " ** ** ", " "] HEIGHT= 10 WIDTH= 13 16469 clauses ..*.*.**..... .....*****... ....**..*.... ......*...*.. ..**...*.*... .*..*.*.**..* *....*....*.* ..*.*..*..... ..*.....*.*.. ....**..*.*.. 1.rle written 16472 clauses *.*.*.**..... .....*****... ....**..*.... ......*...*.. ..**...*.*... .*..*.*.**..* *....*....*.* ..*.*..*..... ..*.....*.*.. ....**..*.*.. 2.rle written 16475 clauses ..*.*.**..... *....*****... ....**..*.... ......*...*.. ..**...*.*... .*..*.*.**..* *....*....*.* ..*.*..*..... ..*.....*.*.. ....**..*.*.. 3.rle written ... Idon’tknowhowmanypossiblestatescanleadto“spaceinvader”,perhaps,toomany. Hadtostopit. Anditslowsdownduringexecution,becausenumberofclausesisincreasing(becauseofnegatingsolutions addition). AllsolutionsarealsoexportedtoRLEfiles,whichcanbeopenedbyGolly94. 11.4.2 Finding “still lives” “Stilllife”intermsofGoLisastatewhichdoesn’tchangeatall. First,usingpreviousdefinitions,wewilldefineatruthtableoffunction,whichwillreturn true,ifthecenter cellofthenextstateisthesameasithasbeeninthepreviousstate,i.e.,hasn’tbeenchanged: 94http://golly.sourceforge.net/ 140 In[17]:= stillife=Flatten[Table[Join[{center},PadLeft[IntegerDigits[neighbours,2],8]]-> Boole[Boole[newcell[center,neighbours]]==center],{neighbours,0,255},{center,0,1}]] Out[17]= {{0,0,0,0,0,0,0,0,0}->1, {1,0,0,0,0,0,0,0,0}->0, {0,0,0,0,0,0,0,0,1}->1, {1,0,0,0,0,0,0,0,1}->0, ... In[18]:= BooleanConvert[BooleanFunction[stillife,{center,a,b,c,d,e,f,g,h}],"CNF"] Out[18]= (!a||!b||!c||!center||!d)&&(!a||!b||!c||!center||!e)&&(!a||!b||!c||!center||!f)&& (!a||!b||!c||!center||!g)&&(!a||!b||!c||!center||!h)&&(!a||!b||!c||center||d||e||f||g||h)&& (!a||!b||c||center||!d||e||f||g||h)&&(!a||!b||c||center||d||!e||f||g||h)&&(!a||!b||c||center||d||e||!f||g||h)&& (!a||!b||c||center||d||e||f||!g||h)&&(!a||!b||c||center||d||e||f||g||!h)&&(!a||!b||!center||!d||!e)&& ... #!/usr/bin/python import os from GoL_SAT_utils import * W=3 # WIDTH H=3 # HEIGHT VARS_TOTAL=W*H+1 VAR_FALSE=str(VARS_TOTAL) def stillife (center, a): s="(!a||!b||!c||!center||!d)&&(!a||!b||!c||!center||!e)&&(!a||!b||!c||!center||!f)&&" \ "(!a||!b||!c||!center||!g)&&(!a||!b||!c||!center||!h)&&(!a||!b||!c||center||d||e||f||g||h)&&" \ "(!a||!b||c||center||!d||e||f||g||h)&&(!a||!b||c||center||d||!e||f||g||h)&&" \ "(!a||!b||c||center||d||e||!f||g||h)&&(!a||!b||c||center||d||e||f||!g||h)&&" \ "(!a||!b||c||center||d||e||f||g||!h)&&(!a||!b||!center||!d||!e)&&(!a||!b||!center||!d||!f)&&" \ "(!a||!b||!center||!d||!g)&&(!a||!b||!center||!d||!h)&&(!a||!b||!center||!e||!f)&&" \ "(!a||!b||!center||!e||!g)&&(!a||!b||!center||!e||!h)&&(!a||!b||!center||!f||!g)&&" \ "(!a||!b||!center||!f||!h)&&(!a||!b||!center||!g||!h)&&(!a||b||!c||center||!d||e||f||g||h)&&" \ "(!a||b||!c||center||d||!e||f||g||h)&&(!a||b||!c||center||d||e||!f||g||h)&&" \ "(!a||b||!c||center||d||e||f||!g||h)&&(!a||b||!c||center||d||e||f||g||!h)&&" \ "(!a||b||c||center||!d||!e||f||g||h)&&(!a||b||c||center||!d||e||!f||g||h)&&" \ "(!a||b||c||center||!d||e||f||!g||h)&&(!a||b||c||center||!d||e||f||g||!h)&&" \ "(!a||b||c||center||d||!e||!f||g||h)&&(!a||b||c||center||d||!e||f||!g||h)&&" \ "(!a||b||c||center||d||!e||f||g||!h)&&(!a||b||c||center||d||e||!f||!g||h)&&" \ "(!a||b||c||center||d||e||!f||g||!h)&&(!a||b||c||center||d||e||f||!g||!h)&&" \ "(!a||!c||!center||!d||!e)&&(!a||!c||!center||!d||!f)&&(!a||!c||!center||!d||!g)&&" \ "(!a||!c||!center||!d||!h)&&(!a||!c||!center||!e||!f)&&(!a||!c||!center||!e||!g)&&" \ "(!a||!c||!center||!e||!h)&&(!a||!c||!center||!f||!g)&&(!a||!c||!center||!f||!h)&&" \ "(!a||!c||!center||!g||!h)&&(!a||!center||!d||!e||!f)&&(!a||!center||!d||!e||!g)&&" \ "(!a||!center||!d||!e||!h)&&(!a||!center||!d||!f||!g)&&(!a||!center||!d||!f||!h)&&" \ "(!a||!center||!d||!g||!h)&&(!a||!center||!e||!f||!g)&&(!a||!center||!e||!f||!h)&&" \ "(!a||!center||!e||!g||!h)&&(!a||!center||!f||!g||!h)&&(a||!b||!c||center||!d||e||f||g||h)&&" \ "(a||!b||!c||center||d||!e||f||g||h)&&(a||!b||!c||center||d||e||!f||g||h)&&" \ "(a||!b||!c||center||d||e||f||!g||h)&&(a||!b||!c||center||d||e||f||g||!h)&&" \ "(a||!b||c||center||!d||!e||f||g||h)&&(a||!b||c||center||!d||e||!f||g||h)&&" \ "(a||!b||c||center||!d||e||f||!g||h)&&(a||!b||c||center||!d||e||f||g||!h)&&" \ "(a||!b||c||center||d||!e||!f||g||h)&&(a||!b||c||center||d||!e||f||!g||h)&&" \ "(a||!b||c||center||d||!e||f||g||!h)&&(a||!b||c||center||d||e||!f||!g||h)&&" \ "(a||!b||c||center||d||e||!f||g||!h)&&(a||!b||c||center||d||e||f||!g||!h)&&" \ "(a||b||!c||center||!d||!e||f||g||h)&&(a||b||!c||center||!d||e||!f||g||h)&&" \ "(a||b||!c||center||!d||e||f||!g||h)&&(a||b||!c||center||!d||e||f||g||!h)&&" \ "(a||b||!c||center||d||!e||!f||g||h)&&(a||b||!c||center||d||!e||f||!g||h)&&" \ "(a||b||!c||center||d||!e||f||g||!h)&&(a||b||!c||center||d||e||!f||!g||h)&&" \ "(a||b||!c||center||d||e||!f||g||!h)&&(a||b||!c||center||d||e||f||!g||!h)&&" \ "(a||b||c||!center||d||e||f||g)&&(a||b||c||!center||d||e||f||h)&&" \ "(a||b||c||!center||d||e||g||h)&&(a||b||c||!center||d||f||g||h)&&" \ "(a||b||c||!center||e||f||g||h)&&(a||b||c||center||!d||!e||!f||g||h)&&" \ "(a||b||c||center||!d||!e||f||!g||h)&&(a||b||c||center||!d||!e||f||g||!h)&&" \ "(a||b||c||center||!d||e||!f||!g||h)&&(a||b||c||center||!d||e||!f||g||!h)&&" \ "(a||b||c||center||!d||e||f||!g||!h)&&(a||b||c||center||d||!e||!f||!g||h)&&" \ "(a||b||c||center||d||!e||!f||g||!h)&&(a||b||c||center||d||!e||f||!g||!h)&&" \ "(a||b||c||center||d||e||!f||!g||!h)&&(a||b||!center||d||e||f||g||h)&&" \ "(a||c||!center||d||e||f||g||h)&&(!b||!c||!center||!d||!e)&&(!b||!c||!center||!d||!f)&&" \ "(!b||!c||!center||!d||!g)&&(!b||!c||!center||!d||!h)&&(!b||!c||!center||!e||!f)&&" \ "(!b||!c||!center||!e||!g)&&(!b||!c||!center||!e||!h)&&(!b||!c||!center||!f||!g)&&" \ "(!b||!c||!center||!f||!h)&&(!b||!c||!center||!g||!h)&&(!b||!center||!d||!e||!f)&&" \ 141 "(!b||!center||!d||!e||!g)&&(!b||!center||!d||!e||!h)&&(!b||!center||!d||!f||!g)&&" \ "(!b||!center||!d||!f||!h)&&(!b||!center||!d||!g||!h)&&(!b||!center||!e||!f||!g)&&" \ "(!b||!center||!e||!f||!h)&&(!b||!center||!e||!g||!h)&&(!b||!center||!f||!g||!h)&&" \ "(b||c||!center||d||e||f||g||h)&&(!c||!center||!d||!e||!f)&&(!c||!center||!d||!e||!g)&&" \ "(!c||!center||!d||!e||!h)&&(!c||!center||!d||!f||!g)&&(!c||!center||!d||!f||!h)&&" \ "(!c||!center||!d||!g||!h)&&(!c||!center||!e||!f||!g)&&(!c||!center||!e||!f||!h)&&" \ "(!c||!center||!e||!g||!h)&&(!c||!center||!f||!g||!h)&&(!center||!d||!e||!f||!g)&&" \ "(!center||!d||!e||!f||!h)&&(!center||!d||!e||!g||!h)&&(!center||!d||!f||!g||!h)&&" \ "(!center||!e||!f||!g||!h)" return mathematica_to_CNF(s, center, a) def try_again (clauses): # rules for the main part of grid for r in range(H): for c in range(W): clauses=clauses+stillife(coords_to_var(r, c, H, W), get_neighbours(r, c, H, W)) # cells behind visible grid must always be false: for c in range(-1, W+1): for r in [-1,H]: clauses=clauses+cell_is_false(coords_to_var(r, c, H, W), get_neighbours(r, c, H, W)) for c in [-1,W]: for r in range(-1, H+1): clauses=clauses+cell_is_false(coords_to_var(r, c, H, W), get_neighbours(r, c, H, W)) write_CNF("tmp.cnf", clauses, VARS_TOTAL) print "%d clauses" % len(clauses) solution=run_minisat ("tmp.cnf") os.remove("tmp.cnf") if solution==None: print "unsat!" exit(0) grid=SAT_solution_to_grid(solution, H, W) print_grid(grid) write_RLE(grid) return grid clauses=[] # always false: clauses.append ("-"+VAR_FALSE) while True: solution=try_again(clauses) clauses.append(negate_clause(grid_to_clause(solution, H, W))) clauses.append(negate_clause(grid_to_clause(reflect_vertically(solution), H, W))) clauses.append(negate_clause(grid_to_clause(reflect_horizontally(solution), H, W))) # is this square? if W==H: clauses.append(negate_clause(grid_to_clause(rotate_square_array(solution,1), H, W))) clauses.append(negate_clause(grid_to_clause(rotate_square_array(solution,2), H, W))) clauses.append(negate_clause(grid_to_clause(rotate_square_array(solution,3), H, W))) print "" (https://github.com/dennis714/SAT_SMT_article/blob/master/SAT/GoL/stillife1.py ) Whatwe’vegotfor 22? 1881 clauses .. .. 1.rle written 1887 clauses ** ** 2.rle written 1893 clauses unsat! Bothsolutionsarecorrect: emptysquarewillprogressintoemptysquare(nocellsareborn). 22boxis 142 alsoknown“stilllife”. Whatabout 33square? 2887 clauses ... ... ... 1.rle written 2893 clauses .** .** ... 2.rle written 2899 clauses .** *.* **. 3.rle written 2905 clauses .*. *.* **. 4.rle written 2911 clauses .*. *.* .*. 5.rle written 2917 clauses unsat! Here is a problem: we see familiar 22box, but shifted. This is indeed correct solution, but we don’t interestedinit,becauseithasbeenalreadyseen. Whatwecandoisaddanothercondition. Wecanforceminisattofindsolutionswithnoemptyrowsand columns. Thisiseasy. TheseareSATvariablesfor 55square: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 Eachclauseis“OR”clause,soallwehavetodoistoadd5clauses: 1 OR 2 OR 3 OR 4 OR 5 6 OR 7 OR 8 OR 9 OR 10 ... Thatmeansthateachrowmusthaveatleastone Truevaluesomewhere. Wecanalsodothisforeach columnaswell. ... # each row must contain at least one cell! for r in range(H): clauses.append(" ".join([coords_to_var(r, c, H, W) for c in range(W)])) # each column must contain at least one cell! for c in range(W): clauses.append(" ".join([coords_to_var(r, c, H, W) for r in range(H)])) ... (https://github.com/dennis714/SAT_SMT_article/blob/master/SAT/GoL/stillife2.py ) Nowwecanseethat 33squarehas3possible“stilllives”: 2893 clauses .*. 143 *.* **. 1.rle written 2899 clauses .*. *.* .*. 2.rle written 2905 clauses .** *.* **. 3.rle written 2911 clauses unsat! 44has7: 4169 clauses ..** ...* ***. *... 1.rle written 4175 clauses ..** ..*. *.*. **.. 2.rle written 4181 clauses ..** .*.* *.*. **.. 3.rle written 4187 clauses ..*. .*.* *.*. **.. 4.rle written 4193 clauses .**. *..* *.*. .*.. 5.rle written 4199 clauses ..*. .*.* *.*. .*.. 6.rle written 4205 clauses .**. *..* *..* .**. 7.rle written 4211 clauses unsat! WhenItrylargesquares, like 2020, funnythingshappen. Firstofall, minisatfindssolutionsnotvery pleasingaesthetically,butstillcorrect,like: 144 61033 clauses ....**.**.**.**.**.* **..*.**.**.**.**.** *................... .*.................. **.................. *................... .*.................. **.................. *................... .*.................. **.................. *................... .*.................. **.................. *................... .*.................. ..*................. ...*................ ***................. *................... 1.rle written ... Indeed: allrowsandcolumnshasatleastone Truevalue. Thenminisatbeginstoaddsmaller“stilllives”intothewholepicture: 61285 clauses .**....**...**...**. .**...*..*.*.*...*.. .......**...*......* ..................** ...**............*.. ...*.*...........*.. ....*.*........**... **...*.*...**..*.... *.*...*....*....*... .*..........****.*.. ................*... ..*...**..******.... .*.*..*..*.......... *..*...*.*..****.... ***.***..*.*....*... ....*..***.**..**... **.*..*............. .*.**.**..........** *..*..*..*......*..* **..**..**......**.. 43.rle written Inotherwords,resultisasquareconsistingofsmaller“stilllives”. Itthenalteringthesepartsslightly, shiftingbackandforth. Isitcheating? Anyway,itdoesitinastrictaccordancetoruleswedefined. Butwewant denserpicture. Wecanaddarule: inall5-cellchunkstheremustbeatleastone Truecell. Toachievethis,wejustsplitthewholesquareby5-cellchunksandaddclauseforeach: ... # make result denser: lst=[] for r in range(H): for c in range(W): lst.append(coords_to_var(r, c, H, W)) # divide them all by chunks and add to clauses: CHUNK_LEN=5 for c in list_partition(lst,len(lst)/CHUNK_LEN): tmp=" ".join(c) clauses.append(tmp) ... (https://github.com/dennis714/SAT_SMT_article/blob/master/SAT/GoL/stillife.py ) Thisisindeeddenser: 145 61113 clauses ..**.**......*.*.*.. ...*.*.....***.**.*. ...*..*...*.......*. ....*.*..*.*......** ...**.*.*..*...**.*. ..*...*.***.....*.*. ...*.*.*......*..*.. ****.*..*....*.**... *....**.*....*.*.... ...**..*...**..*.... ..*..*....*....*.**. .*.*.**....****.*..* ..*.*....*.*..*..**. ....*.****..*..*.*.. ....**....*.*.**..*. *.**...****.*..*.**. **...**.....**.*.... ...**..*..**..*.**.* ***.*.*..*.*..*.*.** *....*....*....*.... 1.rle written 61119 clauses ..**.**......*.*.*.. ...*.*.....***.**.*. ...*..*...*.......*. ....*.*..*.*......** ...**.*.*..*...**.*. ..*...*.***.....*.*. ...*.*.*......*..*.. ****.*..*....*.**... *....**.*....*.*.... ...**..*...**..*.... ..*..*....*....*.**. .*.*.**....****.*..* ..*.*....*.*..*..**. ....*.****..*..*.*.. ....**....*.*.**..*. *.**...****.*..*.**. **...**.....**.*.... ...**..*.***..*.**.* ***.*..*.*..*.*.*.** *.......*..**.**.... 2.rle written ... Let’strymoredense,onemandatory truecellpereach4-cellchunk: 61133 clauses .**.**...*....**..** *.*.*...*.*..*..*..* *....*...*.*..*.**.. .***.*.....*.**.*... ..*.*.....**...*..*. *......**..*...*.**. **.....*...*.**.*... ...**...*...**..*... **.*..*.*......*...* .*...**.**..***.**** .*....*.*..*..*.*... **.***...*.**...*.** .*.*..****.....*..*. *....*.....**..**.*. *.***.*..**.*.....** .*...*..*......**... ...*.*.**......*.*** ..**.*.....**......* *..*.*.**..*.*..***. **....*.*...*...*... 1.rle written 61139 clauses .**.**...*....**..** 146 *.*.*...*.*..*..*..* *....*...*.*..*.**.. .***.*.....*.**.*... ..*.*.....**...*..*. *......**..*...*.**. **.....*...*.**.*... ...**...*...**..*... **.*..*.*......*...* .*...**.**..***.**** .*....*.*..*..*.*... **.***...*.**...*.** .*.*..****.....*..*. *....*.....**..**.*. *.***.*..**.*.....** .*...*..*......**..* ...*.*.**......*.**. ..**.*.....**....*.. *..*.*.**..*.*...*.* **....*.*...*.....** 2.rle written ... ...andevenmore: onecellpereach3-cellchunk: 61166 clauses **.*..**...**.**.... *.**..*.*...*.*.*.** ....**..*...*...*.*. .**..*.*.**.*.*.*.*. ..**.*.*...*.**.*.** *...*.*.**.*....*.*. **.*..*...*.*.***..* .*.*.*.***..**...**. .*.*.*.*..**...*.*.. **.**.*..*...**.*..* ..*...*.**.**.*.*.** ..*.**.*..*.*.*.*... **.*.*...*..*.*.*... .*.*...*.**..*..***. .*..****.*....**...* ..*.*...*..*...*..*. .**...*.*.**...*.*.. ..*..**.*.*...**.**. ..*.*..*..*..*..*..* .**.**....**..**..** 1.rle written 61172 clauses **.*..**...**.**.... *.**..*.*...*.*.*.** ....**..*...*...*.*. .**..*.*.**.*.*.*.*. ..**.*.*...*.**.*.** *...*.*.**.*....*.*. **.*..*...*.*.***..* .*.*.*.***..**...**. .*.*.*.*..**...*.*.. **.**.*..*...**.*..* ..*...*.**.**.*.*.** ..*.**.*..*.*.*.*... **.*.*...*..*.*.*... .*.*...*.**..*..***. .*..****.*....**...* ..*.*...*..*...*..*. .**..**.*.**...*.*.. *..*.*..*.*...**.**. *..*.*.*..*..*..*..* .**...*...**..**..** 2.rle written ... Thisismostdense. Unfortunaly,it’simpossibletoconstruct“stilllife”withonemandatory truecellper each2-cellchunk. 147 11.4.3 The source code SourcecodeandWolframMathematicanotebook: https://github.com/dennis714/SAT_SMT_article/tree/ master/SAT/GoL . 12 Acronyms used CNFConjunctivenormalform....................................................................................... 3 DNFDisjunctivenormalform..................................................................................... 124 DSLDomain-specificlanguage...................................................................................... 4 CPRNGCryptographicallySecurePseudorandomNumberGenerator......................................... 20 SMTSatisfiabilitymodulotheories................................................................................. 1 SATBooleansatisfiabilityproblem.................................................................................. 1 LCGLinearcongruentialgenerator................................................................................. 1 PLProgrammingLanguage.......................................................................................... 4 OOPObject-orientedprogramming............................................................................... 40 SSAStaticsingleassignmentform................................................................................ 32 CPUCentralprocessingunit....................................................................................... 35 FPUFloating-pointunit............................................................................................. 60 PRNGPseudorandomnumbergenerator......................................................................... 70 CRTCruntimelibrary.............................................................................................. 70 CRCCyclicredundancycheck..................................................................................... 67 ASTAbstractsyntaxtree........................................................................................... 41 AKAAlsoKnownAs.................................................................................................. 1 CTFCapturetheFlag............................................................................................. 114 ISAInstructionSetArchitecture................................................................................... 35 CSPConstraintsatisfactionproblem............................................................................... 6 148 CSComputerscience................................................................................................ 3 DAGDirectedacyclicgraph........................................................................................ 30 NOPNoOperation.................................................................................................. 39 JVMJavaVirtualMachine........................................................................................... 60 VMVirtualMachine................................................................................................. 73 LZSSLempel–Ziv–Storer–Szymanski.............................................................................. 99 RAMRandom-accessmemory................................................................................... 100 FPGAField-programmablegatearray........................................................................... 116 EDAElectronicdesignautomation............................................................................... 116 MACMessageauthenticationcode.............................................................................. 116 ECCEllipticcurvecryptography.................................................................................. 116 APIApplicationprogramminginterface............................................................................ 5 NSANationalSecurityAgency..................................................................................... 20 149Cross-core Microarchitectural Side Channel Attacks and Countermeasures by Gorka Irazoqui A Dissertation Submitted to the Faculty of the WORCESTER POLYTECHNIC INSTITUTE In partial fulfillment of the requirements for the Degree of Doctor of Philosophy in Electrical and Computer Engineering by April 2017 APPROVED: Professor Thomas Eisenbarth Professor Berk Sunar Dissertation Advisor Dissertation Committee ECE Department ECE Department Professor Craig Shue Professor Engın Kirda Dissertation Committee Dissertation Committee CS Department Northeastern University Abstract In the last decade, multi-threaded systems and resource sharing have brought a number of technologies that facilitate our daily tasks in a way we never imagined. Among others, cloud computing has emerged to offer us powerful computational resources without having to physically acquire and install them, while smartphones have almost acquired the same importance desktop computers had a decade ago. This has only been possible thanks to the ever evolving performance optimization improvements made to modern microarchitectures that efficiently manage concur- rent usage of hardware resources. One of the aforementioned optimizations is the usage of shared Last Level Caches (LLCs) to balance different CPU core loads and to maintain coherency between shared memory blocks utilized by different cores. The latter for instance has enabled concurrent execution of several processes in low RAM devices such as smartphones. Although efficient hardware resource sharing has become the de-facto model for several modern technologies, it also poses a major concern with respect to security. Some of the concurrently executed co-resident processes might in fact be malicious and try to take advantage of hardware proximity. New technologies usually claim to be secure by implementing sandboxing techniques and executing processes in isolated software environments, called Virtual Machines (VMs). However, the design of these isolated environments aims at preventing pure software-based attacks and usually does not consider hardware leakages. In fact, the malicious utilization of hardware resources as covert channels might have severe consequences to the privacy of the customers. Our work demonstrates that malicious customers of such technologies can utilize the LLC as the covert channel to obtain sensitive information from a co-resident victim. We show that the LLC is an attractive resource to be targeted by attackers, as it offers high resolution and, unlike previous microarchitectural attacks, does not require core-colocation. Particularly concerning are the cases in which cryptography is compromised, as it is the main component of every security solution. In this sense, the presented work does not only introduce three attack variants that can be applicable in different scenarios, but also demonstrates the ability to recover cryptographic keys (e.g. AES and RSA) and TLS session messages across VMs, bypassing sandboxing techniques. Finally, two countermeasures to prevent microarchitectural attacks in general and LLC attacks in particular from retrieving fine-grain information are presented. Unlike previously proposed countermeasures, ours do not add permanent overheads in the system but can be utilized as preemptive defenses. The first identifies leakages in cryptographic software that can potentially lead to key extraction, and thus, can be utilized by cryptographic code designers to ensure the sanity of their libraries before deployment. The second detects microarchitectural attacks embedded into innocent-looking binaries, preventing them from being posted in official application repositories that usually have the full trust of the customer. 2 Acknowledgements The attacks described in Chapter 4 are based on collaborative work with Mehmet Sinan ̇Inci and resulted in peer-reviewed publications [IIES14b, IIES15]. The content presented in Chapter 5 has been published as [IES16]. Parts of the results introduced in Chapter 6 are based joint work with Mehmet Sinan ̇Inci and Berk Gulmezoglu and resulted in publication [ ̇IGI+16a]. The rest of the content of Chapter 6 has been published in [IES15a, IES15b] The vulnerability analysis described in Chapter 7 was performed jointly with Intel employes Xiaofei Guo, Hareesh Khattri, Arun Kanuparthi and Kai Cong, and it is under submission for a peer-review venue. The remaining contributions in the chapter have also been submitted to a peer-review conference. This work was supported by the National Science Foundation (NSF), under grants CNS-1318919, CNS-1314770 and CNS-1618837. This thesis is the result of several years of effort that would not have been successful without the collaboration, both in a professional and personal way, of certain people to whom I would like to express my gratitude. My two advisors Thomas Eisenbarth and Berk Sunar have given me confidence and freedom to achieve the results presented in this thesis. Without their knowledge, advise and personal treatment this journey would have been much more complicated. They were able to take the best out of my skills, and sometimes even believed in me more than I did. I will always feel that part of the achievements that I will accomplish during my career will belong to them. As part of this thesis I had the chance of working closely to Yuval Yarom and Xiaofei Guo. Besides the pleasure that it was working with them, with both of them I established a relationship that goes beyond our professional collaborations. I wish them the best of the lucks in both their professional careers and their personal lives. I would like to thank the members of the Vernam lab, for being such a great support in the most difficult moments. Every single person in the lab was helpful at some specific point during the last 4 years. I am looking forward to maintain the good relationships we built in the upcoming years. I also would like to express my gratitude to the One United and E soccer team members, who gave me the distraction that every person needs to accomplish profes- sional success. Besides the three tournaments that we won together, they provided me an invaluable personal support which produced friendship relationships that I will never forget. They can feel proud of being one of the most positive things I take from this experience. My parents and my sister have always been the first ones believing in me, and have a great influence in any success I achieve in my personal career. Although they already know what they mean to me, I would still like to thank them for everything they do to take always the best of me. I cannot imagine a better source of strength for the upcoming challenges. i Finally, I would like to thank Elena Gonzalez for being the best person I could ever find to stay by my side. Despite all the obstacles she never gave up on us. I will always be in debt with her for all the support she gave me to complete this thesis. She has shown me personal values that I hardly find in other people, and that I am willing to enjoy everyday from now on. ii Contents 1 Introduction 1 2 Background 6 2.1 Side Channel Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 Computer Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . 7 2.2.1 Hardware Caches . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3 The Cache as a Covert Channel . . . . . . . . . . . . . . . . . . . . . 11 2.3.1 The Evict and Time attack . . . . . . . . . . . . . . . . . . . 11 2.3.2 The Prime and Probe Attack . . . . . . . . . . . . . . . . . . 13 2.4 Funcionality of Commonly Used Cryptographic Algorithms . . . . . . 13 2.4.1 AES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.4.2 RSA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.4.3 Elliptic Curve Cryptography . . . . . . . . . . . . . . . . . . . 19 3 Related Work 21 3.1 Classical Side Channel Attacks . . . . . . . . . . . . . . . . . . . . . 21 3.1.1 Timing Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.1.2 Power Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.2 Microarchitectural Attacks . . . . . . . . . . . . . . . . . . . . . . . . 23 3.2.1 Hyper-threading . . . . . . . . . . . . . . . . . . . . . . . . . 24 iii 3.2.2 Branch Prediction Unit Attacks . . . . . . . . . . . . . . . . . 24 3.2.3 Out-of-order Execution Attacks . . . . . . . . . . . . . . . . . 25 3.2.4 Performance Monitoring Units . . . . . . . . . . . . . . . . . . 26 3.2.5 Special Instructions . . . . . . . . . . . . . . . . . . . . . . . . 26 3.2.6 Hardware Caches . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.2.7 Cache Internals . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.2.8 Cache Pre-Fetching . . . . . . . . . . . . . . . . . . . . . . . . 29 3.2.9 Other Attacks on Caches . . . . . . . . . . . . . . . . . . . . . 29 3.2.10 Memory Bus Locking Attacks . . . . . . . . . . . . . . . . . . 30 3.2.11 DRAM and Rowhammer Attacks . . . . . . . . . . . . . . . . 30 3.2.12 TEE Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.2.13 Cross-core/-CPU Attacks . . . . . . . . . . . . . . . . . . . . 32 4 The Flush and Reload attack 34 4.1 Flush and Reload Requirements . . . . . . . . . . . . . . . . . . . . . 35 4.2 Memory Deduplication . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.3 Flush and Reload Functionality . . . . . . . . . . . . . . . . . . . . . 37 4.4 Flush and Reload Attacking AES . . . . . . . . . . . . . . . . . . . . 39 4.4.1 Description of the Attack . . . . . . . . . . . . . . . . . . . . . 40 4.4.2 Recovering the Full Key . . . . . . . . . . . . . . . . . . . . . 42 4.4.3 Attack Scenario 1: Spy Process . . . . . . . . . . . . . . . . . 44 4.4.4 Attack Scenario 2: Cross-VM Attack . . . . . . . . . . . . . . 45 4.4.5 Experiment Setup and Results . . . . . . . . . . . . . . . . . . 45 4.4.6 Comparison to Other Attacks . . . . . . . . . . . . . . . . . . 47 4.5 Flush and Reload Attacking Transport Layer Security: Re-viving the Lucky 13 Attack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.5.1 The TLS Record Protocol . . . . . . . . . . . . . . . . . . . . 49 iv 4.5.2 HMAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.5.3 CBC Encryption & Padding . . . . . . . . . . . . . . . . . . . 51 4.5.4 An Attack On CBC Encryption . . . . . . . . . . . . . . . . . 51 4.5.5 Analysis of Lucky 13 Patches . . . . . . . . . . . . . . . . . . 52 4.5.6 Patches Immune to Flush and Reload . . . . . . . . . . . . . . 53 4.5.7 Patches Vulnerable to Flush and Reload . . . . . . . . . . . . 53 4.5.8 Reviving Lucky 13 on the Cloud . . . . . . . . . . . . . . . . . 55 4.5.9 Experiment Setup and Results . . . . . . . . . . . . . . . . . . 59 4.6 Flush and Reload Outcomes . . . . . . . . . . . . . . . . . . . . . . . 62 5 The First Cross-CPU Attack: Invalidate and Transfer 64 5.1 Cache Coherence Protocols . . . . . . . . . . . . . . . . . . . . . . . . 65 5.1.1 AMD HyperTransport Technology . . . . . . . . . . . . . . . . 67 5.1.2 Intel QuickPath Interconnect Technology . . . . . . . . . . . . 68 5.2 Invalidate and Transfer Attack Procedure . . . . . . . . . . . . . . . 69 5.3 Exploiting the New Covert Channel . . . . . . . . . . . . . . . . . . . 71 5.3.1 Attacking Table Based AES . . . . . . . . . . . . . . . . . . . 72 5.3.2 Attacking Square and Multiply El Gamal Decryption . . . . . 72 5.4 Experiment Setup and Results . . . . . . . . . . . . . . . . . . . . . . 73 5.4.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.4.2 AES Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.4.3 El Gamal Results . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.5 Invalidate and Transfer Outcomes . . . . . . . . . . . . . . . . . . . . 79 6 The Prime and Probe Attack 80 6.1 Virtual Address Translation and Cache Addressing . . . . . . . . . . 80 6.2 Last Level Cache Slices . . . . . . . . . . . . . . . . . . . . . . . . . . 82 v 6.3 The Original Prime and Probe Technique . . . . . . . . . . . . . . . . 83 6.4 Limitations of the Original Prime and Probe Technique . . . . . . . . 84 6.5 Targeting Small Pieces of the LLC . . . . . . . . . . . . . . . . . . . 85 6.6 LLC Set Location Information Enabled by Huge Pages . . . . . . . . 85 6.7 Reverse Engineering the Slice Selection Algorithm . . . . . . . . . . . 87 6.7.1 Probing the Last Level Cache . . . . . . . . . . . . . . . . . . 88 6.7.2 Identifying mData Blocks Co-Residing in a Slice . . . . . . . 88 6.7.3 Generating Equations Mapping the Slices . . . . . . . . . . . . 89 6.7.4 Recovering Linear Hash Functions . . . . . . . . . . . . . . . . 91 6.7.5 Experiment Setup for Linear Hash Functions . . . . . . . . . . 92 6.7.6 Results for Linear Hash Functions . . . . . . . . . . . . . . . . 93 6.7.7 Obtaining Non-linear Slice Selection Algorithms . . . . . . . 95 6.8 The LLC Prime and Probe Attack Procedure . . . . . . . . . . . . . 98 6.8.1 Prime and Probe Applied to AES . . . . . . . . . . . . . . . . 99 6.8.2 Experiment Setup and Results for the AES Attack . . . . . . 101 6.9 Recovering RSA Keys in Amazon EC2 . . . . . . . . . . . . . . . . . 105 6.10 Prime and Probe Outcomes . . . . . . . . . . . . . . . . . . . . . . . 114 7 Countermeasures 116 7.1 Existing Countermeasures . . . . . . . . . . . . . . . . . . . . . . . . 117 7.1.1 Page Coloring . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 7.1.2 Performance Event Monitorization to Detect Cache Attacks . 118 7.2 Problems with Previously Existing Countermeasures . . . . . . . . . 119 7.3 Detecting Cache Leakages at the Source Code . . . . . . . . . . . . . 120 7.3.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 7.3.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 7.3.3 Evaluated Crypto Primitives . . . . . . . . . . . . . . . . . . 130 vi 7.3.4 Cryptographic Libraries Evaluated . . . . . . . . . . . . . . . 135 7.3.5 Results for AES . . . . . . . . . . . . . . . . . . . . . . . . . . 135 7.3.6 Results for RSA . . . . . . . . . . . . . . . . . . . . . . . . . . 137 7.3.7 Results for ECC . . . . . . . . . . . . . . . . . . . . . . . . . . 140 7.3.8 Leakage Summary . . . . . . . . . . . . . . . . . . . . . . . . 143 7.3.9 Comparison with Related Work . . . . . . . . . . . . . . . . . 143 7.3.10 Recommendations to Avoid Leakages in Cryptographic Software145 7.3.11 Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 7.4 MASCAT : Preventing Microarchitectural Attacks from Being Executed147 7.4.1 Microarchitectural Attacks . . . . . . . . . . . . . . . . . . . . 149 7.4.2 Implicit Characteristics of Microarchitectural Attacks . . . . . 152 7.4.3 Our Approach: MASCAT a Static Analysis Tool for Microar- chitectural Attacks . . . . . . . . . . . . . . . . . . . . . . . . 154 7.4.4 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . 160 7.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 7.5.1 Analysis of Microarchitectural Attacks . . . . . . . . . . . . . 161 7.5.2 Results for Benign Binaries . . . . . . . . . . . . . . . . . . . 161 7.5.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 7.5.4 Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 8 Conclusion 166 vii List of Figures 1.1 Hardware attacks bypass VM isolation . . . . . . . . . . . . . . . . . 3 2.1 Side channel attack scenario . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Typical microarchitecture layout in modern processors . . . . . . . . 8 2.3 Cache access time distribution . . . . . . . . . . . . . . . . . . . . . . 9 2.4 Evict and Time procedure . . . . . . . . . . . . . . . . . . . . . . . . 12 2.5 Prime and Probe procedure . . . . . . . . . . . . . . . . . . . . . . . 14 2.6 AES 128 state diagram with respect to its 4 main operations . . . . . 15 2.7 Last round of a T-table implementation of AES . . . . . . . . . . . . 16 2.8 ECC elliptic curve in which R=P+Q . . . . . . . . . . . . . . . . . . 19 2.9 ECDH procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.1 Memory Deduplication Feature . . . . . . . . . . . . . . . . . . . . . 37 4.2 Copy-on-Write Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.3 Flush and Reload access time distinction . . . . . . . . . . . . . . . . 39 4.4 Flush and Reload results for AES . . . . . . . . . . . . . . . . . . . . 47 4.5 CBC mode TLS functionality . . . . . . . . . . . . . . . . . . . . . . 50 4.6 Network time difference with Lucky 13 patches . . . . . . . . . . . . . 56 4.7 Cache access times for Lucky 13 vulnerabilities . . . . . . . . . . . . . 57 4.8 Flush and Reload 2 byte results for PolarSSL Lucky 13 vulnerability . 60 viii 4.9 Flush and Reload 1 byte results for PolarSSL Lucky 13 vulnerability . 61 4.10 Flush and Reload 1 byte results for CyaSSL Lucky 13 vulnerability . . 62 4.11 Flush and Reload 1 byte results for GnuTLS Lucky 13 vulnerability . 63 5.1 DRAM accesses vs Directed probes thanks to the HyperTransport Links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.2 HT link vs DRAM access . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.3 HT and DRAM time access difference in AMD . . . . . . . . . . . . . 70 5.4 IQP and DRAM time access difference in Intel . . . . . . . . . . . . . 71 5.5 Miss counter values for each ciphertext value, normalized to the average 75 5.6 Invalidate and Transfer key finding step . . . . . . . . . . . . . . . . 76 5.7 Encryptions to recover the AES key with Invalidate and Transfer . . 76 5.8 RSA decryption trace observed with Invalidate and Transfer . . . . . 77 5.9 Key recovery step for a Invalidate and Transfer RSA trace . . . . . . 78 6.1 A hash function based on the physical address decides whether the memory block belongs to slice 0 or 1. . . . . . . . . . . . . . . . . . . 83 6.2 Slice Last level cache addressing methodology for Intel processors . . 84 6.3 4KB vs 2MB offset information comparison . . . . . . . . . . . . . . . 86 6.4 Slice colliding memory block generation: Step 1 . . . . . . . . . . . . 89 6.5 Slice colliding memory block generation: Step 2 . . . . . . . . . . . . 90 6.6 Slice colliding memory block generation: Step 3 . . . . . . . . . . . . 91 6.7 Slice access distribution in Intel Xeon E5-2670 v2 . . . . . . . . . . . 96 6.8 Prime and Probe probing cache vs memory time histogram . . . . . . 98 6.9 T-table set identification step with Prime and Probe . . . . . . . . . 102 6.10 Table cache line access distribution per ciphertext byte . . . . . . . . 103 6.11 Results for AES key recovery with Prime and Probe . . . . . . . . . . 104 ix 6.12 RSA trace identification step with Prime and Probe . . . . . . . . . . 109 6.13 RSA multiplicant recovery and aligment with Prime and Probe . . . . 110 6.14 Comparison of the final obtained peaks with the correct peaks with adjusted timeslot resolution . . . . . . . . . . . . . . . . . . . . . . . 111 7.1 Page coloring implementation . . . . . . . . . . . . . . . . . . . . . . 118 7.2 HPC as a cache attack monitoring unit . . . . . . . . . . . . . . . . . 119 7.3 Vulnerable code snippet example . . . . . . . . . . . . . . . . . . . . 124 7.4 Code snippet wrapped in cache tracer . . . . . . . . . . . . . . . . . . 125 7.5 Results for cache traces obtained from toy example . . . . . . . . . . 125 7.6 Taint analysis example . . . . . . . . . . . . . . . . . . . . . . . . . . 128 7.7 Noise threshold adjustment . . . . . . . . . . . . . . . . . . . . . . . 130 7.8 First and Last round of an AES encryption . . . . . . . . . . . . . . . 131 7.9 WolfSSL and NSS AES leakage . . . . . . . . . . . . . . . . . . . . . 136 7.10 OpenSSL and Libgcrypt AES leakage . . . . . . . . . . . . . . . . . . 136 7.11 Montgomery ladder RSA leakage for WolfSSL . . . . . . . . . . . . . 137 7.12 Sliding window RSA leakage for a) WolfSSL and b) MbedTLS . . . . 137 7.13 Leakage for fixed window RSA in a) OpenSSL and b) IPP . . . . . . 138 7.14 Varying leakage due to cache missalignment explanation . . . . . . . 139 7.15 Montgomery Ladder ECC leakage for a) WolfSSL and b) Libgcrypt . 141 7.16 Sliding window ECC leakage in WolfSSL . . . . . . . . . . . . . . . . 142 7.17 ECC leakage in MbedTLS and Bouncy Castle . . . . . . . . . . . . . 142 7.18 wNAF ECC OpenSSL results . . . . . . . . . . . . . . . . . . . . . . 143 7.19 Flush and Reload code snippet from [YF14] . . . . . . . . . . . . . . 150 7.20 Rowhammer code snippet from [KDK+14a] . . . . . . . . . . . . . . . 151 7.21 Attribute analysis and threat score update implemented by MASCAT . 157 x 7.22 Visual example output of MASCAT , in which a flush and reload attack is detected . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 xi List of Tables 3.1 Side channel attack classification according to utilized data analysis method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.1 Comparison of cache side channel attack techniques against AES . . . 48 5.1 Summary of error results in the RSA key recovery attack . . . . . . . 78 6.1 Comparison of the profiled architectures . . . . . . . . . . . . . . . . 93 6.2 Slice selection hash function for the profiled architectures . . . . . . . 94 6.3 Hash selection algorithm implemented by the Intel Xeon E5-2670 v2 . 97 6.4 Successfully recovered peaks on average in an exponentiation . . . . . 111 7.1 Cryptographic libraries evaluated . . . . . . . . . . . . . . . . . . . . 135 7.2 Leakage summary for the cryptographic libraries. Default implemen- tations are presented in bold . . . . . . . . . . . . . . . . . . . . . . . 144 7.3 Antivirus analysis output for microarchitectural attacks . . . . . . . . 155 7.4 Percentage of attacks correctly flagged by MASCAT (true positives). . 161 7.5 Results for different groups of binaries from Ubuntu Software Center. 162 7.6 Results for different groups of APKs. . . . . . . . . . . . . . . . . . . 163 xii 7.7 Explanation for benign binaries classified as threats. . . . . . . . . . . 163 xiii Chapter 1 Introduction The rapid increase in transistor densities over the past few decades brought numer- ous computing applications, previously thought impossible, into the realm of every- day computing. With dense integration and increased clock rates, heat dissipation in single core architectures has become a major challenge for processor manufactur- ers. The design of multi-core architectures has been the method of choice to profit from further advances in integration, a solution that has shown to substantially improve the performance over single-core architectures. Lately, the design of multi- CPU sockets (each allocating multi-core systems) has taken even more advantage of the benefits of having several Central Processing Units executing different tasks in parallel. Despite its many advantages, multi-core and multi-CPU architectures are sus- ceptible to suffering under bandwidth bottlenecks if the architecture is not designed properly, especially when more cores are packed into a high-performance system. Parallelism can only be effective if shared resources are correctly managed, ensuring their fair distribution. Several components have been designed and added to mod- ern microarchitecture designs to achieve this goal. A good example are inclusive Last Level Caches (LLCs), which are shared across multiple CPU cores and aid the maintenance of cache coherency protocols within the same CPU socket by ensuring that copies in the upper level caches exist in the LLC. In fact, it is the aforementioned parallelism that has enabled many of the tech- nologies we use in a daily basis. For instance, in the last decade we have witnessed the cloud revolution, which allocates several customers (and their corresponding workloads) in a single physical machine. The concurrent execution of these num- ber of processes in the same computer would not be possible without the multi- core/CPU designs that are common nowadays. Similarly, smartphones are now able to execute several processes at the same time by running some of them in the back- ground. This has been possible due to the large adoption that embedded devices have done of multi-core architectures; modern smartphones not only have several cores in the same device, but also they started allocating more than one CPU socket. Although parallelism and resource sharing help to improve the performance, 1 they also poses a big risk when it comes to execute untrusted processes alongside trusted processes. For example, a malicious attacker can take advantage of being co-resident within a potential victim and execute malicious code in commercial clouds. Similarly, a malicious application can try to steal sensitive information from a benign security-critical application, e.g., an online banking application. To cope with these issues both trusted and untrusted processes are usually executed in a software isolated environment. IaaS clouds for instance rent expensive hardware resource to multiple customers by offering guest OS instances sandboxed inside virtualized machines (VMs). PaaS cloud services go one step further and allow users to share the application space while sandboxed, e.g. using containers, at the OS level. Similar sandboxing techniques are used to isolate semi-trusted apps running on mobile devices. Even browsers use sandboxing to execute untrusted code without the risk of harming the local host. All these mechanisms have a clear goal in mind: taking advantage of resource sharing among several applications/users while isolating each of them to prevent malicious exploitation of the said shared resources. Despite the widespread adoption of the aforementioned technologies, the robust- ness of the resource sharing scenarios that they provide have, prior to this work, yet not being tested against malicious users exploiting hardware covert channels. While the resistance of these technologies to pure software attacks is usually guar- anteed, their response to hardware leakage based attacks still remains as an open question. In the following we will refer to these attacks, which utilize hardware resources like the cache and the Branch Prediction Unit (BPU) to gain information, as microarchitectural attacks. By 2014, only [RTSS09, ZJRR12] were able to succeed on recovering information from co-resident users in realistics cloud environments utilizing microarchitectural attacks the first recovers keystrokes across VMs in Amazon EC2, the latter recov- ers an El Gamal decryption key in a lab virtualization environment. The problem mostly comes from the fact that these and the rest of the works (except for [YF14]) prior to this thesis only consider core-private covert channels like the L1 cache and the BPU. Therefore these attacks are only applicable when victim and attacker are core co-resident. With multi-core systems being the de-facto architectures utilized not only in high-end servers but also in smarpthone devices, the core co-residency requirement significantly reduces the applicability of previously known microarchi- tectural attacks. Further, constant optimization and penalty reduction make core- private resources difficult to utilize with the amount of noise that regular workloads introduce. For instance, L1 cache misses and L2 cache hits only differ by a few cycles while misspredicted branches do not add substantial overhead. It is question- able whether these covert channels would support the typical amount of noise in sandboxed scenarios. It is therefore necessary to investigate whether more powerful and applicable covert channels can be utilized for unauthorized information extraction. For in- stance, considering covert channels that do not require core co-location increments the attack probabilities, as only CPU socket co-residency is needed. This becomes a 2 Figure 1.1: Malicious VMs can try to bypass the isolation provided by the hypervi- sors and steal information from co-resident VMs crucial fact in, e.g., commercial clouds where core co-residence is hardly unlikely. In fact, several microarchitectural components are connected not only across cores but also across CPU sockets. These include, among others, the Last Level Cache (LLC), the memory bus or the DRAM. All these possible covert channels have to be thor- oughly investigated to understand their capabilities and to develop the appropriate countermeasures to prevent their utilization for information theft. Contributions This thesis focuses in one of the aforementioned potentially powerful covert chan- nels, i.e., the LLC. We will show its exploitability under different assumptions and scenarios. We present three attacks that can take advantage of such a resource, namely Flush and Reload ,Invalidate and Transfer and Prime and Probe . The first is only applicable across cores and under memory deduplication features, the second is capable of reaching victims across CPU sockets under the same memory dedupli- cation assumption, and the latter is able to recover information across cores without any special requirement. We show how and where these three attacks can be applied, specifically in sce- narios where previous attacks had proven to behave poorly. For instance, we show for the first time how to recover information across VMs located in different cores with the Flush and Reload attack in VMware. Further, we show that these attacks can be taken to multi CPU socket machines by applying the Invalidate and Transfer attack in a 4 CPU socket school server. In addition, this thesis presents the LLC Prime and Probe attack, applicable in any hypervisor without any special require- ment (even VMware with updated security features). In fact, we utilize it to recover a RSA key from a co-resident VM in Amazon EC2, demonstrating the big threat that cache attacks pose in real world scenarios. Lastly, we develop two tools that can help at preventing these leakages from being exploited. However, unlike other proposed countermeasures that add perma- 3 nent performance overheads in the system that private companies do not want to afford, our solutions take a preemptive approach. The first, aims at identifying LLC leakages in cryptographic code to ensure they are caught before the software reaches the end users. In particular, we use taint analysis, cache trace analysis and Mutual Information (MI) to derive whether a cryptographic code leaks or not. We found alarming results, as 50% of the implementations leaked information for AES, RSA and ECC algorithms. The second tool, MASCAT , performs a different approach. With official application stores like Google Play or Microsoft Store raising their popularity, it is important that the binaries offered by these repositories are mal- ware safe. Unlike regular malware, e.g. shell code, which might be easily detectable by modern antivirus tools, microarchitectural attacks are different as they do not look malicious. MASCAT serves as a microarchitectural attack antivirus, detecting them inside binaries without having to inspect the source code manually. Our work resulted in the discovery of several vulnerabilities in existing products that attackers can take advantage from to steal information belonging to co-resident victims. As part of our responsibility as researchers, we notified the corresponding software designers about the vulnerabilities of their solutions. These conversations lead to several security updates; VMware decided to implement a new salt-based deduplication mechanism (CVE-2014-3566), Intel modified the RSA implementation of its cryptographic library (CVE-2016-8100), the Bouncy Castle cryptographic li- brary re-designed the AES implementation (2016-10003323) and WolfSSL closed leakages in all AES, RSA and ECC implementations (CVEs 2016-7438, 2016-7439 and 2016-7440). Thus, our investigation played a key role on improving and updat- ing several up-to-date security solutions that, otherwise, could have compromised the privacy of their customers. In summary, this work: •Demonstrates the applicability of the Flush and Reload attack in virtualized environments by recovering AES keys across cores in less than a minute •Shows that cache attacks can go beyond cryptographic algorithms by attacking three TLS implementations and re-implementing the closed Lucky 13 attack. •Presents Invalidate and Transfer , a new attack targeting the cache coherency protocol that works across CPU sockets. We demonstrate its viability by recovering AES and RSA keys. •Introduces the LLC Prime and Probe attack that does not require memory deduplication to succeed. In order to successfully apply the Prime and Probe attack, a thorough investigation on the architecture of LLCs is performed, in particular on how cache slices are distributed. We also show how the Prime and Probe attack succeeds in hypervisors where Flush and Reload was not able to succeed, e.g., the Xen hypervisor. 4 •Shows, for the first time, the applicability of microarchitectural attacks in commercial clouds. More precisely, we demonstrate how to recover a RSA key across co-resident VMs in Amazon EC2. •Presents a cache leakage analysis tool to prevent the design of cryptographic algorithms form leaking information. We found alarming results, as 50% of the implementations showed cache leakages that could lead to full key extraction. •Introduces MASCAT , a tool to detect microarchitectural attacks embedded in apparently innocent looking binaries. MASCAT serves as a verification pro- cess for official application distributors that want to ensure the sanity of the binaries being offered in their repositories. •proposes fixes (that have been adopted) to all the vulnerabilities discovered in commercial software due to the previously mentioned contributions. The rest of the thesis is distributed as follows. Chapter 2 describes the necessary background to understand the attacks and defenses later developed. Related work, both prior and concurrent to this thesis is presented in Chapter 3. Chapters 4, 5 and 6 describe the deployment of the Flush and Reload ,Invalidate and Transfer and Prime and Probe respectively. Finally Chapter 7 describes the aforementioned countermeasures. 5 Chapter 2 Background This thesis includes a large description on how cache attacks can be applied across cores to recover sensitive information belonging to a co-resident victim. In order to aid the reader understand the attacks that will later be presented, this chapter gives an introduction on the typical microarchitecture layout found in modern processors and the attacks that were proposed prior to this thesis. Further, we give a description of the most widely used cryptographic algorithms, as they will be under the attack radar of our microarchitectural attacks in subsequent chapters. 2.1 Side Channel Attacks Side channel Attacks (SCA) are attacks that take advantage of the leakage coming from a covert channel during a secure communication. Typical attacks on a direct channel involve brute forcing the key or social phishing. Instead, side channel attacks observe additional information stemming from side channels that carry information about the key. These side channel traces are then processed and correlated to either obtain the full key or to reduce its search space significantly. Figure 2.1 shows the overall idea, where two parties are trying to establish an encrypted communica- tion over an insecure channel. Due to the leakage stemming from unintended side channels, an attacker can at least try to use that information to obtain the key. This leakage can come in many forms. Power and Electro Magnetic (EM) leak- ages are common in embedded devices and smartcards and usually imply having physical access to the device. Timing attacks can deduce information about a se- cret distinguishing the overall process time, but they suffer from the huge amount of noise in modern communication channels (e.g., the internet). microarchitectural attacks, in general, try to obtain information from the usage of some microarchi- tectural resource. Although they require physical co-residence with the targeted process, they do not need physical access to the device. In fact, several scenarios can arise in which malicious code is executed alongside a potential victim in the same physical machine. Other leakage forms are less common, e.g. sound or ther- 6 Figure 2.1: Side channel attack scenario. Instead of the direct channel, leakage coming from side channels are exploited to obtain information about the key. Figure from [Mic]. mal, but are constantly being studied to increase the applicability of side channel attacks [KJJ99, BECN+04, KJJR11]. 2.2 Computer Microarchitecture Modern computers execute user specified instructions and data in a Central Pro- cessing Unit (CPU), store the necessary memory to execute software in DRAM and interact with the outside world trough peripherals. With a constantly evolving mar- ket and strong competitors fighting for the same marketplace, efficiency has become one of the main goals of every microarchitecture designer. In fact, several hardware resources have been added to the basic microarchitectural components, e.g., caches or Branch Prediction Units (BPU), to provide a better performance. Further, mod- ern microarchitectures now embed several processing units in the same processor, some of which are even capable of processing to threads concurrently. Even further, lately we have observed the rise of multi-CPU socket computers, in which more than one CPU sockets are embedded into the same piece of hardware. All these technological advances have the same goal: offer the end-user the best computing performance. A typical microarchitecture design commonly found in modern processors can be observed in Figure 2.2. The example shows a dual socket CPU, each CPU having two cores. Each CPU core has private L1 and L2 caches, while they share the Last Level Cache (LLC). Further each core has its own BPU, in charge of predicting the outcome of the branches being executed. The communication between the cache hierarchy and the memory is done through the memory bus, which is also in charge of maintaining coherency between shared blocks across CPU sockets. Finally, the 7 L1 Instruction CacheL1 Data Cache L2 Cache Branch Prediction Unit Predictor BTB L1 Instruction CacheL1 Data Cache L2 Cache Branch Prediction Unit Predictor BTB L3 CacheL1 Instruction CacheL1 Data Cache L2 Cache Branch Prediction Unit Predictor BTB L1 Instruction CacheL1 Data Cache L2 Cache Branch Prediction Unit Predictor BTB L3 Cache Memory Bus DRAM CPU Socket 1 CPU Socket 2 Core 1 Core 2 Core 1 Core 2Figure 2.2: Typical microarchitecture layout in modern processors DRAM stores the necessary instructions and data that the program being executed needs. Although each component requires a exhaustive and thorough analysis, we put our focus in caches, as they are the core component being examined in this thesis. Nevertheless, some of the functionalities of the aforementioned components will indeed be described later in greater detail in the thesis when needed. 2.2.1 Hardware Caches Hardware caches are small memories placed between the DRAM and the CPU cores to store data and instructions that are likely to be reused soon. When the software needs a particular memory block, the CPU first checks the cache hierarchy looking for that memory block. If found, the memory block is fetched from the cache and the access time is significantly faster. If the memory block was not found in the cache hierarchy, then it is fetched from the DRAM at the cost of a slower access time. These two scenarios are often called a cache hit and a cache miss respectively. At this point the main question is probably how much faster cache accesses are over DRAM accesses. This can be observed in Figure 2.3, for which we performed consecutive timed accesses to a L1 and L3 cached memory block and an uncached memory block in an Intel i5-3320M. The access time for the L1 cache is about 3 cycles, for the L3 cache is around 7 cycles, while an access to the memory takes around 25 cycles. Thus, an access to the DRAM is about 3 times slower than an access to the lowest cache level, which gives an idea of the performance improvement 8 051015202530354000.10.20.30.40.50.60.70.80.91 Hardware cyclesProbability data in cache data in L3 cache data in memoryFigure 2.3: Reload timing distribution when a memory block is loaded from the: L1 cache, L3 cache and memory. The experiments were performed in an Intel i5-3320M. that the cache hierarchy offers to software execution. In particular, caches base their functionality in two main principles: spatial locality and temporal locality . While the first principle states that data residing close to the data being accessed is likely to be accessed soon, the latter assumes that recently accessed data is also likely to be accessed soon. Caches accomplish temporal locality by storing recently accessed memory blocks, and spatial locality by working with cache lines that load both the needed and neighbor data to the cache. In consequence, the cache is usually divided into several fixed size cache lines. However, caches can have very different design characteristics. In the following we describe how their functionality changes for different design choices. 2.2.1.1 Cache Addressing One of the most important decisions in the design of a cache is how is this distributed and addressed. In this sense there are three main design policies: •Direct mapped caches: In this case, each memory block has a fixed location in the cache, i.e., they can occupy only one specific cache line. •Fully associative cache: In this case a memory block can occupy anyof the cache lines in the cache. •N-way set associative cache: This is the most common design in modern processors. This design splits the cache into equally sized partitions called 9 cache sets , each allocating ncache lines. In this case, a memory block is constrained to occupy one of the ncache lines of a fixed set. Each of the designs has advantages and disadvantages. For instance, the memory block search in a direct mapped cache is very efficient as the CPU has to check only one location. In contrast, in a fully associative cache the CPU will have to search for a memory block in allthe cache lines. However, direct mapped caches suffer from collisions, as they always trigger cache misses for consecutive accesses to cache line colliding memory blocks. Fully associative caches do not suffer from this problem, as any block can occupy any location in the cache. Thus, it is understandable that the most common choice in modern processors is the n-way set associative cache, as it balances the advantages and disadvantages of the first two designs. 2.2.1.2 Cache Replacement policy Another fundamental aspect in the design of n-way set associative caches is the algorithm that selects the block being evicted within a set. Recall that each set has npossible memory blocks that can be evicted to make room for a new one. The most common algorithms designed for this purpose are: •Least Recently Used: This algorithm evicts the memory block that has been longer time without being accessed by taking count of the number of accesses being made to each nways in the set. •Round Robin: Also known as First-In First-Out (FIFO), which evicts the memory block that has reside for longer in the cache. •Pseudo-random: The selection of the memory block to be evicted is done with a pseudo-random algorithm. Note that, in the case of cache attacks, the knowledge of the algorithm might be crucial for the success of the attack. In fact, some cache attacks like the Prime and Probe can be challenging to implement with random replacement policies, as the occupancy of the memory blocks in the cache cannot be predicted. In this thesis, we focus on x86 64 architectures which implement a LRU eviction policy. 2.2.1.3 Inclusiveness Property The last but perhaps most important property that cache designers need to take into account is whether they feature inclusive, non-inclusive or exclusive caches. This has severe implications, among others, in the cache coherency protocol: •Inclusive caches : Inclusive caches are those that require that any block that exists in the upper level caches (i.e., L1 or L2) have to also exists in the LLC. Note that this has several simplifications when it comes to maintain the 10 cache coherency across cores, since the LLC is shared. Thus, the inclusiveness property itself is in charge of keeping a coherency between shared memory blocks across CPU cores. The drawback of inclusive caches is the additional cache lines wasted with several copies of the same memory block. •Exclusive caches : Exclusive caches require that a memory block only resides at one cache level at a time. In contrast to inclusive caches, here the cache coherency protocol has to be implemented with upper level caches as well. However, they do not waste cache space to maintain several copies of the same block. •Non-inclusive caches : This type of caches do not have any requirement, but a memory block can reside in one or more cache levels at a time. Whether the cache features the inclusiveness property might also be a key factor when implementing cache attacks. In fact, as it will be explained in later sections, attacks like Prime and Probe might only be applicable in inclusive caches, while other attacks like Invalidate and Transfer are agnostic of the inclusiveness property. 2.3 The Cache as a Covert Channel This thesis presents, among others, the utilization of the LLC as a new covert channel to obtain information from a co-resident user. It is thus important to see how the cache can be utilized as a covert channel to recover information, and in particular, how previous works have done it. The first thing we should clarify the reader is that the knowledge of which set a victim has utilized can lead to key extraction. For instance, key dependent data that always utilizes the same set can reveal the value of the key being processed if such a utilization of the set is observed. Based on this, research previous to this thesis that has demonstrated in the past the application of spy processes in core-private resources like the L1 cache to obtain fine-grain information have being based in two main attacks: 2.3.1 The Evict and Time attack The Evict and Time attack was presented by [OST06] as a methodology to recover a complete AES key. In particular, the work demonstrated that key dependent T- table accesses in the AES implementation (see Section 2.4) can lead to knowledge of the key. The methodology utilized can be described as: 1. The attacker performs a full AES encryption with a known fixed plaintext and an unknown fixed key. After this step, all the data utilized by the AES encryption resides in the cache. 11 Evict Time Cache Cache CacheFigure 2.4: Evict and Time procedure 2. The attacker evicts a specific set in the cache. Note that, if this specific set was used by the AES process the data utilized by it will no longer reside in the cache. 3. The attacker performs the same encryption with the same plaintext and key, and measures the time to complete the encryption. The time spent to perform the encryption will highly depend on whether the attacker evicted a set utilized by AES in step 1. If he did, then the encryption will take longer as the data has to be fetched from the memory. If he did not, the attacker guesses that the AES encryption does not use the set he evicted, possibly discarding some key candidates. Figure 2.4 graphically represents the states described. The victim first utilized the yellow memory blocks, while the attacker evicts one of them in the eviction process. When she times the victims process again, she sees the victim needed a block from the memory and infers the victim uses the set she evicted. Note that in this case the attacker has to record two encryptions with the same plaintext and key, which might not be entirely realistic. Further, as the leakage comes from the ability to measure the overall encryption time, the attack is not considered overly practical. 12 2.3.2 The Prime and Probe Attack The Prime and Probe attack was again proposed in [OST06], but later was utilized by [Acı07, RTSS09, ZJRR12]. In a sense the attack is similar to Evict and Time , but only the monitorization of own memory blocks is required. This is a clear advantage as it brings a more realistics scenario. These are the main steps: 1. The attacker fills the L1 cache with her own junk data. 2. The attacker waits until the victim executes his vulnerable code snippet. The key dependent branch/memory access will utilize some of the sets filled by the attacker in the L1. 3. The attacker accesses back his own memory blocks, measuring the time it takes to access them. If the attacker observes high access times for some of the memory blocks it means that the targeted software utilized the set where those memory blocks resided. In contrast, if the attacker observes low access times it means that the set remain untouched during the software execution. This can be utilized to recover AES, RSA or El Gamal keys. Once again the Prime and Probe steps can be graphically seen in Figure 2.5. The attacker first fill the set with her red memory blocks, then waits for the victim to utilize the cache (yellow block) and finally measures the time to reaccess her red memory blocks. In this case she will trigger a miss in the probe step, meaning the victim utilized the corresponding set. Note the the Prime and Probe attack shows a much more practical scenario than Evict and Time , as no measurement to the victims process is performed. However, as it was only applied in the L1 cache (mainly due to some complications that will later be discussed) the community still did not consider cache attacks practical enough to perform them in real world scenarios. 2.4 Funcionality of Commonly Used Cryptographic Algorithms In this section we review the functionality of the three most widely used crypto- graphic algorithms in every secure system: AES, RSA and ECC. The goal is to give the reader enough background to understand the attacks that will later carried out on these ciphers. 2.4.1 AES AES is one of the most widely used symmetric cryptographic ciphers, a family of ciphers that utilize the same key during the encryption and decryption. Symmetric 13 Probe Prime Cache Cache CacheFigure 2.5: Prime and Probe procedure cryptography ciphers can clearly provide confidentiality , as they can encrypt a message that only someone with access to the secret key is able to decrypt. Further, symmetric key ciphers can also be utilized in Cipher Block Chaining (CBC) mode for message integrity check, as symemtric key algoritm based Message Authentication Codes (MACs) can be built [Dwo04]. These ensure that the message does not get modified while being transmitted between two parties. In particular, AES is part of the Rijndael cipher family, and as a block cipher, it processes packages in blocks of 16 bytes. AES is composed of 4 main operations, namely SubBytes, ShiftRows, Mixcolumns and AddRoundKey. The key can be of 128 ,192 and 256 bits, and depending on the length of the key, AES implements 10, 12 and 14 rounds of executions of the 4 main operation stages. The description of these stages is: •AddRoundKey: In this stage, the round key is XORed with the intermediate state of the cipher. •SubBytes: A table look up operation with 256 8 bit entry S-box. •ShiftRows: In this stage the last three rows of ths state are shifted a given number of steps. •MixColumns: A combination of the columns of the state. 14 START KEY EXPANSION ADDROUNDKEY SUBBYTES SHIFTROWS MIXCOLUMNSNr=9,11,13 ADDROUNDKEY SUBBYTES SHIFTROWS ADDROUNDKEY ENDFigure 2.6: AES 128 state diagram with respect to its 4 main operations Every round implements these 4 operations except the last one, in which a Mix- Columns operation is not issued. Figure 2.6 represents the state diagram of a 10 round AES with respect to these operations. However, often cryptographic library designers decide to merge the SubBytes, ShiftRows and Mixcolumns operation into a Table look up operation and xor addi- tions. The reason behind it is that, at the cost of a bigger table look up operation, the AES encryption is computed faster, as more operations are precomputed. Nev- ertheless the memory cost is usually affordable in smartphones, laptops, desktop computers and servers. These tables allocate 256 32 bit values, and usually 4 tables are utilized. We called these the T-tables. Some implementations even utilize a difference table for the last round. Figure 2.7 shows an example of the output of 4 bytes in the last round utilizing 4 T-tables. Observe that the T-table entry utilized directly depends on the key and the ciphertext byte, a detail that we will later utilize to perform an attack on the last round. In general, for the scope of microarchitectural attacks, we do not consider those implementations based on the utilization of Intel AES hardware instructions (AES- NI). These instructions are built in modern processors to perform all the AES op- erations in pure hardware, i.e., without utilizing the cache hierarchy. Thus, any microarchitectural attack applied to AES-NI would not succeed on obtaining the key. 15 T0 T1 T2 T3S010S510S1010S1510 K010K110K210K310 C0 C1 C2 C3Figure 2.7: Last round of a T-table implementation of AES 2.4.2 RSA RSA is the most widely used public key cryptographic algorithm. Public key cryp- tosystems usually try to solve the symmetric cryptography key distribution problem. Note that, symemtric key cryptography assumes that both parties share the same secret, but it does not explain how that secret can indeed be shared. And that is exactly where public key comes to place. In public key (or asymmetric) crypto systems, in contrast to symmetric key cryptography, each agent in the communication is assumed to have two keys; a public key eand a private key d. The public key is publicly known, while the private key is kept secret. The two keys are related in the sense that one decodes what the other encoded, i.e., D(E(M)) =MandE(D(M)) =,. It is assumed that revealing the public key does not reveal an easy information to compute D, and therefore, only the user holding the private key can decrypt mesages encrypted with his public key. Assume the two agents willing to establish a secure communication are Alice and Bob. With public key cryptography, if Alice wants to send an encrypted message to Bob, she will have to encrypt the message with Bobs public key, as the message is only decryptable by Bobs private key. That is, public key cryptography ensures confidentiality between two comunicating agents. The properties of public key cryptosystems also allow a user to digitally sign a message. This signature proves that the sender is indeed the one who he claims he is, i.e., public key cryptosystems also provide authenticity . Note that this is a feature that symmetric key algorithms were not able to claim. For instance assumes that Alice does not only wish to ensure confidentialy but also authenticity of their sent messages. Alice in this case would first sign the message with her own private key, and later encrypt it with Bobs public key C=Eb(Da(M)). Upon the reception of this ciphertext Bob would decrypt the message with his own private key (which he 16 only knows) and verify the signature with Alice’s public key (proving that only Alice could have signed the message), i.e., M=Ea(Db(C)). Therefore the communication was not only encrypted but also authenticated. In practice public key cryptosystems are usually utilized to distribute a shared symmetric key between two agents, such that these can later ensure confidentiality in their communication, since symmetric key cryptography is orders of magnitude faster thatn public key cryptography. In particular, RSA takes advantage of the practical difficulty of the factorization of two large prime numbers to build a public key crypto system. Its operation is based on modular exponentiations. An overview of the key generation algorithm is presented in Algorithm 1 which starts by picking two distinct prime numbers p andq, and calculating n=p∗qandφ(n) as (p−1)∗(q−1). The public key eis chosen such that the greatest common divisor between eandφ(n) is equal to one. Finally, the private key is chosen as the modular inverse of ewith respect to φ(n). The resistance of RSA comes from the fact that, even if the attacker knows n, it is computationally infeasible for him to recover the primes pandqthat generated it. Algorithm 1 RSA key generation given prime numbers pandq Input : Prime numbers pandq Output : Public key eand private key d n=p∗q; φ(n) = (p−1)∗(q−1); //Choosees.t: 1. 10do S=eiei−1. . .ei−w; R=R2wmodN; R=R∗T[S] modN; i=i−w; end returnR; Typical sizes for RSA are 2048 and 4096 bits, while typical window sizes range between 4 and 6. Indeed this is one of the main disadvantages of RSA; a good security level is only achieved with very large keys. The following section describes Elliptic Curve Cryptography (ECC), another public key crypto system that provides the same level of security with much lower key sizes. 18 y x PQ R=P+Qy=x3-3x+5Figure 2.8: ECC elliptic curve in which R=P+Q 2.4.3 Elliptic Curve Cryptography Elliptic Curve Cryptography (ECC) also belongs to the category of public key cryp- tographic algorithms. As with RSA, each user has a public and a private key pair. However, while the security of RSA relied mainly in the large prime factorization problem, ECC relies on the elliptic curve discrete logarithm problem : finding the discrete logarithm of an element in an elliptic curve with respect to a generator is computationally infeasible [HMV03, Mil86, LD00]. In fact, ECC achieves the same level of security as RSA with lower key sizes . Typical sizes for ECC keys are 256 or 512 bits. The communication peers have first to agree on the ECC curve that they are going to utilize. A curve is just the set of points defined by an equation, e.g., y2=x3+ax+b. This equation is called the Weirstrass normal form of elliptic curves. In elliptic curves we define addition arithmetic as follows: •ifPis a point in the curve, −Pis just its reflection over the x axis. 19 Figure 2.9: ECDH procedure. ka∗kb∗Qis the shared secret key •if two points PandQare distinct, then result of adding P+Q=Ris computed by drawing a line crossing PandQ, which will intersect in a third point −R in the curve. Ris computed by taking the reflection of −Rwith respect to the x axis. •P+Pis computed by drawing a tangent line to the curve at P, which again will intersect in a third point in the curve −2P. 2Pis just the reflect-ion over the x axis. An example of an addition of two distinct points PandQis presented in Fig- ure 2.8, in which the line crossing PandQintersects in a third point −R, for which we calculate the negated value by taking its reflection over the x axis. With the new ECC point arithmetic we can define, as with RSA, cryptosystems based on the elliptic curve discrete logarithm problem. For all of them we assume the two communication peers agree on a curve, a base point Qin the curve. Each of them choose a scalar that will be their private key, while they while compute the public key as k∗Q, beingkthe scalar and Qthe base point. Note that,the resistance of ECC cryptosystems relies on the fact that, knowing Qandk∗Qit is very difficult to obtain k. With these parameters, the Elliptic Curve Diffie Hellman (ECDH) is easily com- putable. In fact, both peers can agree on a shared key by performing ka∗Pband kb∗Pa, wherekiandPiare private and public keys of peer i. The entire procedure for ECDH can be observed in Figure 2.9. With similar usage of the ECC properties digital signatures can also be performed. 20 Chapter 3 Related Work The following section provides the state-of-the-art of classic as well as microarchi- tectural side channel attack literature. Publications are grouped and discussed to establish a comprehensive overview of scenarios and covert channels that have been exploited to retrieve sensitive information. 3.1 Classical Side Channel Attacks Classical Side Channel attacks were introduced by Kocher et al. [Koc96, KJJ99] almost two decades ago and introduced a new era of cryptography research. Before the discovery of side channel attacks, the security of cryptographic algorithms was studied by assuming that adversaries only have black box access to cryptographic de- vices. As it turned out, this model is not sufficient for real world scenarios, because adversaries have additionally access to a certain amount internal state information which is leaked by side channels. Typical side channels include the power con- sumption, electromagnetic emanations and timing of cryptographic devices. In the following we give an overview on how these have been implemented with respect to timing and power side channel attacks. 3.1.1 Timing Attacks Timing side channel attacks take advantage of variations in the execution time of security related processes to obtain information about the secret being utilized. These variations can arise due to several factors, e.g., cache line collisions or non- constant execution flows. For instance, Kocher [Koc96] demonstrated that execution time variations due to non-constant execution flow in a RSA decryption can be utilized to get information about the private key. Similar instruction execution variations were exploited by Brumley and Boher [BB03] in a more realistic scenario, i.e., an OpenSSL-based web server Later, Bernstein [Ber04] on the contrary exploited timing variations stemming from cache set access time differences to recover AES keys. In 2009 Crosby et al. [CWR09] expanded on [Koc96] reducing the jitter effect 21 from the measurements and therefore succeeded recovering the key with significantly less traces. [BB03] was also further expanded in 2011 by Brumley et al. [BT11] showing the ECC implementations can also suffer from timing variation leakage. Timing attacks have also shown to recover plaintext messages sent over security protocols like TLS, as demonstrated by Al Fardan and Paterson [FP13]. 3.1.2 Power Attacks During the last years, several attacks exploiting side channels were introduced. In the following, the required steps in order to perform a successful attack are described. Measurement In the first step, the acquisition of one or more side channel traces is performed, which results in discrete time series. The resulting traces represent physical proper- ties of the device during execution such as the power consumption or electromagnetic field strength on a certain position. Pre-processing In the second optional step, the raw time series can be enhanced for further pro- cessing. In this step several techniques in order to reduce noise, remove redundant information by compressing the data, transformations into different representations like the frequency domain, and trace alignment are performed. Analysis The actual analysis step extracts information from the acquired traces in order to support the key recovery. There are many more or less complex methods available, which can be categorized with two orthogonal criteria. The first one is based upon the requirement if multiple side channel traces are needed or a single trace is suf- ficient to perform the attacks. The second criterion is based upon the question, whether a training device is available to obtain the leakage characteristics or if the attack is based on assumptions about the leakage characteristics. Table 3.1 shows the four possible combinations of these criteria. Table 3.1: Side channel attack classification according to utilized data analysis method Single Measurement Multiple Measurements Non-ProfilingSimple Analysis (SPA, SEMA)Differential Analysis (DPA, DEMA, CPA, MIA, KSA) Profiling Template Attacks (TA) Stochastic Approach (SA) 22 The most basic analysis is the simple power analysis (SPA) which only requires one side channel trace and does not require profiling. The common way to perform this kind of attack is by visual inspection of a plotted trace. The goal is to find patterns in the trace caused by conditional branching with dependency on the secret key. This attack works well with computations on large numbers as used by public key cryptography. The differential analysis method is very powerful by using statistics on multiple side channel traces. These method does not need a profiling step and uses leakage assumptions based on hypothetical internal values dependent on a small part of the key. This way, an attacker can try out different assumptions about a part of the key and compare the leakage assumptions with an actual observable leakage. The statistical methods utilized to quantify the accuracy of the assumption are referred as distinguishers in the side channel literature. The historically first distiguisher was the distance of means test [KJJ99] that is used by the differential power at- tack (DPA) and differential electromagnetic attack (DEMA). Later, more powerful ones including Pearson correlation (for CPA) [BCO04], mutual information (MIA) [GBTP08], and the Kolmogorov-Smirnov test (KSA) [WOM11] were introduced. Attacks with a profiling step are based upon the assumption that the leakage characteristics of the different devices of the same type are similar. Using a training device, an attacker can model the leakage characteristics of the device and can use this model for the actual attack. The historically first profiled attack was the template attack (TA) [CRR03]. This method is based on multivariate Gaussian models for all possible sub-keys. Template attacks allow key extractions based on only one side channel trace after proper characterization of a training device. The stochastic approach (SA) [SLP05, KSS10] approximates the leakage function of the device using a linear model. The actual key extraction can be performed by different methods. For example, in [SLP05] the maximum likelihood principle is used. The linear model parameters can also be used to identify leakage sources in a design as described by De Santis et al. [DSKM+13]. 3.2 Microarchitectural Attacks Microarchitectural attacks try to exploit specific features of computer microarchitec- tures without requiring physical access to the device to recover cryptographic keys, chasing the pattern of other processors and also threading the cloud users. The focus of this section is put on hardware components that have been exploited in microarchitectural attacks, which have exhibited a significant tendency change on multi-core/CPU systems. A rough description of the basic components in modern microarchitectures was presented in Figure 2.2. Modern computers usually consist on one or more CPU sockets, each containing several CPU cores. Processes are executed in one or several cores at the same time. The Branch Prediction Unit (BPU) is in charge of mak- 23 ing predictions on possible outcomes of branches inside the code being executed. L1/L2 and L3 caches, in order to avoid subsequent DRAM accesses, store data and instructions that have recently being used, because they are very likely to be uti- lized again. The memory bus is in charge of the communication between the cache hierarchy and the DRAM, and further can possibly used to communicate the state of shared memory across CPU sockets. Finally, the DRAM holds the memory pages that are necessary for the execution of the program. Particularly important for microarchitectural attacks is to identify which of these components can serve as a covert channel to perform single-core, cross-core and cross-CPU attacks. In fact, from the figure we can identify many of the covert channels that will later be explained. For instance, it can be observed that an attacker trying to exploit Branch Prediction Units (BPUs) or L1/L2 caches has to co-reside in the same core as the victim, as they are core-private resources. A cross- core attack can be implemented if the L3 cache is utilized as the covert channel, since it is shared across cores. Finally, attacks across CPU sockets can be achieved exploiting, among other components, the memory bus and the DRAM. All these have been exploited in very different manners that will be explained in the following section. 3.2.1 Hyper-threading Hyper threading technology was introduced to perform multiple computations in parallel by Intel in 2002. Each processor core has two virtual threads and they share the work load of a process. The main purpose of hyper-threading is to increase the number of independent instruction in the pipeline. In 2005, Percival et al. [Per05] exploited this technology to establish a cache covert channel in L1-data cache. In the same core but in different threads spy and crpytographic libraries are executed and the spy code is able to recover some bits of the secret RSA exponent. In 2007, Aciicmez et al. [AS07] proposed a new method to exploit the leakage in the hyper- threading technology of Intel processors. To do this, they take advantage of shared ALU’s large parallel integer multiplier between two threads in a processor core. Even if they did not introduce a new vulnerability in OpenSSL library, they showed that it is possible to use ALU as a covert channel in the secret exponent computation. 3.2.2 Branch Prediction Unit Attacks Control hazards have an increasing impact on the performance of a CPU as the pipeline depth increases. Efficient handling of speculative execution of instruc- tions becomes a critical solution against control hazards. This efficiency is usually achieved by predicting the most likely execution path, i.e., by predicting the most likely outcome of a branch. Specifically, Branch Prediction Units (BPU) are in charge of predicting the most likely path that a branch will take. BPUs are usually divided into two main pieces: Branch Target Buffers (BTB) and the predictor. The 24 BTB is a buffer that stores the addresses of the most recently processed branches, while the predictor is in charge of making the most likely prediction of the path. As BPUs are accessible by any user within the same core, the BTB has become a clear target to perform microarchitectural attacks. Imagine a BPU in which branches that are not observed in the BTB are always predicted as not taken, and will only be loaded into the TBT once the branch is taken. If a piece of code has a security critical branch to be predicted, a malicious user can interact with the BTB (i.e., by filling it) to ensure that the branch will be predicted as not taken. If the attacker is able to measure the execution time of the piece of code, he will be able to say whether the branch was misspredicted (i.e. the branch was taken) or the branch was correctly predicted (i.e. the branch was not taken). This is only one possible attack vector that can be implemented against the BPU. A more realistic scenario is the one in which the attacker fills the TBT with always taken branches (which evicts any existing branches from the TBT), then waits for the victim to execute his security critical branch, and finally measure the time to execute his always taken branches again. If the security critical branch was taken, then the branch will be loaded into the TBT and one of the attackers branches will be evicted, causing a missprediction that he will observe in the measured time. In the other hand, if the security critical branch was not taken, it will not be loaded into the TBT and the attacker will predict correctly all his branches. The two attack models discussed have been proposed in [AKKS07, AKS07]. However, BPU microarchitectural attacks have a clear disadvantage when com- pared to other microarchitectural attacks: BPU units are core-private resources. Thus, these attacks are only applicable if attacker and victim co-reside in the same core. Nevertheless, new scenarios arise in which core co-residency is easily achiev- able, as TEE attacks (discussed in Section 3.2.12). Malicious OSs can control the scheduling of the CPU affinity of each process. In fact, attacks utilizing BPU have already been proposed in that scenario [LSG+16]. 3.2.3 Out-of-order Execution Attacks A recent microarchitectural side channel that was discovered and proved to establish communication between co-resident VMs is the exploitation of out of order executed instructions [D’A15]. Out of order execution of instructions is an optimization present in almost all the processors that allows the finalization of a future instruction while waiting for the result of a previous instruction. A first intuition of out-of-order instructions finishing earlier than in-order instructions was made in [CVBS09]. As with other microarchitectural attacks, optimizations can indeed open new covert channels to be exploited by malicious attackers. Assume two threads execute interdependent load and store operations into a shared variable (i.e., one threads stores the value loaded by the other one). If these threads run in parallel, three cases might arise: both threads are executed in- order and at the same time, both threads are executed in-order but one is executed 25 faster or both threads execute out of order instructions. Note that the result of the operations will be different for each of the outputs. Thus, a covert channel can be established by transmitting a 0 when instructions are executed out of order, and a 1 in any other case. Indeed, an attacker can ensure to transmit a 1 by utilizing memory barrier instructions, which keep the execution of the instructions in-order. 3.2.4 Performance Monitoring Units Performance Monitoring Units (PMU) are a set of special-purpose registers to store the count of hardware and software related activities. In 1997, the first study was conducted by Ammons et al. [ABL97] where the logical structure of hardware coun- ters are explained to profile different benchmarks. In 2008, Uhsadel et al. [UGV08] showed that hardware performance counters can be used to perform cache attacks by looking at the L1 and L2 D-cache misses. In 2013, Weaver et al. [WTM13] inves- tigated x86 64 systems to analyze deterministic counter events, concluding that non- x86 hardware has more deterministic counters. In 2015, Bhattacharya et al. [BM15] showed how to use the branch prediction events in hardware performance counters to recover 1024 bit RSA keys in Intel processors. 3.2.5 Special Instructions The set of instructions that a CPU can understand and execute is refered as instruc- tion set architectures (ISA). Although usually composed of common instructions (e.g. mov or add), some ISAs include a set of special instructions that aim at imple- menting system functionalities that the system cannot implement by itself. One of these examples are CPUs with a lack of memory coherence protocols. In these cases, special instructions are needed to handle the situation of conflicted (thus incoherent) values between the DRAM and the cache hierarchy. For instance, the Intel x86 64 ISA provides the clflush instruction to solve the problem. The clflush instruction evicts the desired data from the cache hierarchy, thus making sure that the next data fetch will come from the DRAM. Since ARM- V8, ARM processors started including similar instructions in their ISA. Although these instructions might become crutial for certain processors, they also serve as helpers to implement microarchitectural attacks. In fact many times an attacker is willing to evict a shared variable from the cache, either to reload it later [YF14, LGS+16, IES16], to measure the flushing time [GMWM16] or to access the DRAM continuously [KDK+14a]. A similar situation can be observed with instructions added to utilize hardware random number generators. In particular, it has recently been shown that, due to the low throughput of the special rdseed instruction, a covert channel can be im- plemented by transmitting different bits when the rdseed is exhaustively used/not used [EP16]. 26 3.2.6 Hardware Caches Modern cache structures, as shown in Figure 2.2, are usually divided into core- private L1 instruction and data caches, and at least one level of unified core-shared LLC. The instruction cache (I-cache) is the part of the L1 cache responsible for stor- ing recently executed instructions by the CPU. In 2007, Aciicmez [Acı07] showed the applicability of I-cache attacks by exploiting the cipher accesses to recover an 1024- bit RSA secret key. The monitored I-cache sets are filled with dummy instructions by a spy process whose access is later timed to recover a RSA decryption key. After one year, Aciicmez et al. [AS08] demonstrated the power of the I-cache attacks on OpenSSL RSA decryption processese that use Chinese Remainder Theorem (CRT), Montgomery’s multiplication algorithm and blinding to modular exponentiation. In 2010, Aciicmez et al. [ABG10] revisited I-cache attacks. In this work, the attack is automated by using Hidden Markov Models (HMM) and vector quantization to attack OpenSSL’s DSA implementation. In 2012, Zhang et al. [ZJRR12] published a paper on cross-VM attacks in the same core using L1-data cache. The attack targets Libgcrypt’s ElGamal decryption process to recover the secret key. To eliminate the noise Support Vector Machine algorithm is applied to classify square, multiplication and modular reduction from the L1-data cache accesses. The output of SVM is given to HMM to reduce the noise and increase the reliability of the method. This paper demonstrates the first study in cross-VM setup using L1-data cache. Finally, in 2016, Zankl et al. [ZHS16] proposed an automated method to find the I-cache leakage in RSA implementations of various cryptographic libraries. The correlation technique shows that there are still many libraries which are vulnerable to I-cache attacks. The data cache stores recently accessed data, and as with the I-cache, it has been widely exploited to recover sensitive information. As soon as 2005, Percival et al. [Per05] demonstrated the usage of the L1-data cache as a covert-channel to exploit information from core co-resident processes. To demonstrate the efficiency of the side channel, the OpenSSL RSA implementation is targeted. The main reason of the leakage is that the different precomputed multipliers are loaded into different L1-data cache lines. By profiling each corresponding set it is possible to recover more than half of the bits of the secret exponent. A year later, Osvik et al. [OST06] presented thePrime and Probe and Evict and Time attacks, which were utilized to recover AES cryptographic keys. In 2007, Neve et al. [NS07] showed that it is possible to use L1-data cache as a side channel in single-threaded processors by exploiting the OS scheduler. This technique was applied to the last round of OpenSSL’s AES implementation where only one T-table is used. The authors recovered the last round key with 20 ciphertext/accesses pairs. In 2009, a new type of attack is presented by Brumley et al. [BH09] where L1-data cache templates are implemented to recover OpenSSL ECC decryption keys. The goal of this method is to automate cache attacks by applying HMM and vector quantization in Pentium 4 and Intel Atom. Finally in 2011, Gullasch et al. [GBK11] presented the Flush and Reload 27 attack, although the attack would acquire its name later. The study demonstrated that it is possible to affect the Linux’ Completely Fair Scheduler (CFS) to interrupt the AES thread and recover the AES encryption key with very few encryptions. LLC attacks In 2014, Yarom et al. [YF14] implemented, for the first time, the Flush and Reload attack across cores/VMs to recover sensitive information aided by memory dedupli- cation mechanisms. The attack is applied to the GnuPG implementation of RSA. With this work the Flush and Reload attack became popular and more scenarios in which it could be applied arised. For instance, Benger et al. [BvdPSY14] pre- sented the Flush and Reload attack on the ECDSA implementation of OpenSSL. In the same year, Irazoqui et al. [IIES14b] applied Flush and Reload to recover AES encryption keys across VMs. This attack is implemented in VMware platforms to show the strength of the attack in virtualized environments. Shortly later, Zhang et al. [ZJRR14] showed the applicability of the Flush and Reload attack to verify the co-location in PaaS clouds and obtain the number of items of a co-resident users shopping cart. In 2015, Irazoqui et al. [IIES15] showed that it is possible to recover sensitive data from incorrectly CBC-padded TLS packets. In the same year, Gruss et al. [GSM15] presented an automated way to catch LLC cache patterns apply- ingFlush and Reload method and consequently detect keys strokes pressed by the victim. Finally, hypervisor providers disabled the de-duplication feature to prevent Flush and Reload attacks in their platforms. Therefore, there was a need to find a new way to target the LLC. Concurrently, Irazoqui et al. [IES15a] and Liu et al. [Fan15] de- scribed how to apply Prime and Probe attack in LLC on deduplication-free systems. While Irazoqui et al. [IES15a] applied the method to recover OpenSSL AES last round key in VMware and Xen hypervisors, Liu et al. [Fan15] demonstrated how to recover El Gamal decryption key from the recent GnuPG implementation. In 2016, Inci et al. [IGI+16b] showed that commercial clouds are not immune to these attacks and applied the same technique in the commercial Amazon EC2 cloud to recover 2048 bit RSA key from Libgcrypt implementation. Moreover, Oren et al. [OKSK15] presented the feasibility of implementing cache attacks trough javascript executions to profile incognito browsing. In the same year, Lipp et al [LGS+16] applied three different methods ( Prime and Probe ,Flush and Reload and Evict+Reload) to attack ARM processors typi- cally used in mobile devices. The success of the work showed that it is possible to implement cache attacks in mobile platforms. 3.2.7 Cache Internals With novel an more complex cache designs, which include many undocumented fea- tures, it become increasingly difficult to obtain the necessary knowledge to use it as 28 a covert channel. Aiming at correctly characterizing the usage of the caches as an attack vector, researchers started investigating their designs. An example of these features are the LLC slices, selected by an undocumented hash functions in Intel processors. In 2015, the first LLC slice reverse engineering was applied by Irazoqui et al. [IES15b] utilizing timing information to recover slice selection algorithms of a total of 6 different Intel processors. Later, Maurice et al. [MSN+15] presented a more efficient method using performance counters to recover slice selection algo- rithm. Later,Inci et al. [IGI+16b] and Yarom et al. [YGL+15] again utilized timing information to recover non-linear slice selection algorithms. Finally, in 2016, Yarom et al. [YGH16] exploited the cache-bank conflicts on the Intel Sandy Bridge proces- sors. The study shows that cache-banks in L1 cache can be used to recover an RSA key if the hyper-threading is implemented in the core. 3.2.8 Cache Pre-Fetching Pre-fetching is a commonly used method in computer architecture to provide bet- ter performance to access instructions or data from local memory. It consists on predicting the utilization of a cache line and fetching it to the cache prior to its utilization. In 2014, Liu et al. [LL14] suggested that previous architectural counter- measures do not provide sufficient prevention against demand-fetch policy of shared cache. Thus, to prevent the pre-fetching attacks they propose a randomization pol- icy for the LLC where a cache miss is not sent to the CPU, but instead perform randomized fetches to neighborhood of the missing memory line. In 2015, Rebeiro et al. [RM15] presented an analysis of sequential and arbitrary-stride pre-fetching on cache timing attacks. They conclude that ciphers with smaller tables leak more than ciphers with large tables due to data pre-fetching. In 2016, Gruss et al. [GMF+16] utilized the pre-fetching instructions to obtain the address information by defeating SMAP, SMEP and kernel ASLR. 3.2.9 Other Attacks on Caches Other attacks have exploited different characteristics of a process rather than exe- cution time and cache accesses . In 2010, Kong et al. [KJC+10] presented a thermal attack on I-cache by checking the cache thermal sensors. They showed that dynamic thermal management (DTM) is not sufficient to prevent the thermal attacks if the malicious codes target a specific section of I-cache. In 2015, Masti et al. [MRR+15] expanded the previous work to multi-core platforms. They used core temperature as a side channel to communicate with other processes. In the same year, Riviere et al. [RNR+15] targeted I-cache by electromagnetic fault injection (EMFI) on the con- trol flow. They showed that the applicability of the fault injection is trivial against cryptographic libraries even if they have countermeasures against fault attacks. In 2016, researchers focused on Spin-Transfer Torque RAM (STTRAM), a promising feature for cache applications. Rathi et al. [RDNG16] proposed three new tech- 29 niques to handle magnetic field and temperature of processor. The new techniques are stalling, cache bypass and check-pointing to establish the data security and the performance degradation is measured with SPLASH benchmark suite. In the same year, Rathi et al. [RNG16] exploited read/write current and latency of STTRAM architecture to establish a side channel inside LLC. 3.2.10 Memory Bus Locking Attacks So far it has been shown how an attacker can create contention at different cache hierarchy levels to retrieve secret information from a co-resident user. Memory bus locking attacks are different in this sense, since they do not have the ability to recover fine-grain information. Yet, they can be utilized to perform a different set of malicious actions or as a pre-step to the performance of more powerful side channel attacks. When two different threads located in different cores operate on the same shared variable, the value of the shared variable retrieved by each thread might be different, since modifications might only be performed to core-private caches. This is called a data race. In order to solve this, atomic operations operate in shared variables in such a way that no other thread can see the modification half-complete. For that purpose, atomic operations usually utilize lock prefixes to ensure the particular shared variable in the cache hierarchy is locked until one of the threads has finished updating the value. The lock instructions work well when the data to be locked fits into a single cache line. However, if the data to be locked spans more than one cache line, modern CPUs cannot issue two lock instructions in separate cache lines at the same time [IGES16]. Instead, modern CPUs adopt the solution of flushing any memory operation from the pipeline, incurring in several overheads that can be utilized by users with malicious purposes. In fact, there are several examples of malicious usage of memory bus locking overheads. For instance, Varadarajan et al. [VZRS15], Xu et al. [XWW15] and Inci et al. [IGES16] utilized this mechanism to detect co-residency in IaaS clouds by observing the performance degradation in http queries to the victim. Finally, Inci et al. [IIES16] and Zhang et al. [ZZL17] utilized memory bus locking effects as a Quality of Service (QoS) degradation mechanism in IaaS clouds. 3.2.11 DRAM and Rowhammer Attacks The DRAM, often also called the main memory, is the hardware component in charge of providing memory pages to executed programs. There are two major attacks targeting the DRAM: the rowhammer attack and DRAMA side channel attacks. Since both attacks utilize different concepts and have different goals, we proceed to explain them separately. 30 The DRAM is usually divided into several categories, i.e., channels (physi- cal links between DRAM and memory controller), Dual Inline Memory Modules (DIMMS, physical memory modules attached to each channel), ranks (back and front of DIMMS), banks (analogous to cache sets) and rows (analogous to cache lines). Two addresses are physical adjacent if they share the same channel, DIMM, rank and bank. Additionally each bank contains a row buffer array that holds the most recently accessed row. DRAMA side channel attacks take advantage of the fact that accessing a DRAM row from the row buffer is faster in time that having a row buffer miss. Thus, similar to the fact observed in cache attacks, an attacker can create collisions in a banks row buffer to infer information about a co-resident victim. The rowhammer attack takes advantage of the influence that accesses to a row in a particular bank of the DRAM have in adjacent rows of the same bank. In fact, continuous accesses can increase the charge leak rate of adjacent rows. Typically processors have a refresh rate (at which they refresh their values) to avoid the complete charge leak of cells in a DRAM row. However, if the charge leak rate is faster than the refresh rate, then some cells completely loose their charge before their value can be refreshed, incurring in bit flips. These bit flips can be particularly dangerous if adjacent rows contain security critical information. The rowhammer attack has been more exploited than the DRAMA side channel attack. In fact, since it was discovered [KDK+14a], rowhammer has been shown to be executable from a javascript extension [GMM16], in mobile devices [vdVFL+16], or even from a IaaS VM to break the isolation provided by the hypervisor [XZZT16]. Further, it has been shown that it can be utilized as a mechanism to inject faults into cryptographic keys that can lead to their full exposure [BM16]. 3.2.12 TEE Attacks Trusted Execution Environmnets (TEE) are designed with the goal of executing processes securely in isolated environments, including powerful adversaries such as malicious Operating Systems. In order to achieve this goal, TEEs usually work, among others, with encrypted DRAM memory pages. However, TEEs still utilize the same hardware components as untrusted execution environments, and thus most of the above explained microarchitectural attacks are still applicable to processes executed inside a TEE. Examples of TEEs are the well known Intel SGX [Sch16] and the ARM TrustZone [ARM09]. Perhaps the distinguishing fact when discussing the applicability of microarchi- tectural attacks to TEE is the DRAM memory encryption. While in fact TEE DRAM memory pages are encrypted, this data is decrypted when placed in the cache hierarchy for faster access by the CPU. This implies that any of the above discussed cache hierarchy attacks is still applicable against TEE environments, as nothing stops an attacker from creating cache contention and obtaining decrypted TEE information. More than that, a malicious OS can schedule processes in any 31 way it wants, interrupt the victim process after every small number of cycles or prevent it from using side channel free hardware resources (e.g. AES-NI [TNM11]). Thus, compared to the low resolution that microarchitectural attackers obtain from a commercial OS, a malicious OS obtains a much higher resolution as it can observe every single memory access made by the victim. A similar issue is experienced with BPU microarchitectural attacks. BPU attacks were largely dismissed due to their core co-residency limitation (BPU are core- private resources) and their low access distinguishability. However, the whole picture changes when a malicious OS comes into play. Once again, malicious OS are in control of the scheduling of the processes being executed in the system. Thus, they are in control of where and when each process executes, i.e., they can schedule malicious processes in the same core as the victim process. The situation becomes even more dangerous when special instructions are available from an OS to better control the outcome of the branches executed [LSG+16]. Although DRAM row access and memory bus locking attacks have still not been shown to be applicable in the TEE scenario, the theoretical characteristics of these attacks suggest that both attacks will be soon demonstrated to be applicable in TEEs. In fact there is nothing theoretically solid that suggests TEE would prevent memory bus locking attacks. As for the DRAM access attacks our assumption is that it will depend on how the DRAM rows are placed in the row buffer array. If these are placed unencrypted, then the attack would be equally viable. Rowhammer, on the contrary, is assumed to be defeated by memory authentication and integrity mechanisms being utilized by TEEs. 3.2.13 Cross-core/-CPU Attacks Multiple microarchitectural attacks have been reviewed in the previous paragraphs, but it has not been discussed how their applicability improved over the years. In fact, microarchitectural attacks have improved their practicality since they were implemented for the first time in 2005. Not in vain, the first microarchitectural attacks (i.e. those targeting the L1 and BPU) were largely dismissed almost for several years, as they were only applicable if victim and attacker shared the same CPU core. In a multi-core system world, this restriction seemed to be the main limitation that prevented microarchitectural attacks from being considered as a threat. It was in 2013 when the first cross-core cache attack was presented, even though it required that victim and attacker shared memory pages. Later this requirements would be eliminated, making cross-core cache attacks applicable in almost every system. However, there was a requirement that microarchitectural attacks did not accomplish yet, i.e., to be applicable even when victim and attacker share the un- derlying hardware but are located in different CPU sockets. The microarchitectural attack cross-CPU applicability would later be accom- plished by cache coherency protocol attacks, rowhammer, DRAMA and memory 32 bus locking attacks. Indeed all of them target resources that are shared not only by every core, but farther by every CPU socket in the system. As explained in the previous paragraphs, the first targets the cache coherence protocol characteristics, the next two target the DRAM characteristics, while the latter targets the memory bus characteristics. In short, microarchitectural attacks have experienced a huge popularity increase in the last years, specially for their improved characteristics and applicability. While they were largely dismissed at the beginning for being applicable in private core- resources, current microarchitectural attacks are applicable even across CPU sockets. 33 Chapter 4 TheFlush and Reload attack Until 2013 microarchitectural attacks have been largely disregarded by the commu- nity, mainly due to their lack of applicability in real world scenarios. In particular, they have suffered from the following limitations: •Microarchitectural attacks were only shown to be successful in core-private resources, like the L1 caches and the BPUs. With the wide utilization of multi-core systems, the restriction of having to be located in the same core with the victim seems a huge obstacle when it comes to the applicability of microarchitectural attacks. •Core-private resources usually exhibit low resolution to execute microarchi- tectural attacks. For instance, L1 and L2 accesses only differ in a few cycles. BPU misspredictions are also hardly distinguishable as their penalties have been largely optimized. Therefore, core-private resources are less likely to resist the amount of noise typically observed in real world scenarios. •Before 2013 microarchitectural attacks lacked of real world scenarios in which they could be applied, as they usually demanded a highly controlled environ- ment. In order to increase the applicability of microarchitectural attacks it is necessary to investigate and discover new covert channels that can be exploited across cores and be resistant to the typical amount of noise seen in modern technology usage en- vironments. The only work prior to ours that investigated cross-core covert channels is [YF14], in which a RSA key is recovered through the Last Level Cache (LLC). This sections expands on [YF14], demonstrating that LLC attacks can recover a wider range of information in a number of scenarios. The first attack that we discuss is the Flush and Reload attack, first applied in the L1 in [GBK11], which acquired its name in [YF14]. The Flush and Reload attack utilizes the LLC as the covert channel to recover information, and prior to this work, was only shown to be applicable across processes being executed in the 34 same OS. In this chapter we demonstrate how such an attack can be utilized to recover fine-grain information from a co-resident VM placed in a different core. In particular, •We demonstrate that Flush and Reload can recover information from co- resident VMs in hypervisors as popular as VMware. •We expand on the capabilities of Flush and Reload by recovering an AES key in less than a minute. •We show that microarchitectural attacks can be applied against high level protocols by recovering TLS session messages. 4.1 Flush and Reload Requirements We cannot start discussing the functionality of the Flush and Reload attack with- out first mentioning the requirements that are needed to successfully apply it. In particular the Flush and Reload attack has four very important pre-requisites that have to be met in the targeted system to succeed: •Shared memory with the victim: The Flush and Reload attack assumes that attacker and victim share at least the targeted memory blocks in the system, i.e., both access the same physical memory address. Although this requirement might seem difficult to achieve, we discuss in Section 4.2 the scenarios in which memory sharing approaches are implemented. •CPU socket co-residency: The Flush and Reload attack is only applica- ble if attacker and victim co-reside in the same CPU socket, as the LLC is only shared across cores and not across CPU sockets. Note that, unlike prior microarchitectural attacks, core co-residency is not necessary. •Inclusive LLC: The Flush and Reload attack requires the inclusiveness prop- erty in the LLC. This, as we will see in section 4.4.1, is necessary to be able to manipulate memory blocks in the upper level caches. The vast majority of intel processors feature an inclusive LLC. However, as we will see in Section 5, similar technical procedures can exploit non-inclusive caches through cache coherence covert channels. •Access to a flushing instruction: The Flush and Reload attack needs of a very specific instruction in the Instruction Set Architecture (ISA) capable of forcing memory blocks to be removed from the entire cache hierarchy. In x86-64 systems, this is provided trough the clflush instruction. 35 Without any of these four requirements the Flush and Reload attack is not able to successfully recover data from a victim. Two of them can be assumed to be easily achievable in Intel processors, as they feature inclusive LLCs and the clflush instruction is accessible from userspace. The CPU socket co-residency should also be easily accomplished, as modern computers do not usually feature more than two CPU sockets. However, the shared memory requirement might be harder to achieve. In the following section we describe mechanisms under which different processes/users share the same physical memory. 4.2 Memory Deduplication Although the idea of different processes/users sharing the same physical memory might seem threatening, the truth is we encounter mechanisms that permit it in pop- ular OS and hypervisors. In particular, all linux OS implement the so called Kernel Samepage Merging (KSM) mechanism, which merges duplicate read-only memory pages belonging to different processes. Consequently KVM, a linux based hyper- visor, features the same mechanisms across different VMs. Furthermore, VMware implements Transparent Page Sharing (TPS), a similar mechanism to KSM that also allows different VMs to share memory. Even though the deduplication optimization method saves memory and thus allows more virtual machines to run on the host system, it also opens a door to side channel attacks. While the data in the cache cannot be modified or corrupted by an adversary, parallel access rights can be exploited to reveal secret information about processes executed in the target VM. We will focus on the Linux implementation of Kernel Samepage Merging (KSM) memory deduplication feature and on the TPS mechanism implemented by VMware. We describe in detail the functionality of KSM, but the same procedure is implemented by TPS. KSM is the Linux memory deduplication feature implementation that first ap- peared in Linux kernel version 2.6.32 [Jon10, KSM]. In this implementation, KSM kernel daemon ksmd , scans the user memory for potential pages to be shared among users [AEW09]. Also, since it would be CPU intensive and time consuming, in- stead of scanning the whole memory continuously, KSM scans only the potential candidates and creates signatures for these pages. These signatures are kept in the deduplication table. When two or more pages with the same signature are found, they are cross-checked completely to determine if they are identical. To create sig- natures, KSM scans the memory at 20 msec intervals and at best only scans the 25% of the potential memory pages at a time. This is why any memory disclosure attack, including ours, has to wait for a certain time before the deduplication takes effect upon which the attack can be performed. In our case, it usually took around 30 minutes to share up to 32000 pages. During the memory search, KSM analyzes three types of memory pages [SIYA12]; •Volatile Pages : Where the contents of the memory change frequently and 36 OpenSSL Apache FirefoxApache OpenSSLOpenSSL Apache FirefoxFigure 4.1: Memory Deduplication Feature should not be considered as a candidate for memory sharing. •Unshared Pages : Candidate pages for deduplication. The madvise system call advises to the ksmd to be likely candidates for merging. •Shared Pages : Deduplicated pages that are shared between users or pro- cesses. When a duplicate page signature is found among candidates and the contents are cross-checked, ksmd automatically tags one of the duplicate pages with copy-on- write (COW) tag and shares it between the processes/users while the other copy is eliminated. Experimental implementations [KSM] show that using this method, it is possible to run over 50 Windows XP VMs with 1GB of RAM each on a physical machine with just 16GB of RAM. As a result of this, the power consumption and system cost is significantly reduced for systems with multiple users. 4.3 Flush and Reload Functionality We described the pre-requisites that the system in which the Flush and Reload attack will be applied needs to fullfill. If they are satisfied, and assuming victim and attacker share a memory block b, the Flush and Reload attack can be applied performing the following steps: 37 Write AccessCopied page table that cause the time delay1 2 21Figure 4.2: Copy-on-Write Scheme Flushing stage: In this stage, the attacker uses the clflush command to flush b from the cache hence making sure that it has to be retrieved from the main memory next time it needs to be accessed. We have to remark here that theclflush command does not only flush the memory block from the cache hierarchy of the corresponding working core, but it is flushed from all the caches of all the cores due to the inclusiveness LLC property. This is an important point: if it only flushed the corresponding core’s caches, the attack would only work if the attacker and victim’s processes were co-residing on the same core. This would have required a much stronger assumption than just being in the same physical machine. Target accessing stage: In this stage the attacker waits until the target runs a fragment of code, which might use the memory block bthat has been flushed in the first stage. Reloading stage: In this stage the attacker reloads again the previously flushed memory block band measures the time it takes to reload. We perform this with the popular (and userspace accessible) rdtsc instruction, which counts the hardware cycles spent doing a process. Before reading the cycle counter we issue memory barrier instructions ( mfence and lfence ) to make sure all load and store operations have finished before reading the memory block b. Depending on the reloading time, the attacker decides whether the victim accessed the memory block (in which case the memory block would be present in the cache) or if the victim did not access the corresponding memory block (in which case the memory block will be fetched from memory). The timing difference between a cache hit and a cache miss makes the aforemen- 38 0 50 100 150 200 250 300 350 400 450 50000.050.10.150.20.250.3 Hardware cyclesProbability Instruction in cache Instruction in memoryFigure 4.3: Reload time in hardware cycles when a co-located VM uses the memory blockb(red, LLC accesses) and when it does not use the targeted memory block b (blue, memory accesses) using KVM on an Intel XEON 2670 tioned access easily detectable by the attacker. In fact this is one of the big advan- tages of targeting the LLC as a covert channel. Figure 4.3 shows the reload times of a memory block being retrieved from the LLC and from the memory, represented as red and blue histograms respectively. We observe that LLC accesses utilizing theFlush and Reload technique usually take around 70 cycles, while memory ac- cesses usually take around 200 cycles. Thus, an attacker using the Flush and Reload technique can easily distinguish when a victim uses a shared memory block. Flush and Reload memory targets: As we explained Flush and Reload targets memory blocks that are shared between victim and attacker, and mechanisms like KSM can help this phenomenon happen. However, this also means that Flush and Reload can only target very specific memory blocks, as KSM only shares read-only memory pages. Functions and data declared globally usually belong to the kind of memory that an attacker can target with Flush and Reload . However, dynamically allocated data, as it is modifiable, cannot be targeted by Flush and Reload . Thus, an attacker needs to pick the application to attack carefully, taking this consideration into account. 4.4 Flush and Reload Attacking AES AES has been one of the main targets of cache attacks. For instance, Bernstein demonstrated that table entries in different cache lines can have different L1 access times, while Osvik et al. applied the evict and time and the Prime and Probe 39 attack to the L1 cache. In this section we will describe the principles of our Flush and Reload attack on the C-implementation of AES in OpenSSL . In [GBK11] Gullasch et al. described a Flush and Reload attack on AES implementation of the OpenSSL library. However in this study, we are going to use the Flush and Reload method with some modifications that from our point of view, have clear advantages over [GBK11]. We consider two scenarios: the attack as a spy process running in the same OS instance as the victim (as done in [GBK11]), and the attack running as a cross-VM attack in a virtualized environment. 4.4.1 Description of the Attack As in prior Flush and Reload attacks, we assume that the adversary can monitor accesses to a given cache line. However, unlike the attack in [GBK11], this attack •only requires the monitoring of a single memory block; and •flushing can be done before encryption, reloading after encryption, i.e. the adversary does not need to interfere with or interrupt the attacked process. More concretely, the Linux kernel features a completely fair scheduler which tries to evenly distribute CPU time to processes. Gullasch et al. [GBK11] exploited Completely Fair Scheduler (CFS) [CFS], by overloading the CPU while a victim AES encryption process is running. They managed to gain control over the CPU and suspend the AES process thereby gaining an opportunity to monitor cache accesses of the victim process. Our attack is agnostic to CFS and does not require time consuming overloading steps to gain access to the cache. We assume the adversary monitors accesses to a single line of one of the T tables of an AES implementation, preferably a T table that is used in the last round of AES. Without loss of generality, let’s assume the adversary monitors the memory block corresponding to the first positions of table T, whereTis the lookup table applied to the targeted state byte si, wheresiis thei-th byte of the AES state before the last round. Let’s also assume that a cache line can hold nT table values, e.g, the firstnT table positions for our case. If siis equal to one of the indices of the monitored T table entries in the memory block (i.e. si∈{0,...,n}if the memory block contains the first nT table entries) then the monitored memory block will with very high probability be present in the cache (since it has been accessed by the encryption process). However, if sitakes different values, the monitored memory block is not loaded in this step. Nevertheless, since each T table is accessed ltimes (for AES-128 in OpenSSL ,l= 40 perTj), there is still a probability that the memory block was loaded by any of the other accesses. In both cases, all that happens after the T table lookup is a possible reordering of bytes (due to AES’s Shift Rows ), followed by the last round key addition. Since the last round key is always the same forsi, thenvalues are mapped to nspecific and constant ciphertext byte values. This means that for nout of 256 ciphertext values, the monitored memory block 40 willalways have been loaded by the AES operation, while for the remaining 256 −n values the probability of having been reloaded is smaller. In fact, the probability that the specific T table memory block ihas not been accessed by the encryption process is given as: Pr [no access to T[i]] =( 1−t 256)l Here,lis the number of accesses to the specific T table. For OpenSSL 1.0.1 AES- 128 we have l= 40. If we assume that each memory block can hold t= 16 entries per cache line, we have Pr [no access to T[i]] = 7.6%. However, if the T-tables start at the middle of a cache line, an attacker can be smart enough to target those memory blocks, for which t= 8 and Pr [no access to T[i]] = 28.6%. Therefore it is easily distinguishable whether the memory block is accessed or not. Indeed, this turns out to be the case as confirmed by our experiments. In order to distinguish the two cases, all that is necessary is to measure the timing for the reload of the targeted memory block. If the line was accessed by the AES encryption, the reload is quick; else it takes more time. Based on a threshold that we will empirically choose from our measurements, we expect to distinguish main memory accesses from L3 cache accesses. For each possible value of the ciphertext byteciwe count how often either case occurs. Now, for nciphertext values (the ones corresponding to the monitored T table memory block) the memory block has always been reloaded by AES, i.e. the reload counter is (close to) zero. These n ciphertext values are related to the state as follows: ci=ki⊕T[ s[i]] (4.1) where thes[i]can takenconsecutive values. Note that Eq. (4.1) describes the last round of AES. The brackets in the index of the state byte s[i]indicate the reordering due to the Shift Rows operation. For the other values of ci, the reload counter is significantly higher. Given the nvalues ofciwith a low reload counter, we can solve Eq. (4.1) for the key byte ki, since the indices s[i]as well as the table output values T[ s[i]] are known for the monitored memory block. In fact, we get npossible key candidates for each ciwith a zero reload counter. The correct key is the only one that allnvalid values for cihave in common. A general description of the key recovery algorithm is given in Algorithm 4, where key byte number 0 is recovered from the ciphertext values corresponding to nlow reload counter values that were recovered from the measurements. Again, nis the number of T table positions that a cache line holds. The reload vector Xi= [x(0),x(1),...,x (255)] holds the reload counter values x(j) for each ciphertext valueci=j. FinallyK0is the vector that, for each key byte candidate k, tracks the number of appearances in the key recovery step. 41 Algorithm 4 Recovery algorithm for key byte k0 Input :X0 //Reload vector for ciphertext byte 0 Output :k0 //Correct key byte 0 forallxj∈X0do //Threshold for values with low reload counter. ifxjAccessThreshold then Addcounter( Ti,Xi); // Increase counter of XiusingTi end end end returnX0,X1,...,X 15 the above attack can be repeated on each byte, by simply analyzing the collecting ciphertexts and their timings for each of the ciphertext bytes individually. As before, the timings are profiled according to the value that each ciphertext byte citakes in each of the encryptions, and are stored in a ciphertext byte vector. The attack process is described in Algorithm 5. In a nutshell, the algorithm monitors the first T table memory block of all used tables and hence stores four reload values per observed ciphertext. Note that, this is a known ciphertext attack and therefore all that is needed is a flush of one memory block before one encryption. There is no need for the attacker to gain access to plaintexts. Finally the attacker should apply Algorithm 4 to each of the obtained ciphertext reload vectors. Recall that each ciphertext reload vector uses a different T table, so the right corresponding T table should be applied in the key recovery algorithm. Performing the Attack. In the following we provide the details about the process followed during the attack. Step 1: Acquire information about the offset of T tables The attacker has to know the offset of the T tables with respect to the beginning of the library. With that information, the attacker can refer and point to any memory block that holds T table values even when the ASLR is activated. This means that some reverse engineering work has to be done prior to the attack. This can be done in a debugging step where the offset of the addresses of the four T tables are recovered. Step 2: Collect Measurements In this step, the attacker requests encryptions and applies Flush and Reload between each encryption. The information 43 gained, i.e. Ti0was accessed or not, is stored together with the observed ciphertext. The attacker needs to observe several encryptions to get rid of the noise and to be able to recover the key. Note that, while the reload step must be performed and timed by the attacker, the flush might be performed by other processes running in the victim OS. Step 3: Key recovery In this final step, the attacker uses the collected measure- ments and his knowledge about the public T tables to recover the key. From this information, the attacker applies the steps detailed in Section 7.4.2.1 to recover the individual bytes of the key. 4.4.3 Attack Scenario 1: Spy Process In this first scenario we will attack an encryption server running in the same OS as the spy process. The encryption server just receives encryption requests, encrypts a plaintext and sends the ciphertext back to the client. The server and the client are running on different cores. Thus, the attack consists in distinguishing accesses from the LLC, i.e. L3 cache, which is shared across cores. and the main memory. Clearly, if the attacker is able to distinguish accesses between LLC and main memory, it will be able to distinguish between L1 and main memory accesses whenever server and client co-reside in the same core. In this scenario, both the attacker and victim are using the same shared library. KSM is responsible for merging those pages into one unified shared page. Therefore, the victim and attacker processes are linked through the KSM deduplication feature. Our attack works as described in the previous section. First the attacker dis- covers the offset of the addresses of the T tables with respect to the begining of the library. Next, it issues encryption requests to the server, and receives the cor- responding ciphertext. After each encryption, the attacker checks with the Flush and Reload technique whether the chosen T table values have been accessed. Once enough measurements have been acquired, the key recovery step is performed. As we will see in our results section, the whole process takes less than half a minute. Our attack significantly improves on previous cache side channel attacks such as evict + time orprime and probe [OST06]. Both attacks were based on spy processes targeting the L1 cache. A clear advantage of our attack is that —since it is targeting the last shared level cache— it works across cores. A more realistic attack scenario was proposed earlier by Bernstein [Ber04] where the attacker targets an encryption server. Our attack similarly works under a re- alistic scenario. However, unlike Bernstein’s attack [Ber04], our attack does not require a profiling phase that involves access to an identical implementation with a known-key. Finally, with respect to the previous Flush and Reload attack in AES, our attack does not need to interrupt the AES execution of the encryption server. We will compare different attacks according to the number of encryptions needed in Section 4.4.6. 44 4.4.4 Attack Scenario 2: Cross-VM Attack In our second scenario the victim process is running in one virtual machine and the attacker in another one but on the physical server, possibly on different cores. For the purposes of this study it is assumed that the co-location problem has been solved using the methods proposed in [RTSS09]. The attack exploits memory overcommit- ment features that some hypervisors such as VMware provide. In particular, we focus in memory deduplication. The hypervisor will search periodically for identical pages across VMs to merge both pages into a single page in the memory. Once this is done (without the intervention of the attacker) both the victim and the attacker will access the same portion of the physical memory enabling the attack. The at- tack process is the same as in Scenario 1. Moreover, we later show that the key is recovered in less than a minute, which makes the attack practical . We discussed the improvements of our attack over previous proposals in the previous scenario except the most important one: We believe that the evict+time , prime and probe and time collision attacks will be rather difficult to carry out in real cloud environment. The first two, as we know them so far, are targeting the L1 cache, which is not shared across cores. The attacker would have to be in the same core as the victim, which is a much stronger assumption than being just in the same physical machine. Finally, targeting the CFS [GBK11] to evict the victim process, requires for the attacker’s code to run in the same OS, which will certainly not be possible in a virtualized environment. 4.4.5 Experiment Setup and Results We present results for both a spy process within the native machine as well as the cross-VM scenario. The target process is executed in Ubuntu 12.04 64 bits, kernel version 3.4, using the C-implementation of AES in OpenSSL 1.0.1f for encryption. This is used when OpenSSL is configured with no-asm andno-hw option. We want to remark that this is not the default option in the installation of OpenSSL in most of the products. All experiments were performed on a machine featuring an Intel i5- 3320M four core clocked at 3.2GHz. The Core i5 has a three-level cache architecture: The L1 cache is 8-way associative, with 215bytes of size and a cache line size of 64 bytes. The level-2 cache is 8-way associative as well, with a cache line width of 64 bytes and a total size of 218bytes. The level-3 cache is 12-way associative with a total size of 222bytes and 64 bytes cache line size. It is important to note hypervisor each core has private L1 and L2 caches, but the L3 cache is shared among all cores. Together with the deduplication performed by the VMM, the shared L3 cache allows the adversary to learn about data accesses by the victim process. The attack scenario is as follows: the victim process is an encryption server handling encryption requests through a socket connection and sends back the ci- phertext, similar to Bernstein’s setup in [Ber04]. But unlike Bernstein’s attack, where packages of at least 400 bytes were sent to deal with the noise, our server 45 only receives packages of 16 bytes (the plaintext). The encryption key used by the the server is unknown to the attacker. The attack process sends encryption queries to the victim process. All measurements such as timing measurements of the reload step are done on the attacker side. In our setup, each cache line holds 16 T table values, which results in a 7 .6% probability for not accessing a memory block per encryption. All given attack results target only the first cache line of each T table, i.e. the first 16 values of each T table for Flush and Reload . Note that in the attack any memory block of the T table would work equally well. Both native and cross- VM attacks establish the threshold for selecting the correct ciphertext candidates for the working T table line by selecting those values which are below half of the average of overall timings for each ciphertext value. This is an empirical threshold that we set up after running some experiments as follows threshold =256∑ i=0ti 2·256. Spy process attack setup: The attack process runs in the same OS as the victim process. The communication between the processes is carried out via localhost connection and measures timing using Read Time-Stamp Counters ( rdtsc ). The attack is set up to work across cores; the encryption server is running in a different core than the attacker. We believe that distinguishing between L3 and main memory accesses will be more susceptible to noise than distinguishing between L1 cache accesses and main memory accesses. Therefore while working with the L3 cache gives us a more realistic setting, it also makes the attack more challenging. Cross-VM attack setup: In this attack, we use VMware ESXI 5.5.0 build number 1623387 running Ubuntu 12.04 64-bits guest OSs. We know that VMware implements TPS with large pages (2 MB) or small pages (4 KB). We decided to use the latter one, since it seems to be the default for most systems. Furthermore, as stated in [VMWb], even if the large page sharing is selected, the VMM will still look for identical small pages to share. For the attack we used two virtual machines, one for the victim and one for the attacker. The communication between them is carried out over the local IP connection. The results are presented in Figure 4.4 which plots the number of correctly re- covered key bytes over the number of timed encryptions. The dash-dotted line shows that the spy-process scenario completely recovers the key after only 217encryptions. Prior to moving to the cross-VM scenario, a single VM scenario was performed to gauge the impact of using VMs. The dotted line shows that due to the noise in- troduced by virtualization we need to nearly double the number of encryptions to 46 Figure 4.4: Number of correct key bytes guessed of the AES-128 bit key vs. number of encryption requests. Even 50,000 encryptions (i.e. less than 5 seconds of inter- action) result in significant security degradation in both the native machine as well as the cross-VM attack scenario. match the key recovery performance of the native case. The solid line gives the result for the cross-VM attack: 219observations are sufficient for stable full key recovery. The difference might be due to cpuid like instructions which are emulated by the hypervisor, therefore introducing more noise to the attack. In the worst case, both the native spy process and the single VM attack took around 25 seconds (for 400.000 encryptions). We believe that this is due to communication via the local- host connection. However, when we perform a cross-VM attack, it takes roughly twice as much time as in the previous cases. In this case we are performing the communication via local IPs that have to reach the router, which is believed to add the additional delay. This means that all of the described attacks —even in the cross VM scenario— completely recover the key in less than one minute ! 4.4.6 Comparison to Other Attacks Next we compare the most commonly implemented cache-based side channel attacks to the proposed attack. Results are shown in Table 4.1. It is difficult to compare the attacks, since most of them have been run on different platforms. Many of the prior attacks target OpenSSL 0.9.8 version of AES. Most of these attacks exploit the fact that AES has a separate T Table for the last round, significantly reducing 47 Table 4.1: Comparison of cache side channel attack techniques against AES Attack Platform Methodology OpenSSL Traces Spy-Process based Attacks: Collision timing [BM06] Pentium 4E Time measurement 0.9.8a 300.000 Prime+probe [OST06] Pentium 4E L1 cache prime-probing 0.9.8a 16.000 Evict+time [OST06] Athlon 64 L1 cache evicting 0.9.8a 500.000 Flush+Reload (CFS) [GBK11] Pentium M Flush+reload w/CFS 0.9.8m 100 Flush+Reload [IIES14b] i5-3320M L3 cache Flush+reload 0.9.8a 8.000 Bernstein [AE13] Core2Duo Time measurement 1.0.1c 222 Flush+Reload [IIES14b] i5-3320M L3 cache Flush+reload 1.0.1f 100.000 Cross-VM Attacks: Bernstein [IIES14a]1i5-3320M Time measurement 1.0.1f 230 Our attack(VMware) i5-3320M L3 cache Flush and reload 1.0.1f2400.000 1Only parts of the key were recovered, not the whole key. 2The AES implementation was not updated for the recently released OpenSSL 1.0.1g and 1.0.2 beta versions. So the results for those libraries are identical. the noise introduced by cache miss accesses. Hence, attacks on OpenSSL 0.9.8 AES usually succeed much faster, a trend confirmed by our attack results. Note that our attack, together with [AE13] and [IIES14a] are the only ones that have been run on a 64 bit processor. Moreover, we assume that due to undocumented internal states and advanced features such as hardware prefetchers, implementation on a 64 bit processor will add more noise than older platforms running the attack. With respect to the number of encryptions, we observe that the proposed attack has significant improvements over most of the previous attacks. Spy process in native OS: Even though our attack runs in a noisier environment than Bernstein’s attack, Evict and Time , and cache timing collision attacks, it shows better performance. Only Prime and Probe andFlush and Reload using CFS show either comparable or better performance. The proposed attack has better performance than Prime and Probe even though their measurements were performed with the attack and the encryption being run as one unique process. The Flush and Reload attack in [GBK11] exploits a much stronger leakage, which requires that attacker to interrupt the target AES between rounds (an unrealistic assumption). Furthermore, Flush and Reload with CFS needs to monitor the entire T tables, while our attack only needs to monitor a single line of the cache, making the attack much more lightweight and subtle. Cross-VM attack: So far there is only one publication that has analyzed cache-based leakage across VMs for AES [IIES14a]. Our attack shows dramatic improvements over [IIES14a], which needs 229encryptions (hours of run time) for a partial recovery of the key. Our attack only needs 219encryptions to recover the full key. Thus, while the attack 48 presented in [IIES14a] needs to interact with the target for several hours, our attack succeeds in under a minute and recovers the entire key. Note that the CFS enabled Flush and Reload attack in [GBK11] will not work in the cross-VM setting, since the attacker has no control over victim OS’s CFS. 4.5 Flush and Reload Attacking Transport Layer Security: Re-viving the Lucky 13 Attack Although cache attacks are usually applied to cryptographic algorithms, virtually any security critical software that has non-constant execution flow can be targeted by the Flush and Reload attack. In this section, we show an example of such an application. In particular, we show for the first time that cache attacks can be utilized to attack security protocols like the Transport Layer Security (TLS) protocol, by re-implementing the Lucky 13 attack that was believed to be closed by the security community. The Lucky 13 attack targets a vulnerability in the TLS (and DTLS) protocol design. The vulnerability is due to MAC-then-encrypt mode, in combination with the padding of the CBC encryption, also referred to as MEE-TLS-CBC. In the fol- lowing, our description focuses on this popular mode. Vaudenay [Vau02] showed how the CBC padding can be exploited for a message recovery attack. AlFardan et al. [FP13] showed—more than 10 years later—that the subsequent MAC verifi- cation introduces timing behavior that makes the message recovery attack feasible in practical settings. In fact, their work includes a comprehensive study of the vul- nerability of several TLS libraries. In this section we give a brief description of the attack. For a more detailed description, please refer to the original paper [FP13]. 4.5.1 The TLS Record Protocol The TLS record protocol provides encryption and message authentication for bulk data transmitted in TLS. The basic operation of the protocol is depicted in Fig- ure 4.5. When a payload is sent, a sequence number and a header are attached to it and a MAC tag is generated by any of the available HMAC choices. Once the MAC tag is generated, it is appended to the payload together with a padding. The payload, tag, and pad are then encrypted using a block cipher in CBC mode. The final message is formed by the encrypted ciphertext plus the header. Upon receiving an encrypted packet, the receiver decrypts the ciphertext with the session key that was negotiated in the handshake process. Next, the padding and the MAC tag need to be removed. For this, first the receiver checks whether the size of the ciphertext is a multiple of the block size and makes sure that the ciphertext can accommodate minimally a zero-length record, a MAC tag, and at least one byte of padding. After decryption, the receiver checks if the recovered padding matches one of the allowed patterns. A standard way to implement this 49 HDR|SQN DATA DATA TAG PAD HDR CIPHERTEXTHMAC CBC ENCRYPTIONFigure 4.5: Encryption and authentication in the TLS record protocol when using HMAC and a block cipher in CBC mode. decoding step is to check the last byte of the plaintext, and to use it to determine how many of the trailing bytes belong to the padding. Once the padding is removed, and the plain payload is recovered, the receiver attaches the header and the sequence number and performs the HMAC operation. Finally, the computed tag is compared to the received tag. If they are equal, the contents of the message are concluded to be securely transmitted. 4.5.2 HMAC The TLS record protocol uses the HMAC algorithm to compute the tag. The HMAC algorithm is based on a hash function Hthat performs the following operations: HMAC (K,m ) =H((K⊕opad)||H((K⊕ipad)||M) Common choices in TLS 1.2 for Hare SHA-1, SHA-256 and the now defunct MD5. The message Mis padded with a single 1 bit followed by zeros and an 8 byte length field. The pad aligns the data to a multiple of 64 bytes. K⊕opad already forms a 64 byte field, as well as K⊕ipad. Therefore, the minimum number of compression function calls for a HMAC operation is 4. This means that depending on the number of bytes of the message, the HMAC operation is going to take more or less compression functions. To illustrate this, we are repeating the example given in [FP13] as follows. Assume that the plaintext size is 55 bytes. In this case an 8 byte length field is appended together with a padding of size 1, so that the total size is 64 bytes. Here in total the HMAC operation is going to take four compression function calls. However if the plaintext size is 58, an 8 byte length field is attached and 62 bytes of padding are appended to make the total size equal to 128 bytes. In this case, the total compression function calls are going to be equal to five. Distinguishing the 50 number of performed compression function calls is the basic idea that enables the Lucky 13 attack. 4.5.3 CBC Encryption & Padding Until the support of the Galois Counter Mode in TLS 1.2, block ciphers were always used in cipher block chaining (CBC) mode in TLS. The decryption of each block of a ciphertext Ciis performed as follows: Pi=Dk(Ci)⊕Ci−1 Here,Piis the plaintext block and Dk(·) is the decryption under key k. For the prevalent AES, the block size is 16 bytes. The size of the message to be encrypted in CBC mode has to be indeed a multiple of the cipher block size. The TLS pro- tocol specifies a padding as follows: the last padding byte indicates the length of the padding; the value of the remaining padding bytes is equal to the number of padding bytes needed. This means that if 3 bytes of padding is needed, the cor- rect padding has to be 0x02 |0x02|0x02. Possible TLS paddings are: 0x00, 0x01 |0x01, 0x02|0x02|0x02, up to 0xff—0xff |...|0xff. Note that there are several valid paddings for each message length. 4.5.4 An Attack On CBC Encryption We now discuss the basics of the Lucky 13 attack. For the purposes of this study the target cipher is AES in CBC mode, as described above. Again, we use the same example that AlFardan et al. gave in [FP13]. Assume that the sender is sending 4 non-IV blocks of 16 bytes each, one IV block, and the header number. Let’s further assume that we are using SHA-1 to compute the MAC tag, in which case the digest size is 20 bytes. The header has a fixed length of 5 bytes and the sequence number would have a total size of 8 bytes. The payload would look like this: HDR|CIV|C1|C2|C3|C4 Now assume that the attacker masks ∆ in C3. The decryption of C4is going to be as follows: P∗ 4=Dk(C4)⊕C3⊕∆ =P4⊕∆ Focusing on the last two bytes P∗ 4(14)|P∗ 4(15)three possible scenarios emerge: Invalid padding This is the most probable case, where the plaintext ends with an invalid padding. Therefore, according to TLS protocol, this is treated as 0 padding. 20 bytes of MAC (SHA-1) are removed and the corresponding HMAC operation in the client side is 51 performed on 44 bytes +13 bytes of header, in total 57 bytes. Therefore the HMAC evaluates 5 compression function calls. Valid 0x00 padding IfP∗ 4(15) is 0x00, this is considered as valid padding, and a single byte of padding is removed. Then the 20 bytes of digest are removed, and the HMAC operation in client side is done in 43+13 bytes, 56 in total, which takes 5 compression function calls. Any other valid padding For instance, if we consider a valid padding of two bytes, the valid padding would be 0x01|0x01 and 2 bytes of padding are removed. Then 20 bytes of digest are removed, and the HMAC operation is performed over 42 + 13 = 55 bytes, which means four compression function calls. The Lucky 13 attack is based on detecting this difference between 4 and 5 com- pression function calls. Recall that if an attacker knows that a valid 0x01 |0x01 padding was achieved, she can directly recover the last two bytes of P4, since 0x01|0x01 =P4(14)|P4(15)⊕∆(14)|∆(15) Note that the successful padding 0 x01|0x01 is more likely to be achieved than any other longer valid padding, and therefore, the attacker can be confident enough that it is the padding that was forged. Furthermore, she can keep on trying to recover the remaining bytes once she knows the first 2 bytes. The attacker needs to perform at most 216trials for detecting the last two bytes, and then up to 28messages for each of the bytes that she wants to recover. 4.5.5 Analysis of Lucky 13 Patches The Lucky 13 attack triggered a series of patches for all major implementations of TLS [FP13]. In essence, all libraries were fixed to remove the timing side channel exploited by Lucky 13, i.e. implementations were updated to handle different CBC- paddings in constant time. However, different libraries used different approaches to achieve this: •Some libraries implement dummy functions or processes, •Others use dummy data to process the maximum allowed padding length in each MAC checking. In the following, we discuss these different approaches for some of the most popular TLS libraries. 52 4.5.6 Patches Immune to Flush and Reload In this section we will analyze those libraries that are secure against the Flush and Reload technique. •OpenSSL: The Lucky 13 vulnerability was fixed in OpenSSL versions 1.0.1, 1.0.0k, and 0.9.8y by February 2013 without the use of a time consuming dummy function and by using dummy data. Basically, when a packet is received, the padding variation is considered and the maximum number of HMAC compression function evaluations needed to equalize the time is calcu- lated. Then each compression function is computed directly, without calling any external function. For every message, the maximum number of compres- sion functions are executed, so that no information is leaked through the time channel in case of the incorrect padding. Furthermore, the OpenSSL patch removed any data dependent branches ensuring a fixed data independent ex- ecution flow. This is a generic solution for microarchitectural leakage related attacks, i.e. cache timing or even branch prediction attacks. •Mozilla NSS: This library is patched against the Lucky 13 attack in ver- sion 3.14.3 by using a constant time HMAC processing implementation. This implementation follows the approach of OpenSSL, calculating the number of maximum compression functions needed for a specific message and then com- puting the compression functions directly. This provides not only a counter- measure for both timing and cache access attacks, but also for branch predic- tion attacks. •MatrixSSL: MatrixSSL is fixed against the Lucky 13 with the release of ver- sion 3.4.1 by adding timing countermeasures that reduce the effectiveness of the attack. In the fix, the library authors implemented a decoding scheme that does a sanity check on the largest possible block size. In this scheme, when the received message’s padding length is incorrect, Matrix SSL runs a loop as if there was a full 256 bytes of padding. When there are no padding errors, the same operations are executed as in the case of an incorrect padding to sustain a constant time. Since there are no functions that are specifically called in the successful or unsuccessful padding cases, this library is not vulnerable to ourFlush and Reload attack. In addition, Matrix SSL keeps track of all errors in the padding decoding and does the MAC checking regardless of valid or invalid padding rather than interrupting and finalizing the decoding process at the first error. 4.5.7 Patches Vulnerable to Flush and Reload There are some patches that ensure constant time execution and therefore are im- mune to the original Lucky 13 attack [FP13] which are vulnerable to Flush and 53 Reload . This implies a dummy function call or a different function call tree for valid and invalid paddings. Furthermore, if these calls are preceded by branch predic- tions, these patches might also be exploitable by branch prediction attacks. Some examples including code snippets are given below. •GnuTLS: uses a dummy wait function that performs an extra compression function whenever the padding is incorrect. This function makes the response time constant to fix the original Lucky 13 vulnerability. Since this function is only called in the case of incorrect padding, it can be detected by a co-located VM running a Flush and Reload attack. i f(memcmp ( tag , &ciphertext −>data [ length ] , t a g s i z e ) != 0||p a d f a i l e d != 0) //H M A C was not the same. {dummy wait(params, compressed , pad failed , pad, length+preamble size );} •PolarSSL: uses a dummy function called mdprocess to sustain constant time to fix the original Lucky 13 vulnerability. Basically the number of extra runs for a specific message is computed and added by mdprocess . Whenever this dummy function is called, a co-located adversary can learn that the last padding was incorrect and use this information to realize the Lucky 13 attack. f o r ( j = 0 ; jtransform in−> mdctxdec , ssl−>inmsg );]∗ •CyaSSL: was fixed against the Lucky 13 with the release of 2.5.0 on the same day the Lucky 13 vulnerability became public. In the fix, CyaSSL implements a timing resistant pad/verify check function called TimingPadVerify which uses thePadcheck function with dummy data for all padding length cases whether or not the padding length is correct. CyaSSL also does all the calculations such as the HMAC calculation for the incorrect padding cases which not only fixes the original Lucky 13 vulnerability but also prevents the detection of incorrect padding cases. This is due to the fact that the Padcheck function is called for both correctly and incorrectly padded messages which makes it impossible to detect with our Flush and Reload attack. However, for the correctly padded messages, CyaSSL calls the CompressRounds function which is detectable with Flush and Reload .Therefore, we monitor the correct padding instead of the incorrect padding cases. Correct padding case: 54 PadCheck (dummy, ( byte ) padLen , MAX PAD SIZE−padLen−1); ret = s s l−>hmac( s s l , v e r i f y , input , pLen−padLen−1−t , content , 1) ; CompressRounds( ssl , GetRounds(pLen, padLen, t ) , dummy); ConstantCompare ( v e r i f y , input + ( pLen−padLen−1−t ) , t ) != 0) Incorrect padding case: CYASSL MSG( ”PadCheck f a i l e d ” ) ; PadCheck (dummy, ( byte ) padLen , MAX PAD SIZE−padLen−1); s s l−>hmac( s s l , v e r i f y , input , pLen−t , content , 1) ; // s t i l l compare ConstantCompare ( v e r i f y , input + pLen−t , t ) ; 4.5.8 Reviving Lucky 13 on the Cloud As the cross-network timing side channel has been closed (c.f. Section 4.5.5), the Lucky 13 attack as originally proposed no longer works on the recent releases of most cryptographic libraries. In this work, we revive the Lucky 13 attack to target some of these (fixed) releases by gaining information through co-located VMs (a leakage channel not considered in the original paper) rather than the network timing exploited in the original attack. 4.5.8.1 Regaining the Timing Channel Most cryptographic libraries and implementations have been largely fixed to yield analmost constant time when the MAC processing time is measured over the net- work. As discussed in Section 4.5.5, although there are some similarities in these patches, there are also subtle differences which—as we shall see—have significant implications on security. Some of the libraries not only closed the timing channel but also various cache access channels. In contrast, other libraries left an open door to implement access driven cache attacks on the protocol. In this section we analyze how an attacker can gain information about the number of compression functions performed during the HMAC operation by making use of leakages due to shared memory hierarchy in VMs located on the same machine. This is sufficient to re-implement the Lucky 13 attack. More precisely, during MAC processing depending on whether the actual MAC check terminates early or not, some libraries call a dummy function to equalize the processing time. Knowing if this dummy function is called or not reveals whether the received packet was processed as to either having a invalid padding, zero length 55 Hardware cycles ×1055 5.5 6 6.5 7Probability 00.050.10.15 4 compression functions 5 compression functionsFigure 4.6: Histogram of network time measured for sent packages with valid (4 compression functions) and invalid (5 compression functions) paddings. padding or any other valid padding. In general, any difference in the execution flow between handling a well padded message, a zero padded message or an invalid padded message enables the Lucky 13 attack. This information is gained by the Flush and Reload technique if the cloud system enables deduplication features. To validate this idea, we ran two experiments: •In the first experiment we generated encrypted packets using PolarSSL client with valid and invalid paddings and measured the network time as shown in Figure 4.6. Note that, the network time in the two distributions obtained for valid and invalid paddings are essentially indistinguishable as intended by the patches. •In the second experiment we see a completely different picture. Using Po- larSSL we generated encrypted packets with valid and invalid paddings which were then sent to a PolarSSL server. Here instead, we measured the time it takes to load a specifically chosen PolarSSL library function running inside a co-located VM. Figure 4.7 shows the probability distributions for a function reloaded from L3 cache vs. a function reloaded from the main memory. The two distributions are clearly distinguishable and the misidentification rate (the area under the overlapping tails in the middle of the two distributions) is very small. Note that, this substitute timing channel provides much more precise timing that the network time. To see this more clearly, we refer the reader to Figure 2 in [FP13] where the network time is measured to obtain two overlap- ping Gaussians by measurements with OpenSSL encrypted traffic. This is not a surprise, since the network channel is significantly more noisy. In conclusion, we regain a much more precise timing channel, by exploiting the discrepancy between L3 cache and memory accesses as measured by a co-located attacker. In what follows, we more concretely define the attack scenario, and then precisely define the steps of the new attack. 56 Hardware cycles200 250 300 350 400 450 500Probability 00.20.40.60.8 Instruction in cache Instruction in memoryFigure 4.7: Histogram of access time measured for function calls from the L3 cache vs. a function called from the main memory. 4.5.8.2 New Attack Scenario In our attack scenario, the side channel information will be gained by monitoring the cache in a co-located VM. In the same way as in [FP13] we assume that the adversary captures, modifies, and replaces any message sent to the victim. However, TLS sessions work in such a way that when the protocol fails to decrypt a message, the session is closed. This is the reason why we focus in multi-session attacks where the same plaintext in the same place is being sent to the victim e.g. an encrypted password sent during user authentication. The fact that we are working with a different method in a different scenario gives us some advantages and disadvantages over the previous Lucky 13 work: Advantages: •Recent patches in cryptographic libraries mitigate the old Lucky 13 attack, but are still vulnerable in the new scenario. •In the new scenario, no response from the server is needed. The old Lucky 13 attack needed a response to measure the time, which yielded a noisier environment in TLS than DTLS. •The new attack does not suffer from the network channel noise. This source of noise was painful for the measurements as we can see in the original paper, where in case of TLS as many as 214trials were necessary to guess a single byte value. Disadvantages: •Assumption of co-location: To target a specific victim, the attacker has to be co-located with that target. However the attacker could just reside in a 57 physical machine and just wait for some potential random victim running a TLS operation. •Other sources of noise: The attacker no longer has to deal with network chan- nel noise, but still has to deal with other microarchitectural sources of noise, such as instruction prefetching. This new source of noise is translated in more traces needed, but as we will see, much less than in the original Lucky 13 attack. In Section 7.5 we explain how to deal with this new noise. 4.5.8.3 Attack Description In this section we describe how an attacker uses Flush and Reload technique to gain access to information about the plaintext that is being sent to the victim. •Step 1 Function identification: Identify different function calls in the TLS record decryption process to gain knowledge about suitable target functions for the spy process. The attacker can either calculate the offset of the function she is trying to monitor in the library, and then add the corresponding offset when the Address Space Layout Randomization (ASLR ) moves her user address space. Another option is to disable the ASLR in the attackers VM, and use directly the virtual address corresponding to the function she is monitoring. •Step 2 Capture packet, mask and replace: The attacker captures the packet that is being sent and masks it in those positions that are useful for the attack. Then she sends the modified packet to the victim. •Step 3 Flush targeted function from cache: The Flush and Reload pro- cess starts after the attacker replaces the original version of the packet and sends it. The co-located VM flushes the function to ensure that no one but the victim ran the targeted function. Any subsequent execution of the targeted function will bear a faster reload time during the reload process. •Step 4 Reload target function & measure: Reload the corresponding function memory line again and measure the reload time. According to a threshold that we set based on experimental measurements, we decide whether the dummy function was loaded from the cache (implying that the victim has executed the dummy function earlier) or was loaded from the main memory (implying the opposite). Since the attacker has to deal with instruction prefetching, she will be constantly running Flush and Reload for a specified period of time. The attacker therefore distinguishes between functions preloaded and functions preloaded and executed , since the latter will stay for a longer period of time in the cache. 58 4.5.9 Experiment Setup and Results In this section we present our test environment together with our detection method to avoid different cache prefetching techniques that affect our measurements. Finally we present the results of our experiments for the PolarSSL, GnuTLS and CyaSSL libraries. 4.5.9.1 Experiment Setup The experiments were run on an Intel i5-650 dual core at 3.2 GHz. Our physical server includes 256 KB per core L2 cache, and a 4 MB L3 cache shared between both cores. We used VMware ESXI 5.5.0 build number 162338 for virtualization. TPS is enabled with 4 KB pages. In this setting, our Flush and Reload technique can distinguish between L3 cache and main memory accesses. For the TLS connection, we use an echo server which reads and re-sends the message that it receives, and a client communicating with it. Client and echo server are running in different virtual machines that use Ubuntu 12.04 guest OS. We modify the echo server functionality so that it adds a jitter in the encrypted reply message, modeling the Man in the Middle Attack. Once the message is sent, the echo server uses Flush and Reload to detect different function calls and concludes if the padding was correct or not. 4.5.9.2 Dealing with Cache Prefetching Modern CPUs implement cache prefetching in a number of ways. These techniques affect our experiments, since the monitored function can be prefetched to cache, even if it was not executed by the victim process. To avoid false positives, it is not sufficient to detect ifthe monitored functions were loaded to cache, but also for how long they have resided in the cache. This is achieved by counting the number of subsequent detections for the given function in one execution. Therefore, the attack process effectively distinguishes between prefetched functions and prefetched and executed functions. We use experiments to determine a threshold (which differs across the libraries) to distinguish a prefetch and execute from a mere prefetch. For PolarSSL this threshold is based on observing three Flush and Reload accesses in a row. Assume thatnis the number of subsequent accesses required to conclude that the function was executed. In the following, we present the required hits for different libraries, i.e. the number of n-accesses required to decide whether the targeted function was executed or not. 4.5.9.3 Attack on PolarSSL1.3.6 Our first attack targets PolarSSL 1.3.6, with TLS 1.1. In the first scenario the attacker modifies the last two bytes of the encrypted message until she finds the 59 Number of messages (L * 216)0 2 4 6 8 10 12 14 16Success detection probability 00.10.20.30.40.50.60.70.80.91 1 hit required 2 hits required 3 hits requiredFigure 4.8: (PolarSSL 1.3.6) Success probability of recovering P14andP15vs.L, for different number of hits required. Lrefers to the number of 216traces needed, so the total number of messages is 216∗L. ∆ that leads to a 0x01 |0x01 padding. Recall that 216different variations can be performed in the message. The first plot shows the success probability of guessing the right ∆ versus L, whereLrefers to the number of 216traces needed. For exampleL= 4 means that 216∗4 messages are needed to detect the right ∆. Based on experimental results, we set the access threshold such that we consider a hit whenever the targeted function gets two accesses in a row. The measurements were performed for different number of required hits. Fig- ure 4.8 shows that requiring a single hit might not suffice since the attacker gets false positives, or for small number of messages she may miss the access at all. However when we require two hits, and if the attacker has a sufficient number of messages (in this case L= 23), the probability of guessing the right ∆ is comfortably close to one. If the attacker increases the limit further to ensure an even lower number of false positives, she will need more messages to see the required number of hits. In the case of 3 hits, L= 24is required to have a success probability close to one. Figure 4.9 shows the success probability of correctly recovering P13, once the attacker has recovered the last two bytes. Now the attacker is looking for the padding 0x02|0x02|0x02. We observed a similar behavior with respect to the previous case where with L= 8 and with a two hits requirement we will recover the correct byte with high probability. Again if the attacker increases the requirement to 3 hits, she will need more measurements; about L= 16 is sufficient in practice. 60 Number of messages (L * 28)0 2 4 6 8 10 12 14 16Success detection probability 00.10.20.30.40.50.60.70.80.91 1 hit required 2 hits required 3 hits requiredFigure 4.9: (PolarSSL 1.3.6) Success probability of recovering P13assumingP14,P15 known vsL, for different number of hits required. Lrefers to the number of 28 traces needed, so the total number of messages is 28∗L. 4.5.9.4 CyaSSL 3.0.0 Recall that the attack is much more effective if the attacker knows any of the pre- ceding bytes of the plaintext, for example the last byte P15of the plaintext. This would be the case in a Javascript/web setting where adjusting the length of an ini- tial HTTP request an attacker can ensure that there is only one unknown byte in the HTTP plaintext. In this case, the attacker would not need to try 216possible variations but only 28variations for each byte that she wants to recover. This is the scenario that we analyzed in CyaSSL TLS 1.2, where we assumed that the attacker knowsP15and she wants to recover P14. Now the attacker is again trying to obtain a 0x01|0x01 padding, but unlike in the previous case, she knows the ∆ to make the last byte equal to 0x01. The implementation of CyaSSL behaves very similarly to the one of PolarSSL, where due to the access threshold, a one hit might lead to false positives. However, requiring two hits with a sufficient number of measurements is enough to obtain a success probability very close to one. The threshold is set as in the previous cases, where a hit is considered whenever we observe two Flush and Reload accesses in a row. 4.5.9.5 GnuTLS 3.2.0 Finally we present the results confirming that GnuTLS3.2.0 TLS 1.2 is also vul- nerable to this kind of attack. Again, the measurements were taken assuming that 61 Number of messages (L * 28)0 2 4 6 8 10 12 14 16Success detection probability 00.10.20.30.40.50.60.70.80.91 1 hit required 2 hits required 3 hits requiredFigure 4.10: (CyaSSL3.0.0) Success Probability of recovering P14assumingP15 known vsL, for different number of hits required. Lrefers to the number of 28 traces needed, so the total number of messages would be 28∗L. the attacker knows the last byte P15and she wants to recover P14, i.e., she wants to observe the case where she injects a 0x01 |0x01 padding. However, GnuTLS’s behavior shows some differences with respect to the previous cases. For the case of GnuTLS, we find that if we set an access threshold of three accesses in a row (which would yield our desired hit), the probability of getting false positives is very low. Based on experimental measurements we observed that only when the dummy function is executed we observe such a behavior. However the attacker needs more messages to be able to detect one of these hits. Observing one hit indicates with high probability that the function was called, but we also consider the two hit case in case the attacker wants the probability of having false positives to be even lower. Based on these measurements, we conclude that the attacker recovers the plaintext with very high probability, so we did not find it necessary to consider the three hit case. 4.6 Flush and Reload Outcomes In short, we demonstrated that if the memory deduplication requirement is satisfied, Flush and Reload can have severe consequences to processers/users co-residing in the same CPU socket, even if they are located in different CPU cores. We have demonstrated that such an attack can be utilized to recover cryptographic keys and TLS messages from CPU co-resident users. More than that, we demonstrated that 62 Number of messages (L * 28)0 2 4 6 8 10 12 14 16Success detection probability 00.10.20.30.40.50.60.70.80.91 1 hit required 2 hits requiredFigure 4.11: (GnuTLS3.2.0) Success Probability of recovering P14assumingP15 known vs. L, for different number of hits required. Lrefers to the number of 28 traces needed, so the total number of messages would be 28∗L. Flush and Reload can bypass the isolation techniques implemented by commonly used hypervisors to avoid cross-VM leakage. Despite all these advantages, we observed two major hurdles that Flush and Reload cannot overcome: •Flush and Reload cannot attack victims located in a different CPU, but it is restricted to target victims located in the same CPU socket. •Flush and Reload cannot be applied in systems in which memory deduplication does not exist, as the attacker does not get access to the victim’s data. This fact also restricts Flush and Reload to attack only statically allocated data. In the following chapters we explain how to overcome the two obstacles that Flush and Reload is not able to bypass. 63 Chapter 5 The First Cross-CPU Attack: Invalidate and Transfer In previous sections we have presented Flush and Reload as a cross-core side channel attack introduced to target Intel processors. However, the utilized covert channels makes use of specific characteristics that Intel processors feature. For example, the proposed LLC attack takes advantage of their inclusive cache design. Furthermore, it also relies on the fact that the LLC is shared across cores. Therefore the Flush and Reload attack succeeds only when the victim and the attacker are co-located on the same CPU. These characteristics are not observed in other CPUs, e.g. AMD or ARM. Aim- ing at solving these issues, this chapter presents Invalidate and Transfer , an attack that expands deduplication enabled LLC attacks to victims residing in different CPU sockets with any LLC characteristics. We utilize AMD as an example, but the same technique should also succeed in ARM processors. In this sense, AMD servers present two main complications that prevents application of existing side channel attacks: •AMD tends to have more CPU sockets in high end servers compared to Intel. This reduces the chance of being co-located in the same CPU, and therefore, to apply the aforementioned Flush and Reload attack. •LLCs in AMD are usually exclusive ornon-inclusive . The former does not allocate a memory block in different level caches at the same time. That is, data is present in only one level of the cache hierarchy. Non-inclusive caches show neither inclusive or exclusive behavior. This means that any memory access will fetch the memory block to the upper level caches first. However, the data can be evicted in the outer or inner caches independently. Hence, accesses to L1 cache cannot be detected by monitoring the LLC, as it is possible on Intel machines. Hence, to perform a side channel attack on AMD processors, both of these challenges need to be overcome. Here we present a covert channel that is immune to both 64 complications. The proposed Invalidate and Transfer attack is the first side channel attack that works across CPUs that feature non-inclusive or exclusive caches. Invalidate and Transfer presents a new covert channel based on cache coherency technologies implemented in modern processors. In particular, we focus on AMD processors, which have exclusive caches that in principle are invulnerable to cache side channel attacks although the results can be readily applied to multi-CPU Intel processors as well. In summary, •We present the first cross-CPU side channel attack, showing that CPU co- location is not needed in multi-CPU servers to obtain fine grain information. •We present a new deduplication-based covert channel that utilizes directory- based cache coherency protocols to extract sensitive information. •We show that the new covert channel succeeds in those processors where cache attacks have not been shown to be possible before, e.g. AMD exclusive caches. •We demonstrate the feasibility of our new side channel technique by mounting an attack on a T-table based AES and a square and multiply implementation of El Gamal schemes. 5.1 Cache Coherence Protocols In order to ensure coherence between different copies of the same data, systems implement cache coherence protocols. In the multiprocessor setting, the coherency between shared blocks that are cached in different processors (and therefore in dif- ferent caches) also needs to be maintained. The system has to ensure that each processor accesses the most recent value of a shared block, regardless of where that memory block is cached. The two main categories of cache coherence protocols aresnooping based protocols and directory based protocols . While snooping based protocols follow a decentralized approach, they usually require a centralized data bus that connects all caches. This results in excessive bandwidth need for systems with an increasing number of cores. Directory-based protocols, however, enable point-to-point connections between cores and directories, hence follow an approach that scales much better with an increasing number of cores in the system. We put our focus in the latter one, since it is the prevailing choice in current multiproces- sor systems. The directory keeps track of the state of each of the cached memory blocks. Thus, upon a memory block access request, the directory will decide the state that the memory block has to be turned into, both in the requesting node and the sharing nodes that have a cached copy of the requested memory block. We analyze the simplest cache coherence protocol, with only 3 states, since the attack that is implemented in this study relies on read-only data. Thus, the additional states applied in more complicated coherency protocols do not affect the flow of our attack. 65 We introduce the terms home node for the node where the memory block resides, local node for the node requesting access to the memory block, and owner node referring a node that has a valid copy of the memory block cached. This leads to various communication messages that are summarized as follows: •The memory block cached in one or more nodes can be in either uncached state, exclusive/modified orshared . •Upon a read hit, the local node’s cache services the data. In this case, the memory block maintains its state. •Upon a read miss, the local node contacts the home node to retrieve the memory block. The directory knows the state of the memory block in other nodes, so its state will be changed accordingly. If the block is in exclusive state, it goes to shared . If the block is in shared state, it maintains it. In both cases the local node then becomes an owner and holds a copy of the shared memory block. •Upon a write hit, the local node sets the memory block to exclusive. The local node communicates the nodes that have a cached copy of the memory block to invalidate or to update it. •Upon a write miss, again the home node will service the memory block. The directory knows the nodes that have a cached copy of the memory block, and therefore sends them either an update or an invalidate message. The local node then becomes an owner of the exclusive memory block. In practice, most cache coherency protocols have additional states that the mem- ory block can acquire. The most studied one is the MESI protocol, where the exclusive state is divided into the exclusive and modified states. Indeed, a memory block is exclusive when a single node has a clean state of the memory block cached. However, when a cached memory block is modified, it acquires the modified state since it is not consistent with the value stored in the memory. A write back operation would set the memory block back to the exclusive state. The protocols implemented in modern processors are variants of the MESI pro- tocol, mainly adding additional states. For instance, the Intel i7 processor uses a MESIF protocol, which adds the additional forward state. This state will des- ignate the sharing processor that should reply to a request of a shared memory block, without involving a memory access operation. The AMD Opteron utilizes the MOESI protocol with the additional owned state. This state indicates that the memory block is owned by the corresponding cache and is out-of-date with the memory value. However, contrary to the MESI protocol where a transition from modified toshared involves a write back operation, the node holding the owned state memory block can service it to the sharing nodes without writing it back to memory. Note that both the MESIF and MOESI protocol involve a cache memory 66 block forwarding operation. Both the owned and the forward state suggest that a cache rather than a DRAM will satisfy the reading request. If the access time from cache differs from regular DRAM access times, this behavior becomes an exploitable covert channel. 5.1.1 AMD HyperTransport Technology Cache coherency plays a key role in multi-core servers where a memory block might reside in many core-private caches in the same state or in a modified state. In mul- tiple socket servers, this coherency does not only have to be maintained within a processor, but also across CPUs. Thus, complex technologies are implemented to ensure the coherency in the system. These technologies center around the cache directory protocols explained in section 5.1. The HyperTransport technology imple- mented by AMD processors serves as a good example. We only focus on the features relevant to the new proposed covert channel. A detailed explanation can be found in [CKD+10, AMD]. The HyperTransport technology reserves a portion of the LLC to act as directory cache in the directory based protocol. This directory cache keeps track of the cached memory blocks present in the system. Once the directory is full, one of the previous entries will be replaced to make room for a new cached memory block. The directory always knows the state of any cached memory block, i.e., if a cache line exists in any of the caches, it must also have an entry in the directory. Any memory request will go first through the home node’s directory. The directory knows the processors that have the requested memory block cached, if any. The home node initiates in parallel both a DRAM access and a probe filter . The probe filter is the action of checking in the directory which processor has a copy of the requested memory block. If any node holds a cached copy of the memory block, a directed probe against it is initiated, i.e., the memory block will directly be fast forwarded from the cached data to the requesting processor. A directed probe message does not trigger a DRAM access. Instead, communications between nodes are facilitated via HyperTransport links, which can run as fast as 3 GHz. Figure 5.1 shows a diagram of how the HyperTransport links directly connect the different CPUs to each other avoiding memory node accesses. Although many execution patterns can arise from this protocol, we will only explain those relevant to the attack, i.e. events triggered over read-only blocks which we will elaborate on later. We assume that we have processors A and B, refereed to as PaandPb, that share a memory block: •IfPaandPbhave the same memory block cached, upon a modification made byPa,HyperTransport will notify PbthatPahas the latest version of the memory block. Thus, Pawill have to update its version of the block to con- vert the shared block into a owned block. Upon a new request made by Pb, HyperTransport will transfer the updated memory block cached in Pa. 67 Figure 5.1: DRAM accesses vs Directed probes thanks to the HyperTransport Links •Similarly, upon a cache miss in Pa, the home node will send a probe message to the processors that have a copy of the same shared memory block, if any. If, for instance, Pbhas it, a directed probe message is initiated so that the node can service the cached data through the hypertransport links. Therefore, HyperTransport reduces the latency of retrieving a memory block from the DRAM by also checking whether someone else maintains a cached copy of the same memory block. Note that this process does not involve a write-back operation. •When a new entry has to be placed in the directory of Pa, and the directory is full, one of the previously allocated entries has to be evicted to make room for the new entry. This is referred as a downgrade probe . In this case, if the cache line is dirty a writeback is forced, and an invalidate message is sent to all the processors ( Pb) that maintain a cached copy of the same memory block. In short, HyperTransport reduces latencies that were observed in previously im- plemented cache coherency protocols by issuing directed probes to the nodes that have a copy of the requested memory block cached. The HyperTransport links ensure a fast transfer to the requesting node. In fact, the introduction of HyperTransport links greatly improved the performance and thus viability of multi-CPU systems. Earlier multi-CPU systems relied on broadcast or directory protocols, where a re- quest of a exclusive cached memory block in an adjacent processor would imply a writeback operation to retrieve the up-to-date memory block from the DRAM. 5.1.2 Intel QuickPath Interconnect Technology In order to maintain cache coherency across multiple CPUs Intel implements a similar technique to AMD’s HyperTransport called Intel QuickPath Interconnect (QPI) [Intd, IQP]. Indeed, the later one was designed five years latter than the first one to compete with the existing technology in AMD processors. Similar to HyperTransport , QPI connects one or more processors through high speed point-to- point links as fast as 3.2GHz. Each processor has a memory controller on the same 68 DRAM DRAMBlock A 1Block ACPU 0 CPU 1 Directory Directory Block A1 2 (a) HTLink: fast access DRAM DRAMCPU 0 CPU 1 Directory Directory Block AFAST SLOW1 4 2 3 (b) DRAM: slow access Figure 5.2: Comparison of a directed probe access across processors: probe satisfied from CPU 1’s cache directly via HTLink (a) vs. probe satisfied by CPU 1 via a slow DRAM access (b). die to make to improve the performance. As we have already seen with AMD, among other advantages, this interface efficiently manages the cache coherence in the system in multiple processor servers by transferring shared memory blocks through the QPI high speed links. In consequence, the proposed mechanisms that we later explain in this paper are also applicable in servers featuring multi-CPU Intel processors. 5.2 Invalidate and Transfer Attack Procedure We propose a new spy process that takes advantage of leakages observed in the cache coherency protocol with memory blocks shared between many processors/cores. The spy process does not rely on specific characteristics of the cache hierarchy, like inclusiveness. In fact, the spy process works even across co-resident CPUs that do not share the same cache hierarchy. From now on, we assume that the victim and attacker share the same memory block and that they are located in different CPUs or in different cache hierarchies in the same server. The spy process is executed in three main steps, which are: •Invalidate step: In this step, the attacker invalidates a memory block that is in his own cache hierarchy. If the invalidation is performed in a shared memory block cached in another cache processors cache hierarchy, the HyperTransport will send an invalidate message to them. Therefore, after the invalidation step, the memory block will be invalidated in all the caches that have the same memory block, and this will be uncached from them. This invalidation can be achieved by specialized instructions like clflush if they are supported by the targeted processors, or by priming the set where the memory block resides in the cache directory. •Wait step: In this step, the attacker waits for a certain period of time to 69 Hardware cycles100 150 200 250 300 350Probability 00.10.20.30.40.50.60.7 DRAM access Direct transfer accessFigure 5.3: Timing distribution of a memory block request to the DRAM (red) vs a block request to a co-resident processor(blue) in a AMD opteron 6168. The measurements are taken from different CPUs. Outliers above 400 cycles have been removed let the victim do some computation. The victim might or might not use the invalidated memory block in this step. •Transfer step: In the last step, the attacker requests access to the shared memory block that was invalidated. If any processor in the system has cached this memory block, the entry in the directory would have been updated and therefore a direct probe request will be sent to the processor. If the memory block was not been used, the home directory will request a DRAM access to the memory block. The system experiences a lower latency when a direct probe is issued, mainly because the memory block is issued from another processors cache hierarchy. This is graphically observed in Figure 5.2. Figure 5.2(a) shows a request serviced by theHyperTransport link from a CPU that has the same memory block cached. In contrast, Figure 5.2(b) represents a request serviced by a DRAM access. This introduces a new leakage if the attacker is able to measure and distinguish the time that both actions take. This is the covert channel that will be exploited in this work. We use the RDTSC function which accesses the time stamp counter to measure the request time. In case the RDTSC function is not available from user mode, one can also create a parallel thread incrementing a shared variable that acts as a counter. We also utilize the mfence instruction to ensure that all memory load/store operations have finished before reading the time stamp counter. The timing distributions of both the DRAM access and the directed transfer access are shown in Figure 5.3, where 10,000 points of each distribution were taken 70 Hardware cycles250 300 350 400 450 500 550 600 650Probability 00.10.20.30.40.50.60.7 DRAM access Direct transfer accessFigure 5.4: Timing distribution of a memory block request to the DRAM (red) vs a block request to a co-resident core(blue) in a dual core Intel E5-2609. The measurements are taken from the same CPU. Outliers above 700 cycles have been removed across CPUs in a 48-core 4 CPU AMD Opteron 6168. The x-axis represents the hardware cycles, while the y-axis represents the density function. The measure- ments are taken across processors. The blue distribution represents a directed probe access, i.e., a co-resident CPU has the memory block cached, whereas the red distribution represents a DRAM access, i.e., the memory block is not cached anywhere. It can be observed that the distributions differ in about 50 cycles, fine grain enough to be able to distinguish them. However, the variance in both distri- butions is very similar, in contrast to LLC covert channels. Nevertheless, we obtain a covert channel that works across CPUs and that does not rely on the inclusiveness property of the cache. We also tested the viability of the covert channel in a dual socket Intel Xeon E5-2609. Intel utilizes a similar technique to the HyperTransport technology called Intel Quick Path Interconnect . The results for the Intel processor are shown in Figure 5.4, again with processes running in different CPUs. It can be observed that the distributions are even more distinguishable in this case. 5.3 Exploiting the New Covert Channel In the previous section, we presented the viability of the covert channel. Here we demonstrate how one might exploit the covert channel to extract fine grain information. More concretely, we present two attacks: 71 •a symmetric cryptography algorithm, i.e. table based OpenSSL implementation of AES, and •a public key algorithm, i.e. a square-and-multiply based libgcrypt imple- mentation of the El Gamal scheme. 5.3.1 Attacking Table Based AES We test the granularity of the new covert channel by mounting an attack in a software implementation of AES, as in 4.4.We again use the C OpenSSL reference implementation, which uses 4 different T-tables along 10 rounds for AES-128. To recap, we monitor a memory block belonging to each one of the T-tables. Each memory block contains 16 T-Table positions and it has a certain probability, 8% in our particular case, of not being used in any of the 10 rounds of an encryption. Thus, applying our Invalidate and Transfer attack and recording the ciphertext output, we can know when the monitored memory block has not been used. For this purpose weinvalidate the memory block before the encryption and try to probe it after the encryption. In a noise free scenario, the monitored memory block will not be used for 240 ciphertext outputs with 8% probability, and it will not be used for the remaining 16 ciphertext with 0% probability (because they directly map through the key to the monitored T-table memory block). Although microarchitectural attacks suffer from different microarchitectural sources of noise, we expect that the Invalidate and Transfer can still distinguish both distributions. Once we know the ciphertext values belonging to both distributions, we can apply the equation: Ki=T[Sj]⊕Ci to recover the key. Since the last round of AES involves only a Table look up and a XOR operation, knowing the ciphertext and the T-table block position used is enough to obtain the key byte candidate that was used during the last AES round. Since a cache line holds 16 T-table values, we XOR each of the obtained ciphertext values with all the 16 possible T-table values that they could map to. Clearly, the key candidate will be a common factor in the computations with the exception of the observed noise which is eliminated via averaging. As the AES key schedule is revertible, knowing one of the round keys is equivalent to knowing the full encryption key. 5.3.2 Attacking Square and Multiply El Gamal Decryption We test the viability of the new side channel technique with an attack on a square and multiply libgcrypt implementation of the public key ElGamal algorithm, as in [ZJRR12]. An ElGamal encryption involves a cyclic group of order pand a generatorgof that cyclic group. Then Alice chooses a number a∈Z∗ pand computes her public key as the 3-tuple ( p,g,ga) and keeps aas her secret key. 72 To encrypt a message m, Bob first chooses a number b∈Z∗ pand calculates y1=gbandy2= ((ga)b)∗mand sends both to Alice. In order to decrypt the message, Alice utilizes her secret key ato compute (( y1)−a)∗y2. Note that, if a malicious user recovers the secret key ahe can decrypt any message sent to Alice. Our target will be the y−a 1that uses the square and multiply technique as the modular exponentiation method. It bases its procedure in two operations: a square operation followed by a modulo reduction and a multiplication operation followed by a modulo reduction. The algorithm starts with the intermediate state R=bbeingb the base that is going to be powered, and then examines the secret exponent afrom the most significant to the least significant bit. If the bit is a 0, the intermediate state is squared and reduced with the modulus. If in the contrary the exponent bit is a 1, the intermediate state is first squared, then it is multiplied with the base band then reduced with the modulus. Algorithm 2, presented in the background section and shown below, shows the entire procedure. Algorithm 6 Square and Multiply Exponentiation Input : baseb, modulusN, secretE= (ek−1,...,e 1,e0) Output :bEmodN R=b; fori=k−1downto 0do R=R2modN; ifei==1 then R=R∗bmodN; end end returnR; As it can be observed the algorithm does not implement a constant execution flow, i.e., the functions that will be used directly depend on the bit exponent. If the square and multiply pattern is known, the complete key can be easily computed by converting them into ones and zeros. Indeed, our Invalidate and Transfer spy process can recover this information, since functions are stored as shared memory blocks in cryptographic libraries. Thus, we mount an attack with the Invalidate and Transfer to monitor when the square and multiplication functions are utilized. 5.4 Experiment Setup and Results In this section we present the test setup in which we implemented and executed the Invalidate and Transfer spy process together with the results obtained for the AES and ElGamal attacks. 73 5.4.1 Experiment Setup In order to prove the viability of our attack, we performed our experiments on a 48- core machine featuring four 12-core AMD Opteron 6168 CPUs. This is an university server which has not been isolated for our experiments, i.e., other users are utilizing it at the same time. Thus, the environment is a realistic scenario in which non desired applications are running concurrently with our attack. The machine runs at 1.9GHz, featuring 3.2GHz HyperTransport links. The server has 4 AMD Opteron 6168 CPUs, with 12 cores each. Each core features a private 64KB 2-way L1 data cache, a private 64KB L1 instruction cache and a 16-way 512KB L2 cache. Two 6MB 96-way associative L3 caches—each one shared across 6 cores—complete the cache hierarchy. The L1 and L2 caches are core-private resources, whereas the L3 cache is shared between 6 cores. Both the L2 and L3 caches are non-inclusive, i.e., data can be allocated in anycache level at a time. This is different to the inclusive LLC where most of the cache spy processes in literature have been executed. The attacks were implemented in a RedHat enterprise server running the linux 2.6.23 kernel. The attacks do not require root access to succeed, in fact, we did not have sudo rights on this server. Since ASLR was enabled, the targeted functions addresses were retrieved by calculating the offset with respect to the starting point of the library. All the experiments were performed across CPUs, i.e., attacker and victim do not reside in the same CPU and do not share any LLC. To ensure this, we utilized the taskset command to assign the CPU affinity to our processes. Our targets were the AES C reference implementation of OpenSSL and the El- Gamal square and multiply implementation of libgcrypt 1.5.2. The libraries are compiled as shared, i.e., all users in the OS will use the same shared symbols (through KSM mechanism). In the case of AES we assume we are synchronized with the AES server, i.e., the attacker sends plaintext and receives the correspond- ing ciphertexts. As for the ElGamal case, we assume we are not synchronized with the server. Instead, the attacker process simply monitors the function until valid patterns are observed, which are then used for key extraction. 5.4.2 AES Results As explained in Section 5.3, in order to recover the full key we need to target a single memory block from the four T-tables. However, in the case that a T-table memory block starts in the middle of a cache line, monitoring only 2 memory blocks is enough to recover the full key. In fact, there exists a memory block that contains both the last 8 values of T0 and the first 8 values of T1. Similarly there exists a memory block that contains the last 8 values of T2 and the first 8 values of T3. Unlike in section 4.4, we make use of this fact and only monitor those two memory blocks to recover the entire AES key. We store both the transfer timing and the ciphertext obtained by our encryption 74 Ciphertext value0 50 100 150 200 250Miss counter value normalized with maximum 00.10.20.30.40.50.60.70.80.91Figure 5.5: Miss counter values for each ciphertext value, normalized to the average server. In order to analyze the results, we implement a miss counter approach: we count the number of times that each ciphertext value sees a miss, i.e. that the monitored cache line was not loaded for that ciphertext value. An example of one of the runs for ciphertext number 0 is shown in Figure 5.5. The 8 ciphertext values that obtain the lowest scores are the ones corresponding to the cache line, thereby revealing the key value. In order to obtain the key, we iterate over all possible key byte values and compute the last round of AES only for the monitored T-table values, and then group the miss counter values of the resulting ciphertexts in one set. We group in another set the miss counter of the remaining 248 ciphertext values. Clearly, for the correct key, the distance between the two sets will be maximum. An example of the output of this step is shown in Figure 5.6, where the y-axis represents the miss counter ratio (i.e., ratio of the miss counter value in both sets) and the x-axis represents the key byte guess value. It can be observed that the ratio of the correct key byte (180) is much higher than the ratio of the other guesses. Finally, we calculate the number of encryptions needed to recover the full AES key. This is shown in Figure 5.7, where the y-axis again represents the ratios and thex-axis represents the number of encryptions. As it can be observed, the correct key is not distinguishable before 10,000 traces, but from 20,000 observations, the correct key is clearly distinguishable from the rest. We conclude that the new method succeeds in recovering the correct key from 20,000 encryptions. 75 Key Guess0 50 100 150 200 250 300Miss Counter Ratio 00.010.020.030.040.050.06 X: 180 Y: 0.05315Figure 5.6: Correct key byte finding step, iterating over all possible keys. The maximum distance is observed for the correct key 10310410500.020.040.060.080.10.12 Number of encryptionsDifference of Ratios Figure 5.7: Difference of Ratios over the number of encryptions needed to recover the full AES key. The correct key (bold red line) is clearly distinguishable from 20,000 encryptions. 5.4.3 El Gamal Results Next we present the results obtained when the attack aims at recovering an ElGamal decryption key. We target a 2048 bit ElGamal key. Remember that, unlike in the 76 Timeslot ×1040.4 0.6 0.8 1 1.2 1.4 1.6 1.8Square funcion usage00.51 Decrypt Idle Decrypt Idle Decrypt Idle DecryptFigure 5.8: Trace observed by the Invalidate and Transfer , where 4 decryption operations are monitored. The decryption stages are clearly visible when the square function usage gets the 0 value case of AES, this attack does not need synchronization with the server, i.e., the server runs continuous decryptions while the attacker continuously monitors the vulnerable function. Since the modular exponentiation creates a very specific pattern with respect to both the square and multiply functions, we can easily know when the exponentiation occurred in the time. We only monitor a single function, i.e., the square function. In order to avoid speculative execution, we do not monitor the main function address but the following one. This is sufficient to correctly recover a very high percentage of the ElGamal decryption key bits. For our experiments, we take the time that the invalidate operation takes into account, and a minimum waiting period of 500 cycles between the invalidate and the transfer operation is sufficient to recover the key patterns. Figure 5.8 presents a trace where 4 different decryptions are caught. A 0 in the y-axis means that the square function is being utilized, while a 1 the square function is not utilized, while the x-axis represents the time slot number. The decryption stages are clearly observable when the square function gets a 0 value. Recall that the execution flow caused by a 0 bit in the exponent is square+reduct ion, while the pattern caused by a 1 bit in the exponent is square+reduction+multi ply+reduction . Since we only monitor the square operation, we reconstruct the patterns by checking the distance between two square operations. Clearly, the dis- tance between the two square operations in a 00 trace will be smaller than the distance between the two square operations in a 10 trace, since the latter one takes an additional multiplication function. With our waiting period threshold, we ob- serve that the distance between two square operations without the multiplication function varies from 2 to 4 Invalidate and Transfer steps, while the distance be- tween two square operations varies from 6 to 8 Invalidate and Transfer steps. If the distance between two square operations is lower than 2, we consider it part of the same square operation. An example of such a trace is shown in Figure 5.9. In the figure,Srefers to a square operation, Rrefers to a modulo reduction operation 77 Time slot0 5 10 15 20 25Square function utilized00.51 S R M R SRS R M R SRFigure 5.9: Trace observed by the Invalidate and Transfer , converted into square and multiply functions. The y-axis shows a 0 when the square function is used and a 1 when the square function is not used Table 5.1: Summary of error results in the RSA key recovery attack Traces analysed 20 Maximum error observed 3.47% Minimum error observed 1.9% Average error 2.58% Traces needed to recover full key 5 andMrefers to a multiply operation. The x-axis represents the time slot, while the y axis represents whether the square function was utilized. The 0 value means that the square function was utilized, whereas the 1 value means that the square function was not utilized. The pattern obtained is SRMRSRSRMRSR , which can be translated into the key bit string 1010. However, due to microarchitectural sources of noise (context switches, interrupts, etc.) the recovered key has still some errors. In order to evaluate the error percentage obtained, we compare the obtained bit string with the real key. Any insertion, removal or wrong guessed bit is considered a single error. Table 5.1 summarizes the results. We evaluate 20 different key traces obtained with the Invalidate and Transfer spy process. On average, they key patterns have an error percentage of 2.58%. The minimum observed error percentage was 1.9% and the maximum was 3.47%. Thus, since the errors are very likely to occur at different points in order to decide the correct pattern we analyse more than one trace. On average, 5 traces are needed to recover the key correctly. 78 5.5 Invalidate and Transfer Outcomes We presented the Invalidate and Transfer attack, capable of recovering, for the first time, information across CPU sockets in systems that provide memory deduplica- tion. The new attack utilizes the cache coherency protocol as a covert channel and its effectiveness was proved by recovering both AES and RSA keys. Further, the attack is inclusiveness property agnostic. On the downsides, the attack still requires attacker and victim to share memory, and thus, is only applicable in VMMs with memory deduplication or smartphones. In response, the next chapter introduces an attack that is not reliant upon the memory deduplication feature, and thus is applicable in virtually every system in which attacker and victim processes can co-reside. 79 Chapter 6 ThePrime and Probe Attack In previous chapters, we presented two side channel attacks that used hardware cache properties to retrieve information across cores/CPUs. However, both attacks worked under the assumption of shared memory between attacker and victim, which was achievable through mechanisms like KSM. Although we demonstrated to recover fine grain information with them, due to the shared memory requirement, we observe the following challenges associated to them: •Some real world scenarios might not implement memory deduplication fea- tures. For instance, some commercial clouds have cross-VM memory dedu- plication disabled, as it is the case for Amazon EC2. Furthermore, memory page sharing is not allowed between trusted and untrusted worlds in Trusted Execution Environments (TEEs). •Flush and Reload and Invalidate and Transfer , since they rely on memory sharing, cannot get information dynamically allocated memory, as every user gets its own copy of it. This limits the applicability of both attacks. These two inconveniences restrict the applicability of Flush and Reload and In- validate and Transfer . Thus, it is necessary to know whether an attacker can bypass these obstacles and still gain information about a victim’s activity across cores. This chapter explains a new approach to utilize the Last Level Cache (LLC) as a covert channel without relying on memory deduplication features. In particular, we take the already known Prime and Probe attack and make it successful on the LLC. This is not a straightforward process, as it still requires to solve some technical issues when targeting the LLC. 6.1 Virtual Address Translation and Cache Ad- dressing In this chapter we present an attack that takes advantage of some known information in the virtual to physical address mapping process. Thus, we give a brief overview 80 about the procedure followed by modern processors to access and address data in the cache [HP11]. In modern computing, processes use virtual memory to access the different re- quested memory locations. Indeed processes do not have direct access to the physical memory, but use virtual addresses that are then mapped to physical addresses by the Memory Management Unit (MMU). This virtual address space is managed bv the Operating System. The main benefits of virtual memory are security (processes are isolated from real memory ) and the usage of more memory than physically available due to paging techniques. The memory is divided into fixed length continuous blocks called memory pages. The virtual memory allows the usage of these memory pages even when they are not allocated in the main memory. When a specific process needs a page not present in the main memory, a page fault occurs and the page has to be loaded from the auxiliary disk storage. Therefore, a translation stage is needed to map virtual ad- dresses to physical addresses prior to the memory access. In fact, cloud systems have two translation processes, i.e, guest OS to VMM virtual address and VMM virtual address to physical address. The first translation is handled by shadow page tables while the second one is handled by the MMU. This adds an abstraction layer with the physical memory that is handled by the VMM. During translation, the virtual address is split into two fields: the offset field and the page field. The length of both fields depends directly on the page size . Indeed, if the page size is pbytes, the lower log2(p) bits of the virtual address will be considered as the page offset , while the rest will be considered as the page frame number (PFN). Only the PFN is processed by the MMU and needs to be translated from virtual to physical page number. The page offset remains untouched and will have the same value for both the physical and virtual address. Thus, the user still knows some bits of the physical address. Modern processors usually work with 4 KB pages and 48 bit virtual addresses, yielding a 12 bit offset and the remaining bits as virtual page number. In order to avoid the latency of virtual to physical address translation, modern architectures include a Translation Lookaside Buffer (TLB) that holds the most re- cently translated addresses. The TLB acts like a small cache that is first checked prior to the MMU. One way to avoid TLB misses for large data processes is to in- crease the page size so that the memory is divided in less pages [CJ06, Inte, WW09]. Since the possible virtual to physical translation tags have been significantly reduced, the CPU will observe less TLB misses than with 4 KB pages. This is the reason why most modern processors include the possibility to use huge size pages, which typically have a size of at least 1 MB. This feature is particularly effective in vir- tualized settings, where virtual machines are typically rented to avoid the intensive hardware resource consumption in the customers private computers. In fact, most well known VMMs support the usage of huge size pages by guest OSs to improve the performance of those heavy load processes [VMwc, KVM, Xenb]. 81 Cache Addressing: Caches are physically tagged, i.e, the physical address is used to decide the position that the data is going to occupy in the cache. With bbytes size cache lines and m-way set associative caches (with nnumber of sets), the lower log2(b) bits of the physical address are used to index the byte in a cache line, while the following log2(n) bits select the set that the memory line is mapped to in the cache. A graphical example of the procedure carried out to address the data in the cache can be seen in Figure 6.2. Recall that if a page size of 4 KB is used, the offset field is 12 bits long. If log2(n) + log2(b) is not bigger than 12, the set that a cache line is going to occupy can be addressed by the offset. In this case we say that the cache is virtually addressed, since the position occupied by a cache line can be determined by the virtual address. In contrast, if more than 12 bits are needed to address the corresponding set, we say that the cache is physically addressed, since only the physical address can determine the location of a cache line. Note that when huge size pages are used, the offset field is longer, and therefore bigger caches can be virtually addressed. As we will see, this information can be used to mount a cross-VM attack in the L3 cache in deduplication free systems. Note that, this information was not necessary with Flush and Reload and Invalidate and Transfer as we assumed that we shared ownership of the attacked memory blocks with the victim. 6.2 Last Level Cache Slices Recent SMP microarchitecures divide the LLC into slices with the purpose of re- ducing the bandwidth bottlenecks when more than one core attempts to retrieve data from the LLC at the same time. The number of slices that the LLC is divided into usually matches the number of physical cores. For instance, a processor with scores divides the LLC into sslices, decreasing the probability of resource conflict while accessing it. However, each core is still able to access the whole LLC, i.e., each core can access every slice. Since the data will be spread into s“smaller caches” it is less likely that two cores will try to access the same slice at the same time. In fact, if each slice can support one access per cycle, the LLC does not introduce a bottleneck on the data throughput with respect to the processors as long as each processor issues no more than one access per cycle. The slice that a memory block is going to occupy directly depends on its own physical address and a non-public hash function, as in Figure 6.1. Performance optimization of sliced caches has received a lot of attention in the past few years. In 2006, Cho et al. [CJ06] analyzed a distributed management ap- proach for sliced L2 caches through OS page allocation. In 2007, Zhao et al. [ZIUN08] described a design for LLC where part of the slice allocates core-private data. Cho et al [JC07] describe a two-dimensional page coloring method to improve access latencies and miss rates in sliced caches. Similarly, Tam et al. [TASS07] also pro- posed an OS based method for partitioning the cache to avoid cache contention. 82 010110111011.. Figure 6.1: A hash function based on the physical address decides whether the memory block belongs to slice 0 or 1. In 2010 Hardavellas et al. [HFFA10] proposed an optimized cache block placement for caches divided into slices. Srikantaiah et al. [SKZ+11] presented a new adaptive multilevel cache hierarchy utilizing cache slices for L2 and L3 caches. In 2013 Chen et al. [CCC+13] detail on the approach that Intel is planning to take for their next generation processors. The paper shows that the slices will be workload dependent and that some of them might be dynamically switched off for power saving. In 2014 Kurian et al. [KDK14b] proposed a data replication protocol in the LLC slices. Ye et al. [YWCL14] studied a cache partitioning system treating each slice as an independent smaller cache. However, only very little effort has been put into analyzing the slice selection hash function used for selecting the LLC slice. A detailed analysis of the cache per- formance in Nehalem processors is described in [MHSM09] without an explanation of cache slices. The LLC slices and interconnections in a Sandy Bridge microar- chitecture are discussed in [Bri], but the slice selection algorithm is not provided. In [hen], a cache analysis of the Intel Ivy Bridge architecture is presented and cache slices are mentioned. However, the hash function describing the slice selection algo- rithm is again not described, although it is mentioned that many bits of the physical address are involved. Hund et al. [HWH13] were the only ones describing the slice selection algorithm for a specific processor, i.e, the i7-2600 processor. They recover the slice selection algorithm by comparing Hamming distances of different physical addresses. This information was again not needed with the Flush and Reload and Invalidate and Transfer as we could evict a memory block from the cache with the clflush instruction in systems where deduplication is enabled. 6.3 The Original Prime and Probe Technique The new attack proposed in this work is based on the methodology of the known Prime and Probe technique. Prime and Probe is a cache-based side channel attack technique used in [OST06, ZJOR11, ZJRR12] that can be classified as an access driven cache attack. The spy process ascertains which of the sets have been accessed in the cache by a victim. The attack is carried out in 3 stages: 83 SLICE Cache tag Set ByteS0 S1 SN. . . Figure 6.2: Last level cache addressing methodology for Intel processors. Slices are selected by the tag, which is given as the MSBs for the physical address. •Priming stage: In this stage, the attacker fills the monitored cache with own cache lines. This is done by reading own data. •Victim accessing stage: In this stage the attacker waits for the victim to access some positions in the cache, causing the eviction of some of the cache lines that were primed in the first stage. •Probing stage: In this stage the attacker accesses the priming data again. When the attacker reloads data from a set that has been used by the victim, some of the primed cache lines have been evicted, causing a higher probe time. However if the victim did not use any of the cache lines in a monitored set, all the primed cache lines will still reside in the cache causing a low probe time. 6.4 Limitations of the Original Prime and Probe Technique The original Prime and Probe technique was successfully applied to L1 caches to recover cryptographic keys. It is therefore an open question why, with multi-core systems and shared LLCs, no work prior to this has applied it into the LLC to recover information across cores. Here we summarize the three main problems of taking the Prime and Probe attack to the LLC: •The L1 Prime and Probe attack fills the whole L1 cache with attackers data. As the LLC is usually at least two orders of magnitude bigger than the L1, filling the entire LLC does not seem a realistic approach. 84 •As memory pages are usually 4KB pages and the page offset remains untouched in the virtual address translation, the attacker gains enough bits of information to fully know the location of a memory block in the L1. However, as the LLC has more sets, the attacker is unable to predict the set that his memory addresses will occupy. •The LLC in Intel, as previously described, is divided into slices after the release of the Sandy Bridge architecture. The slice that a memory block occupies is decided by an undocumented hash algorithm. Thus, an attacker that might be willing to fill one of the sets in the LLC might observe how his addresses get distributed across several slices and do not fill the set. In the following sections we solve the aforementioned challenges to successfully apply the Prime and Probe attack in the LLC. 6.5 Targeting Small Pieces of the LLC The first obstacle when executing thew original Prime and Probe attack in the LLC is the vast number of memory addresses needed to fill it. Instead, we will only target smaller pieces of it that will give us enough information of the secret process being executed by a victim. Recall algorithm 2 from Section 2.4.2. The modular exponentiation leaked infor- mation due to the usage of the multiplication function when a ”1”’ bit was found. Instead of filling the entire cache, we can just perform the Prime and Probe attack in the LLC set where the multiplication resides. This gives us enough information about when the multiplication function is used, and therefore, when the victim is processing a ”1”’ or a ”‘0”’ bit. We will cover in future sections how to know where this multiplication function resides. 6.6 LLC Set Location Information Enabled by Huge Pages The second obstacle that the original Prime and Probe attack encounters when tar- geting the LLC is that the set occupied by the attackers memory blocks is unknown in the LLC, due to the lack of control on the physical address bits. The LLC Prime and Probe attack proposed in this work, is enabled by making use of Huge pages and thereby eliminating a major obstacle that normally restricts the Prime and Probe attack to target the L1 cache. As explained in Section 6.1, a user does not use the physical memory directly, but he is assigned a virtual memory so that a trans- lation stage is performed from virtual to physical memory at the hardware level. The address translation step creates an additional challenge to the attacker since real addresses of the variables of the target process are unknown to him. However 85 ? ? ? Offset4 KB page Offset =12 bits ? Offset2 MB page Offset =21 bitsBA L1 L2 L36 6 9 1112 bits 21 bitsVirtual Address Virtual AddressFigure 6.3: Regular Page (4 KB, top) and Huge Page (256 KB, bottom) virtual to physical address mapping for an Intel x86 processor. For Huge pages the entire L3 cache sets become transparently accessible even with virtual addressing. this translation is only performed in some of the higher order bits of the virtual address, while a lower portion, named the offset , remains untouched. Since caches are addressed by the physical address, if we have cache line size of bbytes, the lower log2(b) bits of the address will be used to resolve the corresponding byte in the cache line. Furthermore if the cache is set-associative and for instance divided into nsets, then the next log2(n) bits of the address will select the set that each memory data is going to occupy in the cache. The log2(b)-bits that form the byte address within a cache line, are contained within the offset field. However, depending on the cache size the following field which contains the set address may exceed the offset boundary. The offsets allow addressing within a memory page. The OS’s Memory Management Unit (MMU) keeps track of which page belongs to which process. The page size can be adjusted to better match the needs of the application. Smaller pages require more time for the MMU to resolve. Here we focus on the default 4 KB page size and the larger page sizes provided under the common name of Huge pages. As we shall see, the choice of page size will make a significant difference in the attackers ability to carry out a successful attack on a particular cache level: •4 KB pages: For 4 KB pages, the lower 12-bit offset of the virtual address is not translated while the remaining bits are forwarded to the Memory Man- agement Unit. In modern processors the memory line size is usually set as 64 bytes. This leaves 6 bits untouched by the Memory Management Unit while translating regular pages. As shown in the top of Figure 6.3 the page offset is known to the attacker. Therefore, the attacker knows the 6-bit byte address 86 plus 6 additional bits that can only resolve accesses to small size caches (64 sets at most). This is the main reason why techniques such as Prime and Probe have only targeted the L1 cache, since it is the only one permitting the attacker to have full control of the bits resolving the set. Therefore, the small page size indirectly prevents attacks targeting big size caches like the L2 and L3 caches. •Huge pages: The scenario is different if we work with huge size pages. Typ- ical huge page sizes are at 1 MB or even greater. This means that the offset field in the page translation process is bigger, with 21 bits or more remaining untouched during page translation. Observe the example presented in Fig- ure 6.3. For instance, assume that our computer has 3 levels of cache, with the last one shared, and that 64, 512 and 2048 are the number of sets the L1, L2 and L3 caches are divided into, respectively. The first lowest 6-bits of the offset are used for addressing the 64 byte long cache lines. The following 6 bits are used to resolve the set addresses in the L1 cache. For the L2 and L3 caches this field is 9 and 11-bits wide, respectively. In this case, a huge page size of 2 MB (21 bit offset) will give the attacker full control of the set occupied by his data in all three levels of cache, i.e. L1, L2 and L3 caches. The significance of targeting last level cache becomes apparent when one considers the access time gap between the last level cache and the memory, which is much more pronounced compared to the access time difference between the L1 and L2 caches. Therefore, using huge pages makes it possible to reach a higher resolution Prime and Probe style attack. 6.7 Reverse Engineering the Slice Selection Algo- rithm The last inconvenience that we observe when executing Prime and Probe in the LLC is the fact that the LLC is divided into slices, whose assignment is decided by an undocumented hash function. This means that the attacker cannot control which slice is targeted. Although knowing the slice selection algorithm implemented is not crucial to run a Prime and Probe attack (since we could calculate the eviction set for every setsthat we want to monitor), the knowledge of the non-linear slice selection algorithm can save significant time, specially when we have to profile a big number of sets. Indeed, in the attack step, we can select a range of sets/slices s1,s2,...s nfor which, thanks to the knowledge of the non-linear slice selection algorithm, we know that the memory blocks in the eviction set will not change. This section describes the methodology applied to reverse engineer the slice selection algorithm for specific Intel processors. Note that the method can be used to reverse engineer slice selection algorithms for other Intel processors that have not been analyzed in this work. To the best of our knowledge, this slice selection hash function is not public. We solves 87 the issue by: •Generating data blocks at slice-colliding addresses to fill a specific set. Access timings are used to determine which data is stored in the same slice. •Using the addresses of data blocks identified to be co-residing in slices to generate a system of equations. The resulting equation systems can then be solved to identify the slice selection algorithm implemented by the processor. •Building a scalable tool, i.e, proving its applicability for a wide range of dif- ferent architectures. 6.7.1 Probing the Last Level Cache As stated in Section 6.2, the shared last level caches in SMP multicore architectures are usually divided into slices, with an unknown hash function that determines the slice. In order to be able to reverse engineer this hash function, we need to recover addresses of data blocks co-residing in a set of a specific slice. The set where a data block is placed can be controlled, even in the presence of virtual addressing, if huge size pages are used. Recall that by using 2MB huge size pages we gain control over the least significant 21 bits of the physical address, thereby controlling the set in which our blocks of data will reside. Once we have control over the set a data block is placed in, we can try to detect data blocks co-residing in the same slice. Co-residency can be inferred by distinguishing LLC accesses from memory accesses. 6.7.2 Identifying mData Blocks Co-Residing in a Slice We need to identify the mmemory blocks that fill each one of the slices for a specific set. Note that we still do not know the memory blocks that collide in a specific set. In order to achieve this goal we perform the following steps: •Step 1: Access one memory block b0in a set in the LLC •Step 2: Access several additional memory blocks b1,b2,...,b nthat reside in the same set, but may reside in a different slice, in order to fill the slice where b0resides. •Step 3: Reload the memory block b0to check whether it still resides in the last level cache or in the memory. A high reload time indicates that the memory blockb0has been evicted from the slice, since intel utilizes a Pseudo Last Recently Used (PLRU) cache eviction algorithm. Therefore we know that the requiredmmemory blocks to evict b0 from the slice are part of the accessed additional memory blocks b1,b2,...,b n. 88 0 5 10 15 20 25 30510152025 Line numberHardware cyclesFigure 6.4: Generating additional memory blocks until a high reload value is ob- served, i.e., the monitored block is evicted from the LLC. The experiments were performed in an Intel i5-3320M. •Step 4: Subtract one of the accessed additional memory blocks biand repeat the protocol. If b0still resides in memory, bidoes not reside in the same slice. Ifb0resides in the cache, it can be concluded that biresides in the same cache slice asb0. Steps 2 and 3 can be graphically seen in Figure 6.4, where additional mem- ory blocks are generated until a high reload time is observed, indicating that the monitored block b0was evicted from the LLC cache after memory block b26was accessed. Step 4 is also graphically presented in Figure 6.5 where each additional block is checked to see whether it affects the reload time observed in Figure 6.4. If the reload time remains high when one of the blocks biis no longer accessed, bidoes not reside in the same slice as the monitored block b0. In our particular case, we observe that the slice colliding blocks are b3,b4,b7,b9,b10,b13,b14,b17,b18,b21,b22and b24 6.7.3 Generating Equations Mapping the Slices Oncemmemory blocks that fill one of the cache slices have been identified, we generate additional blocks that reside in the same slice to be able to generate more equations. The approach is similar to the previous one: •Access the mmemory blocks b0,b1,...,b mthat fill one slice in a set in the LLC 89 0 5 10 15 20 25 3051015202530 Line numberHardware cyclesFigure 6.5: Subtracting memory blocks to identify the mblocks mapping to one slice in an Intel i5 3320-M. Low reload values indicate that the line occupies the same slice as the monitored data. •Access, one at a time, additional memory blocks that reside in the same set, but may reside in a different slice •Reload the memory block b0to check whether it still resides in the LLC or in memory. Again, due to the PLRU algorithm, a high reload time indicates thatb0has been evicted from the slice. Hence, the additional memory block also resides in the same cache slice. •Once a sufficiently large group of memory blocks that occupy the same LLC slice has been identified, we get their physical addresses to construct a matrix Piof equations, where each row is one of the physical addresses mapping to the monitored slice. The equation generation stage can be observed in Figure 6.6, where 4000 ad- ditional memory blocks occupying the same set were generated. Knowing the m blocks that fill one slice, accessing an additional memory block will output a higher reload value if it resides in the same slice as b0(since it evicts b0from the cache). Handling Noise: We choose a detection threshold in such a way that we most likely only deal with false negatives, which do not affect correctness of the solutions of the equation system. As it can be observed in Figure 6.6, there are still a few values that are not clearly identified (i.e, those with reload values of 10-11 cycles). By simply not considering these, false positives are avoided and the resulting equation system remains correct. 90 50010001500200025003000350051015202530 Line numberHardware cyclesFigure 6.6: Generating equations mapping one slice for 4000 memory blocks in an Intel i5 3320-M. High reload values indicate that the line occupies the same slice as the monitored data. 6.7.4 Recovering Linear Hash Functions The mapping of a memory block to a specific slice in LLC cache is based on its physical address. A hash function H(p) takes the physical address pas an input and returns the slice the address is mapped to. We know that Hmaps all possible ptosoutputs, where sis the number of slices for the processor. H:{0,1}⌈log2p⌉h− →{0,1}⌈log2s⌉ The labeling of these outputs is arbitrary. However, each output should occur with a roughly equal likelihood, so that accesses are balanced over the slices. We model Has a function on the address bits. In fact, as we will see, the observed hash functions are linear in the address bits pi. In such a case we can model Has a concatenation of linear Boolean functions H(p) =H0(p)||...||H⌈log2s⌉−1(p), where ||is concatenation. Then, Hiis given as Hi(p) =hi,0p0+hi,1p1+...h i,lpl=l∑ 0hi,jpj. Here,hi,j∈{0,1}is a coefficient and pjis a physical address bit. The steps in the previous subsections provide addresses pmapped to a specific slice, which are combined in a matrix Pi, where each row is a physical address p. The goal is to recover the functions Hi, given as the coefficients hi,j. In general, for linear systems, 91 theHican be determined by solving the equations Pi·ˆHi=ˆ0, (6.1) Pi·ˆHi=ˆ1 (6.2) where ˆHi= (hi,0,hi,1,...,h i,l)Tis a vector containing all coefficients of Hi. The right hand side is the ith bit of the representation of the respective slice, where ˆ0 and ˆ1 are the all zeros and all ones vectors, respectively. Note that finding a solution to Equation (6.1) is equivalent to finding the kernel (null space) of the matrixPi. Also note that any linear combination of the vectors in the kernel is also a solution to Equation (6.1), whereas any linear combination of a particular solution to Equation (6.2) and any vector in the kernel is also a particular solution to Equation (6.2). In general: ˆh∈kerPi⇐⇒Pi·ˆh=ˆ0 ∀ˆh1,ˆh2∈kerPiˆh1+ˆh2∈kerPi ∀ˆh1,ˆh2|Pi·ˆh1=ˆ1,ˆh2∈kerPi:Pi·(ˆh1+ˆh2) =ˆ1 Recall that each equation system should map to x=⌈log2(s)⌉bit selection functionsHi. Also note that we cannot infer the labeling of the slices, although the equation system mapping to slice 0 will never output a solution to Equation (6.2). This means that there is more than one possible solution, all of them valid, if the number of slices is greater than 2. In this case,(2x−1 x) solutions will satisfy Equation (6.1). However, any combination of xsolutions is valid, differing only in the label referring to each slice. We have only considered linear systems in our explanation. In the case of having a nonlinear system (i.e., the number of slices is not a power of two), the system becomes non-linear and needs to be re-linearized. This can be done by expanding the above matrix Piwith the non-linear terms Pilinear|Pinonlinear and solve Equations (6.1) and (6.2). Note that the higher the degree of the non-linear terms, the bigger number of equations that are required to solve the system. For that reason, later in Section 6.7.7 we present an example on an alternative (more intuitive) approach that can be taken to recover non-linear slcie selection algorithms. This tool can also be useful in cases where the user cannot determine the slice selection algorithm, e.g., due to a too low number of derived equations. Indeed, the first step that this tool implements is generating the memory blocks that co-reside in each of the slices. This information can already be used to mount a side channel attack. 6.7.5 Experiment Setup for Linear Hash Functions In this section we describe our experiment setup. In order to test the applicability of our tool, we implemented our slice selection algorithm recovery method in a wide 92 Table 6.1: Comparison of the profiled architectures Processor Architecture LLC size Associativity Slices Sets/slice Intel i5-650 [i65] Nehalem 4MB 16 2 2048 Intel i5-3320M [inta] Ivy Bridge 3MB 12 2 2048 Intel i5-4200M [iM] Haswell 3MB 12 2 2048 Intel i7-4702M [i74] Haswell 6MB 12 4 2048 Intel Xeon E5-2609v2 [intb] Ivy bridge 10MB 20 4 2048 Intel Xeon E5-2640v3 [intc] Haswell 20MB 20 8 2048 range of different computer architectures. The different architectures on which our tool was tested are listed in Table 6.1, together with the relevant parameters. Our experiments cover a wide range of linear (power of two) slice sizes as well as different architectures. All architectures except the Intel Xeon e5-2640 v3, were running Ubuntu 12.04 LTS as the operating system, whereas the last one used Ubuntu 14.04. Ubuntu, in root mode, allows the usage of huge size pages. The huge page size in all the processors is 2MB [Lin]. We also use a tool to obtain the physical addresses of the variables used in our code by looking at the /proc/PID/pagemap file. In order to obtain the slice selection algorithm, we profiled a single set, i.e, set 0. However, in all the architectures profiled in this paper, we verified that the set did not affect the slice selection algorithm. This might not be true for all Intel processors. The experiments cover wide selection of architectures, ranging from Nehalem (released in 2008) to Haswell (released in 2013) architectures. The processors include laptop CPUs (i5-3320, i5-4200M and i7-4200M), desktop CPUs (i5-650) and server CPUs (Xeon E5-2609v2, Xeon E5-2640v3), demonstrating the viability of our tool in a wide range of scenarios. As it can be seen, in the entire set of processors that we analyzed, each slice gets 2048 sets in the L3 cache. Apparently, Intel designs all of its processors in such a way that every slice gets all 2048 sets of the LLC. Indeed, this is not surprising, since it allows to use cache addressing mechanisms independent from the size or the number of cores present in the cache. This also means that 17 bits are required from the physical address to select the set in the last level cache. This is far from the 21 bit freedom that we obtain with the huge size pages. 6.7.6 Results for Linear Hash Functions Table 6.2 summarizes the slice selection algorithm for all the processors analyzed in this work. The Intel i650 processor is the oldest one that was analyzed. Indeed, the slice selection algorithm that it implements is much simpler than the rest of them, in- volving only one single bit to decide between the two slices. This bit is the 17th bit, i.e., the bit consecutive to the last one selecting the set. This means that an 93 Table 6.2: Slice selection hash function for the profiled architectures Processor Architecture Solutions Slice selection algorithm Intel i7-2600 [HWH13] Sandy Bridge p18⊕p19⊕p21⊕p23⊕p25⊕p27⊕p29⊕p30⊕p31 p17⊕p19⊕p20⊕p21⊕p22⊕p23⊕p24⊕p26⊕p28⊕p29⊕p31 Intel i650 Nehalem 1 p17 Intel i5-3320M Ivy Bridge 1 p17⊕p18⊕p20⊕p22⊕p24⊕p25⊕p26⊕p27⊕p28⊕p30⊕p32 Intel i5-4200M Haswell 1 p17⊕p18⊕p20⊕p22⊕p24⊕p25⊕p26⊕p27⊕p28⊕p30⊕p32 Intel i7-4702M Haswell 3 p17⊕p18⊕p20⊕p22⊕p24⊕p25⊕p26⊕p27⊕p28⊕p30⊕p32 p18⊕p19⊕p21⊕p23⊕p25⊕p27⊕p29⊕p30⊕p31⊕p32 Intel Xeon Ivy Bridge 3 p17⊕p18⊕p20⊕p22⊕p24⊕p25⊕p26⊕p27⊕p28⊕p30⊕p32 E5-2609 v2 p18⊕p19⊕p21⊕p23⊕p25⊕p27⊕p29⊕p30⊕p31⊕p32 Intel Xeon Haswell 35 p17⊕p18⊕p20⊕p22⊕p24⊕p25⊕p26⊕p27⊕p28⊕p30⊕p32 E5-2640 v2 p19⊕p22⊕p23⊕p26⊕p27⊕p30⊕p31 p17⊕p20⊕p21⊕p24⊕p27⊕p28⊕p29⊕p30 attacker using Prime and Probe techniques can fully control both slices, since all the bits are under his control. It can be seen that the Intel i5-3320M and Intel i5-4200M processors implement a much more complicated slice selection algorithm than the previous one. Since both processors have 2 slices, our method outputs one single solution, i.e, a single vector in the kernel of the system of equations mapping to the zero slice. Note the substantial change between these processors and the previous one, which evaluate many bits in the hash function. Note that, even though both processors have differ- ent microarchitectures and are from different generations, they implement the same slice selection algorithm. We next focus on the 4 slice processors analyzed, i.e., the Intel i7-4702MQ and the Intel Xeon E5-2609. Again, many upper bits are used by the hash function to select the slice. We obtain 3 kernel vectors for the system of equations mapping to the zero slice (two vectors and the linear combination of them). From the three solutions, any combination of two solutions (one for h0 and one for h1) is a valid solution, i.e, the 4 different slices are represented. However, the labeling of the slices is not known. Therefore, choosing a different solution combination will only affect the labeling of the non-zero slices, which is not important for the scope of the tool. It can also be observed that, even when we compare high end servers (Xeon-2609) and laptops (i7-4702MQ) with different architectures, the slice selection algorithm implemented by both of them is the same. Further note that one of the functions is equal to the one discussed for the two core architectures. We can therefore say that the hash function that selects the slice for the newest architectures only depends on the number of slices. Finally, we focus on the Intel Xeon E5-2640v3, which divides the last level cache in 8 slices. Note that this is a new high end server, which might be commonly found in public cloud services. In this case, since 8 slices have to be addressed, we need 3 hash functions to map them. The procedure is the same as for the previous 94 processor: we first identify the set of equations that maps to slice 0 (remember, this will not output any possible solution to 1) by finding its kernel. The kernel gives us 3 possible vectors plus all their linear combinations. As before, any solution that takes a set of 3 vectors will be a valid solution for the equation system, differing only in the labeling of the slices. Note also that some of the solutions have only up to 6 bits of the physical address, making them suitable for side channel attacks. In summary, we were able to obtain all the slice selection hash functions for all cases. Our results show that the slice selection algorithm was simpler in Nehalem architectures, while newer architectures like Ivy Bridge or Haswell use several bits of the physical address to select the slice. We also observed that the slice selection algorithm mostly depends on the number of slices present, regardless of the type of the analyzed CPU (laptop or high end servers). 6.7.7 Obtaining Non-linear Slice Selection Algorithms It is important to note that Prime and Probe attacks become much simpler when architectures with linear slice selection algorithms are targeted, because the memory blocks to create an eviction set for different set values do not change. This means that we can calculate the eviction set for, e.g, set 0 and the memory blocks will be the same if we profile a different set s. As we will see in this section, this is not true for non-linear slice selection algorithms (where the profiled set also affects the slice selected). We utilize the Intel Xeon E5-2670 v2 as an example, the most widely used EC2 instance type, which has a 25MB LLC distributed over 10 LLC slices. By just per- forming some small tests we can clearly observe that the set field affects the slice selection algorithm implemented by the processor. Indeed, it is also clearly observ- able that the implemented hash function is a non-linear function of the address bits, since the 16 memory blocks mapped to the same set in a huge memory page cannot be evenly distributed over 10 slices. Thus we describe the slice selection algorithm as H(p) =h3(p)∥h2(p)∥h1(p)∥h0(p) (6.3) where each H(p) is a concatenation of 4 different functions corresponding to the 4 necessary bits to represent 10 slices. Note that H(p) will output results from 0000 to 1001 if we label the slices from 0-9. Thus, a non-linear function is needed that excludes outputs 10-15. Further note that pis the physical address and will be represented as a bit string: p=p0p1...p 35. In order to recover the non-linear hash function implemented by the Intel Xeon E5-2670 v2, we perform experiments in a fully controlled machine featuring the Intel Xeon E5-2670 v2. We first generate ten equation systems based on addresses colliding in the same slice by applying the same methodology explained in Section 6.7.4 and generating up to 100,000 additional memory blocks. We repeat the same process 10 times, changing the primed memory blockb0in each of them to target a different slice. This outputs 10 different systems 95 −10123456789107000750080008500900095001000010500 Slice numberNumber of addressesFigure 6.7: Number of addresses that each slice takes out of 100,000. The non-linear slices take less addresses than the linear ones. of addresses, each one referring to a different slice. The first important observation we made on the 10 different systems is that 8 of them behave differently from the remaining 2. In 8 of the address systems recovered, if 2 memory blocks in the same huge memory page collide in the same slice, they only differ in the 17th bit. This is not true for the remaining two address systems. We suspect, at this point, that the 2 systems behaving differently are the 8th and 9th slice. We will refer to these two slices as the non-linear slices. Up to this point, one can solve the non-linear function after a re-linearization step given sufficiently many equations. However, one may not be able to recover enough addresses. Recall that the higher the degree of the non-linear term the more equations are needed. In order to keep our analysis simpler, we decided to take a different approach. The second important observation we made is on the distribution of the addresses over the 10 slices. It turns out that the last two slices are mapped to with a lower number of addresses than the remaining 8 slices. Figure 6.7 shows the distribution of the 100,000 addresses over the 10 slices. The different distributions seen for the last two slices give us evidence that a non-linear slice selection function is implemented in the processor. Even further, it can be observed that the linear slices are mapped to by 81 .25% of the addresses, while the non-linear slices get only about 18.75%. The proportion is equal to 3 /16. We will make use of this uneven distribution later. We proceed to first solve the first 8 slices and the last 2 slices separately using linear functions. For each we try to find solutions to the equations 6.1 and 6.2. This outputs two sets of a linear solutions both for the first 8 linear slices and the last 2 slices separately. Given that we can model the slice selection functions separately using linear functions, and given that the distribution is non-uniform, we suspect that the hash function is implemented in two levels. In the first level a non-linear function chooses between either of the 3 linear functions describing the 8 linear slices or the linear 96 Table 6.3: Hash selection algorithm implemented by the Intel Xeon E5-2670 v2 Vector Hash function H(p) =h0(p)∥¬(nl(p))·h′ 1(p)∥¬(nl(p))·h′ 2(p)∥nl(p) h0p18⊕p19⊕p20⊕p22⊕p24⊕p25⊕p30⊕p32⊕p33⊕p34 h′ 1p18⊕p21⊕p22⊕p23⊕p24⊕p26⊕p30⊕p31⊕p32 h′ 2p19⊕p22⊕p23⊕p26⊕p28⊕p30 v0p9⊕p14⊕p15⊕p19⊕p21⊕p24⊕p25⊕p26⊕p27⊕p29⊕p32⊕p34 v1p7⊕p12⊕p13⊕p17⊕p19⊕p22⊕p23⊕p24⊕p25⊕p27⊕p31⊕p32⊕p33 v2p9⊕p11⊕p14⊕p15⊕p16⊕p17⊕p19⊕p23⊕p24⊕p25⊕p28⊕p31⊕p33⊕p34 v3p7⊕p10⊕p12⊕p13⊕p15⊕p16⊕p17⊕p19⊕p20⊕p23⊕p24⊕p26⊕p28⊕p30⊕p31⊕p32⊕p33⊕p34 nl v 0·v1·¬(v2·v3) functions describing the 2 non-linear slices. Therefore, we speculate that the 4 bits selecting the slice look like: H(p) =  h0(p) =h0(p) h1(p) =¬(nl(p))·h′ 1(p) h2(p) =¬(nl(p))·h′ 2(p) h3(p) =nl(p) whereh0,h1andh2are the hash functions selecting bits 0,1 and 2 respectively, h3 is the function selecting the 3rd bit and nlis a nonlinear function of an unknown degree. We recall that the proportion of the occurrence of the last two slices is 3 /16. To obtain this distribution we need a degree 4 nonlinear function where two inputs are negated, i.e.: nl=v0·v1·¬(v2·v3) (6.4) Wherenlis 0 for the 8 linear slices and 1 for the 2 non-linear slices. Observe that nl will be 1 with probability 3 /16 while it will be zero with probability 13 /16, matching the distributions seen in our experiments. Consequently, to find v0andv1we only have to solve Equation (6.2) for slices 8 and 9 together to obtain a 1 output. To find v2andv3, we first separate those addresses where v0andv1output 1 for the linear slices 0−7. For those cases, we solve Equation (6.2) for slices 0 −7. The result is summarized in Table 6.3. We show both the non-linear function vectors v0,v1,v2,v3 and the linear functions h0,h1,h2. These results describe the behavior of the slice selection algorithm implemented in the Intel Xeon E5-2670 v2. It can be observed that the bits involved in the set selection (bits 6 to 16 for the LLC) are also involved in the slice selection process, unlike with linear selection algorithms. This means that for different sets, different memory blocks will map to the same slice. Note that the method applied here can be used to reverse engineer other machines that use different non-linear slice selection algorithms. By looking at the distribution of the memory blocks over all the slices, we can always get the shape of the non- linear part of the slice selection algorithm. The rest of the steps are generic, and can be even applied for linear slices selection algorithms. 97 100 200 300 400 500 600 70000.20.40.60.81Probability cache access Hardware cycles 100 200 300 400 500 600 70000.010.020.030.040.05 Probability memory accessL3 cache AccessTime histogram Memory AccessTime histogramFigure 6.8: Histograms of 10,000 access times in the probe stage when all the lines are in the L3 cache and when all except one are in the cache (and the other one in the memory). 6.8 The LLC Prime and Probe Attack Procedure With the previously discussed challenges solved, we can proceed to apply our LLC Prime and Probe attack. Our LLC Prime and Probe technique takes advantage of the control of the lower kbits in the virtual address that we gain with the huge size pages and the knowledge of the slice selection algorithm. These are the main steps that our spy process will follow to detect accesses to the last level cache: •Step 1 Allocation of huge size pages if available: The spy process is based on the control that the attacker gains on the virtual address when using huge size pages. Therefore the spy process has to have access to the available huge pages, which requires administrator rights. Recall that this is not a problem in the cloud scenario where the attacker has administrator privileges to his guest OS. If huge size pages are not available, an attacker can start from step 2 at the cost of a more time-consuming eviction set discovery. •Step 2 Find eviction-sets: Due to the slice selection algorithm, an attacker has first to deduce which of his memory blocks collide in the same set-slice. To do this, he can be aided by the slice selection algorithm knowledge acquired previously in Section 6.7.4, speeding up the attack process. If the slice selection algorithm is not know, an attacker can also create eviction sets ”on the fly”, again at the cost of a more challenging and complicated process. •Step 3 Prime the desired set-slice in the last level cache: In this step the attacker creates data that will occupy one of the sets-slices in the last level 98 cache. By controlling the virtual address (and, if known, the slice selection algorithm), the attacker knows the set-slice that the created data will occupy in the last level cache. Once sufficiently many blocks are created to occupy the set and slice, the attacker primes the set-slice and ensures it is filled. Typically the last level caches are inclusive. Thus we will not only fill the shared last level cache set but also the corresponding sets in the upper level caches. •Step 4: Victim process runs: After the priming stage, the victim runs the target process. Since one of the sets-slices in the last level cache is already filled, if the targeted process uses the monitored it, one of the primed blocks is going to be evicted. Remember we are priming the last level cache, so evictions will cause memory blocks to reside in the memory. If the monitored set-slice is not used, all the primed blocks are still going to reside in the cache hierarchy after the victim’s process execution. •Step 5: Probe and measure: Once the victim’s process has finished, the spy process probes the primed memory blocks and measures the time to probe them all. If one or more blocks have been evicted by the targeted process, they will be loaded from the memory and we will see a higher probe time. However if all the blocks still reside in the cache, then we will see a shorter probe time. The last step can be made more concrete with the experiment results summarized in Figure 6.8. The experiment was performed in native execution (no VM) on an Intel i5-650 that has a 16-way associative last level cache. It can be seen that when all the blocks reside in the last level cache we obtain very precise probe timings with average around 250 cycles and with very little variance. However when one of the blocks is evicted from last level cache and resides in memory, both the access time and the variance are higher. We conclude that both types of accesses are clearly distinguishable. 6.8.1 Prime and Probe Applied to AES In this section we proceed to explain how the Prime and Probe spy process can be ap- plied to attack AES. Again, we use the C reference implementation of OpenSSL1.0.1f library which uses 4 different T-tables during the AES execution. Recovering one round key is sufficient for AES-128, as the key scheduling is invertible. We use the last round as our targeted round for convenience. Since the 10thround does not implement the MixColumns operation, the ciphertext directly depends on the T-table position accessed and the last round key. Assume Sito be the value of theith byte prior to the last round T-table look up operation. Then the ciphertext byteCiwill be: Ci=Tj[Si]⊕K10 i (6.5) 99 whereTjis the corresponding T-table applied to the ithbyte andK10 i. It can be observed that if the ciphertext and the T-table positions are known, we can guess the key by a simple XOR operation. We assume the ciphertext to be always known by the attacker. Therefore the attacker will use the Prime and Probe spy process to guess the T-table position that has been used in the encryption and consequently, obtain the key. Thus, if the attacker knows which set each of the T-table memory lines occupies, Prime and Probe will detect that the set is not accessed 8% of the time, and once he has gotten enough measurements, the key can be obtained from Equation 6.5. Locating the Set of the T-Tables: The previous description implicitly assumes that the attacker knows the location, i.e. the sets, that each T-table occupies in the shared cache. A simple approach to gain this knowledge is to prime and probe every set in the cache, and analyze the timing behavior for a few random AES encryptions. The T-table based AES implementation leaves a distinctive fingerprint on the cache, as T-table size as well as the access frequency (92% per line per execution) are known. Once the T-tables are detected, the attack can be performed on a single line per table. Nevertheless, this locating process can take a significant amount of time when the number of sets is sufficiently high in the outermost shared cache. An alternative, more efficient approach is to take advantage of the shared library page alignment that some OSs like Linux implement. Assuming that the victim is not using huge size pages for the encryption process, the shared library is aligned at a 4 KB page boundary. This gives us some information to narrow down the search space, since the lower 12 bits of the virtual address will not be translated. Thus, we know the offset fimodulo 64 of each T-table memory line and the T-table location process has been reduced by a factor of 64. Furthermore, we only have to locate one T-table memory line per memory page , since the remaining table occupies the consecutive sets in the last level cache. Attack stages: Putting all together, these are the main stages that the we follow to attack AES with Prime and Probe •Step 1: Last level cache profile stage: The first stage to perform the attack is to gain knowledge about the structure of the last level cache, the number of slices, and the lines that fill one of the sets in the last level cache. •Step 2: T-table set location stage: The attacker has to know which set in the last level cache each T-table occupies, since these are the sets that need to be primed to obtain the key. •Step 3: Measurement stage: The attacker primes the desired sets-slices, requests encryptions and probes again to check whether the monitored sets have been used or not. 100 •Step 4: Key recovery stage: Finally, the attacker utilizes the measurements taken in Step 3 to derive the last round key used by the AES server. 6.8.2 Experiment Setup and Results for the AES Attack In this section we analyze our experiment setup and the results obtained in native machine, single VM and in the cross-VM scenarios, specifically with deduplication- free hypervisors where Flush and Reload and Invalidate and Transfer could not succeed. 6.8.2.1 Testbed Setup The machine used for all our experiments is a dual core Nehalem Intel i5-650 [inta] clocked at 3.2 GHz. This machine works with 64 byte cache lines and has private 8- way associative L1 and L2 caches of size 215and 218bytes, respectively. In contrast, the 16-way associative L3 cache is shared among all the cores and has a size of 222 bytes, divided into two slices. Consequently, the L3 cache will have 212sets in total. Therefore 6 bits are needed to address the byte address in a cache line and 12 more bits to specify the set in the L3 cache. The huge page size is set to 2 MB, which ensures a set field length of 21 bits that are untouched in the virtual to physical address translation stage. All the guest OSs use Ubuntu 12.04, while the VMMs used in our cloud experiments are Xen 4.1 fully virtualized and VMware ESXI 5.5. Both allow the usage of huge size pages by guest OSs [BDF+03, Xenb, xena]. None of them implements memory deduplication mechanisms; Xen because it lacks of such a feature, VMware because we disabled it manually. Recall that, under this settings, Flush and Reload andInvalidate and Transfer would not be able to recover any meaningful information. We will observe how this is not the case for Prime and Probe . The target process uses the C reference implementation of OpenSSL1.0.1f , which is the default if the library is configured with no-asm and no-hw options. We would like to remark that these are not the default OpenSSL installation op- tions in most of the products. The attack scenario is going to be the same one as in Section 4.4, where one process/VM is handling encryption requests with an secret key. The attacker’s pro- cess/VM is co-located with the encryption server, but on different cores. We assume synchronization with the server, i.e, the attacker starts the Prime and Probe spy process and then sends random plaintexts to the encryption server. The communi- cation between encryption server and attacker is carried out via socket connections. Upon the reception of the ciphertext, the attacker measures the L3 cache usage by thePrime and Probe spy process. All measurements are taken by the attackers pro- cess/VM with the rdtscp function, which not only reads the time stamp counters but also ensures that all previous processes have finished before its execution [rdt]. 101 10015020025030035040045050055060000.010.020.030.040.050.060.070.08 Hardware cycles a)Probability 100 150 200 250 300 35000.050.10.150.20.250.30.350.4 Hardware cycles b)ProbabilityRegion where T−table line has not been usedRegion where T−table line has been usedFigure 6.9: Histograms of 500 access times monitored in the probe stage for a) a set used by a T-table memory line and b) a set not used by a T-able memory line. Measurements are taken in the Xen 4.1 cross-VM scenario. 6.8.2.2 The Cross-Core Cross-VM Attack We perform the attack in three different scenarios: native machine, single VM and cross-VM. In the native and single VM scenarios, we assume that the huge size pages are set to be used by any non-root process running in the OS. Recall that in the cross-VM scenario, the attacker has administrator rights in his own OS. The first step is to recognize the access pattern of the L3 cache in our Intel i5- 650. Making use of the knowledge of the slice selection algorithm in Section 6.7.4, we observe that odd blocks (17th bit of the physical address equal to 1) and even blocks (17th bit of the physical address equal to 0) are allocate in different slices. Thus we need 16 odd blocks to fill a set in the odd slice, whereas we need 16 even blocks to fill a specific set in the even slice. The second step is to recognize the set that each T-table cache line occupies in the L3 cache. For that purpose we monitor each of the possible sets according to the offset obtained from the linux shared library alignment feature. Recall that if the offset modulo 64 f0of one of the T-tables is known, we only need check the sets that are 64 positions apart, starting from f0. By sending random plaintexts the set holding a T-table cache line is used around 90% of the times, while around 10% of the times the set will remain unused. The difference between a set allocating a T- table cache line and a set not allocating a T-table cache line can be graphically seen in Figure 6.9, where 500 random encryptions were monitored with Prime and Probe for both cases in a cross-VM scenario in Xen 4.1. It can be observed that monitoring an unused set results in more stable timings in the range of 200-300 cycles. However 102 0 50 100 150 200 25000.10.20.30.40.50.60.70.80.91 Ciphertext valueMiss counter value normalized with maximumAverage/2Figure 6.10: Miss counter values for ciphertext 0 normalized to the maximum value. The key is e1 and we are monitoring the last 8 values of the T-table (since the table starts in the middle of a memory line). monitoring a set used by the T-tables outputs higher time values around 90% of the time, whereas we still see some lower time values below 300 around 10% of the times. Note that the key used by the AES server is irrelevant in this step, since the set used by the T-table cache lines is going to be independent of the key. The last step is to run Prime and Probe to recover the AES key used by the AES server. We consider as valid ciphertexts for the key recovery step those that are at least below half the average of the overall timings. This threshold is based on empirical results that can be seen in Figure 6.10, and is calculated as the overall counter average divided by 2. The figure presents the miss counter value for all the possible ciphertext values of C0, when the last line in the corresponding T-table is monitored. The key in this case is 0 xe1 and the measurements are taken in a cross-VM scenario in Xen 4.1. In this case only 8 values take low miss counter values because the T-table finishes in the middle of a cache line. These values are clearly distinguishable from the rest and appear in opposite sides of the empirical threshold. Results for the three scenarios are presented in Figure 6.11, where it can be observed that the noisier the scenario is, e.g. in the cross-VM scenario, the more monitored encryptions are needed to recover the key. The plot shows the number of correctly guessed key bytes vs. the number of encryptions needed. Recall that the maximum number of correctly guessed key bytes is 16 for AES-128. The attack only needs 150.000 encryptions to succeed on recovering the full AES key in the native OS scenario. Due to the higher noise in the cloud setting, the single VM recovers 103 0 100 200 300 400 500 600 7000246810121416182022 Number of encryptions x1000Number of key bytes correctly guessed Native OS scenario XEN Single VM scenario XEN Cross VM scenario VMware Cross VM scenarioFigure 6.11: Number of key bytes correctly recovered vs number of encryptions needed for native OS, single VM and cross-VM scenarios. the full key with 250.000 encryptions. The cross-VM scenario was analyzed in two popular hypervisors, Xen and VMware, requiring 650.000 and 500.000 encryptions to recover the 16 key bytes respectively. We believe that Xen requires a higher number of encryptions due to the higher noise caused by the usage of a fully virtualized hypervisor. It is important to remark that the attack is completed in only 9 and 35 seconds, respectively, for the native and single VM scenarios. In the cross VM scenario, the attack succeeds in recovering the full key in 90 and 150 seconds in VMware and Xen, respectively. Recall that in the cross-VM scenario the external IP communication adds significant latency. In short, compared to the Flush and Reload attack earlier presented, the Prime and Probe attack needs more encryptions to succeed in recovering the key. This is an expected behavior, as the Prime and Probe attack suffers more from noise. In- deed, Flush and Reload needswaccesses from an unrelated process to create a noisy measurement, being wthe associativity of the cache. However, as the Prime and Probe attack fills the whole set, a simple access from an unrelated process to the set being primed create a noisy measurement. Nevertheless, Prime and Probe succeeds to recover the key in hypervisors where Flush and Reload andInvalidate and Trans- fercannot, including commercial clouds, which do not have memory deduplication enabled. We present a full real world scenario attack using Prime and Probe in the next section. 104 6.9 Recovering RSA Keys in Amazon EC2 As we said, the Prime and Probe attack does not assume any special requirements but that attacker and victim are co-resident in the same CPU socket. Thus, there is no reason why it cannot succeed in commercial clouds, like Amazon EC2. Amazon EC2 mostly uses Intel-based servers, for wich we know that the LLC is inclusive. More than that, the market share that Intel has on desktop and server processors was more than 80% at the beginning of 2016 [Alp09], and since Intel does not seem to offer non-inclusive caches in these devices (at least we have not observed any), our attack would work in big amount of current computers. In the cloud scenario, of course, co-residency has first to be achieved with the target. We assume co-residency is achieved (in fact, with the methodologies described in [VZRS15, IGES16]), and that the attacker utilizes the same hardware resources as a RSA decryption server. To prove the viability of the Prime and Probe attack in Amazon EC2 across co-located VMs, we introduce how it can be utilized to steal RSA cryptographic keys. It is important to remark that the attack is not processor specific, and can be implemented in any processor with inclusive last level caches. The attack targets a sliding window implementation of RSA-2048. Note that attacking such an implementation is far more difficult than the one described in Section 5.4, since the multiplication function by itself does not give us enough in- formation about the key. In this case, contrary to the case of AES and as explained in Section 2.4.2, we put our focus in the multiplicants, which are dynamically al- located. Thus Flush and Reload , even in deduplication enabled systems would be able to recover such information. We will use Libgcrypt 1.6.2 as our target library, which not only uses a slid- ing window implementation but also uses CRT and message blinding techniques 7. The message blinding process is performed as a side channel countermeasure for chosen-ciphertext attacks, in response to studies such as [GST14, GPPT15]. However, note that this does not prevent our attack, as we only focuses on the key management without requiring any particular ciphertext shape. Further, the CRT implementations makes us have to recover dpanddqseparately, but once the knowledge of both is acquired, the full key can be retrieved [HDWH12, Ham13]. The modular exponentiation in Libgcrypt uses the sliding window approach ex- plained in Algorithm 8. In particular, libgcrypt pre-computes the values c3,c5,c7,..., c2W−1in a table. Then, the key is processed in windows that are required to start and finish with a set bit, and have a maximum length of W, beingWthe window size. Until the set bits are found, squares are issued normally. When the set bits and the window of length l(l<=W) is found,lsquares are issued and a multiplication with the appropriate table entry is performed. Clearly, the accesses to the precom- puted table leak information about the window being processed, and therefore, the key bit values of the windows. Our attack uses the Prime and Probe side channel technique to recover the positions of the table Tthat holds the values c3,c5,c7,...,c2W−1whereWis the 105 Algorithm 7 RSA with CRT and Message Blinding Input : Ciphertext c∈ZN, Exponents d,e, ModulusN=pq Output :m r$←ZNwith gcd(r,N) = 1//Message Blinding c∗=c·remodN d p=dmod (p−1) //CRT conversion dq=dmod (q−1)m1= (c∗)dp modp//Modular Exponentiation m2= (c∗)dqmodq h=q−1·(m1−m2) modp//Undo CRT m∗=m2+h·q m=m∗·r−1modN //Undo Blinding returnm window size. For CRT-RSA with 2048 bit keys, W= 5 for both exponentiations dp,dq. Observe that, if all the positions are recovered correctly, reconstructing the key is a straightforward step. In order to perform the attack: •We make use of the fact that the offset of the address of each table position entry does not change when a new decryption process is executed. Therefore, we only need to monitor a subsection of all possible sets, yielding a lower number of traces. •Instead of the monitoring both the multiplication and the table entry set (as in [Fan15] for El Gamal), we only monitor a table entry set in one slice . This avoids the step where the attacker has to locate the multiplication set and avoids an additional source of noise. Recall that we do not control the victim’s user address space. This means that we do not know the location of each of the table entries, which indeed changes from execution to execution, as it is dynamically allocated. Therefore we will monitor a set hoping that it will be accessed by the algorithm. However, our analysis shows a special behavior: each time a new decryption process is started, even if the location changes, the offset field does not change from decryption to decryption. Thus, we candirectly relate a monitored set offset with a specific entry in the multiplication table. The knowledge of the processor in which the attack is going to be carried out gives an estimation of the probability that the set/slice we monitor collides with the set/slice the victim is using. For each table entry, we fix a specific set/slice where not much noise is observed. In the Intel Xeon E5-2670 v2 processors utilized by Amazon EC2, the LLC is divided in 2048 sets and 10 slices. Therefore, knowing the lowest 106 Algorithm 8 RSA Sliding-Window Exponentiation Input : Ciphertext c∈ZN%, Exponent d, Window Size w Output :cdmodN //Table Precomputation step T[0] =c3modN v=c2modNforifrom 1to2w−1−1do T[i] =T[i−1]·vmodN end //Exponentiation step: b= 1,j=len(d)whilej >0do ifej== 0 then b=b2modN j=j−1 else Findejej−1...e l|j−l+ 1≤wwithel= 1forkfrom 1toldo b=b2j−l+1modN end ifej== 1 then b=b·cmodN else b=b·T[(ei−3)/2] modN end j=j−l−1 end end returnb 12 bits of the table locations, we will need to monitor oneset/slice that solves s mod 64 =o, wheresis the set number and ois the offset for a table location. This increases the probability of probing the correct set from 1 /(2048·10) = 1/20480 to 1/((2048·10)/64) = 1/320, reducing the number of traces to recover the key by a factor of 64. Thus our spy process will monitor accesses to oneof the 320 set/slices related to a table entry, hoping that the RSA encryption accesses it when we run repeated decryptions. Recall that we reverse engineered the slice selection algorithm for Intel Xeon E5- 2670 v2 processors in Section 6.7.7. Thanks to the knowledge of the non linear slice selection algorithm, we can easily change our monitored set/slice if we see a high amount of noise in one particular set/slice. Since we also have to monitor a different set per table entry, it also helps us to change our eviction set accordingly. Thanks to the knowledge of the non linear slice selection algorithm, we will select a range of sets/slices s1,s2,...,s nfor which the memory blocks that create the eviction sets do not change, and that allow us to profile all the precomputed table entries. The threshold is different for each of the sets, since the time to access different 107 slices usually varies. Thus, the threshold for each of the sets has to be calculated before the monitoring phase. In order to improve the applicability of the attack, the LLC can be monitored to detect whether there are RSA decryptions or not in the co-located VMs as proposed in [IGES16]. After it is proven that there are RSA decryptions the attack can be performed. In order to obtain high quality timing leakage, we synchronize the spy process and the RSA decryption by initiating a communication between the victim and attacker, e.g. by sending a TLS request. Note that we are looking for a particular pattern observed for the RSA table entry multiplications, and therefore processes scheduled before the RSA decryption will not be counted as valid traces. In short, the attacker will communicate with the victim before the decryption. After this initial communication, the victim will start the decryption while the attacker starts monitoring the cache usage. In this way, we monitor 4,000 RSA decryptions with the same key and same ciphertext for each of the 16 different sets related to the 16 table entries. We investigate a hypothetical case where a system with dual CPU sockets is used. In such a system, depending on the hypervisor CPU management, two scenarios can play out; processes moving between sockets and processes assigned to specific CPUs. In the former scenario, we can observe the necessary number of decryption samples simply by waiting over a longer period of time. In this scenario, the attacker would collect traces and only use the information obtained during the times the attacker and the victim share sockets and discard the rest as missed traces. In the latter scenario, once the attacker achieves co-location, as we have in Amazon EC2, the attacker will always run on the same CPU as the target hence the attack will succeed in a shorter span of time. Among the 4,000 observations for each monitored set, only a small portion con- tains information about the multiplication operations with the corresponding table entry. These are recognized because their exponentiation trace pattern differs from that of unrelated sets. In order to identify where each exponentiation occurs, we inspected 100 traces and created the timeline shown in Figure 6.12(b). It can be ob- served that the first exponentiation starts after 37% of the overall decryption time. Note that among all the traces recovered, only those that have more than 20 and less than 100 peaks are considered. The remaining ones are discarded as noise. Fig- ure 6.12 shows measurements where no correct pattern was detected (Fig. 6.12(a)), and where a correct pattern was measured (Fig. 6.12(b)). In general, after the elimination step, there are 8 −12 correct traces left per set. We observe that data obtained from each of these sets corresponds to 2 consecutive table positions. This is a direct result of CPU cache prefetching. When a cache line that holds a table position is loaded into the cache, the neighboring table position is also loaded due to cache locality principle. For each graph to be processed, we first need to align the creation of the look-up table with the traces. Identifying the table creation step is trivial since each table position is used twice, taking two or more time slots. Figure 6.13(a) shows the table 108 timeslot0 2000 4000 6000 8000 10000Reload time 050100150200250 timeslot0 2000 4000 6000 8000 10000Reload time 050100150200250 Second Secret Exponent (dq)Decryption StartFirst Secret Exponent (dp)Figure 6.12: Different sets of data where we find a) trace that does not contain information b) trace that contains information about the key access position indexes aligned with the table creation. In the figure, the top graph shows the true table accesses while the rest of the graphs show the measured data. It can be observed that the measured traces suffer from misalignment due to noise from various sources e.g. RSA or co-located neighbors. In consequence, we apply alignment and noise reduction techniques, typically utilized in DPA attacks [KJJ99] to obtain a clean cache trace. The result after such an alignment process can be observed in Figure 6.14, which shows the real indices of a particular table entry with those retrieved from our analysis. Despite the good results after these noise reduction techniques, we still do not end up with a perfect trace. The overall results are presented in Table 6.4, where a 0.65% of the peaks are missdetected. Thus, the key recovery algorithm described in [Ham13] is applied to establish relationships between noisy dpanddqto recover a full clean RSA key. In order to apply such an algorithm we need information about the public key. We distinguish two different scenarios in which the attacker is able to utilize public keys for the error correction algorithm. In both, we assume the attacker has already retrieved the leakage from the RSA decryption key of a (potentially known) server: •Targeted Co-location: The Public Key is Known In this case we assume 109 timeslot0 500 1000 1500 2000 2500 3000024681012 timeslot0 500 1000 1500 2000 2500 3000024681012Figure 6.13: 10 traces from the same set where a) they are divided into blocks for a correlation alignment process b) they have been aligned and the peaks can be extracted that the attacker implemented a targeted co-location against a known server, and that she has enough information about the public key parameters of the target. •Bulk Key Recovery: The Public Key is Unknown In this scenario, the attacker can build up a database of public keys by mapping the entire IP range 110 Normalized timeslot0 200 400 600 800 1000 1200Real Data 012Figure 6.14: Comparison of the final obtained peaks with the correct peaks with adjusted timeslot resolution Table 6.4: Successfully recovered peaks on average in an exponentiation Average Number of traces/set 4000 Average number of correct graphs/set 10 Wrong detected peaks 7.19% Missdetected peaks 0.65% Correctly detected peaks 92.15% of the targeted Amazon EC2 region and retrieve all the public keys of hosts that have the TLS port open. The attacker then runs the above described algorithm for each of the recovered private keys and the entire public key database. Having the list of ’neighboring’ IPs with an open TLS port also allows the attacker to initiate TLS handshakes to make the servers use their private keys with high frequency. In the first scenario, the attacker has information about the public key and thus he can apply the error correction algorithm that we will describe directly. In the second scenario, the attacker has a database of public keys, and he does not know which public key the private key leakage belongs to. In this case, the attacker implements the error correction algorithm with each of the public keys in the database until he finds a successful correlation. We proceed next to explain the algorithm to recover the noise-free decryption key. The leakage analysis described recovers information on the CRT version of the secret exponent d, namelydp=dmod (p−1) anddq=dmod (q−1). A noise-free version of either one can be used to trivially recover the factorization of N=pq, since gcd(m−medp,N) =pfor virtually any m[CS04]. Thus, our goal is to retrieve noise-freedpordqfrom noisy d∗pandd∗q. Note that this is not our case, as even with alignment and noise filtering techniques we still had miss-detected and wrongly detected peaks. In such cases, we can exploit their relation to the known public key can if the used public exponent eis small [Ham13]. Note that this will be our case, as almost all RSA implementations currently use e= 216+ 1 due to the heavy 111 performance boost over a random and full size e. For CRT exponents we calculated dpanddqasdp=dmod (p−1) anddq=dmod (q−1). Multiplying both sides by e we obtain edp= 1 mod ( p−1) (6.6) edq= 1 mod ( q−1) (6.7) Asd∗emod ((p−1)∗(q−1)) = 1. This means that exist integers kpandkq such thatedp=kp(p−1) + 1 for some 1 ≤kp