Citation
Implementing Embedded DSLs using Syntax Macros

Material Information

Title:
Implementing Embedded DSLs using Syntax Macros
Creator:
Anderson, William Blake
Publication Date:
Language:
English

Subjects

Genre:
Undergraduate Honors Thesis/Project

Notes

Abstract:
Developers frequently use domain-specific languages (DSLs) such as regex, SQL, and HTML for solving specialized problems. Embedding DSLs into programs often uses strings or other approaches that restrict the syntax of the DSL, prevent static analysis, and limit interactions with other code in the host language. This paper introduces a new model for embedded DSLs, termed syntax macros, and a generalized system of interpolation to resolve these issues. ( en )
General Note:
Awarded Bachelor of Science in Computer Science, summa cum laude, on April 29, 2021. Major: Computer Science
General Note:
College or School: Engineering
General Note:
Advisor: Peter Jonathan Dobbins. Advisor Department or School:

Record Information

Source Institution:
University of Florida
Holding Location:
University of Florida
Rights Management:
Copyright William Blake Anderson. Permission granted to the University of Florida to digitize, archive and distribute this item for non-profit research and educational purposes. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder.

Downloads

This item is only available as the following downloads:


Full Text

PAGE 1

Implementing Embedded DSLs using Syntax Macros William Blake Anderson Herbert Wertheim College of Engineering, University of Florida WillBAnders@gmail.com Abstract Developers frequently use domain specific languages (DSLs) such as regex, SQL, and HTML for solving specialized problems. Embedding DSLs into programs often uses strings or other approaches that restrict the syntax of the DSL , prevent static analysis, and limit interacti ons with other code in the host language. This paper introduces a new model for embedded DSLs , termed syntax macros , and a generalized system of interpolati on to resolve these issues. Keywords programming language design, emb edded domain specific languages, macros , interpolation I. I NTRODUCTION A domain specific language (DSL) is a programming language specially designed for a nd focused on a given problem, context, or application domain [ 6 ] . DSLs are often embedded within larger applications written in an other language , called the host language . A few of the most well known DSLs are: Re gular expressions (regex) , for string matching. SQL, for interacting with relational database s . HTML, for expressing the structure of web pages . To motivate the core issues addressed in this paper, consider the example SQL query in Figure 1 for looking up a user by their name in a database. db.execute( " SELECT * FROM users " + " WHERE name = '" + name + "' " ); Figure 1: SQL using string concatenation This query is vulnerable to SQL injection as the username becomes part of the SQL code itself . A common solution to this is prepared statement s , which use a template and arguments to set values separate from the que ry itself [ 4 ] , shown in Figure 2 . stmt = db.prepare( "SELECT * FROM users " + "WHERE name = ?" ); stmt.set(0, name); stmt.execute(); Figure 2: SQL using prepared statements This does resolve the SQL injection issue, but also impacts readability by separating the query, arguments, and execution. An alternative approach is using an embedded DSL for SQL to write the query using standard SQL syntax, as in Figure 3 . db.execute(#sql { SELECT * FROM users WHERE name = $name }); Figure 3: SQL using embedded DSLs Using #sql , the host language knows the intended DSL and can delegate to its implementation for this input. The DSL understands the syntax of the $name placeholder and can convert this to use prepared statements at runtime. I f the database schema is known , it would also be possible to perform validation on the query and type of name . This paper discusses standard approaches to representing DSLs in programming languages and contributes a new mod el for embedded DSLs termed syntax macros . Syntax macros build on established concepts for macros in existing language s such as hygiene . Furthermore, a generalized system of interpolation is introduced for interacting with the host language from the DSL . Previous work by the author has implemented Rhovas, a programming language for API design and enforcement [1] . The novel syntax macro model has been prototyped in Rhovas, acting as the host language, to support embedded DSLs. II. P ROBLEM S COPE This paper is focused on DSLs that already have a n existing implementation in the host language , meaning that the creation of a custom runtime is not necessary (or at least out of scope) . Therefore, the problem is large ly reduced to the compile time aspects of syntax transformation and validation as opposed to runtime execution. Language design also plays a major role as the syntax used should be consistent between DSLs. Given the goal is to perform validation for DSLs prior to runtime, both the host language and DSLs are presumed to be static ally typed and compiled for there to be a n impact . A DSL may be based on a dynamic or interpreted language but will support at least some degr ee of static analysis. E xamples in this paper will use well known languages meeting these re quirements : regex, SQL, and HTML. These languages are commonly embedded for use in other applications , have implementations provided by the standard library or common dependencies , and can benefit from syntax validation and/or static analysis.

PAGE 2

There are two key features which will be considered requirements for a viable solution to representing embedded DSLs . Without these features, a DSL is significantly limited in how it can be expressed in the host language. 1. The syntax of the DSL should not be unnecessarily restricted by the host language. This includes the lexer, which may have different tokens for each language. The parser must also c arefully manage lookahead to avoid inconsistencies between the current lexer and intended language being parsed. 2. The DSL should integrate with the host language for accessing values, which is often used for templat ing . This generally requires escaping from the DSL back to the host language to avoid grammatical restrictions. III. E XISTING S OLUTIONS Existing solutions for representing DSLs in languages can be grouped into four categories: strings, libraries, macros, and gramma r extensions . These are based on the extent they interact with the host language and their ability to meet the two key features established in the problem scope above. A. Strings The most common approach for representing DSLs in other languages is through str ings. Regex is a standard example of this as nearly all languages provide a regex engine in the standard library . However, DSLs must take care to escape any special characters reserved by the host language . Figure 4 shows two regex examples for matching a number where the first does not account for escape characters and the second does. Regex( " \ d+ ( \ . \ d + )? ") failure Regex(" \ \ d+ ( \ \ . \ \ d + )? ") success Figure 4 : Reserved characters in regex The SQL examples for string co ncatenation in Figure 1 and prepared statements in Figure 2 both fall into this category, as the SQL itself is provided within a string. Often, external tooling is available to perform some degree of validation on the DSL since they are the most vulnerable to syntax, type checking, or other logical errors. In Java, popular options are the Checker Framework for format strings (and other validations) [ 9 ] as well as JDBC Checker for SQL prepared statements [ 7 ]. Validation on strings is a complex problem and can be severely limited when strings are generated dynamically. B. Libraries DSLs can also be implemented through a library for the host language, taking advantage of the existing type system in order to perform compile time v alidation . The Kotlin programming language provides a library called kotlinx.html [ 1 1 ] , which implements a DSL for HTM L as shown in Figure 5 . h1 { +"Hello, " ; +name ; +"!" } Figure 5 : HTML DSL in Kotlin using kotlinx.html To achieve this syntax, the library uses two key features in Kotlin: lambda expressions with an implicit receiver (the HTML element) and operator overloading of unary plus . This results in a more compact syntax for the library at the cost of readability fo r developers without the necessary background knowledge. Other common types of DSLs implemented as libraries include object relation mappers (ORMs) for working with SQL, but is implemented as a library), and configuration libraries for working with JS ON, XML, YAML, or other formats. The primary disa dvantage of libraries is that they rely on the host language for functionality. The most noticeable restriction is on syntax, but many libraries must also work around the type system of the language to provi de static validation. This often results in additional complexity or excluding certain types of validation entirely to keep the API of the library manageable. C. Macro s Macros are compile time transformations of source cod e that operate on the AST of the lang uage. Compared to libraries, the use of macros allows more expressive syntax and additional validation while reducing the runtime costs of using the DSL. The Racket programming language is well known for its hygienic macro system and emphasis on language oriented programming using DSLs [ 2 ] . Figure 6 shows an example SQL query using the sql package [ 1 2 ] , which provides a DSL for creating SQL statements. (select * #:from users #:where (= name ,name)) Figure 6 : SQL DSL using Macros in Racket Th is a prepared statement containing the name argument. This allows for static validation to be performed with the full support of the language . The use of hygienic macros in Racket Additionally, Racket allows for creating a custom reader (lexer) which can be used to support the standard syntax of the DSL in other files , and a method for changing the active reader exists through readtables. This system is not explicitly addressed in their paper, but it is acknowledged that the use of macros has limitations on the syntax and static semantics of the DSL caused by the host language [ 2 ]. An alternative approach is to apply macros to string literals, allowing the syntax of the DSL to have more possibilities. In the Scala programming language , macros have been used to implement DSLs with this approach combined with string inter polation [ 3 ]. The example in Figure 7 shows a DSL for HTML using this approach . html "

Hello, $name!

" Figure 7 : HTML DSL in Scala using macros Th is use s a feature of Scala called implicit classes to extend StringContext , which represents an interpolated string literal . The html" " syntax is called an interpolator, which is de sugared to a method call on StringContext . This method implements the macro which performs the necessary validation to verify syntax and type checking. Inte rpolated values are not treated as strings but rather full objects, allowing them to be converted to the appropriate form for the DSL.

PAGE 3

D. Grammar Extensions The final category represents solutions which can extend the grammar of the language. This is a necessary requirement to support the correct syntax of a DSL, since the grammar is no longer restricted by the host language. One of the most well known examples of a grammar extension is JSX [ 1 3 ] , a language that extend s JavaScript to include XML literals used by React. The primary purpose for JSX is to express HTML templates, such as the one in Figure 8 . const elem =

Hello, {name}!

Figure 8 : HTML template in JSX JSX, however, is not itself a DSL as it is a s eparate language that mirrors JavaScript. Any c hanges to JavaScript must be incorporated manually into JSX, and the possibility of conflicts exist if syntax for XML literals w ere added to JavaScript itself. Furthermore, in an environment with multiple DSLs conflicts may still occur when the host language. One approach to this ambiguity is to specify the name of the DSL for use in the parser. The XJ language extension [ 5 ], short for eXtensible Java, uses an annotation identifying the class responsible for the DSL ( referred to as a syntax class) to notify the parser on the appropriate grammar to use . Figure 9 shows an example SQL query using th is approach. @ Select User u from users where u.name = name Figure 9 : SQL DSL in XJ The primary limitations of XJ are in relation to interactions between the DSL and the host language. Syntax classes are not naturally sandboxed from other Java code, which can be useful but also dangerous for DSLs that behave differently from Java such as other general purpose languages. Macro hygiene, communication, and interpolation are all areas for future work raised by the authors [ 5 ]. An alternative approach is b y using type informatio n . The Wyvern programming language [ 8 ] uses a model for type specific languages (TSLs) which establishes a framework for this approach . The parsing of TSL syntax is delayed until type checking, resolving any potential ambiguities in syntax . Figure 10 shows a potential ambiguity for two DSLs, one for URLs and the other for HTML, which i s let url: U rl = let elem: Html =

Hello, {name}!

Figure 10 : URL and HTML TSLs in Wyvern Although powerful , also complex . It is not clear if the use of types is indeed more seamless and natural for developers, as it relies on a detailed understanding of types in the program to determine the specific T SL being used. This disambiguation is also performed during static analysis rather than within the parser, which may serve as a barrier for tooling as the resulting grammar is context sensitive . IV. S YNTAX M ACRO M ODEL This section introduce s a new model for i mplementing embedded DSLs termed syntax macros . Syntax macros build on established concepts for macros in existing languages ( such as hygiene [2] ) but can also extend the grammar of the language . This model is prototyped in Rhovas, a programming language for API design and enforcement [1]. The two key features established in the problem scope were supporting fully customizable syntax and the ability to access values from the host language. The first of these is solved by the parsing process, which replaces the entire grammar (including lexing) with one controlled by the DSL. The second is addressed by a generalized system of interpolation for interacting with the host language from within the DSL. A. Macros in Rhov as Rhovas has two type of macros: regular and syntax. Regular macros transform the existing AST, and therefore must use the same syntax as Rhovas to be parsed successfully. In comparison, syntax macros are applied during parsing to change the current gramm ar , which allow s the macro to contain syntax that is different from Rhovas itself. Figure 11 shows a regular macro for assertions with a lazy error message and a syntax macro for a regex DSL in Rhovas. Note that the regular macro uses the grammar of Rhovas , while the syntax macro uses a custom grammar for regex literals. #require(cond, error("lazy error ")) #regex { / \ d+ \ (. \ d+)? / } Figure 11 : Regular vs syntax macros in Rhovas Both type s of macros are prefixed with # to help developers identify locations in their code where control flow and/ or syntax changes. Furthermore, regular macros use () like function calls while syntax macros use {} like lambda expressions. This distinguishes both types of macros syntactically so they are unambiguous from both a parsing and readability perspective. B. Par sing The grammar of a syntax macro is shown in Figure 12 , where is the starting rule of a given grammar. dsl ::= '#' id '{' '}' Figure 12 : Grammar template of syntax macros The host language is free to manage the registration of DSLs and determine how to resolve duplicate ids. The recommended approach is to use a project level configuration to set aliases on more explicit dependencies . For example, sql can be defined as an alias for org.sqlite.v3 in projects using SQLite or an alias for com.mysql.v8 when working with MySQL. During parsing, the host language uses the id to resolve the registered DSL and delegate parsing appropriately. The result is a grammar similar to the one depicted in Figure 13 depending on the a vailable DSLs. dsl ::= '#' 'regex' '{' regex/source '}' |

PAGE 4

'#' 'sql' '{' sql/source '}' | '#' 'html' '{' html/source '}' regex/source ::= sql/source ::= html/source ::= Figure 13 : Complete g rammar of syntax macros Delegating to the DSL parser includes sharing both the current parser (which will be used later for interpolation) as well as the current CharStream . Both languages must pay close attention to the amount of lookahead require d for parsing to avoid handling input with an incorrect lexer. Figure 14 shows an example DSL which could cause an error if the host language uses single quote character literals (thus resulting in an error at the lexer level) . #js { 'string' } Figure 14 : DSL using single quote character literals Furthermore, the language should establish a convention for parsing the surrounding braces. Since each language generally requires at least one token of lookahead, a natural approach is for the host language to parse the opening brace and the DSL the closing one. An alternative option is for both parse rs to parse these braces as a type of formed transition, however this necessitates resetting the state of t he CharStream . Rhovas has adopted the latter convention based on the principle of least surprise for any DSL implementors . The final parsing challenge is i nterpolation for accessing values within the host language. This uses the same process as delegating to a DSL in the host language but in the opposite direction, with the DSL delegating back to the host language. Interpolation is discussed in detail in S ection D. C. AST Representation In the simplest cases, a DSL can be represented using only the AST of the host language. An example of this is regex, which can be converted to a string literal used with the regex library of the host language as shown in Figure 15 . #regex { / \ d+ ( \ . \ d+ )? / } syntax macro Regex(" \ \ d+ ( \ \ . \ \ d+ )? ") equivalen t AST Figure 15 : AST representation of regex DSL If additional validation during static analysis is required, the DSL can instead return a custom AST hierarchy. When this AST is reached by the host language during compilation, it delegates handling to the DSL with the appropriate context. Once the necessary validation has been completed, the custom AST can be transformed into the AST of the host langua ge . D. DSL Interpolation Interpolation is a method of inserting values and is most frequently used with strings . In Lisp languages, quasiquote performs a similar function by switching between data literals and evaluating code. The Kotlin programming language supports interpolation in strings using $ for variables and ${ } for arbitrary expressions , as shown in Figure 16 . "name = $name" "person.name = ${person.name}" Figure 16 : Interpolation syntax in Kotlin (and Rhovas) Rhovas currently uses this syntax as it is heavily inspired by Kotlin . An interesting alternative is #{} used by Ruby , which would unify syntax between ent ering a DSL using #name{} and (temporarily) returning to the host language with #{} . The key feature of interpolation with DSLs is that it is not string based, but insert the value with the necessary serialization or conversion. Figure 17 shows an example of interpolation with SQL, similar to Figure 3 but using an expression rather than a variable. db.execute(#sql { SELECT * FROM user s WHERE name = ${user.name} }); Figure 17 : Expression i nterpolation for SQL Standardizing the syntax used for interpolation increases readability when wo rking with multiple DSLs. Furthermore, the generalization of interpolation can also offer additional insight into other languages and DSLs. In the Elixir programming language , variable assignment uses pattern matching and supports destructuring using varia ble bindings and values [ 10 ] . If the value in the pattern does not match the corresponding value being assigned, the assignment fails. The example in Figure 18 demonstrates this behavior with destructuring a 3D point expected to have a y value of 0 . {x, 0, z} = { 1, 0, 1} success x = 1, z = 1 {x, 0, z} = {1, 2, 3} failure Figure 18 : Pattern matching in Elixir In the case when pattern matching should be on the existin g value of a variable rather than rebinding an identifier, Elixir introduces the pin operator ^ as shown in Figure 19 . y = 0 {x, y, z} = {1, 2, 3} success x = 1, y = 2, z = 3 y = 0 {x, ^y, z} = {1, 2, 3} failure Figure 19 : Pin operator in Elixir Interestingly, by considering the syntax of pattern matching from the perspective of a DSL it becomes clearer that the pin operator is simply interpolation of the variable y in the pattern . This offers a new perspective on pattern matching and implies that interpolation is a more generalized concept in programming languages that extends beyond just string literals.

PAGE 5

V. F UTURE W ORK Syntax macros do not currently support arguments, which can be used for specifying additional options for the DSL. Example applications of this include selecting a specific regex flavor ( Figure 20 ) , providing a schema for SQL validation, or specifying the grammatical rule for a language. # regex(:pcre Figure 20 : Syntax macro arguments for regex flavor The primary consequence of this is introducing ambiguity with regular macros. A related problem is macros with trailing lambdas , which also use {} syntax as shown in Figure 21 . The benefit s of a llowing arguments need to be carefully weighed against the pot ential ambiguity for developers in these cases. numbers.filter { it > 0 } Figure 21 : Syntax for t railing lambdas Additionally, there are no restrictions on the DSL modifying control flow, mutating variables, or overall performing any functionality code in the host language can do. Incorporating a sandbox like model may improve of the interactions between the host language and DSL. There is also currently no concept of splicing for macros (regular and syntax) . Splicing allows merging a list of ASTs into a template and is a common feature in many macro systems. An example macro using splicing for creating a list of properties in a struct declaration is provided in Figure 22 . #struct(Point, x, y, z) struc t Point { var x: Int eger = x; var y: Int eger = y ; var z: Int eger = z ; } Figure 22 : Struct definition macro using splicing Finally, there is no dedicated API for implementing DSLs using this system. A developer must integrate with the Rhovas compiler to define a parser and implement the necessary static analysis for their language. Like regular macros, syntax macros should be possible through an API without needing to work with the full compiler to implement effective tooling. VI. C ONCLUSION Embedded DSLs in existing languages often have restricted syntax and difficulty interacting with the host language. The proposed model of syntax macros establishes a conceptual framework for implementing embedded DSLs in a way that resolves thes e issues. Syntax macros extend the grammar of the host language, allowing different syntax to be used within a DSL . A generalized system of interpolation further allows DSLs to better integrate with the host language for accessing values. While there remai n areas for future work on missing features and a dedicated API for macros , th e syntax macro model has been successfully prototyped in the Rhovas programming language. VII. R EFERENCES [1] W. B. Anderson . (2020, April) . Language for API Desig . Senior Project Report . University of Florida . [2] Domain Proceedings of the ACM on Programming Languages , 2020, 4(OOPSLA), pp. 1 29, Available: https://dl.acm.org/doi/10.1145/3428297 [3] in Proceedings of the 4 th Workshop on Scala (SCALA , Montpellier, France, pp. 1 10. Available: https://dl.acm.org/doi/10.1145/2489837.2489840 [4] R. E. Castillo, J. A. Caliwag, R. A. Pagaduan, and A. C. Page of a Websi te Application using Prepared Statement ICISS 2019: Proceedings of the 2019 2 nd International Conference on Information Science and Systems , Tokyo, Japan, pp. 171 175. Available: https://dl.acm.org/doi/10.1145/3322645.3322704 [5] 2008 Eighth IEEE International Working Conference on Source Code Analysis and Manipulation , Beijing, China, p p. 229 238, Available: https://ieeexplore.ieee.org/document/4637555 [6] A. van Deursen, P. Klint, and J. Visser. (2000, Jun.). Domain Specific Languages: An Annotated Bibliography. SIGPLAN Notices [O nline]. 36(6), pp. 26 36. Available: https://dl.acm.org/doi/10.1145/352029.352035 [7] Proceedings. 26th International Conference on Software Engineering , Edinburgh, United Kingdom, 2004, pp. 697 698, Available: https://ieeexplore.ieee.org/document/1317494 [8] C. Omar, D. Kurilova, L. Nistor, B. Chung, A. Potanin, Specific E COOP 2014 Object Oriented Programming. Lecture Notes in Computer Science , 8586, pp. 105 130. Available: https://link.springer.com/chapter/10.1007/978 3 662 44202 9_5 [9] Type System fo ISSTA 2014: roceedings of the 2014 International Symposium on Software Testing and Analysis , San Jose, CA, USA, pp. 127 137, Available: https://dl.acm.org/doi/proceedi ngs/10.1145/2610384 [10] Elixir. Pattern matching (2020, Oct.). Available: https://elixir lang.org/getting started/pattern matching.html [11] Kotlin. Type safe builders (2021, Feb.). Avail able: https://kotlinlang.org/docs/type safe builders.html [12] Racket. SQL: A Structured Notation for SQL Statements. (2021, Feb.). Available: https://docs.racket lang.org/sql/index.html [13] React. Introducing JSX (2020, Dec.). Available: https://reactjs.org/docs/introducing jsx.html