前言

原文再续，书接上一回，上次讲到baseCompile经过parse、optimize、generate 3 个阶段，把 template 模板转化为虚拟 DOM，这里主要讨论parse阶段发生的事情。

graph LR
A[template 模板]-->|parse|B[AST 抽象语法树]-->|optimize|C[优化后的 AST]-->|generate|D[render]-->|return|E[虚拟 DOM]

parse把模板转化为 AST 节点，举个例子：


html
<div>
  <div v-for="(item, index) in arr" @click="clickHandler" class="item">
    Current Value is {{ item }}
  </div>
</div>

<div v-for="(item, index) in arr" @click="clickHandler" class="item">解析后生成如下 AST 节点：


json
{
    "type": 1,
    "tag": "div",
    "attrsList": [
        {
            "name": "@click",
            "value": "clickHandler",
            "start": 44,
            "end": 65
        },
        {
            "name": "class",
            "value": "item",
            "start": 66,
            "end": 78
        }
    ],
    "attrsMap": {
        "v-for": "(item, index) in arr",
        "@click": "clickHandler",
        "class": "item"
    },
    "rawAttrsMap": {
        "v-for": {
            "name": "v-for",
            "value": "(item, index) in arr",
            "start": 15,
            "end": 43
        },
        "@click": {
            "name": "@click",
            "value": "clickHandler",
            "start": 44,
            "end": 65
        },
        "class": {
            "name": "class",
            "value": "item",
            "start": 66,
            "end": 78
        }
    },
    "parent": {
        "type": 1,
        "tag": "div",
        "attrsList": [],
        "attrsMap": {},
        "rawAttrsMap": {},
        "children": [],
        "start": 0,
        "end": 5
    },
    "children": [],
    "start": 10,
    "end": 79,
    "for": "arr",
    "alias": "item",
    "iterator1": "index"
}

入口文件

来看 src/compiler/parser/index.ts，模板到 AST 的转换规则比较复杂，而且有不少平时不会关注的部分，这里主要看看开始标签、结束标签和文本的处理。

graph LR
baseCompile-->A["parse(template.trim())"]-->parseHTML

parse函数组装 AST 抽象语法树，需要注意这几个变量：


ts
// 记录了 AST 的未闭合节点，确保 AST 和模板层次结构一致
const stack: any[] = []
// AST 的根节点，也作为 parse 函数的结果返回。
let root

parse给parseHTML传入了一系列用来生成 AST 的钩子函数，现在先不用管：


ts
parseHTML(template, {
  // ...
  start(tag, attrs, unary, start, end) {
    // 处理开始标签
    // ...
  },
  end(tag, start, end) {
    // 处理结束标签
    // ...
  },
  chars(text: string, start?: number, end?: number) {
    // 处理文本标签
    // ...
  },
  // ...
})

下面来看parseHTML，开始解析模板之前，先注意到这几个变量：


ts
export function parseHTML(html, options: HTMLParserOptions) {
  // 记录了之前匹配到的未闭合开始标签
  const stack: any[] = []
  // 记录了当前的模板字符串下标偏移
  let index = 0
  // last 记录了未匹配的剩下的模板字符串，lastTag 则记录了上一个未闭合的开始标签
  let last, lastTag
  // ...
}

index记录了相对于模板html开头的当前操作下标的偏移，每成功匹配一段模板，index会发生偏移，已经匹配的部分也会从html中扔掉，大部分截取和位移操作都是由这个函数完成的：


ts
function advance(n) {
  index += n
  html = html.substring(n)
}

标签匹配

先不管那些配置，parse最终调用parseHTML解析模板，代码如下所示：


ts
export function parseHTML(html, options: HTMLParserOptions) {
  const stack: any[] = []
  let index = 0
  let last, lastTag
  while (html) {
    last = html
    // Make sure we're not in a plaintext content element like script/style
    if (!lastTag || !isPlainTextElement(lastTag)) {
      let textEnd = html.indexOf('<')
      if (textEnd === 0) {
        // Comment:
        if (comment.test(html)) {
          // ...
          continue
        }
        // Conditional comment:
        if (conditionalComment.test(html)) {
          // ...
          continue
        }
        // Doctype:
        const doctypeMatch = html.match(doctype)
        if (doctypeMatch) {
          // ...
          continue
        }
        // End tag:
        const endTagMatch = html.match(endTag)
        if (endTagMatch) {
          // ...
          continue
        }
        // Start tag:
        const startTagMatch = parseStartTag()
        if (startTagMatch) {
          // ...
          continue
        }
      }
      // Text:
      // ...
    } else {
      // Style, Script & Textarea:
      // ...
    }
    if (html === last) {
      // Last Text:
      // ...
    }
  }
  // 处理未闭合的标签
  parseEndTag()
  //...
}

parseHTML这个函数修改自 JQuery 创建者 John Resig 所写的 HTML Parser，把字符串的模板转为 AST。

函数基于正则表达式来匹配标签类型：

注释：/^<!\--/，匹配的开头。
条件注释：/^<!\[/，这是一种 IE 5 ~ 9 版本的语法，向浏览器按条件执行的 HTML 代码，Vue 会把它忽略掉。
Doctype：/^<!DOCTYPE [^>]+>/i，顾名思义，就是 HTML 第一行的<!DOCTYPE>。
结束标签：new RegExp(`^<\\/${qnameCapture}[^>]*>\`)，其中qnameCapture是各种 HTML 中合法的字符；例如可以匹配</div>。
开始标签：先使用使用const startTagOpen = new RegExp(`^<${qnameCapture}`)匹配标签开头，然后循环匹配动态属性和静态属性，结尾使用const startTagClose = /^\s*(\/?)>/进行匹配。动态属性（例如v-bind、v-if、@input等）使用以下正则表达式：


ts
const dynamicArgAttribute = /^\s*((?:v-[\w-]+:|@|:|#)\[[^=]+?\][^\s"'<>\/=]*)(?:\s*(=)\s*(?:"([^"]*)"+|'([^']*)'+|([^\s"'=<>`]+)))?/

使用以下正则表达式匹配静态属性，也就是除了动态属性外的所有属性：


ts
const attribute = /^\s*([^\s"'<>\/=]+)(?:\s*(=)\s*(?:"([^"]*)"+|'([^']*)'+|([^\s"'=<>`]+)))?/

也就是说，开始标签可以匹配这样子的内容：<div v-for="(item, index) in arr" @click="clickHandler" class="item">。

文本：当要匹配的内容不是<开头的就是文本内容，直到下一个标签开始为止全部视为文本内容。

开始标签

下面来看不同内容被匹配了之后如何转为 AST的。

解析开始标签

下面的代码转换开始标签：


ts
const startTagMatch = parseStartTag()
if (startTagMatch) {
  handleStartTag(startTagMatch)
  continue
}

如果是开始标签，parseStartTag提取它的标签名和各种属性值：


ts
function parseStartTag() {
  const start = html.match(startTagOpen)
  if (start) {
    const match: any = {
      tagName: start[1],
      attrs: [],
      start: index
    }
    advance(start[0].length)
    let end, attr
    while (
      // startTagClose 匹配开始标签的结束部分 /^\s*(\/?)>/，
      !(end = html.match(startTagClose)) &&
      // 上面说过的，对动态属性和静态属性的匹配
      (attr = html.match(dynamicArgAttribute) || html.match(attribute))
    ) {
      attr.start = index
      advance(attr[0].length)
      attr.end = index
      match.attrs.push(attr)
    }
    if (end) {
      // end[1] 有值就是 /^\s*(\/?)>/ 括号里面被匹配上了，例如：<img/>，表示这是单标签
      // 后续只有非单标签才会被入栈，单标签直接进行闭合的操作
      match.unarySlash = end[1]
      advance(end[0].length)
      match.end = index
      return match
    }
  }
}

接下来开始标签会被handleStartTag处理：


ts
function handleStartTag(match) {
  const tagName = match.tagName
  // 在 web 上运行为 true（似乎没看见有别的值，可能留给二次开发实现吧）
  if (expectHTML) {
    // lastTag 是栈中最后的一个标签
    // isNonPhrasingTag 中的标签不能嵌套在 p 标签里面，例如 div、p 等等，p 标签遇到这种情况直接闭合
    if (lastTag === 'p' && isNonPhrasingTag(tagName)) {
      // 闭合标签的操作，见下文
      parseEndTag(lastTag)
    }
    // canBeLeftOpenTag 是可以写成开始标签结束标签的单标签，
    // 例如模板里面可以写<img></img>，里面不能嵌套别的东西
    if (canBeLeftOpenTag(tagName) && lastTag === tagName) {
      parseEndTag(tagName)
    }
  }
  // ...
}

下面看正式的处理流程，也很好懂，把handleStartTag的结果拼成一个对象。


ts
function handleStartTag(match) {
  // ...
  const l = match.attrs.length
  const attrs: ASTAttr[] = new Array(l)
  for (let i = 0; i < l; i++) {
    // 其实是把属性的 = 左右拆开，class="item" → { name: "class", value: "item" }
    const args = match.attrs[i]
    const value = args[3] || args[4] || args[5] || ''
    const shouldDecodeNewlines =
      tagName === 'a' && args[1] === 'href'
        ? options.shouldDecodeNewlinesForHref
        : options.shouldDecodeNewlines
    attrs[i] = {
      name: args[1],
      value: decodeAttr(value, shouldDecodeNewlines)
    }
  }
  // ...
}

下面的代码，会把非单标签的开始标签入栈，用于后续结束标签匹配


ts
function handleStartTag(match) {
  // unarySlash 和 unary 表示这是单标签，为什么要两个变量？感觉是为了允许一些不那么严谨的写法吧
  const unarySlash = match.unarySlash
  // ...
  const unary = isUnaryTag(tagName) || !!unarySlash
  // ...
  if (!unary) {
    stack.push({
      tag: tagName,
      lowerCasedTag: tagName.toLowerCase(),
      attrs: attrs,
      start: match.start,
      end: match.end
    })
    lastTag = tagName
  }
  if (options.start) {
    options.start(tagName, attrs, unary, match.start, match.end)
  }
  
}

options.start函数也就开始组装 AST。

开始标签的 AST 节点

start函数如下所示：


ts
export function createASTElement(
  tag: string,
  attrs: Array<ASTAttr>,
  parent: ASTElement | void
): ASTElement {
  return {
    type: 1,
    tag,
    attrsList: attrs,
    attrsMap: makeAttrsMap(attrs),
    rawAttrsMap: {},
    parent,
    children: []
  }
}
start(tag, attrs, unary, start, end) {
  let element: ASTElement = createASTElement(tag, attrs, currentParent)
  if (!root) {
    root = element
  }
  // 省略了处理各种特殊属性的内容，例如 v-for、v-if
  // 省略了对 v-pre、svg 标签的处理
  // 省略了对静态属性、静态样式、input 标签的优化
  // 以后再说...
  
  if (!unary) {
    currentParent = element
    stack.push(element)
  } else {
    // 单标签则结束之，见结束标签部分
    closeElement(element)
  }
}

主体代码就是把开始标签的 AST 节点入栈，如果是单标签则闭合之。

结束标签

解析结束标签

如果是结束标签，调用parseEndTag函数进行处理：


ts
// End tag:
const endTagMatch = html.match(endTag)
if (endTagMatch) {
  const curIndex = index
  advance(endTagMatch[0].length)
  parseEndTag(endTagMatch[1], curIndex, index)
  continue
}

parseEndTag在栈中寻找未闭合的相同标签名：


ts
function parseEndTag(tagName?: any, start?: any, end?: any) {
  let pos, lowerCasedTagName
  if (start == null) start = index
  if (end == null) end = index
  // Find the closest opened tag of the same type
  if (tagName) {
    lowerCasedTagName = tagName.toLowerCase()
    for (pos = stack.length - 1; pos >= 0; pos--) {
      if (stack[pos].lowerCasedTag === lowerCasedTagName) {
        break
      }
    }
  } else {
    // If no tag name is provided, clean shop
    pos = 0
  }
  // ...
}

直到找到标签的地方，倒序把栈中标签都结束掉：


ts
function parseEndTag(tagName?: any, start?: any, end?: any) {
  let pos, lowerCasedTagName
  // ...
  if (pos >= 0) {
    // Close all the open elements, up the stack
    for (let i = stack.length - 1; i >= pos; i--) {
      if (options.end) {
        options.end(stack[i].tag, start, end)
      }
    }
    // Remove the open elements from the stack
    stack.length = pos
    lastTag = pos && stack[pos - 1].tag
  } else if (lowerCasedTagName === 'br') {
    // 允许单个 br 结束标签</br>变为<br>，和浏览器行为一致
    if (options.start) {
      options.start(tagName, [], true, start, end)
    }
  } else if (lowerCasedTagName === 'p') {
    // 允许单个 p 结束标签补全为<p></p>，和浏览器行为一致
    if (options.start) {
      options.start(tagName, [], false, start, end)
    }
    if (options.end) {
      options.end(tagName, start, end)
    }
  }
}

结束标签的 AST 节点

来看options.end如何生成 AST 节点：


ts
end(tag, start, end) {
  const element = stack[stack.length - 1]
  // pop stack
  stack.length -= 1
  currentParent = stack[stack.length - 1]
  closeElement(element)
}

这段代码主要功能功能就只是给 AST 栈出栈。closeElement函数负责处理 AST 节点上各种属性：


ts
function closeElement(element) {
  trimEndingWhitespace(element)
  if (!inVPre && !element.processed) {
    element = processElement(element, options)
  }
  // v-if 的逻辑...
  if (currentParent && !element.forbidden) {
    if (element.elseif || element.else) {
      // v-if 的逻辑...
    } else {
      // slot 的逻辑...
      // 把当前节点记录到亲节点（最近未闭合的双标签）上，记录当前节点的亲节点
      currentParent.children.push(element)
      element.parent = currentParent
    }
  }
  for (let i = 0; i < postTransforms.length; i++) {
    // 处理静态 style、静态 class、单选框和复选框的逻辑
    postTransforms[i](element, options)
  }
}
export function processElement(element: ASTElement, options: CompilerOptions) {
  // 很明显，这里给 AST 节点处理 key、ref、slot 等情况
  processKey(element)
  element.plain =
    !element.key && !element.scopedSlots && !element.attrsList.length
  processRef(element)
  processSlotContent(element)
  processSlotOutlet(element)
  processComponent(element)
  for (let i = 0; i < transforms.length; i++) {
    element = transforms[i](element, options) || element
  }
  processAttrs(element)
  return element
}

processAttrs方法是处理 AST 节点的其他属性的，开发中最常见的数据绑定也在这里挂到节点上，例如：


html
<div :data="test" @click="clickHandler"></div>

在 AST 上可以得到：


js
{
  // ...
  attrs: [
    { name: 'data', value: 'test', dynamic: false }
  ],
  events: {
    click: { value: 'clickHandler', dynamic: false }
  }
}

文本

解析文本

下面是关于解析文本的：


ts
let text, rest, next
if (textEnd >= 0) {
  rest = html.slice(textEnd)
  while (
    !endTag.test(rest) &&
    !startTagOpen.test(rest) &&
    !comment.test(rest) &&
    !conditionalComment.test(rest)
  ) {
    // < in plain text, be forgiving and treat it as text
    next = rest.indexOf('<', 1)
    if (next < 0) break
    textEnd += next
    rest = html.slice(textEnd)
  }
  text = html.substring(0, textEnd)
}
if (textEnd < 0) {
  text = html
}
if (text) {
  advance(text.length)
}
if (options.chars && text) {
  options.chars(text, index - text.length, index)
}

截取直到<为止的字符串作为文本内容，直接来看options.chars

文本的 AST 节点

options.chars把文本转为 AST 节点，普通文本和包含变量的文本都由她处理：


ts
chars(text: string, start?: number, end?: number) {
  if (!currentParent) {
    return
  }
  const children = currentParent.children
  if (inPre || text.trim()) {
    // v-pre 中展示源码
    text = isTextTag(currentParent)
      ? text
      : (decodeHTMLCached(text) as string)
  } else if (!children.length) {
    // remove the whitespace-only node right after an opening tag
    text = ''
  } else if (whitespaceOption) {
    // whitespaceOption 是一个控制压缩模板的空格和换行的配置
    if (whitespaceOption === 'condense') {
      // in condense mode, remove the whitespace node if it contains
      // line break, otherwise condense to a single space
      text = lineBreakRE.test(text) ? '' : ' '
    } else {
      text = ' '
    }
  } else {
    text = preserveWhitespace ? ' ' : ''
  }
  // ...
}

刚才提到的 whitespace 是一个 Vue 构建的配置，有'condense' | 'preserve'两个值，前者会压缩空格和换行，后者不会。


ts
chars(text: string, start?: number, end?: number) {
  // ...
  if (text) {
    if (!inPre && whitespaceOption === 'condense') {
      // condense consecutive whitespaces into single space
      text = text.replace(whitespaceRE, ' ')
    }
    let res
    let child: ASTNode | undefined
    // 解析 delimiters 语法 {{ ... }}
    if (!inVPre && text !== ' ' && (res = parseText(text, delimiters))) {
      child = {
        type: 2,
        expression: res.expression,
        tokens: res.tokens,
        text
      }
    } else if (
      text !== ' ' ||
      !children.length ||
      children[children.length - 1].text !== ' '
    ) {
      child = {
        type: 3,
        text
      }
    }
    if (child) {
      children.push(child)
    }
  }
}

根据文本是否包含变量，生成type为 2 或者 3 的 AST 节点。好了，文本解析到此结束。

纯文本标签

纯文本标签也就是 style、script、textarea三个标签，里面的东西都当成纯文本，前两者 style、script 在 web 平台上面默认禁用，也就是说，只有 textarea 里面带有填充文本归入此类。


html
<textarea id="story" name="story" rows="5" cols="33">
It was a dark and stormy night...
</textarea>

当 parseHTML栈中有未闭合的纯文本标签时，也就是说 textarea 开始标签被解析了后，进入这个分支，这里的标签不会被解析，直接当成纯文本。


ts
let endTagLength = 0
const stackedTag = lastTag.toLowerCase()
const reStackedTag =
  reCache[stackedTag] ||
  (reCache[stackedTag] = new RegExp(
    '([\\s\\S]*?)(</' + stackedTag + '[^>]*>)',
    'i'
  ))
// 匹配内容和结束标签
// 例如 <textarea>114514</textarea>，会被正则表达式 ([\\s\\S]*?)(</textarea[^>]*>) 匹配，
// 下面 text = "114514"，endTag = "</textarea>"
const rest = html.replace(reStackedTag, function (all, text, endTag) {
  endTagLength = endTag.length
  if (!isPlainTextElement(stackedTag) && stackedTag !== 'noscript') {
    // 如果有注释，则留下其中文本
    text = text
      .replace(/<!\--([\s\S]*?)-->/g, '$1')
      .replace(/<!\[CDATA\[([\s\S]*?)]]>/g, '$1')
  }
  if (shouldIgnoreFirstNewline(stackedTag, text)) {
    text = text.slice(1)
  }
  // 直接把所有内容当成纯文本
  if (options.chars) {
    // 处理文本的钩子，见上文
    options.chars(text)
  }
  // 把匹配了的纯文本标签内容扔掉
  return ''
})
index += html.length - rest.length
html = rest
// 结束标签的操作，见上文
parseEndTag(stackedTag, index - endTagLength, index)

结束匹配

退出的条件是要么传入的模板html解析完了，要么经过以上各种解析流程，html没有改变，也就是剩下的是纯文本了，添加了一个文本节点结束解析：


ts
export function parseHTML(html, options: HTMLParserOptions) {
  let index = 0
  let last, lastTag
  while (html) {
    last = html
    if (!lastTag || !isPlainTextElement(lastTag)) {
      // 注释、条件注释、Doctype、开始/结束标签、文本...
    } else {
      // 纯文本标签...
    }
    // html 没有改变，剩下全是文本
    if (html === last) {
      options.chars && options.chars(html)
      break
    }
  }
  // 结束掉栈中剩下的标签
  parseEndTag()
  // ...
}

其他

AST 节点的 type 是什么？

1 是 HTML 标签（自定义组件的节点也包含在内），2 是包含变量的文本，3 是普通文本。

总结

parse通过调用parseHTML进行模板解析。parseHTML通过正则表达式匹配模板，通过栈结构匹配未闭合标签，返回为 AST 节点的雏形。parse中再生成 AST 节点，最终生成模板对于的 AST 结构。

Parse：Template 模板 → AST（v2）

前言