Package org.w3c.tidy

Class Lexer


  • public class Lexer
    extends java.lang.Object
    Lexer for html parser.

    Given a file stream fp it returns a sequence of tokens. GetToken(fp) gets the next token UngetToken(fp) provides one level undo The tags include an attribute list: - linked list of attribute/value nodes - each node has 2 null-terminated strings. - entities are replaced in attribute values white space is compacted if not in preformatted mode If not in preformatted mode then leading white space is discarded and subsequent white space sequences compacted to single space chars. If XmlTags is no then Tag names are folded to upper case and attribute names to lower case. Not yet done: - Doctype subset and marked sections

    Version:
    $Revision: 1100 $ ($Author: aditsu $)
    Author:
    Dave Raggett dsr@w3.org , Andy Quick ac.quick@sympatico.ca (translation to Java), Fabrizio Giustina
    • Field Summary

      Fields 
      Modifier and Type Field Description
      protected short badAccess
      for accessibility errors.
      protected short badChars
      for bad char encodings.
      protected boolean badDoctype
      set if html or PUBLIC is missing.
      protected short badForm
      for mismatched/mispositioned form tags.
      protected short badLayout
      for bad style errors.
      protected int columns
      at start of current token.
      protected Configuration configuration
      configuration.
      protected int doctype
      version as given by doctype (if any).
      protected short errors
      count of errors.
      protected java.io.PrintWriter errout
      error output stream.
      protected boolean excludeBlocks
      Netscape compatibility.
      protected boolean exiled
      true if moved out of table.
      static short IGNORE_MARKUP
      state: ignore markup.
      static short IGNORE_WHITESPACE
      state: ignore whitespace.
      protected StreamIn in
      file stream.
      protected Node inode
      Inline stack for compatibility with Mosaic.
      protected int insert
      for inferring inline tags.
      protected boolean insertspace
      when space is moved after end tag.
      protected java.util.Stack istack
      stack.
      protected int istackbase
      start of frame.
      protected boolean isvoyager
      true if xmlns attribute on html element.
      protected byte[] lexbuf
      Lexer character buffer parse tree nodes span onto this buffer which contains the concatenated text contents of all of the elements.
      protected int lexlength
      allocated.
      protected int lexsize
      used.
      protected int lines
      lines seen.
      static short MIXED_CONTENT
      state: mixed content.
      static short PREFORMATTED
      state: preformatted.
      protected boolean pushed
      true after token has been pushed back.
      protected Report report
      report.
      protected Node root
      Root node is saved here.
      protected boolean seenEndBody
      already seen end body tag?
      protected boolean seenEndHtml
      already seen end html tag?
      protected short state
      state of lexer's finite state machine.
      protected Style styles
      used for cleaning up presentation markup.
      protected Node token
      current node.
      protected int txtend
      end of current node.
      protected int txtstart
      start of current node.
      protected short versions
      bit vector of HTML versions.
      protected short warnings
      count of warnings in this document.
      protected boolean waswhite
      used to collapse contiguous white space.
    • Method Summary

      All Methods Static Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      void addByte​(int c)
      Adds a byte to lexer buffer.
      void addCharToLexer​(int c)
      Store char c as UTF-8 encoded byte stream.
      boolean addGenerator​(Node root)
      Add meta element for Tidy.
      void addStringLiteral​(java.lang.String str)
      calls addCharToLexer for any char in the string.
      void addStringToLexer​(java.lang.String str)
      Adds a string to lexer buffer.
      short apparentVersion()
      Return the html version used in document.
      boolean canPrune​(Node element)
      Can the given element be removed?
      void changeChar​(byte c)
      Substitute the last char in buffer.
      boolean checkDocTypeKeyWords​(Node doctype)
      Check system keywords (keywords should be uppercase).
      AttVal cloneAttributes​(AttVal attrs)
      Clones an attribute value and add eventual asp or php node to node list.
      Node cloneNode​(Node node)
      Clones a node and add it to node list.
      void deferDup()
      Defer duplicates when entering a table or other element where the inlines shouldn't be duplicated.
      boolean endOfInput()
      Has end of input stream been reached?
      short findGivenVersion​(Node doctype)
      Examine DOCTYPE to identify version.
      boolean fixDocType​(Node root)
      Fixup doctype if missing.
      void fixHTMLNameSpace​(Node root, java.lang.String profile)
      Fix xhtml namespace.
      void fixId​(Node node)
      duplicate name attribute as an id and check if id and name match.
      boolean fixXmlDecl​(Node root)
      Ensure XML document starts with <?XML version="1.0"?>.
      Node getCDATA​(Node container)
      Create a text node for the contents of a CDATA element like style or script which ends with </foo> for some foo.
      Node getToken​(short mode)
      Gets a token.
      short htmlVersion()
      Choose what version to use for new doctype.
      java.lang.String htmlVersionName()
      Choose what version to use for new doctype.
      Node inferredTag​(java.lang.String name)
      Generates and inserts a new node.
      int inlineDup​(Node node)
      This has the effect of inserting "missing" inline elements around the contents of blocklevel elements such as P, TD, TH, DIV, PRE etc.
      Node insertedToken()  
      static boolean isCSS1Selector​(java.lang.String buf)
      In CSS1, selectors can contain only the characters A-Z, 0-9, and Unicode characters 161-255, plus dash (-); they cannot start with a dash or a digit; they can also contain escaped characters and any Unicode character as a numeric code (see next item).
      boolean isPushed​(Node node)
      Is the node in the stack?
      static boolean isValidAttrName​(java.lang.String attr)
      Check if attr is a valid name.
      Node newLineNode()
      Adds a new line node.
      Node newNode()
      Creates a new node and add it to nodelist.
      Node newNode​(short type, byte[] textarray, int start, int end)
      Creates a new node and add it to nodelist.
      Node newNode​(short type, byte[] textarray, int start, int end, java.lang.String element)
      Creates a new node and add it to nodelist.
      Node parseAsp()
      parser for ASP within start tags Some people use ASP for to customize attributes Tidy isn't really well suited to dealing with ASP This is a workaround for attributes, but won't deal with the case where the ASP is used to tailor the attribute value.
      java.lang.String parseAttribute​(boolean[] isempty, Node[] asp, Node[] php)
      consumes the '>' terminating start tags.
      AttVal parseAttrs​(boolean[] isempty)
      Parse tag attributes.
      void parseEntity​(short mode)
      Parse an html entity.
      Node parsePhp()
      PHP is like ASP but is based upon XML processing instructions, e.g.
      int parseServerInstruction()
      Invoked when < is seen in place of attribute value but terminates on whitespace if not ASP, PHP or Tango this routine recognizes ' and " quoted strings.
      char parseTagName()
      Parses a tag name.
      java.lang.String parseValue​(java.lang.String name, boolean foldCase, boolean[] isempty, int[] pdelim)
      Parse an attribute value.
      void popInline​(Node node)
      Pop a copy of an inline node from the stack.
      protected boolean preContent​(Node node)
      Is content acceptable for pre elements?
      void pushInline​(Node node)
      Push a copy of an inline node onto stack but don't push if implicit or OBJECT or APPLET (implicit tags are ones generated from the istack) One issue arises with pushing inlines when the tag is already pushed.
      boolean setXHTMLDocType​(Node root)
      Adds a new xhtml doctype to the document.
      void ungetToken()  
      protected void updateNodeTextArrays​(byte[] oldtextarray, byte[] newtextarray)
      Update oldtextarray in the current nodes.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • IGNORE_WHITESPACE

        public static final short IGNORE_WHITESPACE
        state: ignore whitespace.
        See Also:
        Constant Field Values
      • MIXED_CONTENT

        public static final short MIXED_CONTENT
        state: mixed content.
        See Also:
        Constant Field Values
      • PREFORMATTED

        public static final short PREFORMATTED
        state: preformatted.
        See Also:
        Constant Field Values
      • IGNORE_MARKUP

        public static final short IGNORE_MARKUP
        state: ignore markup.
        See Also:
        Constant Field Values
      • in

        protected StreamIn in
        file stream.
      • errout

        protected java.io.PrintWriter errout
        error output stream.
      • badAccess

        protected short badAccess
        for accessibility errors.
      • badLayout

        protected short badLayout
        for bad style errors.
      • badChars

        protected short badChars
        for bad char encodings.
      • badForm

        protected short badForm
        for mismatched/mispositioned form tags.
      • warnings

        protected short warnings
        count of warnings in this document.
      • errors

        protected short errors
        count of errors.
      • lines

        protected int lines
        lines seen.
      • columns

        protected int columns
        at start of current token.
      • waswhite

        protected boolean waswhite
        used to collapse contiguous white space.
      • pushed

        protected boolean pushed
        true after token has been pushed back.
      • insertspace

        protected boolean insertspace
        when space is moved after end tag.
      • excludeBlocks

        protected boolean excludeBlocks
        Netscape compatibility.
      • exiled

        protected boolean exiled
        true if moved out of table.
      • isvoyager

        protected boolean isvoyager
        true if xmlns attribute on html element.
      • versions

        protected short versions
        bit vector of HTML versions.
      • doctype

        protected int doctype
        version as given by doctype (if any).
      • badDoctype

        protected boolean badDoctype
        set if html or PUBLIC is missing.
      • txtstart

        protected int txtstart
        start of current node.
      • txtend

        protected int txtend
        end of current node.
      • state

        protected short state
        state of lexer's finite state machine.
      • token

        protected Node token
        current node.
      • lexbuf

        protected byte[] lexbuf
        Lexer character buffer parse tree nodes span onto this buffer which contains the concatenated text contents of all of the elements. Lexsize must be reset for each file. Byte buffer of UTF-8 chars.
      • lexlength

        protected int lexlength
        allocated.
      • lexsize

        protected int lexsize
        used.
      • inode

        protected Node inode
        Inline stack for compatibility with Mosaic. For deferring text node.
      • insert

        protected int insert
        for inferring inline tags.
      • istack

        protected java.util.Stack istack
        stack.
      • istackbase

        protected int istackbase
        start of frame.
      • styles

        protected Style styles
        used for cleaning up presentation markup.
      • configuration

        protected Configuration configuration
        configuration.
      • seenEndBody

        protected boolean seenEndBody
        already seen end body tag?
      • seenEndHtml

        protected boolean seenEndHtml
        already seen end html tag?
      • report

        protected Report report
        report.
      • root

        protected Node root
        Root node is saved here.
    • Constructor Detail

      • Lexer

        public Lexer​(StreamIn in,
                     Configuration configuration,
                     Report report)
        Instantiates a new Lexer.
        Parameters:
        in - StreamIn
        configuration - configuation instance
        report - report instance, for reporting errors
    • Method Detail

      • newNode

        public Node newNode()
        Creates a new node and add it to nodelist.
        Returns:
        Node
      • newNode

        public Node newNode​(short type,
                            byte[] textarray,
                            int start,
                            int end)
        Creates a new node and add it to nodelist.
        Parameters:
        type - node type: Node.ROOT_NODE | Node.DOCTYPE_TAG | Node.COMMENT_TAG | Node.PROC_INS_TAG | Node.TEXT_NODE | Node.START_TAG | Node.END_TAG | Node.START_END_TAG | Node.CDATA_TAG | Node.SECTION_TAG | Node. ASP_TAG | Node.JSTE_TAG | Node.PHP_TAG | Node.XML_DECL
        textarray - array of bytes contained in the Node
        start - start position
        end - end position
        Returns:
        Node
      • newNode

        public Node newNode​(short type,
                            byte[] textarray,
                            int start,
                            int end,
                            java.lang.String element)
        Creates a new node and add it to nodelist.
        Parameters:
        type - node type: Node.ROOT_NODE | Node.DOCTYPE_TAG | Node.COMMENT_TAG | Node.PROC_INS_TAG | Node.TEXT_NODE | Node.START_TAG | Node.END_TAG | Node.START_END_TAG | Node.CDATA_TAG | Node.SECTION_TAG | Node. ASP_TAG | Node.JSTE_TAG | Node.PHP_TAG | Node.XML_DECL
        textarray - array of bytes contained in the Node
        start - start position
        end - end position
        element - tag name
        Returns:
        Node
      • cloneNode

        public Node cloneNode​(Node node)
        Clones a node and add it to node list.
        Parameters:
        node - Node
        Returns:
        cloned Node
      • cloneAttributes

        public AttVal cloneAttributes​(AttVal attrs)
        Clones an attribute value and add eventual asp or php node to node list.
        Parameters:
        attrs - original AttVal
        Returns:
        cloned AttVal
      • updateNodeTextArrays

        protected void updateNodeTextArrays​(byte[] oldtextarray,
                                            byte[] newtextarray)
        Update oldtextarray in the current nodes.
        Parameters:
        oldtextarray - previous text array
        newtextarray - new text array
      • newLineNode

        public Node newLineNode()
        Adds a new line node. Used for creating preformatted text from Word2000.
        Returns:
        new line node
      • endOfInput

        public boolean endOfInput()
        Has end of input stream been reached?
        Returns:
        true if end of input stream been reached
      • addByte

        public void addByte​(int c)
        Adds a byte to lexer buffer.
        Parameters:
        c - byte to add
      • changeChar

        public void changeChar​(byte c)
        Substitute the last char in buffer.
        Parameters:
        c - new char
      • addCharToLexer

        public void addCharToLexer​(int c)
        Store char c as UTF-8 encoded byte stream.
        Parameters:
        c - char to store
      • addStringToLexer

        public void addStringToLexer​(java.lang.String str)
        Adds a string to lexer buffer.
        Parameters:
        str - String to add
      • parseEntity

        public void parseEntity​(short mode)
        Parse an html entity.
        Parameters:
        mode - mode
      • parseTagName

        public char parseTagName()
        Parses a tag name.
        Returns:
        first char after the tag name
      • addStringLiteral

        public void addStringLiteral​(java.lang.String str)
        calls addCharToLexer for any char in the string.
        Parameters:
        str - input String
      • htmlVersion

        public short htmlVersion()
        Choose what version to use for new doctype.
        Returns:
        html version constant
      • htmlVersionName

        public java.lang.String htmlVersionName()
        Choose what version to use for new doctype.
        Returns:
        html version name
      • addGenerator

        public boolean addGenerator​(Node root)
        Add meta element for Tidy. If the meta tag is already present, update release date.
        Parameters:
        root - root node
        Returns:
        true if the tag has been added
      • checkDocTypeKeyWords

        public boolean checkDocTypeKeyWords​(Node doctype)
        Check system keywords (keywords should be uppercase).
        Parameters:
        doctype - doctype node
        Returns:
        true if doctype keywords are all uppercase
      • findGivenVersion

        public short findGivenVersion​(Node doctype)
        Examine DOCTYPE to identify version.
        Parameters:
        doctype - doctype node
        Returns:
        version code
      • fixHTMLNameSpace

        public void fixHTMLNameSpace​(Node root,
                                     java.lang.String profile)
        Fix xhtml namespace.
        Parameters:
        root - root Node
        profile - current profile
      • setXHTMLDocType

        public boolean setXHTMLDocType​(Node root)
        Adds a new xhtml doctype to the document.
        Parameters:
        root - root node
        Returns:
        true if a doctype has been added
      • apparentVersion

        public short apparentVersion()
        Return the html version used in document.
        Returns:
        version code
      • fixDocType

        public boolean fixDocType​(Node root)
        Fixup doctype if missing.
        Parameters:
        root - root node
        Returns:
        false if current version has not been identified
      • fixXmlDecl

        public boolean fixXmlDecl​(Node root)
        Ensure XML document starts with <?XML version="1.0"?>. Add encoding attribute if not using ASCII or UTF-8 output.
        Parameters:
        root - root node
        Returns:
        always true
      • inferredTag

        public Node inferredTag​(java.lang.String name)
        Generates and inserts a new node.
        Parameters:
        name - tag name
        Returns:
        generated node
      • getCDATA

        public Node getCDATA​(Node container)
        Create a text node for the contents of a CDATA element like style or script which ends with </foo> for some foo.
        Parameters:
        container - container node
        Returns:
        cdata node
      • ungetToken

        public void ungetToken()
      • getToken

        public Node getToken​(short mode)
        Gets a token.
        Parameters:
        mode - one of the following:
        • MixedContent-- for elements which don't accept PCDATA
        • Preformatted-- white spacepreserved as is
        • IgnoreMarkup-- for CDATA elements such as script, style
        Returns:
        next Node
      • parseAsp

        public Node parseAsp()
        parser for ASP within start tags Some people use ASP for to customize attributes Tidy isn't really well suited to dealing with ASP This is a workaround for attributes, but won't deal with the case where the ASP is used to tailor the attribute value. Here is an example of a work around for using ASP in attribute values: href='<%=rsSchool.Fields("ID").Value%>' where the ASP that generates the attribute value is masked from Tidy by the quotemarks.
        Returns:
        parsed Node
      • parsePhp

        public Node parsePhp()
        PHP is like ASP but is based upon XML processing instructions, e.g. <?php ... ?>.
        Returns:
        parsed Node
      • parseAttribute

        public java.lang.String parseAttribute​(boolean[] isempty,
                                               Node[] asp,
                                               Node[] php)
        consumes the '>' terminating start tags.
        Parameters:
        isempty - flag is passed as array so it can be modified
        asp - asp Node, passed as array so it can be modified
        php - php Node, passed as array so it can be modified
        Returns:
        parsed attribute
      • parseServerInstruction

        public int parseServerInstruction()
        Invoked when < is seen in place of attribute value but terminates on whitespace if not ASP, PHP or Tango this routine recognizes ' and " quoted strings.
        Returns:
        delimiter
      • parseValue

        public java.lang.String parseValue​(java.lang.String name,
                                           boolean foldCase,
                                           boolean[] isempty,
                                           int[] pdelim)
        Parse an attribute value.
        Parameters:
        name - attribute name
        foldCase - fold case?
        isempty - is attribute empty? Passed as an array reference to allow modification
        pdelim - delimiter, passed as an array reference to allow modification
        Returns:
        parsed value
      • isValidAttrName

        public static boolean isValidAttrName​(java.lang.String attr)
        Check if attr is a valid name.
        Parameters:
        attr - String to check, must be non-null
        Returns:
        true if attr is a valid name.
      • isCSS1Selector

        public static boolean isCSS1Selector​(java.lang.String buf)
        In CSS1, selectors can contain only the characters A-Z, 0-9, and Unicode characters 161-255, plus dash (-); they cannot start with a dash or a digit; they can also contain escaped characters and any Unicode character as a numeric code (see next item). The backslash followed by at most four hexadecimal digits (0..9A..F) stands for the Unicode character with that number. Any character except a hexadecimal digit can be escaped to remove its special meaning, by putting a backslash in front.
        Parameters:
        buf - css selector name
        Returns:
        true if the given string is a valid css1 selector name
      • parseAttrs

        public AttVal parseAttrs​(boolean[] isempty)
        Parse tag attributes.
        Parameters:
        isempty - is tag empty?
        Returns:
        parsed attribute/value list
      • pushInline

        public void pushInline​(Node node)
        Push a copy of an inline node onto stack but don't push if implicit or OBJECT or APPLET (implicit tags are ones generated from the istack) One issue arises with pushing inlines when the tag is already pushed. For instance: <p><em> text <p><em> more text Shouldn't be mapped to <p><em> text </em></p><p><em><em> more text </em></em>
        Parameters:
        node - Node to be pushed
      • popInline

        public void popInline​(Node node)
        Pop a copy of an inline node from the stack.
        Parameters:
        node - Node to be popped
      • isPushed

        public boolean isPushed​(Node node)
        Is the node in the stack?
        Parameters:
        node - Node
        Returns:
        true is the node is found in the stack
      • inlineDup

        public int inlineDup​(Node node)
        This has the effect of inserting "missing" inline elements around the contents of blocklevel elements such as P, TD, TH, DIV, PRE etc. This procedure is called at the start of ParseBlock. When the inline stack is not empty, as will be the case in: <i><h1>italic heading</h1></i> which is then treated as equivalent to <h1><i>italic heading</i></h1> This is implemented by setting the lexer into a mode where it gets tokens from the inline stack rather than from the input stream.
        Parameters:
        node - original node
        Returns:
        stack size
      • insertedToken

        public Node insertedToken()
        Returns:
      • canPrune

        public boolean canPrune​(Node element)
        Can the given element be removed?
        Parameters:
        element - node
        Returns:
        true if he element can be removed
      • fixId

        public void fixId​(Node node)
        duplicate name attribute as an id and check if id and name match.
        Parameters:
        node - Node to check for name/it attributes
      • deferDup

        public void deferDup()
        Defer duplicates when entering a table or other element where the inlines shouldn't be duplicated.
      • preContent

        protected boolean preContent​(Node node)
        Is content acceptable for pre elements?
        Parameters:
        node - content
        Returns:
        true if node is acceptable in pre elements