HTML5 defines the fifth major revision of the core language of the World Wide Web, HTML. "HTML5 differences from HTML4" describes the differences between HTML4 and HTML5 and provides some of the rationale for the changes. This document may not provide accurate information as the HTML5 specification is still actively in development.
HTML 5 is the next revision of the Hypertext Markup Language (HTML), the standard programming language for describing the contents and appearance of Web pages. HTML, which stands for HyperText Markup Language, is the predominant markup language for web pages. A markup language is a set of markup tags, and HTML uses markup tags to describe web pages.
HTML5 is a W3C specification that defines the fifth major revision of the Hypertext Markup Language (HTML). One of the major changes in HTML5 is in respect to how HTML addresses Web applications. Other new features in HTML5 include specific functions for embedding graphics, audio, video, and interactive documents.
HTML has been in continuous evolution since it was introduced to the Internet in the early 1990s. Some features were introduced in specifications; others were introduced in software releases. In some respects, implementations and author practices have converged with each other and with specifications and standards, but in other ways, they continue to diverge.
In all probability you know about the latest blogging tool HTML5, some have embraced it, while others are sceptical over its use. The dynamic behaviour of the Internet makes latest technology obsolete tomorrow. However, we cannot ignore the fact that yesterday’s technology is the building block of present and future technologies. For instance, when developing a new stable scripting language we usually turn to the native code, as it tends to remain unchanged for a long time.
Fundamentals:
<!DOCTYPE html>
DOCTYPE is a simpler command carried over from HTML, and is embedded in the source w3.org. It is a command compatible with older versions of Internet browser. Meanwhile, if we use <!DOCTYPE html5> it triggers Quirks Mode in IE6, that correlates to backward compatibility, hence <!DOCTYPE html>.
New Elements Introduced:
The W3C community has listened sincerely to suggestions and perceived a future Internet architecture, where abundance of features is essential for its sustainability and growth. HTML5 includes basic and advanced structural elements to offer a rich browsing experience. For instance, with a powerful API, it allows programmers to create more user-friendly applications, and reducing the dependency on Flash for saving data and intense animation.
Header Element:
<header> contains introductory information to a page or section. It can include branding information to an entire table of contents.
Navigation Element:
<nav> retains the links to other pages or other sections of the same page. Primary navigation links are to be contained in this element, only.
Section Element:
<section> represents a generic document or application. It behaves in the similar manner as <div>, separating a portion of the document.
Article Element:
<article> indicates an independent portion of a page like blog post and forum entry.
Aside Element:
<aside> describes the content associated with main area of the document. It can be used to pull quotes, posts and tags.
Footer Element:
<footer> is used for marking the ending of each page. It can also be used for each section on the same page.
Final Thoughts:
At first glance, these new elements seem to replace common DIV Ids. However, further analysis reveals that these behave like classes and normal HTML elements, which can be repeatedly used for retaining semantic structures.
HTML 4.01 and XHTML 1.0 have small syntax differences that can invalidate code. Bearing this in mind, HTML5 has built-in “slack” to make the transition easier.
This week thousands of system administrators who make use of Goolge products will open their inbox to see an email from Google explaining that their Web Optimizer product contains an Cross-site scripting flaw that allows hackers to inject scripts into their Google Optimized web pages.
A part of this email follows:
“you are using a control script that could allow an attacker to execute malicious code on your site. To fix the vulnerable section of code, you should immediately either replace the control scripts in your affected experiments or stop the affected experiments and start new experiments”
On receiving this notification I quickly scrambled to my web sites to immediately implement the fix recommended by Google. Later on in the day I had time to to dig deeper into the problem and analyse the security flaw in more detail. What I found is a multi-staged attack that relies on cookie injection, improper text parsing and DOM script injection code.
I have documented my research in this article, and I hope that it will be of use to you. There is a lot to learn from other people’s mistakes, especially when those people are Google themselves.
The flaw exists in Googles Web Optimizer, which is a series of scripts that web administrators use to gain insight into how their web sites are navigated by online customers.
Below is a segment of the the flawed code.
<!-- Google Website Optimizer Control Script -->
<script>
function utmx_section(){}function utmx(){}
(function(){var k='XXXXXXXXXX',d=document,l=d.location,c=d.cookie;function f(n){
if(c){var i=c.indexOf(n+'=');if(i>-1){var j=c.indexOf(';',i);return c.substring(i+n.
length+1,j<0?c.length:j)}}}var x=f('__utmx'),xx=f('__utmxx'),h=l.hash;
d.write('<sc'+'ript src="'+
'http'+(l.protocol=='https:'?'s://ssl':'://www')+'.google-analytics.com'
+'/siteopt.js?v=1&utmxkey='+k+'&utmx='+(x?x:'')+'&utmxx='+(xx?xx:'')+'&utmxtime='
+new Date().valueOf()+(h?'&utmxhash='+escape(h.substr(1)):'')+
'" type="text/javascript" charset="utf-8"></sc'+'ript>')})();
</script><script>utmx("url",'A/B');</script>
<!-- End of Google Website Optimizer Control Script -->
This Website Optimizer Control Script is embedded within your web page to track it. It will be run on the user’s end, and under a successful attack it will extract a malicious script from their cookie and execute it in their browser.
The code above is standard JavaScript however it is not easy to read. There are two reasons for this; firstly, like most Google client side scripts, it is obfuscated, purposely making it cryptic. Secondly it was designed to work fast and efficiently, and not to be easily understood.
I manually de-obfuscated this code, and whilst doing that, I re-factored it to make it easy to understand. The code below should be easy enough to read by anyone with JavaScript knowledge, yet it fulfills the same function as the cryptic code provided by Google.
01. function AB_Analysis(){
02. var k='YOURTACKINGNUMBER'
03. var d=document;
04. var l=d.location;
05. var h=l.hash;
06. var injectionvector1 = ReadFromCookie('__utmx');
07. var injectionvector2 = ReadFromCookie('__utmxx');
08. d.write
09. ('<script src=http://www.google-analytics.com/siteopt.js?v=1&utmxkey='+k
10. +'&utmx=' + injectionvector1
11. +'&utmxx='+ injectionvector2
12. +'&utmxtime=' + new Date().valueOf()
13. +(h?'&utmxhash='+escape(h.substr(1)):'')
14. + '" type="text/javascript" charset="utf-8"></script>')
15. }
16.
17. function ReadFromCookie(field_name){
18. var c = document.cookie;
19. var start = c.indexOf(field_name+'=');
20. var end = c.indexOf(';',start);
21. return c.substring(start + field_name.length + 1, end);
22. }
23.
The security flaw starts in lines 06 and 07:
06. var injectionvector1 = ReadFromCookie('__utmx');
07. var injectionvector2 = ReadFromCookie('__utmxx');
Both these lines call into the function ReadFromCookie which parses the headers of a cookie file without sanitising the input. The lack of sanitation is on line 21:
21. return c.substring(start + field_name.length + 1, end);
Over here we can see a classic mistake – data is blindly read from an untrusted source. The substring function reads from the start of the field’s data all the way till the fist semicolon. What it reads should be a tracking number, but in this case it is a specifically planted ‘dormant’ script. It is dormant because it resides inside a cookie and not inside the HTML of the web page itself. The lines 10 and 11 are where the real trouble begins to show. The extracted and potentially dangerous script is injected into the user’s DOM:
08. d.write
09. ('<script src=http://www.google-analytics.com/siteopt.js?v=1&utmxkey='+k
10. +'&utmx=' + injectionvector1
11. +'&utmxx='+ injectionvector2
12. +'&utmxtime=' + new Date().valueOf()
13. +(h?'&utmxhash='+escape(h.substr(1)):'')
14. + '" type="text/javascript" charset="utf-8"></script>')
The code above is the one responsible for the fatal injection. There is some irony here. In the same statement of code there exists some protection against XSS, but it does not go far enough.
Look at line 13:
13. +(h?'&utmxhash='+escape(h.substr(1)):'')
This code correctly treats the DOM hash (variable h) as untrusted because it can be manipulated in a similar way as the cookie can. The lines before it, however omit calling the escape() function that effectively sanitises code against XSS and similar attacks. Its a typical case of ‘so close, yet so far away’.
For those who find it hard to read JavaScript, I have included a flow chart showing the two functions, AB_Analysis and ReadFromCookie.

AB Analysis Function
The diagram above is a flowchart for the AB_Analysis script. This script is embedded on pages by web developers who are making use of the Google Web Site Optimiser. The red processes are where data is read from the cookie and added to a script, which is in turn injected into the DOM.

ReadFromCookie flowchart
Above is a flowchart for the ReadFromCookie function. There is no actual flaw here, except maybe that there is no limit to how much data is read out of the cookie. Also, the end of record detection is rather crude – simply looking for a semicolon in the data.
Below is how a normal cookie might look. Cookies are not very sophisticated and are generally described as simple text files on the user’s computer. In HTML5 cookies have been replaced by a full blown relational database.
Normal Cookie Example
BEGIN COOKIE
umtx: some_value;
umtxx: some_other_value;
END COOKIE
The compromised cookie below contains script inside the umtx and umtxx fields. This script is not active and therefore not dangerous. However, when the AB_Analysis script is executed, the umtx script gets activated through this XSS attack.
Compromised Cookie Example
BEGIN COOKIE
umtx: <<malicious script goes here>>;
umtxx: <<malicious script goes here>>;
END COOKIE
An attack is two staged; first the malicious script has to be injected into a cookie on the victim’s browser. After that, the user must visit a web page. containing the Google AB_Analysis script. The attack can be summarised in the diagram below.

Attack on Google Web Optimizer
Google was fast to react and provide a fix however this fix needs to be deployed by every web site administrator that uses Google Web Optimiser. This applies to hundreds of thousands of web pages globally.
I hope that administrators are quick to fix this problem as it could easily result in an XSS attack against their site if targeted.
Full instructions on different options on applying the fix can be found on the official Google support page.