Our Law and Society Presentation: Google Analytics: Analyzing the Latest Wave of Legal Concerns for Google in the U.S. and the E.U

This is a much shortened version of our presentation (by Keidra Chaney & Raizel Liebler) at the Law and Society Conference. The complete version (with citations!) will be published in the next Buffalo Intellectual Property Law Journal.

What is Web Analytics?

Web Analytics’ official definition by the Web Analytics Association, the worldwide professional organization for web analytics is: “the measurement, collection, analysis and reporting of Internet data for the purposes of understanding and optimizing Web usage”

Web analytics involves the collection and measurement of various forms of online user data, and is traditionally used as a tool for market researchers and web professionals to measure the effectiveness of website communication. As web transactions have become a major source of revenue for companies large and small, online marketing and web communication has evolved to become more of a priority for marketing department, for these companies, measuring and optimizing user results have become a priority. Web analytics commonly provides information on online user activity including web page views, number of visitors, visitor location and referring websites. This information is then used by marketers to evaluate the effectiveness of website content.

The WAA cites the 1993 founding of web analytics software company WebTrends as the formal beginning of web analytics as an industry and a profession. There are two primary methods of data collection used by web analytics software to track user sessions on a website:

1.) Logfile analysis, which uses the log files stored on a website server to collect information on users’ IP addresses, date/time information and referring websites. A number of open source web analytics tools, such as AWStats and Piwik employ this method.

2.) Page tagging involves placing javascript code on a webpage to notify a third party server whenever a page is loaded in a browser, such as Microsoft Internet Explorer or Firefox. This method is employed by Google Analytics.

Cookies, a data collection method used by most hosted analytics software companies, tracks user sessions by placing a small piece of text on a user’s computer when a browser loads. The use of cookies by analytics vendors, including Google Analytics will be the focus of much of our discussion and analysis in this article.

Cookies

An http cookie is a very text file that is places on a users computer hard disk by a web server when a user loads a webpage on their browser. Cookies are commonly employed by web servers to track and authenticate detailed information about online users, based on identifying the specific computer/browser combination of the user. First party cookies are issues by the same website domain being visited. Third party cookies are issued to track user activity among multiple websites.

Third party cookies are commonly used by e-commerce companies for targeted online advertising based on clickstream behavior. While cookies are used by most analytics companies for data collection (including Google Analytics), privacy concerns do prompt some users to delete cookies from their computers after use. According to a 2007 report from web analytics firm comscore, 3 out of 10 internet users regularly delete cookies from their computers.

While cookie technology is not intended to violate consumer privacy by design, there have been instances of companies using this technology maliciously. A 2006 study on consumer understanding of cookie technology showed that users remain unclear about how cookies technology is used by websites, the advantages and disadvantages of use, and the differences between cookies, technology, viruses and malware.

Continue reading

Advertisements

#Amazonfail, the Google Books Settlement, and the importance of open access for preserving cultural heritage: In honor of National Library Week

Over the past two years for National Library Week, I have posted about the importance of openness of publication and accessibility of government information and the limitations of relying on Google. Free Government Information, Public.Resource.org, OpentheGovernment (PDF),  and others, are continuing to do a great job of promoting openness in regards to government (and scholarly) information. Unfortunately, most people are not aware of the great usefulness and importance of government information. But they do know about Amazon, Google, and YouTube, with many among us using them everyday. What would many do to find information if they stopped working?

The #Amazonfail censorship/ glitch / griefing situation last weekend shows the power of publics working together and the organic nature of much of tagging and movementsourcing; people will often be able to create a simple way of communicating information with each other (the first person to use the #Amazonfail tag on twitter used it because it worked as a folksonomy of the situation and it spiralled from there because it was effective). But it also shows the difficulty for all when most rely on one source — Amazon — for information about bestsellers and similar items.

Siva Vaidhyanathan says that #Amazonfail is more than just about crowdsourcing and user tagging, it is about “metadata, cataloging, books, Web commerce, and justice.” A commenter quoted in the New York Times states that “We have to now keep a more diligent eye on Amazon and how they handle the world’s cultural heritage.”

Have we really placed Amazon (and similar companies) in charge of our cultural heritage? Perhaps not directly, but many people have high expectations for these companies’ ability to make information accessible –even if this does not take into account most of the aspects of information literacy.

But libraries differ from these for-profit companies in how they organize information and why they exist. Most libraries are not-profit and their goal is to serve some type of public (what librarians call a patron group). Libraries are generally built on similar organizational systems to each other– such as Library of Congress or Dewey classification, but libraries are intentionally duplicative in their collections. Not only do libraries often have the same item in their collections, but through interlibrary loan, libraries are tied together in a larger network.  And unlike Amazon and Google, even if a library’s online catalog wasn’t working, a user could still use the organizational system to find useful information.

But another major difference is that libraries — and even twitter — directly rely on people for the system to work, not a algorithm, as with Amazon and Google. As we’ve seen with Googlebombing and likely with #Amazonfail, it is possible for an algorithm to be fooled. Or provide inaccurate information.

We rely on Google quite openly, even though sometimes the information is not right. For example, as of when this post is posted, the top result when googling “four stages of tornadoes” gives the blunt answer of “u suck balls” from wiki.answers. This can’t possibly anywhere close to the correct answer to this scientific question, but it is the one Google’s algorithm is choosing!

In my previous posts, I mentioned how what Google has promised from Google Books isn’t what is actually available in many cases. However, some are expecting this settlement between two private/non-public entities to somehow also be a settlement that protects the interests of the public, though there are many that disagree, including Siva Vaidhyanathan, some vehemently. There is a group of professors attempting to intervene in the Google settlement on behalf of the public:

“The proposed settlement will make Google the only company in the world with a license to use orphaned works.  No other company will be able to buy a similar license because, outside the context of the proposed class-action settlement in this case, there is no one from whom to buy such a license….The settling parties plot a cartel in orphaned works.

…  Because exclusive rights in orphaned works do not serve the ultimate purpose of copyright, the public domain has a claim to free, fair use of orphaned works.

We have the right to intervene to present the public domain’s claim to free, fair use of orphaned works.  None of the present parties will present our claim….”

And what about YouTube? While there is much government information on YouTube, what happens if the company goes out of business? Free Government Information ponders whether

agencies that rely on YouTube as a channel of communication keeping copies of the videos they post there? Would they make them available through another channel? What if … libraries had copies?

Relying on private companies — like Google, like YouTube, like West — to give us access to government information — leaves us without options if these access points disappear.

Presently under challenge is access to government-funded scientific information by H.R. 801 – The Fair Copyright in Research Works Act introduced by Rep. John Conyers. If enacted, the bill would reverse the National Institutes of Health (NIH) Public Access Policy regarding public access to taxpayer-funded research and make it impossible for other federal agencies to put similar policies into place. Publicly funded medical research is the metadata of our lives — we don’t see it, but it affects our health and how we live our lives.

Many oppose this bill, including Harvard University, which has written a letter opposing this legislation:

The NIH public access policy has meant that all Americans have access to the important biomedical research results that they have funded through NIH grants. Some 3,000 articles in the life sciences are added to this invaluable public resource each month because of the NIH policy, and one million visitors a month use the site to take advantage of these research papers. The policy respects copyright law and the valuable work of scholarly publishers.

[Instead of passing this bill], Congress should broaden the mandate to other agencies, by passing the Federal Research Public Access Act first introduced in 2006. Doing so would increase transparency of government and of the research that it funds, and provide the widest availability of research results to the citizens who funded it.

Google, Amazon, and the publishing industry — are highly valuable and useful tools and services — but we should not allow closed proprietary systems to determine how we address information that belongs entirely or in part to the public — like the public domain, government publications, and publicly funded studies. And even when “public” information is not at issue, we need to become more wary on relying solely on these systems.

Multiple systems, locations, and means of access are essential to preserve our cultural heritage — as Free Government Information discusses in regards to government information, yet applicable to so much more:

… no single digital archive or repository can ever be as secure and safe as multiple archives, libraries, and repositories. … The nature of digital information is that it can easily be corrupted, altered, lost, or destroyed. It can become unreadable or unusable without constant attention. Relying on any single entity is simply not as safe as relying on multiple organizations. … But this is about more than redundant copies. It is also about relying on different organizations because they have different funding sources, different constituencies, different technologies, and different collections. No single digital collection can ever be as safe as multiple, reliable digital collections.

NIN on Google Earth

Bless his geeky heart, Uncle Trent is at it again. He’s released download statistics for The Slip on Google Earth. Since my job is about using “emerging media” to engage an audience , and I am also a slobbering NIN fan, the whole approach of using freeware and Open Source (google, flickr, YouTube, blogger, you name it) to communicate with fans is really a case study for me. It’s working, and it’s something for organizations, not just musicians to watch closely. Not to mention this goes back to understanding and trusting your audience. Knowing that NIN fans are notoriously geeky tech savvy and trusting them enough to release this information in a format they’d relate to.