Glossary

Google Refine

1

Google Refine is a free tool designed, as the tagline puts it, for “dealing with messy data.” Although Refine opens in a browser window, it is a desktop install and can therefore be used even when no internet connection is available. It is extremely useful for quickly transforming, assessing, and cleaning data from a variety of source formats. While its functions for joining, splitting, and summing data are more involved than other spreadsheet programs, its automatic macro-writing feature makes developing these a worthwhile investment in situations where these operations are needed on a regular basis (such as with a weekly or monthly data release).

What is it good for?


High-level analysis
Google Refine is especially good at providing quick, top-line overviews of imported data. For example, it provides an automatic row-count when data is loaded, while the “faceting” does the same for each unique value that appears in a given column. The summary window that this generates can be sorted both alphabetically and by frequency, making small variations or errors in the data easier to identify and update. Finally, selecting one or more of these values filters the dataset to contain only matching rows. These filters can be applied across many columns simultaneously, allowing for quick drilldown to specific value intersections. Any and all of these filters can be removed by closing the corresponding facet summary window.

Format conversion
Google Refine is also useful for transforming data formats. For example, it can automatically parse well-formed XML or JSON into spreadsheet format, an otherwise difficult and time-consuming manual or programmatic task.

Recording actions on the data
As mentioned above, Google Refine also keeps a step-by-step record of any transformative data manipulation (such as merging, splitting, removing or renaming columns), which can viewed and stepped through at any time using the Undo/Redo tab. Although these manipulations are more complex to execute in Refine when compared to other spreadsheet programs, the resulting commands can be applied wholesale to another file of the same structure. This is particularly useful when certain calculations or manipulations must be made on a regular basis, as with monthly economic data and the like.


What does it look like?

Google Refine Screenshot

The main interface of Google Refine is essentially a spreadsheet layout. The small arrows to the left of each column header provide access to the faceting, sorting, splitting and other features, while the Facet/Filter and Undo/Redo tabs appear next to the current row count.

Google Refine screenshot 2

Unlike most spreadsheet programs, however, a maximum of 50 rows can be viewed at any given time, and beyond this are paginated at the upper right. At any time, the currently selected rows can be exported as a .csv or other format.


How do I use it?

After downloading and installing Google Refine, double-click on the program’s blue diamond icon – it will open a new browser window or tab (on a PC, a small, black window may also appear – you can minimize, but do not close, this window).

Loading & parsing data
Select the file you want to open and click Next. Refine will guess the file type you have selected and offer a visual preview in the top half of the contents in the top half of the window. In the case of non-spreadsheet formats, such as XML and JSON, you simply need to highlight with your mouse the first element you want converted into a row – it will then show you a preview of the resulting file.


By using the Pick Record Elements and Update Preview buttons at middle right, you can experiment with your selection until it’s correct. Then simply click Create Project at the upper right.

Faceting


Select the arrow to the right of the column you want to facet and choose Facet>>Text facet from the dropdown. The resulting summary will be shown at right. By selecting one of the values, the dataset will be filtered to show only those containing that value.


To add values to the selection, roll over them and choose “Include” to the right of its value.


Where can I find it?

The latest version of Google Refine can be downloaded for free for all platforms at code.google.com. As noted earlier, though the interface operates through a browser window, Refine does not need a web connection to operate, nor does it transmit information to a remote server by default.


Issues & FAQs

While its tools for quickly “faceting” and cleaning data are extremely useful, there are some standard spreadsheet operations (such as concatenating columns and calculating derived values) that are more complex to execute in Google Refine than in programs like OpenOffice or Microsoft Excel, because it uses the custom GREL expression language. However, for complex analyses that must be performed on a regular basis, developing them in Google Refine is often more efficient in the long term, as the macros can be stored separately and so easily reapplied.


Tools & Tutorials

Several in-program and online tutorials have are available at Google Refine’s project home page. The downloads page also includes a Getting Started guide and FAQ section. Additional tutorials will also be available on this site in the near future.

Glossary

Code Vocabulary

0

object: basically a synonym for “thing”. When you see “object” it can stand in for almost any of the terms listed below.

variable: a named container for information. The name can be any word or combination of words, as long as they’re not already used (“reserved”) by the programming language.

data type: the category of stuff a particular variable “container” has or can have in it. The data type of a variable gives you a sense of its general characteristics, and rules for what can be done to it or with it.
Common datatypes


Numerical

  • Number: In javascript, this is the generic data type for any numerical value. Often you will use this data type to cast (see below) data that’s been loaded from an external data source, such as XML or JSON.
  • integer (int): Exactly what it sounds like, when you declare a variable of type int, it can only hold a whole-number positive or negative value.
  • float: A non-integer number; in other words, one whose precision (decimal places) may be greater than zero.

String: A string is a sequence of alphanumeric characters (which can include spaces), indicated by surrounding the string with either single (‘) or double (“) quotes. While either type of quote can be used to indicate a particular string, the opening and closing quote type must match. E.g.:

var myNameDouble = "Susan"; //this is fine
var myNameSingle = 'Susan'; //this is fine, too
var myNameBroken = 'Susan"; //this will produce an error. See how this comment is blue?

Object: Object (with a capital “O”) is probably the most versatile data type, because it can hold an arbitrary number of properties, whose names and data types are determined by the developer. When creating data-driven applications or pages, very often the types of objects created while programming will closely reflect the structure of the data objects being used (for example, an object may contain all of the information associated with a single item in an RSS feed).

Array: An array is an ordered (or indexed) collection of other data types. Arrays may contain only a single type of data (all numbers) or a mix (some numbers, some strings, some objects). While arrays are a useful data structure, when developing data-driven applications, creating separate arrays to “hold” lists of data is often unnecessary, as the existing data structure often acts as a de facto array for most of the ordered information in the application.

Boolean: A Boolean data type may contain only one of two values – true or false (which are therefore reserved words in almost every language). The Boolean data type is most often used in conjunction with Boolean and compound operators, which are used to compare two values to determine their relationship. Different forms may be used in different circumstances. E.g.:

== // is equal to
!= // is not equal to
>= // greater than or equal to
<= // less than or equal to
&& or AND // both joined variables must share the same value
|| or OR // either of the joined variables may have the correct value
! or NOT // the negation of a particular statement/value


functions & arguments (to functions): functions are kind of like recipes. The arguments to the function are the specific ingredients to be used; the “operations” or steps within the function are the steps of the recipe. Often at the end of a function, it returns a finished product to wherever in the program it was called from.

events, event listeners, event handlers: an event in a program is just that – something that happens. Very often, these may be so-called “user events”, such as a click or mouseover event; they may also be “system” events, such as a data file being successfully loaded from another web location. Event listeners are special functions, predefined by the programming language, that are triggered when their particular event occurs. An event handler is no different than any other function, except that it is associated with a particular event listener, and so is executed only when that listener is activated.

Glossary

SSH/Public Key Cryptography

0

SSH is Secure SHell. It’s a network protocol for encrypted communication between two computers. In plain English, that means that you can use SSH to access a secure, encrypted command line on a computer that you don’t happen to be sitting in front of. For more details, try Wikipedia’s article on Secure Shell.

SSH is a command line tool, but you can use it to copy files between two servers, send updates to a version control system like git or Subversion. However, system administrators and other command line cowboys will, ever so very rarely ask you for your “SSH key” or insist that they only use “key based authentication” or “public key cryptography”.

What is it good for?

Most people know they’re using bad passwords every day. There are lots of good reasons why experts prefer key based authentication to passwords, but one of the best is that strong keys are easier to use than strong passwords.

How to make strong passwords

Password strength, by xkcd

Key based authentication also makes it much easier for multiple people to access to a single server: you can authorize many keys to access a particular account. No more sharing passwords. For example, The New York World uses SSH keys to manage access to our web server so we don’t have to keep track of a single shared password for our reporters.


How does it work?

Each person generates a pair of keys. Windows users will want PuTTY. Mac and Linux users can pull up a console window (in Mac OSX, find it under Utilities > Terminal) and generate a key pair with a few short commands.

Once you’ve created the keys, one of them – the private key – stays on your computer. You don’t share it with anyone, ever. The second, your public key, you do share. You add your public key to a file on the server called .ssh/authorized_keys. When you try to access the server, SSH will send your private key to the server in lieu of a password. As long as your public key–the other half of your key pair–is listed in the authorized keys file, you’ll be able to connect.

This is infinitely more secure than sharing the same password between everyone who needs access to the server and far, far easier than memorizing truly random 650 character password.

If you’re really new to using the terminal, Git Hub’s help pages have a great overview to help you get started. Placeblogger’s Lisa Williams also maintains a pretty good resource sheet for folks who are just learning to deal with a terminal.

Step 1: Make sure you don’t already have a key pair set up

In your terminal window, type:

$ ls ~/.ssh

This will list the contents of the .ssh folder in your home directory. If the response you get is something along the lines of "ls: cannot access .ssh: No such file or directory
/home/amanda/.ssh
” you’re ready for step two. If you see a list of file names that includes perhaps one called “id_dsa” (or “id_rsa”) and one called “id_dsa.pub” (or “id_rsa.pub”), you might not need to make new keys. If you honestly have no idea where those keys came from, you’re probably better off starting fresh.

(“dsa” and “rsa” are two common methods for creating public keys.)

Next, type:

$ mkdir ssh_bak

This will create a new directory called “ssh_bak” that you can backup your keys to. Follow this with:

$ mv ~/.ssh/id_* ~/ssh_bak/

This will move any file with a name starting with “id_” into the backup directory you just created. (For more on the asterisk (*) check out: wildcard).

Step 2: Generate a new SSH key

The program that generates SSH keys is called ssh-keygen:

$ ssh-keygen -t rsa -C "your_email@youremail.com"

You should see a message like “Generating public/private key pair.” while your computer works, followed by a prompt asking you to provide a file to save the key to. The filename provided in parentheses (probably something like “(/home/sometheing/.ssh/id_rsa)“) is the default — just hit return to except the default value. Your next prompt will be for a passphrase. Make it a pretty good one but keep track of it. You won’t see anything on the screen as you’re typing your passphrase. That’s okay. Just type it and hit return when you’re done.

You can see your new keys by opening them in a text editor or with a command line tool like cat or less:

$ less .ssh/id_rsa
$ less .ssh/id_rsa.pub

Step 3: Put your key on the server

If you’re lucky, you can just give a copy of id_rsa.pub to your server administrator, who can append it to .ssh/authorized_keys for you. If you aren’t quite so lucky, sit tight, and we’ll cover the vi editor in a future post.

Do I still need a password?

You can (and should) password protect your private key. Github has nice instructions for doing that. If you do, you’ll be prompted to enter your passphrase whenever you use your key.


Issues & FAQs

Key-based SSH authentication is just one of many ways you can use public key cryptography — PGP and SSL are two others, but there are many more.

Glossary

KML

1

The acronym for Keyhole Markup Language, KML is a widely-used XML format for describing geospatial data. Files of this type are typically denoted with a .kml extension (compressed files will have the extension .kmz). Though best known on the web for its integration with Google Maps and Google Earth, KML is also an open standard approved by the Open Geospatial Consortium, and can be exported from (and, to a lesser extent imported to) complex mapping software, such as ArcGIS.

What is it good for?


KML is well suited for both creating and “consuming” geographic information. On the creation side, it can be used both to generate outlines of geographic regions (such as states, countries, or custom areas) or identify individual points or locations. On the consumption side, data published by government and other organizations in .kml can be visualized and republished nearly instantly, which makes it especially useful in breaking news situations.

In addition to basic point and outline information, KML *can* (but needn’t) include elaborate style information, such as fill and line styles, location icons and additional data. Because of this, some KML may generate an informative, fully styled map, while other KML may simply describe a range of geographic areas or locations.


What does it look like?

As mentioned above, KML is a specification of XML, and so follows the same grammatical rules. You’ll notice that it actually has an “xml” type header. Coordinates in KML are of format: latitude, longitude, height

KML describing an individual point:

<?xml version="1.0" encoding="UTF-8"?>
<kml xmlns="http://www.opengis.net/kml/2.2">
  <Placemark>
    <name>Simple placemark</name>
    <description>Attached to the ground. Intelligently places itself at the height of the underlying terrain.</description>
    <Point>
      <coordinates>-122.0822035425683,37.42228990140251,0</coordinates>
    </Point>
  </Placemark>
</kml>

KML describing a simple polygon:

<MultiGeometry><Polygon><outerBoundaryIs><LinearRing><coordinates>
-91.853943,33.943816,0 
-91.779785,33.877713,0 
-91.689148,33.927864,0 
-91.818237,33.980264,0
-91.853943,33.943816,0 
</coordinates></LinearRing></outerBoundaryIs></Polygon></MultiGeometry>


How do I use it?

KML can be used effectively to create customized, interactive maps, as well as three-dimensional figures and extrusions (to represent buildings and other structures). ArcGIS can easily output KML from state, county, and country outlines – or any other geographic features that the system has on record. Many technologies – includes Google Maps, Google Earth, OpenLayers, Google Fusion Tables and even Flash – can can consume KML to visualize points, polygons and other map features. In most cases, this simply requires pointing the application to the desired KML file. For example, trying opening the following file in Google Earth: Sample KML

KML is also very useful for tracking weather events and breaking news. For example, the NOAA maps the water levels of U.S. rivers here. However, the KMZ for these locations, which is updated every 15 minutes, is available here. This makes possible a live-updated map such as this one.


Where can I find it?

National and international organizations release data in KML and KMZ formats, and it can also be produced through ArcGIS or even written by hand. As noted above, organizations like NOAA and Google often release urgent data in KML in part because it is so easily visualized by a wide variety of tools.


Issues & FAQs

KMZ is a compressed format of KML. While some programs (such as Google Earth) may automatically “unzip” this format and visualize the file contents, in many cases you will need to manually unzip the file using WinZip or some other software.

Also note that while ArcGIS can output high-quality KML, it is less good at consuming it – that is, converting styled KML to a parallel shapefile (.shp). This somewhat complicates the process of generating print-quality maps from online-data, as restyling the imported KML and data is usually required.


Tools & Tutorials

Not surprisingly, Google offers some tutorials on KML. GeoCommons also provides some KML data for download, as does the NYC Datamine. This handy tool provides an interactive interface for drawing arbitrary polygons, and also shows the KML.

Glossary

FTP and SFTP

1

If you’ve ever helped build a website, you probably used an FTP client to upload at least some of your files. You might even already know that “FTP” stands for “File Transfer Protocol.” SFTP is just encrypted (or Secure) FTP — the connection between your own computer and the server you’re transferring files to is encrypted, which makes it harder for someone else to view (or alter) the information you’re sending.

What is it good for?

At The New York World we publish most of our stories with WordPress, but when we make maps or cool interactive tools, we need to be able to provide custom javascript, css and xml, so we have to upload these files to our web server directly.

When you open an unencrypted connection to any server (and if you use the internet, you’re connecting to servers all day long), all the data that you send to that server is sent in clear (or “plain”) text. So is all the data that the server sends back. That means that anyone can intercept your exchanges and read them out. Intercepting data on wireless connections is distressingly easy to do. Luckily: there’s an easy solution. Encrypt!

You can encrypt almost any connection, but FTP connections are a great place to start. An encrypted SFTP connection will work just like an unencrypted FTP connection, and you’ll have the confidence of knowing that your password and files are being transfered securely.


How do I use it?

If you already have a favorite FTP client, look for a setting that says “encrypted,” “secure” or “SFTP.” Often, you’ll see a pull-down menu with FTP and SFTP among the options available.

I use Ubuntu Linux, where SFTP is built in to Nautilus. Mac users seem to like Cyberduck and Fugu. If you’re a Windows user, PuTTY is pretty great. Filezilla is available to OSX, Windows and Linux users.

A screenshot of a Filezilla connection dialog.

In Filezilla, SFTP is an option under "server type."


Issues & FAQs

This is, of course, a very simplified version of a technical concept. If you really want to understand SSH or SFTP, I highly recommend their respective Wikipedia pages.

Glossary

CSS

0

CSS is the acronym for Cascading Style Sheets, a means of providing style (fonts, colors) and placement (location on page) information for documents on the web. Typically, CSS is linked to (or “included”) in the header of an HTML document. In the CSS document (or stylesheet), collections of style and placement rules are given distinct names. These names are then applied to similarly identified parts of the web page.

What is it good for?


CSS simplifies the process of creating a consistent look and feel across the content of a web site, by linking the layout of any number of pages to a single file. An unlimited number of pages can link to the same stylesheet, greatly reducing the amount of code that has to be written to add new, properly formatted pages to a site. This also makes it much simpler to change the look and feel of a site later on, as changes to the central stylesheet will then be immediately visible across all linked pages. The “cascading” nature of stylesheets (see How do I use it? below) and the fact that more than one stylesheet can be included on a given page also allows for customization of look and feel for particularly sections or pages.

In addition to determining the basic look & layout of a site, more advanced CSS can be used to create sophisticated visual effects and even animations.


What does it look like?

Although there are multiple ways to apply CSS to a web document, including inline, header styles, and external stylesheets, only the last is recommended. While technically all three methods can achieve the same effects, external stylesheets provide the greatest consistency and flexibility. A basic CSS document might look like this:

/* 
This is a sample stylesheet. Anything written between the slash-asterisk "brackets"
will be ignored by the browser, so I can put comments about the stylesheet here.
*/
body { 
font: Arial,Helvetica,Sans-serif; 
color: #000000;
background: #ffffff;
background-image:none;
}
.contentLink {
font-weight:bold;
color: #000000;
}
#recent-posts {
color: #565a5c;
}

The above document specifies three separate “styles,” each of which will be applied only to the parts of the corresponding HTML document with those particular names. An individual style is described by a name and encased in curly braces {}. Each style may specify many characteristics, each of which is described in the syntax

property-name: property-value;

Generally, there are three mechanisms for identifying individual pieces of a web document: by “tag”, “class”, or “id”. In CSS, simply writing a name (body), means that the matching piece of the document (or document element) is a tag; preceding it with a dot (.contentLink) means it applies to all elements with that class; preceding it with a hash (#recent-posts) means it applies to the element with that id. In the HTML document to which the above stylesheet was linked, you would expect to find each of the following elements:

<body>
<div class="contentLink">This is a "div" container whose text would be bold.</div>
<div id="recent-posts">This is a div whose text would be blue. All of the text would be in Arial font with a white background and no background image, because of the styles specified for body tag.</div>
</body>


How do I use it?

The great efficiency of CSS lies in its “cascading” nature. Put simply, any given element of a web page inherits the styles of its container elements (also called “parent elements”), unless its own style overrides those particular characteristics (or “properties”). In the example above, the “body” tag contains both the “contentLink” div and the “recent-posts” div, so they will both have Arial font text in black (#000000), a white (#ffffff) background color, and no background image. The “contentLink” div will have bold text because “font-weight” is specified as bold in its style description. The “recent-posts” div will have blue text because its “color” property is set to blue (#565a5c), which will override the black it inherited from the body tag.

Where can I find it?

Stylesheets are linked to an HTML page through tags that can be found inside the

<head>

tag. A complete CSS link tag will look something like this:

<link rel="stylesheet" type="text/css" href="http://towcenter.org/wp-content/themes/spectrum/style.css" />

The “rel” and “type” attributes identify the linked file as css; as with regular links, the “href” attribute describes where to actually find the file. A given web page may link to several stylesheets, which will be applied to the page’s elements in order (with the lowest – and therefore most “recent” styles – taking precedence if two of the applied styles describe the same property, such as font).


Issues & FAQs

Although there are dozens of properties that can be applied to an HTML element via CSS, some browsers support properties that others do not, requiring browser-specific workarounds for certain effects, like opacity. For example, see Chris Coyier’s post here.

Such workarounds are necessary because although official standards for CSS are periodically approved by the World Wide Web Constortium (W3C), the organization cannot require browser developers (such as Microsoft, Apple and Mozilla) to support all (or any) of the features of each version. Although the most recent version – or “specification”- of CSS (CSS3) includes properties like animations and web fonts, some browsers may not support these features. Conversely, sometimes browser developers invent their own CSS properties, in order to include appealing features before they’ve been included in the official W3C specification.



Tools & Tutorials

For a list of available CSS properties, as well as tutorials and demos, visit the W3C’s CSS page.

CSS can be created and edited with any plain-text editor, but there are many tools that can make it easier.
Adobe’s DreamWeaver ($$$$), Aptana Studio (free) and Coda ($$ – Mac only), are some popular choices.

Both the Google Chrome and Mozilla FireFox (with the free FireBug plugin) browsers have sophisticated CSS-editing tools that allow you to edit CSS styles instantly on a live web page, cutting down the time and complexity involved in working with complex styles.

CSS Lint is a tool that can tell you if your CSS is error-free (or “valid”).

Glossary

AJAX

0

Is the acronym for Asynchronous Javascript And XML, referring to a technique for loading new information into a web page or application without requiring a complete reload or refresh of the page. AJAX was a defining feature of “Web 2.0″, which saw web applications engineered to more closely resemble the more seamless and responsive experiences of desktop applications.

What is it good for?


AJAX is necessary virtually any time you want to update part of a web page with new information without having to reload the entire page. The information could be text, numerical data, or images. Most often, an “AJAX refresh” will be the result of some kind of user activity on the page or web application, such as a button click. It may also be the result of a timed event, such as refreshing a set of headlines or a Twitter feed.

What does it look like?

If the web page or application you’re viewing allows content to change without the page reloading completely, it almost certainly employs AJAX techniques. Like many catch-phrases of web-evolution, “AJAX” is actually more of a development technique than a specific technology, so at the code level, AJAX can take on many forms. The most distinctive feature of AJAX is code that refers to either “get” or “post” requests; these define the two most common protocols for transmitting information to and from a server, and when visible through a “View Source” on a web page (or in one of its included files), almost always indicates the use of AJAX techniques.

How do I use it?

A typical “AJAX call” involves 3 fundamental components:
1. An event listener, or a piece of code that is triggered when a specific user-action occurs on a web page, such as a particular button being clicked.
2. A server request, made by the event listener code, which asks the server for new information.
3. A response handler – another piece of code that is run when the new information is received from the server. This will usually change something on the page, such as text or other media.

Although the term “AJAX” has persisted, the term now has a more generic meaning than the acronym implies. Notably, an “AJAX” call may return data types apart from XML, such as text and JSON. AJAX calls are also not strictly asynchronous, though that is how they are most commonly used.


Where can I find it?

Although AJAX calls can be written in raw Javascript, this can be tedious due to browser vagaries that can require special handling. Because AJAX techniques are so common, however, virtually every Javasript library (or “framework” – such as jQuery, Moo Tools, Prototype and DoJo) contains functions that handle the gritty, browser-specific details in the background, so the developer only has to worry about the main elements described above, that is:
1. What event should trigger the new information?
2. Where is the new information located?
3. What is the new information used for, once it’s returned by the server?


Issues & FAQs

AJAX techniques cannot be used to directly load information from another web domain. In other words, if your web page lives somewhere under www.mydomain.com, you cannot use AJAX methods directly to load information from www.anotherdomain.com. If possible, any information to be loaded should be moved/uploaded to your own domain. If this cannot be done, a PHP proxy can be used to “trick” the page into thinking that the information is coming from your own domain.


Tools & Tutorials

Full documentation for the jQuery.ajax() method; however, the shorthand jQuery.get() method is more common and concise. jQuery also supports AJAX post requests.