How to mine data from news stories

One basic function of data journalism is to extract newsworthy stories from data sets. Under the pressure of a deadline, journalists often need to know how unearth stories by mining significant information from large, complex data sets. Perhaps less related to the daily crush of press deadlines, though equally important, is the ability to extract useful data from news stories. For crime reporters in particular, data sets can be a useful way to apprehend the context in which separate, apparently unrelated incidents reportedly occur.

Consider the alternative. In the absence of developing independently sourced data sets, how do crime reporters track the incidence of criminal activity and the rate of detection? From time to time, authorities might issue reports intended to give an official perspective on the wider context of criminal activity and detection. For example, the police service may periodically issue a summary reports of the number of murders to have occurred over a certain period. Typically, these summary reports indicate either an increase or a decrease in . But should journalists simply take these reports at face value?

High-quality journalism implies an ability to verify reports, even from official sources. Developing that capacity is in great part the responsibility of the individual journalist. The quality of crime reporting can only be improved by newsrooms that empower reporters to create and populate their own data sets, as a means of producing original and independent trend-analysis journalism, and as a a means of verifying official reports.

A data set typically takes the form of a table or spreadsheet. A table can serve as a record of several reported incidents of crime, usually listed in chronological rows. Each row can include several columns, with each column containing a different category of information about the incident. Crime reporters are generally required to report a number of disparate incidents occurring in a geographic area. By aggregating those disparate incidents into a single table, reporters can discover leads embedded in the patterns of the occurrence of incidents.

So, how can crime reporters create and populate tables using data embedded in crime reports? This process can be understood to comprise six basic steps:

  1. Define the set
  2. Number the instances
  3. Identify significant data points and define categories
  4. List categories in order of significance
  5. Draw the table and input data
  6. Review the table

Here is one practical example of this techinique using a crime story by Trinidad Express reporter Joel Julien, whose crime story “12 Murders In 72 Hours” was published on Monday 5 August, 2013, Page 3. The sidebar to the story read as follows:

“TWELVE murders have been committed between the night of the Emancipation holiday and yesterday. Among them were a double murder in Santa Cruz and a triple murder in Laventille. The following is a list of the dozen dead:

  • Andrew Gibson, 50, was fatally stabbed by a close female relative following a bout of drinking in Rio Claro.
  • Double murder: Gregory “Jamdown” Charles, 36, and Sherwin “Dan” Cole, 37, were fatally shot in Santa Cruz.
  • Sean Duncan, 28, was fatally shot while playing cards in his yard at Mulchan Street, Guaico, Sangre Grande, with friends when two men, one dressed like a police officer and the other in camouflage clothing, opened fire on the group.
  • Ganesh Ramroop, 44, was found by a relative in a car parked along the Chase Village flyover with a bullet wound to the stomach. He died while being taken to the Chaguanas Health Facility.
  • Triple murder: Jamal Michael Justin Fox, 28, aka “Blacka”, his nephew Jamal Fox, 21, and Shaquille Bishop, 21, were fatally shot inside a house at Desperlie Crescent in Laventille.
  • The charred remains of two burnt bodies were discovered near the Forres Park Landfill in Claxton Bay. They were burnt beyond recognition.
  • Venice Chattergoon, 27, was stabbed a total of 28 times at his Concerned Citizens Street, California
  • Shakeel Bethel, 19, was gunned down outside a bar in San Fernando.”


Can you convert the article above into a small data set, such as a table? Here’s one way to do it.

  1. To get started, give your table a title. In this example, the title of this table should match the type of event listed in the article. Since the article is essentially a list of incident of a particular type of event, the title of the table can match the type of event, e.g. ‘Emancipation Weekend Murders’.
  2. Next you can determine the number of rows in your table. To do this, count how many instances of the defined event appear in the article. (Note that the article may list some  incidents with multiple instances of the defined event.) The number of rows should match the number of instances of the defined event.
  3. Now for your columns. Read the first incident listed in the article and highlight every significant item of information (data point) related to the defined event. Do the same for the subsequent incidents. As you proceed from one incident to the next, you should be able to determine whether the same data points are available for all of the incidents. If so, give a name to each of the the various types of data points. The name of each columns should match your definition of each category of data.
  4. Next you probably want to order your column so that they appear in decreasing order of importance from left to right. One easy way to do this is simply to create a separate document in which the name of every column stands on a separate line, then number each line to indicate its relative importance (1 = most important, etc.) This system is useful for larger data sets for which you may need to develop reference documents recording for both the names and the relative importance of your columns.
  5. Draw a table with the prescribed number of rows. Rows and columns can always be added after you get started.
  6. Review your table. Are any data points missing from the table? Are any types of data missing from your source?


  1. Stories from you newsroom’s library or archive can be a useful source of data.
  2. Tables can be populated using data embedded in the text of news stories.
  3. Not all data are useful.
  4. Not all data sources are complete.

Did I miss something? Please leave your comment on this blog post.


Please share your thoughts here.

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s