Nebulent - software solutions

Nebulent

Edie (Enterprise Data Intelligence Engine) is our latest application that is specifically designed to help organization get up to speed with Amazon Mechnical Turk platform without going through a learning curve. No need to code or manage any spreadsheets. Design your tasks and push your data through our qualified work force to get fast results at fraction of a penny.

Contacts

Contact information

Toll Free: +1(888)201-7922

Blog

MySQL to MongoDB in 2 days

Clio

Clio is Software as a Service platform, designed to process music and provide ability to algorithmically match any song against a large set of other songs in ones library. Clio is the only software in the world that can analyze and decode the universal patterns that define musical identity and mood.

Far more advanced than cataloging simple metrics like beats-per-minute and key, Clio understands the flow of musical ideas, recognizes subtle differences between drum grooves, and identifies the unique performance styles of individual musicians. It finds and prioritizes the parts of the music that we, as listeners, find most important.

Unlike other machine learning or social recommendation-based solutions, Clio’s technology intuits the difference between Lady Gaga and Ravi Shankar and can find music that sounds (and feels!) like either one.

Original Architecture

Originally, the data architecture for Clio platform was designed with Relational mind set. At the time of the initial design the software was preserving some metadata and a single record of analysis data per song. If we take into account the fact that there are approximately 46,002,354 songs (note the iTunes Music Store has only 2.5% of these songs available) according to Gracenote, most of the relational databases would be able to efficiently store and process such amount of data. As the system was growing and algorithms maturing we ended up with thousands of records per each song, which now included entire set of DSP data (notes, beats and etc.).

With a release deadline barely ahead of us, we hit a wall of data, counting billions of records for any of our medium-size clients. Without wasting much time, a decision has been made to look into non-SQL databases. We’ve addressed similar problems before by implementing a Lucene/Solr farm layer on top of relational data, but the lack of support for hierarchical, document-based structures pointed us directly to MongoDB. It was our luck that the nature of the application defined all audio data as a large self-contained set (document) of data per each song (as illustrated below).

Relational Model

Magic of Spring Data

So, we found ourselves left with original JPA-based domain model, DAO layer and lots of service code around it. Luckily, Spring Source comes to rescue with their Spring Data project and support for MongoDB, among other non-SQL databases.

From the model above, performance table naturally becomes our MongoDB BSON document and all we need to do is simply add Spring Data @Document annotation at the top of Performance JPA bean as illustrated below.

JPA Bean for Performance class

Note, to make MongoDB function correctly, every object in the document, according to Spring Data must have id class property defined, which will serve as a primary key. Even though in theory one can use @Id annotation even if you have a primary key attribute with name other than id, we found that approach is currently not functional. As you see, even in case of primary keys we got lucky since they all were named id and each table was normalized to have no composite primary keys, which will also not work as expected in JPA.

Now, all you need to start interacting with MongoDB is simply define mongo template as illustrated below.

Spring Data mongoTemplate

Interacting with MongoDB in Spring Data

Collection Size

One of the things that were not apparent in the beginning was the way to get a count of documents in a collection without actually retrieving any data, similar to MongoDB command below:

db['my.collection-1'].find().count()

To achieve this in Spring Data, simply do this:

mongoTemplate.getCollection("my.collection-1").count()

Like Queries

If you need to run case insensitive like search, you should not use regex(String) method of Criteria class, use BasicDBObject instead as shown below:

Query query = Query.query(Criteria.where("title").is(new BasicDBObject("$regex", title).append("$options", "i")));

Use Indexes

Indexes do help! After defining composite indexes that matched parameters of our heaviest queries we’ve drastically improved performance. But keep in mind that there is a limit on the number of attributes that you can define in a composite index.

Deployment and Setup

Amazon EC2

It’s a fact that MongoDB requires lots RAM, but during numerous tests we found that it also consumes lots of CPU on Amazon EC2 instances. Based on our tests, the most stable EC2 instance type to use when you have above 100GB worth of data (after export) would be High CPU Extra Large.

High Availability and Scalability

In order to address scalability and availability concerns, proposed solution would implement a replicating cluster of MongoDB databases distributed across multiple availability zones. This would substantially improve up-time SLA for the database services. In addition, implementation will benefit from horizontally scalable architecture, having 2-3 databases processing the load instead of just one, sitting behind a regional load balancer. This benefit will come at a cost, however, since more servers would have to be operated and each of them would have its own copy of the database. Network charges will not apply since all network traffic within the Amazon cloud is free.

Pitfalls on the way

  • In case of composite keys, MongoDB will complain during serializing of the field in current version of Spring Data. Also, pay attention to the fact that you can only use “Long”, “String”, “BigInteger” or “ObjectId” as MongoDB data type for the primary key in a document
  • MongoDB cannot serialize “Character” data type, so do not use “char” or “Character” in your java beans. As a result, either change to String or use enum(s) which serializes just fine (see below). This is another issue with Spring Data.

Enumeration

  • No support for transactions.
  • Spring Data currently does not permit setting values for multiple fields in a single call, so in case you want to update multiple fields of MongoDB document you have two options:
    1. Save entire document (if documents exists, MongoDB will override the same document).
    2. Perform multiple updates (for each field you want to update) using “mongoTemplate”.
2011-05-14