Caranddriver.com Data Scraping: September 2013

Monday, 30 September 2013

Web Scraper Shortcode WordPress Plugin Review

This short post is on the WP-plugin called Web Scraper Shortcode, that enables one to retrieve a portion of a web page or a whole page and insert it directly into a post. This plugin might be used for getting fresh data or images from web pages for your WordPress driven page without even visiting it. More scraping plugins and sowtware you can find in here.

To install it in WordPress go to Plugins -> Add New.
Usage

The plugin scrapes the page content and applies parameters to this scraped page if specified. To use the plugin just insert the

[web-scraper ]

shortcode into the HTML view of the WordPress page where you want to display the excerpts of a page or the whole page. The parameters are as follows:

    url (self explanatory)
    element – the dom navigation element notation, similar to XPath.
    limit – the maximum number of elements to be scraped and inserted if the element notation points to several of them (like elements of the same class).

The use of the plugin is of the dom (Data Object Model) notation, where consecutive dom nodes are stated like node1.node2; for example: element = ‘div.img’. The specific element scrape goes thru ‘#notation’. Example: if you want to scrape several ‘div’ elements of the class ‘red’ (<div class=’red’>…<div>), you need to specify the element attribute this way: element = ‘div#red’.
How to find DOM notation?

But for inexperienced users, how is it possible to find the dom notation of the desired element(s) from the web page? Web Developer Tools are a handy means for this. I would refer you to this paragraph on how to invoke Web Developer Tools in the browser (Google Chrome) and select a single page element to inspect it. As you select it with the ‘loupe’ tool, on the bottom line you’ll see the blue box with the element’s dom notation:

The plugin content

As one who works with web scraping, I was curious about the means that the plugin uses for scraping. As I looked at the plugin code, it turned out that the plugin acquires a web page through ‘simple_html_dom‘ class:

    require_once(‘simple_html_dom.php’);
    $html = file_get_html($url);
    then the code performs iterations over the designated elements with the set limit

Pitfalls

    Be careful if you put two or more [web-scraper] shortcodes on your website, since downloading other pages will drastically slow the page load speed. Even if you want only a small element, the PHP engine first loads the whole page and then iterates over its elements.
    You need to remember that many pictures on the web are indicated by shortened URLs. So when such an image gets extracted it might be visible to you in this way: , since the URL is shortened and the plugin does not take note of its base URL.
    The error “Fatal error: Call to a member function find() on a non-object …” will occur if you put this shortcode in a text-overloaded post.

Summary

I’d recommend using this plugin for short posts to be added with other posts’ elements. The use of this plugin is limited though.

Source: http://extract-web-data.com/web-scraper-shortcode-wordpress-plugin-review/

Friday, 27 September 2013

Visual Web Ripper: Using External Input Data Sources

Sometimes it is necessary to use external data sources to provide parameters for the scraping process. For example, you have a database with a bunch of ASINs and you need to scrape all product information for each one of them. As far as Visual Web Ripper is concerned, an input data source can be used to provide a list of input values to a data extraction project. A data extraction project will be run once for each row of input values.

An input data source is normally used in one of these scenarios:

    To provide a list of input values for a web form
    To provide a list of start URLs
    To provide input values for Fixed Value elements
    To provide input values for scripts

Visual Web Ripper supports the following input data sources:

    SQL Server Database
    MySQL Database
    OleDB Database
    CSV File
    Script (A script can be used to provide data from almost any data source)

To see it in action you can download a sample project that uses an input CSV file with Amazon ASIN codes to generate Amazon start URLs and extract some product data. Place both the project file and the input CSV file in the default Visual Web Ripper project folder (My Documents\Visual Web Ripper\Projects).

For further information please look at the manual topic, explaining how to use an input data source to generate start URLs.

Source: http://extract-web-data.com/visual-web-ripper-using-external-input-data-sources/

Thursday, 26 September 2013

Scraping Amazon.com with Screen Scraper

Let’s look how to use Screen Scraper for scraping Amazon products having a list of asins in external database.

Screen Scraper is designed to be interoperable with all sorts of databases and web-languages. There is even a data-manager that allows one to make a connection to a database (MySQL, Amazon RDS, MS SQL, MariaDB, PostgreSQL, etc), and then the scripting in screen-scraper is agnostic to the type of database.

Let’s go through a sample scrape project you can see it at work. I don’t know how well you know Screen Scraper, but I assume you have it installed, and a MySQL database you can use. You need to:

    Make sure screen-scraper is not running as workbench or server
    Put the Amazon (Scraping Session).sss file in the “screen-scraper enterprise edition/import” directory.
    Put the mysql-connector-java-5.1.22-bin.jar file in the “screen-scraper enterprise edition/lib/ext” directory.
    Create a MySQL database for the scrape to use, and import the amazon.sql file.
    Put the amazon.db.config file in the “screen-scraper enterprise edition/input” directory and edit it to contain proper settings to connect to your database.
    Start the screen scraper workbench

Since this is a very simple scrape, you just want to run it in the workbench (most of the time you want to run scrapes in server mode). Start the workbench, and you will see the Amazon scrape in there, and you can just click the “play” button.

Note that a breakpoint comes up for each item. It would be easy to save the scraped details to a database table or file if you want. Also see in the database the “id_status” changes as each item is scraped.

When the scrape is run, it looks in the database for products marked “not scraped”, so when you want to re-run the scrapes, you need to:

UPDATE asin
SET `id_status` = 0

Have a nice scraping! ))

P.S. We thank Jason Bellows from Ekiwi, LLC for such a great tutorial.

Source: http://extract-web-data.com/scraping-amazon-com-with-screen-scraper/

Using External Input Data in Off-the-shelf Web Scrapers

There is a question I’ve wanted to shed some light upon for a long time already: “What if I need to scrape several URL’s based on data in some external database?“.

For example, recently one of our visitors asked a very good question (thanks, Ed):

“I have a large list of amazon.com asin. I would like to scrape 10 or so fields for each asin. Is there any web scraping software available that can read each asin from a database and form the destination url to be scraped like http://www.amazon.com/gp/product/{asin} and scrape the data?”

This question impelled me to investigate this matter. I contacted several web scraper developers, and they kindly provided me with detailed answers that allowed me to bring the following summary to your attention:
Visual Web Ripper

An input data source can be used to provide a list of input values to a data extraction project. A data extraction project will be run once for each row of input values. You can find the additional information here.
Web Content Extractor

You can use the -at”filename” command line option to add new URLs from TXT or CSV file:

WCExtractor.exe projectfile -at”filename” -s

projectfile: the file name of the project (*.wcepr) to open.
filename – the file name of the CSV or TXT file that contains URLs separated by newlines.
-s – starts the extraction process

You can find some options and examples here.
Mozenda

Since Mozenda is cloud-based, the external data needs to be loaded up into the user’s Mozenda account. That data can then be easily used as part of the data extracting process. You can construct URLs, search for strings that match your inputs, or carry through several data fields from an input collection and add data to it as part of your output. The easiest way to get input data from an external source is to use the API to populate data into a Mozenda collection (in the user’s account). You can also input data in the Mozenda web console by importing a .csv file or importing one through our agent building tool.

Once the data is loaded into the cloud, you simply initiate building a Mozenda web agent and refer to that Data list. By using the Load page action and the variable from the inputs, you can construct a URL like http://www.amazon.com/gp/product/%asin%.
Helium Scraper

Here is a video showing how to do this with Helium Scraper:

The video shows how to use the input data as URLs and as search terms. There are many other ways you could use this data, way too many to fit in a video. Also, if you know SQL, you could run a query to get the data directly from an external MS Access database like
SELECT * FROM [MyTable] IN "C:\MyDatabase.mdb"

Note that the database needs to be a “.mdb” file.
WebSundew Data Extractor
Basically this allows using input data from external data sources. This may be CSV, Excel file or a Database (MySQL, MSSQL, etc). Here you can see how to do this in the case of an external file, but you can do it with a database in a similar way (you just need to write an SQL script that returns the necessary data).
In addition to passing URLs from the external sources you can pass other input parameters as well (input fields, for example).
Screen Scraper

Screen Scraper is really designed to be interoperable with all sorts of databases. We have composed a separate article where you can find a tutorial and a sample project about scraping Amazon products based on a list of their ASINs.

Source: http://extract-web-data.com/using-external-input-data-in-off-the-shelf-web-scrapers/

Tuesday, 24 September 2013

Selenium IDE and Web Scraping

Selenium is a browser automation framework that includes IDE, Remote Control server and bindings of various flavors including Java, .Net, Ruby, Python and other. In this post we touch on the basic structure of the framework and its application to Web Scraping.
What is Selenium IDE

Selenium IDE is an integrated development environment for Selenium scripts. It is implemented as a Firefox plugin, and it allows recording browsers’ interactions in order to edit them. This works well for software tests, composing and debugging. The Selenium Remote Control is a server specific for a particular environment; it causes custom scripts to be implemented for controlled browsers. Selenium deploys on Windows, Linux, and iOS. How various Selenium components are supported with major browsers read here.
What does Selenium do and Web Scraping

Basically Selenium automates browsers. This ability is no doubt to be applied to web scraping. Since browsers (and Selenium) support JavaScript, jQuery and other methods working with dynamic content why not use this mix for benefit in web scraping, rather than to try to catch Ajax events with plain code? The second reason for this kind of scrape automation is browser-fasion data access (though today this is emulated with most libraries).

Yes, Selenium works to automate browsers, but how to control Selenium from a custom script to automate a browser for web scraping? There are Selenium PHP and other language libraries (bindings) providing for scripts to call and use Selenium. It is possible to write Selenium clients (using the libraries) in almost any language we prefer, for example Perl, Python, Java, PHP etc. Those libraries (API), along with a server, the Java written server that invokes browsers for actions, constitute the Selenum RC (Remote Control). Remote Control automatically loads the Selenium Core into the browser to control it. For more details in Selenium components refer to here.

A tough scrape task for programmer

“…cURL is good, but it is very basic. I need to handle everything manually; I am creating HTTP requests by hand.
This gets difficult – I need to do a lot of work to make sure that the requests that I send are exactly the same as the requests that a browser would
send, both for my sake and for the website’s sake. (For my sake
because I want to get the right data, and for the website’s sake
because I don’t want to cause error messages or other problems on their site because I sent a bad request that messed with their web application). And if there is any important javascript, I need to imitate it with PHP.
It would be a great benefit to me to be able to control a browser like Firefox with my code. It would solve all my problems regarding the emulation of a real browser…
it seems that Selenium will allow me to do this…” -Ryan S

Yes, that’s what we will consider below.
Scrape with Selenium

In order to create scripts that interact with the Selenium Server (Selenium RC, Selenium Remote Webdriver) or create local Selenium WebDriver script, there is the need to make use of language-specific client drivers (also called Formatters, they are included in the selenium-ide-1.10.0.xpi package). The Selenium servers, drivers and bindings are available at Selenium download page.
The basic recipe for scrape with Selenium:

    Use Chrome or Firefox browsers
    Get Firebug or Chrome Dev Tools (Cntl+Shift+I) in action.
    Install requirements (Remote control or WebDriver, libraries and other)
    Selenium IDE : Record a ‘test’ run thru a site, adding some assertions.
    Export as a Python (other language) script.
    Edit it (loops, data extraction, db input/output)
    Run script for the Remote Control

The short intro Slides for the scraping of tough websites with Python & Selenium are here (as Google Docs slides) and here (Slide Share).
Selenium components for Firefox installation guide

For how to install the Selenium IDE to Firefox see here starting at slide 21. The Selenium Core and Remote Control installation instructions are there too.
Extracting for dynamic content using jQuery/JavaScript with Selenium

One programmer is doing a similar thing …

1. launch a selenium RC (remote control) server
2. load a page
3. inject the jQuery script
4. select the interested contents using jQuery/JavaScript
5. send back to the PHP client using JSON.

He particularly finds it quite easy and convenient to use jQuery for
screen scraping, rather than using PHP/XPath.
Conclusion

The Selenium IDE is the popular tool for browser automation, mostly for its software testing application, yet also in that Web Scraping techniques for tough dynamic websites may be implemented with IDE along with the Selenium Remote Control server. These are the basic steps for it:

    Record the ‘test‘ browser behavior in IDE and export it as the custom programming language script
    Formatted language script runs on the Remote Control server that forces browser to send HTTP requests and then script catches the Ajax powered responses to extract content.

Selenium based Web Scraping is an easy task for small scale projects, but it consumes a lot of memory resources, since for each request it will launch a new browser instance.

Source: http://extract-web-data.com/selenium-ide-and-web-scraping/

Monday, 23 September 2013

Using Forensic Social Media Data Mining to Discover Work Comp Fraud

As an employer, most of your employee based work comp claims are completely legitimate and should be handled in the best interest of the injured employee. Unfortunately, some are also fraudulent, causing increasing costs on baseless claims. Historically, it's been challenging to contest some of these claims, though recently a new road has opened, allowing enlightened employers to more rapidly travel this road to a truthful outcome.

Employers should now consider the usefulness of Facebook, YouTube, LinkedIn and other social media sites which can contain posts negating the claims of allegedly injured workers participating in activities that are beyond the restrictions placed by the treating physician. These posts can happen on any given day, clearly elucidating a fraudulent claim. For example, let's say that an employee is out of work based on a "work comp" (workers compensation) claim which has restrictions, yet they post links, discussions, comments and photos that are clearly incriminating. This is an area in which companies should be mining regularly to protect your company against out of scope workers compensation claims.

Recently, a transportation attorney based in central Pennsylvania was a guest speaker at an insurance transportation web seminar. His topic was the aggressive defense of trucking lawsuits and he elaborated extensively on an example of forensic social media investigation to assist in the aggressive defense of frivolous lawsuits. I recall that the metrics were impressive; one example noted that a $250,000 claim which was reduced to $2,500 when forensic social media data mining found New Years Eve photos proving the claimant's mobility to be much greater than stipulated in the law suit.

Social media offers a surprising if not inadvertent glimpse into the nuances of the lifestyles of anyone using it, and in certain cases, it also offers important evidence into the veracity of work comp and work comp lawsuit based claims. Employers should be aware of this avenue and investigate accordingly.

The D'Camera Group http://www.dcameragroup.com partners with businesses, creating long range plans to manage their Total Cost of Risk. D'Camera Group's proprietary approach bridges the gap between the client and the insurance marketplace through a series of engagements in which we Discover, Design and Implement a Risk Reduction Plan™. Our specialized business insurance services include risk discovery and assessment, risk reduction strategies, insurance loss management, workers compensation, experience mod factors and improving marketplace competitiveness.

Source: http://ezinearticles.com/?Using-Forensic-Social-Media-Data-Mining-to-Discover-Work-Comp-Fraud&id=5027122

Sunday, 22 September 2013

Database Mining

The term database mining refers to the process of extracting information from a set database and transforming that into understandable information. The data mining process is also known as data dredging or data snooping. The consumer focused companies into retail, financial, communication, and marketing fields are using data mining for cost reduction and increase revenues. This process is the powerful technology, which helps the organisations to focus on the most important and relevant information from their collected data. Organisations can easily understand the potential customers and their behaviour with this process. By predicting behaviours of future trends the recruitment process outsourcing firms assists the multiple organisations to make proactive and profitable decisions in their business. The database mining term is originated from the similarities between searching for valuable information in large databases and mining a mountain for a vein of valuable crystal.

Recruitment process outsourcing firm helps the organisation for the betterment of their future by analyzing the data from distinctive dimensions or angles. From the business point of view, the data mining and data entry services leads the organisation to increase their profitability and customer demands. Data mining process is must for every organisation to survive in the competitive market and quality assurance. Now a day the data mining services are actively utilised and adapted by many organisations to achieve great success and analyse competitor growth, profit analysis, budget, and sales etc. The data mining is a form of artificial intelligence that uses the automated process to find required information. You can easily and swiftly plan your business strategy for the future by finding and collecting the equivalent information from huge data.

With the advanced analytics and modern techniques, the database mining process uncovers the in-depth business intelligence. You can ask for the certain information and let this process provide you information, which can lead to an immense improvement in your business and quality. Every organisation holds a huge amount of data in their database. Due to rapid computerisation of business, the large amount of data gets produced by every organisation and then database mining comes in the picture. When there are problems arising and challenges addressing in the database management of your organisation, the fundamental usage of data mining will help you out with maximum returns. Thus, from the strategic point of view, the rapidly growing world of digital data will depend on the ability of mining and managing the data.

Minakshi Zala works as a CV Writer for CV 24-7, apart from providing professional CV writing services, we are one of the leading a Recruitment Process Outsourcing company into resume sourcing, resume searching, database cleansing, and database management.

Source: http://ezinearticles.com/?Database-Mining&id=7292341

Friday, 20 September 2013

Using Charts For Effective Data Mining

The modern world is one where data is gathered voraciously. Modern computers with all their advanced hardware and software are bringing all of this data to our fingertips. In fact one survey says that the amount of data gathered is doubled every year. That is quite some data to understand and analyze. And this means a lot of time, effort and money. That is where advancements in the field of Data Mining have proven to be so useful.

Data mining is basically a process of identifying underlying patters and relationships among sets of data that are not apparent at first glance. It is a method by which large and unorganized amounts of data are analyzed to find underlying connections which might give the analyzer useful insight into the data being analyzed.

It's uses are varied. In marketing it can be used to reach a product to a particular customer. For example, suppose a supermarket while mining through their records notices customers preferring to buy a particular brand of a particular product. The supermarket can then promote that product even further by giving discounts, promotional offers etc. related to that product. A medical researcher analyzing D.N.A strands can and will have to use data mining to find relationships existing among the strands. Apart from bio-informatics, data mining has found applications in several other fields like genetics, pure medicine, engineering, even education.

The Internet is also a domain where mining is used extensively. The world wide web is a minefield of information. This information needs to be sorted, grouped and analyzed. Data Mining is used extensively here. For example one of the most important aspects of the net is search. Everyday several million people search for information over the world wide web. If each search query is to be stored then extensively large amounts of data will be generated. Mining can then be used to analyze all of this data and help return better and more direct search results which lead to better usability of the Internet.

Data mining requires advanced techniques to implement. Statistical models, mathematical algorithms or the more modern machine learning methods may be used to sift through tons and tons of data in order to make sense of it all.

Foremost among these is the method of charting. Here data is plotted in the form of charts and graphs. Data visualization, as it is often referred to is a tried and tested technique of data mining. If visually depicted, data easily reveals relationships that would otherwise be hidden. Bar charts, pie charts, line charts, scatter plots, bubble charts etc. provide simple, easy techniques for data mining.

Thus a clear simple truth emerges. In today's world of heavy load data, mining it is necessary. And charts and graphs are one of the surest methods of doing this. And if current trends are anything to go by the importance of data mining cannot be undermined in any way in the near future.

Source: http://ezinearticles.com/?Using-Charts-For-Effective-Data-Mining&id=2644996

Thursday, 19 September 2013

What You Need to Know About Popular Software - Data Mining Software

Simply put, data mining is the process of extracting hidden patterns from the organization's database. Over the years it has become a more and more important tool for adding value to the company's databases. Applications include business, medicine, science and engineering, and combating terrorism. This technique actually involves two very different processes, knowledge discovery and prediction. Knowledge discovery provides users with explicit information that in a sense is sitting in the database but has not been exposed. Prediction is an attempt to read into the future.

Data mining relies on the use of real-world data. To understand how this technology works we need first to review some basic concepts. Data are any facts whether numeric or textual that can be processed by a computer. The categories include operational, non-operational, and metadata. Operational or transactional elements include accounting, cost, inventory, and sales facts and figures. Non-operational elements include forecasts and information describing competitors and the industry as a whole. Metadata describes the data itself; it is required to set up and run the databases.

Data mining commonly performs four interrelated tasks: association rule learning, classification, clustering, and regression. Let's examine each in turn. Association rule learning, also known as market basket analysis, searches for relationships between variables. A classic example is a supermarket determining which products customers buy together. Customers who buy onions and potatoes often buy beef. Classification arranges data into predefined groups. This technology can do so in a sophisticated manner. In a related technique known as clustering the groups are not predefined. Regression involves data modeling.

It has been alleged that data mining has been used both in the United States and elsewhere to combat terrorism. As always in such cases, those who know don't say, and those who say don't know. One may surmise that these anti-terrorist applications look for unusual patterns. Many credit card holders have been contacted when their spending patterns changed substantially.

Data mining has become an important feature in many customer relationship management applications. For example, this technology enables companies to focus their marketing efforts on likely customers rather than trying to sell to everyone out there. Human resources applications help companies recruit and manage employees. We have already mentioned market basket analysis. Strategic enterprise management applications help a company transform corporate targets and goals into operational decisions such as hiring and factory scheduling.

Given its great power, many people are concerned with the human rights and privacy issues around data mining. Sophisticated applications could work its way around privacy safeguards. As the technology becomes more widespread and less expensive, these issues may become more urgent. As data is summarized the wrong conclusions can be drawn. This problem not only affects human rights but also the company's bottom line.

Levi Reiss has authored or co-authored ten books on computers and the Internet. He teaches Linux and Windows operating systems plus other computer courses at an Ontario French-language community college. Visit his new website [http://www.mysql4windows.com] which teaches you how to download and run MySQL on Windows computers, even if they are "obsolete." For a break from computers check out his global wine website at http://www.theworldwidewine.com with his new weekly column reviewing $10 wines.

Source: http://ezinearticles.com/?What-You-Need-to-Know-About-Popular-Software---Data-Mining-Software&id=1920655

Wednesday, 18 September 2013

Data Mining Tools - Understanding Data Mining

Data mining basically means pulling out important information from huge volume of data. Data mining tools are used for the purposes of examining the data from various viewpoints and summarizing it into a useful database library. However, lately these tools have become computer based applications in order to handle the growing amount of data. They are also sometimes referred to as knowledge discovery tools.

As a concept, data mining has always existed since the past and manual processes were used as data mining tools. Later with the advent of fast processing computers, analytical software tools, and increased storage capacities automated tools were developed, which drastically improved the accuracy of analysis, data mining speed, and also brought down the costs of operation. These methods of data mining are essentially employed to facilitate following major elements:

    Pull out, convert, and load data to a data warehouse system
    Collect and handle the data in a database system
    Allow the concerned personnel to retrieve the data
    Data analysis
    Data presentation in a format that can be easily interpreted for further decision making

We use these methods of mining data to explore the correlations, associations, and trends in the stored data that are generally based on the following types of relationships:

    Associations - simple relationships between the data
    Clusters - logical correlations are used to categorise the collected data
    Classes - certain predefined groups are drawn out and then data within the stored information is searched based on these groups
    Sequential patterns - this helps to predict a particular behavior based on the trends observed in the stored data

Industries which cater heavily to consumers in retail, financial, entertainment, sports, hospitality and so on rely on these data methods of obtaining fast answers to questions to improve their business. The tools help them to study to the buying patterns of their consumers and hence plan a strategy for the future to improve sales. For e.g. restaurant might want to study the eating habits of their consumers at various times during the day. The data would then help them in deciding on the menu at different times of the day. Data mining tools certainly help a great deal when drawing out business plans, advertising strategies, discount plans, and so on. Some important factors to consider when selecting a data mining tool include the platforms supported, algorithms on which they work (neural networks, decisions trees), input and output options for data, database structure and storage required, usability and ease of operation, automation processes, and reporting methods.

Jeff Smith is the managing director of Karma Technologies and feels strongly about implementing ways to be green into their business practices, to a point they are almost a paper-free company.

Source: http://ezinearticles.com/?Data-Mining-Tools---Understanding-Data-Mining&id=1109771

Tuesday, 17 September 2013

Data Mining Social Networks, Smart Phone Data, and Other Data Base, Yet Maintaining Privacy

Is it possible to data mine social networks in such a way to does not hurt the privacy of the individual user, and if so, can we justify doing such? It wasn't too long ago the CEO of Google stated that it was important that they were able to keep data of Google searches so they can find disease, flu, and food born medical clusters. By using this data and studying the regions in the searches to help fight against outbreaks of diseases, or food borne illnesses in the distribution system. This is one good reason to store the data, and collect it for research, as long as it is anonomized, then theoretically no one is hurt.

Unfortunately, this also scares the users, because they know if the searches are indeed stored, this data can be used against them in the future, for instance, higher insurance rates, bombardment of advertising, or get them put onto some sort of future government "thought police" watch-list. Especially considering all the political correctness, and new ways of defining hate speech, bullying, and what is, what isn't, and what might be a domestically home-grown terrorist. The future concept of the thought police is very scary to most folks.

Usually if you want to collect data from a user, you have to give them something back in return, and therefore they are willing to sign away certain privacy rights on that data in trade for the use of such services; such as on their cell phone, perhaps a free iPhone app or a virtual product in an online social network.

Artificially Intelligent Search Features

It is no surprised that AI search features are getting smarter, even able to anticipate your next search question, or what you are really trying to ask, even second guessing your question for instance. Now then, let's discuss this for a moment. Many folks very much enjoy the features of Amazon.com search features, which use artificial intelligence to recommend potential other books, which they might be interested in. And therefore the user probably does not mind giving away information about itself, for this upgraded service or ability, nor would the person mind having cookies put onto their Web browser.

Nevertheless, these types of systems are always exploited for other purposes. For instance consider the Federal Trade Commission's do not call list, and consider how many corporations, political party organizations, and all of their affiliates and partners were able to bypass these rules due to the fact that the consumer or customer had bought something from them in the last six months. This is not what consumers or customers had in mind when they decided they wanted to have this "do not call list" and the resultant and response from the market place, well, it proves we cannot trust the telecommunication companies, their lobbyists, or the insiders within their group (many of which over the years have indeed been somehow connected to the intelligence agencies - AT&T - NSA Echelon for example.)

Now then, this article is in no way to be considered a conspiracy theory, it is just a known fact, yes national security does need access to such information, and often it might be relevant, catching bad guys, terrorists, spies, etc. The NSA is to protect the American People. However, when it comes to the telecommunication companies, their job is to protect shareholder's equity, maximize quarterly profits, expand their business models, and create new profit centers in their corporations.

Thus, such user data will be and has been exploited for future profits against the wishes of the consumer, without the consumer benefiting from free services for lower prices in any way. If there is an explained reason, trade-off, and a monetary consideration, the consumer might feel obliged to have additional calls bothering them while they are at home, additional advertising, and tracking of their preferences for ease of use and suggestions. What types of suggestions?

Well, there is a Starbucks two-blocks from here, turn right, then turn left and it is 200 yards, with parking available; "Sale on Frappachinos for gold-card holders today!" In this case the telecommunication company tracks your location, knows your preferences, and collects a small fee from Starbucks, and you get a free-phone, and 20% off your monthly 4G wireless fee. Is that something a consumer might want; when asked 75% of consumers or smart phone users say; yes. See that point?

In the future smart phones may have data transferred between them, rather than going through a given or closest cell tower. In other words, packets of information may go from your cell phone, to the next nearest cell phone, to another near cell phone, to the person which is intended to receive it. And the data passing through each mobile device, will not be able to read any of the information which was it is not assigned to receive as it wasn't sent to it. By using such a scheme telecommunication companies can expand their services without building more new cell towers, and therefore they can lower the price.

However, it also means that when you lay your cell phone on the table, and it is turned on it would be constantly passing data through it, data which is not yours, and you are not getting paid for that, even though you had to purchase the smart phone. But if the phone was given to you, with a large battery, so it wouldn't go dead during all those transmissions, you probably wouldn't care, as long as your data packets of information were indeed safe and no one else could read them.

This technology exists now, and is being discussed, and consider if you will that the whole strategy of networking smart cell phones or personal tech devices together is nothing new. For instance, the same strategies have been designed for satellites, and to use an analogy, this scheme is very similar to the strategies FedEx uses when it sends packages to the next nearest FedEx office if that is their destination, without sending all of the packages all the way across the country to the central Memphis sort, and then all the way back again. They are saving time, fuel, space, and energy, and if cell phones did this it would save the telecommunication companies mega bucks in the savings of building new cell towers.

As long as you got a free cell phone, which many of us do, unless we have the mega top of the line edition, and if they gave you a long-lasting free battery it is win-win for the user. You probably wouldn't care, and the telecommunication companies could most likely lower the cost of services, and not need to upgrade their system, because they can carry a lot more data, without hundreds of billions of dollars in future investments.

Also a net centric system like this is safer to disruption in the event of an emergency, when emergency communications systems take precedence, putting every cell phone user as secondary traffic at the cell towers, which means their calls may not even get through.

Next, the last thing the telecommunication company would want to do is to data mine that data, or those packets of information from people like a soccer mom calling her son waiting at the bus stop at school. And anyone with a cell phone certainly wouldn't want their packets of information being stolen from them and rerouted because someone near them hacked into the system and had a cell phone that was displaying all of their information.

You can see the problems with all this, but you can also see the incredible economies of scale by making each and every cell phone a transmitter and receiver, which it already is in principle anyway, at least now for all data you send and receive. In the new system, if all the data which is closest by is able to transfer through it, and send that data on its way. The receiving cell phone would wait for all the packets of data were in, and then display the information.

You can see why such a system also might cause people to have a problem with it because of what they call net neutrality. If someone was downloading a movie onto their iPad using a 3G or 4G wireless network, it could tie up all the cell phones nearby that were moving the data through them. In this case, it might upset consumers, but if that traffic could be somewhat delayed by priority based on an AI algorithm decision matrix, something simple, then such a tactic for packet distribution plan might allow for this to occur without disruption from the actual cell tower, meaning everyone would be better off. Therefore we all get information flow faster, more dispersed, and therefore safer from intruders. Please consider all this.

Source: http://ezinearticles.com/?Data-Mining-Social-Networks,-Smart-Phone-Data,-and-Other-Data-Base,-Yet-Maintaining-Privacy&id=4867112

Monday, 16 September 2013

Processing Of Unorganized Data Is Important For Your Online Business

For every online business it is important to have a management service for organizing their paperwork and their documents. These services are actually a fundamental factor for your business. Form Processing is categorized under data processing services to make your business well-organized with informative details on your website. Form processing is taking out the information from the established forms, scrutinized images. Form Processing Services will help you in making good business links.

Online Form Processing Service offers you assured quality and trust when processing a variety of forms. There are huge numbers of organizations that are using these Forms Processing Services for enjoying the benefits of well-organized data. It is an essential tool for all the organizations and firms.

Outsourcing your Data related work to professional firms is always a great idea and they help you with other advantages too. These services offer you low expenditure, fast processing services of your forms. They help you to accumulate large volumes of important data efficiently and securely. These services are always in demand and are appreciated by their consumers for the precision of their rapid online form processing services. Reduce your burden by outsourcing these Data related Services to an expert and dedicated Data Entry Services.

Why there is a need to outsource your Form Processing Services to professionals?

Outsourcing Data Processing services involves work like capturing, digitizing and processing of data from a variety of resources and converting them into a database for efficient analysis and research.

It could be anything like insurance forms, medical claims, online forms, order forms, feedback, surveys, and questionnaires. Handling all the data of every form and maintaining these records is a tiring job for the organizations that have to focus on their core activities.

Many companies and organizations make use of questionnaires and forms in order to communicate with their customers and gathering information and making it a beneficial tool for your business. It is necessary to know and consider about the process and the requirements of the document.

Let's have a look on the beneficial side of Data Processing Services

It Reduces Cost - Managing manual document classification, extracting documents and separating them is a costly process. Form processing services help to reduce the cost and give cost-effective solutions.

It Increases Speed - Doing manual document classification, document separation and extraction of the data is tiresome, time-consuming, and a bad use of knowledgeable workers' takes huge time, Form processing services are time-saving and it increases speed.

It helps to upgrade - You can upgrade its features and level simply by reactivating the license. Installation of the software is not required. Better compatibility of your configurations is guaranteed.

It Increases Data Security - Programmed document processing reduces the demand of human effort to interact with the confidential data and hence increases your security. By providing capability it classifies the documents as they get scanned and fill them into the workflow system for immediate and automatic routing.

It Easily Generate Performance report- Manual processes are very tough to review. Using this service helps to control, monitor and report documents. As a result, the organizations will meet the requirements of the problem in a much better way and can be identified more easily.

It prevents fraud- If your data is already stored in a proper way, and you have all the information, it reduces the risk of fraud. Because storing data in proper way facilitate accurate data, you can anytime check them if needed.

Outsourcing form processing services are safe and secure for the handling of confidential documents. Additionally, these services are affordable and more often than not low in cost, which make it profitable in the long run. Thus, if you want you too can hire professionals for such data entry services, which ensure complete accuracy of data while ensuring timely completion of your projects. Since taking these services you can be able to concentrate on other task easily.

A well-qualified and knowledgeable team of experts create this work very easy. The knowledgeable venture administrator and staff can handle a large number of information handling solutions to make your company much easier thus helps in upgrading your details management system software. The prepared data and result is examined beforehand, so that customers should get quality service and precise data.

Ashima is a content writer at SunTec India, product Data Entry Services. She is a versatile writer and can write on a variety of topics. She has keen interest in getting knowledge about Form Processing Services and likes to share her ideas on the same.

Source: http://ezinearticles.com/?Processing-Of-Unorganized-Data-Is-Important-For-Your-Online-Business&id=7747653

Sunday, 15 September 2013

One of the Main Differences Between Statistical Analysis and Data Mining

Two methods of analyzing data that are common in both academic and commercial fields are statistical analysis and data mining. While statistical analysis has a long scientific history, data mining is a more recent method of data analysis that has arisen from Computer Science. In this article I want to give an introduction to these methods and outline what I believe is one of the main differences between the two fields of analysis.

Statistical analysis commonly involves an analyst formulating a hypothesis and then testing the validity of this hypothesis by running statistical tests on data that may have been collected for the purpose. For example, if an analyst was studying the relationship between income level and the ability to get a loan, the analyst may hypothesis that there will be a correlation between income level and the amount of credit someone may qualify for.

The analyst could then test this hypothesis with the use of a data set that contains a number of people along with their income levels and the credit available to them. A test could be run that indicates for example that there may be a high degree of confidence that there is indeed a correlation between income and available credit. The main point here is that the analyst has formulated a hypothesis and then used a statistical test along with a data set to provide evidence in support or against that hypothesis.

Data mining is another area of data analysis that has arisen more recently from computer science that has a number of differences to traditional statistical analysis. Firstly, many data mining techniques are designed to be applied to very large data sets, while statistical analysis techniques are often designed to form evidence in support or against a hypothesis from a more limited set of data.

Probably the mist significant difference here, however, is that data mining techniques are not used so much to form confidence in a hypothesis, but rather extract unknown relationships may be present in the data set. This is probably best illustrated with an example. Rather than in the above case where a statistician may form a hypothesis between income levels and an applicants ability to get a loan, in data mining, there is not typically an initial hypothesis. A data mining analyst may have a large data set on loans that have been given to people along with demographic information of these people such as their income level, their age, any existing debts they have and if they have ever defaulted on a loan before.

A data mining technique may then search through this large data set and extract a previously unknown relationship between income levels, peoples existing debt and their ability to get a loan.

While there are quite a few differences between statistical analysis and data mining, I believe this difference is at the heart of the issue. A lot of statistical analysis is about analyzing data to either form confidence for or against a stated hypothesis while data mining is often more about applying an algorithm to a data set to extract previously unforeseen relationships.

The author has a number of websites that provide financial calculators including the sites mortgage calculator amortization and refinance calculator mortgage.

Source: http://ezinearticles.com/?One-of-the-Main-Differences-Between-Statistical-Analysis-and-Data-Mining&id=4578250

Friday, 13 September 2013

What is Data Mining? Why Data Mining is Important?

Searching, Collecting, Filtering and Analyzing of data define as data mining. The large amount of information can be retrieved from wide range of form such as different data relationships, patterns or any significant statistical co-relations. Today the advent of computers, large databases and the internet is make easier way to collect millions, billions and even trillions of pieces of data that can be systematically analyzed to help look for relationships and to seek solutions to difficult problems.

The government, private company, large organization and all businesses are looking for large volume of information collection for research and business development. These all collected data can be stored by them to future use. Such kind of information is most important whenever it is require. It will take very much time for searching and find require information from the internet or any other resources.

Here is an overview of data mining services inclusion:

* Market research, product research, survey and analysis
* Collection information about investors, funds and investments
* Forums, blogs and other resources for customer views/opinions
* Scanning large volumes of data
* Information extraction
* Pre-processing of data from the data warehouse
* Meta data extraction
* Web data online mining services
* data online mining research
* Online newspaper and news sources information research
* Excel sheet presentation of data collected from online sources
* Competitor analysis
* data mining books
* Information interpretation
* Updating collected data

After applying the process of data mining, you can easily information extract from filtered information and processing the refining the information. This data process is mainly divided into 3 sections; pre-processing, mining and validation. In short, data online mining is a process of converting data into authentic information.

The most important is that it takes much time to find important information from the data. If you want to grow your business rapidly, you must take quick and accurate decisions to grab timely available opportunities.

Outsourcing Web Research is one of the best data mining outsourcing organizations having more than 17 years of experience in the market research industry. To know more information about our company please contact us.

Source: http://ezinearticles.com/?What-is-Data-Mining?-Why-Data-Mining-is-Important?&id=3613677

Wednesday, 11 September 2013

Backtesting & Data Mining

Introduction

In this article we'll take a look at two related practices that are widely used by traders called Backtesting and Data Mining. These are techniques that are powerful and valuable if we use them correctly, however traders often misuse them. Therefore, we'll also explore two common pitfalls of these techniques, known as the multiple hypothesis problem and overfitting and how to overcome these pitfalls.

Backtesting

Backtesting is just the process of using historical data to test the performance of some trading strategy. Backtesting generally starts with a strategy that we would like to test, for instance buying GBP/USD when it crosses above the 20-day moving average and selling when it crosses below that average. Now we could test that strategy by watching what the market does going forward, but that would take a long time. This is why we use historical data that is already available.

"But wait, wait!" I hear you say. "Couldn't you cheat or at least be biased because you already know what happened in the past?" That's definitely a concern, so a valid backtest will be one in which we aren't familiar with the historical data. We can accomplish this by choosing random time periods or by choosing many different time periods in which to conduct the test.

Now I can hear another group of you saying, "But all that historical data just sitting there waiting to be analyzed is tempting isn't it? Maybe there are profound secrets in that data just waiting for geeks like us to discover it. Would it be so wrong for us to examine that historical data first, to analyze it and see if we can find patterns hidden within it?" This argument is also valid, but it leads us into an area fraught with danger...the world of Data Mining

Data Mining

Data Mining involves searching through data in order to locate patterns and find possible correlations between variables. In the example above involving the 20-day moving average strategy, we just came up with that particular indicator out of the blue, but suppose we had no idea what type of strategy we wanted to test? That's when data mining comes in handy. We could search through our historical data on GBP/USD to see how the price behaved after it crossed many different moving averages. We could check price movements against many other types of indicators as well and see which ones correspond to large price movements.

The subject of data mining can be controversial because as I discussed above it seems a bit like cheating or "looking ahead" in the data. Is data mining a valid scientific technique? On the one hand the scientific method says that we're supposed to make a hypothesis first and then test it against our data, but on the other hand it seems appropriate to do some "exploration" of the data first in order to suggest a hypothesis. So which is right? We can look at the steps in the Scientific Method for a clue to the source of the confusion. The process in general looks like this:

Observation (data) >>> Hypothesis >>> Prediction >>> Experiment (data)

Notice that we can deal with data during both the Observation and Experiment stages. So both views are right. We must use data in order to create a sensible hypothesis, but we also test that hypothesis using data. The trick is simply to make sure that the two sets of data are not the same! We must never test our hypothesis using the same set of data that we used to suggest our hypothesis. In other words, if you use data mining in order to come up with strategy ideas, make sure you use a different set of data to backtest those ideas.

Now we'll turn our attention to the main pitfalls of using data mining and backtesting incorrectly. The general problem is known as "over-optimization" and I prefer to break that problem down into two distinct types. These are the multiple hypothesis problem and overfitting. In a sense they are opposite ways of making the same error. The multiple hypothesis problem involves choosing many simple hypotheses while overfitting involves the creation of one very complex hypothesis.

The Multiple Hypothesis Problem

To see how this problem arises, let's go back to our example where we backtested the 20-day moving average strategy. Let's suppose that we backtest the strategy against ten years of historical market data and lo and behold guess what? The results are not very encouraging. However, being rough and tumble traders as we are, we decide not to give up so easily. What about a ten day moving average? That might work out a little better, so let's backtest it! We run another backtest and we find that the results still aren't stellar, but they're a bit better than the 20-day results. We decide to explore a little and run similar tests with 5-day and 30-day moving averages. Finally it occurs to us that we could actually just test every single moving average up to some point and see how they all perform. So we test the 2-day, 3-day, 4-day, and so on, all the way up to the 50-day moving average.

Now certainly some of these averages will perform poorly and others will perform fairly well, but there will have to be one of them which is the absolute best. For instance we may find that the 32-day moving average turned out to be the best performer during this particular ten year period. Does this mean that there is something special about the 32-day average and that we should be confident that it will perform well in the future? Unfortunately many traders assume this to be the case, and they just stop their analysis at this point, thinking that they've discovered something profound. They have fallen into the "Multiple Hypothesis Problem" pitfall.

The problem is that there is nothing at all unusual or significant about the fact that some average turned out to be the best. After all, we tested almost fifty of them against the same data, so we'd expect to find a few good performers, just by chance. It doesn't mean there's anything special about the particular moving average that "won" in this case. The problem arises because we tested multiple hypotheses until we found one that worked, instead of choosing a single hypothesis and testing it.

Here's a good classic analogy. We could come up with a single hypothesis such as "Scott is great at flipping heads on a coin." From that, we could create a prediction that says, "If the hypothesis is true, Scott will be able to flip 10 heads in a row." Then we can perform a simple experiment to test that hypothesis. If I can flip 10 heads in a row it actually doesn't prove the hypothesis. However if I can't accomplish this feat it definitely disproves the hypothesis. As we do repeated experiments which fail to disprove the hypothesis, then our confidence in its truth grows.

That's the right way to do it. However, what if we had come up with 1,000 hypotheses instead of just the one about me being a good coin flipper? We could make the same hypothesis about 1,000 different people...me, Ed, Cindy, Bill, Sam, etc. Ok, now let's test our multiple hypotheses. We ask all 1000 people to flip a coin. There will probably be about 500 who flip heads. Everyone else can go home. Now we ask those 500 people to flip again, and this time about 250 will flip heads. On the third flip about 125 people flip heads, on the fourth about 63 people are left, and on the fifth flip there are about 32. These 32 people are all pretty amazing aren't they? They've all flipped five heads in a row! If we flip five more times and eliminate half the people each time on average, we will end up with 16, then 8, then 4, then 2 and finally one person left who has flipped ten heads in a row. It's Bill! Bill is a "fantabulous" flipper of coins! Or is he?

Well we really don't know, and that's the point. Bill may have won our contest out of pure chance, or he may very well be the best flipper of heads this side of the Andromeda galaxy. By the same token, we don't know if the 32-day moving average from our example above just performed well in our test by pure chance, or if there is really something special about it. But all we've done so far is to find a hypothesis, namely that the 32-day moving average strategy is profitable (or that Bill is a great coin flipper). We haven't actually tested that hypothesis yet.

So now that we understand that we haven't really discovered anything significant yet about the 32-day moving average or about Bill's ability to flip coins, the natural question to ask is what should we do next? As I mentioned above, many traders never realize that there is a next step required at all. Well, in the case of Bill you'd probably ask, "Aha, but can he flip ten heads in a row again?" In the case of the 32-day moving average, we'd want to test it again, but certainly not against the same data sample that we used to choose that hypothesis. We would choose another ten-year period and see if the strategy worked just as well. We could continue to do this experiment as many times as we wanted until our supply of new ten-year periods ran out. We refer to this as "out of sample testing", and it's the way to avoid this pitfall. There are various methods of such testing, one of which is "cross validation", but we won't get into that much detail here.

Overfitting

Overfitting is really a kind of reversal of the above problem. In the multiple hypothesis example above, we looked at many simple hypotheses and picked the one that performed best in the past. In overfitting we first look at the past and then construct a single complex hypothesis that fits well with what happened. For example if I look at the USD/JPY rate over the past 10 days, I might see that the daily closes did this:

up, up, down, up, up, up, down, down, down, up.

Got it? See the pattern? Yeah, neither do I actually. But if I wanted to use this data to suggest a hypothesis, I might come up with...

My amazing hypothesis:

If the closing price goes up twice in a row then down for one day, or if it goes down for three days in a row we should buy,

but if the closing price goes up three days in a row we should sell,

but if it goes up three days in a row and then down three days in a row we should buy.

Huh? Sounds like a whacky hypothesis right? But if we had used this strategy over the past 10 days, we would have been right on every single trade we made! The "overfitter" uses backtesting and data mining differently than the "multiple hypothesis makers" do. The "overfitter" doesn't come up with 400 different strategies to backtest. No way! The "overfitter" uses data mining tools to figure out just one strategy, no matter how complex, that would have had the best performance over the backtesting period. Will it work in the future?

Not likely, but we could always keep tweaking the model and testing the strategy in different samples (out of sample testing again) to see if our performance improves. When we stop getting performance improvements and the only thing that's rising is the complexity of our model, then we know we've crossed the line into overfitting.

Conclusion

So in summary, we've seen that data mining is a way to use our historical price data to suggest a workable trading strategy, but that we have to be aware of the pitfalls of the multiple hypothesis problem and overfitting. The way to make sure that we don't fall prey to these pitfalls is to backtest our strategy using a different dataset than the one we used during our data mining exploration. We commonly refer to this as "out of sample testing".

Scott Percival

October 2006

Scott Percival is the Director of Research for the FOREX Statistical Research Center at Market-geeks.com, a site which is devoted to using mathematics and the scientific method to study the behavior of prices in the FOREX market. Mr. Percival has a degree in Civil Engineering from Northeastern University, and has worked as a Registered Representative and trading instructor at Fidelity Investments. He is currently working toward the goal of becoming a full time FOREX trader.

Source: http://ezinearticles.com/?Backtesting-and-Data-Mining&id=341468

Tuesday, 10 September 2013

Why Outsourcing Data Mining Services?

Are huge volumes of raw data waiting to be converted into information that you can use? Your organization's hunt for valuable information ends with valuable data mining, which can help to bring more accuracy and clarity in decision making process.

Nowadays world is information hungry and with Internet offering flexible communication, there is remarkable flow of data. It is significant to make the data available in a readily workable format where it can be of great help to your business. Then filtered data is of considerable use to the organization and efficient this services to increase profits, smooth work flow and ameliorating overall risks.

Data mining is a process that engages sorting through vast amounts of data and seeking out the pertinent information. Most of the instance data mining is conducted by professional, business organizations and financial analysts, although there are many growing fields that are finding the benefits of using in their business.

Data mining is helpful in every decision to make it quick and feasible. The information obtained by it is used for several applications for decision-making relating to direct marketing, e-commerce, customer relationship management, healthcare, scientific tests, telecommunications, financial services and utilities.

Data mining services include:

    Congregation data from websites into excel database
    Searching & collecting contact information from websites
    Using software to extract data from websites
    Extracting and summarizing stories from news sources
    Gathering information about competitors business

In this globalization era, handling your important data is becoming a headache for many business verticals. Then outsourcing is profitable option for your business. Since all projects are customized to suit the exact needs of the customer, huge savings in terms of time, money and infrastructure can be realized.

Advantages of Outsourcing Data Mining Services:

    Skilled and qualified technical staff who are proficient in English
    Improved technology scalability
    Advanced infrastructure resources
    Quick turnaround time
    Cost-effective prices
    Secure Network systems to ensure data safety
    Increased market coverage

Outsourcing will help you to focus on your core business operations and thus improve overall productivity. So data mining outsourcing is become wise choice for business. Outsourcing of this services helps businesses to manage their data effectively, which in turn enable them to achieve higher profits.

Source: http://ezinearticles.com/?Why-Outsourcing-Data-Mining-Services?&id=3066061

Data Mining Models - Tom's Ten Data Tips

What is a model? A model is a purposeful simplification of reality. Models can take on many forms. A built-to-scale look alike, a mathematical equation, a spreadsheet, or a person, a scene, and many other forms. In all cases, the model uses only part of reality, that's why it's a simplification. And in all cases, the way one reduces the complexity of real life, is chosen with a purpose. The purpose is to focus on particular characteristics, at the expense of losing extraneous detail.

If you ask my son, Carmen Elektra is the ultimate model. She replaces an image of women in general, and embodies a particular attractive one at that. A model for a wind tunnel, may look like the real car, at least the outside, but doesn't need an engine, brakes, real tires, etc. The purpose is to focus on aerodynamics, so this model only needs to have an identical outside shape.

Data Mining models, reduce intricate relations in data. They're a simplified representation of characteristic patterns in data. This can be for 2 reasons. Either to predict or describe mechanics, e.g. "what application form characteristics are indicative of a future default credit card applicant?". Or secondly, to give insight in complex, high dimensional patterns. An example of the latter could be a customer segmentation. Based on clustering similar patterns of database attributes one defines groups like: high income/ high spending/ need for credit, low income/ need for credit, high income/ frugal/ no need for credit, etc.

1. A Predictive Model Relies On The Future Being Like The Past

As Yogi Berra said: "Predicting is hard, especially when it's about the future". The same holds for data mining. What is commonly referred to as "predictive modeling", is in essence a classification task.

Based on the (big) assumption that the future will resemble the past, we classify future occurrences for their similarity with past cases. Then we 'predict' they will behave like past look-alikes.

2. Even A 'Purely' Predictive Model Should Always (Be) Explain(ed)

Predictive models are generally used to provide scores (likelihood to churn) or decisions (accept yes/no). Regardless, they should always be accompanied by explanations that give insight in the model. This is for two reasons:

    buy-in from business stakeholders to act on predictions is of eminent importance, and gains from understanding
    peculiarities in data do sometimes arise, and may become obvious from the model's explanation

3. It's Not About The Model, But The Results It Generates

Models are developed for a purpose. All too often, data miners fall in love with their own methodology (or algorithms). Nobody cares. Clients (not customers) who should benefit from using a model are interested in only one thing: "What's in it for me?"

Therefore, the single most important thing on a data miner's mind should be: "How do I communicate the benefits of using this model to my client?" This calls for patience, persistence, and the ability to explain in business terms how using the model will affect the company's bottom line. Practice explaining this to your grandmother, and you will come a long way towards becoming effective.

4. How Do You Measure The 'Success' Of A Model?

There are really two answers to this question. An important and simple one, and an academic and wildly complex one. What counts the most is the result in business terms. This can range from percentage of response to a direct marketing campaign, number of fraudulent claims intercepted, average sale per lead, likelihood of churn, etc.

The academic issue is how to determine the improvement a model gives over the best alternative course of business action. This turns out to be an intriguing, ill understood question. This is a frontier of future scientific study, and mathematical theory. Bias-Variance Decomposition is one of those mathematical frontiers.

5. A Model Predicts Only As Good As The Data That Go In To It

The old "Garbage In, Garbage Out" (GiGo), is hackneyed but true (unfortunately). But there is more to this topic. Across a broad range of industries, channels, products, and settings we have found a common pattern. Input (predictive) variables can be ordered from transactional to demographic. From transient and volatile to stable.

In general, transactional variables that relate to (recent) activity hold the most predictive power. Less dynamic variables, like demographics, tend to be weaker predictors. The downside is that model performance (predictive "power") on the basis of transactional and behavioral variables usually degrades faster over time. Therefore such models need to be updated or rebuilt more often.

6. Models Need To Be Monitored For Performance Degradence

It is adamant to always, always follow up model deployment by reviewing its effectiveness. Failing to do so, should be likened to driving a car with blinders on. Reckless.

To monitor how a model keeps performing over time, you check whether the prediction as generated by the model, matches the patterns of response when deployed in real life. Although no rocket science, this can be tricky to accomplish in practice.

7. Classification Accuracy Is Not A Sufficient Indicator Of Model Quality

Contrary to common belief, even among data miners, no single number of classification accuracy (R2, Gini-coefficient, lift, etc.) is valid to quantify model quality. The reason behind this has nothing to do with the model itself, but rather with the fact that a model derives its quality from being applied.

The quality of model predictions calls for at least two numbers: one number to indicate accuracy of prediction (these are commonly the only numbers supplied), and another number to reflect its generalizability. The latter indicates resilience to changing multi-variate distributions, the degree to which the model will hold up as reality changes very slowly. Hence, it's measured by the multi-variate representativeness of the input variables in the final model.

8. Exploratory Models Are As Good As the Insight They Give

There are many reasons why you want to give insight in the relations found in the data. In all cases, the purpose is to make a large amount of data and exponential number of relations palatable. You knowingly ignore detail and point to "interesting" and potentially actionable highlights.

The key here is, as Einstein pointed out already, to have a model that is as simple as possible, but not too simple. It should be as simple as possible in order to impose structure on complexity. At the same time, it shouldn't be too simple so that the image of reality becomes overly distorted.

9. Get A Decent Model Fast, Rather Than A Great One Later

In almost all business settings, it is far more important to get a reasonable model deployed quickly, instead of working to improve it. This is for three reasons:

    A working model is making money; a model under construction is not
    When a model is in place, you have a chance to "learn from experience", the same holds for even a mild improvement - is it working as expected?
    The best way to manage models is by getting agile in updating. No better practice than doing it... :)

10. Data Mining Models - What's In It For Me?

Who needs data mining models? As the world around us becomes ever more digitized, the number of possible applications abound. And as data mining software has come of age, you don't need a PhD in statistics anymore to operate such applications.

In almost every instance where data can be used to make intelligent decisions, there's a fair chance that models could help. When 40 years ago underwriters were replaced by scorecards (a particular kind of data mining model), nobody could believe that such a simple set of decision rules could be effective. Fortunes have been made by early adopters since then.

Source: http://ezinearticles.com/?Data-Mining-Models---Toms-Ten-Data-Tips&id=289130

Sunday, 8 September 2013

Data Mining And Importance to Achieve Competitive Edge in Business

What is data mining? And why it is so much importance in business? These are simple yet complicated questions to be answered, below is brief information to help understanding data and web mining services.

Mining of data in general terms can be elaborated as retrieving useful information or knowledge for further process of analyzing from various perspectives and summarizing in valuable information to be used for increasing revenue, cut cost, to gather competitive information on business or product. And data abstraction finds a great importance in business world as it help business to harness the power of accurate information thus providing competitive edge in business. May business firms and companies have their own warehouse to help them collect, organize and mine information such as transactional data, purchase data etc.

But to have a mining services and warehouse at premises is not affordable and not very cost effective to solution for reliable information solutions. But as if taking out of information is the need for every business now days. Many companies are providing accurate and effective data and web data mining solutions at reasonable price.

Outsourcing information abstraction services are offered at affordable rates and it is available for wide range of data mine solutions:

• taking out business data
• service to gather data sets
• digging information of datasets
• Website data mining
• stock market information
• Statistical information
• Information classification
• Information regression
• Structured data analysis
• Online mining of data to gather product details
• to gather prices
• to gather product specifications
• to gather images

Outsource web mining solutions and data gathering solutions has been effective in terms of cost cutting, increasing productivity at affordable rates. Benefits of data mining services include:

• clear customer, service or product understanding
• less or minimal marketing cost
• exact information on sales, transactions
• detection of beneficial patterns
• minimizing risk and increased ROI
• new market detection
• Understanding clear business problems and goals

Accurate data mining solutions could prove to be an effective way to cut down cost by concentrating on right place.

Source: http://ezinearticles.com/?Data-Mining-And-Importance-to-Achieve-Competitive-Edge-in-Business&id=5771888

Friday, 6 September 2013

Data Mining

I just love the so called political correct terminology used today. How they decide it's really a political correct term is beyond me. Why don't they call it what it is; misleading terminology, whitewashing, sugar coating. Example: collaterial damage ie. innocent people fatally killed in the wake of combat.

Now the new term for collecting information on people to compile marketing lists so that we can all receive more junk mail, marketing phone calls and faxes is a cute little term called Data Mining.

This is a quote from a banking website on the subject:
Let me try to clarify this. We are wanting to contact our commercial customers' payors that do not bank with us. We would get the information off their checks when our commercial customers deposit them. For example, I use ABC Cleaners and pay by check but I bank at a different bank than the Cleaners. The Cleaner's bank would take my information off my check to contact me.

Data mining has been defined as "The nontrivial extraction of implicit, previously unknown, and potentially useful information from data" [1] and "The science of extracting useful information from large data sets or databases" [2]. Although it is usually used in relation to analysis of data, data mining, like artificial intelligence, is an umbrella term and is used with varied meaning in a wide range of contexts.

The Kirchman Corporation, a leading provider of automation software and compliance services to the banking industry, with over 800 financial institutions worldwide recently publised a 4 page gloss executive summary on Data Mining in the banking industry.

[The Kirchman Corporation became a Metavante subsidiary in May 2004 and operates within the Metavante Financial Services Group. Metavante is a subsidiary of Marshall & Ilsley Corporation (NYSE: MI) Kirchman Bankway®, the easiest-to-use and most customer-focused core bank software on the market, provides a comprehensive package of business solutions for use by the smallest de novo to the largest multi-billion-dollar bank holding company. Operating on IBM and Sun Microsystems hardware platforms, Kirchman Bankway® makes it easy for everyone from the teller to the CEO to access everything they need to know about their customers, right at their fingertips.[www.kirchman.com, Orlando. FL]

Data Mining software is used to analyse customer data and trends in banking as well as many other industries. Chase analyzed the attributes and habits of its checking account customers for clues that might reveal how to set the minimum balance requirement to retain profitable customers. The staff used data mining to develop profiles of customer groups whose members consistently had trouble maintaining minimum balances.

I'm sure you no where I'm headed with this. If Banks use Data Mining to predict trends in one area of services, it's not a far leap to say they would also use data mining to add courtesy bounce protection to customers who have a low balance and would be more likley to overdraft. Would it be too far of a leap to say banks might even use data mining in deciding what customers may get their account processed high to low?

Source: http://ezinearticles.com/?Data-Mining&id=12770

Thursday, 5 September 2013

Data Mining, Not Just a Method But a Technique

Web data mining is segregating probable clients out of huge information available on the Internet by performing various searches. It could be well organized and structured, or raw, depending on the use of the data. Web data mining could be done using a simple database program or investing money in a costly program.

Start collecting basic contact information of probable clients, such as: names, addresses, landline and cell phone numbers, email addresses and education or occupation if required.

CART and CHAID data mining

While collecting data you will find that tree-shaped structures that represent decisions. These derived decisions give rules for the classification of data collected. Precise decision tree methods include Classification and Regression Trees also know as CART data mining and Chi Square Automatic Interaction Detection also known as CHAID data mining. CART and CHAID data mining are decision tree techniques used for classification of data collected. They provide a set of rules that could be applied to unclassified data collected in prediction. CART segments a dataset creating two-way splits whereas CHAID segments using chi square tests creating multi-way splits. CART requires less data preparation compared to CHAID.

Understanding customer's actions

Keep a track of customer's actions like: what does he buy, when does he buy, why does he buy, what is the use of his buying, etc. Knowing such simple things about your customer will help you to understand needs of your customer better and thus process of data mining services will be easier and quality data would be mined. This will increase your personal relations with your customer which would finally result in a better professional relationship.

Following demography

Mine the data as per demography, dependent on geography as well as socio economic background of business location. You can use government statistics as the source of your data collection. Keeping it in mind you can go ahead with the understanding of the community existing and thus the data required.

Use your informal conversation in serving your clients better

Use minute details of your conversation and understanding with your customers to serve them. If essential, conduct surveys, send a professional gift or use some other object that helps you understand better in fulfilling customer needs. This will increase the bonding between you and your customer and you will be able to serve your customer better in providing data mining services.

Insert the collect information in a desktop database. More the information is collected you will find that you can prepare specific templates in feeding information. Using a desktop database, it is easier to make changes later on as and when required.

Maintaining privacy

While performing, it is essential to ensure that you or your team members are not violating privacy laws in gathering or providing the data information. Once trust is lost, you may also loose the customer, because trust is the base of any relationship, let it be a business relation.

Source: http://ezinearticles.com/?Data-Mining,-Not-Just-a-Method-But-a-Technique&id=5416129

Tuesday, 3 September 2013

Know What the Truth Behind Data Mining Outsourcing Service

We came to that, what we call the information age where industries are like useful data needed for decision-making, the creation of products - among other essential uses for business. Information mining and converting them to useful information is a part of this trend that allows companies to reach their optimum potential. However, many companies that do not meet even one deal with data mining question because they are simply overwhelmed with other important tasks. This is where data mining outsourcing comes in.

There have been many definitions to introduced, but it can be simply explained as a process that involves sorting through large amounts of raw data to extract valuable information needed by industries and enterprises in various fields. In most cases this is done by professionals, professional organizations and financial analysts. He has seen considerable growth in the number of sectors or groups that enter my self.
There are a number of reasons why there is a rapid growth in data mining outsourcing service subscriptions. Some of them are presented below:

A wide range of services

Many companies are turning to information mining outsourcing, because they cover a wide range of services. These services include, but are not limited to data from web applications congregation database, collect contact information from different sites, extract data from websites using the software, the sort of stories from sources news, information and accumulate commercial competitors.

Many companies fall

Many industries benefit because it is fast and realistic. The information extracted by data mining service providers of outsourcing used in crucial decisions in the field of direct marketing, e-commerce, customer relationship management, health, scientific tests and other experimental work, telecommunications, financial services, and a whole lot more.

A lot of advantages

Subscribe data mining outsourcing services it's offers many benefits, as providers assures customers to render services to world standards. They strive to work with improved technologies, scalability, sophisticated infrastructure, resources, timeliness, cost, the system safer for the security of information and increased market coverage.

Outsourcing allows companies to focus their core business and can improve overall productivity. Not surprisingly, information mining outsourcing has been a first choice of many companies - to propel the business to higher profits.

Source: http://ezinearticles.com/?Know-What-the-Truth-Behind-Data-Mining-Outsourcing-Service&id=5303589

Sunday, 1 September 2013

An Easy Way For Data Extraction

There are so many data scraping tools are available in internet. With these tools you can you download large amount of data without any stress. From the past decade, the internet revolution has made the entire world as an information center. You can obtain any type of information from the internet. However, if you want any particular information on one task, you need search more websites. If you are interested in download all the information from the websites, you need to copy the information and pate in your documents. It seems a little bit hectic work for everyone. With these scraping tools, you can save your time, money and it reduces manual work.

The Web data extraction tool will extract the data from the HTML pages of the different websites and compares the data. Every day, there are so many websites are hosting in internet. It is not possible to see all the websites in a single day. With these data mining tool, you are able to view all the web pages in internet. If you are using a wide range of applications, these scraping tools are very much useful to you.

The data extraction software tool is used to compare the structured data in internet. There are so many search engines in internet will help you to find a website on a particular issue. The data in different sites is appears in different styles. This scraping expert will help you to compare the date in different site and structures the data for records.

And the web crawler software tool is used to index the web pages in the internet; it will move the data from internet to your hard disk. With this work, you can browse the internet much faster when connected. And the important use of this tool is if you are trying to download the data from internet in off peak hours. It will take a lot of time to download. However, with this tool you can download any data from internet at fast rate.There is another tool for business person is called email extractor. With this toll, you can easily target the customers email addresses. You can send advertisement for your product to the targeted customers at any time. This the best tool to find the database of the customers.

However, there are some more scraping tolls are available in internet. And also some of esteemed websites are providing the information about these tools. You download these tools by paying a nominal amount.

Source: http://ezinearticles.com/?An-Easy-Way-For-Data-Extraction&id=3517104