So you've got a dynamic site, filled with all sorts of user inputs, whether it be a 'phorum', or like my own site at
http://www.knowpost.com. ht://dig will take care of indexing and searching your html pages, but if you are like me, you have very few html pages, and most of your "content" resides in BLOBs in your database. You can't do anything useful using a like %searchword% query, it just isn't coming back relevant.
There has to be a better way, and indeed there is, with a few easy steps. Here's how to slap one together:
Noise ReductionThe first problem with your content is that it is filled with clunky "noisewords," like "a,the,where,look" Things that are there to help us humans to communicate, but really don't have anything to do with relevance. We gotta get rid of those. Essentially, what we're trying to do here is get all those noisewords out of your data, and build a table with two columns, the word, and its indicator (the content associated with it). We want something that will eventually look like this:
+------+------------+
| qid | word |
+------+------------+
| 6 | links |
| 5 | Fire |
| 5 | topics |
| 5 | related |
| 5 | Shakespeare|
| 4 | people |
| 4 | Knowpost |
| 3 | cuba |
| 3 | cigar |
+------+------------+
Lets create our table now:
mysql> CREATE TABLE search_table(
word VARCHAR(50),
qid INT
)
Next, since you want to make all your data compatible, not
just new data, we need to grab your sticky blobs, and their identifiers out of your database:
<?
$query = "SELECT blob,identifier FROM your_table";
$result = mysql_query($query);
$number = mysql_numrows($result);
$j = 0;
WHILE ($j < $number){
/* Your "blob" */
$body = mysql_result($result,$j,"blob");
/*Your "identifier" */
$qid = mysql_result($result,$j,"qid");
/* Open the noise words into an array */
$noise_words = file("noisewords.txt");
$filtered = $body;
/* Got to put a space before the first word in the body, so that we can
recognize the word later */
$filtered = ereg_replace("^"," ",$filtered);
/* Now we suck out all the noisewords, and transform whats left into an
array */
/* Brought to you by poor ereg coding! */
for ($i=0; $i<count($noise_words); $i++) {
$filterword = trim($noise_words[$i]);
$filtered = eregi_replace(" $filterword "," ",$filtered);
}
$filtered = trim($filtered);
$filtered = addslashes($filtered);
$querywords = ereg_replace(",","",$filtered);
$querywords = ereg_replace(" ",",",$querywords);
$querywords = ereg_replace("?","",$querywords);
$querywords = ereg_replace("(","",$querywords);
$querywords = ereg_replace(")","",$querywords);
$querywords = ereg_replace(".","",$querywords);
$querywords = ereg_replace(",","','",$querywords);
$querywords = ereg_replace("^","'",$querywords);
$querywords = ereg_replace("$","'",$querywords);
/* We should now have something that looks like 'Word1','Word2','Word3'
so lets turn it into an array */
$eachword = explode(",", $querywords);
/* and finally lets go through the array, and place each word into the
database, along with its identifier */
for ($k=0; $k<count($eachword); $k++){
$inputword = "INSERT INTO search_table VALUES($eachword[$k],$qid)";
mysql_query($inputword);
}
/* Get the next set of data */
$j++;
}
?>
That script just handles your old data. You'll want to
include a similar function to strip the noisewords out for every time new information comes into your database, through user input, your input, etc... so that your search engine is updated on the fly.
{mospagebreak title=Searching the Table} Now you have an easy to-use table of keywords and their associations. How do you query this table? Here's what I do:
First I format each searchterms passed into the script as 'word1','word2','word3' and stick it in a string called $querywords.
Then I throw them into this SQL query:
SELECT count(search_table.word) as score, search_table.qid,your_table.blob
FROM search_table,your_table
WHERE your_table.qid = search_table.qid AND search_table.word
IN($querywords)
GROUP BY search_table.qid
ORDER BY score DESC";
Set that query to $search, and print out the results like
so:
<?
$getresults = mysql_query($search);
$resultsnumber = mysql_numrows($getresults);
IF ($resultsnumber == 0){
PRINT "Your search returned no results. Try other keyword(s).";
}
ELSEIF ($resultsnumber > 0){
PRINT "Your search returned $resultsnumber results<BR>Listing them
in order of relevance<BR><BR>";
for($count = 0; $count<$resultsnumber; $count++){
$body = mysql_result($getresults,$count,"blob");
$qid = mysql_result($getresults,$count,"qid");
//tighten up the results
$body2print = substr($body, 0, 100);
$cnote = $count+1;
PRINT "$cnote. <a href=yourcontent.php3?qid=$qid>
<i>$body2print...</i></a><BR>";
}
}
?>
Presto, you've got keyword searching for your database,
complete with relevancy ranking. It may not be Google or altavista.
It may not support all those fancy boolean operators, or excite's (*cough*) conceptual mapping technology. But it works, its quick and enough to handle your user's demand.
| DISCLAIMER: The content provided in this article is not warranted or guaranteed by Developer Shed, Inc. The content provided is intended for entertainment and/or educational purposes in order to introduce to the reader key ideas, concepts, and/or product reviews. As such it is incumbent upon the reader to employ real-world tactics for security and implementation of best practices. We are not liable for any negative consequences that may result from implementing any information covered in our articles or tutorials. If this is a hardware review, it is not recommended to open and/or modify your hardware. |
More PHP Articles
More By Clay Johnson
developerWorks - FREE Tools! |
The IBM DB2 Deep Compression ROI tool is designed for DBA’s and IT management personnel to perform a clinical analysis of the cost savings gained from the Storage Optimization feature of DB2 9 for Linux, UNIX and Windows. The feature, also known as Deep Compression, compresses data that lies within a database by up to 80% at times. FREE! Go There Now!
|
|
|
|
Join us for this on demand webcast to learn about developing complex systems more quickly and efficiently. We'll cover market drivers for developing, governing and reusing systems software assets and how you can develop system software assets with Rational Asset Manager. FREE! Go There Now!
|
|
|
|
Visit IBM developerWorks to download a free trial version of WebSphere Extended Deployment Compute Grid, which lets you schedule, execute, and monitor batch jobs. Because online transaction processing and batch jobs execute simultaneously on the same server resources, you can avoid costly duplication of resources. Compute Grid supports job types of Java transactional batch, compute-intensive and a new type called "native execution", which enables non-Java workloads to run on distributed end points. FREE! Go There Now!
|
|
|
|
Join this webcast to see how IBM Data Studio Developer and pureQuery can take the pain out of Java data access. uApplications developed using both Java and SQL have become a common requirement. Database connectivity using Java Database Connectivity (JDBC) to create an application is a multi-step tedious process, and tooling that covers both SQL and Java has been unavailable, until now. IBM Data Studio introduces the pureQuery platform: a high-performance, Java data access platform focused on simplifying the tasks of developing, managing, and optimizing database applications and services. FREE! Go There Now!
|
|
|
|
Manage, govern, and share services across your organization by using WebSphere Service Registry and Repository. Follow the hands-on exercises to learn how to navigate the Web interface to publish, find, reuse, and update services. FREE! Go There Now!
|
|
|
|
XML has become a common way of storing business data as flat files and many data server vendors including IBM have provided ways to store this data within relational database systems. Increasingly collections of XML files are accessed like databases using an xQuery and other XML standard mechanisms. Businesses find the need to combine the traditional tabular structured data with XML formatted data. In this webcast, you’ll learn about IBM’s WebSphere Federation Server technology, which provides users with the ability to integrate these two data formats. FREE! Go There Now!
|
|
|
|
This webcast outlines the best practices that must be instituted to gain the maximum benefit from SOA while maintaining high quality of service. Whether you are deploying new applications or managing and monitoring your existing infrastructure, learn how you can ensure high quality of services with SOA based solutions from IBM. All registrants who attend this live Web Seminar will receive complimentary access to a white paper titled “Maintaining QoS in an SOA Environment”. FREE! Go There Now!
|
|
|
|
Rational Modeling Extension for Microsoft .NET enhances usability for code generation supporting a more intelligent refactoring. The latest enhancements enable organizations with Java and .NET systems and software development maintain architectural integrity across heterogeneous platforms. FREE! Go There Now!
|
|
|
|
Visit IBM developerWorks to try the IBM SOA Sandbox for connectivity. The SOA Sandbox for connectivity provides a trial environment with the tooling and components to help you explore how to effectively connect your infrastructure and integrate all of the people, processes and information in your company. Use the hosted sandbox to explore SOA techniques that streamline connecting existing IT assets together, as well as learn how to connect them to new business logic. FREE! Go There Now!
|
|
|
|
With IBM Rational Systems Development Solution, you can deliver products faster with higher quality. Within this kit, Read the “Model Driven Systems Development” white paper to see how to improve product quality and communication. Then check out the rest of the e-Kit to learn more about important topics that can affect the success of any software project through customer examples, tutorials, informative Webcasts, and best practices for designing, building and managing systems. From start to finish, at every stage in your projects, Rational Systems Development Solution can help your company reach its full potential. FREE! Go There Now!
|
|
|
|
All FREE IBM® developerWorks Tools! |