I have a 5million records and 30%-40% have duplicates, incorrect formats and with missing/blank records. I am assigned to clean it up.
Anyone could help me suggest what process should be done?
Any suggestions are accepted
2007-07-21
16:19:14
·
5 answers
·
asked by
SaintKathy
2
in
Computers & Internet
➔ Programming & Design
the basis of dupes are the phone#, company names and contact persons.
There is a team that populate 7K+ records each day and they cant manage to see the dupes on their database records.
I was assigned to create process that minimize/reduce the dupes, blanks, missing & unnecessary records.
Any suggestions may help.
Could anyone suggest what forum could I repost this question that many could relate to my question?
2007-07-21
17:06:18 ·
update #1
Does your database system have a built-in database management system? If so, how comprehensive is it?
Have you considered purchasing a separate clean-up utility?
What resources have you appealed on the web, i.e.,forums, white-papers, etc.? Some sites have great search engines.
Regardless, there is still a lot of work that you will have to do yourself. This cleanup project entails a great deal of forethought, planning, and organization.
________________________________________________________________
1) List all of the problems that you want to take care of in the cleanup process.
Some can be taken care of quite simply. Others will require a bit of ingenuity and might be somewhat tedious.
2) Next I would decide how you will correct each problem. Create a layout of what constitutes a correctly formatted record entry. Use that as a template to compare all of the existing records to.
3) Plan in what sequence the cleanup tasks should be performed. Carefully note why you place task at a particular phase/stage.
4) Once you have designed the set of cleanup queries, do not attempt to use them on the database as of yet. Instead, you must first test these queries. Create a sample database—whose schema is similar to your current one. Load it with records: both good and bad ones. Make sure the bad ones reflect the various sorts of defects / problems that you are attempting to cleanse. Run the cleanup queries on this sample database in order to evaluate the efficacy of your solution. Make revisions as necessary.
5) You want to avoid this situation in the future. Your ultimate goal should be to improve the consistency and validity of your data, as well as the integrity of the database itself. Make sure that the database is properly setup. Also decide what changes you need to make to the database's rules, triggers, and contraints.
Data monitoring has become a key component of a complete data quality integration practice. Data monitoring ensures that once your data becomes consistent, accurate and reliable, it remains that way over time.
Below, I highlight a few basic considerations that may help your further decide how to plan and implement your cleanup project.
____________________________________________________________________________
DUPLICATES
To eliminate duplicate records.: Use the SQL DISTINCT keyword.
Caveat: Detemine how you will ensure that records that appear to belong to the same account, person, or entity, actually do?
This is going to be an iterative, pain-staking process. Make sure that you have this well thought out.
INCORRECT FORMAT
Handling incorrect formats is going to present the most challenging aspect of your task. Why is the particular field incorrectly formatted in the first place? How could you have prevented the user from having entered incorrectly formated data in the first place? What validation procedures need to be in effect. I need to know just what you mean by incorrect format. Are you referring to one field, or more than one field? Are you referring to string or numerical information. telephone numbers? Then, you are going to have to create an SQL statement that takes each current number, and converts it to the correct form. it into the correct.
To avoid this problem in the future, you need to have a design a format template. You may elect to convert whatever data the user enters into the proper format. If they are things like states, then people should be selecting states from a drop-down listbox.
MISSING / INCOMPLETE RECORDS (FIELDS)
I don't believe there is such as thing as a missing / blank record, I assume that you are referring to a database field / column that does not contain any data. In any event, use the the IS Null operator in your SQL query to ascertain which records have fields are missing data.
Caveat: Different databases handle NULL differently.Usually, NULL differs from zero, a string that contains only blanks, or an empty string (' ').
____________________________________________________________________________
OTHER CONSIDERATIONS:
To what degree is your database normalized?
Does your database need to be re-designed?
Do it contain the necessary tables and fields?
Are your primary and foreign keys properly utilized? These impact uniqueness and referential integrity.
What rules, triggers, and constraints are in place?
2007-07-21 18:14:40
·
answer #1
·
answered by Einstein 5
·
0⤊
1⤋
That is a tough one, someone must have a royally "F" it up to caused this, ok, assuming it is a relational DB, any fix you figured out would need SQL and one other language that has xDBC:
select unique(field) from Table A, Table B where A.field=B.field;
That may help with duplicate, this is firmly in the realm of RDB theory and this is a cross product with finding duplicates. The output is going to be very very long and slow.
What incorrect format are we speaking here? A number stored as string? Whatever it is the likely hood is high that you have to just redesign and repopulate a database from scratch.
Identify what missing/blank mean here I can offer no more advise at this point.
2007-07-21 16:37:14
·
answer #2
·
answered by Andy T 7
·
1⤊
2⤋
it looks like, the database need to normalize and re-design, but redesigning the database means the application must be re-design too.
I assume that you know how to eliminate unwanted records, fixing records with incorret format, and deleting blank records, so the hardest part is to delete some duplicate records.
the question is : what database server you're using?
cause I think only stored procedure can make the task easier.
2007-07-22 22:21:14
·
answer #3
·
answered by Manzana verde 5
·
0⤊
0⤋
You can do it as often as you like. Unless you add files to your system really often, do lots of downloads that create temporary files and such, it's kind of pointless to do it more often than every few days. I do it a couple of times a year, when things start bogging down. OTOH, I also have specific means for downloading files, delete incomplete downloads manually, and that sort of thing. I also send downloads to a firewire drive to keep them from bogging down C. If you practice good computer hygiene, you can get away with doing disk clean-up as often.
2016-04-01 06:26:01
·
answer #4
·
answered by Anonymous
·
0⤊
0⤋
It may not be that easy. May be you can contact a DB expert. Check websites like http://askexpert.info/
2007-07-22 01:54:27
·
answer #5
·
answered by Anonymous
·
0⤊
0⤋